{"title": "Bayesian Active Model Selection with an Application to Automated Audiometry", "book": "Advances in Neural Information Processing Systems", "page_first": 2386, "page_last": 2394, "abstract": "We introduce a novel information-theoretic approach for active model selection and demonstrate its effectiveness in a real-world application. Although our method can work with arbitrary models, we focus on actively learning the appropriate structure for Gaussian process (GP) models with arbitrary observation likelihoods. We then apply this framework to rapid screening for noise-induced hearing loss (NIHL), a widespread and preventible disability, if diagnosed early. We construct a GP model for pure-tone audiometric responses of patients with NIHL. Using this and a previously published model for healthy responses, the proposed method is shown to be capable of diagnosing the presence or absence of NIHL with drastically fewer samples than existing approaches. Further, the method is extremely fast and enables the diagnosis to be performed in real time.", "full_text": "Bayesian Active Model Selection\n\nwith an Application to Automated Audiometry\n\nJacob R. Gardner\nCS, Cornell University\n\nIthaca, NY 14850\n\njrg365@cornell.edu\n\nKilian Q. Weinberger\nCS, Cornell University\n\nIthaca, NY 14850\n\nkqw4@cornell.edu\n\nGustavo Malkomes\n\nCSE, WUSTL\n\nSt. Louis, MO 63130\n\nluizgustavo@wustl.edu\n\nRoman Garnett\nCSE, WUSTL\n\nSt. Louis, MO 63130\n\ngarnett@wustl.edu\n\nDennis Barbour\nBME, WUSTL\n\nSt. Louis, MO 63130\n\ndbarbour@wustl.edu\n\nJohn P. Cunningham\n\nStatistics, Columbia University\n\nNew York, NY 10027\n\njpc2181@columbia.edu\n\nAbstract\n\nWe introduce a novel information-theoretic approach for active model selection\nand demonstrate its effectiveness in a real-world application. Although our method\ncan work with arbitrary models, we focus on actively learning the appropriate\nstructure for Gaussian process (GP) models with arbitrary observation likelihoods.\nWe then apply this framework to rapid screening for noise-induced hearing loss\n(NIHL), a widespread and preventible disability, if diagnosed early. We construct a\nGP model for pure-tone audiometric responses of patients with NIHL. Using this\nand a previously published model for healthy responses, the proposed method is\nshown to be capable of diagnosing the presence or absence of NIHL with drastically\nfewer samples than existing approaches. Further, the method is extremely fast and\nenables the diagnosis to be performed in real time.\n\n1\n\nIntroduction\n\nPersonalized medicine has long been a critical application area for machine learning [1\u20133], in which\nautomated decision making and diagnosis are key components. Beyond improving quality of life,\nmachine learning in diagnostic settings is particularly important because collecting additional medical\ndata often incurs signi\ufb01cant \ufb01nancial burden, time cost, and patient discomfort. In machine learning\none often considers this problem to be one of active feature selection: acquiring each new feature\n(e.g., a blood test) incurs some cost, but will, with hope, better inform diagnosis, treatment, and\nprognosis. By careful analysis, we may optimize this trade off.\nHowever, many diagnostic settings in medicine do not involve feature selection, but rather involve\nquerying a sample space to discriminate different models describing patient attributes. A particular,\nclarifying example that motivates this work is noise-induced hearing loss (NIHL), a prevalent disorder\naffecting 26 million working-age adults in the United States alone [4] and affecting over half of\nworkers in particular occupations such as mining and construction. Most tragically, NIHL is entirely\npreventable with simple, low-cost solutions (e.g., earplugs). The critical requirement for prevention\nis effective early diagnosis.\nTo be tested for NIHL, patients must complete a time-consuming audiometric exam that presents a\nseries of tones at various frequencies and intensities; at each tone the patient indicates whether he/she\nhears the tone [5\u20137]. From the responses, the clinician infers the patient\u2019s audible threshold on a\nset of discrete frequencies (the audiogram); this process requires the delivery of up to hundreds of\ntones. Audiologists scan the audiogram for a hearing de\ufb01cit with a characteristic notch shape\u2014a\n\n1\n\n\fnarrow band that can be anywhere in the frequency domain that is indicative of NIHL. Unfortunately,\nat early stages of the disorder, notches can be small enough that they are undetectable in a standard\naudiogram, leaving many cases undiagnosed until the condition has become severe. Increasing\naudiogram resolution would require higher sample counts (more presented tones) and thus only\nlengthen an already burdensome procedure. We present here a better approach.\nNote that the NIHL diagnostic challenge is not one of feature selection (choosing the next test to run\nand classifying the result), but rather of model selection: is this patient\u2019s hearing better described by\na normal hearing model, or a notched NIHL model? Here we propose a novel active model selection\nalgorithm to make the NIHL diagnosis in as few tones as possible, which directly re\ufb02ects the time\nand personnel resources required to make accurate diagnoses in large populations. We note that this\nis a model-selection problem in the truest sense: a diagnosis corresponds to selecting between two\nor more sets of indexed probability distributions (models), rather than the more-common misnomer\nof choosing an index from within a model (i.e., hyperparameter optimization). In the NIHL case\nthis distinction is critical. We are choosing between two models, the set of possible NIHL hearing\nfunctions and the set of normal hearing functions. This approach suggests a very different and more\ndirect algorithm than \ufb01rst learning the most likely NIHL function and then accepting or rejecting it as\ndifferent from normal, the standard approach.\nWe make the following contributions: \ufb01rst, we design a completely general active-model-selection\nmethod based on maximizing the mutual information between the response to a tone and the posterior\non the model class. Critically, we develop an analytical approximation of this criterion for Gaussian\nprocess (GP) models with arbitrary observation likelihoods, enabling active structure learning for GPs.\nSecond, we extend the work of Gardner et al. [8] (which uses active learning to speed up audiogram\ninference) to the broader question of identifying which model\u2014normal or NIHL\u2014best \ufb01ts a given\npatient. Finally, we develop a novel GP prior mean that parameterizes notched hearing loss for NIHL\npatients. To our knowledge, this is the \ufb01rst publication with an active model-selection approach that\ndoes not require updating each model for every candidate point, allowing audiometric diagnosis of\nNIHL to be performed in real time. Finally, using patient data from a clinical trial, we show empirically\nthat our method typically automatically detects simulated noise-induced hearing loss with fewer than\n15 query tones. This is vastly fewer than the number required to infer a conventional audiogram\nor even an actively learned audiogram [8], highlighting the importance of both the active-learning\napproach and our focus on model selection.\n\n2 Bayesian model selection\n\nWe consider supervised learning problems de\ufb01ned on an input space X and an output space Y.\nSuppose we are given a set of observed data D = (X, y), where X represents the design matrix of\nindependent variables xi \u2208 X and y the associated vector of dependent variables yi = y(xi) \u2208 Y.\nLet M be a probabilistic model, and let \u03b8 be an element of the parameter space indexing M. Given a\nset of observations D, we wish to compute the probability of M being the correct model to explain\nD, compared to other models. The key quantity of interest to model selection is the model evidence:\n(1)\n\n(cid:90)\n\np(y | X,M) =\n\np(y | X, \u03b8,M)p(\u03b8 | M) d\u03b8,\n\nwhich represents the probability of having generating the observed data under the model, marginalized\nover \u03b8 to account for all possible members of that model under a prior p(\u03b8 | M) [9]. Given a set of\nM candidate models {Mi}M\ni=1, and the computed evidence for each, we can apply Bayes\u2019 rule to\ncompute the posterior probability of each model given the data:\n\n(cid:80)\n\n2\n\np(M | D) =\n\np(y | X,M)p(M)\n\n=\n\np(y | X)\n\np(y | X,M)p(M)\ni p(y | X,Mi)p(Mi)\n\n,\n\n(2)\n\nwhere p(M) represents the prior probability distribution over the models.\n2.1 Active Bayesian model selection\nSuppose that we have a mechanism for actively selecting new data\u2014choosing x\u2217 \u2208 X and observing\ny\u2217 = y(x\u2217)\u2014to add to our dataset D = (X, y), in order to better distinguish the candidate models\n\n\f{Mi}. After making this observation, we will form an augmented dataset D(cid:48) = D \u222a\nfrom which we can recompute a new model posterior p(M | D(cid:48)).\nAn approach motivated by information theory is to select the location maximizing the mutual\ninformation between the observation value y\u2217 and the unknown model:\n\nI(y\u2217;M | x\u2217,D) = H[M | D] \u2212 Ey\u2217(cid:2)H[M | D(cid:48)](cid:3)\n\n(3)\n(4)\nwhere H indicates (differential) entropy. Whereas Equation (3) is computationally problematic\n(involving costly model retraining), the equivalent expression (4) is typically more tractable, has been\napplied fruitfully in various active-learning settings [10, 11, 8, 12, 13], and requires only computing\nthe differential entropy of the model-marginal predictive distribution:\n\n= H[y\u2217 | x\u2217,D] \u2212 E\nM\n\n(cid:2)H[y\u2217 | x\u2217,D,M](cid:3),\n\nM(cid:88)\n\n(cid:8)(x\u2217, y\u2217)(cid:9),\n\ni=1\n\np(y\u2217 | x\u2217,D) =\n\np(y\u2217 | x\u2217,D,Mi)p(Mi | D)\n\nand the model-conditional predictive distributions(cid:8)p(y\u2217 | x\u2217,D,Mi)(cid:9) with all models trained with\n\nthe currently available data. In contrast to (3), this does not involve any retraining cost. Although\ncomputing the entropy in (5) might be problematic, we note that this is a one-dimensional integral\nthat can easily be resolved with quadrature. Our proposed approach, which we call Bayesian active\nmodel selection (BAMS) is then to compute, for each candidate location x\u2217, the mutual information\nbetween y\u2217 and the unknown model, and query where this is maximized:\n\n(5)\n\narg max\n\nx\u2217\n\nI(y\u2217;M | x\u2217,D).\n\n(6)\n\n2.2 Related work\n\n(cid:0)B2M|X\u2217|\n\n(cid:1) model updates per iteration, where B is the total budget. Kulick et al. [15] also\n(cid:1) model updates\n\nAlthough active learning and model selection have been widely investigated, active model selection\nhas received comparatively less attention. Ali et al. [14] proposed an active learning model selection\nmethod that requires leave-two-out cross validation when evaluating each candidate x\u2217, requiring\nO\nconsidered an information-theoretic approach to active model selection, suggesting maximizing the\nexpected cross entropy between the current model posterior p(M | D) and the updated distribution\np(M | D(cid:48)). This approach also requires extensive model retraining, with O\nper iteration, to estimate this expectation for each candidate. These approaches become prohibitively\nexpensive for real-time applications with large number of candidates. In our audiometric experiments,\nfor example, we consider 10 000 candidate points, expending 1\u20132 seconds per iteration, whereas\nthese mentioned techniques would take several hours to selected the next point to query.\n\n(cid:0)M|X\u2217|\n\n3 Active model selection for Gaussian processes\n\nIn the previous section, we proposed a general framework for performing sequential active Bayesian\nmodel selection, without making any assumptions about the forms of the models {Mi}. Here we\nwill discuss speci\ufb01c details of our proposal when these models represent alternative structures for\nGaussian process priors on a latent function.\nWe assume that our observations are generated via a latent function f : X \u2192 R with a known\nobservation model p(y | f ), where fi = f (xi). A standard nonparametric Bayesian approach\nwith such models is to place a Gaussian process (GP) prior distribution on f, p(f ) = GP(f ; \u00b5, K),\nwhere \u00b5 : X \u2192 R is a mean function and K : X 2 \u2192 R is a positive-de\ufb01nite covariance function or\nkernel [16]. We condition on the observed data to form a posterior distribution p(f | D), which is\ntypically an updated Gaussian process (making approximations if necessary). We make predictions\n\nat a new input x\u2217 via the predictive distribution p(y\u2217 | x\u2217,D) =(cid:82) p(y\u2217 | f\u2217,D)p(f\u2217 | x\u2217,D) df\u2217,\n\nwhere f\u2217 = f (x\u2217). The mean and kernel functions are parameterized by hyperparameters that we\nconcatenate into a vector \u03b8, and different choices of these hyperparameters imply that the functions\ndrawn from the GP will have particular frequency, amplitude, and other properties. Together, \u00b5 and\nK de\ufb01ne a model parametrized by the hyperparameters \u03b8. Much attention is paid to learning these\nhyperparameters in a \ufb01xed model class, sometimes under the unfortunate term \u201cmodel selection.\u201d\n\n3\n\n\fNote, however, that the structural (not hyperparameter) choices made in the mean function \u00b5 and\ncovariance function K themselves are typically done by selecting (often blindly!) from several\noff-the-shelf solutions (see, for example, [17, 16]; though also see [18, 19]), and this choice has\nsubstantial bearing on the resulting functions f we can model. Indeed, in many settings, choosing\nthe nature of plausible functions is precisely the problem of model selection; for example, to decide\nwhether the function has periodic structure, exhibits nonstationarity, etc. Our goal is to automatically\nand actively decide these structural choices during GP modeling through intelligent sampling.\nTo connect to our active learning formulation, let {Mi} be a set of Gaussian process models for the\nlatent function f. Each model comprises a mean function \u00b5i, covariance function Ki, and associated\nhyperparameters \u03b8i. Our approach outlined in Section 2.1 requires the computation of three quantities\nthat are not typically encountered in GP modeling and inference: the hyperparameter posterior\np(\u03b8 | D,M), the model evidence p(y | X,M), and the predictive distribution p(y\u2217 | x\u2217,D,M),\nwhere we have marginalized over \u03b8 in the latter two quantities. The most-common approaches to GP\ninference are maximum likelihood\u2013II (MLE) or maximum a posteriori\u2013II (MAP) estimation, where\nwe maximize the hyperparameter posterior [20, 16]:1\n\n\u02c6\u03b8 = arg max\n\n\u03b8\n\nlog p(\u03b8 | D,M) = arg max\n\n\u03b8\n\nlog p(\u03b8 | M) + log(y | X, \u03b8,M).\n\n(7)\n\nTypically, predictive distributions and other desired quantities are then reported at the MLE/MAP\nhyperparameters, implicitly making the assumption that p(\u03b8 | D,M) \u2248 \u03b4(\u02c6\u03b8). Although a compu-\ntationally convenient choice, this does not account for uncertainty in the hyperparameters, which\ncan be nontrivial with small datasets [9]. Furthermore, accounting correctly for model parameter\nuncertainty is crucial to model selection, where it naturally introduces a model-complexity penalty.\nWe discuss less-drastic approximations to these quantities below.\n\n3.1 Approximating the model evidence and hyperparameter posterior\nThe model evidence p(y | X,M) and hyperparameter posterior distribution p(\u03b8 | D,M) are in\ngeneral intractable for GPs, as there is no conjugate prior distribution p(\u03b8 | M) available. Instead, we\nwill use a Laplace approximation, where we make a second-order Taylor expansion of log p(\u03b8 | D,M)\naround its mode \u02c6\u03b8 (7). The result is a multivariate Gaussian approximation:\n\n\u03a3\u22121 = \u2212\u22072 log p(\u03b8 | D,M)(cid:12)(cid:12)\u03b8=\u02c6\u03b8.\n\n(8)\n\np(\u03b8 | D,M) \u2248 N (\u03b8; \u02c6\u03b8, \u03a3);\n\nThe Laplace approximation also results in an approximation to the model evidence:\n2 log det \u03a3\u22121 + d\n\nlog p(y | X,M) \u2248 log p(y | X, \u02c6\u03b8,M) + log p(\u02c6\u03b8 | M) \u2212 1\n\n(9)\nwhere d is the dimension of \u03b8 [21, 22]. The Laplace approximation to the model evidence can be\ninterpreted as rewarding explaining the data well while penalizing model complexity. Note that\nthe Bayesian information criterion (BIC), commonly used for model selection, can be seen as an\napproximation to the Laplace approximation [23, 24].\n\n2 log 2\u03c0,\n\n3.2 Approximating the predictive distribution\n\nWe next consider the predictive distribution:\n\n(cid:90)\n\np(y\u2217 | x\u2217,D,M) =\n\np(y\u2217 | f\u2217)\n\n(cid:90)\n(cid:124)\n\n(cid:125)\np(f\u2217 | x\u2217,D, \u03b8,M)p(\u03b8 | D,M) d\u03b8\n\n(cid:123)(cid:122)\n\np(f\u2217|x\u2217,D,M)\n\ndf\u2217.\n\n(10)\n\nThe posterior p(f\u2217 | x\u2217,D, \u03b8,M) in (10) is typically a known Gaussian distribution, derived\nanalytically for Gaussian observation likelihoods or approximately using standard approximate GP\ninference techniques [25, 26]. However, the integral over \u03b8 in (10) is intractible, even with a Gaussian\napproximation to the hyperparameter posterior as in (8).\nGarnett et al. [11] introduced a mechanism for approximately marginalizing GP hyperparameters\n(called the MGP), which we will adopt here due to its strong empirical performance. The MGP assumes\n\n1Using a noninformative prior p(\u03b8 | M) \u221d 1 in the case of maximum likelihood.\n\n4\n\n\fthat we have a Gaussian approximation to the hyperparameter posterior, p(\u03b8 | D,M) \u2248 N (\u03b8; \u02c6\u03b8, \u03a3).2\nWe de\ufb01ne the posterior predictive mean and variance functions as\n\n\u00b5\u2217(\u03b8) = E[f\u2217 | x\u2217,D, \u03b8,M];\n\n\u03bd\u2217(\u03b8) = Var[f\u2217 | x\u2217,D, \u03b8,M].\n\nThe MGP works by making an expansion of the predictive distribution around the posterior mean\nhyperparameters \u02c6\u03b8. The nature of this expansion is chosen so as to match various derivatives of the\ntrue predictive distribution; see [11] for details. The posterior distribution of f\u2217 is approximated by\n\n3 \u03bd\u2217(\u02c6\u03b8) +(cid:2)\n\np(f\u2217 | x\u2217\n\u2207\u00b5\u2217(\u02c6\u03b8)(cid:3)(cid:62)\u03a3(cid:2)\n\n,D,M) \u2248 N\n\n(cid:0)f\u2217; \u00b5\u2217(\u02c6\u03b8), \u03c32\n(cid:1),\n\u2207\u00b5\u2217(\u02c6\u03b8)(cid:3) + 1\n\u2207\u03bd\u2217(\u02c6\u03b8)(cid:3)(cid:62)\u03a3(cid:2)\n(cid:2)\n\n3\u03bd\u2217(\u02c6\u03b8)\n\nMGP\n\n\u2207\u03bd\u2217(\u02c6\u03b8)(cid:3).\n\n(11)\n\n(12)\n\nwhere\n\n\u03c32\nMGP = 4\n\nThe MGP thus in\ufb02ates the predictive variance from the the posterior mean hyperparameters \u02c6\u03b8 by a\nterm that is commensurate with the uncertainty in \u03b8, measured by the posterior covariance \u03a3, and\nthe dependence of the latent predictive mean and variance on \u03b8, measured by the gradients \u2207\u00b5\u2217 and\n\u2207\u03bd\u2217. With the Gaussian approximation in (11), the integral in (10) now reduces to integrating the\nobservation likelihood against a univariate Gaussian. This integral is often analytic [16] and at worse\nrequires one-dimensional quadrature.\n\n3.3\n\nImplementation\n\nGiven the development above, we may now ef\ufb01ciently compute an approximation to the BAMS\ncriterion for active GP model selection. Given currently observed data D, for each of our candidate\nmodels Mi, we \ufb01rst \ufb01nd the Laplace approximation to the hyperparameter posterior (8) and model\nevidence (9). Given the approximations to the model evidence, we may compute an approximation\nto the model posterior (2). Suppose we have a set of candidate points X\u2217 from which we may\nselect our next point. For each of our models, we compute the MGP approximation (11) to the latent\n\nposteriors(cid:8)p(f\u2217 | X\u2217,D,Mi)(cid:9), from which we use standard techniques to compute the predictive\ndistributions(cid:8)p(y\u2217 | X\u2217,D,Mi)(cid:9). Finally, with the ability to compute the differential entropies of\n\nthese model-conditional predictive distributions, as well as the marginal predictive distribution (5),\nwe may compute the mutual information of each candidate in parallel. See the Appendix for explicit\nformulas for common likelihoods and a description of general-purpose, reusable code we will release\nin conjunction with this manuscript to ease implementation.\n\n4 Audiometric threshold testing\n\nStandard audiometric tests [5\u20137] are calibrated such that the average human subject has a 50% chance\nof hearing a tone at any frequency; this empirical unit of intensity is de\ufb01ned as 0 dB HL. Humans give\nbinary reports (whether or not a tone was heard) in response to stimuli, and these observations are\ninherently noisy. Typical audiometric tests present tones in a prede\ufb01ned order on a grid, in increments\nof 5\u201310 dB HL at each of six octaves. Recently, Gardner et al. [8] demonstrated that Bayesian active\nlearning of a patient\u2019s audiometric function signi\ufb01cantly improves the state-of-the-art in terms of\naccuracy and number of stimuli required.\nHowever, learning a patient\u2019s entire audiometric function may not always be necessary. Audiometric\ntesting is frequently performed on otherwise young and healthy patients to detect noise-induced\nhearing loss (NIHL). Noise-induced hearing loss occurs when an otherwise healthy individual is\nhabitually subjected to high-intensity sound [27]. This can result in sharp, notch-shaped hearing loss\nin a narrow (sometimes less than one octave) frequency range. Early detection of NIHL is critical\nto desirable long-term clinical outcomes, so large-scale screenings of susceptible populations (for\nexample, factory workers), is commonplace [28]. Noise-induced hearing loss is dif\ufb01cult to diagnose\nwith standard audiometry, because a frequency\u2013intensity grid must be very \ufb01ne to ensure that a notch\nis detected. The full audiometric test of Gardner et al. [8] may also be inef\ufb01cient if the only goal of\ntesting is to determine whether a notch is present, as would be the case for large-scale screening.\nWe cast the detection of noise-induced hearing loss as an active model selection problem. We will\ndescribe two Gaussian process models of audiometric functions: a baseline model of normal human\n\n2This is arbitrary and need not be the Laplace approximation in (8), so this is a slight abuse of notation.\n\n5\n\n\fhearing, and a model re\ufb02ecting NIHL. We then use the BAMS framework introduced above to, as\nrapidly as possible for a given patient, determine which model best describes his or her hearing.\nNormal-patient model. To model a healthy patient\u2019s audiometric function, we use the model\ndescribed in [8]. The GP prior proposed in that work combines a constant prior mean \u00b5healthy = c\n(modeling a frequency-independent natural threshold) with a kernel taken to be the sum of two\ncomponents: a linear covariance in intensity and a squared-exponential covariance in frequency. Let\n[i, \u03c6] represent a tone stimulus, with i representing its intensity and \u03c6 its frequency. We de\ufb01ne:\n\nK(cid:0)[i, \u03c6], [i(cid:48), \u03c6(cid:48)](cid:1) = \u03b1ii(cid:48) + \u03b2 exp(cid:0)\n\n2(cid:96)2|\u03c6 \u2212 \u03c6(cid:48)|2(cid:1),\n\n\u2212 1\n\n(13)\nwhere \u03b1, \u03b2 > 0 weight each component and (cid:96) > 0 is a length scale of frequency-dependent random\ndeviations from a constant hearing threshold. This kernel encodes two fundamental properties of\nhuman audiologic response. First, hearing is monotonic in intensity. The linear contribution \u03b1ii(cid:48)\nensures that the posterior probability of detecting a \ufb01xed frequency will be monotonically increasing\nafter conditioning on a few tones. Second, human hearing ability is locally smooth in frequency,\nbecause nearby locations in the cochlea are mechanically coupled. The combination of \u00b5healthy with\nK speci\ufb01es our healthy model Mhealthy, with parameters \u03b8healthy = [c, \u03b1, \u03b2, (cid:96)](cid:62).\nNoise-induced hearing loss model. We extend the model above to create a second GP model\nre\ufb02ecting a localized, notch-shaped hearing de\ufb01cit characteristic of NIHL. We create a novel, \ufb02exible\nprior mean function for this purpose, the parameters of which specify the exact nature of the hearing\nloss. Our proposed notch mean is:\n\n\u00b5NIHL(i, \u03c6) = c \u2212 dN (cid:48)(\u03c6; \u03bd, w2),\n\n(14)\nwhere N (cid:48)(\u03c6; \u03bd, w) denotes the unnormalized normal probability density function with mean \u03bd and\nstandard deviation w, which we scale by a depth parameter d > 0 to re\ufb02ect the prominence of the\nhearing loss. This contribution results in a localized subtractive notch feature with tunable center,\nwidth, and height. We retain a constant offset c to revert to the normal-hearing model outside the\nvicinity of the localized hearing de\ufb01cit. Note that we completely model the effect of NIHL on patient\nresponses with this mean notch function; the kernel K above remains appropriate. The combination\nof \u00b5NIHL with K speci\ufb01es our NIHL model MNIHL with, in addition to the parameters of our healthy\nmodel, the additional parameters \u03b8NIHL = [\u03bd, w, d](cid:62).\n\n5 Results\n\nTo test BAMS on our NIHL detection task, we evaluate our algorithm using audiometric data, comparing\nto several baselines. From the results of a small-scale clinical trial, we have examples of high-\ufb01delity\naudiometric functions inferred for several human patients using the method of Gardner et al. [8]. We\nmay use these to simulate audiometric examinations of healthy patients using different methods to\nselect tone presentations. We simulate patients with NIHL by adjusting ground truth inferred from\nnine healthy patients with in-model samples from our notch mean prior. Recall that high-resolution\naudiogram data is extremely scarce.\nWe \ufb01rst took a thorough pure-tone audiometric test of each of nine patients from our trial with normal\nhearing using 100 samples selected using the algorithm in [8] on the domain X = [250, 8000] Hz \u00d7\n[\u221210, 80] dB HL,3 typical ranges for audiometric testing [6]. We inferred the audiometric function\nover the entire domain from the measured responses, using the healthy-patient GP model Mhealthy\nwith parameters learned via MLE\u2013II inference. The observation model was p(y = 1 | f ) = \u03a6(f ),\nwhere \u03a6 is the standard normal CDF, and approximate GP inference was performed via a Laplace\napproximation. We then used the approximate GP posterior p(f | D, \u02c6\u03b8,Mhealthy) for this patient as\nground-truth for simulating a healthy patient\u2019s responses. The posterior probability of tone detection\nlearned from one patient is shown in the background of Figure 1(a). We simulated a healthy patient\u2019s\nresponse to a given query tone x\u2217 = [i\u2217, \u03c6\u2217] by sampling a conditionally independent Bernoulli\nrandom variable with parameter p(y\u2217 = 1 | x\u2217,D, \u02c6\u03b8,Mhealthy).\nWe simulated a patient with NIHL by then drawing notch parameters (the parameters of (14)) from\nan expert-informed prior, adding the corresponding notch to the learned healthy ground-truth latent\nmean, recomputing the detection probabilities, and proceeding as above. Example NIHL ground-truth\ndetection probabilities generated in this manner are depicted in the background of Figure 1(b).\n\n3Inference was done in log-frequency domain.\n\n6\n\n\f(a) Normal hearing model ground truth.\n\n(b) Notch model ground truth.\n\nFigure 1: Samples selected by BAMS (red) and the method of Gardner et al. [8] (white) when run\non (a) the normal-hearing ground truth, and (b), the NIHL model ground truth. Contours denote\nprobability of detection at 10% intervals. Circles indicate presentations that were heard by the\nsimulated patient; exes indicate presentations that were not heard by the simulated patient.\n\n5.1 Diagnosing NIHL\n\nTo test our active model-selection approach to diagnosing NIHL, we simulated a series of audiometric\ntests, selecting tones using three alternatives: BAMS, the algorithm of [8], and random sampling.4\nEach algorithm shared a candidate set of 10 000 quasirandom tones X\u2217 generated using a scrambled\nHalton set so as to densely cover the two-dimensional search space. We simulated nine healthy\npatients and a total of 27 patients exhibiting a range of NIHL presentations, using independent draws\nfrom our notch mean prior in the latter case. For each audiometric test simulation, we initialized with\n\ufb01ve random tones, then allowed each algorithm to actively select a maximum of 25 additional tones,\na very small fraction of the hundreds typically used in a regular audiometric test. We repeated this\nprocedure for each of our nine healthy patients using the normal-patient ground-truth model. We\nfurther simulated, for each patient, three separate presentations of NIHL as described above. We plot\nthe posterior probability of the correct model after each iteration for each method in Figure 2.\nIn all runs with both ground-truth models, BAMS was able to rapidly achieve greater than 99%\ncon\ufb01dence in the correct model without expending the entire budget. Although all methods correctly\ninferred high healthy posterior probability for the healthy patient, BAMS wass more con\ufb01dent. For\nthe NIHL patients, neither baseline inferred the correct model, whereas BAMS rarely required more\nthan 15 actively chosen samples to con\ufb01dently make the correct diagnosis. Note that, when BAMS\nwas used on NIHL patients, there was often an initial period during which the healthy model was\nfavored, followed by a rapid shift towards the correct model. This is because our method penalizes\nthe increased complexity of the notch model until suf\ufb01cient evidence for a notch is acquired.\nFigure 1 shows the samples selected by BAMS for typical healthy and NIHL patients. The fundamental\nstrategy employed by BAMS in this application is logical: it samples in a row of relatively high-\nintensity tones. The intuition for this design is that failure to recognize a normally heard, high-intensity\nsound is strong evidence of a notch de\ufb01cit. Once the notch has been found (Figure 1(b)), BAMS\ncontinues to sample within the notch to con\ufb01rm its existence and rule out the possibility of the miss\n(tone not heard) being due to the stochasticity of the process. Once satis\ufb01ed, the BAMS approach then\nsamples on the periphery of the notch to further solidify its belief.\nThe BAMS algorithm sequentially makes observations where the healthy and NIHL model disagree\nthe most, typically in the top-center of the MAP notch location. The exact intensity at which BAMS\nsamples is determined by the prior over the notch-depth parameter d. When we changed the notch\ndepth prior to support shallower or deeper notches (data not shown), BAMS sampled at lower or\n\n4We also compared with uncertainty sampling and query by committee (QBC); the performance was compa-\n\nrable to random sampling and is omitted for clarity.\n\n7\n\n8910111213020406080frequency(log2Hz)intensity(dBHL)8910111213020406080frequency(log2Hz)00.20.40.60.81\f(a) Notch model ground truth.\n\n(b) Normal hearing model ground truth.\n\nFigure 2: Posterior probability of the correct model as a function of iteration number.\n\nhigher intensities, respectively, to continue to maximize model disagreement. Similarly, the spacing\nbetween samples is controlled by the prior over the notch-width parameter w.\nFinally, it is worth emphasizing the stark difference between the sampling pattern of BAMS and the\naudiometric tests of [8]; see Figure 1. Indeed, when the goal is learning the patient\u2019s audiometric\nfunction, the audiometric testing algorithm proposed in that work typically has a very good estimate\nafter 20 samples. However, when using BAMS, the primary goal is to detect or rule out NIHL. As\na result, the samples selected by BAMS reveal little about the nuances of the patient\u2019s audiometric\nfunction, while being highly informative about the correct model to explain the data. This is precisely\nthe tradeoff one seeks in a large-scale diagnostic setting, highlighting the critical importance of\nfocusing on the model-selection problem directly.\n\n6 Conclusion\n\nWe introduced a novel information-theoretic approach for active model selection, Bayesian active\nmodel selection, and successfully applied it to rapid screening for noise-induced hearing loss. Our\nmethod for active model selection does not require model retraining to evaluate candidate points,\nmaking it more feasible than previous approaches. Further, we provided an effective and ef\ufb01cient\nanalytic approximation to our criterion that can be used for automatically learning the model class\nof Gaussian processes with arbitrary observation likelihoods, a rich and commonly used class of\npotential models.\n\nAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation (NSF) under award\nnumber IIA-1355406. Additionally, JRG and KQW are supported by NSF grants IIS-1525919,\nIIS-1550179, and EFMA-1137211; GM is supported by CAPES/BR; DB acknowledges NIH grant\nR01-DC009215 as well as the CIMIT; JPC acknowledges the Sloan Foundation.\n\nReferences\n[1] I. Kononenko. Machine Learning for Medical Diagnosis: History, State of the Art and Perspective.\n\nArti\ufb01cial Intelligence in Medicine, 23(1):89\u2013109, 2001.\n\n[2] S. Saria, A. K. Rajani, J. Gould, D. L. Koller, and A. A. Penn. Integration of Early Physiological Responses\n\nPredicts Later Illness Severity in Preterm Infants. Science Translational Medicine, 2(48):48ra65, 2010.\n\n8\n\n10203000.20.40.60.81iterationPr(correctmodel|D)BAMSI(y\u2217;f|D,Mhealthy)random10203000.20.40.60.81iteration\f[3] T. C. Bailey, Y. Chen, Y. Mao, C. Lu, G. Hackmann, S. T. Micek, K. M. Heard, K. M. Faulkner, and M. H.\nKollef. A Trial of a Real-Time Alert for Clinical Deterioration in Patients Hospitalized on General Medical\nWards. Journal of Hospital Medicine, 8(5):236\u2013242, 2013.\n\n[4] J. Shargorodsky, S. G. Curhan, G. C. Curhan, and R. Eavey. Change in Prevalence of Hearing Loss in US\n\nAdolescents. Journal of the American Medical Association, 304(7):772\u2013778, 2010.\n\n[5] R. Carhart and J. Jerger. Preferred Method for Clinical Determination of Pure-Tone Thresholds. Journal of\n\nSpeech and Hearing Disorders, 24(4):330\u2013345, 1959.\n\n[6] W. Hughson and H. Westlake. Manual for program outline for rehabilitation of aural casualties both\nmilitary and civilian. Transactions of the American Academy of Ophthalmology and Otolaryngology, 48\n(Supplement):1\u201315, 1944.\n\n[7] M. Don, J. J. Eggermont, and D. E. Brackmann. Reconstruction of the audiogram using brain stem\nresponses and high-pass noise masking. Annals of Otology, Rhinology and Laryngology., 88(3 Part 2,\nSupplement 57):1\u201320, 1979.\n\n[8] J. R. Gardner, X. Song, K. Q. Weinberger, D. Barbour, and J. P. Cunningham. Psychophysical Detection\n\nTesting with Bayesian Active Learning. In UAI, 2015.\n\n[9] D. J. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press,\n\n2003.\n\n[10] N. Houlsby, F. Huszar, Z. Ghahramani, and J. M. Hern\u00b4andez-Lobato. Collaborative Gaussian Processes for\n\nPreference Learning. In NIPS, pages 2096\u20132104, 2012.\n\n[11] R. Garnett, M. A. Osborne, and P. Hennig. Active Learning of Linear Embeddings for Gaussian Processes.\n\nIn UAI, pages 230\u2013239, 2014.\n\n[12] J. M. Hern\u00b4andez-Lobato, M. W. Hoffman, and Z. Ghahramani. Predictive Entropy Search for Ef\ufb01cient\n\nGlobal Optimization of Black-box Functions. In NIPS, pages 918\u2013926, 2014.\n\n[13] N. Houlsby, J. M. Hern\u00b4andez-Lobato, and Z. Ghahramani. Cold-start Active Learning with Robust Ordinal\n\nMatrix Factorization. In ICML, pages 766\u2013774, 2014.\n\n[14] A. Ali, R. Caruana, and A. Kapoor. Active Learning with Model Selection. In AAAI, 2014.\n[15] J. Kulick, R. Lieck, and M. Toussaint. Active Learning of Hyperparameters: An Expected Cross Entropy\n\nCriterion for Active Model Selection. CoRR, abs/1409.7552, 2014.\n\n[16] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[17] D. Duvenaud. Automatic Model Construction with Gaussian Processes. PhD thesis, Computational and\n\nBiological Learning Laboratory, University of Cambridge, 2014.\n\n[18] D. Duvenaud, J. R. Lloyd, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Structure Discovery in\n\nNonparametric Regression through Compositional Kernel Search. In ICML, pages 1166\u20131174, 2013.\n\n[19] A. G. Wilson, E. Gilboa, A. Nehorai, and J. P. Cunningham. Fast Kernel Learning for Multidimensional\n\nPattern Extrapolation. In NIPS, pages 3626\u20133634, 2014.\n\n[20] C. Williams and C. Rasmussen. Gaussian processes for regression. In NIPS, 1996.\n[21] A. E. Raftery. Approximate Bayes Factors and Accounting for Model Uncertainty in Generalised Linear\n\nModels. Biometrika, 83(2):251\u2013266, 1996.\n\n[22] J. Kuha. AIC and BIC: Comparisons of Assumptions and Performance. Sociological Methods and\n\nResearch, 33(2):188\u2013229, 2004.\n\n[23] G. Schwarz. Estimating the Dimension of a Model. Annals of Statistics, 6(2):461\u2013464, 1978.\n[24] K. P. Murphy. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.\n[25] M. Kuss and C. E. Rasmussen. Assessing Approximate Inference for Binary Gaussian Process Classi\ufb01ca-\n\ntion. Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[26] T. P. Minka. Expectation Propagation for Approximate Bayesian Inference. In UAI, pages 362\u2013369, 2001.\n[27] D. McBride and S. Williams. Audiometric notch as a sign of noise induced hearing loss. Occupational\n\nEnvironmental Medicine, 58(1):46\u201351, 2001.\n\n[28] D. I. Nelson, R. Y. Nelson, M. Concha-Barrientos, and M. Fingerhut. The global burden of occupational\n\nnoise-induced hearing loss. American Journal of Industrial Medicine, 48(6):446\u2013458, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1398, "authors": [{"given_name": "Jacob", "family_name": "Gardner", "institution": "Cornell University"}, {"given_name": "Gustavo", "family_name": "Malkomes", "institution": "Washington University in St. Louis"}, {"given_name": "Roman", "family_name": "Garnett", "institution": "Washington University in STL"}, {"given_name": "Kilian", "family_name": "Weinberger", "institution": "Cornell University"}, {"given_name": "Dennis", "family_name": "Barbour", "institution": "Washington University in St. Louis"}, {"given_name": "John", "family_name": "Cunningham", "institution": "University of Columbia"}]}