{"title": "Active Learning of Model Evidence Using Bayesian Quadrature", "book": "Advances in Neural Information Processing Systems", "page_first": 46, "page_last": 54, "abstract": "Numerical integration is an key component of many problems in scientific computing, statistical modelling, and machine learning. Bayesian Quadrature is a model-based method for numerical integration which, relative to standard Monte Carlo methods, offers increased sample efficiency and a more robust estimate of the uncertainty in the estimated integral. We propose a novel Bayesian Quadrature approach for numerical integration when the integrand is non-negative, such as the case of computing the marginal likelihood, predictive distribution, or normalising constant of a probabilistic model. Our approach approximately marginalises the quadrature model's hyperparameters in closed form, and introduces an active learning scheme to optimally select function evaluations, as opposed to using Monte Carlo samples. We demonstrate our method on both a number of synthetic benchmarks and a real scientific problem from astronomy.", "full_text": "Active Learning of Model Evidence\n\nUsing Bayesian Quadrature\n\nMichael A. Osborne\nUniversity of Oxford\n\nmosb@robots.ox.ac.uk\n\nDavid Duvenaud\n\nUniversity of Cambridge\ndkd23@cam.ac.uk\n\nRoman Garnett\n\nCarnegie Mellon University\nrgarnett@cs.cmu.edu\n\nCarl E. Rasmussen\n\nUniversity of Cambridge\ncer54@cam.ac.uk\n\nStephen J. Roberts\nUniversity of Oxford\n\nZoubin Ghahramani\nUniversity of Cambridge\n\nsjrob@robots.ox.ac.uk\n\nzoubin@eng.cam.ac.uk\n\nAbstract\n\nNumerical integration is a key component of many problems in scienti\ufb01c comput-\ning, statistical modelling, and machine learning. Bayesian Quadrature is a model-\nbased method for numerical integration which, relative to standard Monte Carlo\nmethods, offers increased sample ef\ufb01ciency and a more robust estimate of the\nuncertainty in the estimated integral. We propose a novel Bayesian Quadrature\napproach for numerical integration when the integrand is non-negative, such as\nthe case of computing the marginal likelihood, predictive distribution, or normal-\nising constant of a probabilistic model. Our approach approximately marginalises\nthe quadrature model\u2019s hyperparameters in closed form, and introduces an ac-\ntive learning scheme to optimally select function evaluations, as opposed to using\nMonte Carlo samples. We demonstrate our method on both a number of synthetic\nbenchmarks and a real scienti\ufb01c problem from astronomy.\n\n1\n\nIntroduction\n\nThe \ufb01tting of complex models to big data often requires computationally intractable integrals to be\napproximated. In particular, machine learning applications often require integrals over probabilities\n\nZ = h\u2113i = Z \u2113(x)p(x)dx,\n\n(1)\n\nwhere \u2113(x) is non-negative. Examples include computing marginal likelihoods, partition functions,\npredictive distributions at test points, and integrating over (latent) variables or parameters in a model.\nWhile the methods we will describe are applicable to all such problems, we will explicitly con-\nsider computing model evidences, where \u2113(x) is the unnormalised likelihood of some parameters\nx1, . . . , xD. This is a particular challenge in modelling big data, where evaluating the likelihood\nover the entire dataset is extremely computationally demanding.\n\nThere exist several standard randomised methods for computing model evidence, such as annealed\nimportance sampling (AIS) [1], nested sampling [2] and bridge sampling. For a review, see [3].\nThese methods estimate Z given the value of the integrand on a set of sample points, whose size is\nlimited by the expense of evaluating \u2113(x). It is well known that convergence diagnostics are often\nunreliable for Monte Carlo estimates of partition functions [4, 5, 6]. Most such algorithms also have\nparameters which must be set by hand, such as proposal distributions or annealing schedules.\n\nAn alternative, model-based, approach is Bayesian Quadrature (BQ) [7, 8, 9, 10], which speci\ufb01es\na distribution over likelihood functions, using observations of the likelihood to infer a distribution\n\n1\n\n\f\u2113(x)\n\n \n\nx\n\n \n\nZ\n\nsamples\nGP mean\nGP mean \u00b1 SD\nexpected Z\np(Z|samples)\ndraw from GP\ndraw from GP\ndraw from GP\n\nFigure 1: Model-based integration computes a posterior for the integral Z = R \u2113(x)p(x)dx, condi-\n\ntioned on sampled values of the function \u2113(x). For this plot, we assume a Gaussian process model\nfor \u2113(x) and a broad Gaussian prior p(x). The variously probable integrands permitted under the\nmodel will give different possible values for Z, with associated differing probabilities.\n\nfor Z (see Figure 1). This approach offers improved sample ef\ufb01ciency [10], crucial for expensive\nsamples computed on big data. We improve upon this existing work in three ways:\nLog-GP: [10] used a GP prior on the likelihood function; this is a poor model in this case, unable to\nexpress the non-negativity and high dynamic range of most likelihood functions. [11] introduced an\napproximate means of exploiting a GP on the logarithm of a function (henceforth, a log-GP), which\nbetter captures these properties of likelihood functions. We apply this method to estimate Z, and\nextend it to compute Z\u2019s posterior variance and expected variance after adding a sample.\nActive Sampling: Previous work on BQ has used randomised or a priori \ufb01xed sampling schedules.\nWe use active sampling, selecting locations which minimise the expected uncertainty in Z.\nHyperparameter Marginalisation: Uncertainty in the hyperparameters of the model used for\nquadrature has previously been ignored, leading to overcon\ufb01dence in the estimate of Z. We in-\ntroduce a tractable approximate marginalisation of input scale hyperparameters.\n\nFrom a Bayesian perspective, numerical integration is fundamentally an inference and sequential\ndecision making problem: Given a set of function evaluations, what can we infer about the integral,\nand how do we decide where to next evaluate the function. Monte Carlo methods, including MCMC,\nprovide simple but generally suboptimal and non-adaptive answers: compute a sample mean, and\nevaluate randomly. Our approach attempts to learn about the integrand as it evaluates the function\nat different points, and decide based on information gain where to evaluate next. We compare\nour approach against standard Monte Carlo techniques and previous Bayesian approaches on both\nsimulated and real problems.\n\n2 Bayesian Quadrature\n\nBayesian quadrature [8, 10] is a means of performing Bayesian inference about the value of a\n\npotentially nonanalytic integral, hfi := R f (x)p(x)dx. For clarity, we henceforth assume the do-\nmain of integration X = R, although all results generalise to Rn. We assume a Gaussian density\np(x) := N (x; \u03bdx, \u03bbx), although other convenient forms, or, if necessary, the use of an importance\nre-weighting trick (q(x) = q(x)/p(x)p(x) for any q(x)), allow any other integral to be approximated.\nQuadrature involves evaluating f (x) at a vector of sample points xs, giving f s := f (xs). Often this\nevaluation is computationally expensive; the consequent sparsity of samples introduces uncertainty\nabout the function f between them, and hence uncertainty about the integral hfi.\nPrevious work on BQ chooses a Gaussian process (GP) [12] prior for f, with mean \u00b5f and Gaussian\ncovariance function\n\nK(x1, x2) := h2 N (x1; x2, w) .\n\n(2)\n\nHere hyperparameter h species the output scale, while hyperparameter w de\ufb01nes a (squared) input\nscale over x. These scales are typically \ufb01tted using type two maximum likelihood (MLII); we will\nlater introduce an approximate means of marginalising them in Section 4. We\u2019ll use the following\ndense notation for the standard GP expressions for the posterior mean m, covariance C, and variance\n\n2\n\n\f)\nx\n(\n\u2113\n\n)\nx\n(\n\u2113\ng\no\nl\n\ninput scale\n\nx\n\ninput scale\n\nx\n\nFigure 2: A GP \ufb01tted to a peaked log-likelihood function is typically a better model than GP \ufb01t to\nthe likelihood function (which is non-negative and has high dynamic range). The former GP also\nusually has the longer input scale, allowing it to generalise better to distant parts of the function.\n\n\u22c6) := C(f\u22c6, f \u2032\n\nV , respectively: mf |s(x\u22c6) := m(f\u22c6|f s), Cf |s(x\u22c6, x\u2032\n\u22c6|f s) and Vf |s(x\u22c6) := V (f\u22c6|f s).\nNote that this notation assumes implicit conditioning on hyperparameters. Where required for dis-\nambiguation, we\u2019ll make this explicit, as per mf |s,w(x\u22c6) := m(f\u22c6|f s, w) and so forth.\nVariables possessing a multivariate Gaussian distribution are jointly Gaussian distributed with any\naf\ufb01ne transformations of those variables. Because integration is af\ufb01ne, we can hence use computed\nsamples f s to perform analytic Gaussian process inference about the value of integrals over f (x),\nsuch as hfi. The mean estimate for hfi given f s is\n\nm(hfi|f s) = ZZ hfi p(hfi|f ) p(f|f s) dhfi df\n\n= ZZ hfi \u03b4(cid:18)hfi \u2212Z f (x) p(x) dx(cid:19)N(cid:0)f ; mf |s, Cf |s(cid:1) dhfi df\n= Z mf |s(x) p(x) dx ,\n\n(3)\n\nwhich is expressible in closed-form due to standard Gaussian identities [10]. The corresponding\nclosed-form expression for the posterior variance of hfi lends itself as a natural convergence diag-\nnostic. Similarly, we can compute the posteriors for integrals over the product of multiple, indepen-\ndent functions. For example, we can calculate the posterior mean m(hf gi|f s, gs) for an integral\nR f (x)g(x)p(x)dx. In the following three sections, we will expand upon the improvements this\npaper introduces in the use of Bayesian Quadrature for computing model evidences.\n\n3 Modelling Likelihood Functions\n\nWe wish to evaluate the evidence (1), an integral over non-negative likelihoods, \u2113(x). Assigning\na standard GP prior to \u2113(x) ignores prior information about the range and non-negativity of \u2113(x),\nleading to pathologies such as potentially negative evidences (as observed in [10]). A much better\nprior would be a GP prior on log \u2113(x) (see Figure 2). However, the resulting integral is intractable,\n\nm(Z|log \u2113s) = Z (cid:16)Z exp(cid:0)log \u2113(x)(cid:1)p(x) dx(cid:17)N(cid:0)log \u2113; mlog \u2113|s, Clog \u2113|s(cid:1) dlog \u2113 ,\n\nas (4) does not possess the af\ufb01ne property exploited in (3). To progress, we adopt an approximate\ninference method inspired by [11] to tractably integrate under a log-GP prior.1 Speci\ufb01cally, we\nlinearise the problematic exponential term around some point log \u21130(x), as\n\n(4)\n\nexp(cid:0)log \u2113(x)(cid:1) \u2243 exp(cid:0)log \u21130(x)(cid:1) + exp(cid:0)log \u21130(x)(cid:1)(cid:0)log \u2113(x) \u2212 log \u21130(x)(cid:1)\n\n(5)\nThe integral (4) consists of the product of Z and a GP for log \u2113. The former is \u223c exp log \u2113, the\nlatter is \u223c exp(cid:0)\u2212(log \u2113 \u2212 m)2(cid:1), effectively permitting only a small range of log \u2113 functions. Over\nthis narrow region, it is reasonable to assume that Z does not vary too dramatically, and can be\napproximated as linear in log \u2113, as is assumed by (5). Using this approximation, and making the\nde\ufb01nition \u2206log \u2113|s := mlog \u2113|s \u2212 log \u21130, we arrive at\n\nm(Z|log \u2113s) \u2243 m(Z|log \u21130, log \u2113s) := Z \u21130(x)p(x) dx +Z \u21130(x)\u2206log \u2113|s(x)p(x) dx .\n\n(6)\n\n1In practice, we use the transform log (\u2113(x) + 1), allowing us to assume the transformed quantity has zero\n\nmean. For the sake of simplicity, we omit this detail in the following derivations.\n\n3\n\n\f \n\n40\n\n20\n\n)\nx\n(\n\u2113\n\n0\n\n \n\nx\n\n\u2113(xs)\n\u2113(x)\nm(cid:0)\u2113(x)|\u2113(xs)(cid:1)\n\ufb01nal approx\n\n)\nx\n(\n\u2113\ng\no\nl\n\n \n\n4\n\n2\n\n0\n\n \n\nx\n\nlog \u2113(xs)\nlog \u2113(x)\nlog m(cid:0)\u2113(x)|\u2113(xs)(cid:1)\nm(cid:0)log \u2113(x)|log \u2113(xs)(cid:1)\n\u2206c\nm(cid:0)\u2206log \u2113|s(x)|\u2206c(cid:1)\n\nFigure 3: Our approximate use of a GP for log \u2113(x) improves upon the use of a GP for \u2113(x) alone.\nHere the \u2018\ufb01nal approx\u2019 is m\u2113|s(1 + \u2206log \u2113|s), from (5) and (6).\n\nWe now choose \u21130 to allow us to resolve the \ufb01rst integral in (6). First, we introduce a secondary GP\nmodel for \u2113, the non-log space, and choose \u21130 := m\u2113|s, where m\u2113|s is the standard GP conditional\nmean for \u2113 given observations \u2113(xs). For both GPs2 (over both log and non-log spaces), we take\nzero prior means and Gaussian covariances of the form (2). It is reasonable to use zero prior means:\n\u2113(x) is expected to be negligible except at a small number of peaks. If a quantity is dependent upon\nthe GP prior for \u2113, it will be represented as conditional on \u2113s; if dependent upon the former GP prior\nover log \u2113, it will be conditional upon log \u2113s. We expect \u2206log \u2113|s(x) to be small everywhere relative\nto the magnitude of log \u2113(x) (see Figure 3). Hence log \u21130 is close to the peaks of the Gaussian over\nlog \u2113, rendering our linearisation appropriate. For \u21130, the \ufb01rst integral in (6) becomes tractable.\nUnfortunately, the second integral in (6) is non-analytic due to the log \u21130 term within \u2206log \u2113|s. As\nsuch, we perform another stage of Bayesian quadrature by treating \u2206log \u2113|s as an unknown function\nof x. For tractability, we assume this prior is independent of the prior for log \u2113. We use another GP\nfor \u2206log \u2113|s, with zero prior mean and Gaussian covariance (2). A zero prior mean here is reasonable:\n\u2206log \u2113|s is exactly zero at xs, and tends to zero far away from xs, where both mlog \u2113|s and log \u21130 are\ngiven by the compatible prior means for log \u2113 and \u2113. We must now choose candidate points xc at\nwhich to evaluate the \u2206log \u2113|s function (note we do not need to evaluate \u2113(xc) in order to compute\n\u2206c := \u2206log \u2113|s(xc)). xc should \ufb01rstly include xs, where we know that \u2206log \u2113|s is equal to zero.\nWe select the remainder of xc at random on the hyper-ellipses (whose axes are de\ufb01ned by the input\nscales for \u2113) surrounding existing observations; we expect \u2206log \u2113|s to be extremised at such xc. We\nlimit ourselves to a number of candidates that scales linearly with the dimensionality of the integral\nfor all experiments.\nGiven these candidates, we can now marginalise (6) over \u2206log \u2113|s to give\n\nm(Z|log \u2113s) \u2243 m(Z|log \u21130, log \u2113s, \u2206c) = m(Z|\u2113s) + m(cid:0)h\u2113\u2206log \u2113|si(cid:12)(cid:12)\u2113s, \u2206c(cid:1) ,\n\n(7)\nwhere both terms are analytic as per Section 2; m(Z|\u2113s) is of the form (3). The correction factor,\nthe second term in (7), is expected to be small, since \u2206log \u2113|s is small. We extend the work of [11]\nto additionally calculate the variance in the evidence,\n\nwhere the second moment is\n\nand hence\n\nV (Z|log \u21130, log \u2113s, \u2206c) = S(Z | log \u21130, log \u2113s) \u2212 m(Z|log \u21130, log \u2113s, \u2206c)2 ,\nS(Z | log \u21130, log \u2113s) := m(cid:0)h\u2113 Clog \u2113|s \u2113i(cid:12)(cid:12)log \u2113s(cid:1) + m(Z|log \u21130, log \u2113s, \u2206c)2 ,\nV (Z|log \u21130, log \u2113s, \u2206c) = m(cid:0)h\u2113 Clog \u2113|s \u2113i(cid:12)(cid:12)log \u2113s(cid:1)\n\n:= ZZ m\u2113|s(x)m\u2113|s(x\u2032)Clog \u2113|s(x, x\u2032)p(x)p(x\u2032)dxdx\u2032,\n\n(8)\n\n(9)\n\n(10)\n\nwhich is expressible in closed form, although space precludes us from doing so. This variance can\nbe employed as a convergence diagnostic; it describes our uncertainty in the model evidence Z.\n\n2Note that separately modelling \u2113 and log \u2113 is not inconsistent: we use the posterior mean of the GP for \u2113\nonly as a convenient parameterisation for \u21130; we do not treat this GP as a full probabilistic model. While this\nmodelling choice may seem excessive, this approach provides signi\ufb01cant advantages in the sampling ef\ufb01ciency\nof the overall algorithm by approximately capturing the non-negativity of our integrand and allowing active\nsampling.\n\n4\n\n\f \n\ne\nc\nn\na\ni\nr\na\nv\nd\ne\nt\nc\ne\np\nx\ne\n\nf (x)\n\n \n\ndata\nmean\nvariance\napprox. marginalised length scale\ntrue marginalised length scale\n\nx\n(a)\n\n10\n\n8\n\n6\n\nsample\n\nx\n\n(b)\n\nFigure 4: a) Integrating hyperparameters increases the marginal posterior variance (in regions whose\nmean varies as the input scales change) to more closely match the true posterior marginal variance.\nb) An example showing the expected uncertainty in the evidence after observing the likelihood\nfunction at that location. p(x) and l(x) are plotted at the top in green and black respectively, the\nnext sample location in red. Note the model discovering a new mode on the right hand side, sampling\naround it, then moving on to other regions of high uncertainty on the left hand side.\n\nIn summary, we have described a linearisation approach to exploiting a GP prior over log-likelihoods;\nthis permitted the calculation of the analytic posterior mean (7) and variance (10) of Z. Note that our\napproximation will improve with increasing numbers of samples: \u2206log \u2113|s will eventually be small\neverywhere, since it is clamped to zero at each observation. The quality of the linearisation can also\nbe improved by increasing the number of candidate locations, at the cost of slower computation.\n\n4 Marginalising hyperparameters\n\nWe now present a novel means of approximately marginalising the hyperparameters of the GP used\nto model the log-integrand, log \u2113. In previous approaches to Bayesian Quadrature, hyperparameters\nwere estimated using MLII, which approximates the likelihood as a delta function. However, ignor-\ning the uncertainty in the hyperparameters can lead to pathologies. In particular, the reliability of\nthe variance for Z depends crucially upon marginalising over all unknown quantities.\nThe hyperparameters of most interest are the input scales w for the GP over the log-likelihood;\nthese hyperparameters can have a powerful in\ufb02uence on the \ufb01t to a function. We use MLII to \ufb01t all\nhyperparameters other than w. Marginalisation of w is confounded by the complex dependence of\nour predictions upon these input scales. We make the following essential approximations:\nFlat prior: We assume that the prior for w is broad, so that our posterior is the normalised likelihood.\nLaplace approximation: p(log \u2113s|w) is taken as Gaussian with mean equal to the MLII value \u02c6w and\nwith diagonal covariance Cw, diagonal elements \ufb01tted using the second derivatives of the likelihood.\nWe represent the posterior mean for log \u2113 conditioned on \u02c6w as \u02c6m := mlog \u2113|s, \u02c6w.\nGP mean af\ufb01ne in w: Given the narrow width of the likelihood for w, p(log \u2113|log \u2113s, w) is approx-\nimated as having a GP mean which is af\ufb01ne in w around the MLII values, and a constant covariance;\nmlog \u2113|s,w \u2243 \u02c6m + \u2202 \u02c6m\nThe implication of these approximations is that the marginal posterior mean over log \u2113 is simply\n\u02dcmlog \u2113|s := mlog \u2113|s, \u02c6w. The marginal posterior variance is \u02dcClog \u2113|s := Clog \u2113|s, \u02c6w + \u2202 \u02c6m\n\u2202 \u02c6m\n\u2202w .\nAn example of our approximate posterior is depicted in Figure 4a. Our approximations give the\nmarginal posterior mean for Z:\n\n\u2202w (w \u2212 \u02c6w) and Clog \u2113|s,w \u2243 Clog \u2113|s, \u02c6w.\n\n\u2202w Cw\n\nof the form (7). The marginal posterior variance\n\n\u02dcm(Z|log \u21130, log \u2113s, \u2206c) := m(Z|log \u21130, log \u2113s, \u2206c, \u02c6w) ,\n\u02dcV (Z|log \u21130, log \u2113s, \u2206c) = ZZ dx dx\u2032m\u2113|s(x) m\u2113|s(x\u2032)(cid:18)Clog \u2113|s(x, x\u2032) +\n\n\u2202 \u02c6m(x)\n\n\u2202w\n\n(11)\n\nCw\n\n\u2202 \u02c6m(x\u2032)\n\n\u2202w (cid:19)\n\n(12)\n\nis possible, although laborious, to express analytically, as with (10).\n\n5\n\n\f5 Active Sampling\n\nOne major bene\ufb01t of model-based integration is that samples can be chosen by any method, in\ncontrast to Monte Carlo methods, which typically must sample from a speci\ufb01c distribution.\nIn\nthis section, we describe a scheme to select samples xs sequentially, by minimising the expected\nuncertainty in the evidence that remains after taking each additional sample.3 We take the variance\nin the evidence as our loss function, and proceed according to Bayesian decision theory.\n\nSurprisingly, the posterior variance of a GP model with \ufb01xed hyperparameters does not depend\non the function values at sampled locations at all; only the location of those samples matters. In\ntraditional Bayesian quadrature, the evidence is an af\ufb01ne transformation of the sampled likelihood\nvalues, hence its estimate for the variance in the evidence is also independent of likelihood values.\nAs such, active learning with \ufb01xed hyperparameters is pointless, and the optimal sampling design\ncan be found in advance [13].\n\nIn Section 3, we took Z as an af\ufb01ne transform of the log-likelihood, which we model with a GP. As\nthe af\ufb01ne transformation (5) itself depends on the function values (via the dependence of log \u21130), the\nconclusions of the previous paragraph do not apply, and active learning is desirable. The uncertainty\nover the hyperparameters of the GP further motivates active learning: without assuming a priori\nknowledge of the hyperparameters, we can\u2019t evaluate the GP to precompute a sampling schedule.\nThe approximate marginalisation of hyperparameters permits an approach to active sampling that\nacknowledges the in\ufb02uence new samples may have on the posterior over hyperparameters.\nActive sampling selects a new sample xa so as to minimise the expected variance in the evidence\nafter adding the sample to the model of \u2113. The objective is therefore to choose the xa that minimises\n\nconditioned on, as usual for function inputs) where the expected loss is\n\nthe expected loss; xa = argminxa(cid:10)V (Z|log \u21130, log \u2113s,a) | log \u21130, log \u2113s(cid:11) (note xa is implicitly\n(cid:10)V (Z|log \u21130, log \u2113s,a) | log \u21130, log \u2113s(cid:11) = S(Z | log \u21130, log \u2113s) \u2212Z m(Z|log \u21130, log \u2113a,s, \u2206c)2\n\u2202w (cid:19) dlog \u2113a ,\n\n\u2202 \u02c6mT\na\n\n\u00d7 N(cid:18)log \u2113a; \u02c6ma, \u02c6Ca +\n\n\u2202 \u02c6ma\n\u2202w\n\nCw\n\n(13)\n\nand we de\ufb01ne \u02c6ma := m(log \u2113a|log \u2113s, \u02c6w) and \u02c6Ca := V (log \u2113a|log \u2113s, \u02c6w). The \ufb01rst term in (13),\nthe second moment, is independent of the selection of xa and can hence be safely ignored for active\nsampling (true regardless of the model chosen for the likelihood). The second term, the negative\nexpected squared mean, can be resolved analytically4 for any trial xa (we omit the laborious details).\nImportantly, we do not have to make a linearisation approximation for this new sample. That is, the\nGP posterior over log \u2113a can be fully exploited when performing active sampling.\nIn order to minimise the expected variance, the objective in (13) encourages the maximisation of the\nexpected squared mean of Z. Due to our log-GP model, one means the method can use to do this\nis to seek points where the log-likelihood is predicted to be large: which we call exploitation. The\nobjective in (13) naturally balances exploitation against exploration: the choice of points where our\ncurrent variance in the log-likelihood is signi\ufb01cant (see Figure 4b). Note that the variance for log \u2113a\nis increased by approximate integration over hyperparameters, encouraging exploration.\n\n6 Experiments\n\nWe now present empirical evaluation of our algorithm in a variety of different experiments.\n\nMetrics: We judged our methods according to three metrics, all averages over N similar exper-\niments indexed by i. De\ufb01ne Zi as the ground truth evidence for the ith experiment, m(Zi) as its\nestimated mean and V (Zi) as its predicted variance. Firstly, we computed the average log error,\n\n3We also expect such samples to be useful not just for estimating the evidence, but also for any other related\n\nexpectations, such as would be required to perform prediction using the model.\n\n4Here we use the fact that R exp(c y) N (cid:0)y; m, \u03c32(cid:1) dy = exp(c m + 1/2 c2\u03c32). We assume that \u2206log \u2113|s\ndoes not depend on log \u2113a, only its location xa: we know \u2206(xa) = 0 and assume \u2206log \u2113|s elsewhere remains\nunchanged.\n\n6\n\n\fALE := 1\n\nN PN\n\ni=1 |log m(Zi) \u2212 log Zi| . Next we computed the negative log-density of the truth,\nassuming experiments are independent, \u2212 log p(Z) = \u2212PN\ni=1 log N (Zi; m(Zi), V (Zi)), which\nquanti\ufb01es the accuracy of our variance estimates. We also computed the calibration C, de\ufb01ned\nas the fraction of experiments in which the ground truth lay within our 50% con\ufb01dence interval\n(cid:0)m(Zi) \u2212 0.6745\u221aV (Zi), m(Zi) + 0.6745\u221aV (Zi)(cid:1). Ideally, C would be 50%: any higher, and a\n\nmethod is under-con\ufb01dent, any lower and it is over-con\ufb01dent.\n\nMethods: We \ufb01rst compared against simple Monte Carlo (SMC).\n\nx1, . . . , xN from p(x), and estimates Z by \u02c6Z = 1/N PN\n\nSMC generates samples\nn=1 \u2113(xn). An estimate of the variance\nof \u02c6Z is given by the standard error of \u2113(x). As an alternative Monte Carlo technique, we imple-\nmented Annealed Importance Sampling (AIS) using a Metropolis-Hastings sampler. The inverse\ntemperature schedule was linear as in [10], and the proposal width was adjusted to attain approxi-\nmately a 50% acceptance rate. Note that a single AIS chain provides no ready means of determining\nthe posterior variance for its estimate of Z. Our \ufb01rst model-based method was Bayesian Monte\nCarlo (BMC) \u2013 the algorithm used in [10]. Here samples were drawn from the AIS chain above, and\na GP was \ufb01t to the likelihood samples. For this and other methods, where not otherwise mentioned,\nGP hyperparameters were selected using MLII.\n\nWe then tested four novel methods. Firstly, Bayesian Quadrature (BQ), which employed the lin-\nearisation approach of Section 3 to modeling the log-transformed likelihood values. The samples\nsupplied to it were drawn from the same AIS chain as used above, and 400 candidate points were per-\nmitted. BQ* is the same algorithm as BQ but with hyperparameters approximately marginalised, as\nper Section 4. Note that this in\ufb02uences only the variance of the estimate; the means for BQ and BQ*\nare identical. The performance of these methods allow us to quantify to what extent our innovations\nimprove estimation given a \ufb01xed set of samples.\n\nNext, we tested a novel algorithm, Doubly Bayesian Quadrature (BBQ). The method is so named\nfor the fact that we use not only Bayesian inference (with a GP over the log-transformed likelihood)\nto compute the posterior for the evidence, but also Bayesian decision theory to select our samples\nactively, as described in Section 5. BBQ* is identical, but with hyperparameters approximately\nmarginalised. Both algorithms demonstrate the in\ufb02uence of active sampling on our performance.\n\nProblems: We used these methods to evaluate evidences given Gaussian priors and a variety of\nlikelihood functions. As in [10] and [11], we focus on low numbers of samples; we permitted tested\nmethods 150 samples on synthetic integrands, and 300 when using real data. We are motivated by\nreal-world, big-data, problems where evaluating likelihood samples is expensive, making it desirable\nto determine the techniques for evidence estimation that can operate best when permitted only a\nsmall number of samples. Ground truth Z is available for some integrals; for the non-analytic\nintegrals, Z was estimated by a run of SMC with 105 samples.\nWe considered seven synthetic examples. We \ufb01rstly tested using single Gaussians, in one, four,\nten and twenty dimensions. We also tested on mixtures of two Gaussians in one dimension (two\nexamples, alternately widely separated and overlapping) and four dimensions (a single example).\n\nWe additionally tested methods on a real scienti\ufb01c problem: detecting a damped Lyman-\u03b1 absorber\n(DLA) between the Earth and an observed quasar from spectrographic readings of the quasar. DLAs\nare large objects consisting primarily of neutral hydrogen gas. The statistical properties of DLAs\ninform us about the distribution of neutral hydrogen in the universe, which is of fundamental cos-\nmological importance. We model the quasar spectra using a GP; the presence of a DLA is repre-\nsented as an observation fault with known dynamics [14]. This model has \ufb01ve hyperparameters to\nbe marginalised, to which we assign priors drawn from the large corpus of data obtained from the\nSloan Digital Sky Survey (SDSS) [15]. We tested over four datasets; the expense of evaluating a GP\nlikelihood sample on the large datasets available from the SDSS (140TB of data have been released\nin total) motivates the small sample sizes considered.\n\nEvaluation Table 1 shows combined performance on the synthetic integrands listed above. The\ncalibration scores C show that all methods5 are systematically overcon\ufb01dent, although our ap-\nproaches are at least as well calibrated as alternatives. On average, BBQ* provides an estimate\n\n5Because a single AIS chain gives no estimate of uncertainty, it has no likelihood or calibration scores.\n\n7\n\n\f\u00d710\u22123\n\n4\n\n3\nZ\n2\n\n1\n\n10\n\n20\n\n30\n\nNumber of samples\n\n40\n\n(a)\n\nSMC\nBMC\nBBQ*\nTrue value\n\n \n\n)\nZ\n(\np\ng\no\nl\n\n\u2212\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n50\n\n \n\n50\n\n100\n\nNumber of samples\n\n150\n\n(b)\n\nFigure 5: a) The posterior distribution over Z for several methods on a one-dimensional example\nas the number of samples increases. Shaded regions denote \u00b12 SD\u2019s from the mean. The shaded\nregions for SMC and BMC are off the vertical scale of this \ufb01gure. b) The log density of the true\nevidence for different methods (colours identical to those in a), compared to the true Z (in black).\nThe integrand is the same as that in Figure 4b.\n\nTable 1: Combined Synthetic Results\n\nTable 2: Combined Real Results\n\nMethod \u2212 log p(Z)\n> 1000\nN/A\n> 1000\n> 1000\n> 1000\n13.597\n\u221211.909\n\nSMC\nAIS\nBMC\nBQ\nBQ*\nBBQ\nBBQ*\n\nALE\n\n1.101\n1.695\n2.695\n6.760\n6.760\n0.919\n0.271\n\nC\n0.286\nN/A\n0.143\n0.429\n0.429\n0.286\n0.286\n\nMethod \u2212 log p(Z)\n5.001\nN/A\n9.536\n37.017\n33.040\n3.734\n74.242\n\nSMC\nAIS\nBMC\nBQ\nBQ*\nBBQ\nBBQ*\n\nALE\n\n0.632\n2.146\n1.455\n0.635\n0.635\n0.400\n1.732\n\nC\n0.250\nN/A\n0.500\n0.000\n0.000\n0.000\n0.250\n\nof Z which is closer to the truth than the other methods given the same number of samples, and as-\nsigns much higher likelihood to the true value of Z. BBQ* also achieved the lowest error on \ufb01ve, and\nbest likelihood on six, of the seven problems, including the twenty dimensional problem for both\nmetrics. Figure 5a shows a case where both SMC and BBQ* are relatively close to the true value,\nhowever BBQ*\u2019s posterior variance is much smaller. Figure 5b demonstrates the typical behaviour\nof the active sampling of BBQ*, which quickly concentrates the posterior distribution at the true Z.\nThe negative likelihoods of BQ* are for every problem slightly lower than for BQ (\u2212 log p(Z) is\non average 0.2 lower), indicating that the approximate marginalisation of hyperparameters grants a\nsmall improvement in variance estimate.\n\nTable 2 shows results for the various methods on the real integration problems. Here BBQ is clearly\nthe best performer; the additional exploration induced by the hyperparameter marginalisation of\nBBQ* may have led to local peaks being incompletely exploited. Exploration in a relatively high\ndimensional, multi-modal space is inherently risky; nonetheless, BBQ* achieved lower error than\nBBQ on two of the problems.\n\n7 Conclusions\n\nIn this paper, we have made several advances to the BQ method for evidence estimation. These are:\napproximately imposing a positivity constraint6, approximately marginalising hyperparameters, and\nusing active sampling to select the location of function evaluations. Of these contributions, the active\nlearning approach yielded the most signi\ufb01cant gains for integral estimation.\n\nAcknowledgements\n\nM.A.O. was funded by the ORCHID project (http://www.orchid.ac.uk/).\n\n6Our approximations mean that we cannot guarantee non-negativity, but our approach improves upon alter-\n\nnatives that make no attempt to enforce the non-negativity constraint.\n\n8\n\n\fReferences\n[1] R.M. Neal. Annealed importance sampling. Statistics and Computing, 11(2):125\u2013139, 2001.\n[2] J. Skilling. Nested sampling. Bayesian inference and maximum entropy methods in science and engineer-\n\ning, 735:395\u2013405, 2004.\n\n[3] M.H. Chen, Q.M. Shao, and J.G. Ibrahim. Monte Carlo methods in Bayesian computation. Springer,\n\n2000.\n\n[4] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-\n\nTR-93-1, University of Toronto, 1993.\n\n[5] S.P. Brooks and G.O. Roberts. Convergence assessment techniques for Markov chain Monte Carlo. Statis-\n\ntics and Computing, 8(4):319\u2013335, 1998.\n\n[6] M.K. Cowles, G.O. Roberts, and J.S. Rosenthal. Possible biases induced by MCMC convergence diagnos-\n\ntics. Journal of Statistical Computation and Simulation, 64(1):87, 1999.\n\n[7] P. Diaconis. Bayesian numerical analysis. In S. Gupta J. Berger, editor, Statistical Decision Theory and\n\nRelated Topics IV, volume 1, pages 163\u2013175. Springer-Verlag, New York, 1988.\n\n[8] A. O\u2019Hagan. Bayes-Hermite quadrature. Journal of Statistical Planning and Inference, 29:245\u2013260,\n\n1991.\n\n[9] M. Kennedy. Bayesian quadrature with non-normal approximating functions. Statistics and Computing,\n\n8(4):365\u2013375, 1998.\n\n[10] C. E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In S. Becker and K. Obermayer, editors,\n\nAdvances in Neural Information Processing Systems, volume 15. MIT Press, Cambridge, MA, 2003.\n\n[11] M.A. Osborne, R. Garnett, S.J. Roberts, C. Hart, S. Aigrain, N.P. Gibson, and S. Aigrain. Bayesian\nquadrature for ratios. In Proceedings of the Fifteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS 2012), 2012.\n\n[12] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[13] T. P. Minka. Deriving quadrature rules from Gaussian processes. Technical report, Statistics Department,\n\nCarnegie Mellon University, 2000.\n\n[14] R. Garnett, M.A. Osborne, S. Reece, A. Rogers, and S.J. Roberts. Sequential bayesian prediction in the\n\npresence of changepoints and faults. The Computer Journal, 53(9):1430, 2010.\n\n[15] Sloan Digital Sky Survey, 2011. http://www.sdss.org/.\n\n9\n\n\f", "award": [], "sourceid": 32, "authors": [{"given_name": "Michael", "family_name": "Osborne", "institution": null}, {"given_name": "Roman", "family_name": "Garnett", "institution": ""}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "David", "family_name": "Duvenaud", "institution": null}, {"given_name": "Stephen", "family_name": "Roberts", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}