{"title": "Variational Bayesian Monte Carlo", "book": "Advances in Neural Information Processing Systems", "page_first": 8213, "page_last": 8223, "abstract": "Many probabilistic models of interest in scientific computing and machine learning have expensive, black-box likelihoods that prevent the application of standard techniques for Bayesian inference, such as MCMC, which would require access to the gradient or a large number of likelihood evaluations.\nWe introduce here a novel sample-efficient inference framework, Variational Bayesian Monte Carlo (VBMC). VBMC combines variational inference with Gaussian-process based, active-sampling Bayesian quadrature, using the latter to efficiently approximate the intractable integral in the variational objective.\nOur method produces both a nonparametric approximation of the posterior distribution and an approximate lower bound of the model evidence, useful for model selection.\nWe demonstrate VBMC both on several synthetic likelihoods and on a neuronal model with data from real neurons. Across all tested problems and dimensions (up to D = 10), VBMC performs consistently well in reconstructing the posterior and the model evidence with a limited budget of likelihood evaluations, unlike other methods that work only in very low dimensions. Our framework shows great promise as a novel tool for posterior and model inference with expensive, black-box likelihoods.", "full_text": "Variational Bayesian Monte Carlo\n\nLuigi Acerbi\u2217\n\nDepartment of Basic Neuroscience\n\nUniversity of Geneva\n\nluigi.acerbi@unige.ch\n\nAbstract\n\nMany probabilistic models of interest in scienti\ufb01c computing and machine learning\nhave expensive, black-box likelihoods that prevent the application of standard\ntechniques for Bayesian inference, such as MCMC, which would require access\nto the gradient or a large number of likelihood evaluations. We introduce here a\nnovel sample-ef\ufb01cient inference framework, Variational Bayesian Monte Carlo\n(VBMC). VBMC combines variational inference with Gaussian-process based,\nactive-sampling Bayesian quadrature, using the latter to ef\ufb01ciently approximate\nthe intractable integral in the variational objective. Our method produces both\na nonparametric approximation of the posterior distribution and an approximate\nlower bound of the model evidence, useful for model selection. We demonstrate\nVBMC both on several synthetic likelihoods and on a neuronal model with data\nfrom real neurons. Across all tested problems and dimensions (up to D = 10),\nVBMC performs consistently well in reconstructing the posterior and the model\nevidence with a limited budget of likelihood evaluations, unlike other methods that\nwork only in very low dimensions. Our framework shows great promise as a novel\ntool for posterior and model inference with expensive, black-box likelihoods.\n\nIntroduction\n\n1\nIn many scienti\ufb01c, engineering, and machine learning domains, such as in computational neuro-\nscience and big data, complex black-box computational models are routinely used to estimate model\nparameters and compare hypotheses instantiated by different models. Bayesian inference allows us\nto do so in a principled way that accounts for parameter and model uncertainty by computing the\nposterior distribution over parameters and the model evidence, also known as marginal likelihood or\nBayes factor. However, Bayesian inference is generally analytically intractable, and the statistical\ntools of approximate inference, such as Markov Chain Monte Carlo (MCMC) or variational inference,\ngenerally require knowledge about the model (e.g., access to the gradients) and/or a large number of\nmodel evaluations. Both of these requirements cannot be met by black-box probabilistic models with\ncomputationally expensive likelihoods, precluding the application of standard Bayesian techniques of\nparameter and model uncertainty quanti\ufb01cation to domains that would most need them.\nGiven a dataset D and model parameters x \u2208 RD, here we consider the problem of computing both\nthe posterior p(x|D) and the marginal likelihood (or model evidence) p(D), de\ufb01ned as, respectively,\n\np(D) =\n\np(D|x)p(x)dx,\n\nand\n\np(D)\n\n(1)\nwhere p(D|x) is the likelihood of the model of interest and p(x) is the prior over parameters.\nCrucially, we consider the case in which p(D|x) is a black-box, expensive function for which we\nhave a limited budget of function evaluations (of the order of few hundreds).\nA promising approach to deal with such computational constraints consists of building a probabilistic\nmodel-based approximation of the function of interest, for example via Gaussian processes (GP)\n\np(x|D) =\n\np(D|x)p(x)\n\n(cid:90)\n\n\u2217Website: luigiacerbi.com. Alternative e-mail: luigi.acerbi@gmail.com.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[1]. This statistical surrogate can be used in lieu of the original, expensive function, allowing faster\ncomputations. Moreover, uncertainty in the surrogate can be used to actively guide sampling of the\noriginal function to obtain a better approximation in regions of interest for the application at hand.\nThis approach has been extremely successful in Bayesian optimization [2, 3, 4, 5, 6] and in Bayesian\nquadrature for the computation of intractable integrals [7, 8].\nIn particular, recent works have applied GP-based Bayesian quadrature to the estimation of the\nmarginal likelihood [8, 9, 10, 11], and GP surrogates to build approximations of the posterior [12, 13].\nHowever, none of the existing approaches deals simultaneously with posterior and model inference.\nMoreover, it is unclear how these approximate methods would deal with likelihoods with realistic\nproperties, such as medium dimensionality (up to D \u223c 10), mild multi-modality, heavy tails, and\nparameters that exhibit strong correlations\u2014all common issues of real-world applications.\nIn this work, we introduce Variational Bayesian Monte Carlo (VBMC), a novel approximate inference\nframework that combines variational inference and active-sampling Bayesian quadrature via GP\nsurrogates.1 Our method affords simultaneous approximation of the posterior and of the model\nevidence in a sample-ef\ufb01cient manner. We demonstrate the robustness of our approach by testing\nVBMC and other inference algorithms on a variety of synthetic likelihoods with realistic, challenging\nproperties. We also apply our method to a real problem in computational neuroscience, by \ufb01tting a\nmodel of neuronal selectivity in visual cortex [14]. Among the tested methods, VBMC is the only one\nwith consistently good performance across problems, showing promise as a novel tool for posterior\nand model inference with expensive likelihoods in scienti\ufb01c computing and machine learning.\n\n2 Theoretical background\n2.1 Variational inference\nVariational Bayes is an approximate inference method whereby the posterior p(x|D) is approximated\nby a simpler distribution q(x) \u2261 q\u03c6(x) that usually belongs to a parametric family [15, 16]. The\ngoal of variational inference is to \ufb01nd the variational parameters \u03c6 for which the variational posterior\nq\u03c6 \u201cbest\u201d approximates the true posterior. In variational methods, the mismatch between the two\ndistributions is quanti\ufb01ed by the Kullback-Leibler (KL) divergence,\nq\u03c6(x)\np(x|D)\n\n(2)\nwhere we adopted the compact notation E\u03c6 \u2261 Eq\u03c6. Inference is then reduced to an optimization\nproblem, that is \ufb01nding the variational parameter vector \u03c6 that minimizes Eq. 2. We rewrite Eq. 2 as\n(3)\n\nlog p(D) = F[q\u03c6] + KL [q\u03c6(x)||p(x|D)] ,\n\nKL [q\u03c6(x)||p(x|D)] = E\u03c6\n\n(cid:21)\n\n(cid:20)\n\nlog\n\n,\n\n(cid:20)\n\n(cid:21)\n\np(D|x)p(x)\n\nwhere\n\nF [q\u03c6] = E\u03c6\n\n= E\u03c6 [f (x)] + H[q\u03c6(x)]\n\nlog\n\nq\u03c6(x)\n\n(4)\nis the negative free energy, or evidence lower bound (ELBO). Here f (x) \u2261 log p(D|x)p(x) =\nlog p(D, x) is the log joint probability and H[q] is the entropy of q. Note that since the KL divergence\nis always non-negative, from Eq. 3 we have F[q] \u2264 log p(D), with equality holding if q(x) \u2261 p(x|D).\nThus, maximization of the variational objective, Eq. 4, is equivalent to minimization of the KL\ndivergence, and produces both an approximation of the posterior q\u03c6 and a lower bound on the\nmarginal likelihood, which can be used as a metric for model selection.\nNormally, q is chosen to belong to a family (e.g., a factorized posterior, or mean \ufb01eld) such that\nthe expected log joint in Eq. 4 and the entropy can be computed analytically, possibly providing\nclosed-form equations for a coordinate ascent algorithm. Here, we assume that f (x), like many\ncomputational models of interest, is an expensive black-box function, which prevents a direct\ncomputation of Eq. 4 analytically or via simple numerical integration.\n\n2.2 Bayesian quadrature\n\nof the mean and variance of non-analytical integrals of the form (cid:104)f(cid:105) =(cid:82) f (x)\u03c0(x)dx, de\ufb01ned on\n\nBayesian quadrature, also known as Bayesian Monte Carlo, is a means to obtain Bayesian estimates\n\n1Code available at https://github.com/lacerbi/vbmc.\n\n2\n\n\fa domain X = RD [7, 8]. Here, f is a function of interest and \u03c0 a known probability distribution.\nTypically, a Gaussian Process (GP) prior is speci\ufb01ed for f (x).\nGaussian processes GPs are a \ufb02exible class of models for specifying prior distributions over\nunknown functions f : X \u2286 RD \u2192 R [1]. GPs are de\ufb01ned by a mean function m : X \u2192 R and a\npositive de\ufb01nite covariance, or kernel function \u03ba : X \u00d7 X \u2192 R. In Bayesian quadrature, a common\nchoice is the Gaussian kernel \u03ba(x, x(cid:48)) = \u03c32\n], where\n\u03c3f is the output length scale and (cid:96) is the vector of input length scales. Conditioned on training\ninputs X = {x1, . . . , xn} and associated function values y = f (X), the GP posterior will have latent\nposterior conditional mean f \u039e(x) \u2261 f (x; \u039e, \u03c8) and covariance C\u039e(x, x(cid:48)) \u2261 C(x, x(cid:48); \u039e, \u03c8) in\nclosed form (see [1]), where \u039e = {X, y} is the set of training function data for the GP and \u03c8 is a\nhyperparameter vector for the GP mean, covariance, and likelihood.\nBayesian integration Since integration is a linear operator, if f is a GP, the posterior mean and\n\nvariance of the integral(cid:82) f (x)\u03c0(x)dx are [8]\n\nfN (x; x(cid:48), \u03a3(cid:96)), with \u03a3(cid:96) = diag[(cid:96)(1)2\n\n, . . . , (cid:96)(D)2\n\nEf|\u039e[(cid:104)f(cid:105)] =\n\nf \u039e(x)\u03c0(x)dx,\n\nVf|\u039e[(cid:104)f(cid:105)] =\n\nC\u039e(x, x(cid:48))\u03c0(x)dx\u03c0(x(cid:48))dx(cid:48).\n\n(5)\n\n(cid:90) (cid:90)\n\n(cid:90)\n\nCrucially, if f has a Gaussian kernel and \u03c0 is a Gaussian or mixture of Gaussians (among other\nfunctional forms), the integrals in Eq. 5 can be computed analytically.\nActive sampling For a given budget of samples nmax, a smart choice of the input samples X would\naim to minimize the posterior variance of the \ufb01nal integral (Eq. 5) [11]. Interestingly, for a standard\nGP and \ufb01xed GP hyperparameters \u03c8, the optimal variance-minimizing design does not depend on the\nfunction values at X, thereby allowing precomputation of the optimal design. However, if the GP\nhyperparameters are updated online, or the GP is warped (e.g., via a log transform [9] or a square-root\ntransform [10]), the variance of the posterior will depend on the function values obtained so far, and\nan active sampling strategy is desirable. The acquisition function a : X \u2192 R determines which\npoint in X should be evaluated next via a proxy optimization xnext = argmaxxa(x). Examples of\nacquisition functions for Bayesian quadrature include the expected entropy, which minimizes the\nexpected entropy of the integral after adding x to the training set [9], and the much faster to compute\nuncertainty sampling strategy, which maximizes the variance of the integrand at x [10].\n\n3 Variational Bayesian Monte Carlo (VBMC)\n\nWe introduce here Variational Bayesian Monte Carlo (VBMC), a sample-ef\ufb01cient inference method\nthat combines variational Bayes and Bayesian quadrature, particularly useful for models with (moder-\nately) expensive likelihoods. The main steps of VBMC are described in Algorithm 1, and an example\nrun of VBMC on a nontrivial 2-D target density is shown in Fig. 1.\nVBMC in a nutshell\nIn each iteration t, the algorithm: (1) sequentially samples a batch of\n\u2018promising\u2019 new points that maximize a given acquisition function, and evaluates the (expensive) log\njoint f at each of them; (2) trains a GP model of the log joint f, given the training set \u039et = {Xt, yt}\nof points evaluated so far; (3) updates the variational posterior approximation, indexed by \u03c6t, by\noptimizing the ELBO. This loop repeats until the budget of function evaluations is exhausted, or some\nother termination criterion is met (e.g., based on the stability of the found solution). VBMC includes\nan initial warm-up stage to avoid spending computations in regions of low posterior probability mass\n(see Section 3.5). In the following sections, we describe various features of VBMC.\n\n3.1 Variational posterior\n\nWe choose for the variational posterior q(x) a \ufb02exible \u201cnonparametric\u201d family, a mixture of K\nGaussians with shared covariances, modulo a scaling factor,\n\nK(cid:88)\n\nwkN(cid:0)x; \u00b5k, \u03c32\nk\u03a3(cid:1) ,\n\nq(x) \u2261 q\u03c6(x) =\n\n(6)\n\nwhere wk, \u00b5k, and \u03c3k are, respectively, the mixture weight, mean, and scale of the k-th component,\nand \u03a3 is a covariance matrix common to all elements of the mixture. In the following, we assume\n\nk=1\n\n3\n\n\f(cid:46) Initial design, Section 3.5\n\nt \u2190 t + 1\nif t (cid:44) 1 then\n\nelse\n\nfor 1 . . . nactive do\n\nEvaluate y0 \u2190 f (x0) and add (x0, y0) to the training set \u039e\nfor 2 . . . ninit do\n\nSample a new point xnew \u2190 Uniform[PLB, PUB]\nEvaluate ynew \u2190 f (xnew) and add (xnew, ynew) to the training set \u039e\n\nAlgorithm 1 Variational Bayesian Monte Carlo\nInput: target log joint f, starting point x0, plausible bounds PLB, PUB, additional options\n1: Initialization: t \u2190 0, initialize variational posterior \u03c60, STOPSAMPLING \u2190 false\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21: until fevals > MaxFunEvals or TERMINATIONCRITERIA (cid:46) Stopping criteria, Section 3.7\n\nelse\nKt \u2190 Update number of variational components\n\u03c6t \u2190 Optimize ELBO via stochastic gradient descent\nEvaluate whether to STOPSAMPLING and other TERMINATIONCRITERIA\n\nActively sample a new point xnew \u2190 argmaxxa(x)\nEvaluate ynew \u2190 f (xnew) and add (xnew, ynew) to the training set \u039e\nfor each \u03c81, . . . , \u03c8ngp, perform rank-1 update of the GP posterior\n\n22: return variational posterior \u03c6t, E [ELBO],(cid:112)V [ELBO]\n\nif not STOPSAMPLING then\n\n{\u03c81, . . . , \u03c8ngp} \u2190 Sample GP hyperparameters\n\u03c81 \u2190 Optimize GP hyperparameters\n\n(cid:46) GP hyperparameter training, Section 3.4\n\n(cid:46) Active sampling, Section 3.3\n\n(cid:46) Section 3.6\n(cid:46) Section 3.2\n\na diagonal matrix \u03a3 \u2261 diag[\u03bb(1)2\n]. The variational posterior for a given number of\nmixture components K is parameterized by \u03c6 \u2261 (w1, . . . , wK, \u00b51, . . . , \u00b5K, \u03c31, . . . , \u03c3K, \u03bb), which\nhas K(D + 2) + D parameters. The number of components K is set adaptively (see Section 3.6).\n\n, . . . , \u03bb(D)2\n\n3.2 The evidence lower bound\n\nWe approximate the ELBO (Eq. 4) in two ways. First, we approximate the log joint probability f with\na GP with a squared exponential (rescaled Gaussian) kernel, a Gaussian likelihood with observation\nnoise \u03c3obs > 0 (for numerical stability [17]), and a negative quadratic mean function, de\ufb01ned as\n\n(cid:17)2\n\n(cid:16)\n\nD(cid:88)\n\ni=1\n\nx(i) \u2212 x(i)\n\nm\n\u03c9(i)2\n\nm(x) = m0 \u2212 1\n2\n\n,\n\n(7)\n\nwhere m0 is the maximum value of the mean, xm is the location of the maximum, and \u03c9 is a vector of\nlength scales. This mean function, unlike for example a constant mean, ensures that the posterior GP\npredictive mean f is a proper log probability distribution (that is, it is integrable when exponentiated).\nCrucially, our choice of variational family (Eq. 6) and kernel, likelihood and mean function of the\nGP affords an analytical computation of the posterior mean and variance of the expected log joint\nE\u03c6 [f ] (using Eq. 5), and of their gradients (see Supplementary Material for details). Second, we\napproximate the entropy of the variational posterior, H [q\u03c6], via simple Monte Carlo sampling, and\nwe propagate its gradient through the samples via the reparametrization trick [18, 19].2 Armed\nwith expressions for the mean expected log joint, the entropy, and their gradients, we can ef\ufb01ciently\noptimize the (negative) mean ELBO via stochastic gradient descent [21].\nEvidence lower con\ufb01dence bound We de\ufb01ne the evidence lower con\ufb01dence bound (ELCBO) as\n\nELCBO(\u03c6, f ) = Ef|\u039e [E\u03c6 [f ]] + H[q\u03c6] \u2212 \u03b2LCB\n\n(8)\n\n(cid:113)Vf|\u039e [E\u03c6 [f ]]\n\nwhere the \ufb01rst two terms are the ELBO (Eq. 4) estimated via Bayesian quadrature, and the last term is\nthe uncertainty in the computation of the expected log joint multiplied by a risk-sensitivity parameter\n\n2We also tried a deterministic approximation of the entropy proposed in [20], with mixed results.\n\n4\n\n\fFigure 1: Example run of VBMC on a 2-D pdf. A. Contour plots of the variational posterior at\ndifferent iterations of the algorithm. Red crosses indicate the centers of the variational mixture\ncomponents, black dots are the training samples. B. ELBO as a function of iteration. Shaded area is\n95% CI of the ELBO in the current iteration as per the Bayesian quadrature approximation (not the\nerror wrt ground truth). The black line is the true log marginal likelihood (LML). C. True target pdf.\n\n\u03b2LCB (we set \u03b2LCB = 3 unless speci\ufb01ed otherwise). Eq. 8 establishes a probabilistic lower bound on\nthe ELBO, used to assess the improvement of the variational solution (see following sections).\n\n3.3 Active sampling\n\nIn VBMC, we are performing active sampling to compute a sequence of\nintegrals\nE\u03c61 [f ] , E\u03c62 [f ] , . . . , E\u03c6T [f ], across iterations 1, . . . , T such that (1) the sequence of variational\nparameters \u03c6t converges to the variational posterior that minimizes the KL divergence with the true\nposterior, and (2) we have minimum variance on our \ufb01nal estimate of the ELBO. Note how this differs\nfrom active sampling in simple Bayesian quadrature, for which we only care about minimizing the\nvariance of a single \ufb01xed integral. The ideal acquisition function for VBMC will correctly balance\nexploration of uncertain regions and exploitation of regions with high probability mass to ensure a\nfast convergence of the variational posterior as closely as possible to the ground truth.\nWe describe here two acquisition functions for VBMC based on uncertainty sampling. Let V\u039e(x) \u2261\nC\u039e(x, x) be the posterior GP variance at x given the current training set \u039e. \u2018Vanilla\u2019 uncertainty\nsampling for E\u03c6 [f ] is aus(x) = V\u039e(x)q\u03c6(x)2, where q\u03c6 is the current variational posterior. Since\naus only maximizes the variance of the integrand under the current variational parameters, we expect it\nto be lacking in exploration. To promote exploration, we introduce prospective uncertainty sampling,\n(9)\nwhere f \u039e is the GP posterior predictive mean. apro aims at reducing uncertainty of the variational\nobjective both for the current posterior and at prospective locations where the variational posterior\nmight move to in the future, if not already there (high GP posterior mean). The variational posterior\nin apro acts as a regularizer, preventing active sampling from following too eagerly \ufb02uctuations of the\nGP mean. For numerical stability of the GP, we include in all acquisition functions a regularization\nfactor to prevent selection of points too close to existing training points (see Supplementary Material).\nAt the beginning of each iteration after the \ufb01rst, VBMC actively samples nactive points (nactive = 5 by\ndefault in this work). We select each point sequentially, by optimizing the chosen acquisition function\nvia CMA-ES [22], and apply fast rank-one updates of the GP posterior after each acquisition.\n\napro(x) = V\u039e(x)q\u03c6(x) exp(cid:0)f \u039e(x)(cid:1) ,\n\n3.4 Adaptive treatment of GP hyperparameters\n\nThe GP model in VBMC has 3D + 3 hyperparameters, \u03c8 = ((cid:96), \u03c3f , \u03c3obs, m0, xm, \u03c9). We impose an\nempirical Bayes prior on the GP hyperparameters based on the current training set (see Supplementary\n\n5\n\nACBIteration 1x1x2Iteration 5 (end of warm-up)x1x2Iteration 8x1x2Iteration 11x1x2Iteration 14x1x2Iteration 17x1x2True posteriorx1x2158111417Iterations-4-2.273Model evidenceELBOLML\fngp(cid:88)\n\nj=1\n\nngp(cid:88)\n\nj=1\n\n(cid:104){E [\u03c7|\u03c8j]}ngp\n\nj=1\n\n(cid:105)\n\n,\n\n(10)\n\nMaterial), and we sample from the posterior over hyperparameters via slice sampling [23]. In each\niteration, we collect ngp = round(80/\nn) samples, where n is the size of the current GP training\nset, with the rationale that we require less samples as the posterior over hyperparameters becomes\nnarrower due to more observations. Given samples {\u03c8} \u2261 {\u03c81, . . . , \u03c8ngp}, and a random variable \u03c7\nthat depends on \u03c8, we compute the expected mean and variance of \u03c7 as\n\n\u221a\n\nE [\u03c7|{\u03c8}] =\n\n1\nngp\n\nE [\u03c7|\u03c8j] , V [\u03c7|{\u03c8}] =\n\n1\nngp\n\nV [\u03c7|\u03c8j] + Var\n\nwhere Var[\u00b7] is the sample variance. We use Eq. 10 to compute the GP posterior predictive mean and\nvariances for the acquisition function, and to marginalize the expected log joint over hyperparameters.\nThe algorithm adaptively switches to a faster maximum-a-posteriori (MAP) estimation of the hyper-\nparameters (via gradient-based optimization) when the additional variability of the expected log joint\nbrought by multiple samples falls below a threshold for several iterations, a signal that sampling is\nbringing little advantage to the precision of the computation.\n\n3.5\n\nInitialization and warm-up\n\nThe algorithm is initialized by providing a starting point x0 (ideally, in a region of high posterior\nprobability mass) and vectors of plausible lower/upper bounds PLB, PUB, that identify a region of high\nposterior probability mass in parameter space. In the absence of other information, we obtained good\nresults with plausible bounds containing the peak of prior mass in each coordinate dimension, such\nas the top \u223c 0.68 probability region (that is, mean \u00b1 1 SD for a Gaussian prior). The initial design\nconsists of the provided starting point(s) x0 and additional points generated uniformly at random\ninside the plausible box, for a total of ninit = 10 points. The plausible box also sets the reference\nscale for each variable, and in future work might inform other aspects of the algorithm [6]. The\nVBMC algorithm works in an unconstrained space (x \u2208 RD), but bound constraints to the variables\ncan be easily handled via a nonlinear remapping of the input space, with an appropriate Jacobian\ncorrection of the log probability density [24] (see Section 4.2 and Supplementary Material).3\nWarm-up We initialize the variational posterior with K = 2 components in the vicinity of x0,\nand with small values of \u03c31, \u03c32, and \u03bb (relative to the width of the plausible box). The algorithm\nstarts in warm-up mode, during which VBMC tries to quickly improve the ELBO by moving to\nregions with higher posterior probability. During warm-up, K is clamped to only two components\nwith w1 \u2261 w2 = 1/2, and we collect a maximum of ngp = 8 hyperparameter samples. Warm-up\nends when the ELCBO (Eq. 8) shows an improvement of less than 1 for three consecutive iterations,\nsuggesting that the variational solution has started to stabilize. At the end of warm-up, we trim\nthe training set by removing points whose value of the log joint probability y is more than 10 \u00b7 D\npoints lower than the maximum value ymax observed so far. While not necessary in theory, we found\nthat trimming generally increases the stability of the GP approximation, especially when VBMC\nis initialized in a region of very low probability under the true posterior. To allow the variational\nposterior to adapt, we do not actively sample new points in the \ufb01rst iteration after the end of warm-up.\n\n3.6 Adaptive number of variational mixture components\n\nAfter warm-up, we add and remove variational components following a simple set of rules.\nAdding components We de\ufb01ne the current variational solution as improving if the ELCBO of the\nlast iteration is higher than the ELCBO in the past few iterations (nrecent = 4). In each iteration, we\nincrement the number of components K by 1 if the solution is improving and no mixture component\nwas pruned in the last iteration (see below). To speed up adaptation of the variational solution\nto a complex true posterior when the algorithm has nearly converged, we further add two extra\ncomponents if the solution is stable (see below) and no component was recently pruned. Each new\ncomponent is created by splitting and jittering a randomly chosen existing component. We set a\nmaximum number of components Kmax = n2/3, where n is the size of the current training set \u039e.\nRemoving components At the end of each variational optimization, we consider as a candidate for\npruning a random mixture component k with mixture weight wk < wmin. We recompute the ELCBO\n\n3The available code for VBMC currently supports both unbounded variables and bound constraints.\n\n6\n\n\fwithout the selected component (normalizing the remaining weights). If the \u2018pruned\u2019 ELCBO differs\nfrom the original ELCBO less than \u03b5, we remove the selected component. We iterate the process\nthrough all components with weights below threshold. For VBMC we set wmin = 0.01 and \u03b5 = 0.01.\n\n3.7 Termination criteria\nAt the end of each iteration, we assign a reliability index \u03c1(t) \u2265 0 to the current variational solution\nbased on the following features: change in ELBO between the current and the previous iteration;\nestimated variance of the ELBO; KL divergence between the current and previous variational posterior\n(see Supplementary Material for details). By construction, a \u03c1(t) (cid:46) 1 is suggestive of a stable solution.\nThe algorithm terminates when obtaining a stable solution for nstable = 8 iterations (with at most one\nnon-stable iteration in-between), or when reaching a maximum number nmax of function evaluations.\nThe algorithm returns the estimate of the mean and standard deviation of the ELBO (a lower bound\non the marginal likelihood), and the variational posterior, from which we can cheaply draw samples\nfor estimating distribution moments, marginals, and other properties of the posterior. If the algorithm\nterminates before achieving long-term stability, it warns the user and returns a recent solution with\nthe best ELCBO, using a conservative \u03b2LCB = 5.\n\n4 Experiments\n\nWe tested VBMC and other common inference algorithms on several arti\ufb01cial and real problems con-\nsisting of a target likelihood and an associated prior. The goal of inference consists of approximating\nthe posterior distribution and the log marginal likelihood (LML) with a \ufb01xed budget of likelihood\nevaluations, assumed to be (moderately) expensive.\n\nAlgorithms We tested VBMC with the \u2018vanilla\u2019 uncertainty sampling acquisition function aus\n(VBMC-U) and with prospective uncertainty sampling, apro (VBMC-P). We also tested simple Monte\nCarlo (SMC), annealed importance sampling (AIS), the original Bayesian Monte Carlo (BMC),\ndoubly-Bayesian quadrature (BBQ [9])4, and warped sequential active Bayesian integration (WSABI,\nboth in its linearized and moment-matching variants, WSABI-L and WSABI-M [10]). For the basic\nsetup of these methods, we follow [10]. Most of these algorithms only compute an approximation\nof the marginal likelihood based on a set of sampled points, but do not directly compute a posterior\ndistribution. We obtain a posterior by training a GP model (equal to the one used by VBMC) on\nthe log joint evaluated at the sampled points, and then drawing 2\u00b7104 MCMC samples from the\nGP posterior predictive mean via parallel slice sampling [23, 25]. We also tested two methods for\nposterior estimation via GP surrogates, BAPE [12] and AGP [13]. Since these methods only compute\nan approximate posterior, we obtain a crude estimate of the log normalization constant (the LML) as\nthe average difference between the log of the approximate posterior and the evaluated log joint at the\ntop 20% points in terms of posterior density. For all algorithms, we use default settings, allowing\nonly changes based on knowledge of the mean and (diagonal) covariance of the provided prior.\nProcedure For each problem, we allow a \ufb01xed budget of 50 \u00d7 (D + 2) likelihood evaluations,\nwhere D is the number of variables. Given the limited number of samples, we judge the quality\nof the posterior approximation in terms of its \ufb01rst two moments, by computing the \u201cGaussianized\u201d\nsymmetrized KL divergence (gsKL) between posterior approximation and ground truth. The gsKL\nis de\ufb01ned as the symmetrized KL between two multivariate normal distributions with mean and\ncovariances equal, respectively, to the moments of the approximate posterior and the moments of\nthe true posterior. We measure the quality of the approximation of the LML in terms of absolute\nerror from ground truth, the rationale being that differences of LML are used for model comparison.\nIdeally, we want the LML error to be of order 1 of less, since much larger errors could severely affect\nthe results of a comparison (e.g., differences of LML of 10 points or more are often presented as\ndecisive evidence in favor of one model [26]). On the other hand, errors (cid:46) 0.1 can be considered\nnegligible; higher precision is unnecessary. For each algorithm, we ran at least 20 separate runs per\ntest problem with different random seeds, and report the median gsKL and LML error and the 95%\nCI of the median calculated by bootstrap. For each run, we draw the starting point x0 (if requested\nby the algorithm) uniformly from a box within 1 prior standard deviation (SD) from the prior mean.\nWe use the same box to de\ufb01ne the plausible bounds for VBMC.\n\n4We also tested BBQ* (approximate GP hyperparameter marginalization), which perfomed similarly to BBQ.\n\n7\n\n\f4.1 Synthetic likelihoods\nProblem set We built a benchmark set of synthetic likelihoods belonging to three families that\nrepresent typical features of target densities (see Supplementary Material for details). Likelihoods\nin the lumpy family are built out of a mixture of 12 multivariate normals with component means\ndrawn randomly in the unit D-hypercube, distinct diagonal covariances with SDs in the [0.2, 0.6]\nrange, and mixture weights drawn from a Dirichlet distribution with unit concentration parameter.\nThe lumpy distributions are mildly multimodal, in that modes are nearby and connected by regions\nwith non-neglibile probability mass. In the Student family, the likelihood is a multivariate Student\u2019s\nt-distribution with diagonal covariance and degrees of freedom equally spaced in the [2.5, 2 + D/2]\nrange across different coordinate dimensions. These distributions have heavy tails which might be\nproblematic for some methods. Finally, in the cigar family the likelihood is a multivariate normal\nin which one axis is 100 times longer than the others, and the covariance matrix is non-diagonal\nafter a random rotation. The cigar family tests the ability of an algorithm to explore non axis-aligned\ndirections. For each family, we generated test functions for D \u2208 {2, 4, 6, 8, 10}, for a total of 15\nsynthetic problems. For each problem, we pick as a broad prior a multivariate normal with mean\ncentered at the expected mean of the family of distributions, and diagonal covariance matrix with SD\nequal to 3-4 times the SD in each dimension. For all problems, we compute ground truth values for\nthe LML and the posterior mean and covariance analytically or via multiple 1-D numerical integrals.\nResults We show the results for D \u2208 {2, 6, 10} in Fig. 2 (see Supplementary Material for full\nresults, in higher resolution). Almost all algorithms perform reasonably well in very low dimension\n(D = 2), and in fact several algorithms converge faster than VBMC to the ground truth (e.g., WSABI-\nL). However, as we increase in dimension, we see that all algorithms start failing, with only VBMC\npeforming consistently well across problems. In particular, besides the simple D = 2 case, only\nVBMC obtains acceptable results for the LML with non-axis aligned distributions (cigar). Some\nalgorithms (such as AGP and BAPE) exhibited large numerical instabilities on the cigar family,\ndespite our best attempts at regularization, such that many runs were unable to complete.\n\nFigure 2: Synthetic likelihoods. A. Median absolute error of the LML estimate with respect to\nground truth, as a function of number of likelihood evaluations, on the lumpy (top), Student (middle),\nand cigar (bottom) problems, for D \u2208 {2, 6, 10} (columns). B. Median \u201cGaussianized\u201d symmetrized\nKL divergence between the algorithm\u2019s posterior and ground truth. For both metrics, shaded areas\nare 95 % CI of the median, and we consider a desirable threshold to be below one (dashed line).\n\n4.2 Real likelihoods of neuronal model\nProblem set For a test with real models and data, we consider a computational model of neuronal\norientation selectivity in visual cortex [14]. We \ufb01t the neural recordings of one V1 and one V2\ncell with the authors\u2019 neuronal model that combines effects of \ufb01ltering, suppression, and response\nnonlinearity [14]. The model is analytical but still computationally expensive due to large datasets\nand a cascade of several nonlinear operations. For the purpose of our benchmark, we \ufb01x some\nparameters of the original model to their MAP values, yielding an inference problem with D = 7 free\n\n8\n\nALumpyStudentCigarBsmcaisbmcwsabi-Lwsabi-Mbbqagpbapevbmc-Uvbmc-P2D10020010-410-21102104Median LML error6D20040010-410-2110210410D20040060010-410-2110210410020010-410-21102104Median LML error20040010-410-2110210420040060010-410-21102104100200Function evaluations10-410-21102104Median LMLerror200400Function evaluations10-410-21102104200400600Function evaluations10-410-211021042D10020010-410-21102104106Median gsKL6D20040010-410-2110210410610D20040060010-410-2110210410610020010-410-21102104106Median gsKL20040010-410-2110210410620040060010-410-21102104106100200Function evaluations10-410-21102104106Median gsKL200400Function evaluations10-410-21102104106200400600Function evaluations10-410-21102104106\fparameters of experimental interest. We transform bounded parameters to uncontrained space via a\nlogit transform [24], and we place a broad Gaussian prior on each of the transformed variables, based\non estimates from other neurons in the same study [14] (see Supplementary Material for more details\non the setup). For both datasets, we computed the ground truth with 4\u00b7 105 samples from the posterior,\nobtained via parallel slice sampling after a long burn-in. We calculated the ground truth LML from\nposterior MCMC samples via Geyer\u2019s reverse logistic regression [27], and we independently validated\nit with a Laplace approximation, obtained via numerical calculation of the Hessian at the MAP (for\nboth datasets, Geyer\u2019s and Laplace\u2019s estimates of the LML are within \u223c 1 point).\n\nFigure 3: Neuronal model likelihoods. A. Median absolute error of the LML estimate, as a function\nof number of likelihood evaluations, for two distinct neurons (D = 7). B. Median \u201cGaussianized\u201d\nsymmetrized KL divergence between the algorithm\u2019s posterior and ground truth. See also Fig. 2.\n\nResults For both datasets, VBMC is able to \ufb01nd a reasonable approximation of the LML and\nof the posterior, whereas no other algorithm produces a usable solution (Fig. 3). Importantly, the\nbehavior of VBMC is fairly consistent across runs (see Supplementary Material). We argue that\nthe superior results of VBMC stem from a better exploration of the posterior landscape, and from\na better approximation of the log joint (used in the ELBO), related but distinct features. To show\nthis, we \ufb01rst trained GPs (as we did for the other methods) on the samples collected by VBMC (see\nSupplementary Material). The posteriors obtained by sampling from the GPs trained on the VBMC\nsamples scored a better gsKL than the other methods (and occasionally better than VBMC itself).\nSecond, we estimated the marginal likelihood with WSABI-L using the samples collected by VBMC.\nThe LML error in this hybrid approach is much lower than the error of WSABI-L alone, but still\nhigher than the LML error of VBMC. These results combined suggest that VBMC builds better (and\nmore stable) surrogate models and obtains higher-quality samples than the compared methods.\nThe performance of VBMC-U and VBMC-P is similar on synthetic functions, but the \u2018prospective\u2019\nacquisition function converges faster on the real problem set, so we recommend apro as the default.\nBesides scoring well on quantitative metrics, VBMC is able to capture nontrivial features of the true\nposteriors (see Supplementary Material for examples). Moreover, VBMC achieves these results with\na relatively small computational cost (see Supplementary Material for discussion).\n\n5 Conclusions\n\nIn this paper, we have introduced VBMC, a novel Bayesian inference framework that combines\nvariational inference with active-sampling Bayesian quadrature for models with expensive black-box\nlikelihoods. Our method affords both posterior estimation and model inference by providing an\napproximate posterior and a lower bound to the model evidence. We have shown on both synthetic\nand real model-\ufb01tting problems that, given a contained budget of likelihood evaluations, VBMC is\nable to reliably compute valid, usable approximations in realistic scenarios, unlike previous methods\nwhose applicability seems to be limited to very low dimension or simple likelihoods. Our method,\nthus, represents a novel useful tool for approximate inference in science and engineering.\nWe believe this is only the starting point to harness the combined power of variational inference and\nBayesian quadrature. Not unlike the related \ufb01eld of Bayesian optimization, VBMC paves the way to\na plenitude of both theoretical (e.g., analysis of convergence, development of principled acquisition\nfunctions) and applied work (e.g., application to case studies of interest, extension to noisy likelihood\nevaluations, algorithmic improvements), which we plan to pursue as future directions.\n\n9\n\nV1NeuronalmodelV1V2ABV2200400Function evaluations10-21102104Median LML error200400Function evaluations10-21102104200400Function evaluations10-21102104106Median gsKL200400Function evaluations10-21102104106smcaisbmcwsabi-Lwsabi-Mbbqagpbapevbmc-Uvbmc-P\fAcknowledgments\n\nWe thank Robbe Goris for sharing data and code for the neuronal model; Michael Schartner and\nRex Liu for comments on an earlier version of the paper; and three anonymous reviewers for useful\nfeedback.\n\nReferences\n[1] Rasmussen, C. & Williams, C. K. I. (2006) Gaussian Processes for Machine Learning. (MIT Press).\n[2] Jones, D. R., Schonlau, M., & Welch, W. J. (1998) Ef\ufb01cient global optimization of expensive black-box\n\nfunctions. Journal of Global Optimization 13, 455\u2013492.\n\n[3] Brochu, E., Cora, V. M., & De Freitas, N. (2010) A tutorial on Bayesian optimization of expensive cost\nfunctions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599.\n\n[4] Snoek, J., Larochelle, H., & Adams, R. P. (2012) Practical Bayesian optimization of machine learning\n\nalgorithms. Advances in Neural Information Processing Systems 25, 2951\u20132959.\n\n[5] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., & de Freitas, N. (2016) Taking the human out of the\n\nloop: A review of Bayesian optimization. Proceedings of the IEEE 104, 148\u2013175.\n\n[6] Acerbi, L. & Ma, W. J. (2017) Practical Bayesian optimization for model \ufb01tting with Bayesian adaptive\n\ndirect search. Advances in Neural Information Processing Systems 30, 1834\u20131844.\n\n[7] O\u2019Hagan, A. (1991) Bayes\u2013Hermite quadrature. Journal of Statistical Planning and Inference 29, 245\u2013260.\n[8] Ghahramani, Z. & Rasmussen, C. E. (2002) Bayesian Monte Carlo. Advances in Neural Information\n\nProcessing Systems 15, 505\u2013512.\n\n[9] Osborne, M., Duvenaud, D. K., Garnett, R., Rasmussen, C. E., Roberts, S. J., & Ghahramani, Z. (2012)\nActive learning of model evidence using Bayesian quadrature. Advances in Neural Information Processing\nSystems 25, 46\u201354.\n\n[10] Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., & Roberts, S. J. (2014) Sampling for inference in\nprobabilistic models with fast Bayesian quadrature. Advances in Neural Information Processing Systems\n27, 2789\u20132797.\n\n[11] Briol, F.-X., Oates, C., Girolami, M., & Osborne, M. A.\n\n(2015) Frank-Wolfe Bayesian quadrature:\nProbabilistic integration with theoretical guarantees. Advances in Neural Information Processing Systems\n28, 1162\u20131170.\n\n[12] Kandasamy, K., Schneider, J., & P\u00f3czos, B. (2015) Bayesian active learning for posterior estimation.\n\nTwenty-Fourth International Joint Conference on Arti\ufb01cial Intelligence.\n\n[13] Wang, H. & Li, J. (2018) Adaptive Gaussian process approximation for Bayesian inference with expensive\n\nlikelihood functions. Neural Computation pp. 1\u201323.\n\n[14] Goris, R. L., Simoncelli, E. P., & Movshon, J. A. (2015) Origin and function of tuning diversity in macaque\n\nvisual cortex. Neuron 88, 819\u2013831.\n\n[15] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999) An introduction to variational methods\n\nfor graphical models. Machine Learning 37, 183\u2013233.\n\n[16] Bishop, C. M. (2006) Pattern Recognition and Machine Learning. (Springer).\n[17] Gramacy, R. B. & Lee, H. K. (2012) Cases for the nugget in modeling computer experiments. Statistics\n\nand Computing 22, 713\u2013722.\n\n[18] Kingma, D. P. & Welling, M. (2013) Auto-encoding variational Bayes. Proceedings of the 2nd International\n\nConference on Learning Representations.\n[19] Miller, A. C., Foti, N., & Adams, R. P.\n\n(2017) Variational boosting: Iteratively re\ufb01ning posterior\napproximations. Proceedings of the 34th International Conference on Machine Learning 70, 2420\u20132429.\n[20] Gershman, S., Hoffman, M., & Blei, D. (2012) Nonparametric variational inference. Proceedings of the\n\n29th International Coference on Machine Learning.\n\n[21] Kingma, D. P. & Ba, J. (2014) Adam: A method for stochastic optimization. Proceedings of the 3rd\n\nInternational Conference on Learning Representations.\n\n[22] Hansen, N., M\u00fcller, S. D., & Koumoutsakos, P. (2003) Reducing the time complexity of the derandomized\n\nevolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation 11, 1\u201318.\n\n[23] Neal, R. M. (2003) Slice sampling. Annals of Statistics 31, 705\u2013741.\n\n10\n\n\f[24] Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J.,\nLi, P., & Riddell, A. (2017) Stan: A probabilistic programming language. Journal of Statistical Software\n76.\n\n[25] Gilks, W. R., Roberts, G. O., & George, E. I. (1994) Adaptive direction sampling. The Statistician 43,\n\n179\u2013189.\n\n[26] Kass, R. E. & Raftery, A. E. (1995) Bayes factors. Journal of the American Statistical Association 90,\n\n773\u2013795.\n\n[27] Geyer, C. J. (1994) Estimating normalizing constants and reweighting mixtures. (Technical report).\n\n11\n\n\f", "award": [], "sourceid": 5015, "authors": [{"given_name": "Luigi", "family_name": "Acerbi", "institution": "University of Geneva"}]}