{"title": "Linear Response Methods for Accurate Covariance Estimates from Mean Field Variational Bayes", "book": "Advances in Neural Information Processing Systems", "page_first": 1441, "page_last": 1449, "abstract": "Mean field variational Bayes (MFVB) is a popular posterior approximation method due to its fast runtime on large-scale data sets. However, a well known failing of MFVB is that it underestimates the uncertainty of model variables (sometimes severely) and provides no information about model variable covariance. We generalize linear response methods from statistical physics to deliver accurate uncertainty estimates for model variables---both for individual variables and coherently across variables. We call our method linear response variational Bayes (LRVB). When the MFVB posterior approximation is in the exponential family, LRVB has a simple, analytic form, even for non-conjugate models. Indeed, we make no assumptions about the form of the true posterior. We demonstrate the accuracy and scalability of our method on a range of models for both simulated and real data.", "full_text": "Linear Response Methods for Accurate Covariance\n\nEstimates from Mean Field Variational Bayes\n\nRyan Giordano\nUC Berkeley\n\nTamara Broderick\n\nMIT\n\nMichael Jordan\nUC Berkeley\n\nrgiordano@berkeley.edu\n\ntbroderick@csail.mit.edu\n\njordan@cs.berkeley.edu\n\nAbstract\n\nMean \ufb01eld variational Bayes (MFVB) is a popular posterior approximation\nmethod due to its fast runtime on large-scale data sets. However, a well known ma-\njor failing of MFVB is that it underestimates the uncertainty of model variables\n(sometimes severely) and provides no information about model variable covari-\nance. We generalize linear response methods from statistical physics to deliver\naccurate uncertainty estimates for model variables\u2014both for individual variables\nand coherently across variables. We call our method linear response variational\nBayes (LRVB). When the MFVB posterior approximation is in the exponential\nfamily, LRVB has a simple, analytic form, even for non-conjugate models. In-\ndeed, we make no assumptions about the form of the true posterior. We demon-\nstrate the accuracy and scalability of our method on a range of models for both\nsimulated and real data.\n\n1\n\nIntroduction\n\nWith increasingly ef\ufb01cient data collection methods, scientists are interested in quickly analyzing\never larger data sets. In particular, the promise of these large data sets is not simply to \ufb01t old models\nbut instead to learn more nuanced patterns from data than has been possible in the past. In theory,\nthe Bayesian paradigm yields exactly these desiderata. Hierarchical modeling allows practitioners\nto capture complex relationships between variables of interest. Moreover, Bayesian analysis allows\npractitioners to quantify the uncertainty in any model estimates\u2014and to do so coherently across all\nof the model variables.\nMean \ufb01eld variational Bayes (MFVB), a method for approximating a Bayesian posterior distribu-\ntion, has grown in popularity due to its fast runtime on large-scale data sets [1\u20133]. But a well known\nmajor failing of MFVB is that it gives underestimates of the uncertainty of model variables that\ncan be arbitrarily bad, even when approximating a simple multivariate Gaussian distribution [4\u2013\n6]. Also, MFVB provides no information about how the uncertainties in different model variables\ninteract [5\u20138].\nBy generalizing linear response methods from statistical physics [9\u201312] to exponential family vari-\national posteriors, we develop a methodology that augments MFVB to deliver accurate uncertainty\nestimates for model variables\u2014both for individual variables and coherently across variables.\nIn\nparticular, as we elaborate in Section 2, when the approximating posterior in MFVB is in the expo-\nnential family, MFVB de\ufb01nes a \ufb01xed-point equation in the means of the approximating posterior,\n\n1\n\n\fand our approach yields a covariance estimate by perturbing this \ufb01xed point. We call our method\nlinear response variational Bayes (LRVB).\nWe provide a simple, intuitive formula for calculating the linear response correction by solving a\nlinear system based on the MFVB solution (Section 2.2). We show how the sparsity of this system\nfor many common statistical models may be exploited for scalable computation (Section 2.3). We\ndemonstrate the wide applicability of LRVB by working through a diverse set of models to show that\nthe LRVB covariance estimates are nearly identical to those produced by a Markov Chain Monte\nCarlo (MCMC) sampler, even when MFVB variance is dramatically underestimated (Section 3).\nFinally, we focus in more depth on models for \ufb01nite mixtures of multivariate Gaussians (Section 3.3),\nwhich have historically been a sticking point for MFVB covariance estimates [5, 6]. We show that\nLRVB can give accurate covariance estimates orders of magnitude faster than MCMC (Section 3.3).\nWe demonstrate both theoretically and empirically that, for this Gaussian mixture model, LRVB\nscales linearly in the number of data points and approximately cubically in the dimension of the\nparameter space (Section 3.4).\n\nPrevious Work. Linear response methods originated in the statistical physics literature [10\u201313].\nThese methods have been applied to \ufb01nd new learning algorithms for Boltzmann machines [13],\ncovariance estimates for discrete factor graphs [14], and independent component analysis [15]. [16]\nstates that linear response methods could be applied to general exponential family models but works\nout details only for Boltzmann machines. [10], which is closest in spirit to the present work, derives\ngeneral linear response corrections to variational approximations; indeed, the authors go further to\nformulate linear response as the \ufb01rst term in a functional Taylor expansion to calculate full pairwise\njoint marginals. However, it may not be obvious to the practitioner how to apply the general formulas\nof [10]. Our contributions in the present work are (1) the provision of concrete, straightforward\nformulas for covariance correction that are fast and easy to compute, (2) demonstrations of the\nsuccess of our method on a wide range of new models, and (3) an accompanying suite of code.\n\n2 Linear response covariance estimation\n\n2.1 Variational Inference\n\nSuppose we observe N data points, denoted by the N-long column vector x, and denote our un-\nobserved model parameters by \u03b8. Here, \u03b8 is a column vector residing in some space \u0398; it has J\nsubgroups and total dimension D. Our model is speci\ufb01ed by a distribution of the observed data\ngiven the model parameters\u2014the likelihood p(x|\u03b8)\u2014and a prior distributional belief on the model\nparameters p(\u03b8). Bayes\u2019 Theorem yields the posterior p(\u03b8|x).\nMean-\ufb01eld variational Bayes (MFVB) approximates p(\u03b8|x) by a factorized distribution of the form\nq(\u03b8) =\ufffdJ\nj=1 q(\u03b8j). q is chosen so that the Kullback-Liebler divergence KL(q||p) between q and p\nis minimized. Equivalently, q is chosen so that E := L + S, for L := Eq[log p(\u03b8|x)] (the expected\nlog posterior) and S := \u2212Eq[log q(\u03b8)] (the entropy of the variational distribution), is maximized:\n(1)\n\nq\u2217 := arg min\n\nEq [log q(\u03b8) \u2212 log p(\u03b8|x)] = arg max\n\nE.\n\nKL(q||p) = arg min\n\nq\n\nq\n\nq\n\nUp to a constant in \u03b8, the objective E is sometimes called the \u201cevidence lower bound\u201d, or the ELBO\n[5]. In what follows, we further assume that our variational distribution, q (\u03b8), is in the exponential\nfamily with natural parameter \u03b7 and log partition function A: log q (\u03b8|\u03b7) = \u03b7T \u03b8 \u2212 A (\u03b7) (expressed\nwith respect to some base measure in \u03b8). We assume that p (\u03b8|x) is expressed with respect to the\nsame base measure in \u03b8 as for q. Below, we will make only mild regularity assumptions about the\ntrue posterior p(\u03b8|x) and no assumptions about its form.\nIf we assume additionally that the parameters \u03b7\u2217 at the optimum q\u2217(\u03b8) = q(\u03b8|\u03b7\u2217) are in the interior\nof the feasible space, then q(\u03b8|\u03b7) may instead be described by the mean parameterization: m := Eq\u03b8\n\n2\n\n\fwith m\u2217 := Eq\u2217 \u03b8. Thus, the objective E can be expressed as a function of m, and the \ufb01rst-order\ncondition for the optimality of q\u2217 becomes the \ufb01xed point equation\n\n= m\u2217 \u21d4 M (m\u2217) = m\u2217 for M (m) :=\n\n\u2202E\n\u2202m\n\n+ m. (2)\n\n\u2202E\n\n\u2202m\ufffd\ufffd\ufffd\ufffdm=m\u2217\n\n= 0 \u21d4 \ufffd \u2202E\n\n\u2202m\n\n+ m\ufffd\ufffd\ufffd\ufffd\ufffdm=m\u2217\n\n2.2 Linear Response\n\nLet V denote the covariance matrix of \u03b8 under the variational distribution q\u2217(\u03b8), and let \u03a3 denote\nthe covariance matrix of \u03b8 under the true posterior, p(\u03b8|x):\n\nV := Covq\u2217 \u03b8,\n\n\u03a3 := Covp\u03b8.\n\nIn MFVB, V may be a poor estimator of \u03a3, even when m\u2217 \u2248 Ep\u03b8, i.e., when the marginal estimated\nmeans match well [5\u20137]. Our goal is to use the MFVB solution and linear response methods to\nconstruct an improved estimator for \u03a3. We will focus on the covariance of the natural suf\ufb01cient\nstatistic \u03b8, though the covariance of functions of \u03b8 can be estimated similarly (see Appendix A).\nThe essential idea of linear response is to perturb the \ufb01rst-order condition M (m\u2217) = m\u2217 around its\noptimum. In particular, de\ufb01ne the distribution pt (\u03b8|x) as a log-linear perturbation of the posterior:\n(3)\nwhere C (t) is a constant in \u03b8. We assume that pt(\u03b8|x) is a well-de\ufb01ned distribution for any t in an\nopen ball around 0. Since C (t) normalizes pt(\u03b8|x), it is in fact the cumulant-generating function\nof p(\u03b8|x), so the derivatives of C (t) evaluated at t = 0 give the cumulants of \u03b8. To see why this\nperturbation may be useful, recall that the second cumulant of a distribution is the covariance matrix,\nour desired estimand:\n\n:= log p (\u03b8|x) + tT \u03b8 \u2212 C (t) ,\n\nlog pt (\u03b8|x)\n\n\u03a3 = Covp(\u03b8) =\n\nd\n\ndtT dt\n\n=\n\nC(t)\ufffd\ufffd\ufffd\ufffdt=0\n\nd\n\ndtT Ept \u03b8\ufffd\ufffd\ufffd\ufffdt=0\n\n.\n\nThe practical success of MFVB relies on the fact that its estimates of the mean are often good in\npractice. So we assume that m\u2217t \u2248 Ept \u03b8, where m\u2217t is the mean parameter characterizing q\u2217t and\nq\u2217t is the MFVB approximation to pt. (We examine this assumption further in Section 3.) Taking\nderivatives with respect to t on both sides of this mean approximation and setting t = 0 yields\n\n\u03a3 = Covp(\u03b8) \u2248\n\n=: \u02c6\u03a3,\n\n(4)\n\ndm\u2217t\n\ndtT \ufffd\ufffd\ufffd\ufffdt=0\n\nwhere we call \u02c6\u03a3 the linear response variational Bayes (LRVB) estimate of the posterior covariance\nof \u03b8.\nWe next show that there exists a simple formula for \u02c6\u03a3. Recalling the form of the KL divergence\n(see Eq. (1)), we have that \u2212KL(q||pt) = E +tT m =: Et. Then by Eq. (2), we have m\u2217t = Mt(m\u2217t )\nfor Mt(m) := M (m) + t. It follows from the chain rule that\n\ndm\u2217t\ndt\n\n=\n\n+\n\n\u2202Mt\n\u2202t\n\n=\n\ndm\u2217t\ndt\n\n+ I,\n\n(5)\n\nwhere I is the identity matrix. If we assume that we are at a strict local optimum and so can invert\nthe Hessian of E, then evaluating at t = 0 yields\n\n\u2202Mt\n\n\u2202mT\ufffd\ufffd\ufffd\ufffdm=m\u2217t\n\ndm\u2217t\ndt\n\n\u2202Mt\n\n\u2202mT\ufffd\ufffd\ufffd\ufffdm=m\u2217t\n\u02c6\u03a3 + I =\ufffd \u22022E\n\n3\n\n\u02c6\u03a3 =\n\ndm\u2217t\n\ndtT \ufffd\ufffd\ufffd\ufffdt=0\n\n=\n\n\u2202M\n\u2202m\n\n\u2202m\u2202mT + I\ufffd \u02c6\u03a3 + I \u21d2 \u02c6\u03a3 = \u2212\ufffd \u22022E\n\n\u2202m\u2202mT\ufffd\u22121\n\n,\n\n(6)\n\n\fwhere we have used the form for M in Eq. (2). So the LRVB estimator \u02c6\u03a3 is the negative inverse\nHessian of the optimization objective, E, as a function of the mean parameters. It follows from\nEq. (6) that \u02c6\u03a3 is both symmetric and positive de\ufb01nite when the variational distribution q\u2217 is at least\na local maximum of E.\nWe can further simplify Eq. (6) by using the exponential family form of the variational approximat-\ning distribution q. For q in exponential family form as above, the negative entropy \u2212S is dual to the\nlog partition function A [17], so S = \u2212\u03b7T m + A(\u03b7); hence,\n\u2202\u03b7 \u2212 m\ufffd d\u03b7\n\ndm \u2212 \u03b7(m) = \u2212\u03b7(m).\n\n=\ufffd \u2202A\n\n\u2202S\n\u2202m\n\n\u2202S\n\u2202\u03b7T\n\nd\u03b7\ndm\n\n+\n\ndS\ndm\n\n=\n\nRecall that for exponential families, \u2202\u03b7(m)/\u2202m = V \u22121. So Eq. (6) becomes1\n\n\u02c6\u03a3 = \u2212\ufffd \u22022L\n\n\u2202m\u2202mT +\n\n\u22022S\n\n\u2202m\u2202mT\ufffd\u22121\n\n= \u2212(H \u2212 V \u22121)\u22121, for H :=\n\n\u22022L\n\n\u2202m\u2202mT . \u21d2\n\n\u02c6\u03a3 = (I \u2212 V H)\u22121V.\n\n(7)\n\nWhen the true posterior p(\u03b8|x) is in the exponential family and contains no products of the vari-\national moment parameters, then H = 0 and \u02c6\u03a3 = V . In this case, the mean \ufb01eld assumption is\ncorrect, and the LRVB and MFVB covariances coincide at the true posterior covariance. Further-\nmore, even when the variational assumptions fail, as long as certain mean parameters are estimated\nexactly, then this formula is also exact for covariances. E.g., notably, MFVB is well-known to pro-\nvide arbitrarily bad estimates of the covariance of a multivariate normal posterior [4\u20137], but since\nMFVB estimates the means exactly, LRVB estimates the covariance exactly (see Appendix B).\n\n2.3 Scaling the matrix inverse\nEq. (7) requires the inverse of a matrix as large as the parameter dimension of the posterior p(\u03b8|x),\nwhich may be computationally prohibitive. Suppose we are interested in the covariance of parameter\nsub-vector \u03b1, and let z denote the remaining parameters: \u03b8 = (\u03b1, z)T . We can partition \u03a3 =\n(\u03a3\u03b1, \u03a3\u03b1z; \u03a3z\u03b1, \u03a3z) . Similar partitions exist for V and H. If we assume a mean-\ufb01eld factorization\nq(\u03b1, z) = q(\u03b1)q(z), then V\u03b1z = 0. (The variational distributions may factor further as well.) We\ncalculate the Schur complement of \u02c6\u03a3 in Eq. (7) with respect to its zth component to \ufb01nd that\n\n\u02c6\u03a3\u03b1 = (I\u03b1 \u2212 V\u03b1H\u03b1 \u2212 V\u03b1H\u03b1z\ufffdIz \u2212 VzHz)\u22121VzHz\u03b1\ufffd\u22121\n\n(8)\nHere, I\u03b1 and Iz refer to \u03b1- and z-sized identity matrices,\nIn cases where\n(Iz \u2212 VzHz)\u22121 can be ef\ufb01ciently calculated (e.g., all the experiments in Section 3; see Fig. (5)\nin Appendix D), Eq. (8) requires only an \u03b1-sized inverse.\n\nV\u03b1.\nrespectively.\n\n3 Experiments\n\nWe compare the covariance estimates from LRVB and MFVB in a range of models, including models\nboth with and without conjugacy 2. We demonstrate the superiority of the LRVB estimate to MFVB\nin all models before focusing in on Gaussian mixture models for a more detailed scalability analysis.\nFor each model, we simulate datasets with a range of parameters. In the graphs, each point represents\nthe outcome from a single simulation. The horizontal axis is always the result from an MCMC\n\n1For a comparison of this formula with the frequentist \u201csupplemented expectation-maximization\u201d procedure\n\nsee Appendix C.\n\n2All the code is available on our Github repository, rgiordan/LinearResponseVariationalBayesNIPS2015,\n\n4\n\n\fzn|\u03b2, \u03c4\n\nindep\n\n\u223c N\ufffdzn|\u03b2xn, \u03c4\u22121\ufffd ,\n\u03b2 \u223c N (\u03b2|0, \u03c32\n\u03b2),\n\n\u223c Poisson (yn| exp(zn)) ,\n\nindep\n\nyn|zn\n\u03c4 \u223c \u0393(\u03c4|\u03b1\u03c4 , \u03b2\u03c4 ).\n\n(9)\n\nprocedure, which we take as the ground truth. As discussed in Section 2.2, the accuracy of the\nLRVB covariance for a suf\ufb01cient statistic depends on the approximation m\u2217t \u2248 Ept \u03b8. In the models\nto follow, we focus on regimes of moderate dependence where this is a reasonable assumption for\nmost of the parameters (see Section 3.2 for an exception). Except where explicitly mentioned,\nthe MFVB means of the parameters of interest coincided well with the MCMC means, so our key\nassumption in the LRVB derivations of Section 2 appears to hold.\n\n3.1 Normal-Poisson model\n\nModel. First consider a Poisson generalized linear mixed model, exhibiting non-conjugacy. We\nobserve Poisson draws yn and a design vector xn, for n = 1, ..., N.\nImplicitly below, we will\neverywhere condition on the xn, which we consider to be a \ufb01xed design matrix. The generative\nmodel is:\n\nFor MFVB, we factorize q (\u03b2, \u03c4, z) = q (\u03b2) q (\u03c4 )\ufffdN\nn=1 q (zn). Inspection reveals that the optimal\nq (\u03b2) will be Gaussian, and the optimal q (\u03c4 ) will be gamma (see Appendix D). Since the optimal\nq (zn) does not take a standard exponential family form, we restrict further to Gaussian q (zn). There\nare product terms in L (for example, the term Eq [\u03c4 ] Eq [\u03b2] Eq [zn]), so H \ufffd= 0, and the mean \ufb01eld\napproximation does not hold; we expect LRVB to improve on the MFVB covariance estimate. A\ndetailed description of how to calculate the LRVB estimate can be found in Appendix D.\n\nResults. We simulated 100 datasets, each with 500 data points and a randomly chosen value for\n\u00b5 and \u03c4. We drew the design matrix x from a normal distribution and held it \ufb01xed throughout. We\n\u03b2 = 10, \u03b1\u03c4 = 1, and \u03b2\u03c4 = 1. To get the \u201cground truth\u201d covariance\nset prior hyperparameters \u03c32\nmatrix, we took 20000 draws from the posterior with the R MCMCglmm package [18], which\nused a combination of Gibbs and Metropolis Hastings sampling. Our LRVB estimates used the\nautodifferentiation software JuMP [19].\nResults are shown in Fig. (1). Since \u03c4 is high in many of the simulations, z and \u03b2 are correlated,\nand MFVB underestimates the standard deviation of \u03b2 and \u03c4. LRVB matches the MCMC standard\ndeviation for all \u03b2, and matches for \u03c4 in all but the most correlated simulations. When \u03c4 gets very\nhigh, the MFVB assumption starts to bias the point estimates of \u03c4, and the LRVB standard deviations\nstart to differ from MCMC. Even in that case, however, the LRVB standard deviations are much more\naccurate than the MFVB estimates, which underestimate the uncertainty dramatically. The \ufb01nal plot\nshows that LRVB estimates the covariances of z with \u03b2, \u03c4, and log \u03c4 reasonably well, while MFVB\nconsiders them independent.\n\nFigure 1: Posterior mean and covariance estimates on normal-Poisson simulation data.\n\n3.2 Linear random effects\n\nModel. Next, we consider a simple random slope linear model, with full details in Appendix E. We\nobserve scalars yn and rn and a vector xn, for n = 1, ..., N. Implicitly below, we will everywhere\n\n5\n\n\fcondition on all the xn and rn, which we consider to be \ufb01xed design matrices. In general, each\nrandom effect may appear in multiple observations, and the index k(n) indicates which random\neffect, zk, affects which observation, yn. The full generative model is:\n\nyn|\u03b2, z, \u03c4\n\nindep\n\n\u223c N\ufffdyn|\u03b2T xn + rnzk(n), \u03c4\u22121\ufffd ,\n\n\u03bd \u223c \u0393(\u03bd|\u03b1\u03bd, \u03b2\u03bd),\n\n\u03b2 \u223c N (\u03b2|0, \u03a3\u03b2),\n\nzk|\u03bd iid\u223c N\ufffdzk|0, \u03bd\u22121\ufffd ,\n\n\u03c4 \u223c \u0393(\u03c4|\u03b1\u03c4 , \u03b2\u03c4 ).\n\nk=1 q (zn). Since this is\na conjugate model, the optimal q will be in the exponential family with no additional assumptions.\n\nWe assume the mean-\ufb01eld factorization q (\u03b2, \u03bd, \u03c4, z) = q (\u03b2) q (\u03c4 ) q (\u03bd)\ufffdK\n\nResults. We simulated 100 datasets of 300 datapoints each and 30 distinct random effects. We\nset prior hyperparameters to \u03b1\u03bd = 2, \u03b2\u03bd = 2, \u03b1\u03c4 = 2 , \u03b2\u03c4 = 2, and \u03a3\u03b2 = 0.1\u22121I. Our xn was\n2-dimensional. As in Section 3.1, we implemented the variational solution using the autodifferenti-\nation software JuMP [19]. The MCMC \ufb01t was performed with using MCMCglmm [18].\nIntuitively, when the random effect explanatory variables rn are highly correlated with the \ufb01xed\neffects xn, then the posteriors for z and \u03b2 will also be correlated, leading to a violation of the\nmean \ufb01eld assumption and an underestimated MFVB covariance. In our simulation, we used rn =\nx1n + N (0, 0.4), so that rn is correlated with x1n but not x2n. The result, as seen in Fig. (2),\nis that \u03b21 is underestimated by MFVB, but \u03b22 is not. The \u03bd parameter, in contrast, is not well-\nestimated by the MFVB approximation in many of the simulations. Since the LRVB depends on the\napproximation m\u2217t \u2248 Ept \u03b8, its LRVB covariance is not accurate either (Fig. (2)). However, LRVB\nstill improves on the MFVB standard deviation.\n\nFigure 2: Posterior mean and covariance estimates on linear random effects simulation data.\n\n3.3 Mixture of normals\n\nModel. Mixture models constitute some of the most popular models for MFVB application [1, 2]\nand are often used as an example of where MFVB covariance estimates may go awry [5, 6]. Thus, we\nwill consider in detail a Gaussian mixture model (GMM) consisting of a K-component mixture of\nP -dimensional multivariate normals with unknown component means, covariances, and weights. In\nwhat follows, the weight \u03c0k is the probability of the kth component, \u00b5k is the P -dimensional mean\nof the kth component, and \u039bk is the P \u00d7 P precision matrix of the kth component (so \u039b\u22121\nis the\ncovariance parameter). N is the number of data points, and xn is the nth observed P -dimensional\ndata point. We employ the standard trick of augmenting the data generating process with the latent\nindicator variables znk, for n = 1, ..., N and k = 1, ..., K, such that znk = 1 implies xn \u223c\nN (\u00b5k, \u039b\u22121\n\nk ). So the generative model is:\n\nk\n\n6\n\nP (znk = 1) = \u03c0k, p(x|\u03c0, \u00b5, \u039b, z) = \ufffdn=1:N \ufffdk=1:K\n\nN (xn|\u00b5k, \u039b\u22121\n\nk )znk\n\n(10)\n\nWe used diffuse conditionally conjugate priors (see Appendix F for details). We make the variational\nn=1 q (zn). We compare the accuracy and\n\nassumption q (\u00b5, \u03c0, \u039b, z) =\ufffdK\n\nk=1 q (\u00b5k) q (\u039bk) q (\u03c0k)\ufffdN\n\n\fspeed of our estimates to Gibbs sampling on the augmented model (Eq. (10)) using the function\nrnmixGibbs from the R package bayesm. We implemented LRVB in C++, making extensive use\nof RcppEigen [20]. We evaluate our results both on simulated data and on the MNIST data set [21].\n\nResults. For simulations, we generated N = 10000 data points from K = 2 multivariate normal\ncomponents in P = 2 dimensions. MFVB is expected to underestimate the marginal variance of \u00b5,\n\u039b, and log(\u03c0) when the components overlap since that induces correlation in the posteriors due to\nthe uncertain classi\ufb01cation of points between the clusters. We check the covariances estimated with\nEq. (7) against a Gibbs sampler, which we treat as the ground truth.3\nWe performed 198 simulations, each of which had at least 500 effective Gibbs samples in each\nvariable\u2014calculated with the R tool e\ufb00ectiveSize from the coda package [22]. The \ufb01rst three plots\nshow the diagonal standard deviations, and the third plot shows the off-diagonal covariances. Note\nthat the off-diagonal covariance plot excludes the MFVB estimates since most of the values are\nzero. Fig. (3) shows that the raw MFVB covariance estimates are often quite different from the\nGibbs sampler results, while the LRVB estimates match the Gibbs sampler closely.\nFor a real-world example, we \ufb01t a K = 2 GMM to the N = 12665 instances of handwritten 0s\nand 1s in the MNIST data set. We used PCA to reduce the pixel intensities to P = 25 dimensions.\nFull details are provided in Appendix G. In this MNIST analysis, the \u039b standard deviations were\nunder-estimated by MFVB but correctly estimated by LRVB (Fig. (3)); the other parameter standard\ndeviations were estimated correctly by both and are not shown.\n\nFigure 3: Posterior mean and covariance estimates on GMM simulation and MNIST data.\n\n3.4 Scaling experiments\n\nWe here explore the computational scaling of LRVB in more depth for the \ufb01nite Gaussian mixture\nmodel (Section 3.3). In the terms of Section 2.3, \u03b1 includes the suf\ufb01cient statistics from \u00b5, \u03c0, and \u039b,\nand grows as O(KP 2). The suf\ufb01cient statistics for the variational posterior of \u00b5 contain the P -length\nvectors \u00b5k, for each k, and the (P + 1)P/2 second-order products in the covariance matrix \u00b5k\u00b5T\nk .\nSimilarly, for each k, the variational posterior of \u039b involves the (P + 1)P/2 suf\ufb01cient statistics\nin the symmetric matrix \u039bk as well as the term log |\u039bk|. The suf\ufb01cient statistics for the posterior\nof \u03c0k are the K terms log \u03c0k.4 So, minimally, Eq. (7) will require the inverse of a matrix of size\n\n3The likelihood described in Section 3.3 is symmetric under relabeling. When the component locations\nand shapes have a real-life interpretation, the researcher is generally interested in the uncertainty of \u00b5, \u039b, and\n\u03c0 for a particular labeling, not the marginal uncertainty over all possible re-labelings. This poses a problem\nfor standard MCMC methods, and we restrict our simulations to regimes where label switching did not occur\nin our Gibbs sampler. The MFVB solution conveniently avoids this problem since the mean \ufb01eld assumption\nprevents it from representing more than one mode of the joint posterior.\n\n4Since\ufffdK\nk=1 \u03c0k = 1, using K suf\ufb01cient statistics involves one redundant parameter. However, this does\nnot violate any of the necessary assumptions for Eq. (7), and it considerably simpli\ufb01es the calculations. Note\nthat though the perturbation argument of Section 2 requires the parameters of p(\u03b8|x) to be in the interior of the\nfeasible space, it does not require that the parameters of p(x|\u03b8) be interior.\n\n7\n\n\fO(KP 2). The suf\ufb01cient statistics for z have dimension K \u00d7 N. Though the number of parameters\nthus grows with the number of data points, Hz = 0 for the multivariate normal (see Appendix F),\nso we can apply Eq. (8) to replace the inverse of an O(KN )-sized matrix with multiplication by\nthe same matrix. Since a matrix inverse is cubic in the size of the matrix, the worst-case scaling for\nLRVB is then O(K 2) in K, O(P 6) in P , and O(N ) in N.\nIn our simulations (Fig. (4)) we can see that, in practice, LRVB scales linearly5 in N and approxi-\nmately cubically in P across the dimensions considered.6 The P scaling is presumably better than\nthe theoretical worst case of O(P 6) due to extra ef\ufb01ciency in the numerical linear algebra. Note that\nthe vertical axis of the leftmost plot is on the log scale. At all the values of N, K and P considered\nhere, LRVB was at least as fast as Gibbs sampling and often orders of magnitude faster.\n\nFigure 4: Scaling of LRVB and Gibbs on simulation data in both log and linear scales. Before taking\nlogs, the line in the two lefthand (N) graphs is y \u221d x, and in the righthand (P) graph, it is y \u221d x3.\n\n4 Conclusion\n\nThe lack of accurate covariance estimates from the widely used mean-\ufb01eld variational Bayes\n(MFVB) methodology has been a longstanding shortcoming of MFVB. We have demonstrated that\nin sparse models, our method, linear response variational Bayes (LRVB), can correct MFVB to de-\nliver these covariance estimates in time that scales linearly with the number of data points. Further-\nmore, we provide an easy-to-use formula for applying LRVB to a wide range of inference problems.\nOur experiments on a diverse set of models have demonstrated the ef\ufb01cacy of LRVB, and our de-\ntailed study of scaling of mixtures of multivariate Gaussians shows that LRVB can be considerably\nfaster than traditional MCMC methods. We hope that in future work our results can be extended\nto more complex models, including Bayesian nonparametric models, where MFVB has proven its\npractical success.\n\nAcknowledgments. The authors thank Alex Blocker for helpful comments. R. Giordano and\nT. Broderick were funded by Berkeley Fellowships.\n\n5The Gibbs sampling time was linearly rescaled to the amount of time necessary to achieve 1000 effective\nsamples in the slowest-mixing component of any parameter. Interestingly, this rescaling leads to increasing\nef\ufb01ciency in the Gibbs sampling at low P due to improved mixing, though the bene\ufb01ts cease to accrue at\nmoderate dimensions.\n\n6For numeric stability we started the optimization procedures for MFVB at the true values, so the time to\ncompute the optimum in our simulations was very fast and not representative of practice. On real data, the\noptimization time will depend on the quality of the starting point. Consequently, the times shown for LRVB\nare only the times to compute the LRVB estimate. The optimization times were on the same order.\n\n8\n\n\fReferences\n[1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\n[2] D. M. Blei and M. I. Jordan. Variational inference for Dirichlet process mixtures. Bayesian Analysis,\n\n1(1):121\u2013143, 2006.\n\n[3] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14(1):1303\u20131347, 2013.\n\n[4] D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press,\n\n2003. Chapter 33.\n\n[5] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, New York, 2006. Chapter 10.\n[6] R. E. Turner and M. Sahani. Two problems with variational expectation maximisation for time-series\n\nmodels. In D. Barber, A. T. Cemgil, and S. Chiappa, editors, Bayesian Time Series Models. 2011.\n\n[7] B. Wang and M. Titterington.\n\nInadequacy of interval estimates corresponding to variational Bayesian\n\napproximations. In Workshop on Arti\ufb01cial Intelligence and Statistics, pages 373\u2013380, 2004.\n\n[8] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using\nintegrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (statistical\nmethodology), 71(2):319\u2013392, 2009.\n\n[9] G. Parisi. Statistical Field Theory, volume 4. Addison-Wesley New York, 1988.\n[10] M. Opper and O. Winther. Variational linear response. In Advances in Neural Information Processing\n\nSystems, 2003.\n\n[11] M. Opper and D. Saad. Advanced mean \ufb01eld methods: Theory and practice. MIT press, 2001.\n[12] T. Tanaka. Information geometry of mean-\ufb01eld approximation. Neural Computation, 12(8):1951\u20131968,\n\n2000.\n\n[13] H. J. Kappen and F. B. Rodriguez. Ef\ufb01cient learning in Boltzmann machines using linear response theory.\n\nNeural Computation, 10(5):1137\u20131156, 1998.\n\n[14] M. Welling and Y. W. Teh. Linear response algorithms for approximate inference in graphical models.\n\nNeural Computation, 16(1):197\u2013221, 2004.\n\n[15] P. A. d. F. R. H\u00f8jen-S\u00f8rensen, O. Winther, and L. K. Hansen. Mean-\ufb01eld approaches to independent\n\ncomponent analysis. Neural Computation, 14(4):889\u2013918, 2002.\n\n[16] T. Tanaka. Mean-\ufb01eld theory of Boltzmann machine learning. Physical Review E, 58(2):2302, 1998.\n[17] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends\u00ae in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n[18] J. D. Had\ufb01eld. MCMC methods for multi-response generalized linear mixed models: The MCMCglmm\n\nR package. Journal of Statistical Software, 33(2):1\u201322, 2010.\n\n[19] M. Lubin and I. Dunning. Computing in operations research using Julia. INFORMS Journal on Comput-\n\ning, 27(2):238\u2013248, 2015.\n\n[20] D. Bates and D. Eddelbuettel. Fast and elegant numerical linear algebra using the RcppEigen package.\n\nJournal of Statistical Software, 52(5):1\u201324, 2013.\n\n[21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[22] M. Plummer, N. Best, K. Cowles, and K. Vines. CODA: Convergence diagnosis and output analysis for\n\nMCMC. R News, 6(1):7\u201311, 2006.\n\n[23] X. L. Meng and D. B. Rubin. Using EM to obtain asymptotic variance-covariance matrices: The SEM\n\nalgorithm. Journal of the American Statistical Association, 86(416):899\u2013909, 1991.\n\n[24] A. W\u00a8achter and L. T. Biegler. On the implementation of an interior-point \ufb01lter line-search algorithm for\n\nlarge-scale nonlinear programming. Mathematical Programming, 106(1):25\u201357, 2006.\n\n9\n\n\f", "award": [], "sourceid": 875, "authors": [{"given_name": "Ryan", "family_name": "Giordano", "institution": "UC Berkeley"}, {"given_name": "Tamara", "family_name": "Broderick", "institution": "MIT"}, {"given_name": "Michael", "family_name": "Jordan", "institution": "UC Berkeley"}]}