{"title": "Perturbative Black Box Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 5079, "page_last": 5088, "abstract": "Black box variational inference (BBVI) with reparameterization gradients triggered the exploration of divergence measures other than the Kullback-Leibler (KL) divergence, such as alpha divergences. In this paper, we view BBVI with generalized divergences as a form of estimating the marginal likelihood via biased importance sampling. The choice of divergence determines a bias-variance trade-off between the tightness of a bound on the marginal likelihood (low bias) and the variance of its gradient estimators. Drawing on variational perturbation theory of statistical physics, we use these insights to construct a family of new variational bounds. Enumerated by an odd integer order $K$, this family captures the standard KL bound for $K=1$, and converges to the exact marginal likelihood as $K\\to\\infty$. Compared to alpha-divergences, our reparameterization gradients have a lower variance. We show in experiments on Gaussian Processes and Variational Autoencoders that the new bounds are more mass covering, and that the resulting posterior covariances are closer to the true posterior and lead to higher likelihoods on held-out data.", "full_text": "Perturbative Black Box Variational Inference\n\nRobert Bamler\u2217\nDisney Research\nPittsburgh, USA\n\nCheng Zhang\u2217\nDisney Research\nPittsburgh, USA\n\nManfred Opper\n\nTU Berlin\n\nBerlin, Germany\n\nStephan Mandt\u2217\nDisney Research\nPittsburgh, USA\n\n\ufb01rstname.lastname@{disneyresearch.com, tu-berlin.de}\n\nAbstract\n\nBlack box variational inference (BBVI) with reparameterization gradients triggered\nthe exploration of divergence measures other than the Kullback-Leibler (KL) diver-\ngence, such as alpha divergences. In this paper, we view BBVI with generalized\ndivergences as a form of estimating the marginal likelihood via biased importance\nsampling. The choice of divergence determines a bias-variance trade-o\ufb00 between\nthe tightness of a bound on the marginal likelihood (low bias) and the variance of\nits gradient estimators. Drawing on variational perturbation theory of statistical\nphysics, we use these insights to construct a family of new variational bounds.\nEnumerated by an odd integer order K, this family captures the standard KL bound\nfor K = 1, and converges to the exact marginal likelihood as K \u2192 \u221e. Compared\nto alpha-divergences, our reparameterization gradients have a lower variance. We\nshow in experiments on Gaussian Processes and Variational Autoencoders that the\nnew bounds are more mass covering, and that the resulting posterior covariances\nare closer to the true posterior and lead to higher likelihoods on held-out data.\n\nIntroduction\n\n1\nVariational inference (VI) (Jordan et al., 1999) provides a way to convert Bayesian inference to\noptimization by minimizing a divergence measure. Recent advances of VI have been devoted to\nscalability (Ho\ufb00man et al., 2013; Ranganath et al., 2014), divergence measures (Minka, 2005; Li and\nTurner, 2016; Hernandez-Lobato et al., 2016), and structured variational distributions (Ho\ufb00man and\nBlei, 2015; Ranganath et al., 2016).\nWhile traditional stochastic variational inference (SVI) (Ho\ufb00man et al., 2013) was limited to condi-\ntionally conjugate Bayesian models, black box variational inference (BBVI) (Ranganath et al., 2014)\nenables SVI on a large class of models. It expresses the gradient as an expectation, and estimates it by\nMonte-Carlo sampling. A variant of BBVI uses reparameterized gradients and has lower variance (Sal-\nimans and Knowles, 2013; Kingma and Welling, 2014; Rezende et al., 2014; Ruiz et al., 2016). BBVI\npaved the way for approximate inference in complex and deep generative models (Kingma and Welling,\n2014; Rezende et al., 2014; Ranganath et al., 2015; Bamler and Mandt, 2017).\nBefore the advent of BBVI, divergence measures other than the KL divergence had been of limited\npractical use due to their complexity in both mathematical derivation and computation (Minka, 2005),\nbut have since then been revisited. Alpha-divergences (Hernandez-Lobato et al., 2016; Dieng et al.,\n2017; Li and Turner, 2016) achieve a better matching of the variational distribution to di\ufb00erent regions\nof the posterior and may be tuned to either \ufb01t its dominant mode or to cover its entire support. The\nproblem with reparameterizing the gradient of the alpha-divergence is, however, that the resulting\ngradient estimates have large variances. It is therefore desirable to \ufb01nd other divergence measures\nwith low-variance reparameterization gradients.\n\n\u2217Equal contributions. First authorship determined by coin \ufb02ip among \ufb01rst two authors.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this paper, we use concepts from perturbation theory of statistical physics to propose a new family\nof variational bounds on the marginal likelihood with low-variance reparameterization gradients. The\nlower bounds are enumerated by an order K, which takes odd integer values, and are given by\n\n(cid:104)(cid:0)log p(x, z) \u2212 log q(z; \u03bb) + V0\n\n(cid:1)k(cid:105)\n\n.\n\n(1)\n\nL(K)(\u03bb, V0) = e\u2212V0\n\n1\nk!\n\nE\nz\u223cq\n\nK(cid:88)\n\nk=0\n\nHere, p(x, z) denotes the joint probability density function of the model with observations x and\nlatent variables z, q is the variational distribution, which depends on variational parameters \u03bb, and\nV0 \u2208 R is a reference point for the perturbative expansion, see below. In this paper, we motivate\nand discuss Eq. 1 (Section 3), and we analyze the properties of the proposed bound experimentally\n(Section 4). Our contributions are as follows.\n\n\u2022 We establish a view on black box variational inference with generalized divergences as a form\nof biased importance sampling (Section 3.1). The choice of divergence allows us to trade-o\ufb00\nbetween a low-variance stochastic gradient and loose bound, and a tight variational bound\nwith higher-variance Monte-Carlo gradients. As we explain below, importance sampling\nand point estimation are at opposite ends of this spectrum.\n\u2022 We combine these insights with ideas from perturbation theory of statistical physics to moti-\nvate the objective function in Eq. 1 (Section 3.2). We show that, for all odd K, L(K)(\u03bb, V0) is\na nontrivial lower bound on the marginal likelihood p(x). Thus, we propose the perturbative\nblack box variational inference algorithm (PBBVI), which maximizes L(K)(\u03bb, V0) over \u03bb\nand V0 with stochastic gradient descent (SGD). For K = 1, our algorithm is equivalent to\nstandard BBVI with the KL-divergence (KLVI). On the variance-bias spectrum, KLVI is on\nthe side of large bias and low gradient variance. Increasing K to larger odd integers allows\nus to gradually trade in some increase in the gradient variance for some reduction of the bias.\n\u2022 We evaluate our PBBVI algorithm experimentally for the lowest nonstandard order K = 3\n(Section 4). Compared to KLVI (K = 1), our algorithm \ufb01ts variational distributions that\ncover more of the mass of the true posterior. Compared to alpha-VI, our experiments con\ufb01rm\nthat PBBVI uses gradient estimates with lower variance, and converges faster.\n\n2 Related work\nOur approach is related to BBVI, VI with generalized divergences, and variational perturbation theory.\nWe thus brie\ufb02y discuss related work in these three directions.\nBlack box variational inference (BBVI). BBVI has already been addressed in the introduction (Sal-\nimans and Knowles, 2013; Kingma and Welling, 2014; Rezende et al., 2014; Ranganath et al., 2014;\nRuiz et al., 2016); it enables variational inference for many models. Our work builds upon BBVI in that\nBBVI makes a large class of new divergence measures between the posterior and the approximating\ndistribution tractable. Depending on the divergence measure, BBVI may su\ufb00er from high-variance\nstochastic gradients. This is a practical problem that we aim to improve in this paper.\nGeneralized divergences measures. Our work connects to generalized information-theoretic di-\nvergences (Amari, 2012). Minka (2005) introduced a broad class of divergences for variational\ninference, including alpha-divergences. Most of these divergences have been intractable in large-scale\napplications until the advent of BBVI. In this context, alpha-divergences were \ufb01rst suggested by\nHernandez-Lobato et al. (2016) for local divergence minimization, and later for global minimization\nby Li and Turner (2016) and Dieng et al. (2017). As we show in this paper, alpha-divergences have the\ndisadvantage of inducing high-variance gradients, since the ratio between posterior and variational\ndistribution enters the bound polynomially instead of logarithmically. In contrast, our approach leads\nto a more stable inference scheme in high dimensions.\nVariational perturbation theory. Perturbation theory refers to methods that aim to truncate a\ntypically divergent power series to a convergent series.\nIn machine learning, these approaches\nhave been addressed from an information-theoretic perspective by Tanaka (1999, 2000). Thouless-\nAnderson-Palmer (TAP) equations (Thouless et al., 1977) are a form of second-order perturbation\ntheory. They were originally developed in statistical physics to include perturbative corrections to\nthe mean-\ufb01eld solution of Ising models. They have been adopted into Bayesian inference in (Plefka,\n1982) and were advanced by many authors (Kappen and Wiegerinck, 2001; Paquet et al., 2009; Opper\n\n2\n\n\fFigure 1: Di\ufb00erent choices for\nf in Eq. 4. KLVI corresponds to\nf (x) = log(x)+const. (red), and\nimportance sampling to f (x) =\nx (black). Our proposed PBBVI\nbound uses f (K)\n(green, Eq. 7),\nV0\nwhich lies between KLVI and im-\nportance sampling (we set K = 3\nand V0 = 0 for PBBVI here).\n\nFigure 2: Behavior of di\ufb00erent\nVI methods on \ufb01tting a univariate\nGaussian to a bimodal target distri-\nbution (black). PBBVI (proposed,\ngreen) covers more of the mass\nof the entire distribution than the\ntraditional KLVI (red). Alpha-VI\nis mode seeking for large \u03b1 and\nmass covering for smaller \u03b1.\n\nFigure 3: Sampling variance of\nthe stochastic gradient (averaged\nover its components) in the opti-\nmum, for alpha-divergences (or-\nange, purple, gray), and the pro-\nposed PBBVI (green). The vari-\nance grows exponentially with the\nlatent dimension N for alpha-VI,\nand only algebraically for PBBVI.\n\net al., 2013; Opper, 2015). In variational inference, perturbation theory yields extra terms to the\nmean-\ufb01eld variational objective which are di\ufb03cult to calculate analytically. This may be a reason why\nthe methods discussed are not widely adopted by practitioners. In this paper, we emphasize the ease\nof including perturbative corrections in a black box variational inference framework. Furthermore, in\ncontrast to earlier formulations, our approach yields a strict lower bound to the marginal likelihood\nwhich can be conveniently optimized. Our approach is di\ufb00erent from the traditional variational\nperturbation formulation (Kleinert, 2009), which generally does not result in a bound.\n3 Method\nIn this section, we present our main contributions. We \ufb01rst present our view of black box variational\ninference (BBVI) as a form of biased importance sampling in Section 3.1. With this view, we bridge\nthe gap between variational inference and importance sampling. In Section 3.2, we introduce our\nfamily of new variational bounds, and analyze their properties further in Section 3.3.\n3.1 Black Box Variational Inference as Biased Importance Sampling\nConsider a probabilistic model with data x, latent variables z, and joint distribution p(x, z). We are\ninterested in the posterior distribution over the latent variables, p(z|x) = p(x, z)/p(x). This involves\nthe intractable marginal likelihood p(x). In variational inference (Jordan et al., 1999), we instead\nminimize a divergence measure between a variational distribution q(z; \u03bb) and the posterior. Here, \u03bb\nare parameters of the variational distribution, and we aim to \ufb01nd the parameters \u03bb\u2217 that minimize the\ndistance to the posterior. This is equivalent to maximizing a lower bound on the marginal likelihood.\nWe call the di\ufb00erence between the log variational distribution and the log joint distribution the\ninteraction energy,\n\n(2)\nWe use V or V (z) interchangeably to denote V (z; \u03bb), and q(z) to denote q(z; \u03bb), when more conve-\nnient. Using this notation, the marginal likelihood is\n\nV (z; \u03bb) = log q(z; \u03bb) \u2212 log p(x, z).\n\np(x) = E\n\n[e\u2212V (z)].\n\n(3)\nWe call e\u2212V (z) = p(x, z)/q(z) the importance ratio, since sampling from q(z) to estimate the right-\nhand side of Eq. 3 is equivalent to importance sampling. As importance sampling is ine\ufb03cient in\nhigh dimensions, we resort to variational inference. To this end, let f (\u00b7) be any concave function\nde\ufb01ned on the positive reals. We assume furthermore that for all x > 0, we have f (x) \u2264 x. Applying\nJensen\u2019s inequality, we can lower-bound the marginal likelihood,\n\nq(z)\n\np(x) \u2265 f (p(x)) \u2265 E\n\n[f (e\u2212V (z;\u03bb))] \u2261 Lf (\u03bb).\n\nq(z)\n\n(4)\n\nFigure 1 shows exemplary choices of f. We maximize Lf (\u03bb) using reparameterization gradients,\nwhere the bound is not computed analytically, but rather its gradients are estimated by sampling from\n\n3\n\n01234x\u2212101234f(x)importancesampling:f(x)=xPBBVI(K=3,proposed):f(3)V0(x)KLVI:f(x)=1+log(x)\u22124\u22122024z0.00.10.20.30.40.5p(z),q(z)targetdistributionp(z)PBBVI(K=3,proposed)\u03b1=0.2\u03b1\u21921(KLVI)\u03b1=2150100150200numberNoflatentvariables10\u2212710\u22121105101110171023averagevarianceof\u2207\u02dcL[log]\u03b1-divergencewith\u03b1=0.2\u03b1-divergencewith\u03b1=2\u03b1-divergencewith\u03b1=0.5PBBVI(K=3,proposed)\fq(z) (Kingma and Welling, 2014). This leads to a stochastic gradient descent scheme, where the noise\nis a result of the Monte-Carlo estimation of the gradients.\nOur approach builds on the insight that black box variational inference is a type of biased importance\nsampling, where we estimate a lower bound of the marginal likelihood by sampling from a proposal\ndistribution, iteratively improving this distribution. The approach is biased since we do not estimate the\nexact marginal likelihood but only a lower bound to this quantity. As we argue below, the introduced\nbias allows us to estimate the bound more easily, because we decrease the variance of this estimator.\nThe choice of the function f thereby trades-o\ufb00 between bias and variance in the following way:\n\n\u2022 For f = id being the identity, we obtain importance sampling. (See the black line in\nFigure 1). In this case, Eq. 4 does not depend on the variational parameters, so there is\nnothing to optimize and we can directly sample from any proposal distribution q. Since the\nexpectation under q of the importance ratio e\u2212V (z) gives the exact marginal likelihood, there\nis no bias. If the model has a large number of latent variables, the importance ratio e\u2212V (z)\nbecomes tightly peaked around the minimum of the interaction energy V , resulting in a very\nhigh variance of this estimator. Importance sampling is therefore on one extreme end of the\nbias-variance spectrum.\n\u2022 For f = log, we obtain the familiar Kullback-Leibler (KL) bound. (See the pink line in\nFigure 1; here we add a constant of 1 for comparison, which does not a\ufb00ect the optimization).\nSince f (e\u2212V (z)) = \u2212V (z), the bound is\n\nLKL(\u03bb) = E\n\n[\u2212V (z)] = E\n\n[log p(x, z) \u2212 log q(z)].\n\nq(z)\n\nq(z)\n\n(5)\nThe Monte-Carlo expectation of Eq[\u2212V ] has a much smaller variance than Eq[e\u2212V ], implying\ne\ufb03cient learning (Bottou, 2010). However, by replacing e\u2212V with \u2212V we introduce a bias.\nWe can further trade-o\ufb00 less variance for even more bias by dropping the entropy term\non the right-hand side of Eq. 5. A \ufb02exible enough variational distribution will shrink\nto zero variance, which completely eliminates the sampling noise. This is equivalent to\npoint-estimation, and is at the opposite end of the bias-variance spectrum.\n\u2022 Now, consider any f which is between the logarithm and the identity, e.g., the green line\nin Figure 1 (this is the regularizing function we propose in Section 3.2 below). The more\nsimilar f is to the identity, the less biased is our estimate of the marginal likelihood, but the\nlarger the variance. Conversely, the more f behaves like the logarithm, the easier it is to\nestimate f (e\u2212V (z)) by sampling, while at the same time the bias grows.\n\nOne example of alternative divergences to the KL divergence that have been discussed in the literature\nare alpha-divergences (Minka, 2005; Hernandez-Lobato et al., 2016; Li and Turner, 2016; Dieng\net al., 2017). Up to a constant, they correspond to the following choice of f:\n\nf (\u03b1)(e\u2212V ) \u221d e\u2212(1\u2212\u03b1)V .\n\n(6)\nThe real parameter \u03b1 determines the distance to the importance sampling case (\u03b1 = 0). As \u03b1\napproaches 1 from below, this bound leads to a better-behaved estimation problem of the Monte-Carlo\ngradient. However, unless taking the limit of \u03b1 \u2192 1 (where the objective becomes the KL-bound),\nV still enters exponentially in the bound. As we show, this leads to a high variance of the gradient\nestimator in high dimensions (see Figure 3 discussed below). The alpha-divergence bound is therefore\nsimilarly as hard to estimate as the marginal likelihood in importance sampling.\nOur analysis relies on the observation that expectations of exponentials in V are di\ufb03cult to estimate,\nand expectations of polynomials in V are easy to estimate. We derive a family of new variational\nbounds which are polynomials in V , where increasing the order of the polynomial reduces the bias.\n3.2 Perturbative Black Box Variational Inference\nPerturbative bounds. We now motivate the family of lower bounds proposed in Eq. 1 in the\nintroduction based on the considerations outlined above. For \ufb01xed odd integer K and \ufb01xed real value\nV0, the bound L(K)(\u03bb, V0) is of the form of Eq. 4 with the following regularizing function f:\n(V0 \u2212 V )k\n\n(V0 + log x)k\n\nK(cid:88)\n\nK(cid:88)\n\n=\u21d2\n\n4\n\n(x) = e\u2212V0\n\nf (K)\nV0\n\nk=0\n\nk!\n\n(e\u2212V ) = e\u2212V0\n\nf (K)\nV0\n\nk!\n\nk=0\n\n.\n\n(7)\n\n\fAlgorithm 1: Perturbative Black Box Variational Inference (PBBVI)\nInput: joint probability p(x, z); order of perturbation K (odd integer); learning rate schedule \u03c1t;\nnumber of Monte Carlo samples S; number of training iterations T ; variational family q(z, \u03bb)\nthat allows for reparameterization gradients, i.e., z \u223c q(\u00b7 , \u03bb) \u21d0\u21d2 z = g(\u0001, \u03bb) where \u0001 \u223c pn\nwith a \ufb01xed noise distribution pn and a di\ufb00erentiable reparameterization function g.\n\nOutput: \ufb01tted variational parameters \u03bb\u2217.\n1 initialize \u03bb randomly and V0 \u2190 0;\n2 for t \u2190 1 to T do\n3\n\n4\n\n5\n\n6\n\n7\n\nend\n\ndraw S samples \u00011, . . . , \u0001S \u223c pn from the noise distribution;\n// obtain reparameterization gradient estimates using automatic di\ufb00erentiation:\ng\u03bb \u2190 \u02c6\u2207\u03bb \u02dcL(K)(\u03bb, V0) \u2261 \u2207\u03bb\n\u02dcL(K)(\u03bb, V0) \u2261 \u2207V0\ngV0 \u2190 \u02c6\u2207V0\n(cid:104)\n(cid:80)S\n// perform variable updates (see second to last paragraph of Section 3.2):\n\u03bb \u2190 \u03bb + \u03c1tg\u03bb;\nV0 \u2190 V0 + \u03c1t\n\n(cid:80)K\n(cid:80)K\n(cid:0)log p(x, g(\u0001s, \u03bb)) \u2212 log q(g(\u0001s, \u03bb); \u03bb) + V0\n\n(cid:0)log p(x, g(\u0001s, \u03bb))\u2212log q(g(\u0001s, \u03bb); \u03bb)+V0\n(cid:0)log p(x, g(\u0001s, \u03bb))\u2212log q(g(\u0001s, \u03bb); \u03bb)+V0\n(cid:1)k(cid:105)\n\ngV0 \u2212 1\n\nS\n\n(cid:104)1\n(cid:80)S\n(cid:104)1\n(cid:80)S\n(cid:80)K\n\nS\n\nk=0\n\n1\nk!\n\n1\nk!\n\ns=1\n\ns=1\n\nk=0\n\n;\n\nS\n\ns=1\n\nk=0\n\n1\nk!\n\n(cid:1)k(cid:105)\n(cid:1)k(cid:105)\n\n;\n\n;\n\nV0\n\nis the Kth order Taylor expansion\nHere, the second (equivalent) formulation makes it explicit that f (K)\nV0\nof its argument e\u2212V in V around some reference energy V0. Figure 1 shows f (K)\n(x) for K = 1\n(red) and K = 3 (green). The curves are concave and lie below the identity, touching it at x = e\u2212V0.\nWe show in Section 3.3 that these properties extend to every odd K and every V0 \u2208 R. Therefore,\nL(K)(\u03bb, V0) is indeed a lower bound on the marginal likelihood.\nThe rationale for the design of the regularizing function in Eq. 7 is as follows. On the one hand, the\ngradients of the resulting bound should be easy to estimate via the reparameterization approach. We\n(e\u2212V ) a polynomial in V , i.e., in contrast to\nachieve low-variance gradient estimates by making f (K)\nV0\nthe alpha-bound, V never appears in the exponent.\nOn the other hand, the regularizing function should be close to the identity function so that the resulting\nbound has low bias. For K = 1, we have L(1)(\u03bb, V0) = e\u2212V0 Eq[log p \u2212 log q + V0]. Maximizing\nL(1) over \u03bb is independent of the value of V0 and equivalent to maximizing the standard KL bound\nLKL, see Eq. 5, which has low gradient variance and large bias. Increasing the order K to larger odd\nintegers makes the Taylor expansion tighter, leading to a bound with lower bias. In fact, in the limit\nK \u2192 \u221e, the right-hand side of Eq. 7 is the series representation of the exponential function, and\nthus f (K)\nconverges pointwise to the identity. In practice, we propose to set K to a small odd integer\nV0\nlarger than 1. Increasing K further reduces the bias, but it comes at the cost of increasing the gradient\nvariance because the random variable V appears in higher orders under the expectation in Eq. 4.\nAs discussed in Section 3.1, the KL bound LKL can be derived from a regularizing function f = log\nthat does not depend on any further parameters like V0. The derivation of the KL bound therefore\ndoes not require the \ufb01rst inequality in Eq. 4, and one directly obtains a bound on the model evidence\nlog p(x) \u2261 f (p(x)) from the second inequality alone. For K > 1, the bound L(K)(\u03bb, V0) depends\nnontrivially on V0, and we have to employ the \ufb01rst inequality in Eq. 4 in order to make the bounded\nquantity on the left-hand side independent of V0. This expenses some tightness of the bound but\nmakes the method more \ufb02exible by allowing us to optimize over V0 as well, as we describe next.\nOptimization algorithm. We now propose the perturbative black box variational inference (PBBVI)\nalgorithm. Since L(K)(\u03bb, V0) is a lower bound on the marginal likelihood for all \u03bb and all V0, we\ncan \ufb01nd the values \u03bb\u2217 and V \u2217\n0 for which the bound is tightest by maximizing simultaneously over \u03bb\nand V0. Algorithm 1 summarizes the PBBVI algorithm. We minimize \u2212L(K)(\u03bb, V0) using stochastic\ngradient descent (SGD) with reparameterization gradients and a learning rate \u03c1t that decreases with\nthe training iteration t according to Robbins-Monro bounds (Robbins and Monro, 1951). We obtain\nunbiased gradient estimators (denoted by \u201c \u02c6\u2207\u201d) using standard techniques: we replace the expectation\n\n5\n\n\fEq[\u00b7 ] in Eq. 1 with the empirical average over a \ufb01xed number of S samples from q, and we calculate\nthe reparameterization gradients with respect to \u03bb and V0 using automatic di\ufb00erentiation.\nIn practice, we typically discard the value of V \u2217\n0 once the optimization is converged since we are\nonly interested in the \ufb01tted variational parameters \u03bb\u2217. However, during the optimization process,\nV0 is an important auxiliary quantity and the inference algorithm would be inconsistent without an\noptimization over V0: if we were to extend the model p(x, z) by an additional observed variable \u02dcx\nwhich is statistically independent of the latent variables z, then the log joint (as a function of z alone)\nchanges by a constant positive prefactor. The posterior remains unchanged by the constant prefactor,\nand a consistent VI algorithm must therefore produce the same approximate posterior distribution q\nfor both models. Optimizing over V0 ensures this consistency since the log joint appears in the lower\nbound only in the combination log p(x, z) + V0. Therefore, a rescaling of the log joint by a constant\npositive prefactor can be completely absorbed by a change in the reference energy V0.\nWe observed in our experiments that the reference energy V0 can become very large (in absolute\nvalue) for models with many latent variables. To avoid numerical over\ufb02ow or under\ufb02ow from the\nprefactor e\u2212V0, we consider the surrogate objective \u02dcL(K)(\u03bb, V0) \u2261 eV0L(K)(\u03bb, V0). The gradients\nwith respect to \u03bb of L(K)(\u03bb, V0) and \u02dcL(K)(\u03bb, V0) are equal up to a positive prefactor, so we can\nreplace the former with the latter in the update step (line 6 in Algorithm 1). The gradient with respect\nto V0 is \u2207V0L(K)(\u03bb, V0) \u221d \u2207V0\n\u02dcL(K)(\u03bb, V0)\u2212 \u02dcL(K)(\u03bb, V0) (line 7). Using the surrogate \u02dcL(K)(\u03bb, V0)\navoids numerical under\ufb02ow or over\ufb02ow, as well as exponentially increasing or decreasing gradients.\nMass covering e\ufb00ect.\nIn Figure 2, we \ufb01t a Gaussian distribution to a one-dimensional bimodal\ntarget distribution (black line), using di\ufb00erent divergences. Compared to BBVI with the standard\nKL divergence (KLVI, pink line), alpha-divergences are more mode-seeking (purple line) for large\nvalues of \u03b1, and more mass-covering (orange line) for small \u03b1 (Li and Turner, 2016). Our PBBVI\nbound (K = 3, green line) achieves a similar mass-covering e\ufb00ect as in alpha-divergences, but with\nassociated low-variance reparameterization gradients. This is seen in Figure 3, discussed in Section 4.2,\nwhich compares the gradient variances of alpha-VI and PBBVI as a function of dimensions.\n3.3 Proof of Correctness and Nontriviality of the Bound\nTo conclude the presentation of the PBBVI algorithm, we prove that the objective in Eq. 1 is indeed a\nlower bound on the marginal likelihood for all odd orders K, and that the bound is nontrivial.\nCorrectness. The lower bound L(K)(\u03bb, V0) results from inserting the regularizing function f (K)\nfrom Eq. 7 into Eq. 4. For odd K, it is indeed a valid lower bound because f (K)\nis concave and\nV0\n(x)/\u2202x2 = \u2212e\u2212V0 (V0 +\nlies below the identity. To see this, note that the second derivative \u22022f (K)\nV0\nlog x)K\u22121/((K \u2212 1)! x2) is non-positive everywhere for odd K. Therefore, the function is concave.\n(x) \u2212 x, which has a stationary point at x = x0 \u2261 e\u2212V0.\nNext, consider the function g(x) = f (K)\nSince g is also concave, x0 is a global maximum, and thus g(x) \u2264 g(x0) = 0 for all x, implying\nV0\nthat f (K)\nsatis\ufb01es all requirements for Eq. 4, and\nV0\nL(K)(\u03bb, V0) \u2261 Eq[f (K)\n(e\u2212V )] is a lower bound on the marginal likelihood. Note that an even order\nK does not lead to a valid concave regularizing function.\nNontriviality. Since the marginal likelihood p(x) is always positive, a lower bound would be trivial\nif it was negative. We show that once the optimization algorithm has converged, the bound at the\noptimum is always positive. At the optimum, all gradients vanish. By setting the derivative with\n0 \u2212 V )K] = 0, where\nrespect to V0 of the right-hand side of Eq. 1 to zero we \ufb01nd that Eq\u2217 [(V \u2217\nq\u2217 \u2261 q(\u00b7 ; \u03bb\u2217) is the variational distribution at the optimum. Thus, the lower bound at the optimum is\n0 \u2212 V )k, where the sum runs only to K \u2212 1\nL(\u03bb\u2217, V \u2217\nk! (V \u2217\nbecause the term with k = K vanishes at V0 = V \u2217\n0 . We show that h(V ) is positive for all V . If\nK = 1, then h(V ) = 1 is a positive constant. For K \u2265 3, h(V ) is a polynomial in V of even order\nK \u2212 1, whose highest order term has a positive coe\ufb03cient 1/(K \u2212 1)!. Therefore, as V \u2192 \u00b1\u221e, the\nfunction h(V ) goes to positive in\ufb01nity and it thus has a global minimum at some value \u02dcV \u2208 R. At\n0 \u2212 \u02dcV )k. Thus, at the\nglobal minimum of the polynomial h, all terms except the highest order term cancel, and we \ufb01nd\n0 \u2212 \u02dcV )K\u22121 \u2265 0, which is nonnegative because K \u2212 1 is even. The case h( \u02dcV ) = 0\nh( \u02dcV ) = 1\n0 , but this would violate the condition \u2207 \u02dcV h( \u02dcV ) = 0. Therefore,\nis achieved if and only if \u02dcV = V \u2217\n\nthe global minimum, its derivative vanishes, 0 = \u2207 \u02dcV h( \u02dcV ) = \u2212(cid:80)K\u22122\n\n0 ) = e\u2212V0 Eq\u2217 [h(V )] with h(V ) =(cid:80)K\u22121\n\n(x) \u2264 x. Thus, for odd K, the function f (K)\n\n(K\u22121)! (V \u2217\n\n1\n\nk! (V \u2217\n\nk=0\n\nV0\n\nV0\n\nV0\n\n1\n\nk=0\n\n6\n\n\f(a) KLVI\n\n(b) PBBVI with K = 3\n\nFigure 4: Gaussian process regression on synthetic data (green dots). Three standard deviations are shown in\nvarying shades of orange. The blue dashed lines show three standard deviations of the true posterior. The red\ndashed lines show the inferred three standard deviations using KLVI (a) and PBBVI (b). We see that the results\nfrom our proposed PBBVI are close to the analytic solution while traditional KLVI underestimates the variances.\n\nMethod\nAnalytic\nKLVI\nPBBVI\n\nAvg variances\n\n0.0415\n0.0176\n0.0355\n\nData set Crab\n0.22\nKLVI\n0.11\nPBBVI\n\nPima\n0.245\n0.240\n\nHeart\n0.148\n0.1333\n\nSonar\n0.212\n0.1731\n\nTable 2: Error rate of GP classi\ufb01cation on the test set. The\nlower the better. Our proposed PBBVI consistently obtains\nbetter classi\ufb01cation results.\n\nTable 1: Average variances across training ex-\namples in the synthetic data experiment. The\ncloser to the analytic solution, the better.\nh( \u02dcV ) is strictly positive, and since \u02dcV is a global minimum of h, we have h(V ) \u2265 h( \u02dcV ) > 0 for all\nV \u2208 R. Inserting into the expression for L(\u03bb\u2217, V \u2217\n0 ) concludes the proof that the lower bound at the\noptimum is positive.\n4 Experiments\nWe evaluate PBBVI with di\ufb00erent models. First we investigate its behavior in a controlled setup of\nGaussian processes on synthetic data (Section 4.1). We then evaluate PBBVI based on a classi\ufb01cation\ntask using Gaussian processes classi\ufb01ers, where we use data from the UCI machine learning repository\n(Section 4.2). This is a Bayesian non-conjugate setup where black box inference is required. Finally,\nwe use an experiment with the variational autoencoder (VAE) to explore our approach on a deep\ngenerative model (Section 4.3). This experiment is carried out on MNIST data. We use the perturbative\norder K = 3 for all experiments with PBBVI. This corresponds to the lowest order beyond standard\nKLVI, since KLVI is equivalent to PBBVI with K = 1, and K has to be an odd integer. Across all\nthe experiments, PBBVI demonstrates advantages based on di\ufb00erent metrics.\n4.1 GP Regression on Synthetic Data\nIn this section, we inspect the inference behavior using a synthetic data set with Gaussian processes\n(GP). We generate the data according to a Gaussian noise distribution centered around a mixture of\nsinusoids, and sample 50 data points (green dots in Figure 4). We then use a GP to model the data,\nthus assuming the generative process f \u223c GP(0, \u039b) and yi \u223c N (fi, \u0001).\nWe \ufb01rst compute an analytic solution of the posterior of the GP, (three standard deviations shown in\nblue dashed lines) and compare it to approximate posteriors obtained by KLVI (Figure 4 (a)) and the\nproposed PBBVI (Figure 4 (b)). The results from PBBVI are almost identical to the analytic solution.\nIn contrast, KLVI underestimates the posterior variance. This is consistent with Table 1, which shows\nthe average diagonal variances. PBBVI results are much closer to the exact posterior variances.\n4.2 Gaussian Process Classi\ufb01cation\nWe evaluate the performance of PBBVI and KLVI on a GP classi\ufb01cation task. Since the model is\nnon-conjugate, no analytical baseline is available in this case. We model the data with the following\ngenerative process:\n\nf \u223c GP(0, \u039b),\n\nyi \u223c Bern(zi).\n\nzi = \u03c3(fi),\n\n7\n\nObservationsMeanAnalytic3stdInferred3stdObservationsMeanAnalytic3stdInferred3std\fFigure 5: Test log-likelihood (normalized by the num-\nber of test points) as a function of training iterations\nusing GP classi\ufb01cation on the Sonar data set. PBBVI\nconverges faster than alpha-VI even though we tuned the\nnumber of Monte Carlo samples per training step (100)\nand the constant learning rate (10\u22125) so as to maximize\nthe performance of alpha-VI on a validation set.\n\nFigure 6: Predictive likelihood of a VAE trained on\ndi\ufb00erent sizes of the data. The training data are ran-\ndomly sampled subsets of the MNIST training set. The\nhigher value the better. Our proposed PBBVI method\noutperforms KLVI mainly when the size of the training\ndata set is small. The fewer the training data, the more\nadvantage PBBVI obtains.\n\nAbove, \u039b is the GP kernel, \u03c3 indicates the sigmoid function, and Bern indicates the Bernoulli\ndistribution. We furthermore use the Matern 32 kernel,\n\n\u221a\n\n\u221a\n\n\u039bij = s2(1 +\n\n) exp(\u2212\n\n3 rij\nl\n\n3 rij\nl\n\n),\n\nrij =(cid:112)(xi \u2212 xj)T (xi \u2212 xj).\n\n\u221a\n\nData. We use four data sets from the UCI machine learning repository, suitable for binary classi\ufb01cation:\nCrab (200 datapoints), Pima (768 datapoints), Heart (270 datapoints), and Sonar (208 datapoints). We\nrandomly split each of the data sets into two halves. One half is used for training and the other half\nis used for testing. We set the hyper parameters s = 1 and l =\nD/2 throughout all experiments,\nwhere D is the dimensionality of input x.\nTable 2 shows the classi\ufb01cation performance (error rate) for these data sets. Our proposed PBBVI\nconsistently performs better than the traditional KLVI.\nConvergence speed comparison. We also carry out a comparison in terms of speed of convergence,\nfocusing on PBBVI and alpha-divergence VI. Our results indicate that the smaller variance of the\nreparameterization gradient leads to faster convergence of the optimization algorithm.\nWe train the GP classi\ufb01er from Section 4.2 on the Sonar UCI data set using a constant learning rate.\nFigure 5 shows the test log-likelihood under the posterior mean as a function of training iterations.\nWe split the data set into equally sized training, validation, and test sets. We then tune the learning\nrate and the number of Monte Carlo samples per gradient step to obtain optimal performance on the\nvalidation set after minimizing the alpha-divergence with a \ufb01xed budget of random samples. We use\n\u03b1 = 0.5 here; smaller values of \u03b1 lead to even slower convergence. We optimize the PBBVI lower\nbound using the same learning rate and number of Monte Carlo samples. The \ufb01nal test error rate is\n22% on an approximately balanced data set. PBBVI converges an order of magnitude faster.\nFigure 3 in Section 3 provides more insight in the scaling of the gradient variance. Here, we \ufb01t GP\nregression models on synthetically generated data by maximizing the PBBVI lower bound and the\nalpha-VI lower bound with \u03b1 \u2208 {0.2, 0.5, 2}. We generate a separate synthetic data set for each\nN \u2208 {1, . . . , 200} by drawing N random data points around a sinusoidal curve. For each N, we \ufb01t\na one-dimensional GP regression with PBBVI and alpha-VI, respectively, using the same data set\nfor both methods. The variational distribution is a fully factorized Gaussian with N latent variables.\nAfter convergence, we estimate the sampling variance of the gradient of each lower bound with respect\nto the posterior mean. We calculate the empirical variance of the gradient based on 105 samples from\nq, and we average over the N coordinates. Figure 3 shows the average sampling variance as a function\nof N on a logarithmic scale. The variance of the gradient of the alpha-VI bound grows exponentially\nin the number of latent variables. By contrast, we \ufb01nd only algebraic growth for PBBVI.\n4.3 Variational Autoencoder\nWe experiment on Variational Autoencoders (VAEs), and we compare the PBBVI and the KLVI\nbound in terms of predictive likelihoods on held-out data (Kingma and Welling, 2014). Autoencoders\ncompress unlabeled training data into low-dimensional representations by \ufb01tting it to an encoder-\ndecoder model that maps the data to itself. These models are prone to learning the identity function\n\n8\n\n02\u00b71044\u00b71046\u00b71048\u00b7104105trainingiteration\u22120.68\u22120.66\u22120.64\u22120.62normalizedtestlog-likelihoodunderposteriormeanPBBVI(K=3,proposed)\u03b1-VIwith\u03b1=0.5102103104sizeoftrainingset[log]\u2212300\u2212250\u2212200\u2212150\u2212100log-likelihoodoftestsetPBBVI(K=3,proposed)KLVI\fwhen the hyperparameters are not carefully tuned, or when the network is too expressive, especially\nfor a moderately sized training set. VAEs are designed to partially avoid this problem by estimating\nthe uncertainty that is associated with each data point in the latent space. It is therefore important that\nthe inference method does not underestimate posterior variances. We show that, for small data sets,\ntraining a VAE by maximizing the PBBVI lower bound leads to higher predictive likelihoods than\nmaximizing the KLVI lower bound.\nWe train the VAE on the MNIST data set of handwritten digits (LeCun et al., 1998). We build on\nthe publicly available implementation by Burda et al. (2016) and also use the same architecture and\nhyperparamters, with L = 2 stochastic layers and S = 5 samples from the variational distribution per\ngradient step. The model has 100 latent units in the \ufb01rst stochastic layer and 50 latent units in the\nsecond stochastic layer.\nThe VAE model factorizes over all data points. We train it by stochastically maximizing the sum of\nthe PBBVI lower bounds for all data points using a minibatch size of 20. The VAE amortizes the\ngradient signal across data points by training inference networks. The inference networks express the\nmean and variance of the variational distribution as a function of the data point. We add an additional\ninference network that learns the mapping from a data point to the reference energy V0. Here, we use\na network with four fully connected hidden layers of 200, 200, 100, and 50 units, respectively.\nMNIST contains 60,000 training images. To test our approach on smaller-scale data where Bayesian\nuncertainty matters more, we evaluate the test likelihood after training the model on randomly\nsampled fractions of the training set. We use the same training schedules as in the publicly available\nimplementation, keeping the total number of training iterations independent of the size of the training\nset. Di\ufb00erent to the original implementation, we shu\ufb04e the training set before each training epoch as\nthis turns out to increase the performance for both our method and the baseline.\nFigure 6 shows the predictive log-likelihood of the whole test set, where the VAE is trained on random\nsubsets of di\ufb00erent sizes of the training set. We use the same subset to train with PBBVI and KLVI for\neach training set size. PBBVI leads to a higher predictive likelihood than traditional KLVI on subsets\nof the data. We explain this \ufb01nding with our observation that the variational distributions obtained\nfrom PBBVI capture more of the posterior variance. As the size of the training set grows\u2014and the\nposterior uncertainty decreases\u2014the performance of KLVI catches up with PBBVI.\nAs a potential explanation why PBBVI converges to the KLVI result for large training sets, we note\nthat Eq\u2217 [(V \u2217\n0 (see\nSection 3.3). If V becomes a symmetric random variable (such as a Gaussian) in the limit of a large\ntraining set, then this implies that Eq\u2217 [V ] = V \u2217\n0 , and PBBVI reduces to KLVI for large training sets.\n5 Conclusion\nWe \ufb01rst presented a view on black box variational inference as a form of biased importance sampling,\nwhere we can trade-o\ufb00 bias versus variance by the choice of divergence. Bias refers to the deviation\nof the bound from the true marginal likelihood, and variance refers to its reparameterization gradient\nestimator. We then proposed a family of new variational bounds that connect to variational perturbation\ntheory, and which include corrections to the standard Kullback-Leibler bound. Our proposed PBBVI\nbound converges to the true marginal likelihood for large order K of the perturbative expansion,\nand we showed both theoretically and experimentally that it has lower-variance reparameterization\ngradients compared to alpha-VI. In order to scale up our method to massive data sets, future work will\nexplore stochastic versions of PBBVI. Since the PBBVI bound contains interaction terms between all\ndata points, breaking it up into mini-batches is non-straightforward. Besides, while our experiments\nused a \ufb01xed perturbative order of K = 3, it could be bene\ufb01cial to increase the perturbative order\nat some point during the training cycle once an empirical estimate of the gradient variance drops\nbelow a certain threshold. Furthermore, the PBBVI and alpha-bounds can also be combined, such\nthat PBBVI further approximates alpha-VI. This could lead to promising results on large data sets\nwhere traditional alpha-VI is hard to optimize due to its variance, and traditional PBBVI converges\nto KLVI. As a \ufb01nal remark, a tighter variational bound is not guaranteed to always result in a better\nposterior approximation since the variational family limits the quality of the solution. However, in\nthe context of variational EM, where one performs gradient-based hyperparameter optimization on\nthe log marginal likelihood, our bound gives more reliable results since higher orders of K can be\nassumed to approximate the marginal likelihood better.\n\n0 \u2212 V )3] = 0 at the optimal variational distribution q\u2217 and reference energy V \u2217\n\n9\n\n\fReferences\nAmari, S. (2012). Di\ufb00erential-geometrical methods in statistics, volume 28. Springer Science &\n\nBusiness Media.\n\nBamler, R. and Mandt, S. (2017). Dynamic word embeddings. In ICML.\nBottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In COMPSTAT.\n\nSpringer.\n\nBurda, Y., Grosse, R., and Salakhutdinov, R. (2016). Importance weighted autoencoders. In ICLR.\nDieng, A. B., Tran, D., Ranganath, R., Paisley, J., and Blei, D. M. (2017). Variational inference via \u03c7\n\nupper bound minimization. In ICML.\n\nHernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hern\u00e1ndez-Lobato, D., and Turner, R. (2016).\n\nBlack-box alpha divergence minimization. In ICML.\n\nHo\ufb00man, M. and Blei, D. (2015). Stochastic Structured Variational Inference. In AISTATS.\nHo\ufb00man, M. D., Blei, D. M., Wang, C., and Paisley, J. W. (2013). Stochastic variational inference.\n\nJMLR, 14(1).\n\nJordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational\n\nmethods for graphical models. Machine learning, 37(2).\n\nKappen, H. J. and Wiegerinck, W. (2001). Second order approximations for probability models. MIT;\n\n1998.\n\nKingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In ICLR.\nKleinert, H. (2009). Path integrals in quantum mechanics, statistics, polymer physics, and \ufb01nancial\n\nmarkets. World scienti\ufb01c.\n\nLeCun, Y., Bottou, L., Bengio, Y., and Ha\ufb00ner, P. (1998). Gradient-based learning applied to document\n\nrecognition. volume 86. IEEE.\n\nLi, Y. and Turner, R. E. (2016). R\u00e9nyi divergence variational inference. In NIPS.\nMinka, T. (2005). Divergence measures and message passing. Technical report, Technical report,\n\nMicrosoft Research.\n\nOpper, M. (2015). Expectation propagation. In Krzakala, F., Ricci-Tersenghi, F., Zdeborova, L.,\nZecchina, R., Tramel, E. W., and Cugliandolo, L. F., editors, Statistical Physics, Optimization,\nInference, and Message-Passing Algorithms, chapter 9, pages 263\u2013292. Oxford University Press.\nOpper, M., Paquet, U., and Winther, O. (2013). Perturbative corrections for approximate inference in\n\ngaussian latent variable models. JMLR, 14(1).\n\nPaquet, U., Winther, O., and Opper, M. (2009). Perturbation corrections in approximate inference:\n\nMixture modelling applications. JMLR, 10(Jun).\n\nPlefka, T. (1982). Convergence condition of the TAP equation for the in\ufb01nite-ranged ising spin glass\n\nmodel. Journal of Physics A: Mathematical and general, 15(6):1971.\n\nRanganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In AISTATS.\nRanganath, R., Tang, L., Charlin, L., and Blei, D. (2015). Deep exponential families. In AISTATS.\nRanganath, R., Tran, D., and Blei, D. (2016). Hierarchical variational models. In ICML.\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\n\ninference in deep generative models. In ICML.\n\nRobbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical\n\nstatistics.\n\nRuiz, F., Titsias, M., and Blei, D. (2016). The generalized reparameterization gradient. In NIPS.\nSalimans, T. and Knowles, D. A. (2013). Fixed-form variational posterior approximation through\n\nstochastic line ar regression. Bayesian Analysis, 8(4).\n\nTanaka, T. (1999). A theory of mean \ufb01eld approximation. In NIPS.\nTanaka, T. (2000). Information geometry of mean-\ufb01eld approximation. Neural Computation, 12(8).\nThouless, D., Anderson, P. W., and Palmer, R. G. (1977). Solution of \u2019solvable model of a spin glass\u2019.\n\nPhilosophical Magazine, 35(3).\n\n10\n\n\f", "award": [], "sourceid": 2634, "authors": [{"given_name": "Robert", "family_name": "Bamler", "institution": "Disney Research"}, {"given_name": "Cheng", "family_name": "Zhang", "institution": "Disney Research"}, {"given_name": "Manfred", "family_name": "Opper", "institution": "TU Berlin"}, {"given_name": "Stephan", "family_name": "Mandt", "institution": "Disney Research"}]}