{"title": "R\u00e9nyi Divergence Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1073, "page_last": 1081, "abstract": "This paper introduces the variational R\u00e9nyi bound (VR) that extends traditional variational inference to R\u00e9nyi's alpha-divergences. This new family of variational methods unifies a number of existing approaches, and enables a smooth interpolation from the evidence lower-bound to the log (marginal) likelihood that is controlled by the value of alpha that parametrises the divergence. The reparameterization trick, Monte Carlo approximation and stochastic optimisation methods are deployed to obtain a tractable and unified framework for optimisation. We further consider negative alpha values and propose a novel variational inference method as a new special case in the proposed framework. Experiments on Bayesian neural networks and variational auto-encoders demonstrate the wide applicability of the VR bound.", "full_text": "R\u00e9nyi Divergence Variational Inference\n\nYingzhen Li\n\nUniversity of Cambridge\nCambridge, CB2 1PZ, UK\n\nyl494@cam.ac.uk\n\nRichard E. Turner\n\nUniversity of Cambridge\nCambridge, CB2 1PZ, UK\n\nret26@cam.ac.uk\n\nAbstract\n\nThis paper introduces the variational R\u00e9nyi bound (VR) that extends traditional vari-\national inference to R\u00e9nyi\u2019s \u03b1-divergences. This new family of variational methods\nuni\ufb01es a number of existing approaches, and enables a smooth interpolation from\nthe evidence lower-bound to the log (marginal) likelihood that is controlled by the\nvalue of \u03b1 that parametrises the divergence. The reparameterization trick, Monte\nCarlo approximation and stochastic optimisation methods are deployed to obtain a\ntractable and uni\ufb01ed framework for optimisation. We further consider negative \u03b1\nvalues and propose a novel variational inference method as a new special case in\nthe proposed framework. Experiments on Bayesian neural networks and variational\nauto-encoders demonstrate the wide applicability of the VR bound.\n\n1\n\nIntroduction\n\nApproximate inference, that is approximating posterior distributions and likelihood functions, is at the\ncore of modern probabilistic machine learning. This paper focuses on optimisation-based approximate\ninference algorithms, popular examples of which include variational inference (VI), variational Bayes\n(VB) [1, 2] and expectation propagation (EP) [3, 4]. Historically, VI has received more attention\ncompared to other approaches, although EP can be interpreted as iteratively minimising a set of local\ndivergences [5]. This is mainly because VI has elegant and useful theoretical properties such as the\nfact that it proposes a lower-bound of the log-model evidence. Such a lower-bound can serve as\na surrogate to both maximum likelihood estimation (MLE) of the hyper-parameters and posterior\napproximation by Kullback-Leibler (KL) divergence minimisation.\nRecent advances of approximate inference follow three major trends. First, scalable methods,\ne.g. stochastic variational inference (SVI) [6] and stochastic expectation propagation (SEP) [7, 8],\nhave been developed for datasets comprising millions of datapoints. Recent approaches [9, 10, 11]\nhave also applied variational methods to coordinate parallel updates arising from computations\nperformed on chunks of data. Second, Monte Carlo methods and black-box inference techniques have\nbeen deployed to assist variational methods, e.g. see [12, 13, 14, 15] for VI and [16] for EP. They all\nproposed ascending the Monte Carlo approximated variational bounds to the log-likelihood using\nnoisy gradients computed with automatic differentiation tools. Third, tighter variational lower-bounds\nhave been proposed for (approximate) MLE. The importance weighted auto-encoder (IWAE) [17]\nimproved upon the variational auto-encoder (VAE) [18, 19] framework, by providing tighter lower-\nbound approximations to the log-likelihood using importance sampling. These recent developments\nare rather separated and little work has been done to understand their connections.\nIn this paper we try to provide a uni\ufb01ed framework from an energy function perspective that\nencompasses a number of recent advances in variational methods, and we hope our effort could\npotentially motivate new algorithms in the future. This is done by extending traditional VI to R\u00e9nyi\u2019s\n\u03b1-divergence [20], a rich family that includes many well-known divergences as special cases. After\nreviewing useful properties of R\u00e9nyi divergences and the VI framework, we make the following\ncontributions:\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTable 1: Special cases in the R\u00e9nyi divergence family.\n\n\u03b1\n\n\u03b1 \u2192 1\n\n\u03b1 = 0.5\n\u03b1 \u2192 0\n\n\u03b1 = 2\n\u03b1 \u2192 +\u221e\n\nDe\ufb01nition\n\n(cid:82) p(\u03b8) log p(\u03b8)\n\u2212 log(cid:82)\n\n\u22122 log(1 \u2212 Hel2[p||q])\n\nq(\u03b8) d\u03b8\n\np(\u03b8)>0 q(\u03b8)d\u03b8\n\u2212 log(1 \u2212 \u03c72[p||q])\nlog max\u03b8\u2208\u0398\n\np(\u03b8)\nq(\u03b8)\n\nNotes\nKullback-Leibler (KL) divergence,\nused in VI (KL[q||p]) and EP (KL[p||q])\nfunction of the square Hellinger distance\nzero when supp(q) \u2286 supp(p)\n(not a divergence)\nproportional to the \u03c72-divergence\nworst-case regret in\nminimum description length principle [24]\n\n\u2022 We introduce the variational R\u00e9nyi bound (VR) as an extension of VI/VB. We then discuss\nconnections to existing approaches, including VI/VB, VAE, IWAE [17], SEP [7] and black-box\nalpha (BB-\u03b1) [16], thereby showing the richness of this new family of variational methods.\n\u2022 We develop an optimisation framework for the VR bound. An analysis of the bias introduced\nby stochastic approximation is also provided with theoretical guarantees and empirical results.\n\u2022 We propose a novel approximate inference algorithm called VR-max as a new special case.\nEvaluations on VAEs and Bayesian neural networks show that this new method is often\ncomparable to, or even better than, a number of the state-of-the-art variational methods.\n\n2 Background\n\nThis section reviews R\u00e9nyi\u2019s \u03b1-divergence and variational inference upon which the new framework\nis based. Note that there exist other \u03b1-divergence de\ufb01nitions [21, 22] (see appendix). However we\nmainly focus on R\u00e9nyi\u2019s de\ufb01nition as it enables us to derive a new class of variational lower-bounds.\n\n1\n\nlog\n\n(cid:90)\n\n\u03b1 \u2212 1\n\nD\u03b1[p||q] =\n\np(\u03b8)\u03b1q(\u03b8)1\u2212\u03b1d\u03b8.\n\n2.1 R\u00e9nyi\u2019s \u03b1-divergence\nWe \ufb01rst review R\u00e9nyi\u2019s \u03b1-divergence [20, 23]. R\u00e9nyi\u2019s \u03b1-divergence, de\ufb01ned on {\u03b1 : \u03b1 > 0, \u03b1 (cid:54)=\n1,|D\u03b1| < +\u221e}, measures the \u201ccloseness\u201d of two distributions p and q on a random variable \u03b8 \u2208 \u0398:\n(1)\nThe de\ufb01nition is extended to \u03b1 = 0, 1, +\u221e by continuity. We note that when \u03b1 \u2192 1 the Kullback-\nLeibler (KL) divergence is recovered, which plays a crucial role in machine learning and information\ntheory. Some other special cases are presented in Table 1. The method proposed in this work also\nconsiders \u03b1 \u2264 0 (although (1) is no longer a divergence for these \u03b1 values), and we include from\n[23] some useful properties for forthcoming derivations.\nProposition 1. (Monotonicity) R\u00e9nyi\u2019s \u03b1-divergence de\ufb01nition (1), extended to negative \u03b1, is contin-\nuous and non-decreasing on \u03b1 \u2208 {\u03b1 : \u2212\u221e < D\u03b1 < +\u221e}.\nProposition 2. (Skew symmetry) For \u03b1 (cid:54)\u2208 {0, 1}, D\u03b1[p||q] = \u03b1\n1\u2212\u03b1 D1\u2212\u03b1[q||p]. This implies\nD\u03b1[p||q] \u2264 0 for \u03b1 < 0. For the limiting case D\u2212\u221e[p||q] = \u2212D+\u221e[q||p].\nA critical question that is still in active research is how to choose a divergence in this rich family to\nobtain optimal solution for a particular application, an issue which is discussed in the appendix.\n\n2.2 Variational inference\n\nNext we review the variational inference algorithm [1, 2] using posterior approximation as a running\nexample. Consider observing a dataset of N i.i.d. samples D = {xn}N\nn=1 from a probabilistic model\np(x|\u03b8) parametrised by a random variable \u03b8 that is drawn from a prior p0(\u03b8). Bayesian inference\ninvolves computing the posterior distribution of the parameters given the data,\nn=1 p(xn|\u03b8, \u03d5)\np(D|\u03d5)\n\np0(\u03b8|\u03d5)(cid:81)N\n\np(\u03b8,D|\u03d5)\np(D|\u03d5)\n\np(\u03b8|D, \u03d5) =\n\n(2)\n\n=\n\n,\n\n2\n\n\f(a) Approximated posterior.\n\n(b) Hyper-parameter optimisation.\n\nFigure 1: Mean-Field approximation for Bayesian linear regression.\nIn this case \u03d5 = \u03c3 the\nobservation noise variance. The bound is tight as \u03c3 \u2192 +\u221e, biasing the VI solution to large \u03c3 values.\n\nwhere p(D|\u03d5) =(cid:82) p0(\u03b8|\u03d5)(cid:81)N\n\nn=1 p(xn|\u03b8, \u03d5)d\u03b8 is called marginal likelihood or model evidence.\nThe hyper-parameters of the model are denoted as \u03d5 which might be omitted henceforth for notational\nease. For many powerful models the exact posterior is typically intractable, and approximate inference\nintroduces an approximation q(\u03b8) in some tractable distribution family Q to the exact posterior. One\nway to obtain this approximation is to minimise the KL divergence KL[q(\u03b8)||p(\u03b8|D)], which is\nalso intractable due the dif\ufb01cult term p(D). Variational inference (VI) sidesteps this dif\ufb01culty by\nconsidering an equivalent optimisation problem that maximises the variational lower-bound:\n\nLVI(q;D, \u03d5) = log p(D|\u03d5) \u2212 KL[q(\u03b8)||p(\u03b8|D, \u03d5)] = Eq\n\nlog\n\n.\n\n(3)\n\n(cid:20)\n\n(cid:21)\n\np(\u03b8,D|\u03d5)\n\nq(\u03b8)\n\nThe variational lower-bound can also be used to optimise the hyper-parameters \u03d5.\nTo illustrate the approximation quality of VI we present a mean-\ufb01eld approximation example to\nBayesian linear regression in Figure 1(a) (in magenta). Readers are referred to the appendix for\ndetails, but essentially a factorised Gaussian approximation is \ufb01tted to the true posterior, a correlated\nGaussian in this case. The approximation recovers the posterior mean correctly, but is over-con\ufb01dent.\nMoreover, as LVI is the difference between the marginal likelihood and the KL divergence, hyper-\nparameter optimisation can be biased away from the exact MLE towards the region of parameter\nspace where the KL term is small [25] (see Figure 1(b)).\n\n3 Variational R\u00e9nyi bound\n\nRecall from Section 2.1 that the family of R\u00e9nyi divergences includes the KL divergence. Perhaps\nvariational free-energy approaches can be generalised to the R\u00e9nyi case? Consider approximating\nthe exact posterior p(\u03b8|D) by minimizing R\u00e9nyi\u2019s \u03b1-divergence D\u03b1[q(\u03b8)||p(\u03b8|D)] for some selected\n\u03b1 > 0. Now we consider the equivalent optimization problem maxq\u2208Q log p(D)\u2212D\u03b1[q(\u03b8)||p(\u03b8|D)],\nand when \u03b1 (cid:54)= 1, whose objective can be rewritten as\n\n(cid:34)(cid:18) p(\u03b8,D)\n\n(cid:19)1\u2212\u03b1(cid:35)\n\nL\u03b1(q;D) :=\n\n1\n\n1 \u2212 \u03b1\n\nlog Eq\n\nq(\u03b8)\n\n.\n\n(4)\n\nWe name this new objective the variational R\u00e9nyi (VR) bound. Importantly the above de\ufb01nition can\nbe extend to \u03b1 \u2264 0, and the following theorem is a direct result of Proposition 1.\nTheorem 1. The objective L\u03b1(q;D) is continuous and non-increasing on \u03b1 \u2208 {\u03b1 : |L\u03b1| < +\u221e}.\nEspecially for all 0 < \u03b1+ < 1 and \u03b1\u2212 < 0,\n\nLVI(q;D) = lim\n\u03b1\u21921\n\nL\u03b1(q;D) \u2264 L\u03b1+(q;D) \u2264 L0(q;D) \u2264 L\u03b1\u2212 (q;D)\nAlso L0(q;D) = log p(D) if and only if the support supp(p(\u03b8|D)) \u2286 supp(q(\u03b8)).\nTheorem 1 indicates that the VR bound can be useful for model selection by sandwiching the marginal\nlikelihood with bounds computed using positive and negative \u03b1 values, which we leave to future\nwork. In particular L0 = log p(D) under the mild assumption that q is supported where the exact\n\n3\n\n(VI)\fposterior is supported. This assumption holds for many commonly used distributions, e.g. Gaussians\nare supported on the entire space, and in the following we assume that this condition is satis\ufb01ed.\nChoosing different alpha values allows the approximation to balance between zero-forcing (\u03b1 \u2192\n+\u221e, when using uni-modal approximations it is usually called mode-seeking) and mass-covering\n(\u03b1 \u2192 \u2212\u221e) behaviour. This is illustrated by the Bayesian linear regression example, again in Figure\n1(a). First notice that \u03b1 \u2192 +\u221e (in cyan) returns non-zero uncertainty estimates (although it is more\nover-con\ufb01dent than VI) which is different from the maximum a posteriori (MAP) method that only\ni p(\u03b8i|D) and the exact\nmarginal likelihood log p(D) (Figure 1(b)). Also the approximate MLE is less biased for \u03b1 = 0.5 (in\nblue) since now the tightness of the bound is less hyper-parameter dependent.\n\nreturns a point estimate. Second, setting \u03b1 = 0.0 (in green) returns q(\u03b8) =(cid:81)\n\n4 The VR bound optimisation framework\n\nThis section addresses several issues of the VR bound optimisation by proposing further approxi-\nmations. First when \u03b1 (cid:54)= 1, the VR bound is usually just as intractable as the marginal likelihood\nfor many useful models. However Monte Carlo (MC) approximation is applied here to extend the\nset of models that can be handled. The resulting method can be applied to any model that MC-VI\n[12, 13, 14, 15] is applied to. Second, Theorem 1 suggests that the VR bound is to be minimised\nwhen \u03b1 < 0, which performs disastrously in MLE context. As we shall see, this issue is solved also\nby the MC approximation under certain conditions. Third, a mini-batch training method is developed\nfor large-scale datasets in the posterior approximation context. Hence the proposed optimisation\nframework of the VR bound enables tractable application to the same class of models as SVI.\n\n4.1 Monte Carlo approximation of the VR bound\n\nConsider learning a latent variable model with MLE as a running example, where the model is\nspeci\ufb01ed by a conditional distribution p(x|h, \u03d5) and a prior p(h|\u03d5) on the latent variables h.\nExamples include models treated by the variational auto-encoder (VAE) approach [18, 19] that\nparametrises the likelihood with a (deep) neural network. MLE requires log p(x) which is obtained\nby marginalising out h and is often intractable, so the VR bound is considered as an alternative\noptimisation objective. However instead of using exact bounds, a simple Monte Carlo (MC) method\nis deployed, which uses \ufb01nite samples hk \u223c q(h|x), k = 1, ..., K to approximate L\u03b1 \u2248 \u02c6L\u03b1,K:\n\n(cid:34)(cid:18) p(hk, x)\n\nq(hk|x)\n\nK(cid:88)\n\nk=1\n\n(cid:19)1\u2212\u03b1(cid:35)\n\n\u02c6L\u03b1,K(q; x) =\n\n1\n\n1 \u2212 \u03b1\n\nlog\n\n1\nK\n\n.\n\n(5)\n\nk=1\n\nk=1\n\nk=1\n\n[ \u02c6L\u03b1,K(q; x)]\n\n[ \u02c6L\u03b1,K(q; x)] \u2192 L\u03b1 as K \u2192 +\u221e;\n\nThe importance weighted auto-encoder (IWAE) [17] is a special case of this framework with \u03b1 = 0\nand K < +\u221e. But unlike traditional VI, here the MC approximation is biased. Fortunately we can\ncharacterise the bias by the following theorems proved in the appendix.\n[| \u02c6L\u03b1,K(q; x)|] < +\u221e and |L\u03b1| < +\u221e. Then E{hk}K\nTheorem 2. Assume E{hk}K\nas a function of \u03b1 \u2208 R and K \u2265 1 is:\n1) non-decreasing in K for \ufb01xed \u03b1 \u2264 1, and non-increasing in K for \ufb01xed \u03b1 \u2265 1;\n2) E{hk}K\n3) continuous and non-increasing in \u03b1 with \ufb01xed K.\nCorollary 1. For \ufb01nite K, either E{hk}K\nsuch that E{hk}K\nAlso \u03b1K is non-decreasing in K if exists, with limK\u21921 \u03b1K = \u2212\u221e and limK\u2192+\u221e \u03b1K = 0.\nThe intuition behind the theorems is visualised in Figure 2(a). By de\ufb01nition, the exact VR bound\nis a lower-bound or upper-bound of log p(x) when \u03b1 > 0 or \u03b1 < 0, respectively. However the\nMC approximation E[ \u02c6L\u03b1,K] biases the estimate towards LVI, where the approximation quality can\nbe improved using more samples. Thus for \ufb01nite samples and under mild conditions, negative\nalpha values can potentially be used to improve the accuracy of the approximation, at the cost of\nlosing the upper-bound guarantee. Figure 2(b) shows an empirical evaluation by computing the\nexact and the MC approximation of the R\u00e9nyi divergences. In this example p, q are 2-D Gaussian\ndistributions with \u00b5p = [0, 0], \u00b5q = [1, 1] and \u03a3p = \u03a3q = I. The sampling procedure is repeated\n\n[ \u02c6L\u03b1,K(q; x)] < log p(x) for all \u03b1, or there exists \u03b1K \u2264 0\n[ \u02c6L\u03b1,K(q; x)] > log p(x) for all \u03b1 < \u03b1K.\n\n[ \u02c6L\u03b1K ,K(q; x)] = log p(x) and E{hk}K\n\nk=1\n\nk=1\n\nk=1\n\n4\n\n\f(a) MC approximated VR bounds.\n\n(b) Simulated MC approximations.\n\nFigure 2: (a) An illustration for the bounding properties of MC approximations to the VR bounds. (b)\nThe bias of the MC approximation. Best viewed in colour and see the main text for details.\n\n200 times to estimate the expectation. Clearly for K = 1 it is equivalent to an unbiased estimate\nof the KL-divergence for all \u03b1 (even though now the estimation is biased for D\u03b1). For K > 1 and\n\u03b1 < 1, the MC method under-estimates the VR bound, and the bias decreases with increasing K. For\n\u03b1 > 1 the inequality is reversed also as predicted.\n\n4.2 Uni\ufb01ed implementation with the reparameterization trick\nReaders may have noticed that LVI has a different form compared to L\u03b1 with \u03b1 (cid:54)= 1. In this section\nwe show how to unify the implementation for all \ufb01nite \u03b1 settings using the reparameterization trick\n[13, 18] as an example. This trick assumes the existence of the mapping \u03b8 = g\u03c6(\u0001), where the\ndistribution of the noise term \u0001 satis\ufb01es q(\u03b8)d\u03b8 = p(\u0001)d\u0001. Then the expectation of a function F (\u03b8)\nover distribution q(\u03b8) can be computed as Eq(\u03b8)[F (\u03b8)] = Ep(\u0001)[F (g\u03c6(\u0001))]. One prevalent example\nis the Gaussian reparameterization: \u03b8 \u223c N (\u00b5, \u03a3) \u21d2 \u03b8 = \u00b5 + \u03a3 1\n2 \u0001, \u0001 \u223c N (0, I). Now we apply\nthe reparameterization trick to the VR bound\n\nL\u03b1(q\u03c6; x) =\n\n1\n\n1 \u2212 \u03b1\n\nlog E\u0001\n\n(cid:34)(cid:18) p(g\u03c6(\u0001), x)\n\n(cid:19)1\u2212\u03b1(cid:35)\n\nq(g\u03c6(\u0001))\n\n.\n\n(cid:21)\n\nw\u03b1(\u0001; \u03c6, x)\u2207\u03c6 log\n\n(cid:20)(cid:16) p(g\u03c6(\u0001),x)\n\n(cid:17)1\u2212\u03b1(cid:21)\n\np(g\u03c6(\u0001), x)\n\nq(g\u03c6(\u0001))\n\n,\n\n(6)\n\n(7)\n\n(cid:20)\n\nE\u0001\n\n(cid:20)\n\nK(cid:88)\n\nk=1\n\nThen the gradient of the VR bound w.r.t. \u03c6 is (similar for \u03d5, see appendix for derivation)\n\n\u2207\u03c6L\u03b1(q\u03c6; x) = E\u0001\n\n(cid:17)1\u2212\u03b1(cid:30)\n(cid:16) p(g\u03c6(\u0001),x)\n(cid:17)1\u2212\u03b1\n\u02c6w\u03b1,k(\u0001k; \u03c6, x) \u221d(cid:16) p(g\u03c6(\u0001k),x)\n\nq(g\u03c6(\u0001))\n\nq(g\u03c6(\u0001k))\n\nwhere w\u03b1(\u0001; \u03c6, x) =\ndenotes the normalised importance\nweight. One can show that this recovers the the stochastic gradients of LVI by setting \u03b1 = 1 in (7)\nsince now w1(\u0001; \u03c6, x) = 1, which means the resulting algorithm uni\ufb01es the computation for all\n\ufb01nite \u03b1 settings. For MC approximations, we use K samples to approximately compute the weight\n\nq(g\u03c6(\u0001))\n\n, k = 1, ..., K, and the stochastic gradient becomes\n\n\u2207\u03c6 \u02c6L\u03b1,K(q\u03c6; x) =\n\n\u02c6w\u03b1,k(\u0001k; \u03c6, x)\u2207\u03c6 log\n\np(g\u03c6(\u0001k), x)\n\nq(g\u03c6(\u0001k))\n\n.\n\n(8)\n\nWhen \u03b1 = 1, \u02c6w1,k(\u0001k; \u03c6, x) = 1/K, and it recovers the stochastic gradient VI method [18].\nTo speed-up learning [17] suggested back-propagating only one sample \u0001j with j \u223c pj = \u02c6w\u03b1,j, which\ncan be easily extended to our framework. Importantly, the use of different \u03b1 < 1 indicates the degree\nof emphasis placed upon locations where the approximation q under-estimates p, and in the extreme\ncase \u03b1 \u2192 \u2212\u221e, the algorithm chooses the sample that has the maximum unnormalised importance\nweight. We name this approach VR-max and summarise it and the general case in Algorithm 1. Note\nthat VR-max (and VR-\u03b1 with \u03b1 < 0 and MC approximations) does not minimise D1\u2212\u03b1[p||q]. It is\ntrue that L\u03b1 \u2265 log p(x) for negative \u03b1 values. However Corollary 1 suggests that the tightest MC\napproximation for given K has non-positive \u03b1K value, or might not even exist. Furthermore \u03b1K\nbecomes more negative as the mismatch between q and p increases, e.g. VAE uses a uni-modal q\ndistribution to approximate the typically multi-modal exact posterior.\n\n5\n\n(cid:21)\n\n\fAlgorithm 1 One gradient step for VR-\u03b1/VR-max\nwith single backward pass. Here \u02c6w(\u0001k; x) short-\nhands \u02c6w0,k(\u0001k; \u03c6, x) in the main text.\n1: given the current datapoint x, sample\n\n\u00011, ..., \u0001K \u223c p(\u0001)\n\n2: for k = 1, ..., K, compute the unnormalised weight\nlog \u02c6w(\u0001k; x) = log p(g\u03c6(\u0001k), x)\u2212log q(g\u03c6(\u0001k)|x)\n\n3: choose the sample \u0001j to back-propagate:\nif |\u03b1| < \u221e: j \u223c pk where pk \u221d \u02c6w(\u0001k; x)1\u2212\u03b1\nif \u03b1 = \u2212\u221e: j = arg maxk log \u02c6w(\u0001k; x)\n\n4: return the gradients \u2207\u03c6 log \u02c6w(\u0001j; x)\n\nFigure 3: Connecting local and global\ndivergence minimisation.\n\n4.3 Stochastic approximation for large-scale learning\n\nVR bounds can also be applied to full Bayesian inference with posterior approximation. However for\nlarge datasets full batch learning is very inef\ufb01cient. Mini-batch training is non-trivial here since the\nVR bound cannot be represented by the expectation on a datapoint-wise loss, except when \u03b1 = 1.\nThis section introduces two proposals for mini-batch training, and interestingly, this recovers two\nexisting algorithms that were motivated from a different perspective. In the following we de\ufb01ne the\nN . Hence the joint distribution can be rewritten as\np(\u03b8,D) = p0(\u03b8) \u00affD(\u03b8)N . Also for a mini-batch of M datapoints S = {xn1, ..., xnM} \u223c D, we\n\n\u201caverage likelihood\u201d \u00affD(\u03b8) = [(cid:81)N\nde\ufb01ne the \u201csubset average likelihood\u201d \u00affS (\u03b8) = [(cid:81)M\n\nn=1 p(xn|\u03b8)] 1\n\nm=1 p(xnm|\u03b8)] 1\nM .\n\nThe \ufb01rst proposal considers \ufb01xed point approximations with mini-batch sub-sampling. It \ufb01rst derives\nthe \ufb01xed point conditions for the variational parameters (e.g. the natural parameters of q) using the\nexact VR bound (4), then design an iterative algorithm using those \ufb01xed point equations, but with\n\u00affD(\u03b8) replaced by \u00affS (\u03b8). The second proposal also applies this subset average likelihood approx-\nimation idea, but directly to the VR bound (4) (so this approach is named energy approximation):\n\n\u02dcL\u03b1(q;S) =\n\n1\n\n1 \u2212 \u03b1\n\nlog Eq\n\n(cid:34)(cid:18) p0(\u03b8) \u00affS (\u03b8)N\n\n(cid:19)1\u2212\u03b1(cid:35)\n\nq(\u03b8)\n\n.\n\n(9)\n\nIn the appendix we demonstrate with detailed derivations that \ufb01xed point approximation returns\nStochastic EP (SEP) [7], and black box alpha (BB-\u03b1) [16] corresponds to energy approximation. Both\nalgorithms were originally proposed to approximate (power) EP [3, 26], which usually minimises\n\u03b1-divergences locally, and considers M = 1, \u03b1 \u2208 [1 \u2212 1/N, 1) and exponential family distributions.\nThese approximations were done by factor tying, which signi\ufb01cantly reduces the memory overhead\nof full EP and makes both SEP and BB-\u03b1 scalable to large datasets just as SVI. The new derivation\nderivation provides a theoretical justi\ufb01cation from energy perspective, and also sheds lights on the\nconnections between local and global divergence minimisations as depicted in Figure 3. Note that\nall these methods recover SVI when \u03b1 \u2192 1, in which global and local divergence minimisation are\nequivalent. Also these results suggest that recent attempts of distributed posterior approximation (by\ncarving up the dataset into pieces with M > 1 [10, 11]) can be extended to both SEP and BB-\u03b1.\nMonte Carlo methods can also be applied to both proposals. For SEP the moment computation can be\napproximated with MCMC [10, 11]. For BB-\u03b1 one can show in the same way as to prove Theorem\n2 that simple MC approximation in expectation lower-bounds the BB-\u03b1 energy when \u03b1 \u2264 1. In\ngeneral it is also an open question how to choose \u03b1 for given the mini-batch size M and the number\nof samples K, but there is evidence that intermediate \u03b1 values can be superior [27, 28].\n\n5 Experiments\n\nWe evaluate the VR bound methods on Bayesian neural networks and variational auto-encoders. All\nthe experiments used the ADAM optimizer [29], and the detailed experimental set-up (batch size,\nlearning rate, etc.) can be found in the appendix. The implementation of all the experiments in Python\nis released at https://github.com/YingzhenLi/VRbound.\n\n6\n\nVREPSEPBB-globallocalmini-batchsub-samplingfactortyingenergyapprox.\ufb01xed pointapprox.\fFigure 4: Test LL and RMSE results for Bayesian neural network regression. The lower the better.\n\n5.1 Bayesian neural network\n\nThe \ufb01rst experiment considers Bayesian neural network regression. The datasets are collected from\nthe UCI dataset repository.1 The model is a single-layer neural network with 50 hidden units (ReLUs)\nfor all datasets except Protein and Year (100 units). We use a Gaussian prior \u03b8 \u223c N (\u03b8; 0, I) for\nthe network weights and Gaussian approximation to the true posterior q(\u03b8) = N (\u03b8; \u00b5q, diag(\u03c3q)).\nWe follow the toy example in Section 3 to consider \u03b1 \u2208 {\u2212\u221e, 0.0, 0.5, 1.0, +\u221e} in order to\nexamine the effect of mass-covering/zero-forcing behaviour. Stochastic optimisation uses the energy\napproximation proposed in Section 4.3. MC approximation is also deployed to compute the energy\nfunction, in which K = 100, 10 is used for small and large datasets (Protein and Year), respectively.\nWe summarise the test negative log-likelihood (LL) and RMSE with standard error (across different\nrandom splits except for Year) for selected datasets in Figure 4, where the full results are provided in\nthe appendix. These results indicate that for posterior approximation problems, the optimal \u03b1 may\nvary for different datasets. Also the MC approximation complicates the selection of \u03b1 (see appendix).\nFuture work should develop algorithms to automatically select the best \u03b1 values, although a naive\napproach could use validation sets. We observed two major trends that zero-forcing/mode-seeking\nmethods tend to focus on improving the predictive error, while mass-covering methods returns better\ncalibrated uncertainty estimate and better test log-likelihood. In particular VI returns lower test\nlog-likelihood for most of the datasets. Furthermore, \u03b1 = 0.5 produced overall good results for both\ntest LL and RMSE, possibly because the skew symmetry is centred at \u03b1 = 0.5 and the corresponding\ndivergence is the only symmetric distance measure in the family.\n\n5.2 Variational auto-encoder\n\nThe second experiments considers variational auto-encoders for unsupervised learning. We mainly\ncompare three approaches: VAE (\u03b1 = 1.0), IWAE (\u03b1 = 0), and VR-max (\u03b1 = \u2212\u221e), which are\nimplemented upon the publicly available code.2 Four datasets are considered: Frey Face (with 10-fold\ncross validation), Caltech 101 Silhouettes, MNIST and OMNIGLOT. The VAE model has L = 1, 2\nstochastic layers with deterministic layers stacked between, and the network architecture is detailed\nin the appendix. We reproduce the IWAE experiments to obtain a fair comparison, since the results in\nthe original publication [17] mismatches those evaluated on the publicly available code.\nWe report test log-likelihood results in Table 2 by computing log p(x) \u2248 \u02c6L0,5000(q; x) following\n[17]. We also present some samples from the trained models in the appendix. Overall VR-max is\nalmost indistinguishable from IWAE. Other positive alpha settings (e.g. \u03b1 = 0.5) return worse results,\ne.g. 1374.64 \u00b1 5.62 for Frey Face and \u221285.50 for MNIST with \u03b1 = 0.5, L = 1 and K = 5. These\nworse results for \u03b1 > 0 indicate the preference of getting tighter approximations to the likelihood\nfunction for MLE problems. Small negative \u03b1 values (e.g. \u03b1 = \u22121.0,\u22122.0) returns better results on\ndifferent splits of the Frey Face data, and overall the best \u03b1 value is dataset-speci\ufb01c.\n\n1http://archive.ics.uci.edu/ml/datasets.html\n2https://github.com/yburda/iwae\n\n7\n\nmass-coveringzero-forcing\fTable 2: Average Test log-likelihood. Results for VAE on\nMNIST and OMNIGLOT are collected from [17].\nDataset\nFrey Face\n(\u00b1 std. err.)\nCaltech 101\nSilhouettes\nMNIST\n\nL K\n1\n5\n\n1\n\n1\n\nVR-max\n1377.40\n\u00b14.59\n-118.01\n-117.10\n-85.42\n-84.81\n-84.04\n-83.44\n-106.33\n-105.05\n-104.71\n-103.72\n\nVAE\n1322.96\n\u00b110.03\n-119.69\n-119.61\n-86.47\n-86.35\n-85.01\n-84.78\n-107.62\n-107.80\n-106.31\n-106.30\n\nIWAE\n1380.30\n\u00b14.60\n-117.89\n-117.21\n-85.41\n-84.80\n-83.92\n-83.05\n-106.30\n-104.68\n-104.64\n-103.25\n\n5\n50\n5\n50\n5\n50\n5\n50\n5\n50\n\nOMNIGLOT\n\n2\n\n1\n1\n2\n2\n\nFigure 5: Bias of sampling approx-\nimation to. Results for K = 5, 50\nsamples are shown on the left and\nright, respectively.\n\n(a) Log of ratio R = wmax/(1 \u2212 wmax)\n\n(b) Weights of samples.\n\nFigure 6: Importance weights during training, see main text for details. Best viewed in colour.\n\nVR-max\u2019s success might be explained by the tightness of the bound. To evaluate this, we compute\nthe VR bounds on 100 test datapoints using the 1-layer VAE trained on Frey Face, with K = {5, 50}\nand \u03b1 \u2208 {0,\u22121,\u22125,\u221250,\u2212500}. Figure 5 presents the estimated gap \u02c6L\u03b1,K \u2212 \u02c6L0,5000. The results\nindicates that \u02c6L\u03b1,K provides a lower-bound, and that gap is narrowed as \u03b1 \u2192 \u2212\u221e. Also increasing\nK provides improvements. The standard error of estimation is almost constant for different \u03b1 (with\nK \ufb01xed), and is negligible when compared to the MC approximation bias.\nAnother explanation for VR-max\u2019s success is that, the sample with the largest normalised importance\nweight wmax dominates the contributions of all the gradients. This is con\ufb01rmed by tracking R =\nduring training on Frey Face (Figure 6(a)). Also Figure 6(b) shows the 10 largest importance\nwmax\n1\u2212wmax\nweights from K = 50 samples in descending order, which exhibit an exponential decay behaviour,\nwith the largest weight occupying more than 75% of the probability mass. Hence VR-max provides a\nfast approximation to IWAE when tested on CPUs or multiple GPUs with high communication costs.\nIndeed our numpy implementation of VR-max achieves up to 3 times speed-up compared to IWAE\n(9.7s vs. 29.0s per epoch, tested on Frey Face data with K = 50 and batch size M = 100, CPU info:\nIntel Core i7-4930K CPU @ 3.40GHz). However this speed advantage is less signi\ufb01cant when the\ngradients can be computed very ef\ufb01ciently on a single GPU.\n\n6 Conclusion\n\nWe have introduced the variational R\u00e9nyi bound and an associated optimisation framework. We\nhave shown the richness of the new family, not only by connecting to existing approaches including\nVI/VB, SEP, BB-\u03b1, VAE and IWAE, but also by proposing the VR-max algorithm as a new special\ncase. Empirical results on Bayesian neural networks and variational auto-encoders indicate that VR\nbound methods are widely applicable and can obtain state-of-the-art results. Future work will focus\non both experimental and theoretical sides. Theoretical work will study the interaction of the biases\nintroduced by MC approximation and datapoint sub-sampling. A guide on choosing optimal \u03b1 values\nare needed for practitioners when applying the framework to their applications.\n\nAcknowledgements\n\nWe thank the Cambridge MLG members and the reviewers for comments. YL thanks the Schlum-\nberger Foundation FFTF fellowship. RET thanks EPSRC grants # EP/M026957/1 and EP/L000776/1.\n\n8\n\n\fReferences\n[1] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, \u201cAn introduction to variational methods for\n\ngraphical models,\u201d Machine learning, vol. 37, no. 2, pp. 183\u2013233, 1999.\n\n[2] M. J. Beal, Variational algorithms for approximate Bayesian inference. PhD thesis, University College\n\n[3] T. Minka, \u201cExpectation propagation for approximate Bayesian inference,\u201d in Conference on Uncertainty in\n\nLondon, 2003.\n\nArti\ufb01cial Intelligence (UAI), 2001.\n\n[4] M. Opper and O. Winther, \u201cExpectation consistent approximate inference,\u201d The Journal of Machine\n\nLearning Research, vol. 6, pp. 2177\u20132204, 2005.\n\n[5] T. Minka, \u201cDivergence measures and message passing,\u201d tech. rep., Microsoft Research, 2005.\n[6] M. D. Hoffman, D. M. Blei, C. Wang, and J. W. Paisley, \u201cStochastic variational inference,\u201d Journal of\n\nMachine Learning Research, vol. 14, no. 1, pp. 1303\u20131347, 2013.\n\n[7] Y. Li, J. M. Hern\u00e1ndez-Lobato, and R. E. Turner, \u201cStochastic expectation propagation,\u201d in Advances in\n\nNeural Information Processing Systems (NIPS), 2015.\n\n[8] G. Dehaene and S. Barthelm\u00e9, \u201cExpectation propagation in the large-data limit,\u201d arXiv:1503.08060, 2015.\n[9] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan, \u201cStreaming variational Bayes,\u201d in\n\nAdvances in Neural Information Processing Systems (NIPS), 2013.\n\n[10] A. Gelman, A. Vehtari, P. Jyl\u00e4nki, C. Robert, N. Chopin, and J. P. Cunningham, \u201cExpectation propagation\n\nas a way of life,\u201d arXiv:1412.4869, 2014.\n\n[11] M. Xu, B. Lakshminarayanan, Y. W. Teh, J. Zhu, and B. Zhang, \u201cDistributed Bayesian posterior sampling\n\nvia moment sharing,\u201d in Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[12] J. Paisley, D. Blei, and M. Jordan, \u201cVariational Bayesian inference with stochastic search,\u201d in Proceedings\n\nof The 29th International Conference on Machine Learning (ICML), 2012.\n\n[13] T. Salimans and D. A. Knowles, \u201cFixed-form variational posterior approximation through stochastic linear\n\nregression,\u201d Bayesian Analysis, vol. 8, no. 4, pp. 837\u2013882, 2013.\n\n[14] R. Ranganath, S. Gerrish, and D. M. Blei, \u201cBlack box variational inference,\u201d in Proceedings of the 17th\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2014.\n\n[15] A. Kucukelbir, R. Ranganath, A. Gelman, and D. M. Blei, \u201cAutomatic variational inference in Stan,\u201d in\n\nAdvances in Neural Information Processing Systems (NIPS), 2015.\n\n[16] J. M. Hern\u00e1ndez-Lobato, Y. Li, M. Rowland, D. Hern\u00e1ndez-Lobato, T. Bui, and R. E. Turner, \u201cBlack-box\n\u03b1-divergence minimization,\u201d in Proceedings of The 33rd International Conference on Machine Learning\n(ICML), 2016.\n\n[17] Y. Burda, R. Grosse, and R. Salakhutdinov, \u201cImportance weighted autoencoders,\u201d in International Confer-\n\nence on Learning Representations (ICLR), 2016.\n\n[18] D. P. Kingma and M. Welling, \u201cAuto-encoding variational Bayes,\u201d in International Conference on Learning\n\nRepresentations (ICLR), 2014.\n\n[19] D. J. Rezende, S. Mohamed, and D. Wierstra, \u201cStochastic backpropagation and approximate inference\nin deep generative models,\u201d in Proceedings of The 30th International Conference on Machine Learning\n(ICML), 2014.\n\n[20] A. R\u00e9nyi, \u201cOn measures of entropy and information,\u201d Fourth Berkeley symposium on mathematical\n\nstatistics and probability, vol. 1, 1961.\n\n[21] S.-i. Amari, Differential-Geometrical Methods in Statistic. New York: Springer, 1985.\n[22] C. Tsallis, \u201cPossible generalization of Boltzmann-Gibbs statistics,\u201d Journal of statistical physics, vol. 52,\n\nno. 1-2, pp. 479\u2013487, 1988.\n\n[23] T. Van Erven and P. Harremo\u00ebs, \u201cR\u00e9nyi divergence and Kullback-Leibler divergence,\u201d Information Theory,\n\nIEEE Transactions on, vol. 60, no. 7, pp. 3797\u20133820, 2014.\n\n[24] P. Gr\u00fcnwald, Minimum Description Length Principle. MIT press, Cambridge, MA, 2007.\n[25] R. E. Turner and M. Sahani, \u201cTwo problems with variational expectation maximisation for time-series\nmodels,\u201d in Bayesian Time series models (D. Barber, T. Cemgil, and S. Chiappa, eds.), ch. 5, pp. 109\u2013130,\nCambridge University Press, 2011.\n\n[26] T. Minka, \u201cPower EP,\u201d Tech. Rep. MSR-TR-2004-149, Microsoft Research, 2004.\n[27] T. D. Bui, D. Hern\u00e1ndez-Lobato, Y. Li, J. M. Hern\u00e1ndez-Lobato, and R. E. Turner, \u201cDeep gaussian processes\nfor regression using approximate expectation propagation,\u201d in Proceedings of The 33rd International\nConference on Machine Learning (ICML), 2016.\n\n[28] S. Depeweg, J. M. Hern\u00e1ndez-Lobato, F. Doshi-Velez, and S. Udluft, \u201cLearning and policy search in\n\nstochastic dynamical systems with bayesian neural networks,\u201d arXiv preprint arXiv:1605.07127, 2016.\n\n[29] D. P. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d in International Conference on\n\nLearning Representations (ICLR), 2015.\n\n9\n\n\f", "award": [], "sourceid": 616, "authors": [{"given_name": "Yingzhen", "family_name": "Li", "institution": "University of Cambridge"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}]}