{"title": "Fast Variational Inference in the Conjugate Exponential Family", "book": "Advances in Neural Information Processing Systems", "page_first": 2888, "page_last": 2896, "abstract": "We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic models optimized using our bound.", "full_text": "Fast Variational Inference in the\nConjugate Exponential Family\n\nJames Hensman\u2217\n\nDepartment of Computer Science\n\nThe University of Shef\ufb01eld\n\nMagnus Rattray\n\nFaculty of Life Science\n\nThe University of Manchester\n\njames.hensman@sheffield.ac.uk\n\nmagnus.rattray@manchester.ac.uk\n\nNeil D. Lawrence\u2217\n\nDepartment of Computer Science\n\nThe University of Shef\ufb01eld\n\nn.lawrence@sheffield.ac.uk\n\nAbstract\n\nWe present a general method for deriving collapsed variational inference algo-\nrithms for probabilistic models in the conjugate exponential family. Our method\nuni\ufb01es many existing approaches to collapsed variational inference. Our collapsed\nvariational inference leads to a new lower bound on the marginal likelihood. We\nexploit the information geometry of the bound to derive much faster optimization\nmethods based on conjugate gradients for these models. Our approach is very\ngeneral and is easily applied to any model where the mean \ufb01eld update equations\nhave been derived. Empirically we show signi\ufb01cant speed-ups for probabilistic\ninference using our bound.\n\n1\n\nIntroduction\n\nVariational bounds provide a convenient approach to approximate inference in a range of intractable\nmodels [Ghahramani and Beal, 2001]. Classical variational optimization is achieved through coordi-\nnate ascent which can be slow to converge. A popular solution [King and Lawrence, 2006, Teh et al.,\n2007, Kurihara et al., 2007, Sung et al., 2008, L\u00b4azaro-Gredilla and Titsias, 2011, L\u00b4azaro-Gredilla\net al., 2011] is to marginalize analytically a portion of the variational approximating distribution,\nremoving this from the optimization. In this paper we provide a unifying framework for collapsed\ninference in the general class of models composed of conjugate-exponential graphs (CEGs).\nFirst we review the body of earlier work with a succinct and unifying derivation of the collapsed\nbounds. We describe how the applicability of the collapsed bound to any particular CEG can be\ndetermined with a simple d-separation test. Standard variational inference via coordinate ascent\nturns out to be steepest ascent with a unit step length on our unifying bound. This motivates us\nto consider natural gradients and conjugate gradients for fast optimization of these models. We\napply our unifying approach to a range of models from the literature obtaining, often, an order of\nmagnitude or more increase in convergence speed. Our unifying view allows collapsed variational\nmethods to be integrated into general inference tools like infer.net [Minka et al., 2010].\n\n\u2217also at Shef\ufb01eld Institute for Translational Neuroscience, SITraN\n\n1\n\n\f2 The Marginalised Variational Bound\n\nThe advantages to marginalising analytically a subset of variables in variational bounds seem to\nbe well understood: several different approaches have been suggested in the context of speci\ufb01c\nmodels. In Dirichlet process mixture models Kurihara et al. [2007] proposed a collapsed approach\nusing both truncated stick-breaking and symmetric priors. Sung et al. [2008] proposed \u2018latent space\nvariational Bayes\u2019 where both the cluster-parameters and mixing weights were marginalised, again\nwith some approximations. Teh et al. [2007] proposed a collapsed inference procedure for latent\nDirichlet allocation (LDA). In this paper we unify all these results from the perspective of the \u2018KL\ncorrected bound\u2019 [King and Lawrence, 2006]. This lower bound on the model evidence is also an\nupper bound on the original variational bound, the difference between the two bounds is given by a\nKullback Leibler divergence. The approach has also been referred to as the marginalised variational\nbound by L\u00b4azaro-Gredilla et al. [2011], L\u00b4azaro-Gredilla and Titsias [2011]. The connection between\nthe KL corrected bound and the collapsed bounds is not immediately obvious. The key difference\nbetween the frameworks is the order in which the marginalisation and variational approximation are\napplied. However, for CEGs this order turns out to be irrelevant. Our framework leads to a more\nsuccinct derivation of the collapsed approximations. The resulting bound can then be optimised\nwithout recourse to approximations in either the bound\u2019s evaluation or its optimization.\n\n2.1 Variational Inference\nAssume we have a probabilistic model for data, D, given parameters (and/or latent variables), X, Z,\nof the form p(D, X, Z) = p(D | Z, X)p(Z| X)p(X). In variational Bayes (see e.g. Bishop [2006])\nwe approximate the posterior p(Z, X|D) by a distribution q(Z, X). We use Jensen\u2019s inequality\nto derive a lower bound on the model evidence L, which serves as an objective function in the\nvariational optimisation:\n\np(D) \u2265 L =\n\nq(Z, X) ln\n\np(D, Z, X)\nq(Z, X)\n\ndZ dX.\n\n(1)\n\nFor tractability the mean \ufb01eld (MF) approach assumes q factorises across its variables, q(Z, X) =\nq(Z)q(X). It is then possible to implement an optimisation scheme which analytically optimises\neach factor alternately, with the optimal distribution given by\n\n(cid:90)\n\n(cid:26)(cid:90)\n\n(cid:27)\n\nq(cid:63)(X) \u221d exp\n\nq(Z) ln p(D, X|Z) dZ\n\n,\n\n(2)\n\nand similarly for Z: these are often referred to as VBE and VBM steps. King and Lawrence [2006]\nsubstituted the expression for the optimal distribution (for example q(cid:63)(X)) back into the bound (1),\neliminating one set of parameters from the optimisation, an approach that has been reused by L\u00b4azaro-\nGredilla et al. [2011], L\u00b4azaro-Gredilla and Titsias [2011]. The resulting bound is not dependent on\nq(X). King and Lawrence [2006] referred to this new bound as \u2018the KL corrected bound\u2019. The\ndifference between the bound, which we denote LKL, and a standard mean \ufb01eld approximation LMF,\nis the Kullback Leibler divergence between the optimal form of q\u2217(X) and the current q(X).\nWe re-derive their bound by \ufb01rst using Jensen\u2019s inequality to construct the variational lower bound\non the conditional distribution,\n\nln p(D|X) \u2265\n\nq(Z) ln\n\ndZ (cid:44) L1.\n\n(3)\n\np(D, Z|X)\n\nq(Z)\n\nThis object turns out to be of central importance in computing the \ufb01nal KL-corrected bound and\nalso in computing gradients, curvatures and the distribution of the collapsed variables q(cid:63)(X). It is\neasy to see that it is a function of X which lower-bounds the log likelihood p(D | X), and indeed\nour derivation treats it as such. We now marginalize the conditioned variable from this expression,\n\n(cid:90)\n\n(cid:90)\n\nln p(D) \u2265 ln\n\np(X) exp{L1} dX (cid:44) LKL,\n\n(4)\n\ngiving us the bound of King and Lawrence [2006] & L\u00b4azaro-Gredilla et al. [2011]. Note that one set\nof parameters was marginalised after the variational approximation was made.\nUsing (2), this expression also provides the approximate posterior for the marginalised variables X:\n(5)\n\nq(cid:63)(X) = p(X)eL1\u2212LKL\n\nand eLKL appears as the constant of proportionality in the mean-\ufb01eld update equation (2).\n\n2\n\n\f(cid:90)\n\n3 Partial Equivalence of the Bounds\nWe can recover LMF from LKL by again applying Jensen\u2019s inequality,\n\nwhich can be re-arranged to give the mean-\ufb01eld bound,\n\nLKL = ln\n\nq(X)\n\np(X)\nq(X)\n\nexp{L1} dX \u2265\n\n(cid:90)\n\nLKL \u2265\n\nq(X)q(Z) ln\n\n(cid:90)\n\n(cid:26) p(X)\n(cid:26) p(D|Z, X)p(Z)p(X)\n(cid:27)\n\nq(X) ln\n\nq(X)\n\nq(Z)q(X)\n\n(cid:27)\n\nexp{L1}\n\ndX,\n\n(6)\n\ndX dZ,\n\n(7)\n\nand it follows that LKL = LMF + KL(q\u2217(X)||q(X)) and1 LKL \u2265 LMF. For a given q(Z), the bounds\nare equal after q(X) is updated via the mean \ufb01eld method: the approximations are ultimately the\nsame. The advantage of the new bound is to reduce the number of parameters in the optimisation. It\nis particularly useful when variational parameters are optimised by gradient methods. Since VBEM\nis equivalent to a steepest descent gradient method with a \ufb01xed step size, there appears to be a lot to\ngain by combining the KLC bound with more sophisticated optimization techniques.\n\n3.1 Gradients\n\n(cid:90)\n\n\u2202LKL\n\u2202\u03b8z\n\n= exp{\u2212LKL} \u2202\n\u2202\u03b8z\n\nConsider the gradient of the KL corrected bound with respect to the parameters of q(Z):\n\nwhere we have used the relation (5). To \ufb01nd the gradient of the mean-\ufb01eld bound we note that it can\nbe written in terms of our conditional bound (3) as LMF = Eq(X)\ngiving\n\n\u2202\u03b8z\n\n(cid:104)L1 + ln p(X) \u2212 ln q(X)\n\n(cid:105)\n\nexp{L1}p(X) dX = Eq(cid:63)(X)\n\n(8)\n\n(cid:104) \u2202L1\n\n(cid:105)\n\n,\n\n\u2202\u03b8z\n\n(9)\nthus setting q(X) = q(cid:63)(X) not only makes the bounds equal, LMF = LKL, but also their gradients\nwith respect to \u03b8Z.\nSato [2001] has shown that the variational update equation can be interpreted as a gradient method,\nwhere each update is also a step in the steepest direction in the canonical parameters of q(Z). We\ncan combine this important insight with the above result to realize that we have a simple method for\ncomputing the gradients of the KL corrected bound: we only need to look at the update expressions\nfor the mean-\ufb01eld method. This result also reveals the weakness of standard variational Bayesian\nexpectation maximization (VBEM): it is a steepest ascent algorithm. Honkela et al. [2010] looked to\nrectify this weakness by applying a conjugate gradient algorithm to the mean \ufb01eld bound. However,\nthey didn\u2019t obtain a signi\ufb01cant improvement in convergence speed. Our suggestion is to apply\nconjugate gradients to the KLC bound. Whilst the value and gradient of the MF bound matches\nthat of the KLC bound after an update of the collapsed variables, the curvature is always greater. In\npractise this means that much larger steps (which we compute using conjugate gradient methods) can\nbe taken when optimizing the KLC bound than for the MF bound leading to more rapid convergence.\n\n(cid:104) \u2202L1\n\n(cid:105)\n\n\u2202LMF\n\u2202\u03b8z\n\n= Eq(X)\n\n3.2 Curvature of the Bounds\n\nKing and Lawrence [2006] showed empirically that the KLC bound could lead to faster convergence\nbecause the bounds differ in their curvature: the curvature of the KLC bound enables larger steps to\nbe taken by an optimizer. We now derive analytical expressions for the curvature of both bounds.\nFor the mean \ufb01eld bound we have\n\n\u22022LMF\n\u2202\u03b82\nz\n\n= Eq(X)\n\n(cid:104) \u22022L1\n\n(cid:105)\n\n,\n\n\u2202\u03b82\nz\n\n(10)\n\n1We use KL(\u00b7||\u00b7) to denote the Kullback Leibler divergence between two distributions.\n\n3\n\n\fand for the KLC bound, with some manipulation of (4) and using (5):\n\n\u22022LKL\n\u2202\u03b8[i]\nz \u2202\u03b8[j]\nz\n\n= e\u2212LKL \u22022eLKL\n(cid:104) \u22022L1\nz \u2202\u03b8[j]\nz\n\n\u2202\u03b8[i]\n\n= Eq(cid:63)(X)\n\n(cid:105)\n\n\u2202\u03b8[i]\n\nz \u2202\u03b8[j]\nz\n\n\u2212 e\u22122LKL\n\n+ Eq(cid:63)(X)\n\n(cid:110) \u2202eLKL\n(cid:111)(cid:110) \u2202eLKL\n(cid:104) \u2202L1\n\n\u2202\u03b8[i]\nz\n\n(cid:111)\n(cid:105) \u2212(cid:110)Eq(cid:63)(X)\n\n\u2202\u03b8[j]\nz\n\u2202L1\n\u2202\u03b8[j]\nz\n\n\u2202\u03b8[i]\nz\n\n(cid:104) \u2202L1\n\n\u2202\u03b8[i]\nz\n\n(cid:105)(cid:111)(cid:110)Eq(cid:63)(X)\n\n(cid:104) \u2202L1\n\n\u2202\u03b8[j]\nz\n\n(cid:105)(cid:111)\n\n.\n\n(11)\n\nIn this result the \ufb01rst term is equal to (10), and the second two terms combine to be always positive\nsemi-de\ufb01nite, proving King and Lawrence [2006]\u2019s intuition about the curvature of the bound. When\ncurvature is negative de\ufb01nite (e.g. near a maximum), the KLC bound\u2019s curvature is less negative\nde\ufb01nite, enabling larger steps to be taken in optimization. Figure 1(b) illustrates the effect of this as\nwell as the bound\u2019s similarities.\n\n3.3 Relationship to Collapsed VB\n\nIn collapsed inference some parameters are marginalized before applying the variational bound. For\nexample, Sung et al. [2008] proposed a latent variable model where the model parameters were\nmarginalised, and Teh et al. [2007] proposed a non-parametric topic model where the document\nproportions were collapsed. These procedures lead to improved inference, or faster convergence.\nThe KLC bound derivation we have provided also marginalises parameters, but after a variational\napproximation is made. The difference between the two approaches is distilled in these expressions:\n\n(cid:20)\n\n(cid:26)\n\n(cid:2) ln p(D|X, Z)(cid:3)(cid:27)(cid:21)\n\nln Ep(X)\n\nexp\n\nEq(Z)\n\n(cid:20)\n\n(cid:26)\n\nEq(Z)\n\nln\n\nEp(X)\n\n(cid:2)p(D|X, Z)(cid:3)(cid:27)(cid:21)\n\n(12)\n\nwhere the left expression appears in the KLC bound, and the right expression appears in the bound\nfor collapsed variational Bayes, with the remainder of the bounds being equal. Whilst appropri-\nately conjugate formulation of the model will always ensure that the KLC expression is analytically\ntractable, the expectation in the collapsed VB expression is not. Sung et al. [2008] propose a \ufb01rst\norder approximation to the expectation of the form Eq(Z)\n), which reduces\nthe right expression to the that on the left. Under this approximation2 the KL corrected approach is\nequivalent to the collapsed variational approach.\n\n(cid:105) \u2248 f (Eq(Z)\n\nf (Z)\n\n(cid:105)\n\n(cid:104)\n\n(cid:104)\n\nZ\n\n3.4 Applicability\n\nTo apply the KLC bound we need to specify a subset, X, of variables to marginalize. We select\nthe variables that break the dependency structure of the graph to enable the analytic computation\nof the integral in (4). Assuming the appropriate conjugate exponential structure for the model we\nare left with the requirement to select a sub-set that induces the appropriate factorisation. These\ninduced factorisations are discussed in some detail in Bishop [2006]. They are factorisations in\nthe approximate posterior which arise from the form of the variational approximation and from the\nstructure of the model. These factorisations allow application of KLC bound, and can be identi\ufb01ed\nusing a simple d-separation test as Bishop discusses.\nThe d-separation test involves checking for independence amongst the marginalised variables (X in\nthe above) conditioned on the observed data D and the approximated variables (Z in the above). The\nrequirement is to select a suf\ufb01cient set of variables, Z, such that the effective likelihood for X, given\nby (3) becomes conjugate to the prior. Figure 1(a) illustrates the d-separation test with application\nto the KLC bound.\nFor latent variable models, it is often suf\ufb01cient to select the latent variables for X whilst collapsing\nthe model variables. For example, in the speci\ufb01c case of mixture models and topic models, ap-\nproximating the component labels allows for the marginalisation of the cluster parameters (topics\n\n2Kurihara et al. [2007] and Teh et al. [2007] suggest a further second order correction and assume that that\nq(Z) is Gaussian to obtain tractability. This leads to additional correction terms that augment KLC bound. The\nform of these corrections would need to be determined on a case by case basis, and has in fact been shown to\nbe less effective than those methods uni\ufb01ed here [Asuncion et al., 2012].\n\n4\n\n\f(a)\n\n(b)\n\nFigure 1: (a) An example directed graphical model on which we could use the KLC bound. Given\nthe observed node C, the nodes A,F d-separate given nodes B,D,E. Thus we could make an explicit\nvariational approximation for A,F, whilst marginalising B,D,E. Alternatively, we could select B,D,E\nfor a parameterised approximate distribution, whilst marginalising A,F. (b) A sketch of the KLC\nand MF bounds. At the point where the mean \ufb01eld method has q(X) = q(cid:63)(X), the bounds are\nequal in value as well as in gradient. Away from the this point, the different between the bounds\nis the Kullback Leibler divergence between the current MF approximation for X and the implicit\ndistribution q(cid:63)(X) of the KLC bound.\n\nallocations) and mixing proportions. This allowed Sung et al. [2008] to derive a general form for\nlatent variable models, though our formulation is general to any conjugate exponential graph.\n\n4 Riemannian Gradient Based Optimisation\n\nSato [2001] and Hoffman et al. [2012] showed that the VBEM procedure performs gradient ascent in\nthe space of the natural parameters. Using the KLC bound to collapse the problem, gradient methods\nseem a natural choice for optimisation, since there are fewer parameters to deal with, and we have\nshown that computation of the gradients is straightforward (the variational update equations contain\nthe model gradients). It turns out that the KLC bound is particularly amenable to Riemannian or\nnatural gradient methods, because the information geometry of the exponential family distrubu-\ntion(s), over which we are optimising, leads to a simple expression for the natural gradient. Previous\ninvestigations of natural gradients for variational Bayes [Honkela et al., 2010, Kuusela et al., 2009]\nrequired the inversion of the Fisher information at every step (ours does not), and also used VBEM\nsteps for some parameters and Riemannian optimisation for other variables. The collapsed nature\nof the KLC bound means that these VBEM steps are unnecessary: the bound can be computed by\nparameterizing the distribution of only one set of variables (q(Z)) whilst the implicit distribution of\nthe other variables is given in terms of the \ufb01rst distribution and the data by equation (5).\nWe optimize the lower bound LKL with respect to the parameters of the approximating distribution\nof the non-collapsed variables. We showed in section 2 that the gradient of the KLC bound is given\nby the gradient of the standard MF variational bound, after an update of the collapsed variables. It\nis clear from their de\ufb01nition that the same is true of the natural gradients.\n\n4.1 Variable Transformations\n\nWe can compute the natural gradient of our collapsed bound by considering the update equations of\nthe non-collapsed problem as described above. However, if we wish to make use of more powerful\noptimisation methods like conjugate gradient ascent, it is helpful to re-parameterize the natural pa-\nrameters in an unconstrained fashion. The natural gradient is given by [Amari and Nagaoka, 2007]:\n\nwhere G(\u03b8) is the Fisher information matrix whose i,jth element is given by\n\n(cid:101)g(\u03b8) = G(\u03b8)\u22121 \u2202LKL\n\n\u2202\u03b8\n\n(cid:104) \u22022 ln q(X| \u03b8)\n\n(cid:105)\n\nG(\u03b8)[i,j] = \u2212Eq(X | \u03b8)\n\n5\n\n\u2202\u03b8[i]\u2202\u03b8[j]\n\n.\n\n(13)\n\n(14)\n\nDEFABC\fFor exponential family distributions, this reduces to \u22072\n\u03b8\u03c8(\u03b8), where \u03c8 is the log-normaliser. Further,\nfor exponential family distributions, the Fisher information in the canonical parameters (\u03b8) and that\nin the expectation parameters (\u03b7) are reciprocal, and we also have G(\u03b8) = \u2202\u03b7/\u2202\u03b8. This means that\nthe natural gradient in \u03b8 is given by\n\n(cid:101)g(\u03b8) = G(\u03b8)\u22121 \u2202\u03b7\n\n\u2202\u03b8\n\n\u2202LKL\n\u2202\u03b7\n\n\u2202LKL\n\u2202\u03b7\n\n=\n\nand (cid:101)g(\u03b7) =\n\n\u2202LKL\n\u2202\u03b8\n\n.\n\n(15)\n\nThe gradient in one set of parameters provides the natural gradient in the other. Thus when our\napproximating distribution q is exponential family, we can compute the natural gradient without the\nexpensive matrix inverse.\n\n4.2 Steepest Ascent is Coordinate Ascent\n\nSato [2001] showed that the VBEM algorithm was a gradient based algorithm.\nIn fact, VBEM\nconsists of taking unit steps in the direction of the natural gradient of the canonical parameters.\nFrom equation (9) and the work of Sato [2001], we see that the gradient of the KLC bound can\nbe obtained by considering the standard mean-\ufb01eld update for the non-collapsed parameter Z. We\ncon\ufb01rm these relationships for the models studied in the next section in the supplementary material.\nHaving con\ufb01rmed that the VB-E step is equivalent to steepest-gradient ascent we now explore\nwhether the procedure could be improved by the use of conjugate gradients.\n\n4.3 Conjugate Gradient Optimization\n\nOne idea for solving some of the problems associated with steepest ascent is to ensure each gradient\nstep is conjugate (geometrically) to the previous. Honkela et al. [2010] applied conjugate gradients\nto the standard mean \ufb01eld bound, we expect much faster convergence for the KLC bound due to\nits differing curvature. Since VBEM uses a step length of 1 to optimize,3 we also used this step\nlength in conjugate gradients. In the natural conjugate gradient method, the search direction at the\n\nith iteration is given by si = \u2212(cid:101)gi + \u03b2si\u22121. Empirically the Fletcher-Reeves method for estimating\nwhere (cid:104)\u00b7,\u00b7(cid:105)i denotes the inner product in Riemannian geometry, which is given by(cid:101)g(cid:62)G(\u03c1)(cid:101)g. We\nnote from Kuusela et al. [2009] that this can be simpli\ufb01ed since(cid:101)g(cid:62)G(cid:101)g =(cid:101)g(cid:62)GG\u22121g =(cid:101)g(cid:62)g, and\n\n(cid:104)(cid:101)gi,(cid:101)gi(cid:105)i\n(cid:104)(cid:101)gi\u22121,(cid:101)gi\u22121(cid:105)i\u22121\n\nother conjugate methods, de\ufb01ned in the supplementary material, can be applied similarly.\n\n\u03b2 worked well for us:\n\n\u03b2F R =\n\n(16)\n\n5 Experiments\n\nFor empirical investigation of the potential speed ups we selected a range of probabilistic models.\nWe provide derivations of the bound and fuller explanations of the models in the supplementary\nmaterial. In each experiment, the algorithm was considered to have converged when the change\nin the bound or the Riemannian gradient reached below 10\u22126. Comparisons between optimisation\nprocedures always used the same initial conditions (or set of initial conditions) for each method.\nFirst we recreate the mixture of Gaussians example described by Honkela et al. [2010].\n\n5.1 Mixtures of Gaussians\n\nFor a mixture of Gaussians, using the d-separation rule, we select for X the cluster allocation (latent)\nvariables. These are parameterised through the softmax function for unconstrained optimisation.\nOur model includes a fully Bayesian treatment of the cluster parameters and the mixing proportions,\nwhose approximate posterior distributions appear as (5). Full details of the algorithm derivation are\ngiven in the supplementary material. A neat feature is that we can make use of the discussion above\nto derive an expression for the natural gradient without a matrix inverse.\n\n3We empirically evaluated a line-search procedure, but found that in most cases that Wolfe-Powell condi-\n\ntions were met after a single step of unit length.\n\n6\n\n\fTable 1: Iterations to convergence for the mixture of Gaussians problem, with varying overlap (R). This table\nreports the average number of iterations taken to reach (within 10 nats of) the best known solution. For the more\ndif\ufb01cult scenarios (with more overlap in the clusters) the VBEM method failed to reach the optimum solution\nwithin 500 restarts\n\nCG. method\nPolack-Ribi\u00b4ere\nHestenes-Stiefel\nFletcher-Reeves\n\nVBEM\n\nR = 1\n\n3, 100.37\n1, 371.55\n416.18\n\n\u221e\n\nR = 2\n\n15, 698.57\n5, 501.25\n1, 161.35\n\n\u221e\n\nR = 3\n\n5, 767.12\n5, 922.4\n5, 091.0\n\n\u221e\n\nR = 4\n\n1, 613.09\n358.03\n792.10\n992.07\n\nR = 5\n\n3, 046.25\n172.39\n494.24\n429.57\n\nTable 2: Time and iterations taken to run LDA on the NIPS 2011 corpus, \u00b1 one standard deviation, for two\nconjugate methods and VBEM. The Fletcher-Reeves conjugate algorithm is almost ten times as fast as VBEM.\nThe value of the bound at the optimum was largely the same: deviations are likely just due to the choice of\ninitialisations, of which we used 12.\n\nMethod\n\nHestenes-Steifel\nFletcher-Reeves\n\nVBEM\n\nTime (minutes)\n56.4 \u00b1 18.5\n38.5 \u00b1 8.7\n370 \u00b1 105\n\nIterations\n\n644.3 \u00b1 214.5 \u22121, 998, 780 \u00b1 201\n447.8 \u00b1 100.5 \u22121, 998, 743 \u00b1 194\n4, 459 \u00b1 1, 296 \u22121, 998, 732 \u00b1 241\n\nBound\n\nIn Honkela et al. [2010] data are drawn from a mixture of \ufb01ve two-dimensional Gaussians with\nequal weights, each with unit spherical covariance. The centers of the components are at (0, 0) and\n(\u00b1R,\u00b1R). R is varied from 1 (almost completely overlapping) to 5 (completely separate). The\nmodel is initialised with eight components with an uninformative prior over the mixing proportions:\nthe optimisation procedure is left to select an appropriate number of components.\nSung et al. [2008] reported that their collapsed method led to improved convergence over VBEM.\nSince our objective is identical, though our optimisation procedure different, we devised a metric for\nmeasuring the ef\ufb01cacy of our algorithms which also accounts for their propensity to fall into local\nminima. Using many randomised restarts, we measured the average number of iterations taken to\nreach the best-known optimum. If the algorithm converged at a lesser optimum, those iterations were\nincluded in the denomiator, but we didn\u2019t increment the numerator when computing the average. We\ncompared three different conjugate gradient approaches and standard VBEM (which is also steepest\nascent on the KLC bound) using 500 restarts.\nTable 1 shows the number of iterations required (on average) to come within 10 nats of the best\nknown solution for three different conjugate-gradient methods and VBEM. VBEM sometimes failed\nto \ufb01nd the optimum in any of the 500 restarts. Even relaxing the stringency of our selection to 100\nnats, the VBEM method was always at least twice as slow as the best conjugate method.\n\n5.2 Topic Models\n\nLatent Dirichlet allocation (LDA) [Blei et al., 2003] is a popular approach for extracting topics\nfrom documents. To demonstrate the KLC bound we applied it to 200 papers from the 2011 NIPS\nconference. The PDFs were preprocessed with pdftotext, removing non-alphabetical characters\nand coarsely \ufb01ltering words by popularity to form a vocabulary size of 2000.4 We selected the latent\ntopic-assignment variables for parameterisation, collapsing the topics and the document proportions.\nConjugate gradient optimization was compared to the standard VBEM approach.\nWe used twelve random initializations, starting each algorithm from each initial condition. Topic and\ndocument distributions where treated with \ufb01xed, uninformative priors. On average, the Hestenes-\nSteifel algorithm was almost ten times as fast as standard VB, as shown in Table 2, whilst the \ufb01nal\nbound varied little between approaches.\n\n4Some extracted topics are presented in the supplementary material.\n\n7\n\n\f5.3 RNA-seq alignment\n\nAn emerging problem in computational biology is inference of transcript structure and expression\nlevels using next-generation sequencing technology (RNA-Seq). Several models have been pro-\nposed. The BitSeq method [Glaus et al., 2012] is based on a probabilistic model and uses Gibbs\nsampling for approximate inference. The sampler can suffer from particularly slow convergence\ndue to the large size of the problem, which has six million latent variables for the data considered\nhere. We implemented a variational version of their model and optimised it using VBEM and our\ncollapsed Riemannian method. We applied the model to data described in Xu et al. [2010], a study\nof human microRNA. The model was initialised using four random initial conditions, and optimised\nusing standard VBEM and the conjugate gradient versions of the algorithm. The Polack-Ribi\u00b4ere\nconjugate method performed very poorly for this problem, often giving negative conjugation: we\nomit it here. The solutions found for the other algorithms were all fairly close, with bounds com-\ning within 60 nats. The VBEM method was dramatically outperformed by the Fletcher-Reeves and\nHestenes-Steifel methods: it took 4600 \u00b1 20 iterations to converge, whilst the conjugate methods\ntook only 268\u00b1 4 and 265\u00b1 1 iterations to converge. At about 8 seconds per iteration, our collapsed\nRiemannian method requires around forty minutes, whilst VBEM takes almost eleven hours. All the\nvariational approaches represent an improvement over a Gibbs sampler, which takes approximately\none week to run for this data [Glaus et al., 2012].\n\n6 Discussion\n\nUnder very general conditions (conjugate exponential family) we have shown the equivalence of\ncollapsed variational bounds and marginalized variational bounds using the KL corrected perspec-\ntive of King and Lawrence [2006]. We have provided a succinct derivation of these bounds, unifying\nseveral strands of work and laying the foundations for much wider application of this approach.\nWhen the collapsed variables are updated in the standard MF bound the KLC bound is identical to\nthe MF bound in value and gradient. Sato [2001] has shown that coordinate ascent of the MF bound\n(as proscribed by VBEM updates) is equivalent to steepest ascent of the MF bound using natural\ngradients. This implies that standard variational inference is also performing steepest ascent on the\nKLC bound. This equivalence between natural gradients and the VBEM update equations means\nour method is quickly implementable for any model where the mean \ufb01eld update equations have\nbeen computed. It is only necessary to determine which variables to collapse using a d-separation\ntest. Importantly this implies our approach can readily be incorporated in automated inference en-\ngines such as that provided by infer.net [Minka et al., 2010]. We\u2019d like to emphasise the ease with\nwhich the method can be applied: we have provided derivations of equivalencies of the bounds and\ngradients which should enable collapsed conjugate optimisation of any existing mean \ufb01eld algo-\nrithm, with minimal changes to the software. Indeed our own implementations (see supplementary\nmaterial) use just a few lines of code to switch between the VBEM and conjugate methods.\nThe improved performance arises from the curvature of the KLC bound. We have shown that it is\nalways less negative than that of the original variational bound allowing much larger steps in the\nvariational parameters as King and Lawrence [2006] suggested. This also provides a gateway to\nsecond-order optimisation, which could prove even faster.\nWe provided empirical evidence of the performance increases that are possible using our method in\nthree models. In a thorough exploration of the convergence properties of a mixture of Gaussians\nmodel, we concluded that a conjugate Riemannian algorithm can \ufb01nd solutions that are not found\nwith standard VBEM. In a large LDA model, we found that performance can be improved many\ntimes over that of the VBEM method.\nIn the BitSeq model for differential expression of genes\ntranscripts we showed that very large improvements in performance are possible for models with\nhuge numbers of latent variables.\n\nAcknowledgements\n\nThe authors would like to thank Michalis Titsias for helpful commentary on a previous draft and\nPeter Glaus for help with a C++ implementation of the RNAseq alignment algorithm. This work\nwas funded by EU FP7-KBBE Project Ref 289434 and BBSRC grant number BB/1004769/1.\n\n8\n\n\fReferences\nS. Amari and H. Nagaoka. Methods of information geometry. AMS, 2007.\nA. Asuncion, M. Welling, P. Smyth, and Y. Teh. On smoothing and inference for topic models.\n\narXiv preprint arXiv:1205.2662, 2012.\n\nC. M. Bishop. Pattern Recognition and Machine Learning. Springer New York, 2006.\nD. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. The Journal of Machine Learning\n\nResearch, 3:993\u20131022, 2003.\n\nZ. Ghahramani and M. Beal. Propagation algorithms for variational Bayesian learning. Advances in\n\nneural information processing systems, pages 507\u2013513, 2001.\n\nP. Glaus, A. Honkela, and M. Rattray. Identifying differentially expressed transcripts from RNA-\nseq data with biological variation. Bioinformatics, 2012. doi: 10.1093/bioinformatics/bts260.\nAdvance Access.\n\nM. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. arXiv preprint\n\narXiv:1206.7051, 2012.\n\nA. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. Approximate Riemannian conjugate\ngradient learning for \ufb01xed-form variational Bayes. The Journal of Machine Learning Research,\n9999:3235\u20133268, 2010.\n\nN. King and N. D. Lawrence. Fast variational inference for Gaussian process models through KL-\n\ncorrection. Machine Learning: ECML 2006, pages 270\u2013281, 2006.\n\nK. Kurihara, M. Welling, and Y. W. Teh. Collapsed variational Dirichlet process mixture models. In\nProceedings of the International Joint Conference on Arti\ufb01cial Intelligence, volume 20, page 19,\n2007.\n\nM. Kuusela, T. Raiko, A. Honkela, and J. Karhunen. A gradient-based algorithm competitive with\nIn Neural Networks, 2009. IJCNN 2009.\n\nvariational Bayesian EM for mixture of Gaussians.\nInternational Joint Conference on, pages 1688\u20131695. IEEE, 2009.\n\nM. L\u00b4azaro-Gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regression. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2011, 2011.\n\nM. L\u00b4azaro-Gredilla, S. Van Vaerenbergh, and N. Lawrence. Overlapping mixtures of Gaussian\n\nprocesses for the data association problem. Pattern Recognition, 2011.\n\nT. P. Minka, J. M. Winn, J. P. Guiver, and D. A. Knowles.\n\nCambridge, 2010.\n\nInfer .NET 2.4. Microsoft Research\n\nM. A. Sato. Online model selection based on the variational Bayes. Neural Computation, 13(7):\n\n1649\u20131681, 2001.\n\nJ. Sung, Z. Ghahramani, and S. Bang. Latent-space variational Bayes. Pattern Analysis and Machine\n\nIntelligence, IEEE Transactions on, 30(12):2236\u20132242, 2008.\n\nY. W. Teh, D. Newman, and M. Welling. A collapsed variational Bayesian inference algorithm for\n\nlatent Dirichlet allocation. Advances in neural information processing systems, 19:1353, 2007.\n\nG. Xu et al. Transcriptome and targetome analysis in MIR155 expressing cells using RNA-\nseq. RNA, pages 1610\u20131622, June 2010. ISSN 1355-8382. doi: 10.1261/rna.2194910. URL\nhttp://rnajournal.cshlp.org/cgi/doi/10.1261/rna.2194910.\n\n9\n\n\f", "award": [], "sourceid": 1314, "authors": [{"given_name": "James", "family_name": "Hensman", "institution": null}, {"given_name": "Magnus", "family_name": "Rattray", "institution": null}, {"given_name": "Neil", "family_name": "Lawrence", "institution": null}]}