{"title": "Decoupled Variational Gaussian Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 1547, "page_last": 1555, "abstract": "Variational Gaussian (VG) inference methods that optimize a lower bound to the marginal likelihood are a popular approach for Bayesian inference. These methods are fast and easy to use, while being reasonably accurate. A difficulty remains in computation of the lower bound when the latent dimensionality $L$ is large. Even though the lower bound is concave for many models, its computation requires optimization over $O(L^2)$ variational parameters. Efficient reparameterization schemes can reduce the number of parameters, but give inaccurate solutions or destroy concavity leading to slow convergence. We propose decoupled variational inference that brings the best of both worlds together. First, it maximizes a Lagrangian of the lower bound reducing the number of parameters to $O(N)$, where $N$ is the number of data examples. The reparameterization obtained is unique and recovers maxima of the lower-bound even when the bound is not concave. Second, our method maximizes the lower bound using a sequence of convex problems, each of which is parallellizable over data examples and computes gradient efficiently. Overall, our approach avoids all direct computations of the covariance, only requiring its linear projections. Theoretically, our method converges at the same rate as existing methods in the case of concave lower bounds, while remaining convergent at a reasonable rate for the non-concave case.", "full_text": "Decoupled Variational Gaussian Inference\n\nEcole Polytechnique F\u00b4ed\u00b4erale de Lausanne (EPFL), Switzerland\n\nMohammad Emtiyaz Khan\n\nemtiyaz@gmail.com\n\nAbstract\n\nVariational Gaussian (VG) inference methods that optimize a lower bound to the\nmarginal likelihood are a popular approach for Bayesian inference. A dif\ufb01culty\nremains in computation of the lower bound when the latent dimensionality L is\nlarge. Even though the lower bound is concave for many models, its computation\nrequires optimization over O(L2) variational parameters. Ef\ufb01cient reparameter-\nization schemes can reduce the number of parameters, but give inaccurate solu-\ntions or destroy concavity leading to slow convergence. We propose decoupled\nvariational inference that brings the best of both worlds together. First, it max-\nimizes a Lagrangian of the lower bound reducing the number of parameters to\nO(N ), where N is the number of data examples. The reparameterization obtained\nis unique and recovers maxima of the lower-bound even when it is not concave.\nSecond, our method maximizes the lower bound using a sequence of convex prob-\nlems, each of which is parallellizable over data examples. Each gradient compu-\ntation reduces to prediction in a pseudo linear regression model, thereby avoiding\nall direct computations of the covariance and only requiring its linear projections.\nTheoretically, our method converges at the same rate as existing methods in the\ncase of concave lower bounds, while remaining convergent at a reasonable rate for\nthe non-concave case.\n\n1\n\nIntroduction\n\nLarge-scale Bayesian inference remains intractable for many models, such as logistic regression,\nsparse linear models, or dynamical systems with non-Gaussian observations. Approximate Bayesian\ninference requires fast, robust, and reliable algorithms. In this context, algorithms based on varia-\ntional Gaussian (VG) approximations are growing in popularity [17, 3, 13, 6] since they strike a\nfavorable balance between accuracy, generality, speed, and ease of use.\nVG inference remains problematic for models with large latent-dimensionality. While some vari-\nants are convex [3], they require O(L2) variational parameters to be optimized, where L is the\nlatent-dimensionality. This slows down the optimization. One solution is to restrict the covariance\nrepresentations by naive mean-\ufb01eld [2] or restricted Cholesky [3], but this can result in considerable\nloss of accuracy when signi\ufb01cant posterior correlations exist. An alternative is to reparameterize\nthe covariance to obtain O(N ) number of parameters, where N is the number of data examples\n[17]. However, this destroys the convexity and converges slowly [12]. A recent approach called\ndual variational inference [10] obtains fast convergence while retaining this parameterization, but is\napplicable to only some models such as Poisson regression.\nIn this paper, we propose an approach called decoupled variational Gaussian inference which ex-\ntends the dual variational inference to a large class of models. Our method relies on the theory\nof Lagrangian multiplier methods. While remaining widely applicable, our approach reduces the\nnumber of variational parameters similar to [17, 10] and converges at similar convergence rates as\nconvex methods such as [3]. Our method is similar in spirit to parallel expectation-propagation (EP)\nbut has provable convergence guarantees even when likelihoods are not log-concave.\n\n1\n\n\f2 The Model\n\nN(cid:89)\n\nIn this paper, we apply our method for Bayesian inference on Latent Gaussian Models (LGMs).\nThis choice is motivated by a large amount of existing work on VG approximations for LGMs\n[16, 17, 3, 10, 12, 11, 7, 2], and because LGMs include many popular models, such as Gaussian\nprocesses, Bayesian regression and classi\ufb01cation, Gaussian Markov random \ufb01eld, and probabilistic\nPCA. An extensive list of these models is given in Chapter 1 of [9]. We have also included few\nexamples in the supplementary material.\nGiven a vector of observations y of length N, LGMs model the dependencies among its components\nusing a latent Gaussian vector z of length L. The joint distribution is shown below.\n\npn(yn|\u03b7n)p(z),\n\np(z) := N (z|\u00b5, \u03a3)\n\nn=1\n\np(y, z) =\n\n\u03b7 = Wz,\n\n(1)\nwhere W is a known real-valued matrix of size N \u00d7 L, and is used to de\ufb01ne linear predictors \u03b7.\nEach \u03b7n is used to model the observation yn using a link function pn(yn|\u03b7n). The exact form of\nthis function depends on the type of observations, e.g. a Bernoulli-logit distribution can be used for\nbinary data [14, 7]. See the supplementary material for an example. Usually, an exponential family\ndistribution is used, although there are other choices (such as T-distribution [8, 17]). The parameter\nset \u03b8 includes {W, \u00b5, \u03a3} and other parameters of the link function and is assumed to be known.\nWe suppress \u03b8 in our notation, for simplicity.\nIn Bayesian inference, we wish to compute expectations with respect to the posterior distribution\np(z|y) which is shown below. Another important task is the computation of the marginal likelihood\np(y) which can be maximized to estimate parameters \u03b8, for example, using empirical Bayes [18].\n\n(cid:90) N(cid:89)\n\np(z|y) \u221d N(cid:89)\n\np(yn|\u03b7n)N (z|\u00b5, \u03a3)\n\n,\n\np(y) =\n\np(yn|\u03b7n)N (z|\u00b5, \u03a3) dz\n\n(2)\n\nn=1\n\nn=1\n\nFor non-Gaussian likelihoods, both of these tasks are intractable. Applications in practice demand\ngood approximations that scale favorably in N and L.\n\n3 VG Inference by Lower Bound Maximization\n\nIn variational Gaussian (VG) inference [17], we assume the posterior to be a Gaussian q(z) =\nN (z|m, V). The posterior mean m and covariance V form the set of variational parameters, and\nare chosen to maximize the variational lower bound to the log marginal likelihood shown in Eq. (3).\nTo get this lower bound, we \ufb01rst multiply and divide by q(z) and then apply Jensen\u2019s inequality\nusing the concavity of log.\n\n(cid:20)\n\n(cid:81)\n\n(cid:21)\n\n(cid:90)\n\n(cid:81)\nn p(yn|\u03b7n)p(z)\n\nq(z)\n\nlog p(y) = log\n\nq(z)\n\ndz \u2265 Eq(z)\n\nlog\n\nn p(yn|\u03b7n)p(z)\n\nq(z)\n\n(3)\n\nn=1\n\nfn( \u00afmn, \u00af\u03c3n),\n\nfn( \u00afmn, \u00af\u03c3n) := EN (\u03b7n| \u00afmn,\u00af\u03c32\n\n\u2212D[q(z)(cid:107) p(z)] \u2212 N(cid:88)\n\nThe simpli\ufb01ed lower bound is shown in Eq. (4). The detailed derivation can be found in Eqs. (4)\u2013(7)\nin [11] (and in the supplementary material). Below, we provide a summary of its components.\nn)[\u2212 log p(yn|\u03b7n)]\n\n(4)\nmax\nm,V(cid:31)0\nThe \ufb01rst term is the KL-divergence D[q (cid:107) p] = Eq[log q(z) \u2212 log p(z)], which is jointly concave in\n(m, V). The second term sums over data examples, where each term denoted by fn is the expec-\ntation of \u2212 log p(yn|\u03b7n) with respect to \u03b7n. Since \u03b7n = wT\nn z, it follows a Gaussian distribution\nq(\u03b7n) = N ( \u00afmn, \u00af\u03c32\nn m and variances \u00af\u03c32\nn Vwn. The terms fn are not\nn = wT\nalways available in closed form, but can be computed using quadrature or look-up tables [14]. Note\nthat unlike many other methods such [2, 11, 10, 7, 21], we do not bound or approximate these terms.\nSuch approximations lead to loss of accuracy.\nWe denote the lower bound of Eq. (3) by f and expand it below in Eq. (5):\n\nn) with mean \u00afmn = wT\n\n2 [log |V| \u2212 Tr(V\u03a3\u22121) \u2212 (m \u2212 \u00b5)T \u03a3\u22121(m \u2212 \u00b5) + L] \u2212 N(cid:88)\n\nf (m, V) := 1\n\n(5)\nHere |V| denotes the determinant of V. We now discuss existing methods and their pros and cons.\n\nfn( \u00afmn, \u00af\u03c3n)\n\nn=1\n\n2\n\n\f3.1 Related Work\n\nA straight-forward approach is to optimize Eq. (5) directly in (m, V) [2, 3, 14, 11]. In practice,\ndirect methods are slow and memory-intensive because of the very large number L + L(L + 1)/2\nof variables. Challis and Barber [3] show that for log-concave likelihoods p(yn|\u03b7n), the original\nproblem Eq. (4) is jointly concave in m and the Cholesky factor of V. This fact, however, does\nnot result in any reduction in the number of parameters, and they propose to use factorizations of a\nrestricted form, which negatively affects the approximation accuracy.\n[17] and [16] note that the optimal V\u2217 must be of the form V\u2217 = [\u03a3\u22121 +WT diag(\u03bb)W]\u22121, which\nsuggests reparameterizing Eq. (5) in terms of L+N parameters (m, \u03bb), where \u03bb is the new variable.\nHowever, the problem is not concave in this alternative parameterization [12]. Moreover, as shown\nin [12] and [10], convergence can be exceedingly slow. The coordinate-ascent approach of [12] and\ndual variational inference [10] both speed-up convergence, but only for a limited class of models.\nA range of different deterministic inference approximations exist as well. The local variational\nmethod is convex for log-concave potentials and can be solved at very large scales [23], but applies\nonly to models with super-Gaussian likelihoods. The bound it maximizes is provably less tight than\nEq.\n(4) [22, 3] making it less accurate. Expectation propagation (EP) [15, 21] is more general\nand can be more accurate than most other approximations mentioned here. However, it is based\non a saddle-point rather than an optimization problem, and the standard EP algorithm does not al-\nways converge and can be numerically unstable. Among these alternatives, the variational Gaussian\napproximation stands out as a compromise between accuracy and good algorithmic properties.\n\n4 Decoupled Variational Gaussian Inference using a Lagrangian\n\nWe simplify the form of the objective function by decoupling the KL divergence term from the\nterms including fn. In other words, we separate the prior distribution from the likelihoods. We do\nso by introducing real-valued auxiliary-variables hn and \u03c3n > 0, such that the following constraints\nhold: hn = \u00afmn and \u03c3n = \u00af\u03c3n. This gives us the following (equivalent) optimization problem over\nx := {m, V, h, \u03c3},\n\n(cid:2)log |V| \u2212 Tr(V\u03a3\u22121) \u2212 (m \u2212 \u00b5)T \u03a3\u22121(m \u2212 \u00b5) + L(cid:3) \u2212 N(cid:88)\n\nfn(hn, \u03c3n)\n\n(6)\n\nmax\n\nx\n\ng(x) := 1\n2\n\nN(cid:88)\n\nn(x) := hn \u2212 wT\n\nn m = 0 and c2\n\nsubject to constraints c1\nFor log-concave likelihoods, the function g(x) is concave in V, unlike the original function f (see\nEq. (5)) which is concave with respect to Cholesky of V. The dif\ufb01culty now lies with the non-\nlinear constraints c2\nn(x). We will now establish that the new problem gives rise to a convenient\nparameterization, but does not affect the maximum.\nThe signi\ufb01cance of this reformulation lies in its Lagrangian, shown below.\n\nn Vwn) = 0 for all n.\n\nn(x) := 1\n\n2 (\u03c32\n\nn=1\n\nn \u2212 wT\n\nL(x, \u03b1, \u03bb) := g(x) +\n\n\u03b1n(hn \u2212 wT\n\nn m) + 1\n\n2 \u03bbn(\u03c32\n\nn \u2212 wT\n\nn Vwn)\n\n(7)\n\nn=1\n\nHere, \u03b1n, \u03bbn are Lagrangian multipliers for the constraints c1\nn(x) and c2(x). We will now show\nthat the maximum of f of Eq. (5) can be parameterized in terms of these multipliers, and that this\nreparameterization is unique. The following theorem states this result along with three other useful\nrelationships between the maximum of Eq. (5), (6), and (7). Proof is in the supplementary material.\nTheorem 4.1. The following holds for maxima of Eq. (5), (6), and (7):\n\n1. A stationary point x\u2217 of Eq. (6) will also be a stationary point of Eq. (5). For every such\n\nstationary point x\u2217, there exist unique \u03b1\u2217 and \u03bb\n\n\u2217 such that,\n\nV\u2217 = [\u03a3\u22121 + WT diag(\u03bb\n\n\u2217\n\n)W]\u22121, m\u2217 = \u00b5 \u2212 \u03a3WT \u03b1\u2217\n\n(8)\n\n2. The \u03b1\u2217\n\nn and \u03bb\u2217\n\nn depend on the gradient of function fn and satisfy the following conditions,\n(9)\n\nn, (cid:53)\u03c3n fn(h\u2217\n\n(cid:53)hn fn(h\u2217\n\nn) = \u03b1\u2217\n\nn) = \u03c3\u2217\n\nn, \u03c3\u2217\n\nn, \u03c3\u2217\n\nn\u03bb\u2217\n\nn\n\n3\n\n\fwhere h\u2217\nof f (x) with respect to x at x = x\u2217.\n\nn m\u2217 and (\u03c3\u2217\n\nn = wT\n\nn)2 = wT\n\nn V\u2217wn for all n and (cid:53)xf (x\u2217) denotes the gradient\n\n3. When {m\u2217, V\u2217} is a local maximizer of Eq. (5), then the set {m\u2217, V\u2217, h\n\u2217\n\na strict maximizer of Eq. (7).\n\n, \u03c3\u2217, \u03b1\u2217, \u03bb\n\n\u2217} is\n\n4. When likelihoods p(yn|\u03b7n) are log-concave, there is only one global maximum of f, and\n\nany {m\u2217, V\u2217} obtained by maximizing Eq. (7) will be the global maximizer of Eq. (5).\n\n\u2217\n\n\u2217\n\nPart 1 establishes the parameterization of (m\u2217, V\u2217) by (\u03b1\u2217, \u03bb\n) and its uniqueness, while part\n2 shows the conditions that (\u03b1\u2217, \u03bb\n) satisfy. This form has also been used in [12] for Gaussian\n\u2217. Part 3 shows that such\nProcesses where a \ufb01xed-point iteration was employed to search for \u03bb\nparameterization can be obtained at maxima of the Lagrangian rather than minima or saddle-points.\nThe \ufb01nal part considers the case when f is concave and shows that the global maximum can be\nobtained by maximizing the Lagrangian. Note that concavity of the lower bound is required for the\nlast part only and the other three parts are true irrespective of concavity.\nDetailed proof of the theorem is given in the supplementary material.\nNote that the conditions of Eq. (9) restrict the values that \u03b1\u2217\nn and \u03bb\u2217\nn can take. Their values will be\nvalid only in the range of the gradients of fn. This is unlike the formulation of [17] which does not\nconstrain these variables, but is similar to the method of [10]. We will see later that our algorithm\nmakes the problem infeasible for values outside this range. Ranges of these variables vary depending\non the likelihood p(yn|\u03b7n). However, we show below in Eq. (10) that \u03bb\u2217\nn is always strictly positive\nfor log-concave likelihoods. The \ufb01rst equality is obtained using Eq. (9), while the second equality\nis simply change of variables from \u03c3n to \u03c32\nn. The third equality is obtained using Eq. (19) from\n[17]. The \ufb01nal inequality is obtained since fn is convex for all log-concave likelihoods ((cid:53)xxf (x)\ndenotes the Hessian of f (x)).\n\n\u03bb\u2217\nn = \u03c3\u2217\n\nn\n\n\u22121 (cid:53)\u03c3n fn(h\u2217\n\nn, \u03c3\u2217\n\nn) = 2 (cid:53)\u03c32\n\nn\n\nfn(h\u2217\n\nn, \u03c3\u2217\n\nn) = (cid:53)2\n\nhnhn\n\nfn(h\u2217\n\nn, \u03c3\u2217\n\nn) > 0\n\n(10)\n\n5 Optimization Algorithms for Decoupled Variational Gaussian Inference\n\nTheorem 4.1 suggests that the optimal solution can be obtained by maximizing g(x) or the La-\ngrangian L. The maximization is dif\ufb01cult for two reasons. First, the constraints c2\nn(x) are non-linear\nand second the function g(x) may not always be concave. Note that it is not easy to apply the aug-\nmented Lagrangian method or \ufb01rst-order methods (see Chapter 4 of [1]) because their application\nwould require storage of V. Instead, we use a method based on linearization of the constraints which\nwill avoid explicit computation and storage of V. First, we will show that when g(x) is concave,\nwe can maximize it by minimizing a sequence of convex problems. We will then solve each convex\nproblem using the dual-variational method of [10].\n\n5.1 Linearly Constrained Lagrangian (LCL) Method\n\nWe now derive an algorithm based on the linearly constrained Lagrangian (LCL) method [19]. The\nLCL approach involves linearization of the non-linear constraints and is an effective method for\nlarge-scale optimization, e.g. in packages such as MINOS [24]. There are variants of this method\nthat are globally convergent and robust [4], but we use the variant described in Chapter 17 of [24].\nThe \ufb01nal algorithm: See Algorithm 1. We start with a \u03b1, \u03bb and \u03c3. At every iteration k, we\nminimize the following dual:\n\nmin\n\u03b1,\u03bb\u2208S\n\n2 log |\u03a3\u22121 + WT diag(\u03bb)W| + 1\n\u2212 1\nHere, (cid:101)\u03a3 = W\u03a3WT and(cid:101)\u00b5 = W\u00b5. The functions f k\u2217\n\nf k\u2217\nn (\u03b1n, \u03bbn) := max\n\n\u2212fn(hn, \u03c3n) + \u03b1nhn + 1\n\nhn,\u03c3n>0\n\nwhere \u03bbk\n\nn and \u03c3k\n\nn were obtained at the previous iteration.\n\n4\n\n2 \u03b1T(cid:101)\u03a3\u03b1 \u2212(cid:101)\u00b5T \u03b1 +\n\nN(cid:88)\n\nn=1\n\nf k\u2217\nn (\u03b1n, \u03bbn)\n\n(11)\n\nn are obtained as follows:\n2 \u03bbk\n2 \u03bbn\u03c3k\n\nn(2\u03c3n \u2212 \u03c3k\n\nn) \u2212 1\n\nn(\u03c3n \u2212 \u03c3k\nn)2\n\n(12)\n\n\fAlgorithm 1 Linearly constrained Lagrangian (LCL) method for VG approximation\n\nInitialize \u03b1, \u03bb \u2208 S and \u03c3 (cid:31) 0.\nfor k = 1, 2, 3, . . . do\n\n\u03bbk \u2190 \u03bb and \u03c3k \u2190 \u03c3.\nrepeat\n\nuntil convergence\n\nend for\n\nFor all n, compute predictive mean \u02c6m\u2217\nFor all n, in parallel, compute (h\u2217\nn, \u03c3\u2217\nFind next (\u03b1, \u03bb) \u2208 S using gradients g\u03b1\n\nn and variances \u02c6v\u2217\nn) that maximizes Eq. (12).\nn = h\u2217\nn = 1\n\nn using linear regression (Eq. (13))\nn\u03c3n \u2212 \u02c6v\u2217\nn].\n\nn \u2212 \u02c6m\u2217\n\nn)2 + 2\u03c3k\n\nn and g\u03bb\n\n2 [\u2212(\u03c3k\n\nThe constraint set S is a box constraints on \u03b1n and \u03bbn such that a global minimum of Eq. (12)\nexists. We will show some examples later in this section.\nEf\ufb01cient gradient computation: An advantage of this approach is that the gradient at each iteration\ncan be computed ef\ufb01ciently, especially for large N and L. The gradient computation is decoupled\ninto two terms. The \ufb01rst term can be computed by computing f k\u2217\nn in parallel, while the second\nterm involves prediction in a linear model. The gradients with respect to \u03b1n and \u03bbn (derived in the\n2 [\u2212(\u03c3k\nsupplementary material) are given as g\u03b1\nn], where\nn, \u03c3\u2217\n(h\u2217\n\u02c6v\u2217\nn := wT\n\u02c6m\u2217\nn := wT\n\nn (\u03a3\u22121 + WT diag(\u03bb)W)\u22121wn =(cid:101)\u03a3nn \u2212(cid:101)\u03a3n,:((cid:101)\u03a3 + diag(\u03bb)\u22121)\u22121(cid:101)\u03a3n,:\nn (\u00b5 \u2212 \u03a3WT \u03b1) =(cid:101)\u00b5n \u2212(cid:101)\u03a3n,:\u03b1\n\nn) are maximizers of Eq. (12) and \u02c6v\u2217\n\nn are computed as follows:\n\nnwn = wT\nn = wT\n\nn \u2212 \u02c6m\u2217\nn and \u02c6m\u2217\n\nn V\u2217\nn m\u2217\n\nn := h\u2217\n\nn)2 + 2\u03c3k\n\nn \u2212 \u02c6v\u2217\n\nn and g\u03bb\n\nn := 1\n\nn\u03c3\u2217\n\n(13)\n\nn, \u03c3\u2217\n\nn and \u02c6m\u2217\n\nThe quantities (h\u2217\nn) can be computed in parallel over all n. Sometimes, this can be done in closed\nform (as we shown in the next section), otherwise we can compute them by numerically optimizing\nover two-dimensional functions. Since these problems are only two-dimensional, a Newton method\ncan be easily implemented to obtain fast convergence.\nThe other two terms \u02c6v\u2217\nn can be interpreted as predictive means and variances of a pseudo\nlinear model, e.g. compare Eq. (13) with Eq. 2.25 and 2.26 of Rasmussen\u2019s book [18]. Hence\nevery gradient computation can be expressed as Bayesian prediction in a linear model for which\nwe can use existing implementation. For example, for binary or multi-class GP classi\ufb01cation, we\ncan reuse ef\ufb01cient implementation of GP regression. In general, we can use a Bayesian inference\nin a conjuate model to compute the gradient of a non-conjugate model. This way the method also\navoids forming V\u2217 and work only with its linear projections which can be ef\ufb01ciently computed\nusing vector-matrix-vector products.\nThe \u201cdecoupling\u201d nature of our algorithm should now be clear. The non-linear computations de-\npending on the data, are done in parallel to compute h\u2217\nn. These are completely decoupled\nfrom linear computations for \u02c6mn and \u02c6vn. This is summarized in Algorithm (1).\nDerivation: To derive the algorithm, we \ufb01rst linearize the constraints. Given multiplier \u03bbk and a\npoint xk at the k\u2019th iteration, we linearize the constraints c2\n\nn and \u03c3\u2217\n\nn(x):\n\n\u00afc2\nnk(x) := c2\n= 1\n= \u2212 1\n\nn(xk) + (cid:53)c2\nn)2 \u2212 wT\n2 [(\u03c3k\nn)2 \u2212 2\u03c3k\n\n2 [(\u03c3k\n\nn(xk)T (x \u2212 xk)\nn Vkwn + 2\u03c3k\n\nn\u03c3n + wT\n\nn) \u2212 (wT\n\nn Vwn \u2212 wT\n\nn Vkwn)]\n\nn(\u03c3n \u2212 \u03c3k\nn Vwn]\n\n(14)\n(15)\n(16)\n\nSince we want the linearized constraint \u00afc2\npenalize the difference between the two.\n\nnk(x) to be close to the original constraint c2\n\nn(x), we will\n\nn(x) \u2212 \u00afc2\nc2\n\nnk(x) = 1\n\n2{\u03c32\n\nn \u2212 wT\n\nn Vwn \u2212 [\u2212(\u03c3k\n\nn)2 + 2\u03c3k\n\nn\u03c3n \u2212 wT\n\nn Vwn]} = 1\n\n2 (\u03c3n \u2212 \u03c3k\nn)2\n\n(17)\n\nThe key point is that this term is independent of V, allowing us to obtain a closed-form solution for\nV\u2217. This will also be crucial for the extension to non-concave case in the next section.\n\n5\n\n\fThe new k\u2019th subproblem is de\ufb01ned with the linearized constraints and the penalization term:\n\n1\n\nn(\u03c3n \u2212 \u03c3k\nn)2\n2 \u03bbk\nn mn = 0 , \u2212 1\n\nn)2 \u2212 2\u03c3k\n\n2 [(\u03c3k\n\nn\u03c3n + wT\n\nn Vwn] = 0,\n\n(18)\n\n\u2200n\n\nThis is a concave problem with linear constraints and can be optimized using dual variational infer-\nence [10]. Detailed derivation is given in the supplementary material.\nConvergence: When LCL algorithm converges, it has quadratic convergence rates [19]. However,\nit may not always converge. Globally convergent methods do exist (e.g. [4]) although we do not\nexplore them in this paper. Below, we present a simple approach that improves the convergence for\nnon log-concave likelihoods.\nAugmented Lagrangian Methods for non log-concave likelihoods: When the likelihood p(yn|\u03b7n)\nare not log-concave, the lower bound can contain local minimum, making the optimization dif\ufb01cult\nfor function f (m, V). In such scenarios, the algorithm may not converge for all starting values.\nThe convergence of our approach can be improved for such cases. We simply add an augmented\nnk(x)]2 to the linearly constrained Lagrangian de\ufb01ned in Eq. (18), as shown\nLagrangian term [\u00afc2\ni > 0 and i is the i\u2019th iteration of k\u2019th subproblem:\nbelow [24]. Here, \u03b4k\n\ngk(x) := g(x) \u2212 N(cid:88)\n\nn=1\n\ns.t. hn \u2212 wT\n\nmax\n\nx\n\naug(x) := g(x) \u2212 N(cid:88)\n\ngk\n\nn(\u03c3n \u2212 \u03c3k\n\n1\n\n2 \u03bbk\n\nn)2 + 1\n\n2 \u03b4k\n\ni (\u03c3n \u2212 \u03c3k\nn)4\n\n(19)\n\nn=1\nsubject to the same constraints as Eq. (18).\nThe sequence \u03b4k\ni can either be set to a constant or be increased slowly to ensure convergence to a\nlocal maximum. More details on setting this sequence and its affect on the convergence can be found\nin Chapter 4.2 of [1]. It is in fact possible to know the value of \u03b4k\ni such that the algorithm always\nconverge. This value can be set by examining the primal function - a function with respect to the\ndeviations in constraints. It turns out that it should be set larger than the largest eigenvalues of the\nHessian of the primal function at 0. A good discussion of this can be found in Chapter 4.2 of [1].\nThe fact that that the linearized constraint \u00afc2\naddition of this term then only affects computation of f k\u2217\nchanging the computation to optimization of the following function:\n\nnk(x) does not depend on V is very useful here since\nn . We modify the algorithm by simply\n\nmax\n\nhn,\u03c3n>0\n\n\u2212fn(hn, \u03c3n) + \u03b1nhn + 1\n\n2 \u03bbn\u03c3k\n\nn(2\u03c3n \u2212 \u03c3k\n\nn) \u2212 1\n\nn(\u03c3n \u2212 \u03c3k\n\n2 \u03bbk\n\nn)2 \u2212 \u03b4k\ni\n2\n\n(\u03c3n \u2212 \u03c3k\nn)4\n\n(20)\n\n\u03c3k\nn,\n\n\u03bbn + \u03bbk\nn\n\nn is \ufb01nite.\n\nn = \u2212 1\nh\u2217\n\n2 \u03c3\u22172\n\nyn + \u03b1n + \u03bbk\nn\n\nn + log(yn + \u03b1n), S = {\u03b1n > \u2212yn, \u03bbn > 0,\u2200n} (21)\n\nIt is clear from this that the augmented Lagrangian term is trying to \u201cconvexify\u201d the non-convex\nfunction fn, leading to improved convergence.\nComputation of f k\u2217\nn (\u03b1,\u03bbn) These functions are obtained by solving the optimization problem\nshown in Eq. (12). In some cases, we can compute these functions in closed form. For exam-\nple, as shown in the supplementary material, we can compute h\u2217 and \u03c3\u2217 in closed form for Poisson\nlikelihood as shown below. We also show the range of \u03b1n and \u03bbn for which f k\u2217\n\u03c3\u2217\nn =\nAn expression for Laplace likelihood is also derived in the supplementary material.\nWhen we do not have a closed-form expression for f k\u2217\nn , we can use a 2-D Newton method for opti-\nmization. To facilitate convergence, we must warm-start the optimization. When fn is concave, this\nusually converges in few iterations, and since we can parallelize over n, a signi\ufb01cant speed-up can\nbe obtained. A signi\ufb01cant engineering effort is required for parallelization and for our experiments\nin this paper, we have not done so.\nAn issue that remains open is the evaluation of the range S for which each f k\u2217\nn is \ufb01nite. For now,\nwe have simply set it to the range of gradients of function fn as shown by Eq. 9 (also see the last\nparagraph in that section). It is not clear whether this will always assure convergence for the 2-D\noptimization.\nPrediction: Given \u03b1\u2217 and \u03bb\nregression. See details in Rasmussen\u2019s book [18].\n\n\u2217, we can compute the predictions by using equations similar to GP\n\n6\n\n\f6 Results\n\nWe demonstrate the advantages of our approach on a binary GP classi\ufb01cation problem. We model\nthe binary data using Bernoulli-logit likelihoods. Function fn are computed to a reasonable accuracy\nusing the piecewise bound [14] with 20 pieces.\nWe apply this model to a subproblem of the USPS digit data [18]. Here, the task is to classify\nbetween 3\u2019s vs. 5\u2019s. There are a total of 1540 data examples with feature dimensionality of 256.\nSince we want to compare the convergence, we will show results for different data sizes by sub-\nsampling randomly from these examples.\nWe set \u00b5 = 0 and use a squared-exponential kernel, for which the (i, j)th entry of \u03a3 is de\ufb01ned as:\n\u03a3ij = \u2212\u03c32 exp[\u2212 1\n2||xi \u2212 xj||2/s] where xi is i\u2019th feature. We show results for log(\u03c3) = 4 and\nlog(s) = \u22121 which corresponds to a dif\ufb01cult case where VG approximations converge slowly (due\nto the ill-conditioning of the Kernel) [18]. Our conclusions hold for other parameter settings as well.\nWe compare our algorithm with the approach of Opper and Archambeau [17] and Challis and Barber\n[3]. We refer to them as \u2018Opper\u2019 and \u2018Cholesky\u2019, respectively. We call our approach \u2018Decoupled\u2019.\nFor all methods, we use L-BFGS method for optimization (implemented in minFunc by Mark\nSchmidt), since a Newton method would be too expensive for large N. All algorithms were stopped\nwhen the subsequent changes in the lower bound value of Eq. 5 were less than 10\u22124. All methods\nwere randomly initialized. Our results are not sensitive to initialization. We compare convergence\nin terms of the value of lower bound. The prediction errors show very similar trend, therefore we do\nnot present them.\nThe results are summarized in Figure 1. Each plot shows the negative of the lower bound vs time in\nseconds for increasing data sizes N = 200, 500, 1000 and 1500. For Opper and Cholesky, we show\nmarkers for every iteration. For decoupled, we show markers after completion of each subproblem.\nWe can not see the result of \ufb01rst subproblem here, and the \ufb01rst visible marker is obtained from the\nsecond subproblem onwards.\nWe see that as the data size increases, Decoupled converges faster than the other methods, showing\na clear advantage over other methods for large dimensionality.\n\n7 Discussion and Future Work\n\nIn this paper, we proposed the decoupled VG inference method for approximate Bayesian inference.\nWe obtain ef\ufb01cient reparameterization using a Lagrangian to the lower bound. We showed that such\na parameterization is unique, even for non log-concave likelihood functions, and the maximum of\nthe lower bound can be obtained by maximizing the Lagrangian. For concave likelihood function,\nour method recovers the global maximum. We proposed a linearly constrained Lagrangian method\nto maximize the Lagrangian. The algorithm has the desired property that it reduces each gradi-\nent computation to a linear model computation, while parallelizing non-linear computations over\ndata examples. Our proposed algorithm is capable of attaining convergence rates similar to convex\nmethods.\nUnlike methods such as mean-\ufb01eld approximation, our method preserves all posterior correlations\nand can be useful towards generalizing stochastic variational inference (SVI) methods [5] to non-\nconjugate models. Existing SVI methods rely on mean-\ufb01eld approximations and are widely applied\nfor conjugate models. Under our method, we can stochastically include only few constraints to\nmaximize the Lagrangian. This amounts to a low-rank approximation of the covariance matrix and\ncan be used to construct an unbiased estimate of the gradient.\nWe have focused only on latent Gaussian models for simplicity. It is easy to extend our approach\nto other non-Gaussian latent models, e.g. sparse Bayesian linear model [21] and Bayesian non-\nnegative matrix factorization [20]. Similar decoupling method can also be applied to general latent\nvariable models. Note that a choice of proper posterior distribution is required to get an ef\ufb01cient\nparameterization of the posterior.\nIt is also possible to get sparse posterior covariance approximation using our decoupled formulation.\nOne possible idea is to use Hinge type of loss to approximate the likelihood terms. Using the\ndualization similar to what we have shown here would give us a sparse posterior covariance.\n\n7\n\n\fFigure 1: Convergence results for a GP classi\ufb01cation on the USPS-3vs5 data set. Each plot shows\nthe negative of the lower bound vs time in seconds for data sizes N = 200, 500, 1000 and 1500.\nFor Opper and Cholesky, we show markers for every iteration. For decoupled, we show markers\nafter completion of each subproblem. We can not see the result of \ufb01rst subproblem here, and the\n\ufb01rst visible marker is obtained from the second subproblem. As the data size increases, Decoupled\nconverges faster, showing a clear advantage over other methods for large dimensionality.\n\nA weakness of our paper is a lack of strong experiments showing that the decoupled method indeed\nconverge at a fast rate. The implementation of decoupled method requires a good engineering effort\nfor it to scale to big data. In future, we plan to have an ef\ufb01cient implementation of this method and\ndemonstrate that this enables variational inference to scale to large data.\n\nAcknowledgments\n\nThis work was supported by School of Computer Science and Communication at EPFL. I would\nspeci\ufb01cally like to thank Matthias Grossglauser, Rudiger Urbanke, and Jame Larus for providing me\nsupport and funding during this work. I would like to personally thank Volkan Cevher, Quoc Tran-\nDinh, and Matthias Seeger from EPFL for early discussions of this work and Marc Desgroseilliers\nfrom EPFL for checkin some proofs.\nI would also like to thank the reviewers for their valuable feedback. The experiments in this paper\nare less extensive than what I promised them. Due to time and space constraints, I have not been\nable to add all of them. More experiments will appear in an arXiv version of this paper.\n\n8\n\n0.20.30.40.51525535545N = 200Time in secondsNegative Lower Bound 10010113601380140014201440146014801500N = 500Time in secondsNegative Lower Bound 2030405010015020027602780280028202840N = 1000Time in secondsNegative Lower Bound 50100150200300400500423042404250426042704280N = 1500Time in secondsNegative Lower Bound CholeskyOpperDecoupledCholeskyOpperDecoupledCholeskyOpperDecoupledCholeskyOpperDecoupled\fReferences\n[1] Dimitri P Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[2] M. Braun and J. McAuliffe. Variational inference for large-scale models of discrete choice. Journal of\n\nthe American Statistical Association, 105(489):324\u2013335, 2010.\n\n[3] E. Challis and D. Barber. Concave Gaussian variational approximations for inference in large-scale\n\nBayesian linear models. In International conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[4] Michael P Friedlander and Michael A Saunders. A globally convergent linearly constrained lagrangian\n\nmethod for nonlinear optimization. SIAM Journal on Optimization, 15(3):863\u2013897, 2005.\n\n[5] M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine\n\nLearning Research, 14:1303\u20131347, 2013.\n\n[6] A. Honkela, T. Raiko, M. Kuusela, M. Tornio, and J. Karhunen. Approximate Riemannian conjugate\ngradient learning for \ufb01xed-form variational Bayes. Journal of Machine Learning Research, 11:3235\u2013\n3268, 2011.\n\n[7] T. Jaakkola and M. Jordan. A variational approach to Bayesian logistic regression problems and their\n\nextensions. In International conference on Arti\ufb01cial Intelligence and Statistics, 1996.\n\n[8] P. Jyl\u00a8anki, J. Vanhatalo, and A. Vehtari. Robust Gaussian process regression with a Student-t likelihood.\n\nThe Journal of Machine Learning Research, 999888:3227\u20133257, 2011.\n\n[9] Mohammad Emtiyaz Khan. Variational Learning for Latent Gaussian Models of Discrete Data. PhD\n\nthesis, University of British Columbia, 2012.\n\n[10] Mohammad Emtiyaz Khan, Aleksandr Y. Aravkin, Michael P. Friedlander, and Matthias Seeger. Fast\ndual variational inference for non-conjugate latent gaussian models. In ICML (3), volume 28 of JMLR\nProceedings, pages 951\u2013959. JMLR.org, 2013.\n\n[11] Mohammad Emtiyaz Khan, Shakir Mohamed, Benjamin Marlin, and Kevin Murphy. A stick breaking\nIn International conference on\n\nlikelihood for categorical data analysis with latent Gaussian models.\nArti\ufb01cial Intelligence and Statistics, 2012.\n\n[12] Mohammad Emtiyaz Khan, Shakir Mohamed, and Kevin Murphy. Fast Bayesian inference for non-\n\nconjugate Gaussian process regression. In Advances in Neural Information Processing Systems, 2012.\n\n[13] M. L\u00b4azaro-Gredilla and M. Titsias. Variational heteroscedastic Gaussian process regression. In Interna-\n\ntional Conference on Machine Learning, 2011.\n\n[14] B. Marlin, M. Khan, and K. Murphy. Piecewise bounds for estimating Bernoulli-logistic latent Gaussian\n\nmodels. In International Conference on Machine Learning, 2011.\n\n[15] T. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Conference\n\non Uncertainty in Arti\ufb01cial Intelligence, 2001.\n\n[16] H. Nickisch and C.E. Rasmussen. Approximations for binary Gaussian process classi\ufb01cation. Journal of\n\nMachine Learning Research, 9(10), 2008.\n\n[17] M. Opper and C. Archambeau. The variational gaussian approximation revisited. Neural Computation,\n\n21(3):786\u2013792, 2009.\n\n[18] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. MIT\n\nPress, 2006.\n\n[19] Stephen M Robinson. A quadratically-convergent algorithm for general nonlinear programming prob-\n\nlems. Mathematical programming, 3(1):145\u2013156, 1972.\n\n[20] Mikkel N Schmidt, Ole Winther, and Lars Kai Hansen. Bayesian non-negative matrix factorization. In\n\nIndependent Component Analysis and Signal Separation, pages 540\u2013547. Springer, 2009.\n\n[21] M. Seeger. Bayesian Inference and Optimal Design in the Sparse Linear Model. Journal of Machine\n\nLearning Research, 9:759\u2013813, 2008.\n\n[22] M. Seeger. Sparse linear models: Variational approximate inference and Bayesian experimental design.\n\nJournal of Physics: Conference Series, 197(012001), 2009.\n\n[23] M. Seeger and H. Nickisch. Large scale Bayesian inference and experimental design for sparse linear\n\nmodels. SIAM Journal of Imaging Sciences, 4(1):166\u2013199, 2011.\n\n[24] SJ Wright and J Nocedal. Numerical optimization, volume 2. Springer New York, 1999.\n\n9\n\n\f", "award": [], "sourceid": 820, "authors": [{"given_name": "Mohammad Emtiyaz", "family_name": "Khan", "institution": "EPFL"}]}