{"title": "Minimization of Continuous Bethe Approximations: A Positive Variation", "book": "Advances in Neural Information Processing Systems", "page_first": 2564, "page_last": 2572, "abstract": "We develop convergent minimization algorithms for Bethe variational approximations which explicitly constrain marginal estimates to families of valid distributions. While existing message passing algorithms define fixed point iterations corresponding to stationary points of the Bethe free energy, their greedy dynamics do not distinguish between local minima and maxima, and can fail to converge. For continuous estimation problems, this instability is linked to the creation of invalid marginal estimates, such as Gaussians with negative variance. Conversely, our approach leverages multiplier methods with well-understood convergence properties, and uses bound projection methods to ensure that marginal approximations are valid at all iterations. We derive general algorithms for discrete and Gaussian pairwise Markov random fields, showing improvements over standard loopy belief propagation. We also apply our method to a hybrid model with both discrete and continuous variables, showing improvements over expectation propagation.", "full_text": "Minimization of Continuous Bethe Approximations:\n\nA Positive Variation\n\nDepartment of Computer Science, Brown University, Providence, RI\n\nJason L. Pacheco and Erik B. Sudderth\n{pachecoj,sudderth}@cs.brown.edu\n\nAbstract\n\nWe develop convergent minimization algorithms for Bethe variational approxima-\ntions which explicitly constrain marginal estimates to families of valid distribu-\ntions. While existing message passing algorithms de\ufb01ne \ufb01xed point iterations cor-\nresponding to stationary points of the Bethe free energy, their greedy dynamics do\nnot distinguish between local minima and maxima, and can fail to converge. For\ncontinuous estimation problems, this instability is linked to the creation of invalid\nmarginal estimates, such as Gaussians with negative variance. Conversely, our\napproach leverages multiplier methods with well-understood convergence proper-\nties, and uses bound projection methods to ensure that marginal approximations\nare valid at all iterations. We derive general algorithms for discrete and Gaussian\npairwise Markov random \ufb01elds, showing improvements over standard loopy be-\nlief propagation. We also apply our method to a hybrid model with both discrete\nand continuous variables, showing improvements over expectation propagation.\n\n1\n\nIntroduction\n\nVariational inference algorithms pose probabilistic inference as an optimization over distributions.\nTypically the optimization is formulated by minimizing an objective known as the Gibbs free en-\nergy [1]. Variational methods relax an otherwise intractable optimal inference problem by approx-\nimating the entropy-based objective, and considering appropriately simpli\ufb01ed families of approxi-\nmating distributions [2]. Local message passing algorithms offer a computationally ef\ufb01cient method\nfor extremizing variational free energies. Loopy belief propagation (LBP), for example, optimizes\na relaxed objective known as the Bethe free energy [1, 2], which we review in Sec. 2. Expectation\npropagation (EP) [3] is a generalization of LBP which shares the same objective, but optimizes over\na relaxed set of constraints [4] applicable to a broader family of continuous inference problems.\nIn general, neither LBP nor EP are guaranteed to converge. Even in simple continuous models, both\nmethods may improperly estimate invalid or degenerate marginal distributions, such as Gaussians\nwith negative variance. Such degeneracy typically occurs in classes of models for which conver-\ngence properties are poor, and there is evidence that these problems are related [5, 6],\nExtensive work has gone into developing algorithms which improve on LBP for models with discrete\nvariables, for example by bounding [7, 8] or convexifying [9] the free energy objective. Gradient\noptimization methods have been applied successfully to binary Ising models [10], but when applied\nto Gaussian models this approach suffers similar non-convergence and degeneracy issues as LBP.\nWork on optimization of continuous variational free energies has primarily focused on addressing\nconvergence problems [11]. None of these approaches directly address degeneracy in the continuous\ncase, and computation may be prohibitively expensive for these direct minimization schemes.\nBy leveraging gradient projection methods from the extensive literature on constrained nonlinear\noptimization, we develop an algorithm which ensures that marginal estimates remain valid and nor-\nmalizable at all iterations. In doing so, we account for important constraints which have been ignored\n\n1\n\n\fby previous variational derivations of the expectation propagation algorithm [12, 6, 11]. Moreover,\nby adapting the method of multipliers [13], we guarantee that our inference algorithm converges for\nmost models of practical interest.\nWe begin by introducing the Bethe variational problem (Sec. 2). We brie\ufb02y review the correspon-\ndence between the Lagrangian formalism and message passing and discuss implicit normalizability\nassumptions which, when violated, lead to degeneracy in message passing algorithms. We discuss\nthe method of multipliers, gradient projection, and convergence properties (Sec. 3). We then pro-\nvide derivations (Sec. 4) for discrete MRFs, Gaussian MRFs, and hybrid models with potentials\nde\ufb01ned by discrete mixtures of Gaussian distributions. Experimental results in Sec. 5 demonstrate\nsubstantial improvements over baseline message passing algorithms.\n\n2 Bethe Variational Problems\n\nFor simplicity, we restrict our attention to pairwise Markov random \ufb01elds (MRF) [2], with graphs\nG(V,E) de\ufb01ned by nodes V and undirected edges E. The joint distribution then factorizes as\n\n(cid:89)\n\ns\u2208V\n\np(x) =\n\n1\nZp\n\n(cid:89)\n\n(s,t)\u2208E\n\n\u03d5s(xs)\n\n\u03d5st(xs, xt)\n\n(1)\n\nfor some non-negative potential functions \u03d5(\u00b7). Often this distribution is a posterior given \ufb01xed\nobservations y, but we suppress this dependence for notational simplicity. We are interested in\ncomputing the log partition function log Zp, and/or the marginal distributions p(xs), s \u2208 V.\nLet q(x; \u00b5) denote an exponential family of densities with suf\ufb01cient statistics \u03c6(x) \u2208 Rd:\n\nq(x; \u00b5) \u221d exp{\u03b8T \u03c6(x)}, \u00b5 = Eq[\u03c6(x)].\n\n(2)\nTo simplify subsequent algorithm development, we index distributions via their mean parameters \u00b5.\nWe associate each node s \u2208 V with an exponential family qs(xs; \u00b5s), \u03c6s(x) \u2208 Rds, and each edge\n(s, t) \u2208 E with a family qst(xs, xt; \u00b5st), \u03c6st(x) \u2208 Rdst. Because qs(xs; \u00b5s) is a valid probability\ndistribution, \u00b5s must lie in a set of realizable mean parameters, \u00b5s \u2208 Ms. Similarly, \u00b5st \u2208 Mst.\nFor example, Ms and Mst might require Gaussians to have positive semide\ufb01nite covariances.\nWe can express the log partition as the solution to an optimization problem,\n\n\u2212 log Zp = min\n\u00b5\u2208M(G)\n\nE\u00b5[\u2212 log p(x)] \u2212 H[\u00b5] = min\n\u00b5\u2208M(G)\n\n(3)\nwhere H[\u00b5] is the entropy of q(x; \u00b5), E\u00b5[\u00b7] denotes expectation with respect to q(x; \u00b5), and F(\u00b5) is\nknown as the variational free energy. Mean parameters \u00b5 lie in the marginal polytope M(G) if and\nonly if there exists some valid, joint probability distribution with those moments.\nExactly characterizing M(G) may require exponentially many constraints, so we relax the optimiza-\ntion to be over a set of locally consistent marginal distributions L(G), which are properly normalized\nand satisfy expectation constraints associated with each edge of the graph:\n\nF(\u00b5),\n\nCs(\u00b5) = 1 \u2212\n\nqs(xs; \u00b5s) dxs, Cts(\u00b5) = \u00b5s \u2212 Eqst[\u03c6s(xs)].\n\n(4)\nThis is a relaxation in the sense that M(G) \u2282 L(G) with strict equality if G does not contain cycles.\nWe approximate the entropy H[\u00b5] with the entropy of a tree-structured distribution q(x; \u00b5). Such an\napproximation is tractable and consistent with L(G), and yields the Bethe free energy:\nF B(\u00b5) =\n\nEqst [log qst(xs, xt; \u00b5st)\u2212 \u03c8st(xs, xt)]\u2212(cid:88)\n\n(ns\u2212 1) Eqs[log qs(xs; \u00b5s)\u2212 \u03d5s(xs)]\n\n(cid:90)\n\nM =(cid:83)\n\n(5)\nHere, \u03c8st(\u00b7) = \u03d5st(\u00b7)\u03d5s(\u00b7)\u03d5t(\u00b7), and the mean parameters \u00b5 are valid within the constraint set\n\nst Mst. The resulting objective is the Bethe variational problem (BVP):\n\ns Ms\n\n(cid:88)\n(cid:83)\n\n(s,t)\u2208E\n\ns\u2208V\n\nFB(\u00b5)\n\nminimize\nsubject to Cts(\u00b5) = 0,\u2200s \u2208 V, t \u2208 N (s)\n\n\u00b5\n\nCs(\u00b5) = 0,\u2200s \u2208 V,\n{\u00b5s : s \u2208 V} \u222a {\u00b5st : (s, t) \u2208 E} \u2208 M.\n\n(6)\n\nHere, N (s) denotes the set of neighbors of node s \u2208 V.\n\n2\n\n\f2.1 Correspondence to Message Passing\n\nWe can optimize the BVP (6) by relaxing the normalization and local consistency constraints with\nLagrange multipliers. Constrained minima are characterized by stationary points of the Lagrangian,\n\nL(x, \u03bb) = F B(q) +\n\n\u03bbsCs +\n\n\u03bbtsCts.\n\n(7)\n\n(cid:88)\n\ns\n\n(cid:88)\n\n(cid:88)\n\ns\n\nt\u2208N (s)\n\nEquivalence between LBP \ufb01xed points and stationary points of the Lagrangian for the discrete case\nhave been discussed extensively [1, 2]. Similar correspondence has been shown more generally for\nEP \ufb01xed points [2, 4]. Since our focus is on the continuous case we brie\ufb02y review the correspon-\ndence between Gaussian LBP \ufb01xed points and the Gaussian Bethe free energy. For simplicity we\nfocus on zero-mean p(x) = N (x | 0, J\u22121), where diagonal precision entries Jss = As and\n\n(cid:27)\n\n(cid:26)\n\n\u2212 1\n2\n\nx2\nsAs\n\n\u03d5s(xs) = exp\n\n,\n\n\u03d5st(xs, xt) = exp\n\nLet q(xs) = N (xs | 0, Vs), q(xs, xt) = N (( xs\nThe Gaussian Bethe free energy then equals\n\n(cid:26)\n\n\u2212 1\n2\n\n(cid:19)(cid:18) xs\n\n(cid:18) 0\n(cid:19)(cid:27)\nxt ) | 0, \u03a3st), \u03a3st =(cid:0) Vts Pst\n(cid:1), and (cid:101)Bst =(cid:0) As Jst\n(cid:19)\n(cid:18) ns \u2212 1\n\n(xs xt)\n\nJst\n0\n\nPts Vst\n\nJst At\n\nJst\n\nxt\n\n(VsAs \u2212 log Vs) .\n\n.\n\n(cid:1).\n\n(8)\n\n2\n\nFGB(V, \u03a3) =\n\n1\n2\n\n(cid:88)\n\n(s,t)\u2208E\n\n(cid:16)\ntr(\u03a3st(cid:101)Bst) \u2212 log |\u03a3st|(cid:17) \u2212(cid:88)\n(cid:88)\n\n(cid:88)\n\ns\u2208V\n\ns\n\nt\u2208N (s)\n\nThe locally consistent marginal polytope L(G) consists of the constraints Cts(V ) = Vs \u2212 Vts for\nall nodes s \u2208 V and edges (s, t) \u2208 E. The Lagrangian is given by,\n\nL(V, \u03a3, \u03bb) = FGB(V, \u03a3) +\n\n\u03bbts [Vs \u2212 Vts] .\n\n(9)\n\nTaking the derivative with respect to the node marginal variance \u2202 L\nV \u22121\ns = As + 1\n\n= 0 yields the stationary point\nt\u2208N (s) \u03bbts. For a Gaussian LBP algorithm with messages parametrized as\ns\u039bt\u2192s\n2 x2\n\n(cid:9), \ufb01xed points of the node marginal precision are given by\n\n\u2202Vs\n\nns\u22121\n\n(cid:80)\nmt\u2192s(xs) = exp(cid:8)\u2212 1\n(cid:80)\n\n(cid:88)\n\nt\u2208N (s)\n\n\u039bs = As +\n\n\u039bt\u2192s\n\nLet \u03bbts = \u2212 1\na\u2208N (s)\\t \u039ba\u2192s. Substituting back into the stationary point conditions yields\ns \u21d2 \u039bs. A similar construction holds for the pairwise marginals. Inverting the correspondence\nV \u22121\nbetween multipliers and message parameters yields the converse V \u22121\n\ns \u21d0 \u039bs (c.f. [4]).\n\n2\n\n2.2 Message Passing Non-Convergence and Degeneracy\n\nWhile local message passing algorithms are convenient for many applications, their convergence\nis not guaranteed in general.\nIn particular, LBP often fails to converge for networks with tight\nloops [1] such as the 3\u00d73 lattice of Figure 1(a). For non-Gaussian models with continuous variables,\nconvergence of the EP algorithm can be even more problematic [11].\nFor continuous models message updates may yield degenerate, unnormalizable marginal distribu-\ntions which do not correspond to stationary points of the Lagrangian. For example, for Gaussian\nMRFs the Bethe free energy F B(\u00b7) in (5) is derived from expectations with respect to variational\ndistributions qs(xs; \u00b5s), qst(xs, xt; \u00b5st). If a set of hypothesized marginals are not normalizable\n(positive variance), the Gaussian Bethe free energy F GB(\u00b7) is invalid and unde\ufb01ned.\nDegenerate marginals arise because the constraint set M is not represented in the Lagrangian (7);\nthis issue is mentioned brie\ufb02y in [2] but is not dealt with computationally. Figure 1(b) demonstrates\nthis issue for a simple, three-node Gaussian MRF. Here LBP produces marginal variances which\noscillate between impossibly large positive, and non-sensical negative, values. Such degeneracies\nare arguably more problematic for EP since its moment matching steps require expected values with\nrespect to an augmented distribution [3], which may involve an unbounded integral.\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) Bethe free energy versus iteration for 3x3 toroidal binary MRF. (b) Node marginal variance\nestimates per iteration for a symmetric, single-cycle Gaussian MRF with three nodes (plot is of x1, other nodes\nare similar). (c) For the model from (b), the Gaussian Bethe free energy is unbounded on the constraint set.\n\n2.3 Unboundedness of the Gaussian Bethe Free Energy\n\nConditions under which the simple LBP and EP updates are guaranteed to be accurate are of great\npractical interest. For Gaussian MRFs, the class of pairwise normalizable models are suf\ufb01cient to\nguarantee LBP stability and convergence [5]. For non-pairwise normalizable models the Gaussian\nBethe free energy is unbounded below [6] on the set of local consistency constraints L(G).\nWe offer a small example consisting of a non-pairwise normalizable symmetric single cycle with 3\nnodes. Diagonal precision elements are Jss = 1.0, and off-diagonal elements Jst = 0.6. We embed\nmarginalization constraints into a symmetric parametrization Vs = V and \u03a3st =\n. Feasi-\nble solutions within the constraint set are characterized by V > 0 and \u22121 < \u03c1 < 1. Substituting\nthis parametrization into the Gaussian free energy (8), and performing some simple algebra, yields\n\n(cid:16) V \u03c1V\n\n(cid:17)\n\n\u03c1V V\n\nFGB(V, \u03c1) = \u2212 3\n2\n\nlog V +\n\n3\n2\n\nV (1 + 1.2\u03c1) \u2212 3\n2\n\nlog(1 \u2212 \u03c12).\n\n1.2 the free energy is unbounded below at rate O(\u2212V ). Figure 1(c) illustrates the Bethe\n\nFor \u03c1 < \u2212 1\nfree energy for this model as a function of V , and for several values of \u03c1.\nMore generally, it has been shown that Gaussian EP messages are always normalizable (positive\nvariance) for models with log-concave potentials [14].\nIt has been conjectured, but not proven,\nthat EP is also guaranteed to converge for such models [15]. For Gaussian MRFs, we note that\nthe family of log-concave models coincides with the pairwise normalizability condition. Our work\nseeks to improve inference for non-log-concave models with bounded Bethe free energies.\n\n3 Method of Multipliers\n\nGiven our complete constrained formulation of the Bethe variational problem, we avoid convergence\nand degeneracy problems via direct minimization using the method of multipliers (MoM) [13]. In\ngeneral terms, given some convex feasible region M, consider the equality constrained problem\n\nWith penalty parameter c > 0, we form the augmented Lagrangian function,\n\nminimize\n\nx\u2208M f (x)\n\nsubject to h(x) = 0\n\n(10)\nGiven a multiplier vector \u03bbk and penalty parameter ck we update the primal and dual variables as,\n\nLc(x, \u03bb) = f (x) + \u03bbT h(x) +\n\nc||h(x)||2\n\n1\n2\n\nxk = arg min\n\n\u03bbk+1 = \u03bbk + ckh(xk).\n\nThe penalty multiplier can be updated as ck+1 \u2265 ck according to some \ufb01xed update schedule, or\nbased on the results of the optimization step. An update rule that we \ufb01nd useful [13] is to increase\nthe penalty parameter by \u03b2 > 1 if the constraint violation is not improved by a factor 0 < \u03b3 < 1\nover the previous iteration,\n\nx\u2208MLck (x, \u03bbk),\n(cid:26) \u03b2ck\n\n4\n\nck+1 =\n\nck\n\nif (cid:107)h(xk)(cid:107) > \u03b3(cid:107)h(xk\u22121)(cid:107),\nif (cid:107)h(xk)(cid:107) \u2264 \u03b3(cid:107)h(xk\u22121)(cid:107).\n\n050100150200250300350400450100101102103104105Bethe Free EnergyIteration Belief PropagationMoM0102030405060708090100\u2212100\u221250050100150200250300Iteration #Variance of Node x1 TrueLBPMoM100101102103104\u22121000\u2212800\u2212600\u2212400\u221220002004006008001000Variance (V)Gaussian Bethe Free Energy \u03c1\u2208{\u22120.9,\u22120.87,\u22120.85,\u22120.7,\u22120.5}\f3.1 Gradient Projection Methods\nThe augmented Lagrangian Lc(x, \u03bb) is a partial one, where feasibility of mean parameters (x \u2208 M)\nis enforced explicitly by projection. A simple gradient projection method [13] de\ufb01nes a sequence\n\nxk+1 = xk + \u03b1k(\u00afxk \u2212 xk),\n\n\u00afxk = [xk \u2212 sk\u2207f (xk)]+ .\n\nThe notation [\u00b7]+ denotes a projection onto the constraint set M. After taking a step sk > 0 in the\ndirection of the negative gradient, we project the result onto the constraint set to obtain a feasible\ndirection \u00afxk. We then compute xk+1 by taking a step \u03b1k \u2208 (0, 1] in the direction of (\u00afxk \u2212 xk). If\nxk \u2212 sk\u2207f (xk) is feasible, gradient projection reduces to unconstrained steepest descent.\nThere are multiple such projection steps in each inner-loop iteration of MoM (e.g. each xk update).\nFor our experiments we use a projected quasi-Newton method [16] and step-sizes \u03b1k and sk are\nchosen using an Armijo rule [13, Prop. 2.3.1].\n\n3.2 Convergence of Multiplier Methods\n\nxx Lc(x\u2217, \u03bb\u2217)z > 0.\n\nConvergence and rate of convergence results have been proven [17, Proposition 2.4] for the Method\nof Multipliers with a quadratic penalty and multiplier iteration \u03bbk+1 = \u03bbk + ckh(xk). The main\nregularity assumptions are that the sequence {\u03bbk} is bounded, and there is a local minimum for\nwhich a Lagrange multiplier pair (x\u2217, \u03bb\u2217) exists satisfying second-order suf\ufb01ciency conditions, so\nxx L0(x\u2217, \u03bb\u2217)z > 0 for all z (cid:54)= 0. It then follows that there exists\nthat \u2207x L0(x\u2217, \u03bb\u2217) = 0 and zT\u22072\nsome \u00afc such that for all c \u2265 \u00afc, the augmented Lagrangian also contains a strict local minimum\nzT\u22072\nFor convergence, the initialization of the Lagrange multiplier \u03bb0 and penalty parameter c0 must be\nsuch that (cid:107)\u03bb0 \u2212 \u03bb\u2217(cid:107) < \u03b4c0 for some \u03b4 > 0 and c \u2265 \u00afc which depend on the objective and constraints.\nIn practice, a poor initialization of the multiplier \u03bb0 can often be offset by a suf\ufb01ciently high c0. A\n\ufb01nal technical note is that convergence proofs assume the sequence of unconstrained optimizations\nwhich yield xk stays in the neighborhood of x\u2217 after some k. This does not hold in general, but can\nbe encouraged by warm-starting the unconstrained optimization with the previous xk\u22121.\nTo invoke existing convergence results we must show that a local minimum x\u2217 exists for each of\nthe free energies we consider; a suf\ufb01cient condition is then that the Bethe free energy is bounded\nfrom below. This property has been previously established for general discrete MRFs [18], for pair-\nwise normalizable Gaussian MRFs [6], and for the clutter model [3]. For non-pairwise normalizable\nGaussian MRFs, the example of Section 2.3 shows that the Bethe variational objective is unbounded\nbelow, and further may not contain any local optima. While the method of multipliers does not con-\nverge in this situation, its non-convergence is due to fundamental \ufb02aws in the Bethe approximation.\n\n4 MoM Algorithms for Probabilistic Inference\n\nWe derive MoM algorithms which minimize the Bethe free energy for three different families of\ngraphical models. For each model we de\ufb01ne the form of the joint distribution, Bethe free energy (5),\nlocal consistency constraints, augmented Lagrangian, and the gradient projection step. Gradients,\nwhich can be notationally cumbersome, are given in the supplemental material.\n\n4.1 Gaussian Markov Random Fields\n\nWe have already introduced the Lagrangian (9) for the Gaussian MRF. The Gaussian Bethe free\nenergy (8) is always unbounded below off of the constraint set in node marginal variances Vs. We\ncorrect this by adding an additional \ufb01xed penalty in the augmented Lagrangian,\n\nLc(V, \u03a3, \u03bb) = FGB(V ) +\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\ns\n\n\u03bbts [Vs \u2212 Vts]\n\nt\u2208N (s)\n[log Vs \u2212 log Vts]2 +\n\n(cid:88)\n\n+\n\n\u03ba\n2\n\ns\n\nt\u2208N (s)\n\n(cid:88)\n\n(cid:88)\n\ns\n\nt\u2208N (s)\n\nc\n2\n\n[Vs \u2212 Vts]2 .\n\nWe keep \u03ba \u2265 1 \ufb01xed so that existing convergence theory remains applicable. The set of realizeable\nmean parameters M is the set of symmetric positive semide\ufb01nite matrices Vs, \u03a3st. We therefore\n\n5\n\n\fmust solve a series of constrained optimizations of the form, minV,\u03a3 Lck (V, \u03a3, \u03bbk), subject to Vs \u2265\n0, \u03a3st (cid:23) 0. The gradient projection step is easily expressed in terms of correlation coef\ufb01cients \u03c1st,\n\n(cid:20)\n\n\u03a3st =\n\n\u221a\nVst\nVstVts\n\n\u03c1st\n\n\u03c1st\n\nVstVts\nVts\n\n\u221a\n\n(cid:21)\n\n.\n\nThen, \u03a3st (cid:23) 0 if and only if Vst \u2265 0, Vts \u2265 0, and \u22121 \u2264 \u03c1st \u2264 1. The projection step is then,\n\nVst = max(0, Vst),\n\nVts = max(0, Vts),\n\n\u03c1st = max(\u22121, min(1, \u03c1st)).\n\nThe full MoM algorithm then follows from gradient derivations in the supplemental material.\nRecall that in Section 2.3, we showed that the Gaussian Bethe free energy is unbounded on the\nconstraint set for non-pairwise normalizable models. We run MoM on the symmetric three-node\ncycle from this discussion and \ufb01nd that MoM, correctly, identi\ufb01es an unbounded direction, and\nFigure 1(b) shows that the node marginal variances indeed diverge to in\ufb01nity.\n\n4.2 Discrete Markov Random Fields\nConsider a discrete MRF where all variables xs \u2208 Xs = {1, . . . , Ks}. The variational marginal\nI(xs,k), and have mean parameters \u03c4 \u2208 RKs. Let\n\u03c4 (xs) denote element xs of vector \u03c4. Pairwise marginals have mean parameters \u03c4st \u2208 RKs\u00d7Kt\nsimilarly indexed as \u03c4st(xs, xt). The discrete Bethe free energy is then\n\nk=1 \u03c4 (xs)\n\nxs\n\nxt\n\n(s,t)\u2208E\n\nFB(\u03c4 ; \u03d5) =\n\ndistributions are then qs(xs; \u03c4 ) = (cid:81)Ks\n(cid:88)\n(cid:88)\n(cid:88)\n\u2212(cid:88)\n(cid:88)\nCs(\u03c4 ) = 1 \u2212(cid:88)\n\u03c4s(xs), Cts(xs; \u03c4 ) = \u03c4s(xs) \u2212(cid:88)\n(cid:34)(cid:88)\n(cid:88)\n(cid:88)\n(cid:34)(cid:88)\n(cid:88)\n\nThe augmented Lagrangian is then,\nLc(\u03c4, \u03bb, \u03be; \u03d5) = FB(\u03c4 ; \u03d5) +\n\n\u03bbts(xs)Cts(xs; \u03c4 ) +\n\n(cid:88)\n\n(cid:88)\n\n(s,t)\u2208E\n\ns\u2208V\n\nxs\n\nxs\n\nxs\n\nxt\n\nxt\n\n+\n\n\u03bessCs(\u03c4 ) +\n\nCs(\u03c4 )2 +\n\nc\n2\n\ns\u2208V\n\nc\n2\n\n(s,t)\u2208E\n\nxs\n\ns\u2208V\n\n\u03c4st(xs, xt)[log \u03c4st(xs, xt) \u2212 log \u03c6st(xs, xt)]\n\n(ns \u2212 1)\u03c4s(xs)[log \u03c4s(xs) \u2212 log \u03d5s(xs)].\n\nFor this discrete model, our expectation constraints reduce to the following normalization and\nmarginalization constraints:\n\n\u03c4st(xs, xt).\n\n\u03bbst(xt)Cst(xt; \u03c4 )\n\n(cid:35)\n(cid:88)\n\nxt\n\n(11)\n\n(cid:35)\n\nCts(xs; \u03c4 )2 +\n\nCst(xt; \u03c4 )2\n\n.\n\nMean parameters must be non-negative to be valid, so M = {\u03c4s, \u03c4st : \u03c4s \u2265 0, \u03c4st \u2265 0}. This\nconstraint is enforced by a bound projection \u03c4s(xs) = max(0, \u03c4s(xs)), and similarly for the pair-\nwise marginals. While these constraints are never active in BP \ufb01xed point iterations, they must be\nenforced in gradient optimization. With these pieces and the gradient computations presented in the\nsupplement, implementation of MoM optimization for the discrete MRF is straightforward.\n\n4.3 Discrete Mixtures of Gaussian Potentials\n\nWe are particularly interested in tractable inference in hybrid models with discrete and conditionally\nGaussian random variables. A simple example of such a model is the clutter problem [3], whose\njoint distribution models N conditionally independent Gaussian observations {yi}N\ni=1. These obser-\nvations may either be centered on a target scalar x \u2208 R (zi = 1) or drawn from a background clutter\ndistribution (zi = 0). If target observations occur with frequency \u03b20, we then have\n0)(1\u2212zi)N (x, \u03c32\n\nyi | x, zi \u223c N (0, \u03c32\n\nx \u223c N (\u00b50, P0),\n\nzi \u223c Ber(\u03b20),\n\n1)zi\n\nThe corresponding variational posterior distributions are,\n\nq0(x) = N (m0, V0),\n\nqi(x, zi) = ((1 \u2212 \u03b2i)N (x | mi0, Vi0))(1\u2212zi) (\u03b2iN (x | mi1, Vi1))zi .\n\n6\n\n\fAssuming normalizable marginals with V0 \u2265 0, Vi0 \u2265 0, Vi1 \u2265 0, as always ensured by our\nmultiplier method, we de\ufb01ne the Bethe free energy F CGB(m, V, \u03b2) in terms of the mean parameters\nin the supplemental material. Expectation constraints are given by,\n\nCmean\n\ni\n\n= E0[x] \u2212 Ei[x], Cvar\n\ni = Var0[x] \u2212 Vari[x],\n\nwhere Ei[\u00b7] and Var i[\u00b7] denote the mean and variance of the Gaussian mixture qi(x, zi). Com-\nbining the free energy, constraints, and additional positive semide\ufb01nite constraints on the marginal\nvariances we have the BVP for the clutter model,\n\nminimize\n\nm,V,\u03b2\n\nFCGB(m, V, \u03b2; \u03d5)\n\nsubject to Cmean\n\n= 0, Cvar\n\ni = 0, for all i = 1, 2, . . . , N\n\ni\n\nV0 \u2265 0, Vi0 \u2265 0, Vi1 \u2265 0\n\n(12)\n\nDerivation of the free energy and augmented Lagrangian is somewhat lengthy, and so is deferred to\nthe supplement. Projection of the variances onto the constraint set is a simple thresholding operation.\n\n5 Experimental Results\n\n5.1 Discrete Markov Random Fields\n\nWe consider binary Ising models, with variables arranged in NxN lattices with toroidal boundary\nconditions. Potentials are parametrized as in [19], so that\n\n(cid:20) exp(hs)\n\nexp(\u2212hs)\n\n(cid:21)\n\n\u03c8s =\n\n, \u03c8st =\n\n(cid:20) exp(Jst)\n\nexp(\u2212Jst)\n\n(cid:21)\n\n.\n\nexp(\u2212Jst)\nexp(Jst)\n\nWe sample 500 instances at random from a 10x10 toroidal lattice with each Jst \u223c N (0, 1) and\nhs \u223c N (0, 0.01). LBP is run for a maximum of 1000 iterations, and MoM is initialized with a\nsingle iteration of LBP. We report average L1 error of the approximate marginals as compared to\nthe true marginals computed with the junction tree algorithm [20]. Marginal errors are reported in\nFigure 2(a,top), and there is a clear improvement over LBP in the majority of cases.\nDirect evaluation of the Bethe free energy does not take into account constraint violations for non-\nconvergent LBP runs. The augmented Lagrangian penalizes constraint violation, but requires a\npenalty parameter which LBP does not provide. For an objective comparison, we construct a pe-\nnalized Bethe free energy by evaluating the augmented Lagrangian with \ufb01xed penalty c = 1 and\nmultipliers \u03bb = 0. We evaluate this objective at the \ufb01nal iteration of both algorithms. As we see in\nFigure 2(a,bottom), MoM \ufb01nds a lower free energy for most trials.\nOur implementations of LBP and MoM are in Matlab, and emphasize correctness over ef\ufb01ciency.\nNevertheless, computation time for LBP exceeds that of MoM. Wall clock time is measured in\nseconds across various trials, and the percentiles for LBP are 25%: 1040.46, 50%: 1042.57, and\n75%: 1044.85. For MoM they are 25%: 290.25, 50%: 381.62, and 75%: 454.52.\n\n5.2 Gaussian Markov Random Fields\n\nFor the Gaussian case we again sample 500 random instances from a 10x10 lattice with toroidal\nboundary conditions. We randomly sample only pairwise normalizable instances and initialization\nis provided with a single iteration of Gaussian LBP. We \ufb01nd that MoM is generally insensitive to\ninitialization in this model. True marginals are computed by explicitly inverting the model precision\nmatrix and average symmetric L1 error with respect to truth is reported in Figure 2(b,top).\nFor pairwise normalizable models, Gaussian LBP is guaranteed to converge to the unique \ufb01xed point\nof the Bethe free energy, so it is reassuring that MoM optimization matches LBP performance. The\nvalue of the augmented Lagrangian at the \ufb01nal iteration is shown in Figure 2(b,bottom) and again\nshows that MoM matches Gaussian LBP on pairwise normalizable models. Computation time for\nMoM is slightly faster with median wall clock time of 58.76 seconds as compared to 103.17 seconds\nfor LBP. The 25% and 75% percentiles are 37.81 and 92.10 seconds for MoM compared to 88.40\nand 125.59 seconds for LBP.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 2: Performance of MoM and LBP on randomly generated (a) discrete 10 \u00d7 10 toroidal Ising MRFs,\n(b) 10\u00d7 10 toroidal Gaussian MRfs, and (c) clutter models with N = 30 observations. Each point corresponds\nto a single model instance. Top: L1 error between estimated and true marginal distributions, averaged over all\nnodes. Bottom: Penalized Bethe free energy constructed by setting \u03bb = 0, c = 1 in the augmented Lagrangian.\n\n(c)\n\n5.3 Discrete Mixtures of Gaussian Potentials\n\nTo test the bene\ufb01ts of avoiding degenerate marginals, we consider the clutter model of Sec. 4.3 with\n\u00b50 = 0, P0 = 100 and \u03b20 = 0.25. The variance of the clutter distribution is \u03c32\n0 = 10, and of the\ntarget distribution \u03c32\n1 = 1. We sample N = 30 observations for each trial instance.\nA good initialization of the multipliers is critical to performance of MoM. We generate 10 initializa-\ntions by running 5 iterations of EP, each with a different random message update schedule, compute\nthe corresponding Lagrange multipliers for each, and use the one with the lowest value of the aug-\nmented Lagrangian. Similarly, we measure EP\u2019s performance by the best performing of 10 longer\nruns. Both methods are run for a maximum of 1000 iterations, and true marginals are computed\nnumerically by \ufb01nely discretizing the scalar target x.\nWe sample 500 random instances and report average L1 error with respect to true marginals in\nFigure 2(c,top). We see a signi\ufb01cant improvement in the majority of runs. Similarly, the augmented\nLagrangian comparison is shown in Figure 2(c,bottom) and MoM often \ufb01nds a better penalized free\nenergy. While MoM and EP can both suffer from local optima, MoM avoids non-convergence and\nthe output of invalid (negative variance) marginal distributions. Median wall clock time for EP is\n0.59 seconds, and 9.80 seconds for MoM. The 25% and 75% percentiles are 0.42 and 0.84 seconds\nfor EP and 0.51 and 49.19 seconds for MoM.\n\n6 Discussion\n\nWe have proposed an approach for directly minimizing the Bethe variational problem motivated by\nsuccessful methods in nonlinear programming. Our approach is unique in that we do not relax the\nconstraint on normalizability of the marginals, rather we explicitly enforce it at all points in the op-\ntimization. This method directly avoids the creation of degenerate distributions \u2014 for example with\nnegative variance \u2014 which frequently occur in more greedy approaches for minimizing the Bethe\nfree energy. In addition we obtain convergence guarantees under broadly applicable assumptions.\n\n8\n\n00.10.20.30.40.50.600.10.20.30.40.5Belief PropagationMethod of MultipliersAvg. Marginal Error0.40.420.440.460.480.50.40.410.420.430.440.450.460.470.480.490.5Belief PropagationMethod of MultipliersAvg. Marginal Error00.20.40.60.8100.10.20.30.40.50.60.70.80.91Expectation PropagationMethod of MultipliersAvg. Marginal Error\u2212180\u2212170\u2212160\u2212150\u2212140\u2212130\u2212120\u2212180\u2212170\u2212160\u2212150\u2212140\u2212130\u2212120Belief PropagationMethod of MultipliersAugmented Lagrange\u221220020406080100120\u221220020406080100120Belief PropagationMethod of MultipliersAugmented Lagrange10010210410610\u22121100101102103104105106107Expectation PropagationMethod of MultipliersAugmented Lagrange\fReferences\n[1] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and gener-\nalized Belief Propagation algorithms. Information Theory, IEEE Transactions on, 51(7):2282\u2013\n2312, 2005.\n\n[2] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational\n\ninference. Technical report, UC Berkeley, Dept. of Statistics, 2003.\n\n[3] T. P. Minka. Expectation Propagation for approximate Bayesian inference. Uncertainty in\n\nArti\ufb01cial Intelligence, 17:362\u2013369, 2001.\n\n[4] Tom Heskes, Wim Wiegerinck, Ole Winther, and Onno Zoeter. Approximate inference tech-\nniques with expectation constraints. Journal of Statistical Mechanics: Theory and Experiment,\npage 11015, 2005.\n\n[5] Dmitry M. Malioutov, Jason K. Johnson, and Alan S. Willsky. Walk-sums and Belief Propa-\ngation in Gaussian graphical models. Journal of Machine Learning Research, 7:2031\u20132064,\n2006.\n\n[6] B. Cseke and T. Heskes. Properties of bethe free energies and message passing in Gaussian\n\nmodels. Journal of Arti\ufb01cial Intelligence Research, 41(2):1\u201324, 2011.\n\n[7] A. Yuille. CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent\n\nalternatives to Belief Propagation. Neural Computation, 14:1691\u20131722, 2002.\n\n[8] B. Kappen T. Heskes, K. Albers. Approximate inference and constrained optimization. Un-\n\ncertainty in Arti\ufb01cial Intelligence, 13:313\u2013320, 2003.\n\n[9] Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-reweighted Belief Prop-\nagation algorithms and approximate ML estimation by pseudo-moment matching. In In AIS-\nTATS, 2003.\n\n[10] M. Welling and Y.W. Teh. Belief optimization for binary networks: A stable alternative to\n\nLoopy Belief Propagation. In Uncertainty in Arti\ufb01cial Intelligence, 2001.\n\n[11] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic\n\nBayesian networks. Uncertainty in Arti\ufb01cial Intelligence, 18:216\u2013223, 2002.\n\n[12] T. Minka. The EP energy function and minimization schemes. Technical report, MIT Media\n\nLab, 2001.\n\n[13] D.P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[14] M. Seeger. Bayesian inference and optimal design for the sparse linear model. Journal of\n\nMachine Learning Research, 9:759\u2013813, 2008.\n\n[15] C. Rasmussen. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[16] M. Schmidt, E. Van Den Berg, M. Friedlander, and K. Murphy. Optimizing costly functions\nwith simple constraints: A limited-memory projected quasi-Newton algorithm. In AI & Statis-\ntics, 2009.\n\n[17] D.P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Computer Science\n\nand Applied Mathematics, Boston: Academic Press, 1982, 1, 1982.\n\n[18] T. Heskes. On the uniqueness of loopy belief propagation \ufb01xed points. Neural Computation,\n\n16(11):2379\u20132413, 2004.\n\n[19] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Generalized Belief Propagation. Advances in neural\n\ninformation processing systems, pages 689\u2013695, 2001.\n\n[20] Joris M. Mooij. libDAI: A free and open source C++ library for discrete approximate inference\n\nin graphical models. Journal of Machine Learning Research, 11:2169\u20132173, August 2010.\n\n9\n\n\f", "award": [], "sourceid": 1219, "authors": [{"given_name": "Jason", "family_name": "Pacheco", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}]}