{"title": "Decomposition Bounds for Marginal MAP", "book": "Advances in Neural Information Processing Systems", "page_first": 3267, "page_last": 3275, "abstract": "Marginal MAP inference involves making MAP predictions in systems defined with latent variables or missing information. It is significantly more difficult than pure marginalization and MAP tasks, for which a large class of efficient and convergent variational algorithms, such as dual decomposition, exist. In this work, we generalize dual decomposition to a generic powered-sum inference task, which includes marginal MAP, along with pure marginalization and MAP, as special cases. Our method is based on a block coordinate descent algorithm on a new convex decomposition bound, that is guaranteed to converge monotonically, and can be parallelized efficiently. We demonstrate our approach on various inference queries over real-world problems from the UAI approximate inference challenge, showing that our framework is faster and more reliable than previous methods.", "full_text": "Decomposition Bounds for Marginal MAP\n\nWei Ping\u2217\n\n\u2217Computer Science, UC Irvine\n\nQiang Liu\u2020\n{wping,ihler}@ics.uci.edu\n\nAlexander Ihler\u2217\n\n\u2020Computer Science, Dartmouth College\nqliu@cs.dartmouth.edu\n\nAbstract\n\nMarginal MAP inference involves making MAP predictions in systems de\ufb01ned\nwith latent variables or missing information. It is signi\ufb01cantly more dif\ufb01cult than\npure marginalization and MAP tasks, for which a large class of ef\ufb01cient and con-\nvergent variational algorithms, such as dual decomposition, exist. In this work, we\ngeneralize dual decomposition to a generic power sum inference task, which in-\ncludes marginal MAP, along with pure marginalization and MAP, as special cases.\nOur method is based on a block coordinate descent algorithm on a new convex\ndecomposition bound, that is guaranteed to converge monotonically, and can be\nparallelized ef\ufb01ciently. We demonstrate our approach on marginal MAP queries\nde\ufb01ned on real-world problems from the UAI approximate inference challenge,\nshowing that our framework is faster and more reliable than previous methods.\n\n1\n\nIntroduction\n\nProbabilistic graphical models such as Bayesian networks and Markov random \ufb01elds provide a use-\nful framework and powerful tools for machine learning. Given a graphical model, inference refers to\nanswering probabilistic queries about the model. There are three common types of inference tasks.\nThe \ufb01rst are max-inference or maximum a posteriori (MAP) tasks, which aim to \ufb01nd the most prob-\nable state of the joint probability; exact and approximate MAP inference is widely used in structured\nprediction [26]. Sum-inference tasks include calculating marginal probabilities and the normaliza-\ntion constant of the distribution, and play a central role in many learning tasks (e.g., maximum\nlikelihood). Finally, marginal MAP tasks are \u201cmixed\u201d inference problems, which generalize the \ufb01rst\ntwo types by marginalizing a subset of variables (e.g., hidden variables) before optimizing over the\nremainder.1 These tasks arise in latent variable models [e.g., 29, 25] and many decision-making\nproblems [e.g., 13]. All three inference types are generally intractable; as a result, approximate\ninference, particularly convex relaxations or upper bounding methods, are of great interest.\nDecomposition methods provide a useful and computationally ef\ufb01cient class of bounds on inference\nproblems. For example, dual decomposition methods for MAP [e.g., 31] give a class of easy-to-\nevaluate upper bounds which can be directly optimized using coordinate descent [36, 6], subgradient\nupdates [14], or other methods [e.g., 22]. It is easy to ensure both convergence, and that the objec-\ntive is monotonically decreasing (so that more computation always provides a better bound). The\nresulting bounds can be used either as stand-alone approximation methods [6, 14], or as a compo-\nnent of search [11]. In summation problems, a notable decomposition bound is tree-reweighted BP\n(TRW), which bounds the partition function with a combination of trees [e.g., 34, 21, 12, 3]. These\nbounds are useful in joint inference and learning (or \u201cinferning\u201d) frameworks, allowing learning\nwith approximate inference to be framed as a joint optimization over the model parameters and de-\ncomposition bound, often leading to more ef\ufb01cient learning [e.g., 23]. However, far fewer methods\nhave been developed for marginal MAP problems.\n\n1In some literature [e.g., 28], marginal MAP is simply called MAP, and the joint MAP task is called MPE.\n\n1\n\n\fIn this work, we deveop a decomposition bound that has a number of desirable properties: (1) Gener-\nality: our bound is suf\ufb01ciently general to be applied easily to marginal MAP. (2) Any-time: it yields\na bound at any point during the optimization (not just at convergence), so it can be used in an any-\ntime way. (3) Monotonic and convergent: more computational effort gives strictly tighter bounds;\nnote that (2) and (3) are particularly important for high-width approximations, which are expensive\nto represent and update. (4) Allows optimization over all parameters, including the \u201cweights\u201d, or\nfractional counting numbers, of the approximation; these parameters often have a signi\ufb01cant effect\non the tightness of the resulting bound. (5) Compact representation: within a given class of bounds,\nusing fewer parameters to express the bound reduces memory and typically speeds up optimization.\nWe organize the rest of the paper as follows. Section 2 gives some background and notation, fol-\nlowed by connections to related work in Section 3. We derive our decomposed bound in Section 4,\nand present a (block) coordinate descent algorithm for monotonically tightening it in Section 5. We\nreport experimental results in Section 6 and conclude the paper in Section 7.\n\n2 Background\n\n(cid:105)\n\n(cid:105)\n\n(cid:104)(cid:88)\n\n(cid:104)(cid:88)\n\n(cid:88)\n\nexp\n\n\u03b1\u2208F\n\n\u03b1\u2208F\n\nx\u2208X n\n\np(x; \u03b8) = exp\n\n; \u03a6(\u03b8) = log\n\n\u03b8\u03b1(x\u03b1) \u2212 \u03a6(\u03b8)\n\nHere, we review some background on graphical models and inference tasks. A Markov random \ufb01eld\n(MRF) on discrete random variables x = [x1, . . . , xn] \u2208 X n is a probability distribution,\n,\n\u03b8\u03b1(x\u03b1)\n\n(1)\nwhere F is a set of subsets of the variables, each associated with a factor \u03b8\u03b1, and \u03a6(\u03b8) is the log\npartition function. We associate an undirected graph G = (V, E) with p(x) by mapping each xi to\na node i \u2208 V , and adding an edge ij \u2208 E iff there exists \u03b1 \u2208 F such that {i, j} \u2286 \u03b1. We say node\ni and j are neighbors if ij \u2208 E. Then, F is the subset of cliques (fully connected subgraphs) of G.\nThe use and evaluation of a given MRF often involves different types of inference tasks. Marginal-\nization, or sum-inference tasks perform a sum over the con\ufb01gurations to calculate the log partition\nfunction \u03a6 in (1), marginal probabilities, or the probability of some observed evidence. On the other\nhand, the maximum a posteriori (MAP), or max-inference tasks perform joint maximization to \ufb01nd\ncon\ufb01gurations with the highest probability, that is, \u03a60(\u03b8) = maxx\nA generalization of max- and sum- inference is marginal MAP, or mixed-inference, in which we are\ninterested in \ufb01rst marginalizing a subset A of variables (e.g., hidden variables), and then maximizing\nthe remaining variables B (whose values are of direct interest), that is,\n\n\u03b1\u2208F \u03b8\u03b1(x\u03b1).\n\n\u03a6AB(\u03b8) = max\nxB\n\n(2)\nwhere A \u222a B = V (all the variables) and A \u2229 B = \u2205. Obviously, both sum- and max- inference are\nspecial cases of marginal MAP when A = V and B = V , respectively.\nIt will be useful to de\ufb01ne an even more general inference task, based on a power sum operator:\n\nQ(xB) = max\nxB\n\n\u03b8\u03b1(x\u03b1)\n\n\u03b1\u2208F\n\nexp\n\nlog\n\nxA\n\n,\n\n(cid:105)\n\n(cid:80)\n(cid:104)(cid:88)\n\n(cid:88)\n\n\u03c4i(cid:88)\n\nf (xi) =(cid:2)(cid:88)\n\nf (xi)1/\u03c4i(cid:3)\u03c4i,\n\nxi\n\nxi\n\nwhere f (xi) is any non-negative function and \u03c4i is a temperature or weight parameter. The power\nsum reduces to a standard sum when \u03c4i = 1, and approaches maxx f (x) when \u03c4i \u2192 0+, so that we\nde\ufb01ne the power sum with \u03c4i = 0 to equal the max operator.\nThe power sum is helpful for unifying max- and sum- inference [e.g., 35], as well as marginal\nMAP [15]. Speci\ufb01cally, we can apply power sums with different weights \u03c4i to each variable xi along\na prede\ufb01ned elimination order (e.g., [x1, . . . , xn]), to de\ufb01ne the weighted log partition function:\n\n\u03c4(cid:88)\n\n\u03c4n(cid:88)\n\n\u03c41(cid:88)\n\n\u03a6\u03c4 (\u03b8) = log\n\nexp(\u03b8(x)) = log\n\n. . .\n\nexp(\u03b8(x)),\n\n(3)\n\nx\n\nxn\n\nx1\n\nwhere we note that the value of (3) depends on the elimination order unless all the weights are equal.\nObviously, (3) includes marginal MAP (2) as a special case by setting weights \u03c4A = 1 and \u03c4B = 0.\nThis representation provides a useful tool for understanding and deriving new algorithms for general\ninference tasks, especially marginal MAP, for which relatively few ef\ufb01cient algorithms exist.\n\n2\n\n\f3 Related Work\n\nVariational upper bounds on MAP and the partition function, along with algorithms for providing\nfast, convergent optimization, have been widely studied in the last decade. In MAP, dual decompo-\nsition and linear programming methods have become a dominating approach, with numerous opti-\nmization techniques [36, 6, 32, 14, 37, 30, 22], and methods to tighten the approximations [33, 14].\nFor summation problems, most upper bounds are derived from the tree-reweighted (TRW) family\nof convex bounds [34], or more generally conditional entropy decompositions [5]. TRW bounds\ncan be framed as optimizing over a convex combination of tree-structured models, or in a dual\nrepresentation as a message-passing, TRW belief propagation algorithm. This illustrates a basic\ntension in the resulting bounds: in its primal form 2 (combination of trees), TRW is inef\ufb01cient:\nit maintains a weight and O(|V |) parameters for each tree, and a large number of trees may be\nrequired to obtain a tight bound; this uses memory and makes optimization slow. On the other hand,\nthe dual, or free energy, form uses only O(|E|) parameters (the TRW messages) to optimize over the\nset of all possible spanning trees \u2013 but, the resulting optimization is only guaranteed to be a bound\nat convergence, 3 making it dif\ufb01cult to use in an anytime fashion. Similarly, the gradient of the\nweights is only correct at convergence, making it dif\ufb01cult to optimize over these parameters; most\nimplementations [e.g., 24] simply adopt \ufb01xed weights.\nThus, most algorithms do not satisfy all the desirable properties listed in the introduction. For exam-\nple, many works have developed convergent message-passing algorithms for convex free energies\n[e.g., 9, 10]. However, by optimizing the dual they do not provide a bound until convergence, and the\nrepresentation and constraints on the counting numbers do not facilitate optimizing the bound over\nthese parameters. To optimize counting numbers, [8] adopt a more restrictive free energy form re-\nquiring positive counting numbers on the entropies; but this cannot represent marginal MAP, whose\nfree energy involves conditional entropies (equivalent to the difference between two entropy terms).\nOn the other hand, working in the primal domain ensures a bound, but usually at the cost of enu-\nmerating a large number of trees. [12] heuristically select a small number of trees to avoid being\ntoo inef\ufb01cient, while [21] focus on trying to speed up the updates on a given collection of trees.\nAnother primal bound is weighted mini-bucket (WMB, [16]), which can represent a large collection\nof trees compactly and is easily applied to marginal MAP using the weighted log partition function\nviewpoint [15, 18]; however, existing optimization algorithms for WMB are non-monotonic, and\noften fail to converge, especially on marginal MAP tasks.\nWhile our focus is on variational bounds [16, 17], there are many non-variational approaches for\nmarginal MAP as well. [27, 38] provide upper bounds on marginal MAP by reordering the order\nin which variables are eliminated, and using exact inference in the reordered join-tree; however,\nthis is exponential in the size of the (unconstrained) treewidth, and can easily become intractable.\n[20] give an approximation closely related to mini-bucket [2] to bound the marginal MAP; however,\nunlike (weighted) mini-bucket, these bounds cannot be improved iteratively. The same is true for\nthe algorithm of [19], which also has a strong dependence on treewidth. Other examples of marginal\nMAP algorithms include local search [e.g., 28] and Markov chain Monte Carlo methods [e.g., 4, 39].\n\n4 Fully Decomposed Upper Bound\n\nIn this section, we develop a new general form of upper bound and provide an ef\ufb01cient, monotoni-\ncally convergent optimization algorithm. Our new bound is based on fully decomposing the graph\ninto disconnected cliques, allowing very ef\ufb01cient local computation, but can still be as tight as WMB\nor the TRW bound with a large collection of spanning trees once the weights and shifting variables\nare chosen or optimized properly. Our bound reduces to dual decomposition for MAP inference, but\nis applicable to more general mixed-inference settings.\nOur main result is based on the following generalization of the classical H\u00a8older\u2019s inequality [7]:\n\n2Despite the term \u201cdual decomposition\u201d used in MAP tasks, in this work we refer to decomposition bounds\nas \u201cprimal\u201d bounds, since they can be viewed as directly bounding the result of variable elimination. This\nis in contrast to, for example, the linear programming relaxation of MAP, which bounds the result only after\noptimization.\n\n3See an example on Ising model in Supplement A.\n\n3\n\n\fTheorem 4.1. For a given graphical model p(x; \u03b8) in (1) with cliques F = {\u03b1} and a set of non-\nnegative weights \u03c4 = {\u03c4i \u2265 0, i \u2208 V }, we de\ufb01ne a set of \u201csplit weights\u201d w\u03b1 = {w\u03b1\ni \u2265 0, i \u2208 \u03b1}\n\non each variable-clique pair (i, \u03b1), that satis\ufb01es(cid:80)\nexp(cid:2)\u03b8\u03b1(x\u03b1)(cid:3) \u2264 (cid:89)\n\ni = \u03c4i. Then we have\n\n\u03c4(cid:88)\n\nw\u03b1(cid:88)\n\n\u03b1|\u03b1(cid:51)i w\u03b1\n\n(cid:89)\n\n(4)\n\n\u03b1\u2208F\n\nx\n\n\u03b1\u2208F\n\nx\u03b1\n\nwhere the left-hand side is the powered-sum along order [x1, . . . , xn] as de\ufb01ned in (3), and the\nright-hand side is the product of the powered-sums on subvector x\u03b1 with weights w\u03b1 along\n\nthe same elimination order; that is,(cid:80)w\u03b1\n\nexp(cid:2)\u03b8\u03b1(x\u03b1)(cid:3) = (cid:80)w\u03b1\n\nexp(cid:2)\u03b8\u03b1(x\u03b1)(cid:3), where\n\nx\u03b1 = [xk1 , . . . , xkc] should be ranked with increasing index, consisting with the elimination order\n[x1, . . . , xn] as used in the left-hand side.\nProof details can be found in Section E of the supplement. A key advantage of the bound (4) is\nthat it decomposes the joint power sum on x into a product of independent power sums over smaller\ncliques x\u03b1, which signi\ufb01cantly reduces computational complexity and enables parallel computation.\n\nx\u03b1\n\nexp(cid:2)\u03b8\u03b1(x\u03b1)(cid:3),\n\u00b7\u00b7\u00b7(cid:80)w\u03b1\n\nkc\nxkc\n\nk1\nxk1\n\n4.1\n\nIncluding Cost-shifting Variables\n\nexp\n\n\u03a6\u03c4 (\u03b8) = log\n\nIn order to increase the \ufb02exibility of the upper bound, we introduce a set of cost-shifting or repa-\ni (xi) | \u2200(i, \u03b1), i \u2208 \u03b1} on each variable-factor pair (i, \u03b1), which can\nrameterization variables \u03b4 = {\u03b4\u03b1\n(cid:104)(cid:88)\n\u03c4(cid:88)\nbe optimized to provide a much tighter upper bound. Note that \u03a6\u03c4 (\u03b8) can be rewritten as,\n(cid:105)\n(cid:104) (cid:88)\n\nwhere Ni = {\u03b1 | \u03b1 (cid:51) i} is the set of cliques incident to i. Applying inequality (4), we have that\n\ni (xi)(cid:1)(cid:105)\n(cid:105) def== L(\u03b4, w),\n\n(cid:0)\u03b8\u03b1(x\u03b1) \u2212(cid:88)\n\u03b8\u03b1(x\u03b1) \u2212(cid:88)\n\n(5)\nwhere the nodes i \u2208 V are also treated as cliques within inequality (4), and a new weight wi is\nintroduced on each variable i; the new weights w = {wi, w\u03b1\ni = \u03c4i, wi \u2265 0, w\u03b1\nw\u03b1\n\ni | \u2200(i, \u03b1), i \u2208 \u03b1} should satisfy\ni \u2265 0, \u2200(i, \u03b1).\n\n\u03a6\u03c4 (\u03b8) \u2264(cid:88)\n\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:104)\n\n(cid:88)\n\nwi(cid:88)\n\nw\u03b1(cid:88)\n\n\u03b4\u03b1\ni (xi) +\n\n\u03b4\u03b1\ni (xi)\n\n\u03b4\u03b1\ni (xi)\n\nwi +\n\n\u03b1\u2208Ni\n\n\u03b1\u2208F\n\ni\u2208V\n\n\u03b1\u2208Ni\n\n\u03b1\u2208F\n\n(6)\n\ni\u2208\u03b1\n\ni\u2208\u03b1\n\nexp\n\n+\n\nlog\n\nexp\n\nlog\n\ni\u2208V\n\nxi\n\n\u03b4\u03b1\n\n,\n\nx\u03b1\n\nx\n\n\u03b1\u2208Ni\n\nThe bound L(\u03b4, w) is convex w.r.t. the cost-shifting variables \u03b4 and weights w, enabling an ef\ufb01cient\noptimization algorithm that we present in Section 5. As we will discuss in Section 5.1, these shifting\nvariables correspond to Lagrange multipliers that enforce a moment matching condition.\n\n4.2 Dual Form and Connection With Existing Bounds\n\nIt is straightforward to see that our bound in (5) reduces to dual decomposition [31] when applied\non MAP inference with all \u03c4i = 0, and hence wi = w\u03b1\ni = 0. On the other hand, its connection with\nsum-inference bounds such as WMB and TRW is seen more clearly via a dual representation of (5):\n\n(cid:88)\n(cid:8)(cid:104)\u03b8, b(cid:105) +\nb\u03b1(x\u03b1), (cid:80)\n\ni\u2208V\n\nTheorem 4.2. The tightest upper bound obtainable by (5), that is,\n\nmin\n\nw\n\nmin\n\n\u03b4\n\nL(\u03b4, w) = min\n\nw\n\nmax\nb\u2208L(G)\n\nwiH(xi; bi) +\n\n(cid:88)\n\n(cid:88)\n\n\u03b1\u2208F\n\ni\u2208\u03b1\n\n; b\u03b1)(cid:9),\n\n(7)\n\ni H(xi|xpa\u03b1\nw\u03b1\n\ni\n\nby L(G) = {b | bi(xi) =(cid:80)\n\nwhere b = {bi(xi), b\u03b1(x\u03b1) | \u2200(i, \u03b1), i \u2208 \u03b1} is a set of pseudo-marginals (or beliefs) de\ufb01ned on the\nsingleton variables and the cliques, and L is the corresponding local consistency polytope de\ufb01ned\nbi(xi) = 1}. Here, H(\u00b7) are their corresponding\ni is the set of variables in \u03b1 that rank later than i, that is,\ni = {j \u2208 \u03b1 | j (cid:31) i}.\n\nmarginal or conditional entropies, and pa\u03b1\nfor the global elimination order [x1, . . . , xn], pa\u03b1\nThe proof details can be found in Section F of the supplement.\nIt is useful to compare Theo-\nrem 4.2 with other dual representations. As the sum of non-negatively weighted conditional en-\ntropies, the bound is clearly convex and within the general class of conditional entropy decomposi-\ntions (CED) [5], but unlike generic CED it has a simple and ef\ufb01cient primal form (5). 4 Comparing\n\nx\u03b1\\i\n\nxi\n\n4The primal form derived in [5] (a geometric program) is computationally infeasible.\n\n4\n\n\f(a) 3 \u00d7 3 grid\n\n(b) WMB: covering tree\n\n(c) Full decomposition\n\n(d) TRW\n\nFigure 1: Illustrating WMB, TRW and our bound on (a) 3 \u00d7 3 grid. (b) WMB uses a covering tree\nwith a minimal number of splits and cost-shifting. (c) Our decomposition (5) further splits the graph\ninto small cliques (here, edges), introducing additional cost-shifting variables but allowing for easier,\nmonotonic optimization. (d) Primal TRW splits the graph into many spanning trees, requiring even\nmore cost-shifting variables. Note that all three bounds attain the same tightness after optimization.\n\nexpressed in terms of joint entropies, (cid:104)\u03b8, b(cid:105) +(cid:80)\n\nto the dual form of WMB in Theorem 4.2 of [16], our bound is as tight as WMB, and hence the\nclass of TRW / CED bounds attainable by WMB [16]. Most duality-based forms [e.g., 9, 10] are\n\u03b2 c\u03b2H(b\u03b2), rather than conditional entropies; while\ni }, 5\nthe two can be converted, the resulting counting numbers c\u03b2 will be differences of weights {w\u03b1\nwhich obfuscates its convexity, makes it harder to maintain the relative constraints on the counting\nnumbers during optimization, and makes some counting numbers negative (rendering some meth-\nods inapplicable [8]). Finally, like most variational bounds in dual form, the RHS of (7) has a inner\nmaximization and hence guaranteed to bound \u03a6\u03c4 (\u03b8) only at its optimum.\nIn contrast, our Eq. (5) is a primal bound (hence, a bound for any \u03b4). It is similar to the primal form\nof TRW, except that (1) the individual regions are single cliques, rather than spanning trees of the\ngraph, 6 and (2) the fraction weights w\u03b1 associated with each region are vectors, rather than a single\nscalar. The representation\u2019s ef\ufb01ciency can be seen with an example in Figure 1, which shows a 3\u00d7 3\ngrid model and three relaxations that achieve the same bound. Assuming d states per variable and\nignoring the equality constraints, our decomposition in Figure 1(c) uses 24d cost-shifting parameters\n(\u03b4), and 24 weights. WMB (Figure 1(b)) is slightly more ef\ufb01cient, with only 8d parameters for \u03b4 and\nand 8 weights, but its lack of decomposition makes parallel and monotonic updates dif\ufb01cult. On the\nother hand, the equivalent primal TRW uses 16 spanning trees, shown in Figure 1(d), for 16 \u00b7 8 \u00b7 d2\nparameters, and 16 weights. The increased dimensionality of the optimization slows convergence,\nand updates are non-local, requiring full message-passing sweeps on the involved trees (although\nthis cost can be amortized in some cases [21]).\n\n5 Monotonically Tightening the Bound\n\nIn this section, we propose a block coordinate descent algorithm (Algorithm 1) to minimize the\nupper bound L(\u03b4, w) in (5) w.r.t.\nthe shifting variables \u03b4 and weights w. Our algorithm has a\nmonotonic convergence property, and allows ef\ufb01cient, distributable local computation due to the\nfull decomposition of our bound. Our framework allows generic powered-sum inference, including\nmax-, sum-, or mixed-inference as special cases by setting different weights.\n\n5.1 Moment Matching and Entropy Matching\n\nWe start with deriving the gradient of L(\u03b4, w) w.r.t. \u03b4 and w. We show that the zero-gradient\nequation w.r.t. \u03b4 has a simple form of moment matching that enforces a consistency between the\nsingleton beliefs with their related clique beliefs, and that of weights w enforces a consistency of\nmarginal and conditional entropies.\nTheorem 5.1. (1) For L(\u03b4, w) in (5), its zero-gradient w.r.t. \u03b4\u03b1\n\ni (xi) is\n\n\u00b5\u03b1(x\u03b1) = 0,\n\n(8)\n\n= \u00b5i(xi) \u2212(cid:88)\n\nx\u03b1\\i\n\n\u2202L\ni (xi)\n\n\u2202\u03b4\u03b1\n\n5See more details of this connection in Section F.3 of the supplement.\n6While non-spanning subgraphs can be used in the primal TRW form, doing so leads to loose bounds; in\n\ncontrast, our decomposition\u2019s terms consist of individual cliques.\n\n5\n\n\fAlgorithm 1 Generalized Dual-decomposition (GDD)\n\nInput: weights {\u03c4i | i \u2208 V }, elimination order o.\nOutput: the optimal \u03b4\u2217, w\u2217 giving tightest upper bound L(\u03b4\u2217, w\u2217) for \u03a6\u03c4 (\u03b8) in (5).\ninitialize \u03b4 = 0 and weights w = {wi, w\u03b1\ni }.\nrepeat\n\nfor node i (in parallel with node j, (i, j) (cid:54)\u2208 E) do\n\ni |\u2200\u03b1 \u2208 Ni} with the closed-form update (11);\n\nif \u03c4i = 0 then\nupdate \u03b4Ni = {\u03b4\u03b1\nelse if \u03c4i (cid:54)= 0 then\nupdate \u03b4Ni and wNi with gradient descent (8) and(12), combined with line search;\nend if\nend for\n\n(cid:80)\n\nuntil convergence\n\u03b4\u2217 \u2190 \u03b4, w\u2217 \u2190 w, and evaluate L(\u03b4\u2217, w\u2217) by (5);\nRemark. GDD solves max-, sum- and mixed-inference by setting different values of weights {\u03c4i}.\n\nwhere \u00b5i(xi) \u221d exp(cid:2) 1\n\u00b5\u03b1(x\u03b1) = (cid:81)c\ni=1 \u00b5\u03b1(xi|xi+1:c); \u00b5\u03b1(xi|xi+1:c) = (Zi\u22121(xi:c)/Zi(xi+1:c))1/w\u03b1\n1(cid:88)\ni(cid:88)\n\ni (xi)(cid:3) can be interpreted as a singleton belief on xi, and \u00b5\u03b1(x\u03b1)\n(cid:105)\n(cid:104)\n\u03b8\u03b1(x\u03b1) \u2212(cid:88)\n\ncan be viewed as clique belief on x\u03b1, de\ufb01ned with a chain rule (assuming x\u03b1 = [x1, . . . , xc]),\ni , where Zi is the\n\npartial powered-sum up to x1:i on the clique, that is,\n\n\u03b8\u03b1(x\u03b1) \u2212(cid:88)\n\n, Z0(x\u03b1) = exp\n\nZi(xi+1:c) =\n\n\u03b1\u2208Ni\n\n\u00b7\u00b7\u00b7\n\n(cid:104)\n\nw\u03b1\n\nw\u03b1\n\n(cid:105)\n\nexp\n\n\u03b4\u03b1\n\n\u03b4\u03b1\ni (xi)\n\n,\n\nwi\n\nxi\n\nx1\n\n\u03b4\u03b1\ni (xi)\n\ni\u2208\u03b1\n\ni\u2208\u03b1\n\nwhere the summation order should be consistent with the global elimination order o = [x1, . . . , xn].\ni } are marginal and conditional entropies\n(2) The gradients of L(\u03b4, w) w.r.t.\nde\ufb01ned on the beliefs {\u00b5i, \u00b5\u03b1}, respectively,\n\nthe weights {wi, w\u03b1\n\n\u00b5\u03b1(x\u03b1) log \u00b5\u03b1(xi|xi+1:c).\n\n(9)\n\n= H(xi|xi+1:c; \u00b5\u03b1) = \u2212(cid:88)\n(cid:1) = 0, w\u03b1\n\n(cid:0)H(xi|xi+1:c; \u00b5\u03b1) \u2212 \u00afHi\n\nx\u03b1\n\n(cid:1) = 0,\n\n\u2202L\n\u2202wi\n\n\u2202L\n\u2202w\u03b1\ni\n\n= H(xi; \u00b5i),\n\n(cid:0)H(xi; \u00b5i) \u2212 \u00afHi\nwhere \u00afHi = wiH(xi; \u00b5i) +(cid:80)\n\nTherefore, the optimal weights should satisfy the following KKT condition\n\ni\n\nwi\n\n\u03b1 w\u03b1\n\n(10)\ni H(xi|xi+1:c; \u00b5\u03b1) is the (weighted) average entropy on node i.\nThe proof details can be found in Section G of the supplement. The matching condition (8) enforces\nthat \u00b5 = {\u00b5i, \u00b5\u03b1 | \u2200(i, \u03b1)} belong to the local consistency polytope L as de\ufb01ned in Theorem 4.2;\nsimilar moment matching results appear commonly in variational inference algorithms [e.g., 34].\n[34] also derive a gradient of the weights, but it is based on the free energy form and is correct only\nafter optimization; our form holds at any point, enabling ef\ufb01cient joint optimization of \u03b4 and w.\n\n\u2200(i, \u03b1)\n\n5.2 Block Coordinate Descent\n\nWe derive a block coordinate descent method in Algorithm 1 to minimize our bound, in which\nwe sweep through all the nodes i and update each block \u03b4Ni = {\u03b4\u03b1\ni (xi) | \u2200\u03b1 \u2208 Ni} and\nwNi = {wi, w\u03b1\ni | \u2200\u03b1 \u2208 Ni} with the neighborhood parameters \ufb01xed. Our algorithm applies two\nupdate types, depending on whether the variables have zero weight: (1) For nodes with \u03c4i = 0 (cor-\nresponding to max nodes i \u2208 B in marginal MAP), we derive a closed-form coordinate descent rule\nfor the associated shifting variables \u03b4Ni; these nodes do not require to optimize wNi since it is \ufb01xed\nto be zero. (2) For nodes with \u03c4i (cid:54)= 0 (e.g., sum nodes i \u2208 A in marginal MAP), we lack a closed\nform update for \u03b4Ni and wNi, and optimize by local gradient descent combined with line search.\nThe lack of a closed form coordinate update for nodes \u03c4i (cid:54)= 0 is mainly because the order of power\nsums with different weights cannot be exchanged. However, the gradient descent inner loop is still\nef\ufb01cient, because each gradient evaluation only involves the local variables in clique \u03b1.\nClosed-form Update.\nits associated \u03b4Ni = {\u03b4\u03b1\nzero (sub-)gradient equation in (8) (keeping the other {\u03b4\u03b1\n\nFor any node i with \u03c4i = 0 (i.e., max nodes i \u2208 B in marginal MAP), and\ni (xi) | \u2200\u03b1 \u2208 Ni}, the following update gives a closed form solution for the\n\nj |j (cid:54)= i,\u2200\u03b1 \u2208 Ni} \ufb01xed):\n\n6\n\n\fi (xi) \u2190 |Ni|\n|Ni| + 1\n\ni (xi) \u2212\n\u03b3\u03b1\n\n1\n\n(cid:88)\ni (xi) = log(cid:80)w\u03b1\\i\n\n\u03b3\u03b2\ni (xi),\n\n\u03b4\u03b1\n\nx\u03b1\\i\n\n(11)\n\n(cid:80)\n\nj\u2208\u03b1\\i \u03b4\u03b1\n\n|Ni| + 1\n\nexp(cid:2)\u03b8\u03b1(x\u03b1) \u2212\nj (xj)(cid:3). Note that the update in (11) works regardless of the weights of nodes {\u03c4j | \u2200j \u2208\n\n\u03b2\u2208Ni\\\u03b1\nwhere |Ni| is the number of neighborhood cliques, and \u03b3\u03b1\n\u03b1, \u2200\u03b1 \u2208 Ni} in the neighborhood cliques; when all the neighboring nodes also have zero weight\n(\u03c4j = 0 for \u2200j \u2208 \u03b1, \u2200\u03b1 \u2208 Ni), it is analogous to the \u201cstar\u201d update of dual decomposition for\nMAP [31]. The detailed derivation is shown in Proposition H.2 in the supplement.\nThe update in (11) can be calculated with a cost of only O(|Ni| \u00b7 d|\u03b1|), where d is the number of\nstates of xi, and |\u03b1| is the clique size, by computing and saving all the shared {\u03b3\u03b1\ni (xi)} before\nupdating \u03b4Ni. Furthermore, the updates of \u03b4Ni for different nodes i are independent if they are not\ndirectly connected by some clique \u03b1; this makes it easy to parallelize the coordinate descent process\nby partitioning the graph into independent sets, and parallelizing the updates within each set.\nFor nodes with \u03c4i (cid:54)= 0 (or i \u2208 A in marginal MAP), there is no closed-\nLocal Gradient Descent.\ni } to minimize the upper bound. However, because of the\nform solution for {\u03b4\u03b1\nfully decomposed form, the gradient w.r.t. \u03b4Ni and wNi, (8)\u2013(9), can be evaluated ef\ufb01ciently via\nlocal computation with O(|Ni| \u00b7 d|\u03b1|), and again can be parallelized between nonadjacent nodes.\nTo handle the normalization constraint (6) on wNi, we use an exponential gradient descent:\nlet\n\ni (xi)} and {wi, w\u03b1\n\nwi = exp(vi)/(cid:2) exp(vi) +(cid:80)\nwi \u221d wi exp(cid:2) \u2212 \u03b7wi\n\ngradient w.r.t. vi and v\u03b1\n\n\u03b1 exp(v\u03b1\n\n(cid:0)H(xi; \u00b5i) \u2212 \u00afHi\n\ni )(cid:3) and w\u03b1\n(cid:1)(cid:3), w\u03b1\n\ni )/(cid:2) exp(vi) +(cid:80)\n(cid:0)H(xi|xpa\u03b1\n\ni = exp(v\u03b1\ni \u221d w\u03b1\n\ni exp(cid:2) \u2212 \u03b7w\u03b1\n\ni and transforming back gives the following update\n\n(12)\ni ={j\u2208 \u03b1| j(cid:31) i}. In our implementation, we \ufb01nd that a few gradient\nwhere \u03b7 is the step size and pa\u03b1\nsteps (e.g., 5) with a backtracking line search using the Armijo rule works well in practice. Other\nmore advanced optimization methods, such as L-BFGS and Newton\u2019s method are also applicable.\n\ni )(cid:3); taking the\n\n(cid:1)(cid:3),\n\n\u03b1 exp(v\u03b1\n\n; \u00b5\u03b1) \u2212 \u00afHi\n\ni\n\ni\n\n6 Experiments\nIn this section, we demonstrate our algorithm on a set of real-world graphical models from recent\nUAI inference challenges, including two diagnostic Bayesian networks with 203 and 359 variables\nand max domain sizes 7 and 6, respectively, and several MRFs for pedigree analysis with up to 1289\nvariables, max domain size of 7 and clique size 5.7 We construct marginal MAP problems on these\nmodels by randomly selecting half of the variables to be max nodes, and the rest as sum nodes.\nWe implement several algorithms that optimize the same primal marginal MAP bound, including\nour GDD (Algorithm 1), the WMB algorithm in [16] with ibound = 1, which uses the same cliques\nand a \ufb01xed point heuristic for optimization, and an off-the-shelf L-BFGS implementation that di-\nrectly optimizes our decomposed bound. For comparison, we also computed several related primal\nbounds, including standard mini-bucket [2] and elimination reordering [27, 38], limited to the same\ncomputational limits (ibound = 1). We also tried MAS [20] but found its bounds extremely loose.8\nDecoding (\ufb01nding a con\ufb01guration \u02c6xB) is more dif\ufb01cult in marginal MAP than in joint MAP. We use\nthe same local decoding procedure that is standard in dual decomposition [31]. However, evaluating\nthe objective Q(\u02c6xB) involves a potentially dif\ufb01cult sum over xA, making it hard to score each\ndecoding. For this reason, we evaluate the score of each decoding, but show the most recent decoding\nrather than the best (as is standard in MAP) to simulate behavior in practice.\nFigure 2 and Figure 3 compare the convergence of the different algorithms, where we de\ufb01ne the\niteration of each algorithm to correspond to a full sweep over the graph, with the same order of\ntime complexity: one iteration for GDD is de\ufb01ned in Algorithm 1; for WMB is a full forward and\nbackward message pass, as in Algorithm 2 of [16]; and for L-BFGS is a joint quasi-Newton step\non all variables. The elimination order that we use is obtained by a weighted-min-\ufb01ll heuristic [1]\nconstrained to eliminate the sum nodes \ufb01rst.\nDiagnostic Bayesian Networks.\nFigure 2(a)-(b) shows that our GDD converges quickly and\nmonotonically on both the networks, while WMB does not converge without proper damping; we\n\n7See http://graphmod.ics.uci.edu/uai08/Evaluation/Report/Benchmarks.\n8The instances tested have many zero probabilities, which make \ufb01nding lower bounds dif\ufb01cult; since MAS\u2019\n\nbounds are symmetrized, this likely contributes to its upper bounds being loose.\n\n7\n\n\f(a) BN-1 (203 nodes)\n\n(b) BN-2 (359 nodes)\n\nFigure 2: Marginal MAP results on BN-1 and BN-2 with 50% randomly selected max-nodes (ad-\nditional plots are in the supplement B). We plot the upper bounds of different algorithms across\niterations; the objective function Q(xB) (2) of the decoded solutions xB are also shown (dashed\nlines). At the beginning, Q(xB) may equal to \u2212\u221e because of zero probabiliy.\n\n(a) pedigree1 (334 nodes)\n\n(b) pedigree7 (1068 nodes)\n\n(c) pedigree9 (1118 nodes)\n\nFigure 3: Marginal MAP inference on three pedigree models (additional plots are in the supplement\nC). We randomly select half the nodes as max-nodes in these models. We tune the damping rate of\nWMB from 0.01 to 0.05.\n\nexperimented different damping ratios for WMB, and found that it is slower than GDD even with the\nbest damping ratio found (e.g., in Figure 2(a), WMB works best with damping ratio 0.035 (WMB-\n0.035), but is still signi\ufb01cantly slower than GDD). Our GDD also gives better decoded marginal\nMAP solution xB (obtained by rounding the singleton beliefs). Both WMB and our GDD pro-\nvide a much tighter bound than the non-iterative mini-bucket elimination (MBE) [2] or reordered\nelimination [27, 38] methods.\nGenetic Pedigree Instances.\nFigure 3 shows similar results on a set of pedigree instances. Again,\nGDD outperforms WMB even with the best possible damping, and out-performs the non-iterative\nbounds after only one iteration (pass through the graph).\n\n7 Conclusion\n\nIn this work, we propose a new class of decomposition bounds for general powered-sum inference,\nwhich is capable of representing a large class of primal variational bounds but is much more compu-\ntationally ef\ufb01cient. Unlike previous primal sum bounds, our bound decomposes into computations\non small, local cliques, increasing ef\ufb01ciency and enabling parallel and monotonic optimization. We\nderive a block coordinate descent algorithm for optimizing our bound over both the cost-shifting pa-\nrameters (reparameterization) and weights (fractional counting numbers), which generalizes dual\ndecomposition and enjoy similar monotonic convergence property. Taking the advantage of its\nmonotonic convergence, our new algorithm can be widely applied as a building block for improved\nheuristic construction in search, or more ef\ufb01cient learning algorithms.\n\nAcknowledgments\nThis work is sponsored in part by NSF grants IIS-1065618 and IIS-1254071. Alexander Ihler is\nalso funded in part by the United States Air Force under Contract No. FA8750-14-C-0011 under the\nDARPA PPAML program.\n\n8\n\n05101520\u221250\u221240\u221230\u221220\u221210010IterationsBound WMB\u22120.025WMB\u22120.035WMB\u22120.045GDDL\u2212BFGSMBEElimination reorderingDecoded value (WMB)Decoded value (GDD)05101520\u221260\u221250\u221240\u221230\u221220\u221210010IterationsBound WMB\u22120.015WMB\u22120.020WMB\u22120.025GDDL\u2212BFGSMBEElimination reorderingDecoded value (WMB)Decoded value (GDD)05101520\u221255\u221250\u221245\u221240\u221235IterationsUpper Bound WMB\u22120.01WMB\u22120.02WMB\u22120.03WMB\u22120.04WMB\u22120.05GDDMBEElim reordering05101520\u2212140\u2212120\u2212100\u221280\u221260IterationsUpper Bound WMB\u22120.01WMB\u22120.02WMB\u22120.03WMB\u22120.04WMB\u22120.05GDDMBEElim reordering05101520\u2212140\u2212120\u2212100\u221280\u221260\u221240\u221220IterationsUpper Bound WMB\u22120.01WMB\u22120.02WMB\u22120.03WMB\u22120.04WMB\u22120.05GDDMBEElim reordering\fReferences\n[1] R. Dechter. Reasoning with probabilistic and deterministic graphical models: Exact algorithms. Synthesis\n\nLectures on Arti\ufb01cial Intelligence and Machine Learning, 2013.\n\n[2] R. Dechter and I. Rish. Mini-buckets: A general scheme for bounded inference. JACM, 2003.\n[3] J. Domke. Dual decomposition for marginal inference. In AAAI, 2011.\n[4] A. Doucet, S. Godsill, and C. Robert. Marginal maximum a posteriori estimation using Markov chain\n\nMonte Carlo. Statistics and Computing, 2002.\n\n[5] A. Globerson and T. Jaakkola. Approximate inference using conditional entropy decompositions.\n\nIn\n\n[6] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP\n\nAISTATS, 2007.\n\nLP-relaxations. In NIPS, 2008.\n\nBayesian Statistics, 2011.\n\nUAI, 2009.\n\nECML/PKDD, 2011.\n\ndual losses. In ICML, 2010.\n\nmodels. JMLR, 2010.\n\nstructure. In EMNLP, 2012.\n\n[7] G. H. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge University Press, 1952.\n[8] T. Hazan, J. Peng, and A. Shashua. Tightening fractional covering upper bounds on the partition function\n\n[9] T. Hazan and A. Shashua. Convergent message-passing algorithms for inference over general graphs with\n\nfor high-order region graphs. In UAI, 2012.\n\nconvex free energies. In UAI, 2008.\n\n[10] T. Hazan and A. Shashua. Norm-product belief propagation: Primal-dual message-passing for approxi-\n\nmate inference. IEEE Transactions on Information Theory, 2010.\n\n[11] A. Ihler, N. Flerova, R. Dechter, and L. Otten. Join-graph based cost-shifting schemes. In UAI, 2012.\n[12] J. Jancsary and G. Matz. Convergent decomposition solvers for TRW free energies. In AISTATS, 2011.\n[13] I. Kiselev and P. Poupart. Policy optimization by marginal MAP probabilistic inference in generative\n\nmodels. In AAMAS, 2014.\n\ntion. TPAMI, 2011.\n\n[14] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposi-\n\n[15] Q. Liu. Reasoning and Decisions in Probabilistic Graphical Models\u2013A Uni\ufb01ed Framework. PhD thesis,\n\nUniversity of California, Irvine, 2014.\n\n[16] Q. Liu and A. Ihler. Bounding the partition function using H\u00a8older\u2019s inequality. In ICML, 2011.\n[17] Q. Liu and A. Ihler. Variational algorithms for marginal MAP. JMLR, 2013.\n[18] R. Marinescu, R. Dechter, and A. Ihler. AND/OR search for marginal MAP. In UAI, 2014.\n[19] D. Maua and C. de Campos. Anytime marginal maximum a posteriori inference. In ICML, 2012.\n[20] C. Meek and Y. Wexler. Approximating max-sum-product problems using multiplicative error bounds.\n\n[21] T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms: a unifying view. In\n\n[22] O. Meshi and A. Globerson. An alternating direction method for dual MAP LP relaxation.\n\nIn\n\n[23] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning ef\ufb01ciently with approximate inference via\n\n[24] J. Mooij.\n\nlibDAI: A free and open source C++ library for discrete approximate inference in graphical\n\n[25] J. Naradowsky, S. Riedel, and D. Smith. Improving NLP through marginalization of hidden syntactic\n\n[26] S. Nowozin and C. Lampert. Structured learning and prediction in computer vision. Foundations and\n\nTrends in Computer Graphics and Vision, 6, 2011.\n\n[27] J. Park and A. Darwiche. Solving MAP exactly using systematic search. In UAI, 2003.\n[28] J. Park and A. Darwiche. Complexity results and approximation strategies for MAP explanations. JAIR,\n\n[29] W. Ping, Q. Liu, and A. Ihler. Marginal structured SVM with hidden variables. In ICML, 2014.\n[30] N. Ruozzi and S. Tatikonda. Message-passing algorithms: Reparameterizations and splittings.\n\nIEEE\n\n[31] D. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference. Optimization\n\nTransactions on Information Theory, 2013.\n\nfor Machine Learning, 2011.\n\n[32] D. Sontag and T. Jaakkola. Tree block coordinate descent for MAP in graphical models. AISTATS, 2009.\n[33] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using\n\nmessage passing. In UAI, 2008.\n\n[34] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function.\n\nIEEE Transactions on Information Theory, 2005.\n\n[35] Y. Weiss, C. Yanover, and T. Meltzer. MAP estimation, linear programming and belief propagation with\n\nconvex free energies. In UAI, 2007.\n\n[36] T. Werner. A linear programming approach to max-sum problem: A review. TPAMI, 2007.\n[37] J. Yarkony, C. Fowlkes, and A. Ihler. Covering trees and lower-bounds on quadratic assignment. In CVPR,\n\n2004.\n\n2010.\n\n[38] C. Yuan and E. Hansen. Ef\ufb01cient computation of jointree bounds for systematic map search. IJCAI, 2009.\n[39] C. Yuan, T. Lu, and M. Druzdzel. Annealed MAP. In UAI, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1816, "authors": [{"given_name": "Wei", "family_name": "Ping", "institution": "UC Irvine"}, {"given_name": "Qiang", "family_name": "Liu", "institution": "MIT"}, {"given_name": "Alexander", "family_name": "Ihler", "institution": "UC Irvine"}]}