{"title": "Concavity of reweighted Kikuchi approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 3473, "page_last": 3481, "abstract": "We analyze a reweighted version of the Kikuchi approximation for estimating the log partition function of a product distribution defined over a region graph. We establish sufficient conditions for the concavity of our reweighted objective function in terms of weight assignments in the Kikuchi expansion, and show that a reweighted version of the sum product algorithm applied to the Kikuchi region graph will produce global optima of the Kikuchi approximation whenever the algorithm converges. When the region graph has two layers, corresponding to a Bethe approximation, we show that our sufficient conditions for concavity are also necessary. Finally, we provide an explicit characterization of the polytope of concavity in terms of the cycle structure of the region graph. We conclude with simulations that demonstrate the advantages of the reweighted Kikuchi approach.", "full_text": "Concavity of reweighted Kikuchi approximation\n\nPo-Ling Loh\n\nloh@wharton.upenn.edu\n\nDepartment of Statistics\n\nThe Wharton School\n\nUniversity of Pennsylvania\n\nAndre Wibisono\n\nComputer Science Division\n\nUniversity of California, Berkeley\nwibisono@berkeley.edu\n\nAbstract\n\nWe analyze a reweighted version of the Kikuchi approximation for estimating the\nlog partition function of a product distribution de\ufb01ned over a region graph. We\nestablish suf\ufb01cient conditions for the concavity of our reweighted objective func-\ntion in terms of weight assignments in the Kikuchi expansion, and show that a\nreweighted version of the sum product algorithm applied to the Kikuchi region\ngraph will produce global optima of the Kikuchi approximation whenever the al-\ngorithm converges. When the region graph has two layers, corresponding to a\nBethe approximation, we show that our suf\ufb01cient conditions for concavity are\nalso necessary. Finally, we provide an explicit characterization of the polytope of\nconcavity in terms of the cycle structure of the region graph. We conclude with\nsimulations that demonstrate the advantages of the reweighted Kikuchi approach.\n\nIntroduction\n\n1\nUndirected graphical models are a familiar framework in diverse application domains such as com-\nputer vision, statistical physics, coding theory, social science, and epidemiology. In certain settings\nof interest, one is provided with potential functions de\ufb01ned over nodes and (hyper)edges of the\ngraph. A crucial step in probabilistic inference is to compute the log partition function of the distri-\nbution based on these potential functions for a given graph structure. However, computing the log\npartition function either exactly or approximately is NP-hard in general [2, 17]. An active area of re-\nsearch involves \ufb01nding accurate approximations of the log partition function and characterizing the\ngraph structures for which such approximations may be computed ef\ufb01ciently [29, 22, 7, 19, 25, 18].\nWhen the underlying graph is a tree, the log partition function may be computed exactly via the sum\nproduct algorithm in time linear in the number of nodes [15]. However, when the graph contains\ncycles, a generalized version of the sum product algorithm known as loopy belief propagation may\neither fail to converge or terminate in local optima of a nonconvex objective function [26, 20, 8, 13].\nIn this paper, we analyze the Kikuchi approximation method, which is constructed from a variational\nrepresentation of the log partition function by replacing the entropy with an expression that decom-\nposes with respect to a region graph. Kikuchi approximations were previously introduced in the\nphysics literature [9] and reformalized by Yedidia et al. [28, 29] and others [1, 14] in the language of\ngraphical models. The Bethe approximation, which is a special case of the Kikuchi approximation\nwhen the region graph has only two layers, has been studied by various authors [3, 28, 5, 25]. In ad-\ndition, a reweighted version of the Bethe approximation was proposed by Wainwright et al. [22, 16].\nAs described in Vontobel [21], computing the global optimum of the Bethe variational problem may\nin turn be used to approximate the permanent of a nonnegative square matrix.\nThe particular objective function that we study generalizes the Kikuchi objective appearing in pre-\nvious literature by assigning arbitrary weights to individual terms in the Kikuchi entropy expansion.\nWe establish necessary and suf\ufb01cient conditions under which this class of objective functions is\nconcave, so a global optimum may be found ef\ufb01ciently. Our theoretical results synthesize known re-\nsults on Kikuchi and Bethe approximations, and our main theorem concerning concavity conditions\nfor the reweighted Kikuchi entropy recovers existing results when specialized to the unweighted\n\n1\n\n\fKikuchi [14] or reweighted Bethe [22] case. Furthermore, we provide a valuable converse result\nin the reweighted Bethe case, showing that when our concavity conditions are violated, the entropy\nfunction cannot be concave over the whole feasible region. As demonstrated by our experiments,\na message-passing algorithm designed to optimize the Kikuchi objective may terminate in local\noptima for weights outside the concave region. Watanabe and Fukumizu [24, 25] provide a similar\nconverse in the unweighted Bethe case, but our proof is much simpler and our result is more general.\nIn the reweighted Bethe setting, we also present a useful characterization of the concave region of\nthe Bethe entropy function in terms of the geometry of the graph. Speci\ufb01cally, we show that if the\nregion graph consists of only singleton vertices and pairwise edges, then the region of concavity\ncoincides with the convex hull of incidence vectors of single-cycle forest subgraphs of the original\ngraph. When the region graph contains regions with cardinality greater than two, the latter region\nmay be strictly contained in the former; however, our result provides a useful way to generate weight\nvectors within the region of concavity. Whereas Wainwright et al. [22] establish the concavity of\nthe reweighted Bethe objective on the spanning forest polytope, that region is contained within the\nsingle-cycle forest polytope, and our simulations show that generating weight vectors in the latter\npolytope may yield closer approximations to the log partition function.\nThe remainder of the paper is organized as follows: In Section 2, we review background information\nabout the Kikuchi and Bethe approximations. In Section 3, we provide our main results on concavity\nconditions for the reweighted Kikuchi approximation, including a geometric characterization of the\nregion of concavity in the Bethe case. Section 4 outlines the reweighted sum product algorithm\nand proves that \ufb01xed points correspond to global optima of the Kikuchi approximation. Section 5\npresents experiments showing the improved accuracy of the reweighted Kikuchi approximation over\nthe region of concavity. Technical proofs and additional simulations are contained in the Appendix.\n2 Background and problem setup\nIn this section, we review basic concepts of the Kikuchi approximation and establish some termi-\nnology to be used in the paper.\nLet G = (V, R) denote a region graph de\ufb01ned over the vertex set V , where each region r 2 R is a\nsubset of V . Directed edges correspond to inclusion, so r ! s is an edge of G if s \u2713 r. We use the\nfollowing notation, for r 2 R:\n\nA(r) := {s 2 R : r ( s}\nF(r) := {s 2 R : r \u2713 s}\nN (r) := {s 2 R : r \u2713 s or s \u2713 r}\n\n(ancestors of r)\n(forebears of r)\n(neighbors of r).\n\n1\n\nZ(\u21b5) Yr2R\n\nFor R0 \u2713 R, we de\ufb01ne A(R0) =Sr2R0 A(r), and we de\ufb01ne F(R0) and N (R0) similarly.\n\nWe consider joint distributions x = (xs)s2V that factorize over the region graph; i.e.,\n\np(x) =\n\n\u21b5r(xr),\n\n(1)\n\nfor potential functions \u21b5r > 0. Here, Z(\u21b5) is the normalization factor, or partition function, which\nis a function of the potential functions \u21b5r, and each variable xs takes values in a \ufb01nite discrete\nset X . One special case of the factorization (1) is the pairwise Ising model, de\ufb01ned over a graph\nG = (V, E), where the distribution is given by\n(2)\n\np(x) = exp\u21e3Xs2V\n\nst(xs, xt) A()\u2318,\nand X = {1, +1}. Our goal is to analyze the log partition function\n\u21b5r(xr)o.\n\ns(xs) + X(s,t)2E\nlog Z(\u21b5) = logn Xx2X |V |Yr2R\n\n(3)\n\n2.1 Variational representation\nIt is known from the theory of graphical models [14] that the log partition function (3) may be\nwritten in the variational form\n\nlog Z(\u21b5) =\n\nsup\n\n{\u2327r(xr)}2R nXr2RXxr\n\n\u2327r(xr) log(\u21b5r(xr)) + H(p\u2327 )o,\n\n(4)\n\n2\n\n\fwhere p\u2327 is the maximum entropy distribution with marginals {\u2327r(xr)} and\n\nH(p) := Xx\n\np(x) log p(x)\n\nis the usual entropy. Here, R denotes the R-marginal polytope; i.e., {\u2327r(xr) : r 2 R} 2 R if\nand only if there exists a distribution \u2327 (x) such that \u2327r(xr) =Px\\r\n\u2327 (xr, x\\r) for all r. For ease of\nnotation, we also write \u2327 \u2318 {\u2327r(xr) : r 2 R}. Let \u2713 \u2318 \u2713(x) denote the collection of log potential\nfunctions {log(\u21b5r(xr)) : r 2 R}. Then equation (4) may be rewritten as\n(5)\n\u23272R {h\u2713, \u2327i + H(p\u2327 )} .\n\u00b52M{h, \u00b5i + H(p\u00b5)} ,\n\n(6)\nwhich appears in Wainwright and Jordan [23]. Here, M \u2318 M(G) denotes the marginal polytope,\ncorresponding to the collection of mean parameter vectors of the suf\ufb01cient statistics in the exponen-\ntial family representation (2), ranging over different values of , and p\u00b5 is the maximum entropy\ndistribution with mean parameters \u00b5.\n\nSpecializing to the Ising model (2), equation (5) gives the variational representation\n\nlog Z(\u2713) = sup\n\nA() = sup\n\n2.2 Reweighted Kikuchi approximation\nAlthough the set R appearing in the variational representation (5) is a convex polytope, it may\nhave exponentially many facets [23]. Hence, we replace R with the set\n\nK\n\nR =n\u2327 : 8t, u 2 R s.t. t \u2713 u,Xxu\\t\n\n\u2327u(xt, xu\\t) = \u2327t(xt) and 8u 2 R,Xxu\n\n\u2327u(xu) = 1o\n\nof locally consistent R-pseudomarginals. Note that R \u2713 K\nally many facets, making optimization more tractable.\nIn the case of the pairwise Ising model (2), we let L \u2318 L(G) denote the polytope K\nthe collection of nonnegative functions \u2327 = (\u2327s, \u2327st) satisfying the marginalization constraints\n\nR and the latter set has only polynomi-\n\nR . Then L is\n\n\u2327s(xs) = 1,\n\nPxt\n\n\u2327st(xs, xt) = \u2327s(xs) and Pxs\n\nPxs\n\n\u2327st(xs, xt) = \u2327t(xt),\n\n8s 2 V,\n8(s, t) 2 E.\n\nRecall that M(G) \u2713 L(G), with equality achieved if and only if the underlying graph G is a tree. In\nthe general case, we have R = K\nR when the Hasse diagram of the region graph admits a minimal\nrepresentation that is loop-free (cf. Theorem 2 of Pakzad and Anantharam [14]).\nGiven a collection of R-pseudomarginals \u2327, we also replace the entropy term H(p\u2327 ), which is\ndif\ufb01cult to compute in general, by the approximation\n\nH(p\u2327 ) \u21e1Xr2R\n\n\u21e2rHr(\u2327r) := H(\u2327 ; \u21e2),\n\n(7)\n\nwhere Hr(\u2327r) := Pxr\n\n\u2327r(xr) log \u2327r(xr) is the entropy computed over region r, and {\u21e2r : r 2 R}\nare weights assigned to the regions. Note that in the pairwise Ising case (2), with p := p, we have\nthe equality\nH(p) =Xs2V\n\nHs(ps) X(s,t)2E\n\nwhen G is a tree, where Ist(pst) = Hs(ps) + Ht(pt) Hst(pst) denotes the mutual information\nand ps and pst denote the node and edge marginals. Hence, the approximation (7) is exact with\n\nIst(pst)\n\n\u21e2st = 1,\n\n8(s, t) 2 E,\n\nand\n\n\u21e2s = 1 deg(s),\n\n8s 2 V.\n\nUsing the approximation (7), we arrive at the following reweighted Kikuchi approximation:\n\n(8)\n\n(9)\n\n|\n\n3\n\nNote that when {\u21e2r} are the overcounting numbers {cr}, de\ufb01ned recursively by\n\nB(\u2713; \u21e2) := sup\n\u23272K\n\nR\n\n{h\u2713, \u2327i + H(\u2327 ; \u21e2)}\n\n.\n\nB\u2713,\u21e2(\u2327 )\n\n{z\n\ncs,\n\n}\n\ncr = 1 Xs2A(r)\n\nthe expression (8) reduces to the usual (unweighted) Kikuchi approximation considered in Pakzad\nand Anantharam [14].\n\n\f3 Main results and consequences\nIn this section, we analyze the concavity of the Kikuchi variational problem (8). We derive a suf\ufb01-\ncient condition under which the function B\u2713,\u21e2(\u2327 ) is concave over the set K\nR , so global optima of\nthe reweighted Kikuchi approximation may be found ef\ufb01ciently. In the Bethe case, we also show\nthat the condition is necessary for B\u2713,\u21e2(\u2327 ) to be concave over the entire region K\nR , and we provide\na geometric characterization of K\n\nR in terms of the edge and cycle structure of the graph.\n\n3.1 Suf\ufb01cient conditions for concavity\n\nWe begin by establishing suf\ufb01cient conditions for the concavity of B\u2713,\u21e2(\u2327 ). Clearly, this is equiva-\nlent to establishing conditions under which H(\u2327 ; \u21e2) is concave. Our main result is the following:\nTheorem 1. If \u21e2 2 R|R| satis\ufb01es\n\n\u21e2s 0,\n\n8S \u2713 R,\n\n(10)\n\nXs2F(S)\n\nthen the Kikuchi entropy H(\u2327 ; \u21e2) is strictly concave on K\nR .\n\nThe proof of Theorem 1 is contained in Appendix A.1, and makes use of a generalization of Hall\u2019s\nmarriage lemma for weighted graphs (cf. Lemma 1 in Appendix A.2).\nThe condition (10) depends heavily on the structure of the region graph. For the sake of inter-\npretability, we now specialize to the case where the region graph has only two layers, with the \ufb01rst\nlayer corresponding to vertices and the second layer corresponding to hyperedges. In other words,\nfor r, s 2 R, we have r \u2713 s only if |r| = 1, and R = V [ F , where F is the set of hyperedges and\nV denotes the set of singleton vertices. This is the Bethe case, and the entropy\n\nH(\u2327 ; \u21e2) =Xs2V\n\n\u21e2sHs(\u2327s) +X\u21b52F\n\n\u21e2\u21b5H\u21b5(\u2327\u21b5)\n\nis consequently known as the Bethe entropy.\nThe following result is proved in Appendix A.3:\nCorollary 1. Suppose \u21e2\u21b5 0 for all \u21b5 2 F , and the following condition also holds:\n\nXs2U\n\n\u21e2s + X\u21b52F : \u21b5\\U6=;\n\n\u21e2\u21b5 0,\n\n8U \u2713 V.\n\n(11)\n\n(12)\n\nThen the Bethe entropy H(\u2327 ; \u21e2) is strictly concave over K\nR .\n\n3.2 Necessary conditions for concavity\n\nWe now establish a converse to Corollary 1 in the Bethe case, showing that condition (12) is also\nnecessary for the concavity of the Bethe entropy. When \u21e2\u21b5 = 1 for \u21b5 2 F and \u21e2s = 1 |N (s)|\nfor s 2 V , we recover the result of Watanabe and Fukumizu [25] for the unweighted Bethe case.\nHowever, our proof technique is signi\ufb01cantly simpler and avoids the complex machinery of graph\nzeta functions. Our approach proceeds by considering the Bethe entropy H(\u2327 ; \u21e2) on appropriate\nslices of the domain K\nR so as to extract condition (12) for each U \u2713 V . The full proof is provided\nin Appendix B.1.\nTheorem 2. If the Bethe entropy H(\u2327 ; \u21e2) is concave over K\nR , then \u21e2\u21b5 0 for all \u21b5 2 F , and\ncondition (12) holds.\n\nIndeed, as demonstrated in the simulations of Section 5, the Bethe objective function B\u2713,\u21e2(\u2327 ) may\nhave multiple local optima if \u21e2 does not satisfy condition (12).\n\n3.3 Polytope of concavity\n\nWe now characterize the polytope de\ufb01ned by the inequalities (12). We show that in the pairwise\nBethe case, the polytope may be expressed geometrically as the convex hull of single-cycle forests\n\n4\n\n\fformed by the edges of the graph. In the more general (non-pairwise) Bethe case, however, the\npolytope of concavity may strictly contain the latter set.\nNote that the Bethe entropy (11) may be written in the alternative form\n\nH(\u2327 ; \u21e2) =Xs2V\n\n\u21e20sHs(\u2327s) X\u21b52F\n\n\u21e2\u21b5eI\u21b5(\u2327\u21b5),\n\nwhereeI\u21b5(\u2327\u21b5) := {Ps2\u21b5 Hs(\u2327s)} H\u21b5(\u2327\u21b5) is the KL divergence between the joint distribution \u2327\u21b5\nand the product distributionQs2\u21b5 \u2327s, and the weights \u21e20s are de\ufb01ned appropriately.\nWe show that the polytope of concavity has a nice geometric characterization when \u21e20s = 1 for\nall s 2 V , and \u21e2\u21b5 2 [0, 1] for all \u21b5 2 F . Note that this assignment produces the expression\nfor the reweighted Bethe entropy analyzed in Wainwright et al. [22] (when all elements of F have\ncardinality two). Equation (13) then becomes\n\n(13)\n\n(14)\n\n(15)\n\nH(\u2327 ; \u21e2) =Xs2V\u21e31 X\u21b52N (s)\n\n\u21e2\u21b5\u2318Hs(\u2327s) +X\u21b52F\n\n\u21e2\u21b5H\u21b5(\u2327\u21b5),\n\nand the inequalities (12) de\ufb01ning the polytope of concavity are\n(|\u21b5 \\ U| 1)\u21e2\u21b5 \uf8ff |U|,\n\nX\u21b52F : \u21b5\\U6=;\n\nConsequently, we de\ufb01ne\n\n8U \u2713 V.\n\nC :=n\u21e2 2 [0, 1]|F| : X\u21b52F : \u21b5\\U6=;\n\n(|\u21b5 \\ U| 1)\u21e2\u21b5 \uf8ff |U|,\n\n8U \u2713 Vo.\n\nBy Theorem 2, the set C is the region of concavity for the Bethe entropy (14) within [0, 1]|F|.\nWe also de\ufb01ne the set\n\nF := {1F 0 : F 0 \u2713 F and F 0 [ N (F 0) is a single-cycle forest in G} \u2713 {0, 1}|F|,\n\nwhere a single-cycle forest is de\ufb01ned to be a subset of edges of a graph such that each connected\ncomponent contains at most one cycle. (We disregard the directions of edges in G.) The following\ntheorem gives our main result. The proof is contained in Appendix C.1.\nTheorem 3. In the Bethe case (i.e., the region graph G has two layers), we have the containment\nconv(F) \u2713 C. If in addition |\u21b5| = 2 for all \u21b5 2 F , then conv(F) = C.\nThe signi\ufb01cance of Theorem 3 is that it provides us with a convenient graph-based method for\nconstructing vectors \u21e2 2 C. From the inequalities (15), it is not even clear how to ef\ufb01ciently verify\nwhether a given \u21e2 2 [0, 1]|F| lies in C, since it involves testing 2|V | inequalities.\nComparing Theorem 3 with known results, note that in the pairwise case (|\u21b5| = 2 for all \u21b5 2 F ),\nTheorem 1 of Wainwright et al. [22] states that the Bethe entropy is concave over conv(T), where\nT \u2713 {0, 1}|E| is the set of edge indicator vectors for spanning forests of the graph. It is trivial to\ncheck that T \u2713 F, since every spanning forest is also a single-cycle forest. Hence, Theorems 2\nand 3 together imply a stronger result than in Wainwright et al. [22], characterizing the precise\nregion of concavity for the Bethe entropy as a superset of the polytope conv(T) analyzed there. In\nthe unweighted Kikuchi case, it is also known [1, 14] that the Kikuchi entropy is concave for the\nassignment \u21e2 = 1F when the region graph G is connected and has at most one cycle. Clearly,\n1F 2 C in that case, so this result is a consequence of Theorems 2 and 3, as well. However, our\ntheorems show that a much more general statement is true.\nIt is tempting to posit that conv(F) = C holds more generally in the Bethe case. However, as the fol-\nlowing example shows, settings arise where conv(F) ( C. Details are contained in Appendix C.2.\nExample 1. Consider a two-layer region graph with vertices V = {1, 2, 3, 4, 5} and factors \u21b51 =\n{1, 2, 3}, \u21b52 = {2, 3, 4}, and \u21b53 = {3, 4, 5}. Then (1, 1\nIn fact, Example 1 is a special case of a more general statement, which we state in the following\nproposition. Here, F := {F 0 \u2713 F : 1F 0 2 F}, and an element F \u21e4 2 F is maximal if it is not\ncontained in another element of F.\n\n2 , 1) 2 C\\ conv(F).\n\n5\n\n\fProposition 1. Suppose (i) G is not a single-cycle forest, and (ii) there exists a maximal element\nF \u21e4 2 F such that the induced subgraph F \u21e4 [ N (F \u21e4) is a forest. Then conv(F) ( C.\nThe proof of Proposition 1 is contained in Appendix C.3. Note that if |\u21b5| = 2 for all \u21b5 2 F , then\ncondition (ii) is violated whenever condition (i) holds, so Proposition 1 provides a partial converse\nto Theorem 3.\n4 Reweighted sum product algorithm\nIn this section, we provide an iterative message passing algorithm to optimize the Kikuchi varia-\ntional problem (8). As in the case of the generalized belief propagation algorithm for the unweighted\nKikuchi approximation [28, 29, 11, 14, 12, 27] and the reweighted sum product algorithm for the\nBethe approximation [22], our message passing algorithm searches for stationary points of the La-\ngrangian version of the problem (8). When \u21e2 satis\ufb01es condition (10), Theorem 1 implies that the\nproblem (8) is strictly concave, so the unique \ufb01xed point of the message passing algorithm globally\nmaximizes the Kikuchi approximation.\nLet G = (V, R) be a region graph de\ufb01ning our Kikuchi approximation. Following Pakzad and\nAnantharam [14], for r, s 2 R, we write r s if r ( s and there does not exist t 2 R such that\nr ( t ( s. For r 2 R, we de\ufb01ne the parent set of r to be P(r) = {s 2 R : r s} and the child set\nof r to be C(r) = {s 2 R : s r}. With this notation, \u2327 = {\u2327r(xr) : r 2 R} belongs to the set K\nif and only ifPxs\\r\nThe message passing algorithm we propose is as follows: For each r 2 R and s 2 P(r), let\nMsr(xr) denote the message passed from s to r at assignment xr. Starting with an arbitrary positive\ninitialization of the messages, we repeatedly perform the following updates for all r 2 R, s 2 P(r):\n\n\u2327s(xr, xs\\r) = \u2327r(xr) for all r 2 R, s 2 P(r).\n\nR\n\nMsw(xw)1\n\nMrt(xt)1\n\n\u21e2r\n\n\u21e2r +\u21e2s\n\n. (16)\n\n375\n\nMrt(xt)1,\n\n(17)\n\nMsr(xr) C264\nPxr\n\ncording to\n\nexp\u2713s(xs)/\u21e2s Qv2P(s)\nPxs\\r\nexp\u2713r(xr)/\u21e2r Qu2P(r)\\s\n\u21e2r \u25c6 Ys2P(r)\n\n\u2327r(xr) / exp\u2713 \u2713r(xr)\n\nMvs(xs)\u21e2v/\u21e2s Qw2C(s)\\r\nMur(xr)\u21e2u/\u21e2r Qt2C(r)\n\nMsr(xr)\u21e2s/\u21e2r Yt2C(r)\n\nHere, C > 0 may be chosen to ensure a convenient normalization condition;\n\ne.g.,\nMsr(xr) = 1. Upon convergence of the updates (16), we compute the pseudomarginals ac-\n\nand we obtain the corresponding Kikuchi approximation by computing the objective function (8)\nwith these pseudomarginals. We have the following result, which is proved in Appendix D:\nTheorem 4. The pseudomarginals \u2327 speci\ufb01ed by the \ufb01xed points of the messages {Msr(xr)} via\nthe updates (16) and (17) correspond to the stationary points of the Lagrangian associated with the\nKikuchi approximation problem (8).\n\nAs with the standard belief propagation and reweighted sum product algorithms, we have several\noptions for implementing the above message passing algorithm in practice. For example, we may\nperform the updates (16) using serial or parallel schedules. To improve the convergence of the\nalgorithm, we may damp the updates by taking a convex combination of new and previous messages\nusing an appropriately chosen step size. As noted by Pakzad and Anantharam [14], we may also use\na minimal graphical representation of the Hasse diagram to lower the complexity of the algorithm.\nFinally, we remark that although our message passing algorithm proceeds in the same spirit as clas-\nsical belief propagation algorithms by operating on the Lagrangian of the objective function, our\nalgorithm as presented above does not immediately reduce to the generalized belief propagation\nalgorithm for unweighted Kikuchi approximations or the reweighted sum product algorithm for\ntree-reweighted pairwise Bethe approximations. Previous authors use algebraic relations between\nthe overcounting numbers (9) in the Kikuchi case [28, 29, 11, 14] and the two-layer structure of the\nHasse diagram in the Bethe case [22] to obtain a simpli\ufb01ed form of the updates. Since the coef\ufb01-\ncients \u21e2 in our problem lack the same algebraic relations, following the message-passing protocol\nused in previous work [11, 28] leads to more complicated updates, so we present a slightly different\nalgorithm that still optimizes the general reweighted Kikuchi objective.\n\n6\n\n\f5 Experiments\nIn this section, we present empirical results to demonstrate the advantages of the reweighted Kikuchi\napproximation that support our theoretical results. For simplicity, we focus on the binary pairwise\nIsing model given in equation (2). Without loss of generality, we may take the potentials to be\ns(xs) = sxs and st(xs, xt) = stxsxt for some = (s, st) 2 R|V |+|E|. We run our\nexperiments on two types of graphs: (1) Kn, the complete graph on n vertices, and (2) Tn, the\npn \u21e5 pn toroidal grid graph where every vertex has degree four.\nBethe approximation. We consider the pairwise Bethe approximation of the log partition function\nA() with weights \u21e2st 0 and \u21e2s = 1 Pt2N (s) \u21e2st. Because of the regularity structure of Kn\nand Tn, we take \u21e2st = \u21e2 0 for all (s, t) 2 E and study the behavior of the Bethe approximation\nas \u21e2 varies. For this particular choice of weight vector ~\u21e2 = \u21e21E, we de\ufb01ne\n\n\u21e2tree = max{\u21e2 0 : ~\u21e2 2 conv(T)},\n\nand\n\n\u21e2cycle = max{\u21e2 0 : ~\u21e2 2 conv(F)}.\n\n2n and \u21e2cycle = 1\n2.\n\nn and \u21e2cycle = 2\n\nn1; while for Tn, we have\n\nIt is easily veri\ufb01ed that for Kn, we have \u21e2tree = 2\n\u21e2tree = n1\nOur results in Section 3 imply that the Bethe objective function B,\u21e2(\u2327 ) in equation (8) is concave\nif and only if \u21e2 \uf8ff \u21e2cycle, and Wainwright et al. [22] show that we have the bound A() \uf8ff B(; \u21e2)\nfor \u21e2 \uf8ff \u21e2tree. Moreover, since the Bethe entropy may be written in terms of the edge mutual\ninformation (13), the function B(; \u21e2) is decreasing in \u21e2. In our results below, we observe that we\nmay obtain a tighter approximation to A() by moving from the upper bound region \u21e2 \uf8ff \u21e2tree to the\nconcavity region \u21e2 \uf8ff \u21e2cycle. In addition, for \u21e2 > \u21e2cycle, we observe multiple local optima of B,\u21e2(\u2327 ).\nProcedure. We generate a random potential = (s, st) 2 R|V |+|E| for the Ising model (2) by\nsampling each potential {s}s2V and {st}(s,t)2E independently. We consider two types of models:\n\nAttractive: st \u21e0 Uniform[0, !st],\n\nand\n\nMixed: st \u21e0 Uniform[!st, !st].\n\nIn each case, s \u21e0 Uniform[0, !s]. We set !s = 0.1 and !st = 2. Intuitively, the attractive model\nencourages variables in adjacent nodes to assume the same value, and it has been shown [18, 19] that\nthe ordinary Bethe approximation (\u21e2st = 1) in an attractive model lower-bounds the log partition\nfunction. For \u21e2 2 [0, 2], we compute stationary points of B,\u21e2(\u2327 ) by running the reweighted sum\nproduct algorithm of Wainwright et al. [22]. We use a damping factor of = 0.5, convergence\nthreshold of 1010 for the average change of messages, and at most 2500 iterations. We repeat this\nprocess with at least 8 random initializations for each value of \u21e2. Figure 1 shows the scatter plots\nof \u21e2 and the Bethe approximation B,\u21e2(\u2327 ). In each plot, the two vertical lines are the boundaries\n\u21e2 = \u21e2tree and \u21e2 = \u21e2cycle, and the horizontal line is the value of the true log partition function A().\n\nResults. Figures 1(a)\u20131(d) show the results of our experiments on small graphs (K5 and T9) for\nboth attractive and mixed models. We see that the Bethe approximation with \u21e2 \uf8ff \u21e2cycle generally\nprovides a better approximation to A() than the Bethe approximation computed over \u21e2 \uf8ff \u21e2tree.\nHowever, in general we cannot guarantee whether B(; \u21e2) will give an upper or lower bound for\nA() when \u21e2 \uf8ff \u21e2cycle. As noted above, we have B(; 1) \uf8ff A() for attractive models.\nWe also observe from Figures 1(a)\u20131(d) that shortly after \u21e2 leaves the concavity region {\u21e2 \uf8ff \u21e2cycle},\nmultiple local optima emerge for the Bethe objective function. The presence of the point clouds\nnear \u21e2 = 1 in Figures 1(a) and 1(c) arises because the sum product algorithm has not converged\nafter 2500 iterations. Indeed, the same phenomenon is true for all our results: in the region where\nmultiple local optima begin to appear, it is more dif\ufb01cult for the algorithm to converge. See Figure 2\nand the accompanying text in Appendix E for a plot of the points (\u21e2, log10()), where is the\n\ufb01nal average change in the messages at termination of the algorithm. From Figure 2, we see that the\nvalues of are signi\ufb01cantly higher for the values of \u21e2 near where multiple local optima emerge. We\nsuspect that for these values of \u21e2, the sum product algorithm fails to converge since distinct local\noptima are close together, so messages oscillate between the optima. For larger values of \u21e2, the local\noptima become suf\ufb01ciently separated and the algorithm converges to one of them. However, it is\ninteresting to note that this point cloud phenomenon does not appear for attractive models, despite\nthe presence of distinct local optima.\nSimulations for larger graphs are shown in Figures 1(e)\u20131(h).\nIf we zoom into the region near\n\u21e2 \uf8ff \u21e2cycle, we still observe the same behavior that \u21e2 \uf8ff \u21e2cycle generally provides a better Bethe\n\n7\n\n\fn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\n16\n15\n14\n13\n12\n11\n10\n9\n8\n7\n \n0\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\n110\n100\n90\n80\n70\n60\n50\n40\n30\n \n0\n\nK5, mixed\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\nK5, attractive\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\n20\n\n18\n\n16\n\n14\n\n12\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\n13\n12.5\n12\n11.5\n11\n10.5\n10\n9.5\n\nT 9, mixed\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\n25\n\n24\n\n23\n\n22\n\n21\n\n20\n\n19\n\nT 9, attractive\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\n \n0\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\n10\n \n0\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\n \n0\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\n(a) K5, mixed\n\n(b) K5, attractive\n\n(c) T9, mixed\n\n(d) T9, attractive\n\nK15, attractive\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\nK15, mixed\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\n114\n\n112\n\n110\n\n108\n\n106\n\n104\n\n102\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\n \n0\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\n65\n\n60\n\n55\n\n50\n\n45\n\n40\n\n35\n \n0\n\nT 25, attractive\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\nT 25, mixed\n\n \n\n\u03c1t r ee\n\u03c1c y c l e\nA(\u03b3 )\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\n\ne\nh\nt\ne\nB\n\n60\n\n55\n\n50\n\n45\n\n40\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\n \n0\n\n0.5\n\n1\n\u03c1\n\n1.5\n\n2\n\n(e) K15, mixed\nFigure 1: Values of the reweighted Bethe approximation as a function of \u21e2. See text for details.\n\n(f) K15, attractive\n\n(g) T25, mixed\n\n(h) T25, attractive\n\napproximation than \u21e2 \uf8ff \u21e2tree. Moreover, the presence of the point clouds and multiple local optima\nare more pronounced, and we see from Figures 1(c), 1(g), and 1(h) that new local optima with even\nworse Bethe values arise for larger values of \u21e2. Finally, we note that the same qualitative behavior\nalso occurs in all the other graphs that we have tried (Kn for n 2 {5, 10, 15, 20, 25} and Tn for\nn 2 {9, 16, 25, 36, 49, 64}), with multiple random instances of the Ising model p.\n6 Discussion\nIn this paper, we have analyzed the reweighted Kikuchi approximation method for estimating the log\npartition function of a distribution that factorizes over a region graph. We have characterized nec-\nessary and suf\ufb01cient conditions for the concavity of the variational objective function, generalizing\nexisting results in literature. Our simulations demonstrate the advantages of using the reweighted\nKikuchi approximation and show that multiple local optima may appear outside the region of con-\ncavity.\nAn interesting future research direction is to obtain a better understanding of the approximation\nguarantees of the reweighted Bethe and Kikuchi methods. In the Bethe case with attractive potentials\n\u2713, several recent results [22, 19, 18] establish that the Bethe approximation B(\u2713; \u21e2) is an upper bound\nto the log partition function A(\u2713) when \u21e2 lies in the spanning tree polytope, whereas B(\u2713; \u21e2) \uf8ff A(\u2713)\nwhen \u21e2 = 1F . By continuity, we must have B(\u2713; \u21e2\u21e4) = A(\u2713) for some values of \u21e2\u21e4, and it would\nbe interesting to characterize such values where the reweighted Bethe approximation is exact.\nAnother interesting direction is to extend our theoretical results on properties of the reweighted\nKikuchi approximation, which currently depend solely on the structure of the region graph and the\nweights \u21e2, to incorporate the effect of the model potentials \u2713. For example, several authors [20, 6]\npresent conditions under which loopy belief propagation applied to the unweighted Bethe approxi-\nmation has a unique \ufb01xed point. The conditions for uniqueness of \ufb01xed points slightly generalize the\nconditions for convexity, and they involve both the graph structure and the strength of the potentials.\nWe suspect that similar results would hold for the reweighted Kikuchi approximation.\n\nAcknowledgments. The authors thank Martin Wainwright for introducing the problem to them\nand providing helpful guidance. The authors also thank Varun Jog for discussions regarding the\ngeneralization of Hall\u2019s lemma. The authors thank the anonymous reviewers for feedback that im-\nproved the clarity of the paper. PL was partly supported from a Hertz Foundation Fellowship and an\nNSF Graduate Research Fellowship while at Berkeley.\n\n8\n\n\fReferences\n[1] S. M. Aji and R. J. McEliece. The generalized distributive law and free energy minimization. In Proceed-\n\nings of the 39th Allerton Conference, 2001.\n\n[2] F. Barahona. On the computational complexity of Ising spin glass models. Journal of Physics A: Mathe-\n\nmatical and General, 15(10):3241, 1982.\n\n[3] H. A. Bethe. Statistical theory of superlattices. Proceedings of the Royal Society of London. Series A,\n\nMathematical and Physical Sciences, 150(871):552\u2013575, 1935.\n\n[4] P. Hall. On representatives of subsets. Journal of the London Mathematical Society, 10:26\u201330, 1935.\n[5] T. Heskes. Stable \ufb01xed points of loopy belief propagation are minima of the Bethe free energy.\n\nAdvances in Neural Information Processing Systems 15, 2002.\n\nIn\n\n[6] T. Heskes. On the uniqueness of loopy belief propagation \ufb01xed points. Neural Computation, 16(11):2379\u2013\n\n2413, 2004.\n\n[7] T. Heskes. Convexity arguments for ef\ufb01cient minimization of the Bethe and Kikuchi free energies. Journal\n\nof Arti\ufb01cial Intelligence Research, 26:153\u2013190, 2006.\n\n[8] A. T. Ihler, J. W. Fischer III, and A. S. Willsky. Loopy belief propagation: Convergence and effects of\n\nmessage errors. Journal of Machine Learning Research, 6:905\u2013936, December 2005.\n\n[9] R. Kikuchi. A theory of cooperative phenomena. Phys. Rev., 81:988\u20131003, March 1951.\n[10] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms. Springer, 4th edition, 2007.\n[11] R. J. McEliece and M. Yildirim. Belief propagation on partially ordered sets. In Mathematical Systems\n\nTheory in Biology, Communications, Computation, and Finance, pages 275\u2013300, 2002.\n\n[12] T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms: a unifying view. In\n\nProceedings of the Twenty-Fifth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI \u201909, 2009.\n\n[13] J. M. Mooij and H. J. Kappen. Suf\ufb01cient conditions for convergence of the sum-product algorithm. IEEE\n\nTransactions on Information Theory, 53(12):4422\u20134437, December 2007.\n\n[14] P. Pakzad and V. Anantharam. Estimation and marginalization using Kikuchi approximation methods.\n\nNeural Computation, 17:1836\u20131873, 2003.\n\n[15] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauf-\n\nmann Publishers Inc., San Francisco, CA, USA, 1988.\n\n[16] T. Roosta, M. J. Wainwright, and S. S. Sastry. Convergence analysis of reweighted sum-product algo-\n\nrithms. IEEE Transactions on Signal Processing, 56(9):4293\u20134305, 2008.\n\n[17] D. Roth. On the hardness of approximate reasoning. Arti\ufb01cial Intelligence, 82(12):273 \u2013 302, 1996.\n[18] N. Ruozzi. The Bethe partition function of log-supermodular graphical models. In Advances in Neural\n\nInformation Processing Systems 25, 2012.\n\n[19] E. B. Sudderth, M. J. Wainwright, and A. S. Willsky. Loop series and Bethe variational bounds in attractive\n\ngraphical models. In Advances in Neural Information Processing Systems 20, 2007.\n\n[20] S. C. Tatikonda and M. I. Jordan. Loopy belief propagation and Gibbs measures. In Proceedings of the\n\nEighteenth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI \u201902, 2002.\n\n[21] P. O. Vontobel. The Bethe permanent of a nonnegative matrix. IEEE Transactions on Information Theory,\n\n59(3):1866\u20131901, 2013.\n\n[22] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition\n\nfunction. IEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n\n[23] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFoundations and Trends in Machine Learning, 1(1\u20132):1\u2013305, January 2008.\n\n[24] Y. Watanabe and K. Fukumizu. Graph zeta function in the Bethe free energy and loopy belief propagation.\n\nIn Advances in Neural Information Processing Systems 22, 2009.\n\n[25] Y. Watanabe and K. Fukumizu. Loopy belief propagation, Bethe free energy and graph zeta function.\n\narXiv preprint arXiv:1103.0605, 2011.\n\n[26] Y. Weiss. Correctness of local probability propagation in graphical models with loops. Neural Computa-\n\ntion, 12(1):1\u201341, 2000.\n\n[27] T. Werner. Primal view on belief propagation. In UAI 2010: Proceedings of the Conference of Uncertainty\n\nin Arti\ufb01cial Intelligence, pages 651\u2013657, Corvallis, Oregon, July 2010. AUAI Press.\n\n[28] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation.\n\nInformation Processing Systems 13, 2000.\n\nIn Advances in Neural\n\n[29] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized\n\nbelief propagation algorithms. IEEE Transactions on Information Theory, 51:2282\u20132312, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1808, "authors": [{"given_name": "Po-Ling", "family_name": "Loh", "institution": "University of Pennsylvania"}, {"given_name": "Andre", "family_name": "Wibisono", "institution": "UC Berkeley"}]}