{"title": "Making Pairwise Binary Graphical Models Attractive", "book": "Advances in Neural Information Processing Systems", "page_first": 1772, "page_last": 1780, "abstract": "Computing the partition function (i.e., the normalizing constant) of a given pairwise binary graphical model is NP-hard in general. As a result, the partition function is typically estimated by approximate inference algorithms such as belief propagation (BP) and tree-reweighted belief propagation (TRBP). The former provides reasonable estimates in practice but has convergence issues. The later has better convergence properties but typically provides poorer estimates. In this work, we propose a novel scheme that has better convergence properties than BP and provably provides better partition function estimates in many instances than TRBP. In particular, given an arbitrary pairwise binary graphical model, we construct a specific ``attractive'' 2-cover. We explore the properties of this special cover and show that it can be used to construct an algorithm with the desired properties.", "full_text": "Making Pairwise Binary Graphical Models Attractive\n\nInstitute for Data Sciences and Engineering\n\nDepartment of Computer Science\n\nNicholas Ruozzi\n\nColumbia University\nNew York, NY 10027\nnr2493@columbia.edu\n\nTony Jebara\n\nColumbia University\nNew York, NY 10027\njebara@cs.columbia.edu\n\nAbstract\n\nComputing the partition function (i.e., the normalizing constant) of a given pair-\nwise binary graphical model is NP-hard in general. As a result, the partition\nfunction is typically estimated by approximate inference algorithms such as be-\nlief propagation (BP) and tree-reweighted belief propagation (TRBP). The former\nprovides reasonable estimates in practice but has convergence issues. The later\nhas better convergence properties but typically provides poorer estimates. In this\nwork, we propose a novel scheme that has better convergence properties than BP\nand provably provides better partition function estimates in many instances than\nTRBP. In particular, given an arbitrary pairwise binary graphical model, we con-\nstruct a speci\ufb01c \u201cattractive\u201d 2-cover. We explore the properties of this special\ncover and show that it can be used to construct an algorithm with the desired\nproperties.\n\n1\n\nIntroduction\n\nGraphical models provide a mechanism for expressing the relationships among a collection of vari-\nables. Many applications in computer vision, coding theory, and machine learning can be reduced\nto performing statistical inference, either computing the partition function or the most likely con\ufb01g-\nuration, of speci\ufb01c graphical models. In general models, both of these problems are NP-hard. As\na result, much effort has been invested in designing algorithms that can approximate, or in some\nspecial cases exactly solve, these inference problems.\nThe belief propagation algorithm (BP) is an ef\ufb01cient message-passing algorithm that is often used\nto approximate the partition function of a given graphical model. However, BP does not always\nconverge, and so-called convergent message-passing algorithms such as tree reweighted belief prop-\nagation (TRBP) have been proposed as alternatives to BP. Such convergent message passing algo-\nrithms can be viewed as dual coordinate-descent schemes on a particular convex upper bound on the\npartition function [1]. While TRBP-style message-passing algorithms guarantee convergence under\nsuitable message-passing schedules, \ufb01nding the optimal message-passing schedule can be cumber-\nsome or impractical depending on the application, and TRBP often performs worse than BP in terms\nof estimating the partition function.\nThe primary goal of this work is to study alternatives to BP and TRBP that have better convergence\nproperties than BP and approximate the partition function better than TRBP. To that end, the so-\ncalled \u201cattractive\u201d graphical models (i.e., those models that do not contain frustrated cycles) stand\nout as a special case. Attractive graphical models have desirable computational properties: Weller\nand Jebara [2, 3] describe a polynomial time approximation scheme to minimize the Bethe free\nenergy of attractive models (note that BP only guarantees convergence to a local optimum).\nIn\naddition, BP has much better convergence properties on attractive models than on general pairwise\nbinary models [4, 5].\n\n1\n\n\fIn this work, we show how to approximate the inference problem over a general pairwise binary\ngraphical model as an inference problem over an attractive graphical model. Similar in spirit to\nthe work of Bayati et al. [6] and Ruozzi and Tatikonda [7], we will use graph covers in order to\nbetter understand the behavior of the Bethe approximation with respect to the partition function. In\nparticular, we will show that if a graphical model is strictly positive and contains even one frustrated\ncycle, then there exists a choice of external \ufb01eld and a 2-cover without frustrated cycles whose\npartition function provides a strict upper bound on the partition function of the original model.\nWe then show that the computation of the Bethe partition function can approximated, or in some\ncases found exactly, by computing the Bethe partition function over this special cover. The required\ncomputations are easier on this \u201cattractive\u201d graph cover as computing the MAP assignment can be\ndone in polynomial time and there exists a polynomial time approximation scheme for computing\nthe Bethe partition function.\nWe illustrate the theory through a series of experiments on small models, grid graphs, and vertex in-\nduced subgraphs of the Epinions social network1, . All of these models have frustrated cycles which\nmake the computation of their partition functions, marginals, and most-likely con\ufb01gurations exceed-\ningly dif\ufb01cult. In these experiments, the proposed scheme converges signi\ufb01cantly more frequently\nthan BP and provides a better estimate of the partition function than TRBP.\n\n2 Prerequisites\n\nWe begin by reviewing pairwise binary graphical models, graph covers, the Bethe and TRBP ap-\nproximations, and recent work on lower bounds.\n\n2.1 Pairwise Binary Graphical Models\nLet f : {0, 1}n \u2192 R\u22650 be a non-negative function. A function f factors with respect to a graph\nG = (V, E), if there exist potential functions \u03c6i : {0, 1} \u2192 R\u22650 for each i \u2208 V and \u03c8ij : {0, 1}2 \u2192\nR\u22650 for each (i, j) \u2208 E such that\n\nf (x1, . . . , xn) =\n\n\u03c6i(xi)\n\n\u03c8ij(xi, xj).\n\n(cid:89)\n\ni\u2208V\n\n(cid:89)\n\n(i,j)\u2208E\n\nThe graph G together with the collection of potential functions \u03c6 and \u03c8 de\ufb01ne a graphical model\nthat we will denote as (G; \u03c6, \u03c8). For clarity, we will often denote the corresponding function as\nf (G;\u03c6,\u03c8)(x). For a given graphical model (G; \u03c6, \u03c8), we are interested in computing the partition\n\nfunction Z(G; \u03c6, \u03c8) (cid:44)(cid:80)\n\nx\u2208{0,1}|V |(cid:81)\n\ni\u2208V \u03c6i(xi)(cid:81)\n\n(i,j)\u2208E \u03c8ij(xi, xj).\n\nWe will also be interested in computing the maximum value of f, sometimes referred to as the\nMAP problem. The problem of computing the MAP solution can be converted into the problem\nof computing the partition function by adding a temperature parameter, T , and taking the limit as\nT \u2192 0.\n\nmax\n\nx\n\nf (G;\u03c6,\u03c8)(x) = lim\nT\u21920\n\nZ(G; \u03c61/T , \u03c81/T )T\n\nHere, \u03c61/T is the collection of potentials generated by taking each potential \u03c6i(xi) and raising it to\nthe 1/T power for all i \u2208 V, xi \u2208 {0, 1}.\n\n2.2 Graph Covers\n\nGraph covers have played an important role in our understanding of statistical inference in graphical\nmodels [8, 9]. Roughly speaking, if a graph H covers a graph G, then H looks locally the same as\nG.\nDe\ufb01nition 2.1. A graph H covers a graph G = (V, E) if there exists a graph homomorphism\nh : H \u2192 G such that for all vertices i \u2208 G and all j \u2208 h\u22121(i), h maps the neighborhood \u2202j of j in\nH bijectively to the neighborhood \u2202i of i in G.\n\n1In the Epinions network, users are connected by agreement and disagreement edges and therefore frus-\ntrated cycles abound. By treating the network as a pairwise binary graphical model, we may compute the\ntrustworthiness of a user by performing marginal inference over a variable representing if the user is trusted or\nnot.\n\n2\n\n\f1\n\n4\n\n2\n\n3\n\n(a) A graph, G.\n\n1\n\n2\n\n3\n\n4\n\n2\n\n1\n4\n(b) One possible cover of G.\n\n3\n\nFigure 1: An example of a graph cover. The nodes in the cover are labeled for the node that they\ncopy in the base graph.\nIf h(j) = i, then we say that j \u2208 H is a copy of i \u2208 G. Further, H is said to be an M-cover of G\nif every vertex of G has exactly M copies in H. For an example of a graph cover, see Figure 1. For\na connected graph G = (V, E), each M-cover consists of M copies of each of the variable nodes\nof G with an edge joining each distinct copy of i \u2208 V to a distinct copy of j \u2208 V if and only if\n(i, j) \u2208 E.\nTo any M-cover H = (V H , EH ) of G given by the homomorphism h, we can associate a collection\nof potentials: the potential at node i \u2208 V H is equal to \u03c6h(i), the potential at node h(i) \u2208 G, and\nfor each (i, j) \u2208 EH, we associate the potential \u03c8h(i,j). In this way, we can construct a function\nf (H;\u03c6H ,\u03c8H ) : {0, 1}M|V | \u2192 R\u22650 such that f (H;\u03c6H ,\u03c8H ) factorizes over H. We will say that the\ngraphical model (H; \u03c6H , \u03c8H ) is an M-cover of the graphical model (G; \u03c6, \u03c8) whenever H is an\nM-cover of G and \u03c6H and \u03c8H are derived from \u03c6 and \u03c8 as above.\n\n2.3 The Bethe Partition Function\n\nThe Bethe free energy is a standard approximation to the so-called Gibbs free energy that is mo-\ntivated by ideas from statistical physics. TRBP and more general reweighted belief propagation\nalgorithms take advantage of a similar approximation.\nFor \u03c4 in the local marginal polytope,\nT (cid:44){\u03c4 \u2265 0 | \u2200(i, j) \u2208 E,\n\n\u03c4ij(xi, xj) = \u03c4i(xi) and \u2200i \u2208 V,\n\n\u03c4i(xi) = 1}.\n\n(cid:88)\n\n(cid:88)\n\nxi\nthe reweighted free energy approximation at temperature T = 1 is given by\n\nxj\n\nlog FB,\u03c1(G, \u03c4 ; \u03c6, \u03c8) = U (\u03c4 ; \u03c6, \u03c8) \u2212 H(\u03c4, \u03c1)\n\nwhere U is the energy,\n\nU (\u03c4 ; \u03c6, \u03c8) = \u2212(cid:88)\n(cid:88)\nH(\u03c4, \u03c1) = \u2212(cid:88)\n\ni\u2208V\n\n(cid:88)\n\u03c4i(xi) log \u03c6i(xi) \u2212 (cid:88)\n(cid:88)\n\u03c4i(xi) log \u03c4i(xi) \u2212 (cid:88)\n\n(i,j)\u2208E\n\nxi\n\nand H is an entropy approximation,\n\n(cid:88)\n\nxi,xj\n\n\u03c4ij(xi, xj) log \u03c8ij(xi, xj),\n\ni\u2208V\n\nxi\n\n(i,j)\u2208E\n\nxi,xj\n\n\u03c1ij\u03c4ij(xi, xj) log\n\n\u03c4ij(xi, xj)\n\u03c4i(xi)\u03c4j(xj)\n\n.\n\nHere, \u03c1ij controls the reweighting over the edge (i, j) in the graphical model.\nIf \u03c1ij = 1 for\nall (i, j) \u2208 E, then we call this the Bethe approximation and will typically drop the \u03c1 writing\nZB,(cid:126)1 = ZB. The reweighted partition function is then expressed in terms of the minimum value\nachieved by this approximation over T as follows.\n\nZB,\u03c1(G; \u03c6, \u03c8) = e\u2212 min\u03c4\u2208T FB,\u03c1(G,\u03c4 ;\u03c6,\u03c8)\n\nSimilar to the exact partition function computation, the reweighted partition function at temperature\nT is given by ZB,\u03c1(G; \u03c61/T , \u03c81/T )T . The zero temperature limit corresponds to minimizing the\nenergy function over the local marginal polytope.\nIn practice, local optima of these free energy approximations can be found by a reweighted version of\nbelief propagation. The \ufb01xed points of this reweighted algorithm correspond to stationary points of\nlog ZB(G, \u03c4 ; \u03c6, \u03c8) over T [10]. The TRBP algorithm chooses the vector \u03c1 such that \u03c1ij corresponds\nto the edge appearance probability of (i, j) over a convex combination of spanning trees. For these\nchoices of \u03c1, the reweighted free energy approximation is convex in \u03c4, ZB,\u03c1(G; \u03c6, \u03c8) is always\nlarger than the true partition function and there exists an ordering of the message updates so that\nreweighted belief propagation is guaranteed to converge.\n\n3\n\n\f2.4 Log-Supermodularity and Lower Bounds\n\nA recent theorem of Vontobel [8] provides a combinatorial characterization of the Bethe partition\nfunction in terms of graph covers.\nTheorem 2.2 (8).\n\nZB(G; \u03c6, \u03c8) = lim sup\n\nM\u2192\u221e M\n\nZ(H; \u03c6H , \u03c8H )\n\n|CM (G)|\n\n(cid:118)(cid:117)(cid:117)(cid:116) (cid:88)\n\nH\u2208CM (G)\n\nwhere CM (G) is the set of all M-covers of G.\nThis characterization suggests that bounds on the partition functions of individual graph covers can\nbe used to bound the Bethe partition function. This approach has recently been used to prove that the\nBethe partition function provides a lower bound on the true partition function in certain nice families\nof graphical models [8, 11, 12]. One such nice family is the family of so-called log-supermodular\n(aka attractive) graphical models.\nDe\ufb01nition 2.3. A function f : {0, 1}n \u2192 R\u22650 is log-supermodular if for all x, y \u2208 {0, 1}n\nwhere (x \u2227 y)i = min{xi, yi} and (x \u2228 y)i = max{xi, yi}. Similarly, f is log-submodular if the\ninequality is reversed for all x, y \u2208 {0, 1}n.\nTheorem 2.4 (Ruozzi [11]). If (G; \u03c6, \u03c8) is a log-supermodular graphical model, then for any M-\ncover, (H; \u03c6H , \u03c8H ), of (G; \u03c6, \u03c8), Z(H; \u03c6H , \u03c8H ) \u2264 Z(G; \u03c6, \u03c8)M .\nPlugging this result into Theorem 2.2, we can conclude that the Bethe partition function always\nlower bounds the true partition function in log-supermodular models.\n\nf (x)f (y) \u2264 f (x \u2227 y)f (x \u2228 y)\n\n3 Switching Log-Supermodular Functions\n\nLet (G; \u03c6, \u03c8) be a pairwise binary graphical model. Each \u03c8ij, in this model, is either log-\nsupermodular, log-submodular, or both. In the case that each \u03c8ij is log-supermodular, Theorem\n2.4 says that the partition function of the disconnected 2-cover of G provides an upper bound on the\npartition function of any other 2-cover of G.\nWhen the \u03c8ij are not all log-supermodular, this is not necessarily the case. As an example, if G is\na 3-cycle, then, up to isomorphism, G has two distinct covers: the 6-cycle and the graph consisting\nof two disconnected 3-cycles. Consider the pairwise binary graphical model for the independent set\nproblem on G = (V, E) given by the edge potentials \u03c8ij(xi, xj) = 1 \u2212 xixj for all (i, j) \u2208 E. We\ncan easily check that the 6-cycle has 18 distinct independent sets while the disconnected cover has\nonly 16 independent sets. That is, the disconnected 2-cover does not provide an upper bound on the\nnumber of independent sets in all 2-covers.\nSometimes graphical models that are not log-supermodular can be converted into log-supermodular\nmodels by performing a simple change of variables (e.g., for a \ufb01xed I \u2286 V , a change of variables\nthat sends xi (cid:55)\u2192 1 \u2212 xi for each i \u2208 I and xi (cid:55)\u2192 xi for each i \u2208 V \\ I). As a change of variables\ndoes not change the partition function, we can directly apply Theorem 2.4 to the new model. We will\ncall such functions switching log-supermodular. These functions are the log-supermodular analog\nof the \u201cswitching supermodular\u201d and \u201cpermuted submodular\u201d functions considered by Crama and\nHammer [13] and Schlesinger [14] respectively.\nThe existence of a 2-cover whose partition function is larger than the disconnected one is not unique\nto the problem of counting independent sets. Such a cover exists whenever the base graphical model\nis not switching log-supermodular. In this section, we will describe one possible construction of a\nspeci\ufb01c 2-cover that is distinct from the disconnected 2-cover whenever the given graphical model\nis not switching log-supermodular and will always provide an upper bound on the true partition\nfunction.\n\n3.1 Signed Graphs\n\nIn order to understand when a graphical model can be converted into a log-supermodular model\nby switching some of the variables, we introduce the notion of a signed graph. A signed graph is\n\n4\n\n\f2\n\n3\n\n1\n\n4\n\n(a)\n\n1\n\n1\n\n4\n\n4\n\n3\n\n3\n\n2\n\n2\n\n(b)\n\n2\n\n3\n\n1\n\n4\n\n(c)\n\n1\n\n1\n\n4\n\n4\n\n3\n\n3\n\n2\n\n2\n\n(d)\n\nFigure 2: An example of the construction of the 2-cover G2 for the same graph with different edge\npotentials. Here, dashed lines represent edges with log-submodular potentials. The graph in (b) is\nthe 2-cover construction of the graph in (a) and the graph in (d) is the 2-cover construction applied\nto the graph in (c). Note that the graph in (b) is connected while the graph in (d) is not.\n\na graph in which each edge has an associated sign. For our graphical models, we will use a \u201c+\u201d\nto represent a log-supermodular edge and a \u201c\u2212\u201d to represent a log-submodular edge. The sign of\na cycle in the graph is positive if it has an even number of \u201c\u2212\u201d edges and negative otherwise. A\nsigned graph is said to be balanced if there are no negative cycles. Equivalently, a signed graph is\nbalanced, if we can divide its vetices into two sets A and B such that all edges in the graph with\none endpoint in set A and the other endpoint in the set B are negative and the remaining edges are\npositive [15]. Switching, or \ufb02ipping, a variable as above has the effect of \ufb02ipping the sign of all\nedges adjacent to the corresponding variable node in the graphical model: \ufb02ipping a single variable\nconverts an incident log-supermodular edge into a log-submodular edge and vice versa. A graphical\nmodel is switching log-supermodular if and only if its signed graph is balanced.\nSigned graphs have been studied before in the context of graphical models. Watanabe [16] charac-\nterized signed graphs for which belief propagation is guaranteed to have a unique \ufb01xed point. These\nresults depend only on the graph structure and the signs on the edges and not on the strength of the\npotentials.\n\n3.2 Switching Log-Supermodular 2-covers\n\nWe can always construct a 2-cover of a pairwise binary graphical model that is switching log-\nsupermodular.\nDe\ufb01nition 3.1. Given a pairwise binary graphical model (G; \u03c6, \u03c8), construct a 2-cover,\n(G2; \u03c6G2\n\n) where G2 = (V G2\n\n), as follows.\n\n, \u03c8G2\n\n, EG2\n\n\u2022 For each i \u2208 G, create two copies of i, denoted i1 and i2, in V G2.\n\u2022 For each edge (i, j) \u2208 E, if \u03c8ij is log-supermodular, then add the edges (i1, j1) and (i2, j2)\n\nto EG2. Otherwise, add the edges (i1, j2) and (i2, j1) to EG2.\n\nG2 is switching log-supermodular. This follows from the characterization of Harary [15] as G2 can\nbe divided into two sets V1 and V2 with only negative edges between the two partitions and positive\nedges elsewhere. See Figure 2 for an example of this construction.\nIf all of the potentials in (G; \u03c6, \u03c8) are log-supermodular, then G2 is equal to the disconnected 2-\ncover of G. If all of the potentials in (G; \u03c6, \u03c8) are log-submodular, then G2 is a bipartite graph.\nLemma 3.2. For a connected graph G, (G2; \u03c6G2\n) is disconnected if and only if f (G;\u03c6,\u03c8) is\nswitching log-supermodular. Equivalently, G2 is disconnected if and only if the signed version of G\nis balanced.\n\n, \u03c8G2\n\nReturning to the example of counting independent sets on a 3-cycle at the beginning of this section,\nwe can check that G2 for this graphical model corresponds to the 6-cycle. The observation that the\n6-cycle has more independent sets than two disconnected copies of the 3-cycle is a special case of a\ngeneral theorem.\nTheorem 3.3. For any pairwise binary graphical model (G; \u03c6, \u03c8), Z(G2; \u03c6G2\nZ(G; \u03c6, \u03c8)2.\n\n) \u2265\n\n, \u03c8G2\n\n5\n\n\fThe proof of Theorem 3.3 can be found in Appendix A of the supplementary material. Unlike\nTheorem 2.4 that provides lower bounds on the partition function, Theorem 3.3 provides an upper\nbound on the partition function.\n\n4 Properties of the Cover G2\n\nIn this section, we study the implications that Theorem 2.4 and Theorem 3.3 have for characteriza-\ntions of switching log-supermodular functions and the computation of the Bethe partition function.\n\n4.1 Field Independence\n\nWe begin with the simple observation that Theorem 3.3, like Theorem 2.4, does not depend on the\nchoice of external \ufb01eld. In fact, in the case that all of the edge potentials are strictly larger than zero,\nthis independence of external \ufb01eld completely characterizes switching log-supermodular graphical\nmodels.\nTheorem 4.1. For a pairwise binary graphical model (G; \u03c6, \u03c8) with strictly positive edge potentials\n\u03c8, the following are equivalent.\n\n1. f (G;\u03c6,\u03c8)(x) is switching log-supermodular.\n\n2. For all M \u2265 1, any external \ufb01eld (cid:98)\u03c6, and any M-cover (H;(cid:98)\u03c6H , \u03c8H ) of (G;(cid:98)\u03c6, \u03c8),\nZ(H;(cid:98)\u03c6H , \u03c8H ) \u2264 Z(G;(cid:98)\u03c6, \u03c8)M .\n3. For all choices of the external \ufb01eld (cid:98)\u03c6 and any 2-cover (H;(cid:98)\u03c6H , \u03c8H ) of (G;(cid:98)\u03c6, \u03c8),\nZ(H;(cid:98)\u03c6H , \u03c8H ) \u2264 Z(G;(cid:98)\u03c6, \u03c8)2.\none negative cycle, then there exists an external \ufb01eld(cid:98)\u03c6 and a 2-cover (H;(cid:98)\u03c6H , \u03c8H ) of (G;(cid:98)\u03c6, \u03c8) such\nthat Z(G;(cid:98)\u03c6, \u03c8)2 < Z(H;(cid:98)\u03c6H , \u03c8H ). In particular, the proof of the theorem shows that there exists an\nexternal \ufb01eld (cid:98)\u03c6 such that Z(G;(cid:98)\u03c6, \u03c8)2 < Z(G2;(cid:98)\u03c6G2\n\nIn other words, if all of the edge potentials are strictly positive, and the graphical model has even\n\n). See Appendix B in the supplementary\n\nmaterial for a proof of Theorem 4.1.\n\n, \u03c8G2\n\n4.2 Bethe Partition Function of Graph Covers\n\nAlthough the true partition function of an arbitrary graph cover could overestimate or underestimate\nthe true partition function of the base graphical model, the Bethe partition function on every cover\nalways provides an upper bound on the Bethe partition function of the base graph.\nIn addition,\nthe reweighted free energy is always convex for an appropriate choice of parameters \u03c1T RBP which\nmeans that ZB,\u03c1T RBP (G; \u03c6, \u03c8)2 = ZB,\u03c1T RBP (G2; \u03c6G2\n\n). Consequently,\n\n, \u03c8G2\n\nZB,\u03c1T RBP (G; \u03c6, \u03c8)2 \u2265 Z(G2; \u03c6G2\n\n, \u03c8G2\n\n) \u2265 ZB(G2; \u03c6G2\n\n, \u03c8G2\n\n) \u2265 ZB(G; \u03c6, \u03c8)2.\n\n(1)\n\n, \u03c8G2\n\nBecause the graph cover G2 is switching log-supermodular, the convergence properties of BP are\nbetter [5], and we can always apply the PTAS of Weller and Jebara [3] to (G2; \u03c6G2\n) in order\nto obtain an upper bound on the Bethe partition function of the original model. That is, by forming\nthe special graph cover G2, we accomplished our stated goal of deriving an algorithm that produces\nbetter estimates of the partition function than TRBP but has better convergence properties than BP.\nWe examine the convergence properties experimentally in Section 5.\nBefore we evaluate the empirical properties of this strategy, observe that (1) holds for the MAP\ninference problem as well. In the zero temperature limit, computing the Bethe partition function is\nequivalent to minimizing the energy over the local marginal polytope. Many provably convergent\nmessage-passing algorithms have been designed for this speci\ufb01c task [17, 18, 19, 1].\nBy Theorem 3.3, the MAP solution on (G2; \u03c6G2\n) is always at least as good as the MAP so-\nlution on the original graph. The problem of \ufb01nding the MAP solution for a log-supermodular\npairwise binary graphical model is exactly solvable in strongly polynomial time using max-\ufb02ow\n\n, \u03c8G2\n\n6\n\n\f[20, 21]. We can show that the optimal solution to the Bethe approximation in the zero temperature\nlimit is attained as an integral assignment on this speci\ufb01c 2-cover. The argument goes as follows.\nThe graphical model (G2; \u03c6G2\n) is switching log-supermodular. By Theorem 2.4, in the zero\ntemperature limit, no MAP solution on any cover of (G2; \u03c6G2\n) can attain a higher value of the\nobjective function. This means that\n\n, \u03c8G2\n\n, \u03c8G2\n\nZB(G2; (\u03c6G2\n\n)1/T ,(\u03c8G2\n\nlim\nT\u21920\n\n)1/T )T = max\nxG2\n\nf (G2;\u03c6G2\n\n,\u03c8G2\n\n)(xG2).\n\n, \u03c8G2\n\nBy (1), the Bethe approximation on (G2; \u03c6G2\n) is at least as good as the Bethe approximation\non the original problem. In fact, they are equivalent in the zero temperature limit: the only part of\nthe Bethe approximation that is not necessarily convex in \u03c4 is the entropy approximation, which\nbecomes negligible as T \u2192 0.\nAs a consequence, we can compute the optimum of the Bethe free energy in the zero temperature\nlimit in polynomial time without relying on convergent message-passing algorithms. This is partic-\nularly interesting as the local marginal polytope for pairwise binary graphical models has an integer\npersistence property. Given any fractional optimum \u03c4 of the energy, U, over the local marginal poly-\ntope, if \u03c4i(0) > \u03c4i(1), then there exists an integer optimum \u03c4(cid:48) in the marginal polytope such that\n\u03c4(cid:48)(0) > \u03c4(cid:48)(1) [22]. A similar result holds when the strict inequality is reversed. Therefore, we can\ncompute both the Bethe optimum and partial solutions to the exact MAP inference problem simply\nby solving a max-\ufb02ow problem over (G2; \u03c6G2\nIn this restricted setting, the two cover G2 is essentially the same as the graph construction produced\nas part of the quadratic pseudo-boolean optimization (QPBO) algorithm in the computer vision\ncommunity [23]. In this sense, we can view the technique presented in this work as a generalization\nof QPBO to approximate the partition function of pairwise binary graphical models.\n\n, \u03c8G2\n\n).\n\n5 Experimental Results\n\nIn this section, we present several experimental results for the above procedure. For the experiments,\nwe used a standard implementation of reweighted, asynchronous message passing starting from a\nrandom initialization and a damping factor of .9. We test the performance of these algorithms on\nIsing models with a randomly selected external \ufb01eld and various interaction strengths on the edges.\nWe do not use the convergent version of TRBP as the message update order is graph dependent\nand not as easily parallelizable as the reweighted message-passing algorithm [1]. In addition, alter-\nnative message-passing schemes that guarantee convergence tend to converge slower than damped\nreweighted message passing [24]. In some cases where the TRBP parameter choices do not con-\nverge, additional damping does help but does not allow convergence within the speci\ufb01ed number of\niterations.\nThe \ufb01rst experiment was conducted on a complete cycle on four nodes. The convergence properties\nof BP have been studied both theoretically and empirically by Mooij and Kappen [5]. As expected,\nTRBP provides a looser bound on the partition function than BP on the 2-cover and both typically\nperform worse in terms of estimation than BP on the original graph (when BP converges there).\nThe experimental results are described in Figure 3. In all cases, the algorithms were run until the\nmessages in consecutive time steps differed by less than 10\u22128 or until more than 20, 000 iterations\nwere performed (a single iteration consists of updating all of the messages in the model). In general,\nBP on the 2-cover construction converges more quickly than both BP and TRBP on the original\ngraph. BP failed to converge as the interaction strength decreased past \u2212.9. The number of iterations\nrequired for convergence of BP on the 2-cover has a spike at the \ufb01rst interaction strength such that\n\nZB(G) (cid:54)=(cid:112)ZB(G2). Empirically, this occurs because of the appearance of new BP \ufb01xed points on\n\nthe two cover that are close to the BP \ufb01xed point on the original graph. As the interaction strength\nincreases past this point, the new \ufb01xed points further separate from the old \ufb01xed points and the\nalgorithm converges signi\ufb01cantly faster.\nOur second set of experiments evaluates the practical performance of these three message-passing\nschemes for Ising models on frustrated grid graphs (which arise in computer vision problems), sub-\nnetworks of the Epinions social network (the speci\ufb01c subnetworks tested can be found in Appendix\nD of the supplementary material), and simple four layer graphical models with \ufb01ve nodes per layer\n\n7\n\n\f10\n\n5\n\nZ\ng\no\nl\n\n2,000\n\n1,000\n\ns\nn\no\ni\nt\na\nr\ne\nt\nI\n\nBP 2-cover\nBP\nTRBP\n\n0\n\n0.5\n\n1\n\u2212J\n\n1.5\n\n2\n\n0\n\n0\n\n0.5\n\n1.5\n\n2\n\n1\n\u2212J\n\nFigure 3: Plots of the log partition function and the number of iterations for the different algorithms\nto converge for a complete graph on four nodes with no external \ufb01eld as the strength of the negative\nedges goes from 0 to -2. For TRBP, \u03c1ij = .5 for all (i, j) \u2208 E. The dashed black line is the ground\ntruth.\n\nTRBP BP 2-cover BP Iter. TRBP Iter BP 2-cover Iter.\n\n222.99\n44.14\n29.59\n21.12\n16.19\n15.9\n15.12\n14.84\n14.93\n16.67\n16.82\n18.17\n\nGrid\n\nEPIN1\n\nEPIN1\n\nDeep Networks\n\na\n1\n2\n4\n1\n2\n4\n1\n2\n4\n1\n2\n4\n\nBP\n100% 100%\n30%\n15%\n0%\n1%\n0%\n47%\n37%\n0%\n0%\n38%\n0%\n41%\n0%\n50%\n0%\n53%\n61%\n0%\n0%\n61%\n60%\n0%\n\n95%\n100%\n100%\n100%\n100%\n100%\n100%\n99%\n100%\n100%\n100%\n100%\n\n44.62\n210\n219\n63.53\n90.1\n93.63\n51.8\n42.46\n86.66\n89.2\n30.66\n24.88\n\n110.41\n815.3\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n\nFigure 4: Percent of samples on which each algorithm converged within 1000 iterations and the\naverage number of iterations for convergence for 100 samples of edges weights in [\u2212a, a] for the\ndesignated graphs. For TRBP, performance was poor independent of the spanning trees selected.\nsimilar to those used to model \u201cdeep\u201d belief networks (layer i and layer i + 1 form a complete bi-\npartite graph and there are no intralayer edges). In the Epinions network, the pairwise interactions\ncorrespond to trust relationships. If our goal was to \ufb01nd the most trusted users in the network, then\nwe could, for example, compute the marginal probability that each user is trusted and then rank the\nusers by these probabilities. For each of these models, the edge weights are drawn uniformly at\nrandom from the interval [\u2212a, a]. The performance of BP, TRBP, and BP on the 2-cover continue to\nbehave as they did for the simple four node model: as a increases, BP fails to converge and BP on the\n2-cover converges much faster and more frequently than the other methods. Here, convergence was\nrequired to an accuracy of 10\u22128 within 1, 000 iterations. The results for the different graphs appear\nin Figure 4. Notably, both BP and TRBP perform poorly on the real networks from the Epinions\ndata set.\n\nAcknowledgments\n\nThis work was supported in part by NSF grants IIS-1117631, CCF-1302269 and IIS-1451500.\n\nReferences\n[1] T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms: a unifying view. In\n\nProc. 25th Uncertainty in Arti\ufb01cal Intelligence (UAI), Montreal, Canada, 2009.\n\n[2] A. Weller and T. Jebara. Bethe bounds and approximating the global optimum. In Sixteenth International\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2013.\n\n[3] A. Weller and T. Jebara. Approximating the Bethe partition function. In Uncertainty in Arti\ufb01cal Intelli-\n\ngence (UAI), 2014.\n\n[4] N. Taga and S. Mase. On the convergence of loopy belief propagation algorithm for different update rules.\n\nIEICE Trans. Fundam. Electron. Commun. Comput. Sci., E89-A(2):575\u2013582, Feb. 2006.\n\n[5] J. M. Mooij and H. J. Kappen. Suf\ufb01cient conditions for convergence of the sum-product algorithm.\n\nInformation Theory, IEEE Transactions on, 53(12):4422\u20134437, Dec. 2007.\n\n8\n\n\f[6] M. Bayati, C. Borgs, J. Chayes, and R. Zecchina. Belief propagation for weighted b-matchings on arbi-\ntrary graphs and its relation to linear programs with integer solutions. SIAM Journal on Discrete Mathe-\nmatics, 25(2):989\u20131011, 2011.\n\n[7] N. Ruozzi and S. Tatikonda. Message-passing algorithms for quadratic minimization. Journal of Machine\n\nLearning Research, 14:2287\u20132314, 2013.\n\n[8] P. O. Vontobel. Counting in graph covers: A combinatorial characterization of the Bethe entropy function.\n\nInformation Theory, IEEE Transactions on, Jan. 2013.\n\n[9] P. O. Vontobel and R. Koetter. Graph-cover decoding and \ufb01nite-length analysis of message-passing itera-\n\ntive decoding of LDPC codes. CoRR, abs/cs/0512078, 2005.\n\n[10] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\nbelief propagation algorithms. Information Theory, IEEE Transactions on, 51(7):2282 \u2013 2312, July 2005.\n[11] N. Ruozzi. The Bethe partition function of log-supermodular graphical models. In Neural Information\n\nProcessing Systems (NIPS), Lake Tahoe, NV, Dec. 2012.\n\n[12] N. Ruozzi. Beyond log-supermodularity: Lower bounds and the bethe partition function. In Proceedings\nof the Twenty-Ninth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-13),\npages 546\u2013555, Corvallis, Oregon, 2013. AUAI Press.\n\n[13] Y. Crama and P. L. Hammer. Boolean functions: Theory, algorithms, and applications, volume 142.\n\nCambridge University Press, 2011.\n\n[14] D. Schlesinger. Exact solution of permuted submodular minsum problems.\n\nIn Energy Minimization\n\nMethods in Computer Vision and Pattern Recognition (EMMCVPR), pages 28\u201338. Springer, 2007.\n\n[15] F. Harary. On the notion of balance of a signed graph. The Michigan Mathematical Journal, 2(2):143\u2013146,\n\n1953.\n\n[16] Y. Watanabe. Uniqueness of belief propagation on signed graphs. In Advances in Neural Information\n\nProcessing Systems, pages 1521\u20131529, 2011.\n\n[17] T. Werner. A linear programming approach to max-sum problem: A review. Pattern Analysis and Machine\n\nIntelligence, IEEE Transactions on, 29(7):1165\u20131179, 2007.\n\n[18] A. Globerson and T. S. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP\nLP-relaxations. In Proc. 21st Neural Information Processing Systems (NIPS), Vancouver, B.C., Canada,\n2007.\n\n[19] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. MAP estimation via agreement on (hyper)trees:\nInformation Theory, IEEE Transactions on, 51(11):3697\u2013\n\nMessage-passing and linear programming.\n3717, Nov. 2005. ISSN 0018-9448. doi: 10.1109/TIT.2005.856938.\n\n[20] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images.\n\nJournal of the Royal Statistical Society. Series B (Methodological), pages 271\u2013279, 1989.\n\n[21] V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? In Computer\n\nVisionECCV 2002, pages 65\u201381. Springer, 2002.\n\n[22] V. Kolmogorov and M. Wainwright. On the optimality of tree-reweighted max-product message-passing.\nIn Proceedings of the Twenty-First Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelli-\ngence (UAI-05), pages 316\u2013323, Arlington, Virginia, 2005. AUAI Press.\n\n[23] V. Kolmogorov and C. Rother. Minimizing nonsubmodular functions with graph cuts-a review. Pattern\n\nAnalysis and Machine Intelligence, IEEE Transactions on, 29(7):1274\u20131279, July 2007.\n\n[24] A. Globerson and T. S. Jaakkola. Convergent propagation algorithms via oriented trees. In Proc. 23rd\n\nUncertainty in Arti\ufb01cal Intelligence (UAI), 2007.\n\n[25] A. W. Marshall and I. Olkin. Inequalities: Theory of Majorization and its Applications. Academic Press,\n\nNew York, 1979.\n\n[26] L. Lov\u00b4asz. Submodular functions and convexity.\n\nIn A. Bachem, B. Korte, and M. Grtschel, editors,\n\nMathematical Programming The State of the Art, pages 235\u2013257. Springer Berlin Heidelberg, 1983.\n\n[27] M. Richardson, R. Agrawal, and P. Domingos. Trust management for the semantic web. In Dieter Fensel,\nKatia Sycara, and John Mylopoulos, editors, The Semantic Web - ISWC 2003, volume 2870 of Lecture\nNotes in Computer Science, pages 351\u2013368. Springer Berlin Heidelberg, 2003.\n\n9\n\n\f", "award": [], "sourceid": 942, "authors": [{"given_name": "Nicholas", "family_name": "Ruozzi", "institution": "Columbia University"}, {"given_name": "Tony", "family_name": "Jebara", "institution": "Columbia University"}]}