{"title": "Cooled and Relaxed Survey Propagation for MRFs", "book": "Advances in Neural Information Processing Systems", "page_first": 297, "page_last": 304, "abstract": null, "full_text": "Cooled and Relaxed Survey Propagation for MRFs\n\nHai Leong Chieu1,2, Wee Sun Lee2\n\n1Singapore MIT Alliance\n\n2Department of Computer Science\nNational University of Singapore\n\nYee-Whye Teh\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\nywteh@gatsby.ucl.ac.uk\n\nhaileong@nus.edu.sg,leews@comp.nus.edu.sg\n\nAbstract\n\nWe describe a new algorithm, Relaxed Survey Propagation (RSP), for \ufb01nding\nMAP con\ufb01gurations in Markov random \ufb01elds. We compare its performance with\nstate-of-the-art algorithms including the max-product belief propagation, its se-\nquential tree-reweighted variant, residual (sum-product) belief propagation, and\ntree-structured expectation propagation. We show that it outperforms all ap-\nproaches for Ising models with mixed couplings, as well as on a web person\ndisambiguation task formulated as a supervised clustering problem.\n\n1 Introduction\n\nEnergy minimization is the problem of \ufb01nding a maximum a posteriori (MAP) con\ufb01guration in a\nMarkov random \ufb01eld (MRF). A MAP con\ufb01guration is an assignment of values to variables that\nmaximizes the probability (or minimizes the energy) in the MRF. Energy minimization has many\napplications; for example, in computer vision it is used for applications such as stereo matching [11].\nAs energy minimization in general MRFs is computationally intractable, approximate inference al-\ngorithms based on belief propagation are often necessary in practice. Such algorithms are split into\ntwo classes: max-product and variants address the problem by trying to \ufb01nd a MAP con\ufb01guration\ndirectly, while sum-product and variants return approximate marginal distributions which can be\nused to estimate a MAP con\ufb01guration. It has been shown that the max-product algorithm converges\nto neighborhood optimums [18], while the sum-product algorithm converges to local minima of the\nBethe approximation [20]. Convergence of these algorithms are important for good approximations.\nRecent work (e.g. [16, 8]) on suf\ufb01cient conditions for convergence of sum-product algorithms sug-\ngests that they converge better on MRFs containing potentials with small strengths. In this paper,\nwe propose a novel algorithm, called Relaxed Survey Propagation (RSP), based on performing the\nsum-product algorithm on a relaxed MRF. In the relaxed MRF, there is a parameter vector y that\ncan be optimized for convergence. By optimizing y to reduce the strengths of potentials, we show\nempirically that RSP converges on MRFs where other algorithms fail to converge.\nThe relaxed MRF is built in two steps, by \ufb01rst (i) converting the energy minimization problem into\nits weighted MAX-SAT equivalent [17], and then (ii) constructing a relaxed version of the survey\npropagation MRF proposed in [14]. We prove that the relaxed MRF has approximately equal joint\ndistribution (and hence the same MAP and marginals) as the original MRF, independent (to a large\nextent) of the parameter vector y. Empirically, we show that RSP, when run at low temperatures\n(\u201ccooled\u201d), performs well for the application of energy minimization. For max-product algorithms,\nwe compare against the max-product algorithm and its sequential tree-reweighted variant, which\nis guaranteed to converge [11]. For sum-product algorithms, we compare against residual belief\npropagation [6] as a state-of-the-art asynchronous belief propagation algorithm, as well as the tree-\nstructured expectation propagation [15], which has been shown to be a special case of generalized\nbelief propagation [19]. We show that RSP outperforms all approaches for Ising models with mixed\ncouplings, as well as in a supervised clustering application for web person disambigation.\n\n\f(a) G = (V, F )\n\n(b) W = (B, C)\n\nFigure 1: The variables x1, x2 in (a) are binary, resulting in 4 variables in (b). The clauses \u03b11 to \u03b14\nin (b) are entries in the factor a in (a), \u03b31 and \u03b32 (resp. \u03b33 and \u03b34) are from b (resp. c). \u03b2(1) and\n\u03b2(2) are the positivity clauses. The relaxed MRF for RSP has a similar form to the graph in (b).\n2 Preliminaries\nA MRF, G = (V, F ), is de\ufb01ned by a set of variables V , and a set of factors F = {\u03a6a}, where\neach \u03a6a is a non-negative function depending on Xa \u2286 V . We assume for simplicity that variables\nin V have the same cardinality q, taking values in Q = {1, .., q}. For Xi \u2208 V and Xa \u2286 V ,\nwe denote by xi the event that Xi = xi, and by xa the event Xa = xa. To simplify notation,\nwe will sometimes write i \u2208 V for Xi \u2208 V , or a \u2208 F for \u03a6a \u2208 F . The joint distribution over\na \u03a6a(xa) where Z is the normalization factor. When each\ncon\ufb01gurations is de\ufb01ned by P (x) = 1\nZ exp(\u2212E(x)/T ) where\nZ\n\u03a6a is a positive function, the joint distribution can be written as P (x) = 1\na \u2212 log \u03a6a(xa) is the energy function, and the temperature T is set to 1. A factor graph\n[13] is a graphical representation of a MRF, in the form of a bipartite graph with two types of nodes,\nthe variables and the factors. Each factor \u03a6a is connected to the variables in Xa, and each variable\nXi is connected to the set of factors, N(i), that depends on it. See Figure 1(a) for a simple example.\n\nE(x) =(cid:80)\n\n(cid:81)\n\nWeighted MAX-SAT conversion [17]: Before describing RSP, we describe the weighted MAX-\nSAT (WMS) conversion of the energy minimization problem for a MRF. The WMS problem is a\ngeneralization of the satis\ufb01ability problem (SAT). In SAT, a set of boolean variables are constrained\nby a boolean function in conjunctive normal form, which can be treated as a set of clauses. Each\nclause is a set of literals (a variable or its negation), and is satis\ufb01ed if one of its literals evaluates to\n1. The SAT problem consist of \ufb01nding a con\ufb01guration that satis\ufb01es all the clauses. In WMS, each\nclause has a weight, and the WMS problem consists of \ufb01nding a con\ufb01guration with maximum total\nweight of satis\ufb01ed clauses (called the weight of the con\ufb01guration). We describe the conversion [17]\nof a MRF G = (V, F ) into a WMS problem W = (B, C), where B is the set of boolean variables\nand C the set of weighted clauses. Without lost of generality, we normalize factors in F to take\nvalues in (0, 1]. For each Xi \u2208 V , introduce the variables \u03c3(i,xi) \u2208 B as the predicate that Xi = xi.\nFor convenience, we index variables in B either by k or by (i, xi), denote factors in F with Roman\nalphabet (e.g. a, b, c) and clauses in C with Greek alphabet (e.g. \u03b1, \u03b2, \u03b3). For a clause \u03b1 \u2208 C, we\ndenote by C(\u03b1) as the set of variables in \u03b1. There are two types of clauses in C: interaction and\npositivity clauses.\nDe\ufb01nition 1. Interaction clauses: For each entry \u03a6a(xa) in \u03a6a \u2208 F , introduce the clause \u03b1 =\n\u2228xi\u2208xa\u03c3(i,xi) with the weight w\u03b1 = \u2212 log(\u03a6a(xa)). We write \u03b1 (cid:64) a to show that the clause \u03b1\ncomes from the factor \u03a6a \u2208 F , and we denote a = src(\u03b1) to be the factor \u03a6a \u2208 F for which \u03b1 (cid:64) a.\nThe violation of an interaction clause corresponding to \u03a6a(xa) entails that \u03c3(i,xi) = 1 for all xi \u2208\nxa. This correspond to the event that Xi = xi for Xi \u2208 Xa.\nDe\ufb01nition 2. Positivity clauses: for Xi \u2208 V , introduce the clause \u03b2(i) = \u2228xi\u2208Q\u03c3(i,xi) with\nw\u03b1, where Ci is the set of interaction clauses containing any vari-\nable in {\u03c3(i,xi)}xi\u2208Q. For Xi \u2208 V , we denote \u03b2(i) as the corresponding positivity clause in C, and\nfor a positivity clause \u03b2 \u2208 C, we denote src(\u03b2) for the corresponding variable in V .\nPositivity clauses have large weights to ensure that for each Xi \u2208 V , at least one predicate in\n{\u03c3(i,xi)}xi\u2208Q equals 1. To map \u03c3 back to a con\ufb01guration in the original MRF, exactly one variable\nin each set of {\u03c3(i,xi)}xi\u2208Q can take the value 1. We call such con\ufb01gurations valid con\ufb01gurations:\n\nweights w\u03b2(i) = 2 \u2217(cid:80)\n\n\u03b1\u2208Ci\n\na12bc: factors: variables\u03c3(1,1)\u03c3(1,2)\u03b2(1)\u03c3(2,1)\u03c3(2,2)\u03b2(2)\u03b11\u03b14\u03b13\u03b12\u03b33\u03b31\u03b32\u03b34\u03b1\u03c3\u03b1\u03c3\u03c3 is a negative literal in \u03b1\u03c3 is a positive literal in \u03b1Legend:\fDe\ufb01nition 3. A con\ufb01guration is valid if \u2200Xi \u2208 V , exactly one of the indicators {\u03c3i,xi}xi\u2208Q equals\n1. There are two types of invalid con\ufb01gurations: MTO con\ufb01gurations where more than one variable\nin the set {\u03c3i,xi}xi\u2208Q equals 1, and AZ con\ufb01gurations where all variables in the set equals zero .\nFor valid con\ufb01gurations \u03c3, let x(\u03c3) be the corresponding con\ufb01guration of \u03c3 in V .\nFor valid con\ufb01gurations \u03c3, and for each a \u2208 F , exactly one interaction clause in {\u03b1}\u03b1(cid:64)a is violated:\nwhen \u03b1 corresponding to \u03a6a(xa) is violated, we have Xa = xa in x(\u03c3). Valid con\ufb01gurations have\nlocally maximal weights [17]: MTO con\ufb01gurations have low weights since in all interaction clauses,\nvariables appear as negative literals. AZ con\ufb01gurations have low weights because they violate the\npositivity clauses. See Figure 1 for an example of a WMS equivalent of a simple factor graph.\n\n3 Relaxed Survey Propagation\n\nIn this section, we transform the WMS problem W = (B, C) into another MRF, Gs = (Vs, Fs),\nbased on the construction of the MRF for survey propagation [14]. We show (in Section 3.1) that\nunder this framework, we are able to remove MTO con\ufb01gurations, and AZ con\ufb01gurations have\nnegligible probability. In survey propagation, in addition to the values {0, 1}, variables can take a\nthird value, * (\u201cjoker\u201d state), signifying that the variable is free to take either 0 or 1, without violating\nany clause. In this section, we assume that variables \u03c3k take values in {0, 1,\u2217}.\nDe\ufb01nition 4. [14] A variable \u03c3k is constrained by the clause \u03b1 \u2208 C if it is the unique\nsatisfying variable for clause \u03b1 (all other variables violate \u03b1). De\ufb01ne CONk,\u03b1(\u03c3C(\u03b1)) =\n\u03b4(\u03c3k is constrained by \u03b1), where \u03b4(P ) equals 1 if the predicate P is true, and 0 otherwise.\nWe introduce the parameters {ya}a\u2208F and {yi}i\u2208V by modifying the de\ufb01nition of VAL in [14]:\nDe\ufb01nition 5. An assignment \u03c3 is invalid for clause \u03b1 if and only if all variables are unsatisfying\nexcept for exactly one for which \u03c3k = \u2217. (In this case, \u03c3k cannot take * as it is constrained). De\ufb01ne\n\n\uf8f1\uf8f2\uf8f3exp(w\u03b1)\n\nexp(\u2212ysrc(\u03b1))\n0\n\nVAL\u03b1(\u03c3C(\u03b1)) =\n\nif \u03c3C(a) satis\ufb01es \u03b1\nif \u03c3C(a) violates \u03b1\nif \u03c3C(a) is invalid\n\n(1)\n\nThe term exp(\u2212ysrc(\u03b1)) is the penalty for violating clauses, with src(\u03b1) \u2208 V \u222a F de\ufb01ned in De\ufb01-\nnitions 1 and 2. For interaction clauses, we index ya by a \u2208 F because among valid con\ufb01gurations,\nexactly one clause in the group {\u03b1}\u03b1(cid:64)a is violated, and exp(\u2212ya) becomes a constant factor. Posi-\ntivity clauses are always satis\ufb01ed and the penalty factor will not appear for valid con\ufb01gurations.\nDe\ufb01nition 6. [14] De\ufb01ne the parent set Pi of a variable \u03c3k to be the set of clauses for which \u03c3k is\nthe unique satisfying variable, (i.e. the set of clauses constraining \u03c3k).\nWe now construct the MRF Gs = (Vs, Fs) where variables \u03bbk \u2208 Vs are of the form \u03bbk = (\u03c3k, Pk),\nwith \u03c3k variables in the WMS problem W = (B, C). (See Figure 1). We de\ufb01ne single-variable\ncompatibilities (\u03a8k) and clause compatibilities (\u03a8\u03b1) as in [14]:\n\n(cid:40) \u03c90 if Pk = \u2205, \u03c3k (cid:54)= \u2217\n\u03a8\u03b1(\u03bb\u03b1 = {\u03c3k, Pk}k\u2208C(\u03b1)) = VAL\u03b1(\u03c3C(\u03b1)) \u00d7 (cid:89)\n\n\u03a8k(\u03bbk = {\u03c3k, Pk}) =\n\n\u03c9\u2217 if Pk = \u2205, \u03c3k = \u2217\n1 for any other valid (\u03c3k, Pk)\n\n\u03b4((\u03b1 \u2208 Pk) = CON\u03b1,k(\u03c3C(\u03b1))), (3)\n\n, where \u03c90 + \u03c9\u2217 = 1\n\n(2)\n\nk\u2208C(\u03b1)\n\nwhere \u03b4 is de\ufb01ned in De\ufb01nition 4. The single-variable compatibilities \u03a8k(\u03bbk) are de\ufb01ned so that\nwhen \u03c3k is unconstrained (i.e. Pk = \u2205), \u03a8k(\u03bbk) takes the values \u03c9\u2217 or \u03c90 depending on whether\n\u03c3k equals *. The clause compatibilities introduce the clause weights and penalties into the joint\ndistribution. The factor graph has the following underlying distribution:\n\nP ({\u03c3k, Pk}k) \u221d \u03c9n0\n\n0 \u03c9n\u2217\u2217\n\nexp(\u2212ysrc(\u03b1)),\n\n(4)\n\n(cid:89)\n\nexp (w\u03b1) (cid:89)\n\n\u03b1\u2208SAT(\u03c3)\n\n\u03b1\u2208UNSAT(\u03c3)\n\nwhere n0 is the number of unconstrained variables in \u03c3, and n\u2217 the number of variables taking \u2217.\nComparing RSP with SP-\u03c1 in [14], we see that\n\n\fTheorem 1. In the limit where all ya, yi \u2192 \u221e, RSP is equivalent to SP-\u03c1 [14], with \u03c1 = \u03c9\u2217.\nTaking y to in\ufb01nity correspond to disallowing violated constraints, and SP-\u03c1 was formulated for\nsatis\ufb01able SAT problems, where violated constraints are forbidden. In this case, all clauses must be\n\n\u03b1\u2208C exp(w\u03b1) in Equation 4 is a constant, and P (\u03c3) \u221d \u03c9n0\n\n0 \u03c9n\u2217\u2217 .\n\nsatis\ufb01ed and the term(cid:81)\n\n3.1 Main result\n\nIn the following, we assume the following settings: (1) \u03c9\u2217 = 1 and \u03c90 = 0 ; (2) for positivity clauses\n\u03b2(i), let yi = 0 ; and (3) in the original MRF G = (V, F ), single-variable factors are de\ufb01ned on\nall variables (we can always de\ufb01ne uniform factors). Under these settings, we will prove the main\nresult that the joint distribution on the relaxed MRF is approximately equal to that on the original\nMRF, and that RSP estimates marginals on the original MRF. First, we prove the following lemma:\nLemma 1. The joint probability over valid con\ufb01gurations on Gs is proportional to the joint proba-\nbility of the corresponding con\ufb01gurations on the original MRF, G = (V, F ).\nProof. For valid con\ufb01gurations, all positivity clauses are satis\ufb01ed, and for each a \u2208 F , all valid\ncon\ufb01gurations have one violated constraint in the set of interaction clauses {\u03b1}\u03b1(cid:64)a. Hence the\n\u03b1\u2208C w\u03b1 be the\n\na\u2208F exp(ya) is a constant factor. Let W =(cid:80)\nw\u03b3) = exp(W \u2212 (cid:88)\n\nw\u03b3) \u221d (cid:89)\n\npenalty term for violated constraints(cid:81)\nP (\u03c3) \u221d exp( (cid:88)\n\nsum of all weights. For a valid con\ufb01guration \u03c3,\n\n\u03a6a(x(\u03c3))\n\n\u03b3\u2208SAT(\u03c3)\n\n\u03b3\u2208UNSAT(\u03c3)\n\na\u2208F\n\nLemma 2. All con\ufb01gurations containing * have zero probability in the MRF Gs, and there is a\none-to-one mapping between con\ufb01gurations \u03bb = {\u03c3k, Pk}k\u2208Vs and con\ufb01gurations \u03c3 = {\u03c3k}k\u2208B\nProof. Single-variable factors on G translate into single-literal clauses in the WMS formulation,\nwhich in turn becomes single-variable factors in Gs. For a variable \u03bbk = (\u03c3k, Pk) with a single-\nvariable factor, \u03a8\u03b1, we have VAL\u03b1(\u03c3k = \u2217) = 0. This implies \u03a8\u03b1(\u03bbk = (\u2217, Pk)) = 0.\nLemma 3. MTO con\ufb01gurations have n0 (cid:54)= 0 and since \u03c90 = 0, they have zero probability.\nProof. In MTO con\ufb01gurations, \u2203(i, xi, x(cid:48)\n= 1. The positivity clause \u03b2(i) is hence\nnon-constraining for these variables, and since all other clauses connected to them are interaction\nclauses and contain them as negative literals, both variables are unconstrained. Hence n0 (cid:54)= 0, and\nfrom Equation 4, for \u03c90 = 0, they have zero probability.\n\ni), \u03c3i,xi = \u03c3i,x(cid:48)\n\ni\n\nbeliefs at each node, B(\u03c3(i,xi) = 1), is an estimate of P (Xi = xi), and(cid:80)\n\nThe above lemma lead to the following theorem:\nTheorem 2. Assuming that exp(w\u03b2(i)) (cid:29) 1 for all Xi \u2208 V , the joint distribution over the relaxed\nMRF Gs = (Vs, Fs) is approximately equal to the joint distribution over the original MRF, G =\n(V, F ). Moreover, RSP estimates the marginals on the original MRF, and at the \ufb01xed points, the\nxi\u2208Q B(\u03c3(i,xi) = 1) \u2248 1.\nif we assume that the probability of AZ in-\nWe can understand the above theorem as follows:\nvalid con\ufb01gurations is negligible (equivalent to assuming that the probability of violating positivity\nclauses are negligible, i.e. exp(wi) (cid:29) exp(\u2212ysrc(\u03b2(i))) = 1), then we have only valid con\ufb01gura-\ntions left. MTO invalid con\ufb01gurations are ruled out by Lemma 3. Since the positivity clauses have\nlarge weights, exp(wi) (cid:29) 1 are usually satis\ufb01ed. Hence RSP, as the sum-product algorithm on the\nrelaxed MRF, returns estimations of the marginals P (Xi = xi) as B(\u03c3(i,xi) = 1).\n\n3.2 Choosing y\n\nValid con\ufb01gurations have a joint probability with the factor(cid:81)\nmarginals satisfying(cid:80)\n\na\u2208F exp(\u2212ya) while AZ con\ufb01gura-\ntions do not. However, Theorem 2 states that, if exp(wi) (cid:29) 1, AZ con\ufb01gurations have negligi-\nble probability. Empirically, we observe that for a large range of values of {ya}a\u2208F , RSP returns\nB(\u03c3(i,xi) = 1) \u2248 1, indicating that AZ con\ufb01gurations do indeed have\nnegligible probability. We can hence select the values of {ya}a\u2208F for better convergence properties.\n\nxi\n\n\fWe describe heuristics based on the suf\ufb01cient conditions for convergence of sum-product algorithms\nin [16]. To simplify notations, we write the conditions for a MRF with pairwise factors \u03a6a,\n\nwhere\n\nmaxXj\u2208V,b\u2208N (j)\nN(\u03a6a) = supxi(cid:54)=x(cid:48)\n\ni\n\na\u2208N (j)\\b N(\u03a6a) < 1\nsupxj(cid:54)=x(cid:48)\n4 log\n\ntanh\n\nj\n\n(cid:16) 1\n\n(cid:16) \u03a6a(xi,xj )\n\n\u03a6a(x(cid:48)\n\ni,xj )\n\n\u03a6a(x(cid:48)\ni,x(cid:48)\nj )\n\u03a6a(xi,x(cid:48)\nj )\n\n(cid:17)(cid:17)\n\n(cid:80)\n\nMooij and Kappen [16] have also derived another condition based on the spectral radius of a ma-\ntrix having N(\u03a6a) as entries. These conditions lead us to believe that the sum-product algorithm\nconverges better on MRFs with small N(\u03a6a) (or the \u201cstrengths\u201d of potentials in [8]). To calculate\nN(\u03a8\u03b1) for the interaction clause \u03b1, we characterize these factors as follows:\n\n\u03a8\u03b1((\u03c3k, Pk), (\u03c3l, Pl)) =\n\nAs ya are shared among \u03b1 (cid:64) a, we choose ya to minimize(cid:80)\n\n(6)\n4|w\u03b1+ya|.\nA good approximation for ya would be the median of {\u2212w\u03b1}\u03b1(cid:64)a. For our experiments, we divide\nthe search range for ya into 10 bins, and use fminsearch in Matlab to \ufb01nd a local minimum.\n\n\u03b1(cid:64)a N(\u03a8\u03b1) =(cid:80)\n\nif clause \u03b1 is violated, i.e. (\u03c3k, \u03c3l) = (0, 0)\notherwise\n\n\u03b1(cid:64)a tanh 1\n\nexp(w\u03b1)\n\n(cid:26) exp(\u2212ysrc(\u03b1))\n\n3.3 Update equations and ef\ufb01ciency\n\nWhile each message in RSP has large cardinality, we show in the supplementary material that,\nunder the settings of Section 3.1, the update equations can be simpli\ufb01ed such that each factor passes\na single number to each variable. The interaction clause \u03b1 sends a number \u03bd\u03b1\u2192(i,v) to each (i, x) \u2208\nC(\u03b1), and the positivity clauses \u03b2(i) sends a number \u00b5\u03b2(i)\u2192(i,x) to (i, x) for each x \u2208 Q. The\nupdate equations are as follows: (proofs in the supplementary material):\n\n\u00b5\u03b2\u2192(i,x) = (cid:88)\n\nx(cid:48)(cid:54)=x\n\n\u03bd\u03b1\u2192(i,x(cid:48)) + exp(\u2212wi)\n\n\u03b1\u2208N (i,x(cid:48))\\\u03b2(i)\n\n\u00b5\u03b2(j)\u2192(j,x(cid:48)) + exp(\u2212ysrc(\u03b1) \u2212 w\u03b1)(cid:81)\n\n(cid:89)\n\u00b5\u03b2(j)\u2192(j,x(cid:48)) +(cid:81)\n; B(\u03c3(i,x) = 1) \u221d (cid:89)\n\n\u03b1\u2208N (i,x)\\\u03b2(i)\n\n\u03bd\u03b1\u2192(i,x) =\n\nB(\u03c3(i,x) = 0) \u221d \u00b5\u03b2(i)\u2192(i,x)\n\n(5)\n\n(7)\n\n(8)\n\n(9)\n\n\u03b3\u2208N (j,x(cid:48))\\{\u03b2(j),\u03b1} \u03bd\u03b3\u2192(j,x(cid:48))\n\n\u03b3\u2208N (j,x(cid:48))\\{\u03b2(j),\u03b1} \u03bd\u03b3\u2192(j,x(cid:48))\n\n\u03bd\u03b1\u2192(i,x)\n\n; B(\u03c3(i,x) = \u2217) = 0\n\nWe found empirically that the schedule of message updates affect convergence to a large extent. A\ngood schedule is to update all the \u03bd-messages \ufb01rst (by updating the groups of \u03bd-messages belonging\nto each factor a \u2208 F together), and then updating the \u00b5-messages together. This seems to work\nbetter than the schedule de\ufb01ned by residual belief propagation [6] on the relaxed MRF.\nIn terms of ef\ufb01ciency, for a MRF with N pairwise factors, the sum-product algorithm has 2N q real\nnumbers in the factor to variable messages, and RSP has 2N q + q. Empirically, we observe that RSP\non the relaxed MRF runs as fast as the simple sum-product algorithm on the original MRF, with an\noverhead for determining the values of y.\n\n4 Experimental Results\n\nWhile Ising models with attractive couplings are exactly solvable by graph-cut algorithms, general\nIsing models with mixed couplings on complete graphs are NP-hard [4], and graph cut algorithms\nare not applicable to graphs with mixed couplings [12]. In this section, we perform three sets of\nexperiments to show that RSP outperforms other approaches: the \ufb01rst set compares RSP and the\nresidual belief propagation on a simple graph, the second set compares the performance of various\nmethods on randomly generated graphs with mixed couplings, and the third set applies RSP to the\napplication of the web person disambiguation task.\nA simple example: we use a 4-node complete graph of binary variables, with the two sets of factors\nde\ufb01ned in Figure 2(a), for \u0001 = +1 and -1. The case \u0001 = \u22121 was used in [8] to illustrate how the\nstrengths of potentials affect convergence of the sum-product algorithm. We also show the case of\n\u0001 = +1 (an attractive network) as a case where the sum-product algorithm converges well. Both sets\nof graphs (\u0001 = +1 or \u22121) have uniform marginals, and 2 MAP con\ufb01gurations (modes). In Figure\n\n\f(a) 4-node (binary) complete graph\n\n(b) \u0001 = +1\n\n(c) \u0001 = \u22121\n\nH(s) = \u2212(cid:80)\n\ni,j \u03b8i,jsisj \u2212(cid:80)\n\nFigure 2: In Figure (a), we de\ufb01ne factors under the two settings: \u0001 = \u00b11. Figure (b) and (c) show\nthe L2 distance between the returned marginals and the nearest mode of the graph. Circles on the\nlines mean failure to converge, where we take the marginals at the last iteration.\n2(b) and 2(c), we show experimental results for \u0001 = +1 and \u22121. In each case, we vary \u03c9 from 0 to\n12, and for each \u03c9, run residual belief propagation (RBP) damped at 0.5 and RSP (undamped) on the\ncorresponding graph. Both methods are randomly initialized. We plot the L2 distance between the\n\u221a\nreturned marginals and the nearest mode marginals (marginals with probability one on the modes).\n0.5 \u2248 0.7. For small \u03c9, both methods\nThe correct marginals are uniform, where the L2 distance is\nconverge to the correct marginals. As \u03c9 is increased, for \u0001 = +1 in Figure 2(b), both approaches\nconverge to marginals with probability 1 on one of the modes. For \u0001 = \u22121, however, RSP converges\nagain to marginals indicating a mode, while RBP faces convergence problems for \u03c9 \u2265 8.\nIncreasing \u03c9 corresponds to increasing N(\u03a8i,j), and the sum-product algorithm fails to converge for\nlarge \u03c9 when \u0001 = \u22121. When the algorithms converge for large \u03c9, they converge not to the correct\nmarginals, but to a MAP con\ufb01guration. Increasing \u03c9 has the same effect as decreasing the tem-\nperature of a network: the behavior of sum-product algorithm approaches that of the max-product\nalgorithm, i.e. the max-product algorithm is the sum-product algorithm at the zero temperature limit.\nIsing models with mixed couplings: we conduct experiments on complete graphs of size 20\nwith different percentage of attractive couplings, using the Ising model with the energy function:\ni \u03b8isi,where si\u2208{\u22121, 1}. We draw \u03b8i from U[0, 0.1]. To control the\npercentage of attractive couplings, we draw \u03b8i,j from U[0, \u03b1], and randomly assign negative signs\nto the \u03b8i,j with probability (1 \u2212 \u03c1), where \u03c1 is the percentage of attractive couplings required. We\nvary \u03b1 from 1 to 3. In Figure 3, we plot the difference between the optimal energy (obtained with a\nbrute force search) and the energy returned by each of the following approaches: RSP, max-product\nbelief propagation (MBP), the convergent tree reweighted max product belief propagation (TRW-S)\n[11], residual sum-product belief propagation (RBP) [6], and tree-structured expecation propagation\n(TEP) [15]. Each point on the graph is the average over 30 randomly generated networks. In Table\n1, we compare RSP against these methods. When an algorithm does not converge, we take its result\nat the last iteration. We damp RBP and TEP with a 0.5 damping factor. For RSP, MBP, TRW-S and\nRBP, we randomly initialize the initial messages, and take the best result after 5 restarts. For TEP,\nwe use \ufb01ve different trees consisting of a maximal spanning tree and four random stars [19]. For\nRSP, RBP and TEP, which are variants of the sum product algorithm, we lower the temperature by\na factor of 2 each time the method converges and stop when the method fails to converge or if the\nresults are not improved over the last temperature. We observe that MBP outperforms TRW-S con-\nstantly: this agrees with [11] that MBP outperforms TRW-S for graphs with mixed couplings. While\nthe performance of TRW-S remains constant from 25% to 75%, the sum-product based algorithms\n(RBP and TEP) improve as the percentage of attractive potentials is increased. In all three cases,\nRSP is one of the best performing methods, beaten only by TEP at 2 points on the 50% graph. TEP,\nbeing of the class of generalized belief propagation [19], runs signi\ufb01cantly slower than RSP.\nSupervised clustering: Finley and Joachims [7] formulated SV M cluster, which learns an item-\npair similarity measure, Sim(i, j), to minimize a correlation clustering objective on a training set. In\ni,j Sim(i, j)\u03b4(xi, xj) where xi \u2208 {1, .., U}\nare cluster-ids of item i, and U an upper-bound on the number of clusters. They tried a greedy and\na linear programming approach, and concluded that the two approaches are comparable.\nDue to time constraints, we did not implement SV M cluster: instead we test our inference algorithms\non the pairwise classi\ufb01cation clustering (PCC) baseline in [7]. The PCC baseline trains svmlight [9]\non training item-pairs, and run the classi\ufb01er through all test pairs. For each test pair (i, j), we apply\nsoftmax to the classi\ufb01er outputs to obtain the probability pi,j that the pair is in the same cluster.\n\ntraining SV M cluster, they have to minimize E(x) =(cid:80)\n\n\u03a8i,j(xi,xj)=!exp(\u2126i,j/4)ifxi=xjexp(\u2212\u2126i,j/4)ifxi\"=xj\u2126=\u03c9\uf8ee\uf8ef\uf8ef\uf8f001\u0001\u0001101\u0001\u000110\u0001\u0001\u0001\u00010\uf8f9\uf8fa\uf8fa\uf8fb\f(a) 75%\n\n(b) 50%\n\n(c) 25%\n\nFigure 3: Experiments on the complete graph Ising model with mixed couplings (legend in (a)),\nwith different percentage of attractive couplings. The y-axis shows, in log scale, the average energy\ndifference between the con\ufb01guration found by the algorithm and the optimal solution.\n\n\u03b1\nmbp\ntrw-s\nrbp\ntep\nopt\n\n1\n2/0\n26/0\n1/0\n2/0\n0/0\n\n75% attractive\n\n1.5\n2/0\n24/0\n0/0\n2/0\n0/0\n\n2\n0/0\n22/0\n0/0\n0/0\n0/0\n\n2.5\n1/0\n25/0\n2/0\n2/0\n0/1\n\n50% attractive\n\n3\n1/0\n25/0\n0/0\n0/0\n0/0\n\n1\n7/6\n28/0\n22/0\n14/3\n0/7\n\n1.5\n11/5\n29/0\n14/2\n9/3\n0/8\n\n2\n\n14/0\n29/0\n12/0\n11/2\n0/2\n\n2.5\n10/2\n27/0\n9/1\n6/2\n0/2\n\n3\n9/6\n28/1\n13/5\n6/5\n0/7\n\n1\n\n20/2\n29/0\n22/0\n23/1\n0/6\n\n25% attractive\n\n1.5\n13/3\n27/0\n16/6\n15/4\n0/10\n\n2\n\n16/0\n30/0\n15/2\n10/2\n0/4\n\n2.5\n13/3\n28/1\n21/0\n16/2\n0/4\n\n3\n\n15/2\n27/0\n17/0\n15/2\n0/2\n\nTable 1: Number of trials (out of 30) where RSP does better/worse than various methods. In partic-\nular, the last row (opt) shows the number of times that RSP does worse than the optimal solution.\nDe\ufb01ning Sim(i, j) = log(pi,j/(1 \u2212 pi,j)), we minimize E(x) to cluster the test set. We found that\nthe various inference algorithms perform poorly on the MRF for large U, even when they converge\n(probably due to a large number of minima in the approximation). We are able to obtain lower energy\ncon\ufb01gurations by the recursive 2-way partitioning procedure in [5] used for graph cuts. (Graph cuts\ndo not apply here as weights can be negative). This procedure involves recursively running, for e.g.\nRSP, on the MRF for E(x) with U = 2, and applying the Kernighan-Lin algorithm [10] for local\nre\ufb01nements among current partitions. Each time RSP returns a con\ufb01guration that partitions the data,\nwe run RSP on each of the two partitions. We terminate the recursion when RSP assigns a same\nvalue to all variables, placing all remaining items in one cluster.\nWe use the web person disambiguation task de\ufb01ned in SemEval-2007 [1] as the test application.\nTraining data consists of 49 sets of web pages (we use 29 sets with more than 50 documents), where\neach set (or domain) are results from a search query on a person name. The test data contains another\n30 domains. Each domain is manually annotated into clusters, with each cluster containing pages\nreferring to a single individual. We use a simple feature \ufb01ltering approach to select features that\nare useful across many domains in the training data. Candidate features include (i) words occurring\nin only one document of the document-pair, (ii) words co-ocurring in both documents, (iii) named\nentity matches between the documents, and (iv) topic correlation features. For comparison, we\nreplace RSP with MBP and TRW-S as inference algorithms (we did not run RBP and TEP as they\nare very slow on these problems because they often fail to converge). We also implemented the\ngreedy algorithm (Greedy) in [7]. We tried using the linear programming approach but free off-the-\nshelf solvers seem unable to scale to these problems. Results comparing RSP with Greedy, MBP\nand TRW-S are shown in Table 2. The F-measure attained by RSP for this SemEval task [1] is\nequal to the systems ranked second and third out of 16 participants (of\ufb01cial results yet unpublished).\nWe found that although TRW-S is guaranteed to converge, it performs poorly. RSP converges far\nbetter than MBP, but due to the Kernighan-Lin corrections that we run at each iteration, results can\nsometimes be corrected to a large extent by the local re\ufb01nements.\n\nMethod\nNumber of test domains where RSP attains lower/higher energy E(x) than Method\n\nPercentage of convergence over all runs\nF-measure of purity and inverse purity [1]\n\nRSP\n0/0\n91%\n\n75.08%\n\nMBP\n9/6\n74%\n\n74.97%\n\nTRW-S\n\n16/7\n\n100% *\n74.61%\n\nGreedy\n22/5\n\n-\n\n74.78%\n\nTable 2: Results for the web person disambiguation task. (*: TRW-S is guaranteed to converge)\n5 Related work and conclusion\n\nIn this paper, we formulated RSP, generalizing the formulation of SP-\u03c1 in [14]. SP-\u03c1 is the sum-\nproduct interpretation for the survey propagation (SP) algorithm [3]. SP has been shown to work well\n\n\ffor hard instances of 3-SAT, near the phase transition where local search algorithms fail. However,\nits application has been limited to constraint satisfaction problems [3]. In RSP, we took inspiration\nfrom the SP-y algorithm [2] in adding a penalty term for violated clauses. SP-y works on MAX-\nSAT problems and SP can be considered as SP-y with y taken to \u221e, hence disallowing violated\nconstraints. This is analogous to the relation between RSP and SP-\u03c1 [14] (See Theorem 1). RSP is\nhowever different from SP-y since we address weighted MAX-SAT problems. Even if all weights\nare equal, RSP is still different from SP-y, which, so far, does not have a sum-product formulation on\nan alternative MRF. We show that while RSP is the sum-product algorithm on a relaxed MRF, it can\nbe used for solving the energy minimization problem. By tuning the strengths of the factors (based\non convergence criteria in [16]) while keeping the underlying distribution approximately correct,\nRSP converges well even at low temperatures. This enables it to return low-energy con\ufb01gurations\non MRFs where other methods fail. As far as we know, this is the \ufb01rst application of convergence\ncriteria to aid convergence of belief propagation algorithms, and this mechanism can be used to\nexploit future work on suf\ufb01cient conditions for the convergence of belief propagation algorithms.\n\nAcknowledgments\nWe would like to thank Yee Fan Tan for his help on the web person disambiguation task, and Tomas\nLozano-Perez and Leslie Pack Kaelbling for valuable comments on the paper. The research is par-\ntially supported by ARF grant R-252-000-240-112.\n\nReferences\n[1] \u201cWeb person disambiguation task at SemEval,\u201d 2007. [Online]. Available: http://nlp.uned.es/weps/task-\n\ndescription-2.html\n\n[2] D. Battaglia, M. Kolar, and R. Zecchina, \u201cMinimizing energy below the glass thresholds,\u201d Physical Re-\n\nview E, vol. 70, 2004.\n\n[3] A. Braunstein, M. Mezard, and R. Zecchina, \u201cSurvey propagation: An algorithm for satis\ufb01ability,\u201d Ran-\n\ndom Struct. Algorithms, vol. 27, no. 2, 2005.\n\n[4] B. A. Cipra, \u201cThe Ising model is NP-complete,\u201d SIAM News, vol. 33, no. 6, 2000.\n[5] C. Ding, \u201cSpectral clustering,\u201d ICML \u201904 Tutorial, 2004.\n[6] G. Elidan, I. McGraw, and D. Koller, \u201cResidual belief propagation: Informed scheduling for asynchronous\n\nmessage passing,\u201d in UAI, 2006.\n\n[7] T. Finley and T. Joachims, \u201cSupervised clustering with support vector machines,\u201d in ICML, 2005.\n[8] T. Heskes, \u201cOn the uniqueness of loopy belief propagation \ufb01xed points,\u201d Neural Computation, vol. 16,\n\n2004.\n\n[9] T. Joachims, Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms.\n\nNorwell, MA, USA: Kluwer Academic Publishers, 2002.\n\n[10] B. Kernighan and S. Lin, \u201cAn ef\ufb01cient heuristic procedure for partitioning graphs,\u201d Bell Systems Techni-\n\ncal Report, 1970.\n\n[11] V. Kolmogorov, \u201cConvergent tree-reweighted message passing for energy minimization,\u201d IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, vol. 28, no. 10, 2006.\n\n[12] V. Kolmogorov and R. Zabih, \u201cWhat energy functions can be minimized via graph cuts?\u201d IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, 2004.\n\n[13] F. Kschischang, B. Frey, and H. Loeliger, \u201cFactor graphs and the sum-product algorithm,\u201d IEEE Transac-\n\ntions on Information Theory, vol. 47, no. 2, 2001.\n\n[14] E. Maneva, E. Mossel, and M. Wainwright, \u201cA new look at survey propagation and its generalizations,\u201d\n\n2004. [Online]. Available: http://arxiv.org/abs/cs.CC/0409012\n\n[15] T. Minka and Y. Qi, \u201cTree-structured approximations by expectation propagation,\u201d in NIPS, 2004.\n[16] J. M. Mooij and H. J. Kappen, \u201cSuf\ufb01cient conditions for convergence of loopy belief propagation,\u201d in\n\nUAI, 2005.\n\n[17] J. D. Park, \u201cUsing weighted MAX-SAT engines to solve MPE,\u201d in AAAI, 2002.\n[18] Y. Weiss and W. T. Freeman, \u201cOn the optimality of solutions of the max-product belief-propagation algo-\n\nrithm in arbitrary graphs,\u201d IEEE Transactions on Information Theory, vol. 47, no. 2, 2001.\n\n[19] M. Welling, T. Minka, and Y. W. Teh, \u201cStructured region graphs: Morphing EP into GBP,\u201d in UAI, 2005.\n[20] J. S. Yedidia, W. T. Freeman, and Y. Weiss, \u201cConstructing free-energy approximations and generalized\n\nbelief propagation algorithms,\u201d IEEE Transactions on Information Theory, vol. 51, no. 7, 2005.\n\n\f", "award": [], "sourceid": 82, "authors": [{"given_name": "Hai", "family_name": "Chieu", "institution": null}, {"given_name": "Wee", "family_name": "Lee", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}