{"title": "Generalized Belief Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 689, "page_last": 695, "abstract": null, "full_text": "Generalized Belief Propagation \n\nJonathan S. Yedidia \n\nWilliam T. Freeman \n\nYair Weiss \n\nMERL \n\n201 Broadway \n\nMERL \n\n201 Broadway \n\nCambridge, MA 02139 \nPhone: 617-621-7544 \n\nCambridge, MA 02139 \nPhone: 617-621-7527 \n\nyedidia@merl.com \n\nfreema n@merl.com \n\nComputer Science Division \nUC Berkeley, 485 Soda Hall \n\nBerkeley, CA 94720-1776 \n\nPhone: 510-642-5029 \nyweiss@cs .berkeley.edu \n\nAbstract \n\nBelief propagation (BP) was only supposed to work for tree-like \nnetworks but works surprisingly well in many applications involving \nnetworks with loops, including turbo codes. However, there has \nbeen little understanding of the algorithm or the nature of the \nsolutions it finds for general graphs. \nWe show that BP can only converge to a stationary point of an \napproximate free energy, known as the Bethe free energy in statis(cid:173)\ntical physics. This result characterizes BP fixed-points and makes \nconnections with variational approaches to approximate inference. \nMore importantly, our analysis lets us build on the progress made \nin statistical physics since Bethe's approximation was introduced \nin 1935. Kikuchi and others have shown how to construct more ac(cid:173)\ncurate free energy approximations, of which Bethe's approximation \nis the simplest. Exploiting the insights from our analysis, we de(cid:173)\nrive generalized belief propagation (GBP) versions ofthese Kikuchi \napproximations. These new message passing algorithms can be \nsignificantly more accurate than ordinary BP, at an adjustable in(cid:173)\ncrease in complexity. We illustrate such a new GBP algorithm on \na grid Markov network and show that it gives much more accurate \nmarginal probabilities than those found using ordinary BP. \n\n1 \n\nIntroduction \n\nLocal \"belief propagation\" (BP) algorithms such as those introduced by Pearl are \nguaranteed to converge to the correct marginal posterior probabilities in tree-like \ngraphical models. For general networks with loops, the situation is much less clear. \nOn the one hand, a number of researchers have empirically demonstrated good \nperformance for BP algorithms applied to networks with loops. One dramatic case \nis the near Shannon-limit performance of \"Turbo codes\" , whose decoding algorithm \nis equivalent to BP on a loopy network [2, 6]. For some problems in computer vision \ninvolving networks with loops, BP has also shown to be accurate and to converge \nvery quickly [2, 1, 7]. On the other hand, for other networks with loops, BP may \ngive poor results or fail to converge [7] . \n\n\fFor a general graph, little has been understood about what approximation BP \nrepresents, and how it might be improved. This paper's goal is to provide that \nunderstanding and introduce a set of new algorithms resulting from that under(cid:173)\nstanding. We show that BP is the first in a progression of local message-passing \nalgorithms, each giving equivalent results to a corresponding approximation from \nstatistical physics known as the \"Kikuchi\" approximation to the Gibbs free energy. \nThese algorithms have the attractive property of being user-adjustable: by pay(cid:173)\ning some additional computational cost, one can obtain considerable improvement \nin the accuracy of one's approximation, and can sometimes obtain a convergent \nmessage-passing algorithm when ordinary BP does not converge. \n\n2 Belief propagation fixed-points are zero gradient points of \n\nthe Bethe free energy \n\nWe assume that we are given an undirected graphical model of N nodes with pair(cid:173)\nwise potentials (a Markov network). Such a model is very general, as essentially \nany graphical model can be converted into this form. The state of each node i is \ndenoted by Xi, and the joint probability distribution function is given by \n\nP(Xl,X2, ... ,XN) = z II 'l/Jij(Xi,Xj) II 'l/Ji(Xi) \n\n1 \n\nij \n\ni \n\n(1) \n\nwhere 'l/Ji(Xi) is the local \"evidence\" for node i, 'l/Jij(Xi, Xj) is the compatibility \nmatrix between nodes i and j, and Z is a normalization constant. Note that we are \nsubsuming any fixed evidence nodes into our definition of 'l/Ji(Xi). \n\nThe standard BP update rules are: \n\nmij(Xj) \n\nf-\n\nbi(Xi) \n\nf-\n\na L 'l/Jij (Xi, Xj)'l/Ji(Xi) II mki(xi) \na'I/Ji (Xi) II mki(xi) \n\nkEN(i)\\j \n\nXi \n\nkEN(i) \n\n(2) \n\n(3) \n\nwhere a denotes a normalization constant and N(i)\\j means all nodes neigh(cid:173)\nboring node i, except j. Here mij refers to the message that node i sends \nto node j and bi is the belief (approximate marginal posterior probability) at \nnode i, obtained by multiplying all incoming messages to that node by the lo(cid:173)\ncal evidence. Similarly, we can define the belief bij(Xi,Xj) at the pair of nodes \n(Xi, Xj) as the product of the local potentials and all messages incoming to the \npair of nodes: bij(Xi, Xj) = acPij(Xi, Xj) ITkEN(i)\\j mki(Xi) IT1EN(j)\\i mlj (Xj), where \ncPij(Xi,Xj) = 'l/Jij(Xi,Xj)'l/Ji(Xi)'l/Jj(Xj) . \nClaim 1: Let {mij} be a set of BP messages and let {bij , bi} be the beliefs calculated \nfrom those messages. Then the beliefs are fixed-points of the BP algorithm if and \nonly if they are zero gradient points of the Bethe free energy, Ff3: \n\nL L bij(Xi,Xj) [In bij(Xi, Xj) -lncPij(xi,Xj)] \n\nij Xi,Xj \n\nsubject to the normalization and marginalization constraints: LXi bi(Xi) = 1, \nL Xi bij(Xi,Xj) = bj(xj). (qi is the number of neighbors of node i.) \n\nXi \n\n\fIJ X\"'X J \n\n.j = 0 gives: \n\nTo prove this claim we add Lagrange multipliers to form a Lagrangian L: Aij (x j ) \nis the multiplier corresponding to the constraint that bij (Xi, X j) marginalizes down \nto bj(xj), and \"(ij, \"(i are multipliers corresponding to the normalization constraints. \nInbij(xi,Xj) = In(\u00a2ij(xi,Xj)) + Aij(Xj) + \nThe equation 8b.fL \nl)(lnbi(xi) + 1) = \nAji(Xi) + \"(ij - 1. The equation 8b~&i) = 0 gives: \nIn?jJi (Xi) + LjEN(i) Aji (Xi) + \"(i\u00b7 Setting Aij (Xj) = In OkEN(j)\\i mkj (Xj) and us(cid:173)\ning the marginalization constraints, we find that the stationary conditions on the \nLagrangian are equivalent to the BP fixed-point conditions. (Empirically, we find \nthat stable BP fixed-points correspond to local minima of the Bethe free energy, \nrather than maxima or saddle-points.) \n\n(qi -\n\n2.1 \n\nImplications \n\nThe fact that F/3( {bij , bd) is bounded below implies that the BP equations always \npossess a fixed-point (obtained at the global minimum of F). To our knowledge, \nthis is the first proof of existence of fixed-points for a general graph with arbitrary \npotentials (see [9] for a complicated prooffor a special case). \n\nThe free energy formulation clarifies the relationship to variational approaches which \nalso minimize an approximate free energy [3]. For example, the mean field approx(cid:173)\nimation finds a set of {bi } that minimize: \nFMF({bd) = - L L bi(Xi)bj(xj) In?jJij(xi,Xj)+ L L bi(Xi) [lnbi(xi) -In?jJi(xi)] \n(5) \n\nij Xi,'Xj \n\nXi \n\ni \n\nsubject to the constraint Li bi(Xi) = 1. \nThe BP free energy includes first-order terms bi(Xi) as well as second-order terms \nbij (Xi, Xj), while the mean field free energy uses only the first order ones. It is easy \nto show that the BP free energy is exact for trees while the mean field one is not. \nFurthermore the optimization methods are different: typically FMF is minimized \ndirectly in the primal variables {bi } while F/3 is minimized using the messages, which \nare a combination of the dual variables {Aij(Xj)}. \nKabashima and Saad [4] have previously pointed out the correspondence between \nBP and the Bethe approximation (expressed using the TAP formalism) for some \nspecific graphical models with random disorder. Our proof answers in the affirma(cid:173)\ntive their question about whether there is a \"deep general link between the two \nmethods.\" [4] \n\n3 Kikuchi Approximations to the Gibbs Free Energy \n\nThe Bethe approximation, for which the energy and entropy are approximated by \nterms that involve at most pairs of nodes, is the simplest version of the Kikuchi \n\"cluster variational method.\" [5, 10] In a general Kikuchi approximation, the free \nenergy is approximated as a sum of the free energies of basic clusters of nodes, \nminus the free energy of over-counted cluster intersections, minus the free energy \nof the over-counted intersections of intersections, and so on. \nLet R be a set of regions that include some chosen basic clusters of nodes, their \nintersections, the intersections of the intersections, and so on. The choice of basic \nclusters determines the Kikuchi approximation- for the Bethe approximation, the \nbasic clusters consist of all linked pairs of nodes. Let Xr be the state of the nodes \nin region r and br(xr) be the \"belief\" in X r . We define the energy of a region by \n\n\fEr(xr) == -In TIij 'l/Jij (Xi, Xj) - In TIi 'l/Ji(Xi) == -In 'l/Jr(xr ), where the products are \nover all interactions contained within the region r. For models with higher than \npair-wise interactions, the region energy is generalized to include those interactions \nas well. \n\nThe Kikuchi free energy is \n\nFK = 2:: Cr (2:: br(xr)Er(xr) + 2:: br(xr) IOgbr(Xr)) \n\nrER \n\nx.,. \n\nXT' \n\n(6) \n\nwhere Cr is the over-counting number of region r, defined by: Cr = 1- LSEStLper(r) Cs \nwhere super(r) is the set of all super-regions of r. For the largest regions in R, \nCr = 1. The belief br (Q:r ) in region r has several constraints: it must sum to one \nand be consistent with the beliefs in regions which intersect with r. In general, \nincreasing the size of the basic clusters improves the approximation one obtains by \nminimizing the Kikuchi free energy. \n\n4 Generalized belief propagation (G BP) \n\nMinimizing the Kikuchi free energy subject to the constraints on the beliefs is \nnot simple. Nearly all applications of the Kikuchi approximation in the physics \nliterature exploit symmetries in the underlying physical system and the choice of \nclusters to reduce the number of equations that need to be solved from O(N) to \n0(1). But just as the Bethe free energy can be minimized by the BP algorithm, we \nintroduce a class of analogous genemlized belief propagation (GBP) algorithms that \nminimize an arbitrary Kikuchi free energy. These algorithms represent an advance \nin physics, in that they open the way to the exploitation of Kikuchi approximations \nfor inhomogeneous physical systems. \n\nThere are in fact many possible GBP algorithms which all correspond to the same \nKikuchi approximation. We present a \"canonical\" GBP algorithm which has the \nnice property of reducing to ordinary BP at the Bethe level. We introduce messages \nmrs(xs) between all regions r and their \"direct sub-regions\" s. \n(Define the set \nsubd(r) of direct sub-regions of r to be those regions that are sub-regions of r \nbut have no super-regions that are also sub-regions of r, and similarly for the set \nsuperd(r) of \"direct super-regions.\") It is helpful to think of this as a message \nfrom those nodes in r but not in s (which we denote by r\\s) to the nodes in s. \nIntuitively, we want messages to propagate information that lies outside of a region \ninto it. Thus, for a given region r, we want the belief br(xr) to depend on exactly \nthose messages mr,s, that start outside ofthe region r and go into the region r. We \ndefine this set of messages M(r) to be those messages mr,s, (xs,) such that region \nr'\\s' has no nodes in common with region r, and such that region s' is a sub-region \nof r or the same as region r. We also define the set M(r, s) of messages to be all \nthose messages that start in a sub-region of r and also belong to M(s), and we \ndefine M(r)\\M(s) to be those messages that are in M(r) but not in M(s). \n\nThe canonical generalized belief propagation update rules are: \n\nmrs +-\n\nQ: [2:: 'l/Jr\\s(xr\\s) \n\nX'\\' \nbr +- Q:'l/Jr(xr) \n\nII mr,s, \n\nm,1I ,\" EM(r)\\M(s) \n\nII \n\nmrllsll] / \n\nII mr,s, \n\nm,' \" EM(r,s) \n\n(7) \n\n(8) \n\nwhere for brevity we have suppressed the functional dependences of the beliefs and \nmessages. The messages are updated starting with the messages into the smallest \n\nm\" . ,EM(r) \n\n\fregions first. One can then use the newly computed messages in the product over \nM(r, s) of the message-update rule. Empirically, this helps convergence. \nClaim 2: Let {mrs(xs)} be a set of canonical GBP messages and let {br(xr)} be \nthe beliefs calculated from those messages. Then the beliefs are fixed-points of \nthe canonical GBP algorithm if and only if they are zero gradient points of the \nconstrained Kikuchi free energy FK. \n\nWe prove this claim by adding Lagrange multipliers: 'Yr to enforce the normal(cid:173)\nization of br and Ars(Xs) to enforce the consistency of each region r with all of \nits direct sub-regions s. This set of consistency constraints is actually more than \nsufficient, but there is no harm in adding extra constraints. We then rotate to \nanother set of Lagrange multipliers /l-rs(xs) of equal dimensionality which enforce a \nlinear combination of the original constraints: /l-rs (xs) enforces all those constraints \ninvolving marginalizations by all direct super-regions r' of s into s except that of \nregion r itself. The rotation matrix is in a block form which can be guaranteed \nto be full rank. We can then show that the /l-rs(xs) constraints can be written \nin the form /l-rs(xs) Er'ER(/-Iro) Cr, Ex., b(x~) where R(/l-rs) is the set of all regions \nwhich receive the message /l-rs in the belief update rule of the canonical algorithm. \nWe then re-arrange the sum over all /l-'S into a sum over all regions, which has \nthe form ErER Cr Ex. br(xr) E/-I .. EM(r) /l-rs(Xs). (M(r) is a set of /l-r's' in one-to(cid:173)\none correspondence with the mr,s, in M(r).) Finally, we differentiate the Kikuchi \nfree energy with respect to br(r), and identify /l-rs(xs) = lnmrs(xs) to obtain the \ncanonical GBP belief update rules, Eq. 8. Using the belief update rules in the \nmarginalization constraints, we obtain the canonical GBP message update rules, \nEq.7. \n\nIt is clear from this proof outline that other GBP message passing algorithms which \nare equivalent to the Kikuchi approximation exist. If one writes any set of con(cid:173)\nstraints which are sufficient to insure the consistency of all Kikuchi regions, one can \nassociate the exponentiated Lagrange multipliers of those constraints with a set of \nmessages. \n\nThe GBP algorithms we have described solve exactly those networks which have \nthe topology of a tree of basic clusters. This is reminiscent of Pearl's method of \nclustering [8], wherein wherein one groups clusters of nodes into \"super-nodes,\" \nand then applies a belief propagation method to the equivalent super-node lattice. \nWe can show that the clustering method, using Kikuchi clusters as super-nodes, \nalso gives results equivalent to the Kikuchi approximation for those lattices and \ncluster choices where there are no intersections between the intersections of the \nKikuchi basic clusters. For those networks and cluster choices which do not obey this \ncondition, (a simple example that we discuss below is the square lattice with clusters \nthat consist of all square plaquettes of four nodes), Pearl's clustering method must \nbe modified by adding additional update conditions to agree with GBP algorithms \nand the Kikuchi approximation. \n\n5 Application to Specific Lattices \n\nWe illustrate the canonical G BP algorithm for the Kikuchi approximation of over(cid:173)\nlapping 4-node clusters on a square lattice of nodes. Figure 1 (a), (b), (c) illustrates \nthe beliefs at a node, pair of nodes, and at a cluster of 4 nodes, in terms of messages \npropagated in the network. Vectors are the single index messages also used in ordi(cid:173)\nnary BP. Vectors with line segments indicate the double-indexed messages arising \nfrom the Kikuchi approximation used here. These can be thought of as correction \nterms accounting for correlations between messages that ordinary BP treats as in-\n\n\fdependent. (For comparison, Fig. 1 (d), (e), (f) shows the corresponding marginal \ncomputations for the triangular lattice with all triangles chosen as the basic Kikuchi \nclusters). \n\nWe find the message update rules by equating marginalizations of Fig. 1 (b) and \n(c) with the beliefs in Fig. 1 (a) and (b), respectively. Figure 2 (a) and (b) show \n(graphically) the resulting fixed point equations. The update rule (a) is like that for \nordinary BP, with the addition of two double-indexed messages. The update rule \nfor the double-indexed messages involves division by the newly-computed single(cid:173)\nindexed messages. Fixed points of these message update equations give beliefs that \nare stationary points (empirically minima) of the corresponding Kikuchi approxi(cid:173)\nmation to the free energy. \n\nI f \nJ J 9 \n, \n~CJ,~ \n\n(b) \n\n, l!J~, \nt- O...=J \nl 1..t \n(c) \n\nb~.L. \n1\\ \n\n(d) \n\n>\\I.+~L \n!\\~/\\ \n\n(e) \n\nFor (a) node, \n\nFigure 1: Marginal probabilities in terms of the node links and GBP mes(cid:173)\nsages. \n(b) line, (c) square cluster, using a Kikuchi ap(cid:173)\nproximation with 4-node clusters on a square lattice. \n(a special case of Eq. 8, written here using node labels): \na'I/Jab(Xa, Xb)'l/Ja(Xa)'l/Jb(Xb)M~M~M~M:t Mt M~ Mt: M~~ , where super and sub(cid:173)\nscripts indicate which nodes message M goes from and to. (d), (e), (f): Marginal \nprobabilities for triangular lattice with 3-node Kikuchi clusters. \n\n(b) depicts \nbab(Xa, Xb) = \n\nE.g., \n\nb .-\n\na \n\n(a) \n\nFigure 2: Graphical depiction of message update equations (Eq. 7; marginal(cid:173)\nize over nodes shown unfilled) for GBP using overlapping 4-node Kikuchi \n(a) Update equation for the single-index messages: M!(xa) = \nclusters. \na L:xb 'l/Jb(Xb)'l/Jab(Xa, xb)M:t Mt M~ Mt: M~~. (b) Update equation for double(cid:173)\nindexed messages (involves a division by the single-index messages on the left hand \nside) . \n\n6 Experimental Results \n\nOrdinary BP is expected to perform relatively poorly for networks with many tight \nloops, conflicting interactions, and weak evidence. We constructed such a network, \nknown in the physics literature as the square lattice Ising spin glass in a random \nmagnetic field. The nodes are on a square lattice, with nearest neighbor nodes \nexp(-J, .. )) \nconnected by a compatibility matrix of the form 'l/Ji3\" = \nexp \nand local evidence vectors of the form 'l/Ji = (exp(hi)jexp(-hi )). To instantiate a \nparticular network, the Jij and hi parameters are chosen randomly and indepen(cid:173)\ndently from zero-mean Gaussian probability distributions with standard deviations \nJ and h respectively. \n\n( exp(J,\u00b7\u00b7) \n( 13J, ) \n\nexp -\n\nij \n\n(J, \n\n)13 \n\nij \n\n\fThe following results are for n by n lattices with toroidal boundary conditions and \nwith J = 1, and h = 0.1. This model is designed to show off the weaknesses \nof ordinary BP, which performs well for many other networks. Ordinary BP is a \nspecial case of canonical G BP, so we exploited this to use the same general-purpose \nGBP code for both ordinary BP and canonical GBP using overlapping square four(cid:173)\nnode clusters, thus making computational cost comparisons reasonable. We started \nwith randomized messages and only stepped half-way towards the computed values \nof the messages at each iteration in order to help convergence. We found that \ncanonical G BP took about twice as long as ordinary BP per iteration, but would \ntypically reach a given level of convergence in many fewer iterations. In fact, for \nthe majority of the dozens of samples that we looked at, BP did not converge at \nall, while canonical GBP always converged for this model and always to accurate \nanswers. (We found that for the zero-field 3-dimensional spin glass with toroidal \nboundary conditions, which is an even more difficult model, canonical GBP with \n2x2x2 cubic clusters would also fail to converge). \nFor n = 20 or larger, it was difficult to make comparisons with any other algorithm, \nbecause ordinary BP did not converge and Monte Carlo simulations suffered from \nextremely slow equilibration. However, generalized belief propagation converged \nreasonably rapidly to plausible-looking beliefs. For small n, we could compare with \nexact results, by using Pearl's clustering method on a chain of n by 1 super-nodes. \nTo give a qualitative feel for the results, we compare ordinary BP, canonical GBP, \nand the exact results for an n = 10 lattice where ordinary BP did converge. Listing \nthe values of the one-node marginal probabilities in one of the rows, we find that \nordinary BP gives (.0043807, .74502, .32866, .62190, .37745, .41243, .57842, .74555, \n.85315, .99632), canonical GBP gives (.40255, .54115, .49184, .54232, .44812, .48014, \n.51501, .57693, .57710, .59757), and the exact results were (.40131, .54038, .48923, \n.54506, .44537, .47856, .51686, .58108, .57791, .59881). \n\nReferences \n\n[1] W. T. Freeman and E. Pasztor. Learning low-level vision. In 7th Intl. Conf. \n\nComputer Vision, pages 1182- 1189, 1999. \n\n[2] B. J. Frey. Graphical Models for Machine Learning and Digital Communica(cid:173)\n\ntion. MIT Press, 1998. \n\n[3] M. Jordan, Z. Ghahramani, T. Jaakkola, and 1. Saul. An introduction to \nvariational methods for graphical models. In M. Jordan, editor, Learning in \nGraphical Models. MIT Press, 1998. \n\n[4] Y. Kabashima and D. Saad. Belief propagation vs. TAP for decoding corrupted \n\nmessages. Euro. Phys. Lett., 44:668, 1998. \n\n[5] R. Kikuchi. Phys. Rev., 81:988, 1951. \n[6] R. McEliece, D. MacKay, and J. Cheng. Turbo decoding as an instance of \nIEEE J. on Sel. Areas in Comm., \n\nPearl's 'belief propagation' algorithm. \n16(2):140- 152, 1998. \n\n[7] K. Murphy, Y. Weiss, and M. Jordan. Loopy belief propagation for approxi(cid:173)\n\nmate inference: an empirical study. In Proc. Uncertainty in AI, 1999. \n\n[8] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible \n\ninference. Morgan Kaufmann, 1988. \n\n[9] T. J. Richardson. The geometry of turbo-decoding dynamics. IEEE Trans. \n\nInfo. Theory, 46(1):9-23, Jan. 2000. \n\n[10] Special issue on Kikuchi methods. Progr. Theor. Phys. Suppl., vol. 115, 1994. \n\n\f", "award": [], "sourceid": 1832, "authors": [{"given_name": "Jonathan", "family_name": "Yedidia", "institution": null}, {"given_name": "William", "family_name": "Freeman", "institution": null}, {"given_name": "Yair", "family_name": "Weiss", "institution": null}]}