{"title": "Fast Convergence of Belief Propagation to Global Optima: Beyond Correlation Decay", "book": "Advances in Neural Information Processing Systems", "page_first": 8331, "page_last": 8341, "abstract": "Belief propagation is a fundamental message-passing algorithm for probabilistic reasoning and inference in graphical models. While it is known to be exact on trees, in most applications belief propagation is run on graphs with cycles. Understanding the behavior of ``loopy'' belief propagation has been a major challenge for researchers in machine learning, and several positive convergence results for BP are known under strong assumptions which imply the underlying graphical model exhibits decay of correlations. We show that under a natural initialization, BP converges quickly to the global optimum of the Bethe free energy for Ising models on arbitrary graphs, as long as the Ising model is \\emph{ferromagnetic} (i.e. neighbors prefer to be aligned). This holds even though such models can exhibit long range correlations and may have multiple suboptimal BP fixed points. We also show an analogous result for iterating the (naive) mean-field equations; perhaps surprisingly, both results are ``dimension-free'' in the sense that a constant number of iterations already provides a good estimate to the Bethe/mean-field free energy.", "full_text": "Fast Convergence of Belief Propagation to Global\n\nOptima: Beyond Correlation Decay\n\nFrederic Koehler\n\nDepartment of Mathematics\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02141\nfkoehler@mit.edu\n\nAbstract\n\nBelief propagation is a fundamental message-passing algorithm for probabilistic\nreasoning and inference in graphical models. While it is known to be exact on\ntrees, in most applications belief propagation is run on graphs with cycles. Under-\nstanding the behavior of \u201cloopy\u201d belief propagation has been a major challenge for\nresearchers in machine learning and other \ufb01elds, and positive convergence results\nfor BP are known under strong assumptions which imply the underlying graphi-\ncal model exhibits decay of correlations. We show, building on previous work of\nDembo and Montanari, that under a natural initialization BP converges quickly to\nthe global optimum of the Bethe free energy for Ising models on arbitrary graphs,\nas long as the Ising model is ferromagnetic (i.e. neighbors prefer to be aligned).\nThis holds even though such models can exhibit long range correlations and may\nhave multiple suboptimal BP \ufb01xed points. We also show an analogous result for\niterating the (naive) mean-\ufb01eld equations; perhaps surprisingly, both results are\ndimension-free in the sense that a constant number of iterations already provides\na good estimate to the Bethe/mean-\ufb01eld free energy.\n\n1\n\nIntroduction\n\nUndirected graphical models, also known as Markov Random Fields, are a general, powerful, and\npopular framework for modeling and reasoning about high dimensional distributions. These models\nexplain the dependency structure of a probability distribution in terms of interactions along the edges\nof a (hyper-) graph, which gives rise to a factorization of the joint probability distribution, and the\nabsence of edges in this graph encodes conditional independence relations.\nIsing models are a special class of graphical models with a long history and which are popular in\napplications; they model the interaction of random variables valued in a binary alphabet ({\u00b11}) with\nexclusively pairwise interactions. Explicitly, the joint pmf of an Ising model is\n\nPr(X = x) = exp\n\nxT Jx + h \u00b7 x \u2212 log Z\n\n(1)\n\n(cid:18) 1\n\n2\n\n(cid:19)\n\nwhere x \u2208 {\u00b11}n, J : n \u00d7 n is an arbitrary matrix describing the pairwise interactions between\nnodes (with zero diagonal), h is an arbitrary vector encoding node biases, and Z is a proportionality\nconstant called the partition function. Historically, Ising models originated in the statistical physics\ncommunity as a way to model and study phase transition phenomena in magnetic materials; since\nthen, they have attracted signi\ufb01cant interest in the machine learning community and have been ap-\nplied in a wide variety of domains including \ufb01nance, social networks, neuroscience, and computer\nvision (see e.g. references in [18, 23, 30, 11].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fPerforming sampling and inference on an Ising model is a major computational challenge for which\na wide variety of approaches have been developed. One family of methods are Markov-Chain Monte\nCarlo (MCMC) algorithms, the most popular of which is Gibbs sampling (also known as Glauber\ndynamics) which resamples the spin of one node at a time from its conditional distribution. When\nrun suf\ufb01ciently long, MCMC methods will draw samples from the true distribution (1); unfortu-\nnately, it is well-known in both theory and practice that MCMC methods may become stuck when\nthe probability distribution (1) exhibits multi-modal structure; for example, on an n\u00d7n square lattice\nthe Glauber dynamics required exponential time to mix in the low temperature phase [20].\nA popular alternative to markov chain methods are variational methods, which typically make some\napproximation to the distribution (1) but often run much faster than MCMC. These methods usually\nreduce inference on the Ising model to some (typically non-convex) optimization problem, which is\nsolved either by standard optimization methods (e.g. gradient ascent) or by more specialized meth-\nods like message-passing algorithms (e.g. Belief Propagation). Because of the non-convexity, these\nmethods are typically not guaranteed to return global optimizers of the corresponding variational\nproblem. Indeed, these optimization problems are NP-hard to approximate for general Ising models\n(see e.g. [12] for the case of mean-\ufb01eld approximation).\nBelief propagation (BP) is a celebrated message passing algorithm which is known to be closely\nrelated to the Bethe approximation. It is a fundamental algorithm for probabilistic inference [25]\nwhich plays a fundamental role in a variety of applications like phylogenetic reconstruction, coding,\nconstraint satisfaction problems, and community detection in the stochastic block model (see e.g.\n[21, 23, 6]); it is also closely connected to the \u201ccavity method\u201d in statistical physics [21]. Although\nBP is observed to works well for many problems, there are few settings on general graphs (i.e.\nwith loops) where it provably works. For example, BP with random initialization is conjectured to\nachieve optimal reconstruction in the 2-community SBM [6] but no rigorous proof of this result is\nknown.\nIn this work, we consider two popular variational approximations, the naive mean-\ufb01eld approxi-\nmation and the Bethe approximation to the Ising model, and the corresponding heuristic message-\npassing algorithms which are usually used to solve these optimization problems: mean-\ufb01eld iteration\nand belief propagation. We show that under a natural and popular assumption of ferromagneticity\n(that Jij \u2265 0 and h has consistent signs; a.k.a. as an attractive graphical model) that these methods\ndo indeed converge to global optimizers of their optimization problems, under a natural initializa-\ntion, and moreover that their convergence rate is fast and dimension-free in the appropriate sense.\n\n1.1 Background: Variational methods and belief propagation\n\nWe can describe the variational methods we consider in terms of optimization problems whose goal\nis to estimate \u03a6 := log Z, the log partition function or free energy of the Ising model. This is natural\nto consider because other important quantities can be recovered by differentiating log Z in some\nparameter, and because the ability to construct suf\ufb01ciently precise estimates for Z is equivalent to\napproximate sampling for any self-reducible family of models [15]. We note throughout this section\nwe specialize to Ising models, but all of these notions generalize straightforwardly to general Markov\nrandom \ufb01elds (a.k.a. factor models) \u2014 see [21] for a more detailed discussion.\nThe starting point for these variational methods is the Gibbs variational principle [21] which states\n\nEP [\n\n1\n2\n\nxT Jx + h \u00b7 x] + HP (X)\n\nmax\n\nlog Z =\n\nP\u2208P({\u00b11}n)\n\n(2)\nwhere P ranges over probability distributions on {\u00b11}n and HP (X) is the entropy of X under P .\nThis formula is derived by observing the Gibbs measure minimizes the KL divergence to itself and\nexpanding.\nThe (naive) mean-\ufb01eld approximation is given by restricting (2) to product distributions and \ufb01nding\nthe maximum of the functional\n\n(cid:18)\n\n(cid:88)\n\ni\n\n(cid:18) 1 + xi\n\n(cid:19)(cid:19)\n\n2\n\n\u03a6M F (x) :=\n\nxT Jx + h \u00b7 x +\n\n1\n2\n\nH\n\nBer\n\n(3)\n\nwhere H(Ber(p)) = \u2212p log p \u2212 (1 \u2212 p) log(1 \u2212 p) is the entropy of a Bernoulli random vari-\nable. Information-theoretically, the optimizer(s) x\u2217 of this optimization problem corresponds to the\n\n2\n\n\fmarginals of a product distribution \u03bd which minimizes the KL-divergence from the Gibbs measure\n\u00b5 (de\ufb01ned by (1)) to \u03bd. Note that this always gives a lower bound on log Z.\nBy considering the \ufb01rst-order optimality conditions for (3), one arises at the mean-\ufb01eld equations\n\n\u2297n(J \u00b7 x + h)\n\nx = tanh\n\n(4)\n\u2297n denotes entry-wise application of tanh. The mean-\ufb01eld iteration is the natural\nwhere tanh\niterative algorithm which starts with some x0 and applies (4) iteratively to search for a \ufb01xed\npoint. The error of the mean-\ufb01eld approximation has been extensively studied; the approxima-\ntion is guaranteed to be accurate when (cid:107)J(cid:107)F = o(n) (informally, in unfrustrated models with\n\u221a\n[2, 13, 1]. For example, the recent result of [1] shows that\nlarge average degree); see e.g.\n| log Z \u2212 maxx \u03a6M F (x)| = O(\nn(cid:107)J(cid:107)F ) and the result of [10] gives even better bounds for some\nmodels.\nThe naive mean-\ufb01eld approximation can be inaccurate on very sparse graphs; the Bethe approxima-\ntion is a more sophisticated approach which has the bene\ufb01t of being exact on trees [21], and which\nis always at least as accurate as the mean-\ufb01eld approximation in ferromagnetic models [26]. The\nBethe free energy is the maximum of the (typically non-convex) functional\n\n\u03a6Bethe(P ) :=\n\nJijEPij [XiXj]+\n\n(deg(i)\u22121)HPi (Xi)\n(5)\nwhere P lies in the polytope of locally consistent distributions (equivalently, SA(2) in the Shereli-\nAdams hierarchy1). Explicitly this polytope is given by constraints:\n\nhiEPi[Xi]+\n\ni\n\ni\n\nHPij (Xi, Xj)\u2212(cid:88)\n\n(cid:88)\n\ni\u223cj\n\n(cid:88)\n\n(cid:88)\n\ni\u223cj\n\n(cid:88)\n\nxi\n\n(cid:88)\n\nxi\n\nPij(xi, xj) = Pj(xj)\n\nPi(xi) = 1\nPi(xi) \u2265 0\n\nfor all i, j neighbors\n\nfor all i\n\nfor all i, xi\n\nOne can derive the Bethe-Peierls (BP) equations from the \ufb01rst-order optimality conditions for this\noptimization problem; this connection is involved and is discussed further in Section 3.1. Just like\nthe mean-\ufb01eld equations, the BP equations can be iterated to search for a \ufb01xed point, in which case\none recovers the belief propagation updates for this setting. Explicitly, for edge messages \u03bdi\u2192j with\n\u03bdi\u2192j \u2208 [\u22121, 1] the consistency equation is\n\n\u03bdi\u2192j = tanh(hi +\n\ntanh\n\n\u22121(tanh(Jik)\u03bdk\u2192i))\n\n(6)\n\n(cid:88)\n\nk\u2208\u2202i\\j\n\ntions, the BP estimate for E[Xi] can be written as \u03bdi := tanh(hi+(cid:80)\n\nIntuitively, this equation describes the expected\nwhere \u2202i denotes the neighborhood of node i.\nmarginal of node Xj in the absence of the edge between i and j. Given \u03bd which solves these equa-\n\u22121(tanh(Jik)\u03bdk\u2192i)).\nBP also gives an estimate for the free energy in terms of its messages (see equation (7) in Sec-\ntion 3.1).\nThe above derivation of (\u201cloopy\u201d) belief propagation from the the Bethe free energy for general\ngraphical models is due to [36]. Alternatively, belief propagation can also be derived as the exact\nsolution to computing the partition function of a tree graphical model and this can be found in a\nvariety of places \u2014 see e.g. [25].\n\nk\u2208\u2202i tanh\n\n1.2 Our Results\n\nWe analyze the behavior of mean-\ufb01eld iteration and belief propagation in ferromagnetic (a.k.a. at-\ntractive) models on arbitrary graphs.\nDe\ufb01nition 1.1. An Ising model is ferromagnetic (with consistent \ufb01eld) if Jij \u2265 0 for all i and\nhi \u2265 0 for every i. (We also can allow hi \u2264 0 for all i, but this is equivalent after \ufb02ipping signs.)\n\n1The equivalence is given by treating the underlying graph as complete with Jij = 0 for unconnected nodes,\n\nwhere the optimal coupling of these non-neighbors is when they are independent.\n\n3\n\n\fWe show that in ferromagnetic Ising models, belief propagation and mean-\ufb01eld iteration always\nconverge to the true global optimum of the Bethe free energy and mean-\ufb01eld free energy respectively,\nas long as they start from the all-1s initialization. Moreover we show that these algorithms converge\nquickly, which makes them fast and practical algorithms for estimating the corresponding objective.\nWe note that these results cannot hold for arbitrary Ising models, as even approximating the mean-\n\ufb01eld free energy is NP-hard in general Ising models with anti-ferromagnetic interactions [12].\nTheorem 1.2. Fix an arbitrary ferromagnetic Ising model parameterized by J, h and let x\u2217 be a\nglobal maximizer of \u03a6M F . Initializing with x(0) = (cid:126)1 and de\ufb01ning x(1), x(2), . . . by iterating the\nmean-\ufb01eld equations, we have that2 for every t \u2265 1,\n\n(cid:40)(cid:107)J(cid:107)1 + (cid:107)h(cid:107)1\n\n(cid:18)(cid:107)J(cid:107)1 + (cid:107)h(cid:107)1\n\n(cid:19)4/3(cid:41)\n\n.\n\n, 2\n\nt\n\nt\n\n0 \u2264 \u03a6M F (x\u2217) \u2212 \u03a6M F (x(t)) \u2264 min\n\nTheorem 1.3. Fix an arbitrary ferromagnetic Ising model parameterized by J, h, and let P \u2217 be a\ni\u2192j = 1 for all i \u223c j and de\ufb01ning \u03bd(1), \u03bd(2), . . . by BP\nglobal maximizer of \u03a6Bethe. Initializing \u03bd(0)\niteration on a graph with m edges we have for every t \u2265 1,\n\n0 \u2264 \u03a6Bethe(P \u2217) \u2212 \u03a6\u2217\n\nBethe(\u03bd(t)) \u2264\n\n(cid:114)\n\n8mn(1 + (cid:107)J(cid:107)\u221e)\n\nt\n\nWe also give simple lower bound examples showing these bounds are not too far from optimal;\nfor example, for both algorithms we show the optimal asymptotic rate in t is lower bounded by at\nleast \u2126(1/t2). We refer to these bounds as dimension-independent because under the typical scaling\nof the entries of J in a ferromagnetic model, they show that mean-\ufb01eld iteration/BP achieve good\nestimates to the variational free energy density after a constant number of iterations. We explain this\nmore precisely in the next remark:\nRemark 1.4 (Scaling and Dimension-Free Nature). In ferromagnetic models, we usually expect\nthe scaling (cid:107)J(cid:107)1 = \u0398(n),(cid:107)h(cid:107)1 = \u0398(n) so that all of the terms in the Gibbs variational principle\n(2) are on the same order, since the entropy scales like \u0398(n) (e.g. the entropy of U ni({\u00b11}n) is\nn log(2)). Then the free energy log Z and its variational approximations all grow like \u0398(n), so\nwhen considering the scaling in n one should consider the free energy density 1\nn log Z. Writing\nthe guarantee of Theorem 1.2 for the free energy density when picking the O(1/t) bound, we see\n0 \u2264 1\nand the rhs is \u0398(1/t) under the assumption (cid:107)J(cid:107)1 =\n\u0398(n),(cid:107)h(cid:107)1 = \u0398(n). We get a similar dimension-free guarantee for BP as long as m = \u0398(n), i.e.\nthe model is sparse, and (cid:107)J(cid:107)\u221e = O(1) which is a very rarely violated assumption.\nRemark 1.5 (Importance of initialization). The fast convergence rates we show do not hold for\nother seemingly natural choices of initialization; e.g. if we start BP with initial messages near zero.\nFor a concrete illustration of this, see Figure 2 in the Appendix.\n\nn \u03a6M F (xt) \u2264 (cid:107)J(cid:107)1+(cid:107)h(cid:107)1\n\nn \u03a6M F (x\u2217) \u2212 1\n\nnt\n\nFinally, we build on ideas developed in our analysis of BP to give a different method, based on\nconvex programming, which has worse dependence on n but converges exponentially fast, i.e. can\ncompute the optimal BP solution to error \u0001 in time poly(n, log(1/\u0001)). This is described in Ap-\npendix H; as we explain there, such a method is useful when we care about computing the optimal\nBP solution accurately in parameter space, as there can be exponentially \ufb02at (in terms of \u03b2) direc-\ntions in the objective.\n\n1.3 Related Work\n\nAs mentioned above, the general connection between the Bethe free energy and Belief Propagation\nwas established in the work of Yedidia, Freeman and Weiss [36]. They showed that in any graphical\nmodel, the \ufb01xed points of BP correspond to the critical points of the Bethe Free Energy. However,\ntheir theory by itself does not say anything about the behavior of BP when it is not at a \ufb01xed point\n(as is the case during the BP iteration), or which \ufb01xed point (if any) it will converge to. In the\nspecial case that the edge constraints are relatively weak, e.g. if they satisfy Dobrushin\u2019s uniqueness\ncondition [9], one can show that BP converges to a unique \ufb01xed point by comparing to what happens\n2In this theorem and throughout, we use the notation (cid:107)J(cid:107)1,(cid:107)J(cid:107)\u221e to refer to the corresponding (cid:96)1, (cid:96)\u221e\n\nnorms of J when viewed as a vector of entries.\n\n4\n\n\fon a tree (see [29, 22]). BP is also known to converge if the graph has at most one cycle [31].\nStronger results are known for BP in Gaussian graphical models, in which case BP can be viewed\nas an iterative method for solving linear equations [32, 19].\nThis work builds upon previous work of Dembo and Montanari [7], who studied the convergence\nof Belief Propagation in ferromagnetic Ising models with a positive external \ufb01eld strictly bounded\naway from 0. In their analysis they crucially showed that in all such models, BP converges at an\nasymptotically exponential rate to a unique \ufb01xed point if initialized with non-negative messages\n(other \ufb01xed points may exist, but they have at least one negative coordinate). As discussed in [7],\nthis can be thought of as establishing an \u201caverage-case\u201d version of correlation decay which goes\nwell beyond the usual \u201cworst-case\u201d setting (which would require there to be a unique global \ufb01xed\npoint). From this they derived analytic results for graphs for Ising models which converge locally\nto (random) trees: these models exhibit an average-case version of correlation decay, BP correctly\nestimates the marginals (e.g. E[Xi]) of the Ising model, and from this derive that the \u201ccavity predic-\ntion\u201d for the free energy, which is determined by the tree the graph locally converges to, is correct to\nleading order. These analytic results were generalized in [8] beyond the Ising case (i.e. to non-binary\nspins) using some new techniques.\nIn contrast, in this work we allow for the complete absence of external \ufb01eld, in which case these\nmodels may have multiple \ufb01xed points, even in the space of nonnegative messages. Furthermore,\nthe optimal BP \ufb01xed point often does not correspond to the true marginals of the underlying Ising\nmodel3. Despite this, we show that BP converges quickly in objective value4 to the optimal \ufb01xed\npoint as long as we start from the all-1s initialization. Another key difference is we are interested\nin the behavior of BP on general graphs, where the BP estimate cannot always be related to the true\nfree energy; we get around this issue by building on the connection established in [36] to show that\nthe BP result is always equal to the Bethe free energy on any graph, not necessarily locally tree-like.\nA very different line of work studies the dense limit of BP in spin glasses and related models, in\nthe form of the TAP approximation and Approximate Message Passing (AMP); for example, see\n[4, 3]. These results are more concerned with dense models with random edge weights and are mo-\ntivated by CLT-type considerations; the models they consider are typically far from ferromagnetic\nand thus require in the dense limit the TAP approximation instead of the naive mean \ufb01eld approxi-\nmation. Since we consider arbitrary graphs instead of (dense) random graphs, these techniques are\nnot applicable.\nOutside of message passing algorithms, we note that in ferromagnetic Ising models it is actually\npossible to sample ef\ufb01ciently from the Boltzmann distribution by using a special Markov chain\nwhich performs non-local updates: this was proved in a landmark work of Jerrum and Sinclair [14].\nThis result can be used to give an algorithm for approximating the mean-\ufb01eld free energy using\na graph blow-up reduction [12]. Also, for ferromagnetic models it was shown previously that the\nBethe free energy can be computed in polynomial time using discretization with submodularity-\nbased methods in [16, 33]. The polynomials in the runtime guarantees for both methods are fairly\nlarge compared to the message-passsing algorithms discussed in this work.\n\n2 Convergence of Mean-Field Iteration\n\nIn this section, we give the proof of Theorem 1.2 by analyzing the mean-\ufb01eld iteration. Organiza-\ntionally, we split this theorem into two (corresponding to the two seperate bounds implied by the\nmin): we prove the \ufb01rst bound in the theorem as Theorem 2.4 and the second O(1/t4/3) bound as\nTheorem 2.6. Omitted proofs are deferred to Appendices A-C corresponding to each subsection.\n\n3For example, if there is no external \ufb01eld then E[Xi] = 0 for all i by symmetry, but e.g. on a 2D lattice\nat low temperature one can see the optimal BP solution has different marginals [21]. Roughly, this kind of\nbehavior correspond to the existence of phase transitions at zero external \ufb01eld (for say the random d-regular\ngraph), which are ruled out in the case of strictly positive external \ufb01eld by the Lee-Yang theorem [17].\n\n4The convergence in parameter space may be slower due to \ufb02at directions of the objective: see Appendix G.\n\nThis differs from the setting where the external \ufb01eld is lower bounded by a positive constant [7].\n\n5\n\n\f2.1 Main convergence bound\n\nIn this section we prove the \ufb01rst (O(1/t)) bound appearing in Theorem 1.2, the bound which is\nbetter for small t; we consider this to be the more signi\ufb01cant bound because it gives a meaningful\nconvergence result even when t = O(1) (see Remark 1.4). A key observation in the proof is that the\nfunctional \u03a6M F is actually concave on a certain subset of the space of product distributions, and that\nthe iteration stays in this region because the iteration is monotone w.r.t. the partial order structure;\nthis allows us to show progress at each step.\nFor the analysis of mean-\ufb01eld iteration, it will be very helpful to split the updates up into two steps:\n\ny(t+1) := Jx(t) + h\nx(t+1) := tanh\n\n\u2297n(y(t+1)).\n\nLemma 2.1. A global maximizer of \u03a6M F is in [0, 1]n.\nProof. For any x, if |x| denotes the coordinate wise absolute value then we observe \u03a6M F (x) \u2264\n\u03a6M F (|x|) since J, h are entrywise nonnegative and the entropy term is preserved. Therefore if x\nis a global maximizer then so is |x|, and by compactness of [\u22121, 1]n there exists at least one global\nmaximizer.\nLemma 2.2. There exists at most one critical point of \u03a6M F in (0, 1]n.\nBased on these lemmas, we de\ufb01ne x\u2217 to be the global maximizer of \u03a6M F in [0, 1]n. De\ufb01ne S :=\n{x \u2208 (0, 1]n : xi \u2265 x\u2217\ni }.\nLemma 2.3. The mean-\ufb01eld free energy functional \u03a6M F is concave on S.\nTheorem 2.4 (Main bound in Theorem 1.2). Suppose that x0 \u2208 S and de\ufb01ne (x(t), y(t))\u221e\niterating the mean-\ufb01eld equations. Then for every t, x(t) \u2208 S. Furthermore\n\nt=1 by\n\n\u03a6M F (x\u2217) \u2212 \u03a6M F (x(t)) \u2264 (cid:107)J(cid:107)1 + (cid:107)h(cid:107)1\n\n.\n\nt\n\n\u2297n(Jx+h) \u2264 tanh\n\nProof. To show that x(t) \u2208 S, observe that the mean-\ufb01eld iteration is monotone: if x \u2264 x(cid:48), then\n\u2297n(Jx\u2217+\ntanh\nh) \u2264 tanh\nTo prove the convergence bound, \ufb01rst note that \u2202\n\u2202xi\nobserve by Lemma 2.3 and concavity that\n\n\u2297n(Jx(cid:48)+h). Therefore, because x\u2217 \u2264 x0 we see that x\u2217 = tanh\n\n\u2297n(Jx(0) + h) = x(1) and so on iteratively.\n\n\u03a6M F (x) = Ji \u00b7 x + hi \u2212 tanh\n\n\u22121(xi) and then\n\n\u03a6M F (x\u2217) \u2212 \u03a6M F (x(t)) \u2264 (cid:104)\u2207\u03a6M F (x(t)), x\u2217 \u2212 xt(cid:105)\n\n(cid:88)\n\n\u2264 (cid:107)\u2207\u03a6M F (x(t))(cid:107)1\n\u22121(x(t)\n=\n\n| tanh\n\ni\n\ni ) \u2212 (Jx(t) + h)i| =\n\n(cid:88)\n\ni\n\ni \u2212 y(t+1)\ny(t)\n\ni\n\nwhere the second inequality was by H\u00f6lder\u2019s inequality and (cid:107)x\u2217 \u2212 x(t)(cid:107)\u221e \u2264 1, and the last equality\nfollows from the de\ufb01nition of y(t) and because y(t+1) \u2264 y(t) coordinate-wise. We can also see that\n\u03a6M F (x(t)) is a monotonically increasing function of t by considering the path between x(t) and\nx(t+1) which updates one coordinate at a time, as the gradient always has non-positive entries along\nthis path. Therefore if we sum over t we \ufb01nd that\n\n\u03a6M F (x\u2217)\u2212\u03a6M F (x(T )) \u2264 1\nT\n\nsince y(T +1)\n\ni\n\n\u2265 0 and y(1)\n\nT(cid:88)\ni \u2264(cid:80)\nj Jij + hi \u2264 (cid:107)Ji(cid:107)1 + hi.\n\n(\u03a6M F (x\u2217)\u2212\u03a6M F (x(t))) \u2264 1\nT\n\nt=1\n\nn(cid:88)\n\ni=1\n\ni \u2212y(T +1)\n(y(1)\n\ni\n\n) \u2264 (cid:107)J(cid:107)1 + (cid:107)h(cid:107)1\n\nT\n\nThe following simple example shows that the above result is not too far from optimal, in the sense\nthat an asymptotic rate of o(1/t2) is impossible. We take advantage of the fact that when the model\nis completely symmetrical, the behavior of the update can be reduced to a 1-dimensional recursion,\nwhich is a standard trick (see e.g. [21, 24]).\n\n6\n\n\fExample 2.5. Consider any d-regular graph with no external \ufb01eld and edge weight \u03b2 = 1/d, which\ncorresponds to the naive mean \ufb01eld prediction for the critical temperature. By symmetry, analyzing\nthe mean \ufb01eld iteration reduces to the 1d recursion x (cid:55)\u2192 tanh(x) which behaves like x (cid:55)\u2192 x\u2212 x3/3\n\u221a\nnear the \ufb01xed point x = 0. Solving this recurrence, we see that x converges to 0 at rate \u0398(1/\nt).\nIn terms of x, the estimated mean \ufb01eld free energy is (n/2)x2 + nH( 1+x\n2 ), so by expanding we see\nthat the estimated free energy converges at a \u0398(1/t2) rate in this example.\n\n2.2 A Faster Asymptotic Rate\n\nThe above theorem and lower bound leave a gap between O(1/t) and \u2126(1/t2) for the asymptotic\nrate of the mean-\ufb01eld iteration. This section is devoted to showing that for large t, we can obtain an\nimproved asymptotic rate of O(1/t4/3) for the mean-\ufb01eld iteration using a slightly more involved\nvariant of the argument from the previous section. The key insight is that we can obtain some control\nof (cid:107)x \u2212 x\u2217(cid:107)\u221e by consider the behavior of higher-order terms when expanding around x\u2217, and this\ncan be used to get better bounds on the convergence in objective.\nTheorem 2.6 (Second bound in Theorem 1.2). Suppose that x0 \u2208 S and de\ufb01ne (xt, yt)\u221e\niterating the mean-\ufb01eld equations. Then for every t, xt \u2208 S. Furthermore for any t \u2265 1,\n\nt=1 by\n\nand\n\n(cid:107)xt \u2212 x\u2217(cid:107)3\u221e \u2264 (cid:107)J(cid:107)1 + (cid:107)h(cid:107)1\n(cid:18)(cid:107)J(cid:107)1 + (cid:107)h(cid:107)1\n\nt\n\n(cid:19)4/3\n\n.\n\n\u03a6M F (x\u2217) \u2212 \u03a6M F (x2t) \u2264\n\nt\n\n2.3 Aside: Computing the Mean-Field Optimum given Inconsistent Fields\n\nIn this section we describe a polynomial time algorithm to compute the optimal mean-\ufb01eld approx-\nimation even in the situation when the external \ufb01elds have inconsistent signs (i.e. some of the hi\nare negative, some are positive). This is by reduction to the following algorithmic result of [27],\nfollowing the same strategy as [16, 33] for the case of the Bethe free energy. We include this result\nas we were not aware of it appearing explicitly in the literature, though it is known at least as a\n\u201cfolk-lore\u201d result.\nTheorem 2.7. Fix an Ising model with ferromagnetic interactions (Jij \u2265 0) and arbitrary (not\nnecessarily consistent) external \ufb01eld h. Then the mean-\ufb01eld free energy maxx \u03a6M F (x) can be\napproximated within error \u0001n in time poly(1/\u0001, n,(cid:107)J(cid:107)1,(cid:107)h(cid:107)1).\n\n3 Rapid Convergence of Belief Propagation\n\nIn this section, we give the proof of our main result Theorem 1.3 by analyzing belief propagation.\nThis proof is considerably more involved than the case of the mean-\ufb01eld iteration; a major concep-\ntual difference between the two iterations is that the mean-\ufb01eld iteration always maintains a valid\nproduct distribution, and so can be understood in terms of the landscape of \u03a6M F , whereas BP oper-\nates on \u201cdual\u201d variables which do not correspond to valid pseudodistributions except at \ufb01xed points,\nso analyzing the landscape of \u03a6Bethe by itself does not suf\ufb01ce. We get around this by also consid-\nering the behavior of a dual functional \u03a6\u2217\nBethe which is well-de\ufb01ned for every set of BP messages;\nhowever this functional is poorly behaved in general (it can be unbounded and its critical points are\ntypically saddle points). We are able to handle these dif\ufb01culties by identifying the special behavior\nof BP and \u03a6\u2217\nBethe on two special subsets of the BP messages arising from the partial order struc-\nture: the pre-\ufb01xpoints and post-\ufb01xpoints. Finally, when analyzing BP in these regions we are able to\nrelate in a useful way its behavior at different values of external \ufb01eld, enabling us to use a convexity\nargument from [7] which cannot be directly applied to our setting.\nOmitted proofs are deferred to Appendices D-F corresponding to each subsection.\n\n3.1 Background: BP and the Bethe Free Energy\n\nIn this section we recall the necessary facts we need about the relationship derived in [36] between\nthe Bethe free energy and BP. This relationship and corresponding formulas are a bit involved so we\nsketch the derivation in Appendix D.\n\n7\n\n\fFollowing the correspondence outlined in [36], by writing down the Lagrangian corresponding to\nthe optimization problem (5) over the polytope of locally consistent distributions, one can derive\nan expression for the Bethe free energy at a critical point of the Lagrangian in terms of the dual\nvariables (Lagrange multipliers) \u2014 see [21, 35]. After a change of variables to \u03bd, this lets us de\ufb01ne\n(for all \u03bd, not necessarily \ufb01xed points of the BP equations), the dual Bethe free energy\n\nwhere\n\nFi(\u03bd) = log\n\n\uf8ee\uf8f0ehi (cid:89)\n\nj\u2208\u2202i\n\n\u03a6\u2217\nBethe(\u03bd) :=\n\n(1 + tanh(Jij)\u03bdj\u2192i) + e\u2212hi (cid:89)\n\n(1 \u2212 tanh(Jij)\u03bdj\u2192i)\n\nj\u2208\u2202i\n\n\uf8f9\uf8fb +\n\n(7)\n\n(cid:88)\n\nj\u2208\u2202i\n\nlog cosh(Jij)\n\n(cid:88)\n\nFi(\u03bd) \u2212(cid:88)\n\ni\n\nFij(\u03bd).\n\ni\u223cj\n\nand Fij(\u03bd) = log(1 + tanh(Jij)\u03bdi\u2192j\u03bdj\u2192i) + log cosh(Jij). We remark (see [21]) that in the case\nthe graph is a tree, it\u2019s known that the Bethe free energy is a convex function, so \u03a6\u2217\nBethe(\u03bd) plus the\nLagrange multiplier terms is actually the Lagrangian dual and thus has a natural interpretation for\nall \u03bd. This is not true in general, however we will soon see that \u03a6\u2217\nBethe does have useful properties\non some special subsets of the space of messages.\n\n3.2 Properties of BP and Optimization Landscape\n\nThroughout the remainder of the paper we will use the notation \u03c6(\u03bd) to denote the result of perform-\ning a single iteration of belief propagation from \u03bd, i.e.\n\n\uf8eb\uf8edhi +\n(cid:88)\nLemma 3.1. Suppose that f (x) = tanh(h +(cid:80)\n\n\u03c6(\u03bd)i\u2192j := tanh\n\nk\u2208\u2202i\\j\n\n\uf8f6\uf8f8 .\n\ntanh\n\n\u22121(tanh(Jik)\u03bdk\u2192i)\n\nThe following lemma implies that \u03c6(\u03bd)i\u2192j is a concave monotone function for nonnegative \u03bd.\n\n\u22121(xi)) for any h \u2265 0. Then f is a concave\nmonotone function on the domain [0, 1)n. Furthermore \u22072f (x) \u227a 0 unless h = 0 and | supp(x)| \u2264\n1.\n\ni tanh\n\nThere are two special subsets of the nonnegative messages which will play key roles in our analysis.\nThey are the set of pre-\ufb01xpoints and post-\ufb01xpoints (following standard poset terminology), de\ufb01ned\nby\n\nSpre = {\u03bd : 0 \u2264 \u03c6(\u03bd)i\u2192j \u2264 \u03bdi\u2192j}, Spost = {\u03bd : 0 \u2264 \u03bdi\u2192j \u2264 \u03c6(\u03bd)i\u2192j}.\n\nBethe(\u03bd)(cid:107)\u221e \u2264 1. Furthermore, if \u03bd \u2208 Spre then \u2207\u03a6\u2217\n\nNote that both sets contain the nonnegative \ufb01xed points; also we note from Lemma 3.1 that Spost is\na convex set, whereas Spre is typically non-convex and even disconnected. The gradient of \u03a6\u2217\nBethe\nis well-behaved on these sets:\nLemma 3.2. For any \u03bd \u2265 0, (cid:107)\u2207\u03a6\u2217\nand if \u03bd \u2208 Spost then \u2207\u03a6\u2217\nThe Knaster-Tarski theorem [28] shows that the \ufb01xed points of \u03c6 must form a complete lattice, and\nin particular shows that a greatest \ufb01xed point must exist; the following lemma identi\ufb01es it explicitly.\nLemma 3.3. Suppose that BP is run from initial messages \u03bd(0)\ni\u2192j = 1. The messages converge to a\n\ufb01xed point \u03bd\u2217 of the BP equations such that for any other \ufb01xed point \u00b5, \u00b5i\u2192j \u2264 \u03bd\u2217\ni\u2192j for all i, j.\nFurthermore\n\nBethe(\u03bd) \u2265 0.\n\nBethe(\u03bd) \u2264 0\n\nBethe(\u03bd\u2217) = max\n\u03a6\u2217\n\u03bd\u2208Spost\n\n\u03a6\u2217\nBethe(\u03bd)\n\nThe following key theorem states that \u03bd\u2217 is a global optimum; its proof involves intricate analysis\nof the Bethe free energy objective and is left to the appendix. The high level idea is similar to the\nprevious situation with mean \ufb01eld approximation \u2014 since J, h are nonnegative, based on the form of\nthe Bethe free energy we would guess that the optimizing pseudodistribution exhibits only positive\ncorrelations, and that this should be re\ufb02ected in the optimum BP \ufb01xed point having nonnegative\nmessages.\nTheorem 3.4. The maximal \ufb01xed point \u03bd\u2217 (as de\ufb01ned in Lemma 3.3) corresponds to a global max-\nimizer of the Bethe free energy.\n\n8\n\n\fBethe(\u03bd(0)) \u2264 \u03a6\u2217\n\nBethe(\u03bd(T )).\n\nBethe(\u00b5) \u2264 \u03a6\u2217\n\nBethe behaves nicely with respect to BP for messages in Spre:\n\n3.3 Convergence rate of belief propagation\nA priori, there is no signi\ufb01cance to the value of \u03a6\u2217\nBethe on a general (non-\ufb01xed point) \u03bd. However,\nwe observe that \u03a6\u2217\nLemma 3.5. Suppose that \u03bd(0) \u2208 Spre and de\ufb01ne the BP iterates \u03bd(t+1) := \u03c6(\u03bd(t)). Then for\nany T \u2265 0 and any \u00b5 such that \u03bd(T ) \u2264 \u00b5 \u2264 \u03bd(0) it follows that \u03a6\u2217\nBethe(\u03bd(T )). In\nparticular, \u03a6\u2217\nIn order to give a quantitative bound on the convergence of BP, we are faced with an important\nconceptual dif\ufb01culty: the BP messages may not converge quickly in parameter space, but if the BP\nmessages are far from a \ufb01xed point in parameter space it is hard to show anything about the quality\nof their estimate to the free energy. We overcome this dif\ufb01culty by relating the behavior of BP at\nzero external \ufb01eld and with additional positive external \ufb01eld, which allows us to take advantage\nof the smoothness of the primal objective \u03a6Bethe \u2014 Lemma 3.5 and Theorem 3.4 are key tools\nneeded to make this connection work. This trick is similar in spirit to the use of monotone couplings\nin the proof of various correlation inequalities for the Ising model. This, combined with a useful\nconcavity argument from [7] for analyzing the positive external \ufb01eld setting, allow us to ultimately\nprove Theorem 1.3. The detailed proof is in Appendix F.\nLower bounds and examples: We give examples which illustrate the importance of distinguishing\nrates in parameter space vs. objective space, the importance of initialization at all-ones, and also\ngive lower bounds on the asymptotic convergence rate of BP in Appendix G.\nA method with exponentially fast convergence: As mentioned in the introduction, we use the\ninsights developed in our analysis of BP (especially, the nice structure of the set of the set of post-\n\ufb01xpoints Spost) to give a different method with much faster asymptotic convergence (but with a\nruntime that depends more signi\ufb01cantly on the dimension): this is in Appendix H.\nAcknowledgements: The author thanks Vishesh Jain for suggesting the argument in Section 2.2,\nElchanan Mossel, Nike Sun, Matthew Brennan, and Enric Boix for helpful discussions related to\nthis work, and Ankur Moitra and Andrej Risteski for useful discussions on related topics.\n\nReferences\n[1] Fanny Augeri. A transportation approach to the mean-\ufb01eld approximation. arXiv preprint\n\narXiv:1903.08021, 2019.\n\n[2] Anirban Basak and Sumit Mukherjee. Universality of the mean-\ufb01eld for the potts model.\n\nProbability Theory and Related Fields, 168(3-4):557\u2013600, 2017.\n\n[3] Mohsen Bayati and Andrea Montanari.\n\nThe dynamics of message passing on dense\ngraphs, with applications to compressed sensing. IEEE Transactions on Information Theory,\n57(2):764\u2013785, 2011.\n\n[4] Erwin Bolthausen. An iterative construction of solutions of the tap equations for the\nsherrington\u2013kirkpatrick model. Communications in Mathematical Physics, 325(1):333\u2013366,\n2014.\n\n[5] S\u00e9bastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and\n\nTrends R(cid:13) in Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[6] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00e1. Asymptotic analysis of the stochas-\ntic block model for modular networks and its algorithmic applications. Physics Review E,\n84:066106, Dec 2011.\n\n[7] Amir Dembo and Andrea Montanari.\n\nProbab., 20(2):565\u2013592, 2010.\n\nIsing models on locally tree-like graphs. Ann. Appl.\n\n[8] Amir Dembo, Andrea Montanari, Nike Sun, et al. Factor models on locally tree-like graphs.\n\nThe Annals of Probability, 41(6):4162\u20134213, 2013.\n\n[9] Roland Lvovich Dobrushin. The description of a random \ufb01eld by means of conditional proba-\n\nbilities and conditions of its regularity. Theor. Prob. Appl., 13:197\u2013224, 1968.\n\n9\n\n\f[10] Ronen Eldan. Taming correlations through entropy-ef\ufb01cient measure decompositions with\n\napplications to mean-\ufb01eld approximation. arXiv preprint arXiv:1811.11530, 2018.\n\n[11] Geoffrey E Hinton. A practical guide to training restricted boltzmann machines. In Neural\n\nnetworks: Tricks of the trade, pages 599\u2013619. Springer, 2012.\n\n[12] Vishesh Jain, Frederic Koehler, and Elchanan Mossel. The mean-\ufb01eld approximation: Infor-\nmation inequalities, algorithms, and complexity. In Conference on Learning Theory (COLT),\n2018.\n\n[13] Vishesh Jain, Frederic Koehler, and Andrej Risteski. Mean-\ufb01eld approximation, convex hier-\narchies, and the optimality of correlation rounding: a uni\ufb01ed perspective. In Proceedings of\nthe Symposium on Theory of Computing (STOC), 2019.\n\n[14] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for ising model (ex-\n\ntended abstract). In Automata, Languages and Programming, pages 462\u2013475, 1990.\n\n[15] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial\n\nstructures from a uniform distribution. Theoretical Computer Science, 43:169\u2013188, 1986.\n\n[16] Filip Kor\u02c7c, Vladimir Kolmogorov, Christoph H Lampert, et al. Approximating marginals using\ndiscrete energy minimization. In ICML Workshop on Inferning: Interactions between Inference\nand Learning. Citeseer, 2012.\n\n[17] Tsung-Dao Lee and Chen-Ning Yang. Statistical theory of equations of state and phase transi-\n\ntions. ii. lattice gas and ising model. Physical Review, 87(3):410, 1952.\n\n[18] Stan Z Li. Markov random \ufb01eld modeling in image analysis. Springer Science & Business\n\nMedia, 2009.\n\n[19] Dmitry M Malioutov, Jason K Johnson, and Alan S Willsky. Walk-sums and belief propagation\nin gaussian graphical models. Journal of Machine Learning Research, 7(Oct):2031\u20132064,\n2006.\n\n[20] Fabio Martinelli. Lectures on glauber dynamics for discrete spin models.\n\nprobability theory and statistics, pages 93\u2013191. Springer, 1999.\n\nIn Lectures on\n\n[21] M. M\u00e9zard and A. Montanari.\n\nPress, USA, 2009.\n\nInformation, physics, and computation. Oxford University\n\n[22] J. M. Mooij and H. J. Kappen. Suf\ufb01cient conditions for convergence of the sum-product algo-\n\nrithm. IEEE Transactions on Information Theory, 53(12):4422\u20134437, Dec 2007.\n\n[23] Elchanan Mossel. Survey-information \ufb02ow on trees. DIMACS series in discrete mathematics\n\nand theoretical computer science, 63:155\u2013170, 2004.\n\n[24] Giorgio Parisi. Statistical \ufb01eld theory. New York: Addison-Wesley, 1988.\n\n[25] J. Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufman, San Mateo, 1988.\n\n[26] Nicholas Ruozzi. The bethe partition function of log-supermodular graphical models.\n\nAdvances in Neural Information Processing Systems, pages 117\u2013125, 2012.\n\nIn\n\n[27] Dmitrij Schlesinger and Boris Flach. Transforming an arbitrary minsum problem into a binary\n\none. TU, Fak. Informatik, 2006.\n\n[28] Alfred Tarski et al. A lattice-theoretical \ufb01xpoint theorem and its applications. Paci\ufb01c journal\n\nof Mathematics, 5(2):285\u2013309, 1955.\n\n[29] S. Tatikonda and M. I. Jordan. Loopy belief propagation and gibbs measures. In Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI), Proceedings of the Eighteenth Conference. 2002.\n\n[30] Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and\n\nvariational inference. Foundations and Trends in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n10\n\n\f[31] Yair Weiss. Correctness of local probability propagation in graphical models with loops. Neu-\n\nral computation, 12(1):1\u201341, 2000.\n\n[32] Yair Weiss and William T Freeman. Correctness of belief propagation in gaussian graphical\nmodels of arbitrary topology. In Advances in neural information processing systems, pages\n673\u2013679, 2000.\n\n[33] Adrian Weller and Tony Jebara. Bethe bounds and approximating the global optimum.\n\nArti\ufb01cial Intelligence and Statistics, pages 618\u2013631, 2013.\n\nIn\n\n[34] Max Welling and Yee Whye Teh. Belief optimization for binary networks: A stable alternative\nto loopy belief propagation. In Proceedings of the Seventeenth conference on Uncertainty in\narti\ufb01cial intelligence, pages 554\u2013561. Morgan Kaufmann Publishers Inc., 2001.\n\n[35] Tom\u00e1\u0161 Werner. Primal view on belief propagation. In Proceedings of the Twenty-Sixth Con-\n\nference on Uncertainty in Arti\ufb01cial Intelligence, pages 651\u2013657. AUAI Press, 2010.\n\n[36] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understading belief propogation and its gener-\nIn Exploring Arti\ufb01cial Intelligence in the New Millenium, Annals of mathemat-\nalizations.\nics studies, no. 34, pages 239\u2013326. Science & Technology Books, 2003. Availible online at\nhttp://www.merl.com/papers/TR2002-35/.\n\n11\n\n\f", "award": [], "sourceid": 4516, "authors": [{"given_name": "Frederic", "family_name": "Koehler", "institution": "MIT"}]}