{"title": "Novel iteration schemes for the Cluster Variation Method", "book": "Advances in Neural Information Processing Systems", "page_first": 415, "page_last": 422, "abstract": null, "full_text": "Novel iteration schemes for the Cluster \n\nVariation Method \n\nHilbert J. Kappen \n\nDepartment of Biophysics \n\nNijmegen University \n\nWim Wiegerinck \n\nDepartment of Biophysics \n\nNijmegen University \n\nNijmegen, the Netherlands \n\nbert\u00a9mbfys.kun.nl \n\nNijmegen, the Netherlands \n\nwimw\u00a9mbfys.kun.nl \n\nAbstract \n\nThe Cluster Variation method is a class of approximation meth(cid:173)\nods containing the Bethe and Kikuchi approximations as special \ncases. We derive two novel iteration schemes for the Cluster Vari(cid:173)\nation Method. One is a fixed point iteration scheme which gives a \nsignificant improvement over loopy BP, mean field and TAP meth(cid:173)\nods on directed graphical models. The other is a gradient based \nmethod, that is guaranteed to converge and is shown to give useful \nresults on random graphs with mild frustration. We conclude that \nthe methods are of significant practical value for large inference \nproblems. \n\n1 \n\nIntroduction \n\nBelief Propagation (BP) is a message passing scheme, which is known to yield exact \nIt has been noted by several \ninference in tree structured graphical models [1]. \nauthors that Belief Propagation can can also give impressive results for graphs that \nare not trees [2]. \n\nThe Cluster Variation Method (CVM), is a method that has been developed in the \nphysics community for approximate inference in the Ising model [3]. The CVM ap(cid:173)\nproximates the joint probability distribution by a number of (overlapping) marginal \ndistributions (clusters). The quality of the approximation is determined by the size \nand number of clusters. When the clusters consist of only two variables, the method \nis known as the Bethe approximation. Recently, the method has been introduced \nby Yedidia et a1.[4] into the machine learning community, showing that in the Bethe \napproximation, the CVM solution coincides with the fixed points of the belief prop(cid:173)\nagation algorithm. For clusters consisting of more than two variables, [4] present a \nmessage passing scheme called generalized belief propagation (GBP). This approx(cid:173)\nimation to the free energy is often referred to as the Kikuchi approximation. They \nshow, that GBP gives a significant improvement over the Bethe approximation for \na small two dimensional Ising lattice with random couplings. However, for larger \nlatices, both GBP and BP fail to converge [4, 5]. \n\nIn [5] the CCCP method is proposed, which is a double loop iteration algorithm that \nis guaranteed to converge for the general CVM problem. Intuitively, the method \n\n\fconsists of iteration a sequence of convex subproblem (outer loop) each of which is \nsolved using a fixed point iteration method (inner loop). In this sense, the method \nis similar to the UPS algorithm of [6] which identifies trees as subproblems. \nIn this paper, we propose two algorithms, one is a fixed point iteration proce(cid:173)\ndure, the other a gradient based method. We show that the fixed point iteration \nmethod gives very fast convergence and accurate results for some classical directed \ngraphical models. However, for more challenging cases the fixed point method does \nnot converge and the gradient based approach, which is guaranteed to converge, is \npreferable. \n\n2 The Cluster Variation Method \n\nIn this section, we briefly present the cluster variation method. For a more complete \ntreatment see for instance [7]. Let x = (Xl, ... ,xn ) be a set of variables, where each \nXi can take a finite number of values. Consider a probability distribution on X of \nthe form \n\n( ) __ 1_ -H(x) \n\n- Z(H)e \n\nPH X \n\nZ = 2:= e-H(x) \n\nx \n\nIt is well known, that PH can be obtained as the minimum of the free energy, which \nis a functional over probability distributions of the following form: \n\nFH(P) = (H) + (logp) , \n\n(1) \n\nwhere the expectation value is taken with respect to the distribution p , i.e. (H) = \nL x P(x)H(x). When one minimizes FH(P) with respect to P under the constraint \nof normalization L x P(X) = 1, one obtains PH. \nComputing marginals of PH such as PH(Xi) or PH(Xi, Xj) involves sums over all \nstates, which is intractable for large n. Therefore, one needs tractable approxi(cid:173)\nmations to PH. The cluster variation method replaces the probability distribution \nPH(X) by a large number of (possibly overlapping) probability distributions, each \ndescribing a sub set (cluster) of variables. Due to the one-to-one correspondence \nbetween a probability distribution and the minima of a free energy we can define ap(cid:173)\nproximate probability distributions by constructing approximate free energies and \ncomputing their minimum. This is achieved by approximating Eq. 1 in terms of the \ncluster probabilities. The solution is obtained by minimizing this approximate free \nenergy subject to normalization and consistency constraints. \n\nDefine clusters as subsets of distinct variables: Xa = (XiI' ... ,Xik), with 1 ~ i j ~ n. \nConsider the set of clusters P that describe the interactions in H and write H as a \nsum of these interactions: \n\nH(x) = 2:= Hl(xoJ \n\na EP \n\nWe now define a set of clusters B, that will determine our approximation in the \ncluster variation method. For each cluster a E B, we introduce a probability \ndistribution Pa(xa) which jointly must approximate p(x). B should at least contain \nthe interactions in p(x) in the following way: Va E P => 30:' E B,a c a'. In \naddition, we demand that no two clusters in B contain each other: a, a' E B => \na rt a', a' rt a. The minimal choice for B is to chose clusters from P itself. The \nmaximal choice for B is the cliques obtained when constructing the junction tree[8]. \nIn this case, the clusters in B form a tree structure and the CVM method is exact. \nDefine a set of clusters M that consist of any intersection of a number of clusters \nof B: M = {,BI,B = nkak, ak E B}, and define U = BuM. Once U is given, we \n\n\fdefine numbers a/3 recursively by the Moebius formula \n\n1 = \n\nL \n\no;EU,o;\"J/3 \n\nao;, V (3 E U \n\nIn particular, this shows that ao; = 1 for 0: E B. \nThe Moebius formula allows us to rewrite (H) in terms of the cluster probabilities \n\n(H) = Lao; LPo;(xo;)Ho;(xo;), \n\no;EU \n\nx\" \n\n(2) \n\nwith Ho;(xo;) = L./3EP,/3co; Hh(X/3) . Since interactions Hh may appear in more than \none Ho;, the constants ao; ensure that double counting is compensated for. \nWhereas (H) can be written exactly in terms of Po;, this is not the case for the \nentropy term in Eq. 1. The approach is to decompose the entropy of a cluster 0: in \nterms of 'connected entropies' in the following way: 1 \n\nx\" \n\n/3Co; \n\nwhere the sum over (3 contains all sub clusters of 0:. Such a decomposition can be \nmade for any cluster. In particular it can be made for the 'cluster' consisting of all \nvariables, so that we obtain \n\n(3) \n\nS = - LP(x) logp(x) = L Sh\u00b7 \n\nx \n\n/3 \n\n(4) \n\nThe cluster variation method approximates the total entropy by restricting this \nlatter sum to only clusters in U and re-expressing Sh in terms of So;, using the \nMoebius formula and the definition Eq. 3. \n\n/3EU \n\n/3EU 0;\"J/3 \n\no;EU \n\n(5) \n\nSince So; is a function of Po; (Eq. 3) , we have expressed the entropy in terms of \ncluster probabilities Po; . \nThe quality of this approximation is illustrated in Fig. 1 for the SK model. Note, \nthat both the Bethe and Kikuchi approximation strongly deteriorate around J = 1, \nwhich is where the spin-glass phase starts. For J < 1, the Kikuchi approximation is \nsuperior to the Bethe approximation. Note, however, that this figure only illustrates \nthe quality of the truncations in Eq. 5 assuming that the exact marginals are known. \nIt does not say anything about the accuracy of the approximate marginals using \nthe approximate free energy. \n\nSubstituting Eqs. 2 and 5 into the free energy Eq. 1 we obtain the approximate \nfree energy of the Cluster Variation method. This free energy must be minimized \nsubject to normalization constraints L.x\" Po; (x o; ) = 1 and consistency constraints \n\nPo;(X/3) = P/3(X/3), \n\n0:,(3 E U,(3 C 0:. \n\n(6) \n\nwith Po; (X/3) = L.x \n\n\"\\f3 \n\nPo; (xo;). \n\nIThis decomposition is similar to writing a correlation in terms of means and covariance. \nFor instance when a = (i) , S(i) = SIi) is the usual mean field entropy and S(ij) = Sli) + \nSIj) + Slij) defines the two node correction Slij)\" \n\n\f12 \n\n10 \n\n8 \n\n>-a. e 6 \nc \nlJ.J \n\n4 \n\n2 \n\n0 \n0.5 \n\n\"-\n\n\"-\n\n\"-\n\n\"-\n\n1.5 \n\n2 \n\nJ \n\nFigure 1: Exact and approximate entropies for the fully connected Boltzmann-Gibbs \ndistribution on n = 10 variables with random couplings (SK model) as a function of mean \ncoupling strength. Couplings Wij are chosen from a normal Gaussian distribution with \nmean zero and standard deviation J /..;n. External fields ()i are chosen from a normal \nGaussian distribution with mean zero and standard deviation 0.1. The exact entropy is \ncomputed from Eq. 4. The Bethe and Kikuchi approximations are computed using the \napproximate entropy expression Eq. 5 with exact marginals and by choosing B as the set \nof all pairs and all triplets, respectively. \n\nThe set of consistency constraints can be significantly reduced because some con(cid:173)\nstraints imply others. Let 0:,0:', .. . denote clusters in Band fJ, fJ', ... denote clusters \nin M. \n\n\u2022 If fJ c fJ' Co: and Pa(x/3') = P/3' (x/3') and Pa(x/3 ) = P/3(x/3), then P/3' (x/3) = \nP/3 (X/3)' This means that constraints between clusters in M can be removed . \n\n\u2022 If fJ c fJ' c 0: , 0:' and Pa(x/3') = Pa' (x/3') and p,,,(x/3) = P/3 (x/3 ), then \nPa,(x/3) = P/3 (x/3)' This means that some constraints between clusters in B \nand M can be removed. \n\nWe denote the remaining necessary constraints by 0: ---t fJ. \nAdding Lagrange multipliers for the constraints we obtain the Cluster Variation \nfree energy: \n\n- L Aa (LPa(Xa) - 1) - L L L Aa/3 (X/3) (Pa(x/3 ) - P/3 (x/3)) \n\naEU \n\nx \" \n\naEU \n\nx \" \n\n/3 EM a-+/3 X f3 \n\n(7) \n\n\f3 \n\nIterating Lagrange multipliers \n\nBy setting 88Fc(vm), a E U equal to zero, one can express the cluster probabilities in \nterms of the Lagrange multipliers: \n\nPO! X o: \n\n; \n\na \n\n; \n\n(3 \n\nexp (-Ha(Xa) + L \n\n)..a(3 (X(3)) \n\n(3 f-a \n\nexp (-H(3 (X(3 ) - a1 L \n\n)..a(3 (X(3 )) \n\n(3 a-t (3 \n\n(8) \n\n(9) \n\nThe remaining task is to solve for the Lagrange multipliers such that all constraints \n(Eq. 6) are satisfied. We present two ways to do this. \n\nWhen one substitutes Eqs. 8-9 into the constraint Eqs. 6 one obtains a system of \ncoupled non-linear equations. In Yedidia et al.[4] a message passing algorithm was \nproposed to find a solution to this problem. Here, we will present an alternative \nmethod, that solves directly in terms of the Lagrange multipliers. \n\n3.1 Fixed point iteration \n\nConsider the constraints Eq. 6 for some fixed cluster fJ and all clusters a -+ fJ and \ndefine B(3 = {a E Bla -+ fJ }\u00b7 We wish to solve for all constraints a -+ fJ, with \na E B(3 by adjusting )..a(3, a E B(3. This is a sub-problem with IB(3 I IX(3 I equations \nand an equal number of unknowns, where IB(3 1 is the number of elements of B(3 \nand IX(3 1 is the number of values that x(3 can take. The probability distribution P(3 \n(Eq. 9) depends only on these Lagrange multipliers. Pa (Eq. 8) depends also on other \nLagrange multipliers. However, we consider only its dependence on )..a(3 , a E B(3 , \nand consider all other Lagrange multipliers as fixed . Thus, \n\n(10) \n\nwith Pa independent of ).. a(3, a E B(3 . \nSubstituting, Eqs. 9 and 10 into Eq. 6, we obtain a set of linear equations for \n)..a(3 (x(3 ) which we can solve in closed form: \n\n)..a(3 (X(3 ) = -\n\nwith \n\nalB IH(3 (X(3 ) - L A aadogPa l (X(3 ) \n\na(3 + \n\n(3 \n\na' \n\nAaal = /ja a l \n\n-\n\n1 \n\n--c=--:-\n\na(3 + IB(3 1 \n\nWe update the probabilities with the new values of the Lagrange multipliers using \nEqs. 9 and 10. We repeat the above procedure for all fJ E M until convergence. \n\n3.2 Gradient descent \n\nWe define an auxiliary cost function \n\nC = L LP(3 (X(3 ) log P(3 ((X(3 )) = L Ca(3 \n\na(3 Xf3 \n\nPa x(3 \n\na(3 \n\n(11) \n\nthat is zero when all constraints are satisfied and positive otherwise and minimize \nthis cost function with respect to the Lagrange multipliers )..a(3 (X(3 ). The gradient \n\n\fof C is given by: \n\n8C \n\nwith \n\n- - - ~ log \n\nPf3(Xf3 ) \"\"' (Pf3(Xf3 ) \nPa' Xf3 \n\na/-tf3 \n\naf3 \n\n( \n\n) - Calf3 \n\n) \n\n- ~ af3' xf3 \n\n\"\"' (PI ( ) \n13' +--a \n\n()) \n\n- Pa xf3 \n\n4 Numerical results \n\n4.1 Directed Graphical models \n\nWe show the performance of the fixed point iteration procedure on several 'real \nIn figure 2a, we plot the exact single node \nworld' directed graphical models. \nmarginals against the approximate marginals for the Asia problem [8]. Clusters \nin B are defined according to the conditional probability tables. Convergence was \nreached in 6 iterations using fixed point iteration. Maximal error on the marginals \nis 0.0033. For comparison, we computed the mean field and TAP approximations, \nas previously introduced by [9]. Although TAP is significantly better than MF, it \nis far worse than the CVM method. This is not surprising, since both the MF and \nTAP approximation are based on single node approximation, whereas the CVM \nmethod uses potentials up to size 3. \n\nIn figure 2b, we plot the exact single node marginals against the approximate CVM \nmarginals for the alarm network [10]. The structure and CPTs were downloaded \nfrom www.cs.huji.ac.il;-nir. Clusters in B are defined according to the con(cid:173)\nditional probability tables and maximally contain 5 variables. Convergence was \nreached in 15 iterations using fixed point iteration. Maximal error on the marginals \nis 0.029. Ordinary loopy BP gives an error in the marginals of approximately 0.25 \n[2]. Mean field and TAP methods did not give reproducible results on this problem. \n\n0.5 , - - - - - - - - - ---.--f'l---+ \n\n.' \n\n.. ' \n\n0.4 \n\n(fj co c \n\u00b7~0.3 \nC1l \nE \nx e 0.2 \nc. \nc. \n () \n\nx \nx \n\n0.2 \n\n.' \n\n.' \n\n,... \no \n\nOL-----------~ \n\n0.5 \n\nExact marginals \n\n(a) Asia problem (n = 8). \n\n(b) Alarm problem (n = 37). \n\nFigure 2: Comparison of single node marginals on two real world problems. \n\nFinally, we tested the cluster variation method on randomly generated directed \n\n\fgraphical models. Each node is randomly connected to k parents. The entries of \nthe probability tables are randomly generated between zero and one. Due to the \nlarge number of loops in the graph, the exact method requires exponential time in \nthe maximum clique size, which can be seen from Table 1 to scale approximately \nlinear with the network size. Therefore exact computation is only feasible for small \ngraphs (up to size n = 40 in this case). \nFor the CVM, clusters in B are defined according to the conditional probability \ntables. Therefore, maximal cluster size is k + 1. On these more challenging cases, \nthe fixed point iteration method does not converge. The results shown are obtained \nwith conjugate gradient descent on the auxiliary cost function Eq. 11. The results \nare shown in Table 1. \n\nn \n10 \n20 \n30 \n40 \n50 \n\nIter \n16 \n189 \n157 \n148 \n132 \n\nIGI \n8 \n12 \n16 \n21 \n26 \n\nPotential error Margin error \n0.004 \n0.029 \n0.130 \n0.144 \n-\n\n0.018 \n0.019 \n0.033 \n0.048 \n-\n\nG \n9.7e-ll \n2.4e-4 \n2.1e-3 \n3.6e-3 \n4.5e-3 \n\nTable 1: Comparison of CYM method for large directed graphical models. Each node \nis connected to k = 5 parents. IGI is the tree width of the triangulated graph required \nfor the exact computation. Iter is the number of conjugate gradient descent iterations of \nthe CYM method. Potential error and margin error are the maximum absolute distance \n(MAD) in any of the cluster probabilities and single variable marginals computed with \nCYM, respectively. G is given by Eq. 11 after termination of CYM. \n\n4.2 Markov networks \n\nWe compare the Bethe and Kikuchi approximations for the SK model with n = 5 \nneurons as defined in Fig. 1. We expect that for small J the CVM approximation \ngives accurate results and deteriorates for larger J. \n\nWe compare the Bethe approximation, where we define clusters for all pairs of nodes \nand a Kikuchi approximation where we define clusters for all sub sets of three nodes. \nThe results are given in Table 2. We see that for the Bethe approximation, the \nresults of the fixed point iteration method (FPI) and the gradient based approach \nagree. For the Kikuchi approximation the fixed point iteration method does not \nconverge and results are omitted. As expected, the Kikuchi approximation gives \nmore accurate results than the Bethe approximation for small J. \n\n5 Conclusion \n\nWe have presented two iteration schemes for finding the minimum of the constraint \nproblem Eq. 7. One method is a fixed point iteration method that is equivalent \nto belief propagation for pairwise interactions. This method is very fast and gives \nvery accurate results for 'not too complex' graphical models, such as real world \ndirected graphical models and frustrated Boltzmann distributions in the Bethe ap(cid:173)\nproximation. However, for more complex graphs such as random directed graphs or \nmore complex approximations, such as the Kikuchi approximation, the fixed point \niteration method does not converge. Empirically, it is found that smoothing may \nsomewhat help, but certainly does not solve this problem. For these more com(cid:173)\nplex problems we propose to minimize an auxiliary cost function using a gradient \n\n\fBethe \n\nFPI \n\nIter \n\nError \n7 0.000161 \n9 0.001297 \n13 0.004325 \n17 0.009765 \n38 0.027217 \n75 \n0.049955 \n\ngradient \n\nIter \n\nError \n7 0.000548 \n11 \n0.001263 \n14 0.004392 \n0.009827 \n15 \n0.027323 \n16 \n20 \n0.049831 \n\nJ \n0.25 \n0.50 \n0.75 \n1.00 \n1.50 \n2.00 \n\nKikuchi \ngradient \n\nIter \nError \n120 0.000012 \n0.000355 \n221 \n0.021176 \n86 \n49 \n0.036882 \n150 0.059977 \n137 0.088481 \n\nTable 2: Comparison of Bethe and Kikuchi approximation for Boltzmann distributions. \nIter is the number of iterations needed. Error is the MAD of single variable marginals. \n\nbased method. Clearly, this approach is guaranteed to converge. Empirically, we \nhave found no problems with local minima. However, we have found that obtaining \nsolut ion with C close to zero may require many iterations. \n\nAcknowledgments \n\nThis research was supported in part by the Dutch Technology Foundation (STW). \nI would like to thank Taylan Cemgil for providing his Matlab graphical models \ntoolkit and Sebino Stramaglia (Bari, Italy) for useful discussions. \n\nReferences \n[1] J. Pearl. Probabilistic reasoning in intelligent systems: Networks of Plausible Infer(cid:173)\n\nence. Morgan Kaufmann, San Francisco, California, 1988. \n\n[2] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for \napproximate inference: An empirical study. In Proceedings of Uncertainty in AI, \npages 467- 475, 1999. \n\n[3] R. Kikuchi. Physical R eview, 81:988, 1951. \n[4] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Generalized belief propagation. In T.K. \n\nLeen, T.G. Dietterich, and V. Tresp, editors, Advances in Neural Information Pro(cid:173)\ncessing Systems 13 (Proceedings of the 2000 Conference), 2001. In press. \n\n[5] A.L. Yuille and A. Rangarajan. The convex-concave principle. In T.G. Dieterich, \nS. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing \nSystems, volume 14, 2002. In press. \n\n[6] Y. Teh and M. Welling. The unified propagation and scaling algorithm. In T.G. \nDieterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information \nProcessing Systems, volume 14, 2002. In press. \n\n[7] H.J. Kappen. The cluster variation method for approximate reasoning in medical \ndiagnosis. In G. Nardulli and S. Stramaglia, editors, Modeling Bio-medical signals. \nWorld-Scientific, 2002. In press. \n\n[8] S.L. Lauritzen and D.J. Spiegelhalter. Local computations with probabilties on graph(cid:173)\n\nical structures and their application to expert systems. J. Royal Statistical society B, \n50:154- 227, 1988. \n\n[9] H.J . Kappen and W .A.J.J. Wiegerinck. Second order approximations for probability \nmodels. In Todd Leen, Tom Dietterich, Rich Caruana, and Virginia de Sa, editors, \nAdvances in Neural Information Processing Systems 13, pages 238- 244. MIT Press, \n2001. \n\n[10] 1. Beinlich, G. Suermondt, R. Chaves, and G. Cooper. The alarm monitoring system: \nA case study with two probabilistic inference techniques for belief networks. In 2'nd \nEuropean Conference on AI in Medicin e, 1989. \n\n\f", "award": [], "sourceid": 2135, "authors": [{"given_name": "Hilbert", "family_name": "Kappen", "institution": null}, {"given_name": "Wim", "family_name": "Wiegerinck", "institution": null}]}