{"title": "Gauging Variational Inference", "book": "Advances in Neural Information Processing Systems", "page_first": 2881, "page_last": 2890, "abstract": "Computing partition function is the most important statistical inference task arising in applications of Graphical Models (GM). Since it is computationally intractable, approximate methods have been used in practice, where mean-field (MF) and belief propagation (BP) are arguably the most popular and successful approaches of a variational type. In this paper, we propose two new variational schemes, coined Gauged-MF (G-MF) and Gauged-BP (G-BP), improving MF and BP, respectively. Both provide lower bounds for the partition function by utilizing the so-called gauge transformation which modifies factors of GM while keeping the partition function invariant. Moreover, we prove that both G-MF and G-BP are exact for GMs with a single loop of a special structure, even though the bare MF and BP perform badly in this case. Our extensive experiments indeed confirm that the proposed algorithms outperform and generalize MF and BP.", "full_text": "Gauging Variational Inference\n\nSungsoo Ahn\u2217 Michael Chertkov\u2020\n\n\u2217School of Electrical Engineering,\n\nJinwoo Shin\u2217\n\nKorea Advanced Institute of Science and Technology, Daejeon, Korea\n\n\u20201 Theoretical Division, T-4 & Center for Nonlinear Studies,\n\nLos Alamos National Laboratory, Los Alamos, NM 87545, USA,\n\n\u20202Skolkovo Institute of Science and Technology, 143026 Moscow, Russia\n\u2217{sungsoo.ahn, jinwoos}@kaist.ac.kr\n\u2020chertkov@lanl.gov\n\nAbstract\n\nComputing partition function is the most important statistical inference task arising\nin applications of Graphical Models (GM). Since it is computationally intractable,\napproximate methods have been used in practice, where mean-\ufb01eld (MF) and belief\npropagation (BP) are arguably the most popular and successful approaches of a\nvariational type. In this paper, we propose two new variational schemes, coined\nGauged-MF (G-MF) and Gauged-BP (G-BP), improving MF and BP, respectively.\nBoth provide lower bounds for the partition function by utilizing the so-called\ngauge transformation which modi\ufb01es factors of GM while keeping the partition\nfunction invariant. Moreover, we prove that both G-MF and G-BP are exact for\nGMs with a single loop of a special structure, even though the bare MF and BP\nperform badly in this case. Our extensive experiments indeed con\ufb01rm that the\nproposed algorithms outperform and generalize MF and BP.\n\n1\n\nIntroduction\n\nGraphical Models (GM) express factorization of the joint multivariate probability distributions in\nstatistics via a graph of relations between variables. The concept of GM has been developed and/or\nused successfully in information theory [1, 2], physics [3, 4, 5, 6, 7], arti\ufb01cial intelligence [8], and\nmachine learning [9, 10]. Of many inference problems one can formulate using a GM, computing the\npartition function (normalization), or equivalently computing marginal probability distributions, is\nthe most important and universal inference task of interest. However, this paradigmatic problem is\nknown to be computationally intractable in general, i.e., it is #P-hard even to approximate [11].\nThe Markov chain monte carlo (MCMC) [12] is a classical approach addressing the inference task,\nbut it typically suffers from exponentially slow mixing or large variance. Variational inference is\nan approach stating the inference task as an optimization. Hence, it does not have such issues of\nMCMC and is often more favorable. The mean-\ufb01eld (MF) [6] and belief propagation (BP) [13] are\narguably the most popular algorithms of the variational type. They are distributed, fast and overall\nvery successful in practical applications even though they are heuristics lacking systematic error\ncontrol. This has motivated researchers to seek for methods with some guarantees, e.g., providing\nlower bounds [14, 15] and upper bounds [16, 17, 15] for the partition function of GM.\nIn another line of research, which this paper extends and contributes, the so-called re-parametrizations\n[18], gauge transformations (GT) [19, 20] and holographic transformations [21, 22] were explored.\nThis class of distinct, but related, transformations consist in modifying a GM by changing fac-\ntors, associated with elements of the graph, continuously such that the partition function stays the\nsame/invariant.1 In this paper, we choose to work with GT as the most general one among the three\n\n1See [23, 24, 25] for discussions of relations between the aforementioned techniques.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fapproaches. Once applied to a GM, it transforms the original partition function, de\ufb01ned as a weighted\nseries/sum over states, to a new one, dependent on the choice of gauges. In particular, a \ufb01xed point\nof BP minimizes the so-called Bethe free energy [26], and it can also be understood as an optimal\nGT [19, 20, 27, 28]. Moreover, \ufb01xing GT in accordance with BP results in the so-called loop series\nexpression for the partition function [19, 20]. In this paper we generalize [19, 20] and explore a\nmore general class of GT: we develop a new gauge-optimization approach which results in \u2018better\u2019\nvariational inference schemes than MF, BP and other related methods.\nContribution. The main contribution of this paper consists in developing two novel variational\nmethods, called Gauged-MF (G-MF) and Gauged-BP (G-BP), providing lower bounds on the\npartition function of GM. While MF minimizes the (exact) Gibbs free energy under (reduced) product\ndistributions, G-MF does the same task by introducing an additional GT. Due to the the additional\ndegree of freedom in optimization, G-MF improves the lower bound of the partition function provided\nby MF systematically. Similarly, G-BP generalizes BP, extending interpretation of the latter as an\noptimization of the Bethe free energy over GT [19, 20, 27, 28], by imposing additional constraints\non GT, and thus forcing all the terms in the resulting series for the partition function to remain\nnon-negative. Consequently, G-BP results in a provable lower bound for the partition function, while\nBP does not (except for log-supermodular models [29]).\nWe prove that both G-MF and G-BP are exact for GMs de\ufb01ned over single cycle, which we call\n\u2018alternating cycle/loop\u2019, as well as over line graph. The alternative cycle case is surprising as it\nrepresents the simplest \u2018counter-example\u2019 from [30], illustrating failures of MF and BP. For general\nGMs, we also establish that G-MF is better than, or at least as good as, G-BP. However, we also\ndevelop novel error correction schemes for G-BP such that the lower bound of the partition function\nprovided by G-BP is improved systematically/sequentially, eventually outperforming G-MF on the\nexpense of increasing computational complexity. Such error correction scheme has been studied for\nimproving BP by accounting for the loop series consisting of positive and negative terms [31, 32].\nAccording to to our design of G-BP, the corresponding series consists of only non-negative terms,\nwhich leads to easier systematic corrections to G-BP.\nWe also show that the proposed GT-based optimizations can be restated as smooth and unconstrained,\nthus allowing ef\ufb01cient solutions via algorithms of a gradient descent type or any generic optimization\nsolver, such as IPOPT [33]. We experiment with IPOPT on complete GMs of relatively small size and\non large GM (up-to 300 variables) of \ufb01xed degree. Our experiments indeed con\ufb01rm that the newly\nproposed algorithms outperform and generalize MF and BP. Finally, we remark that all statements of\nthe paper are made within the framework of the so-called Forney-style GMs [34] which is general\nas it allows interactions beyond pair-wise (i.e., high-order GM) and includes other/alternative GM\nformulations, based on factor graphs [35].\n\n2 Preliminaries\n\n2.1 Graphical model\nFactor-graph model. Given (undirected) bipartite factor graph G = (X ,F,E), a joint distribution\nof (binary) random variables x = [xv \u2208 {0, 1} : v \u2208 X ] is called a factor-graph Graphical Model\n(GM) if it factorizes as follows:\n\np(x) =\n\nfa(x\u2202a),\n\nneighboring factor a, and the normalization constant Z :=(cid:80)\n\nwhere fa are some non-negative functions called factor functions, \u2202a \u2286 X consists of nodes\na\u2208F fa(x\u2202a), is called the\npartition function. A factor-graph GM is called pair-wise if |\u2202a| \u2264 2 for all a \u2208 F, and high-order\notherwise. It is known that approximating the partition function is #P-hard in general [11].\nForney-style model. In this paper, we primarily use the Forney-style GM [34] instead of factor-graph\nGM. Elementary random variables in the Forney-style GM are associated with edges of an undirected\ngraph, G = (V,E). Then the random vector, x = [xab \u2208 {0, 1} : {a, b} \u2208 E] is realized with the\nprobability distribution\n\nx\u2208{0,1}X(cid:81)\n\np(x) =\n\n1\nZ\n\nfa(xa),\n\n(1)\n\n(cid:89)\n\na\u2208F\n\n1\nZ\n\n(cid:89)\n\na\u2208V\n\n2\n\n\fZ := (cid:80)\n\nx\u2208{0,1}E(cid:81)\n\nwhere xa is associated with set of edges neighboring node a, i.e. xa = [xab\n\n: b \u2208 \u2202a] and\na\u2208V fa(xa). As argued in [19, 20], the Forney-style GM constitutes a more\nuniversal/compact description of gauge transformations without any restriction of generality: given\nany factor-graph GM, one can construct an equivalent Forney-style (see the supplementary material).\n\n2.2 Mean-\ufb01eld and belief propagation\n\nWe now introduce two most popular methods for approximating the partition function: the mean-\ufb01eld\nand Bethe (i.e., belief propagation) approximation methods. Given any (Forney-style) GM p(x)\nde\ufb01ned as in (1) and any distribution q(x) over all variables, the Gibbs free energy is de\ufb01ned as\n\n(cid:88)\n\nx\u2208{0,1}E\n\n(cid:81)\n\nq(x)\n\na\u2208V fa(xa)\n\nFGibbs(q) :=\n\nq(x) log\n\n.\n\n(2)\n\nThe partition function is related to the Gibbs free energy according to \u2212 log Z = minq FGibbs(q),\nwhere the optimum is achieved at q = p [35]. This optimization is over all valid probability\ndistributions from the exponentially large space, and obviously intractable.\nIn the case of the mean-\ufb01eld (MF) approximation, we minimize the Gibbs free energy over a family of\n\ntractable probability distributions factorized into the following product: q(x) =(cid:81){a,b}\u2208E qab(xab),\n\nwhere each independent qab(xab) is a proper probability distribution, behaving as a (mean-\ufb01eld)\nproxy to the marginal of q(x) over xab. By construction, the MF approximation provides a lower\nbound for log Z. In the case of the Bethe approximation, the so-called Bethe free energy approximates\nthe Gibbs free energy [36]:\n\nFBethe(b) =\n\na\u2208V\n\nxa\u2208{0,1}\u2202a\n\nba(xa) log\n\nba(xa)\nfa(xa)\n\n{a,b}\u2208E\n\nxab\u2208{0,1}\n\nbab(xab) log bab(xab),\n\n(3)\n\nwhere beliefs b = [ba, bab : a \u2208 V,{a, b} \u2208 E] should satisfy following \u2018consistency\u2019 constraints:\n\n0 \u2264 ba, bab \u2264 1,\n\nba(xab) = 1,\n\nxab\u2208{0,1}\n\na\\xab\u2208{0,1}\u2202a\nx(cid:48)\n\nb(x(cid:48)\n\na) = b(xab) \u2200{a, b} \u2208 E.\n\na\\xab denotes a vector with x(cid:48)\n\nHere, x(cid:48)\nab = xab \ufb01xed and minb FBethe(b) is the Bethe estimation for\n\u2212 log Z. The popular belief propagation (BP) distributed heuristics solves the optimization iteratively\n[36]. The Bethe approximation is exact over trees, i.e., \u2212 log Z = minb FBethe(b). However, in\nthe case of a general loopy graph, the BP estimation lacks approximation guarantees. It is known,\nhowever, that the result of BP-optimization lower bounds the log-partition function, log Z, if the\nfactors are log-supermodular [29].\n\n2.3 Gauge transformation\n\n(cid:88)\n\n\u2212 (cid:88)\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\nGauge transformation (GT) [19, 20] is a family of linear transformations of the factor functions in (1)\nwhich leaves the the partition function Z invariant. It is de\ufb01ned with respect to the following set of\ninvertible 2 \u00d7 2 matrices Gab for {a, b} \u2208 E, coined gauges:\n\n(cid:21)\n\n(cid:20) Gab(0, 0) Gab(0, 1)\n(cid:88)\n(cid:89)\n\nfa(x(cid:48)\na)\n\na\u2208{0,1}\u2202a\nx(cid:48)\n\nb\u2208\u2202a\n\nGab(1, 0) Gab(1, 1)\nThe GM, gauge transformed with respect to G = [Gab, Gba\nexpressed as:\n\nGab =\n\n.\n: {a, b} \u2208 E], consists of factors\n\nfa,G(xa) =\n\nGab(xab, x(cid:48)\n\nab).\n\nHere one treats independent xab and xba equivalently for notational convenience, and {Gab, Gba}\nis a conjugated pair of distinct matrices satisfying the gauge constraint G(cid:62)\nabGba = I, where I is the\nidentity matrix. Then, one can prove invariance of the partition function under the transformation:\n\n(cid:88)\n\n(cid:89)\n\nZ =\n\n(cid:88)\n\n(cid:89)\n\nx\u2208{0,1}|E|\n\na\u2208V\n\nx\u2208{0,1}|E|\n\na\u2208V\n\nfa,G(xa).\n\n(4)\n\nfa(xa) =\n\n3\n\n\f(cid:81)\n\nConsequently, GT results in the gauge transformed distribution pG(x) = 1\nZ\nthat some components of pG(x) can be negative, in which case it is not a valid distribution.\nWe remark that the Bethe/BP approximation can be interpreted as a speci\ufb01c choice of GT [19, 20].\nIndeed any \ufb01xed point of BP corresponds to a special set of gauges making an arbitrarily picked\ncon\ufb01guration/state x to be least sensitive to the local variation of the gauge. Formally, the following\nnon-convex optimization is known to be equivalent to the Bethe approximation:\n\na\u2208V fa,G(xa). Note\n\nlog fa,G(0, 0, . . . )\n\nmaximize\n\nG\n\na\u2208V\nsubject to G(cid:62)\nabGba = I,\n\n\u2200 {a, b} \u2208 E,\n\n(5)\nand the set of BP-gauges correspond to stationary points of (5), having the objective as the respective\n\nBethe free energy, i.e.,(cid:80)\n\na\u2208V log fa,G(0, 0, . . . ) = \u2212FBethe.\n\n(cid:88)\n\n3 Gauge optimization for approximating partition functions\n\nNow we are ready to describe two novel gauge optimization schemes (different from (5)) providing\nguaranteed lower bound approximations for log Z. Our \ufb01rst GT scheme, coined Gauged-MF (G-MF),\nshall be considered as modifying and improving the MF approximation, while our second GT scheme,\ncoined Gauged-BP (G-BP), modi\ufb01es and improves the Bethe approximation in a way that it now\nprovides a provable lower bound for log Z, while the bare BP does not have such guarantees. The\nG-BP scheme also allows further improvement (in terms of the output quality) on the expense of\nmaking underlying algorithm/computation more complex.\n\nWe \ufb01rst propose the following optimization inspired by, and also improving, the MF approximation:\n\n3.1 Gauged mean-\ufb01eld\n\n(cid:88)\n\n(cid:88)\n\nmaximize\n\nq,G\n\na\u2208V\nxa\u2208{0,1}\u2202a\nsubject to G(cid:62)\nabGba = I,\nfa,G(xa) \u2265 0,\nq(x) =\n\n(cid:89)\n\n{a,b}\u2208E\n\nqa(xa) log fa,G(xa) \u2212 (cid:88)\n(cid:89)\n\n\u2200 {a, b} \u2208 E,\n\u2200a \u2208 V, \u2200xa \u2208 {0, 1}\u2202a,\nqab(xab),\n\nqa(xa) =\n\n(cid:88)\n\n{a,b}\u2208E\n\nxab\u2208{0,1}\n\nqab(xab),\n\nb\u2208\u2202a\n\nqab(xab) log qab(xab)\n\n\u2200a \u2208 V.\n\n(6)\n\nRecall that the MF approximation optimizes the Gibbs free energy with respect to q given the original\nGM, i.e. factors. On the other hand, (6) jointly optimizes it over q and G. Since the partition function\nof the gauge transformed GM is equal to that of the original GM, (6) also outputs a lower bound on\nthe (original) partition function, and always outperforms MF due to the additional degree of freedom\nin G. The non-negative constraints fa,G(xa) \u2265 0 for each factor enforce that the gauge transformed\nGM results in a valid probability distribution (all components are non-negative).\nTo solve (6), we propose a strategy, alternating between two optimizations, formally stated in\nAlgorithm 1. The alternation is between updating q, within Step A, and updating G, within Step C.\nThe optimization in Step A is simple as one can apply any solver of the mean-\ufb01eld approximation.\nOn the other hand, Step C requires a new solver and, at the \ufb01rst glance, looks complicated due to\nnonlinear constraints. However, the constraints can actually be eliminated. Indeed, one observes\nthat the non-negative constraint fa,G(xa) \u2265 0 is redundant, because each term q(xa) log fa,G(xa)\nin the optimization objective already prevents factors from getting close to zero, thus keeping\nthem positive. Equivalently, once current G satis\ufb01es the non-negative constraints, the objective,\nq(xa) log fa,G(xa), acts as a log-barrier forcing the constraints to be satis\ufb01ed at the next step within\nan iterative optimization procedure. Furthermore, the gauge constraint, G(cid:62)\nabGba = I, can also be\nremoved simply expressing one (of the two) gauge via another, e.g., Gba via (G(cid:62)\nab)\u22121. Then, Step\nC can be resolved by any unconstrained iterative optimization method of a gradient descent type.\nNext, the additional (intermediate) procedure Step B was considered to handle extreme cases when\nfor some {a, b}, qab(xab) = 0 at the optimum. We resolve the singularity perturbing the distribution\nby setting zero probabilities to a small value, qab(xab) = \u03b4 where \u03b4 > 0 is suf\ufb01ciently small. In\n\n4\n\n\fbarrier terms \u03b41 > \u03b42 > \u00b7\u00b7\u00b7 > \u03b4T > 0 (to handle extreme cases).\n\nAlgorithm 1 Gauged mean-\ufb01eld\n1: Input: GM de\ufb01ned over graph G = (V,E) with factors {fa}a\u2208V. A sequence of decreasing\n2: for t = 1, 2,\u00b7\u00b7\u00b7 , T do\n3:\n\nStep A. Update q by solving the mean-\ufb01eld approximation, i.e., solve the following optimiza-\ntion:\n\n(cid:88)\n\nqa(xa) log fa,G(xa) \u2212 (cid:88)\n(cid:89)\n\nqab(xab),\n\nqa(xa) =\n\n(cid:88)\n(cid:89)\n\n{a,b}\u2208E\n\n{a,b}\u2208E\n\nxab\u2208{0,1}\n\nqab(xab),\n\nb\u2208\u2202a\n\n(cid:88)\n\nmaximize\n\nq\n\na\u2208V\n\nxa\u2208{0,1}\u2202a\n\nsubject to q(x) =\n\nqab(xab) log qab(xab)\n\n\u2200a \u2208 V.\n\nStep B. For factors with zero values, i.e. qab(xab) = 0, make perturbation by setting\n\n4:\n\n5:\n\nStep C. Update G by solving the following optimization:\n\nqab(x(cid:48)\n\nab) =\n\n(cid:26)\u03b4t\n(cid:88)\n\n1 \u2212 \u03b4t\n\n(cid:88)\n\nif x(cid:48)\notherwise.\n\nab = xab\n\n(cid:89)\n\na\u2208V\n\nmaximize\n\nG\n\na\u2208V\nx\u2208{0,1}E\nsubject to G(cid:62)\nabGba = I,\n\nq(x) log\n\nfa,G(xa)\n\n\u2200 {a, b} \u2208 E.\n\n6: end for\n7: Output: Set of gauges G and product distribution q.\n\nsummary, it is straightforward to check that the Algorithm 1 converges to a local optimum of (6),\nsimilar to some other solvers developed for the mean-\ufb01eld and Bethe approximations.\nWe also provide an important class of GMs where the Algorithm 1 provably outperforms both the\nMF and BP (Bethe) approximations. Speci\ufb01cally, we prove that the optimization (6) is exact in the\ncase when the graph is a line (which is a special case of a tree) and, somewhat surprisingly, a single\nloop/cycle with odd number of factors represented by negative de\ufb01nite matrices. In fact, the latter\ncase is the so-called \u2018alternating cycle\u2019 example which was introduced in [30] as the simplest loopy\nexample where the MF and BP approximations perform quite badly. Formally, we state the following\ntheorem whose proof is given in the supplementary material.\nTheorem 1. For GM de\ufb01ned on any line graph or alternating cycle, the optimal objective of (6) is\nequal to the exact log partition function, i.e., log Z.\n\n3.2 Gauged belief propagation\n\nWe start discussion of the G-BP scheme by noticing that, according to [37], the G-MF gauge\noptimization (6) can be reduced to the BP/Bethe gauge optimization (5) by eliminating the non-\nnegative constraint fa,G(xa) \u2265 0 for each factor and replacing the product distribution q(x) by:\n\nMotivated by this observation, we propose the following G-BP optimization:\n\n(cid:26)1\n\n0\n\nq(x) =\n\n(cid:88)\n\nif x = (0, 0,\u00b7\u00b7\u00b7 ),\notherwise.\n\nlog fa,G(0, 0,\u00b7\u00b7\u00b7 )\n\nmaximize\n\nG\n\na\u2208V\nsubject to G(cid:62)\nabGba = I,\nfa,G(xa) \u2265 0,\n\n\u2200(a, b) \u2208 E,\n\u2200a \u2208 V, \u2200xa \u2208 {0, 1}\u2202a.\n\n5\n\n(7)\n\n(8)\n\n\fThe only difference between (5) and (8) is addition of the non-negative constraints for factors in (8).\nHence, (8) outputs a lower bound on the partition function, while (5) can be larger or smaller then\nlog Z. It is also easy to verify that (8) (for G-BP) is equivalent to (6) (for G-MF) with q \ufb01xed to (7).\nHence, we propose the algorithmic procedure for solving (8), formally described in Algorithm 2, and\nit should be viewed as a modi\ufb01cation of Algorithm 1 with q replaced by (7) in Step A, also with a\nproperly chosen log-barrier term in Step C. As we discussed for Algorithm 1, it is straightforward to\nverify that Algorithm 2 also converges to a local optimum of (8) and one can replace Gba by (G(cid:62)\nab)\u22121\nfor each pair of the conjugated matrices in order to build a convergent gradient descent algorithmic\nimplementation for the optimization.\n\nAlgorithm 2 Gauged belief propagation\n1: Input: GM de\ufb01ned over graph G = (V,E) with and factors {fa}a\u2208V. A sequence of decreasing\n2: for t = 1, 2,\u00b7\u00b7\u00b7 do\n3:\n\nbarrier terms \u03b41 > \u03b42 > \u00b7\u00b7\u00b7 > \u03b4T > 0.\n\nUpdate G by solving the following optimization:\nlog fa,G(0, 0,\u00b7\u00b7\u00b7 ) + \u03b4t\n\nmaximize\n\n(cid:88)\n\nq(x) log\n\nfa,G(xa)\n\n(cid:88)\n\n(cid:88)\n\n(cid:89)\n\na\u2208V\n\nG\n\na\u2208V\nsubject to G(cid:62)\nabGba = I,\n\na\u2208V\n\nx\u2208{0,1}E\n\n\u2200 {a, b} \u2208 E.\n\n4: end for\n5: Output: Set of gauges G.\n\nZi\n\nx\u2208Xi\n\n(cid:81)\n\n(cid:81)\n\na\u2208V fa,G(x) and\n\nXi := {x : xei = 1, xej = 0, xek \u2208 {0, 1} \u2200 j, k, such that 1 \u2264 j < i < k \u2264 |E|}.\n\nSince \ufb01xing q(x) eliminates the degree of freedom in (6), G-BP should perform worse than G-MF,\ni.e., (8) \u2264 (6). However, G-BP is still meaningful due to the following reasons. First, Theorem 1\nstill holds for (8), i.e., the optimal q of (6) is achieved at (7) for any line graph or alternating cycle\n(see the proof of the Theorem 1 in the supplementary material). More importantly, G-BP can be\ncorrected systematically. At a high level, the \u201cerror-correction\" strategy consists in correcting the\napproximation error of (8) sequentially while maintaining the desired lower bounding guarantee.\nThe key idea here is to decompose the error of (8) into partition functions of multiple GMs, and\nthen repeatedly lower bound each partition function. Formally, we \ufb01x an arbitrary ordering of edges\ne1,\u00b7\u00b7\u00b7 e|E| and de\ufb01ne the corresponding GM for each ei as follows: p(x) = 1\na\u2208V fa,G(xa) for\n\nx \u2208 Xi, where Zi :=(cid:80)\nGM. Next, recall that (8) maximizes and outputs a single con\ufb01guration(cid:81)\n|E|(cid:88)\n\nNamely, we consider GMs from sequential conditioning of xe1,\u00b7\u00b7\u00b7 , xei in the gauge transformed\na fa,G(0, 0,\u00b7\u00b7\u00b7 ). Then,\nsince Xi\ni=1 Xi = {0, 1}E\\(0, 0,\u00b7\u00b7\u00b7 ), the error of (8) can be decomposed as follows:\n\n(cid:84)Xj = \u2205 and(cid:83)|E|\nZ \u2212(cid:89)\n(cid:89)\n(cid:88)\nNow, one can run G-MF, G-BP or any other methods (e.g., MF) again to obtain a lower bound (cid:98)Zi\nof Zi for all i and then output(cid:81)\na\u2208V fa,G(0, 0,\u00b7\u00b7\u00b7 ) +(cid:80)|E|\ni=1 (cid:98)Zi. However, such additional runs of\n(cid:81)\na ) for x(i) = [xei = 1, xej = 0, \u2200 j (cid:54)= i] from Xi, as a choice of (cid:98)Zi just after solving\n|E|(cid:88)\n(cid:89)\n\noptimization inevitably increase the overall complexity. Instead, one can also pick a single term\n\nfa,G(0, 0,\u00b7\u00b7\u00b7 ) =\n\n(8) initially, and output\n\na fa,G(x(i)\n\n|E|(cid:88)\n\ni=1\n\nx\u2208Xi\n\na\u2208V\n\nfa,G(x) =\n\nZi,\n\n(9)\n\ni=1\n\na\n\nx(i) = [xei = 1, xej = 0, \u2200 j (cid:54)= i],\n\n(10)\n\nfa,G(0, 0,\u00b7\u00b7\u00b7 ) +\n\nas a better lower bound for log Z than(cid:81)\n\na\u2208V\n\ni=1\n\nfa,G(x(i)\n\na ),\n\na\u2208V fa,G(0, 0,\u00b7\u00b7\u00b7 ). This choice is based on the intuition\nthat con\ufb01gurations partially different from (0, 0,\u00b7\u00b7\u00b7 ) may be signi\ufb01cant too as they share most of\nthe same factor values with the zero con\ufb01guration maximized in (8). In fact, one can even choose\nmore con\ufb01gurations (partially different from (0, 0,\u00b7\u00b7\u00b7 )) by paying more complexity, which is always\n\n6\n\n\fFigure 1: Averaged log-partition approximation error vs interaction strength \u03b2 in the case of generic\n(non-log-supermodular) GMs on complete graphs of size 4, 5 and 6 (left, middle, right), where the\naverage is taken over 20 random models.\n\nFigure 2: Averaged log-partition approximation error vs interaction strength \u03b2 in the case of log-\nsupermodular GMs on complete graphs of size 4, 5 and 6 (left, middle, right), where the average is\ntaken over 20 random models.\n\nbetter as it brings the approximation closer to the true partition function. In our experiments, we\nconsider additional con\ufb01gurations {x : [xei = 1, xei(cid:48) = 1, xej = 0, \u2200 i, i(cid:48) (cid:54)= j] for i(cid:48) = i,\u00b7\u00b7\u00b7|E|},\ni.e., output(cid:89)\n\n|E|(cid:88)\n\n|E|(cid:88)\n\nfa,G(0, 0,\u00b7\u00b7\u00b7 ) +\n\nfa,G(x(i,i(cid:48))\n\na\n\n),\n\na\u2208V\n\ni=1\n\ni(cid:48)=i\n\nas a better lower bound of log Z than (10).\n4 Experimental results\n\nx(i,i(cid:48)) = [xei = 1, xei(cid:48) = 1, xej = 0, \u2200 j (cid:54)= i, i(cid:48)],\n(11)\n\nWe report results of our experiments with G-MF and G-BP introduced in Section 3. We also\nexperiment here with improved G-BPs correcting errors by accounting for single (10) and multiple\n(11) terms, as well as correcting G-BP by applying it (again) sequentially to each residual partition\nfunction Zi. The error decreases, while the evaluation complexity increases, as we move from G-BP-\nsingle to G-BP-multiple and then to G-BP-sequential. To solve the proposed gauge optimizations,\ne.g., Step C. of Algorithm 1, we use the generic optimization solver IPOPT [33]. Even though the\ngauge optimizations can be formulated as unconstrained optimizations, IPOPT runs faster on the\noriginal constrained versions in our experiments.2 However, the unconstrained formulations has a\nstrong future potential for developing fast gradient descent algorithms. We generate random GMs\nwith factors dependent on the \u2018interaction strength\u2019 parameters {\u03b2a}a\u2208V (akin inverse temperature)\naccording to:\n\nfa(xa) = exp(\u2212\u03b2a|h0(xa) \u2212 h1(xa)|),\n\nwhere h0 and h1 count numbers of 0 and 1 contributions in xa, respectively. Intuitively, we expect\nthat as |\u03b2a| increases, it becomes more dif\ufb01cult to approximate the partition function. See the\nsupplementary material for additional information on how we generate the random models.\nIn the \ufb01rst set of experiments, we consider relatively small, complete graphs with two types of fac-\ntors: random generic (non-log-supermodular) factors and log-supermodular (positive/ferromagnetic)\nfactors. Recall that the bare BP also provides a lower bound in the log-supermodular case [29],\nthus making the comparison between each proposed algorithm and BP informative. We use the\nlog partition approximation error de\ufb01ned as | log Z \u2212 log ZLB|/| log Z|, where ZLB is the algorithm\n\n2 The running time of the implemented algorithms are reported in the supplementary material.\n\n7\n\n\fFigure 3: Averaged ratio of the log partition function compared to MF vs graph size (i.e., number\nof factors) in the case of generic (non-log-supermodular) GMs on 3-regular graphs (left) and grid\ngraphs (right), where the average is taken over 20 random models.\n\nFigure 4: Averaged ratio of the log partition function compared to MF vs interaction strength \u03b2 in\nthe case of log-supermodular GMs on 3-regular graphs of size 200 (left) and grid graphs of size 100\n(right), where the average is taken over 20 random models.\n\noutput (a lower bound of Z), to quantify the algorithm\u2019s performance. In the \ufb01rst set of experiments,\nwe deal with relatively small graphs and the explicit computation of Z (i.e., the approximation error)\nis feasible. The results for experiments over the small graphs are illustrated in Figure 1 and Figure 2\nfor the non-log-supermodular and log-supermodular cases, respectively. Figure 1 shows that, as\nexpected, G-MF always outperforms MF. Moreover, we observe that G-MF typically provides the\ntightest low-bound, unless it is outperformed by G-BP-multiple or G-BP-sequential. We remark that\nBP is not shown in Figure 1, because in this non-log-supermodular case, it does not provide a lower\nbound in general. According to Figure 2, showing the log-supermodular case, both G-MF and G-BP\noutperform MF, while G-BP-sequential outperforms all other algorithms. Notice that G-BP performs\nrather similar to BP in the log-supermodular case, thus suggesting that the constraints, distinguishing\n(8) from (5), are very mildly violated.\nIn the second set of experiments, we consider more sparse, larger graphs of two types: 3-regular\nand grid graphs with size up to 200 factors/300 variables. As in the \ufb01rst set of experiments, the\nsame non-log-supermodular/log-supermodular factors are considered. Since computing the exact\napproximation error is not feasible for the large graphs, we instead measure here the ratio of estimation\nby the proposed algorithm to that of MF, i.e., log(ZLB/ZMF) where ZMF is the output of MF. Note\nthat a larger value of the ratio indicates better performance. The results are reported in Figure 3 and\nFigure 4 for the non-log-supermodular and log-supermodular cases, respectively. In Figure 3, we\nobserve that G-MF and G-BP-sequential outperform MF signi\ufb01cantly, e.g., up-to e14 times better in\n3-regular graphs of size 200. We also observe that even the bare G-BP outperforms MF. In Figure 4,\nalgorithms associated with G-BP outperform G-MF and MF (up to e25 times). This is because the\nchoice of q(x) for G-BP is favored by log-supermodular models, i.e., most of con\ufb01gurations are\nconcentrated around (0, 0,\u00b7\u00b7\u00b7 ) similar to the choice (7) of q(x) for G-BP. One observes here (again)\nthat performance of G-BP in this log-supermodular case is almost on par with BP. This implies that\nG-BP generalizes BP well: the former provides a lower bound of Z for any GMs, while the latter\ndoes only for log-supermodular GMs.\n5 Conclusion\n\nWe explore the freedom in gauge transformations of GM and develop novel variational inference\nmethods which result in signi\ufb01cant improvement of the partition function estimation. We note that\nthe GT methodology, applied here to improve MF and BP, can also be used to improve and extend\nutility of other variational methods.\n\n8\n\n\fAcknowledgments\n\nThis work was supported in part by the National Research Council of Science & Technology\n(NST) grant by the Korea government (MSIP) (No. CRC-15-05-ETRI), Institute for Information\n& communications Technology Promotion(IITP) grant funded by the Korea government(MSIT)\n(No.2017-0-01778, Development of Explainable Human-level Deep Machine Learning Inference\nFramework) and ICT R&D program of MSIP/IITP [2016-0-00563, Research on Adaptive Machine\nLearning Technology Development for Intelligent Autonomous Digital Companion].\n\nReferences\n[1] Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory,\n\n8(1):21\u201328, 1962.\n\n[2] Frank R. Kschischang and Brendan J. Frey. Iterative decoding of compound codes by proba-\nbility propagation in graphical models. IEEE Journal on Selected Areas in Communications,\n16(2):219\u2013230, 1998.\n\n[3] Hans .A. Bethe. Statistical theory of superlattices. Proceedings of Royal Society of London A,\n\n150:552, 1935.\n\n[4] Rudolf E. Peierls. Ising\u2019s model of ferromagnetism. Proceedings of Cambridge Philosophical\n\nSociety, 32:477\u2013481, 1936.\n\n[5] Marc M\u00e9zard, Georgio Parisi, and M. A. Virasoro. Spin Glass Theory and Beyond. Singapore:\n\nWorld Scienti\ufb01c, 1987.\n\n[6] Giorgio Parisi. Statistical \ufb01eld theory, 1988.\n\n[7] Marc Mezard and Andrea Montanari. Information, Physics, and Computation. Oxford Univer-\n\nsity Press, Inc., New York, NY, USA, 2009.\n\n[8] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference.\n\nMorgan Kaufmann, 2014.\n\n[9] Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business\n\nMedia, 1998.\n\n[10] William T Freeman, Egon C Pasztor, and Owen T Carmichael. Learning low-level vision.\n\nInternational journal of computer vision, 40(1):25\u201347, 2000.\n\n[11] Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the ising\n\nmodel. SIAM Journal on computing, 22(5):1087\u20131116, 1993.\n\n[12] Ethem Alpaydin. Introduction to machine learning. MIT press, 2014.\n\n[13] Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach.\nCognitive Systems Laboratory, School of Engineering and Applied Science, University of\nCalifornia, Los Angeles, 1982.\n\n[14] Qiang Liu and Alexander T Ihler. Negative tree reweighted belief propagation. arXiv preprint\n\narXiv:1203.3494, 2012.\n\n[15] Stefano Ermon, Ashish Sabharwal, Bart Selman, and Carla P Gomes. Density propagation\nand improved bounds on the partition function. In Advances in Neural Information Processing\nSystems, pages 2762\u20132770, 2012.\n\n[16] Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. A new class of upper bounds on\nthe log partition function. IEEE Transactions on Information Theory, 51(7):2313\u20132335, 2005.\n\n[17] Qiang Liu and Alexander T Ihler. Bounding the partition function using holder\u2019s inequality.\nIn Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages\n849\u2013856, 2011.\n\n9\n\n\f[18] Martin J. Wainwright, Tommy S. Jaakkola, and Alan S. Willsky. Tree-based reparametrization\nInformation Theory, IEEE\n\nframework for approximate estimation on graphs with cycles.\nTransactions on, 49(5):1120\u20131146, 2003.\n\n[19] Michael Chertkov and Vladimir Chernyak. Loop calculus in statistical physics and information\n\nscience. Physical Review E, 73:065102(R), 2006.\n\n[20] Michael Chertkov and Vladimir Chernyak. Loop series for discrete statistical models on graphs.\n\nJournal of Statistical Mechanics, page P06009, 2006.\n\n[21] Leslie G Valiant. Holographic algorithms. SIAM Journal on Computing, 37(5):1565\u20131594,\n\n2008.\n\n[22] Ali Al-Bashabsheh and Yongyi Mao. Normal factor graphs and holographic transformations.\n\nIEEE Transactions on Information Theory, 57(2):752\u2013763, 2011.\n\n[23] Martin J. Wainwright and Michael E. Jordan. Graphical models, exponential families, and\n\nvariational inference. Foundations and Trends in Machine Learning, 1(1):1\u2013305, 2008.\n\n[24] G David Forney Jr and Pascal O Vontobel. Partition functions of normal factor graphs. arXiv\n\npreprint arXiv:1102.0316, 2011.\n\n[25] Michael Chertkov. Lecture notes on \u201cstatistical inference in structured graphical models: Gauge\n\ntransformations, belief propagation & beyond\", 2016.\n\n[26] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and gen-\neralized belief propagation algorithms. Information Theory, IEEE Transactions on, 51(7):2282\u2013\n2312, 2005.\n\n[27] Vladimir Y Chernyak and Michael Chertkov. Loop calculus and belief propagation for q-ary\nalphabet: Loop tower. In Information Theory, 2007. ISIT 2007. IEEE International Symposium\non, pages 316\u2013320. IEEE, 2007.\n\n[28] Ryuhei Mori. Holographic transformation, belief propagation and loop calculus for generalized\nprobabilistic theories. In Information Theory (ISIT), 2015 IEEE International Symposium on,\npages 1099\u20131103. IEEE, 2015.\n\n[29] Nicholas Ruozzi. The bethe partition function of log-supermodular graphical models.\n\nAdvances in Neural Information Processing Systems, pages 117\u2013125, 2012.\n\nIn\n\n[30] Adrian Weller, Kui Tang, Tony Jebara, and David Sontag. Understanding the bethe approxima-\n\ntion: when and how can it go wrong? In UAI, pages 868\u2013877, 2014.\n\n[31] Michael Chertkov, Vladimir Y Chernyak, and Razvan Teodorescu. Belief propagation and\nloop series on planar graphs. Journal of Statistical Mechanics: Theory and Experiment,\n2008(05):P05003, 2008.\n\n[32] Sung-Soo Ahn, Michael Chertkov, and Jinwoo Shin. Synthesis of mcmc and belief propagation.\n\nIn Advances in Neural Information Processing Systems, pages 1453\u20131461, 2016.\n\n[33] Andreas W\u00e4chter and Lorenz T Biegler. On the implementation of an interior-point \ufb01lter\nline-search algorithm for large-scale nonlinear programming. Mathematical programming,\n106(1):25\u201357, 2006.\n\n[34] G David Forney. Codes on graphs: Normal realizations. IEEE Transactions on Information\n\nTheory, 47(2):520\u2013548, 2001.\n\n[35] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational\n\ninference. Technical Report 649, UC Berkeley, Department of Statistics, 2003.\n\n[36] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, kikuchi approxima-\ntions, and belief propagation algorithms. Advances in neural information processing systems,\n13, 2001.\n\n[37] Michael Chertkov and Vladimir Y Chernyak. Loop series for discrete statistical models on\n\ngraphs. Journal of Statistical Mechanics: Theory and Experiment, 2006(06):P06009, 2006.\n\n10\n\n\f", "award": [], "sourceid": 1647, "authors": [{"given_name": "Sung-Soo", "family_name": "Ahn", "institution": "KAIST"}, {"given_name": "Michael", "family_name": "Chertkov", "institution": "Los Alamos National Laboratory"}, {"given_name": "Jinwoo", "family_name": "Shin", "institution": "KAIST"}]}