{"title": "Clusters and Coarse Partitions in LP Relaxations", "book": "Advances in Neural Information Processing Systems", "page_first": 1537, "page_last": 1544, "abstract": "We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency of the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message-passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly.", "full_text": "Clusters and Coarse Partitions in LP Relaxations\n\nDavid Sontag\nCSAIL, MIT\n\ndsontag@csail.mit.edu\n\nSchool of Computer Science and Engineering\n\nAmir Globerson\n\nThe Hebrew University\n\ngamir@cs.huji.ac.il\n\nTommi Jaakkola\n\nCSAIL, MIT\n\ntommi@csail.mit.edu\n\nAbstract\n\nWe propose a new class of consistency constraints for Linear Programming (LP)\nrelaxations for \ufb01nding the most probable (MAP) con\ufb01guration in graphical mod-\nels. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of\na cluster of variables, with computational cost increasing exponentially with the\nsize of the clusters. By partitioning the state space of a cluster and enforcing con-\nsistency only across partitions, we obtain a class of constraints which, although\nless tight, are computationally feasible for large clusters. We show how to solve\nthe cluster selection and partitioning problem monotonically in the dual LP, us-\ning the current beliefs to guide these choices. We obtain a dual message passing\nalgorithm and apply it to protein design problems where the variables have large\nstate spaces and the usual cluster-based relaxations are very costly. The result-\ning method solves many of these problems exactly, and signi\ufb01cantly faster than a\nmethod that does not use partitioning.\n\n1 Introduction\n\nA common inference task in graphical models is \ufb01nding the most likely setting of the values of the\nvariables (the MAP assignment). Indeed, many important practical problems can be formulated as\nMAP problems (e.g., protein-design problems [9]). The complexity of the MAP problem depends\non the structure of the dependencies between the variables (i.e. the graph structure) and is known to\nbe NP-hard in general. Speci\ufb01cally, for problems such as protein-design, the underlying interaction\ngraphs are dense, rendering standard exact inference algorithms useless.\nA great deal of effort has been spent recently on developing approximate algorithms for the MAP\nproblem. One promising approach is based on linear programming relaxations, solved via message\npassing algorithms akin to belief propagation [2, 3]. In this case, the MAP problem is \ufb01rst cast as an\ninteger linear program, and then is relaxed to a linear program by removing the integer constraints\nand adding new constraints on the continuous variables. Whenever the relaxed solution is integral, it\nis guaranteed to be the optimal solution. However, this happens only if the relaxation is suf\ufb01ciently\n\u201ctight\u201d (with respect to a particular objective function).\nRelaxations can be made increasingly tight by introducing LP variables that correspond to clusters\nof variables in the original model. In fact, in recent work [6] we have shown that by adding a set\nof clusters over three variables, complex problems such as protein-design and stereo-vision may be\nsolved exactly. The problem with adding clusters over variables is that computational cost scales\nexponentially with the cluster size. Consider, for example, a problem where each variable has 100\nstates (cf. protein-design). Using clusters of s variables means adding 100s LP variables, which is\ncomputationally demanding even for clusters of size three.\nOur goal in the current paper is to design methods that introduce constraints over clusters at a reduced\ncomputational cost. We achieve this by representing clusters at a coarser level of granularity. The\nkey observation is that it may not be necessary to represent all the possible joint states of a cluster of\nvariables. Instead, we partition the cluster\u2019s assignments at a coarser level, and enforce consistency\n\n\fonly across such partitions. This removes the number of states per variable from consideration, and\ninstead focuses on resolving currently ambiguous settings of the variables. Following the approach\nof [2], we formulate a dual LP for the partition-based LP relaxations and derive a message passing\nalgorithm for optimizing the dual LP based on block coordinate descent. Unlike standard message\npassing algorithms, the algorithm we derive involves passing messages between coarse and \ufb01ne\nrepresentations of the same set of variables.\nMAP and its LP relaxation. We consider discrete pairwise Markov random \ufb01elds on a graph\nG = (V, E), de\ufb01ned as the following exponential family distribution1\n\n(cid:80)\n\np(x; \u03b8) =\n\n1\nZ\n\ne\n\nij\u2208E \u03b8ij (xi,xj )\n\n(1)\n\nwhere \u00b5 \u00b7 \u03b8 =(cid:80)\n\n(cid:80)\n\nxi,xj\n\nHere \u03b8 is a parameter vector specifying how pairs of variables in E interact. The MAP problem we\nconsider here is to \ufb01nd the most likely assignment of the variables under p(x; \u03b8) (we assume that the\nevidence has already been incorporated into the model). This is equivalent to \ufb01nding the assignment\n\nxM that maximizes the function f(x; \u03b8) =(cid:80)\n\nij\u2208E \u03b8ij(xi, xj).\n\nThe resulting discrete optimization problem may also be cast as a linear program. De\ufb01ne \u00b5 to\nbe a vector of marginal probabilities associated with the interacting pairs of variables (edges)\n{\u00b5ij(xi, xj)}ij\u2208E as well as {\u00b5i(xi)}i\u2208V for the nodes.\nThe set of \u00b5\u2019s that could arise from\nsome joint distribution on G is known as the marginal polytope M(G) [7]. The MAP problem is\nthen equivalent to the following linear program:\n\nmax\n\nx\n\nf(x; \u03b8) = max\n\n\u00b5\u2208M(G)\n\n\u00b5 \u00b7 \u03b8 ,\n\n(2)\n\nij\u2208E\n\nare consistent with the node marginals, {\u00b5 | (cid:80)\n\n\u03b8ij(xi, xj)\u00b5ij(xi, xj). The extreme points of the marginal polytope\nare integral and correspond one-to-one with assignments x. Thus, there always exists a maximiz-\ning \u00b5 that is integral and corresponds to xM . Although the number of variables in this LP is only\nO(|E| + |V |), the dif\ufb01culty comes from an exponential number of linear inequalities typically re-\nquired to describe the marginal polytope M(G).\nLP relaxations replace the dif\ufb01cult global constraint that the marginals in \u00b5 must arise from some\ncommon joint distribution by ensuring only that the marginals are locally consistent with one an-\nother. The most common such relaxation, pairwise consistency, enforces that the edge marginals\n\u00b5ij(xi, xj) = \u00b5i(xi)}. The integral extreme\npoints of this local marginal polytope also correspond to assignments. If a solution is obtained at\none such extreme point, it is provably the MAP assignment. However, the local marginal polytope\nalso contains fractional extreme points, and, as a relaxation, will in general not be tight.\nWe are therefore interested in tightening the relaxation. There are many known ways to do so, in-\ncluding cycle inequalities [5] and semi-de\ufb01nite constraints [8]. However, perhaps the most straight-\nforward approach corresponds to lifting the relaxation by adding marginals over clusters of nodes to\nthe model (cf. generalized belief propagation [10]) and constraining them to be consistent with the\nedge marginals. However, each cluster comes with a computational cost that grows as ks, where s\nis the number of variables in the cluster and k is the number of states for each variable. We seek to\noffset this exponential cost by introducing coarsened clusters, as we show next.\n\nxj\n\n2 Coarsened clusters and consistency constraints\n\nWe begin with an illustrative example. Suppose we have a graphical model that is a triangle with\neach variable taking k states. We can recover the exact marginal polytope in this case by forcing the\npairwise marginals \u00b5ij(xi, xj) to be consistent with some distribution \u00b5123(x1, x2, x3). However,\nwhen k is large, introducing the corresponding k3 variables to our LP may be too costly and perhaps\nunnecessary, if a weaker consistency constraint would already lead to an integral extreme point. To\nthis end, we will use a coarse-grained version of \u00b5123 where the joint states are partitioned into\nlarger collections, and consistency is enforced over the partitions.\n\n1We do not use potentials on single nodes \u03b8i(xi) since these can be folded into \u03b8ij(xi, xj). Our algorithm\n\ncan also be derived with explicit \u03b8i(xi), and we omit the details for brevity.\n\n\fxk\n\nzk\n\nzk\n\nxi\n\nzi\n\nzi\n\nzj\n\nFigure 1: A graphical illustration of the consistency constraint between the original (\ufb01ne granularity)\nedge (xi(cid:44) xk) and the coarsened triplet (zi(cid:44) zj(cid:44) zk). The two should agree on the marginal of zi(cid:44) zk.\nFor example, the shaded area in all three \ufb01gures represents the same probability mass.\n\nThe simplest partitioning scheme builds on coarse-grained versions of each variable Xi. Let Zi\ndenote a disjoint collection of sets covering the possible values of Xi. For example, if variable Xi\n\nhas \ufb01ve states, Zi might be de\ufb01ned as(cid:31)(cid:123) 1(cid:44) 2(cid:125) (cid:44)(cid:123) 3(cid:44) 5(cid:125) (cid:44)(cid:123) 4(cid:125) (cid:30). Given such a partitioning scheme,\ngraphically in Fig. 1. In the case when Zi individuates each state, i.e.,(cid:31)(cid:123) 1(cid:125) (cid:44)(cid:123) 2(cid:125) (cid:44)(cid:123) 3(cid:125) (cid:44)(cid:123) 4(cid:125) (cid:30), we\n\nwe can introduce a distribution over coarsened variables (cid:181) 123(z1(cid:44) z2(cid:44) z3) and constrain it to agree\nik(xi(cid:44) xk) in the sense that they both yield the same marginals for zi(cid:44) zk. This is illustrated\nwith (cid:181)\n\nrecover the usual cluster consistency constraint.\nWe use the above idea to construct tighter outer bounds on the marginal polytope and incorporate\nthem into the MAP-LP relaxation. We assume that we are given a set of clusters C. For each cluster\nc (cid:31) C and variable i (cid:31) c we also have a partition Z c\ni as in the above example2 (the choice of clusters\nand partitions will be discussed later). We introduce marginals over the coarsened clusters (cid:181) (zc)\nand constrain them to agree with the edge variables (cid:181)\n\nij(xi(cid:44) xj) for all edges ij (cid:31) c:\n\n(cid:29)\n\nij(xi(cid:44) xj) = (cid:29)\n\nxi(cid:31)zc\n\ni (cid:44)xj(cid:31)zc\n\nj\n\nzc(cid:92)(cid:123) zc\n\nj(cid:125)\ni (cid:44)zc\n\n(cid:181) c(zc)(cid:46)\n\n(3)\n\nThe key idea is that the coarsened cluster represents higher-order marginals albeit at a lower res-\nolution, whereas the edge variables represent lower-order marginals but at a \ufb01ner resolution. The\nconstraint in Eq. 3 implies that these two representations should agree.\nWe can now state the LP that we set out to solve. Our LP optimizes over the following marginal\ni(xi) for the edges and nodes of the original graph, and (cid:181) c(zc) for the\nvariables:\ncoarse-grained clusters. We would like to constrain these variables to belong to the following outer\nbound on the marginal polytope:\n\nij(xi(cid:44) xj)(cid:44) (cid:181)\n\n(cid:28)(cid:27)(cid:26)(cid:27)(cid:25) (cid:181) (cid:30) 0\n\n(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)(cid:24)\n\n(cid:23)\n\n(cid:23)\n(cid:23)\n\nxj\ni (cid:44)xj(cid:31)zc\nxi(cid:44)xj\n\nj\n\nxi(cid:31)zc\n\n(cid:23)\n\ni(xi)\n\nzc(cid:92)(cid:123) zc\n\nj(cid:125)\ni (cid:44)zc\n\nij(xi(cid:44) xj) =(cid:181)\nij(xi(cid:44) xj) =\n\nij(xi(cid:44) xj) = 1\n\n(cid:22)(cid:27)(cid:21)(cid:27)(cid:20)\n\n(cid:181) c(zc)\n\n(4)\n\nMC(G) =\n\nNote that(cid:23)\n\nation is then:\n\nzc\n\n(cid:181) c(zc) = 1 is implied by the above constraints. The corresponding MAP-LP relax-\n\nmax\n\n(cid:181) (cid:31)MC(G)\n\n(cid:183) (cid:31)\n\n(5)\n\nThis LP could in principle be solved using generic LP optimization tools. However, a more ef\ufb01cient\nand scalable approach is to solve it via message passing in the dual LP, which we show how to do\nin the next section. In addition, for this method to be successful, it is critical that we choose good\ncoarsenings, meaning that it should have few partitions per variable, yet still suf\ufb01ciently tightens the\nrelaxation. Our approach for choosing the coarsenings is to iteratively solve the LP using an initial\nrelaxation (beginning with the pairwise consistency constraints), then to introduce additional cluster\nconstraints, letting the current solution guide how to coarsen the variables. As we showed in earlier\nwork [6], solving with the dual LP gives us a simple method for \u201cwarm starting\u201d the new LP (the\ntighter relaxation) using the previous solution, and also results in an algorithm for which every step\nmonotonically decreases an upper bound on the MAP assignment. We will give further details of\nthe coarsening scheme in Section 4.\n\n2We use a superscript of c to highlight the fact that different clusters may use different partitionings for Zi.\n\nAlso, there can be multiple clusters on the same set of variables, each using a different partitioning.\n\n(cid:181)\n(cid:181)\n(cid:181)\n(cid:181)\n(cid:181)\n(cid:181)\n\f(cid:88)\n(cid:88)\n\ni\n\nmin\n\u03b2\n\ns.t.\n\n3 Dual linear program and a message passing algorithm\n\nIn this section we give the dual of the partition-based LP from Eq. 5, and use it to obtain a message\npassing algorithm to ef\ufb01ciently optimize this relaxation. Our approach extends earlier work by\nGloberson and Jaakkola [2] who gave the generalized max-product linear programming (MPLP)\nalgorithm to solve the usual (non-coarsened) cluster LP relaxation in the dual.\nThe dual formulation in [2] was derived by adding auxiliary variables to the primal. We fol-\nlowed a similar approach to obtain the LP dual of Eq. 5. The dual variables are as follows:\n\u03b2ij\u2192i(xi, xj), \u03b2ij\u2192j(xi, xj), \u03b2ij\u2192ij(xi, xj) for every edge ij \u2208 E, and \u03b2c\u2192ij(zc) for every coars-\nened cluster c and edge ij \u2208 c. As in [2], we de\ufb01ne the following functions of \u03b2:\n\n\u03bbij\u2192i(xi) = max\nxj\nmax\nzc\\{zc\ni ,zc\n\n\u03b2ij\u2192i(xi, xj),\nj} \u03b2c\u2192ij(zc)\n\nj) =\n\ni , zc\n\n\u03bbc\u2192ij(zc\n\n\u03bbij\u2192ij(xi, xj) = \u03b2ij\u2192ij(xi, xj)\n\n(6)\n\n(7)\n\nAs we show below, the variables \u03bb correspond to the messages sent in the message passing algorithm\nthat we use for optimizing the dual. Thus \u03bbij\u2192i(xi) should be read as the message sent from edge\nj) is the message from the coarsened cluster to one of its intersection\nij to node i, and \u03bbc\u2192ij(zc\nedges. Finally, \u03bbij\u2192ij(xi, xj) is the message sent from an edge to itself. The dual of Eq. 5 is the\nfollowing constrained minimization problem:\n\ni , zc\n\n(cid:88)\n\n\u03bbik\u2192i(xi) + (cid:88)\n\n(cid:104)\n\n\u03bbij\u2192ij(xi, xj) + (cid:88)\n\n(cid:105)\n\nmax\nxi\n\nk\u2208N (i)\n\nmax\nxi,xj\n\nij\u2208E\n\nc:ij\u2208c\n\nj[xj])\n\n\u03bbc\u2192ij(zc\ni [xi], zc\n\u2200ij \u2208 E, xi, xj\n\n\u03b2ij\u2192i(xi, xj) + \u03b2ij\u2192j(xi, xj) + \u03b2ij\u2192ij(xi, xj) = \u03b8ij(xi, xj)\n\n\u2200c, zc\n\n(8)\n\ni \u2208 Z c\n\n\u03b2c\u2192ij(zc) = 0\ni [xi] refers to the mapping from xi \u2208 Xi to the coarse state zc\n\nij\u2208c\ni such that xi \u2208 zc\nThe notation zc\ni .\nBy convex duality, the dual objective evaluated at a dual feasible point upper bounds the primal LP\noptimum, which in turn upper bounds the value of the MAP assignment. It is illustrative to compare\nthis dual LP with [2] where the cluster dual variables were \u03b2c\u2192ij(xc). Our dual corresponds to\nintroducing the additional constraint that \u03b2c\u2192ij(xc) = \u03b2c\u2192ij(x(cid:48)\nThe advantage of the above dual is that it can be optimized via a simple message passing algorithm\nthat corresponds to block coordinate descent. The key idea is that it is possible to \ufb01x the values of\nthe \u03b2 variables corresponding to all clusters except one, and to \ufb01nd a closed form solution for the\nnon-\ufb01xed \u03b2s. It then turns out that one does not need to work with \u03b2 variables directly, but can\nkeep only the \u03bb message variables. Fig. 2 provides the form of the updates for all three message\ntypes. S(c) is the set of edges in cluster c (e.g. ij, jk, ik). Importantly, all messages outgoing from\na cluster or edge must be sent simultaneously.\nHere we derive the cluster to edge updates, which differ from [2]. Assume that all values of \u03b2 are\nj) for all ij \u2208 c in some cluster c. The term in the dual objective that\n\ufb01xed except for \u03b2c\u2192ij(zc\n(cid:105)\ndepends on \u03b2c\u2192ij(zc\ni , zc\n\nc) whenever zc[xc] = zc[x(cid:48)\nc].\n\ni [xi], zc(cid:48)\n\nj [xj]) + \u03bbc\u2192ij(zc\n\ni [xi], zc\n\nj[xj])\n\ni , zc\nj) can be written equivalently as\n\u03bbc(cid:48)\u2192ij(zc(cid:48)\n\n(cid:104)\n\u03bbij\u2192ij(xi, xj) + (cid:88)\n(cid:104)\n\n(cid:105)\n\nc(cid:48):c(cid:48)(cid:54)=c,ij\u2208c(cid:48)\ni [xi], zc\n\nj) + \u03bbc\u2192ij(zc\n\nj[xj])\n\ni , zc\n\n.\n\nmax\nxi,xj\n\n(11)\n\n= max\nzc\ni ,zc\nj\n\nDue to the constraint(cid:80)\n\nbij(zc\n\nij\u2208c \u03b2c\u2192ij(zc) = 0, all of the \u03b2c\u2192ij need to be updated simultaneously. It\ncan be easily shown (using an equalization argument as in [2]) that the \u03b2c\u2192ij(zc) that satisfy the\nconstraint and minimize the objective are given by\n\n\u03b2c\u2192ij(zc) = \u2212bij(zc\n\ni , zc\n\nj) +\n\n1\n\n|S(c)|\n\nbst(zc\n\ns, zc\n\nt ).\n\n(12)\n\n(cid:88)\n\nst\u2208c\n\nThe message update given in Fig. 2 follows from the de\ufb01nition of \u03bbc\u2192ij. Note that none of the\ncluster messages involve the original cluster variables xc, but rather only zc. Thus, we have achieved\nthe goal of both representing higher-order clusters and doing so at a reduced computational cost.\n\n\f\u2022 Edge to Node: For every edge ij \u2208 E and node i (or j) in the edge:\n\u03bbij\u2192i(xi)\u2190\u22122\n3 \u03bb\n\n\u2212j\ni (xi)+\n\n\u03bbc\u2192ij(zc\n\ni [xi], zc\n\nmax\nxj\n\n1\n3\n\n(cid:104)(cid:88)\n\nc:ij\u2208c\n\nj[xj])+\u03bbij\u2192ij(xi, xj)+\u03bb\u2212i\n\nj (xj)+\u03b8ij(xi, xj)\n\n(cid:105)\n\ni (xi) =(cid:80)\n\n\u2212j\n\nwhere \u03bb\n\nk\u2208N (i)\\j \u03bbik\u2192i(xi).\n\u2022 Edge to Edge: For every edge ij \u2208 E:\n\n\u03bbij\u2192ij(xi, xj)\u2190 \u2212 2\n3\n\nc:ij\u2208c\n\u2022 Cluster to Edge: First de\ufb01ne\n\nbij(zc\n\ni , zc\n\nj) = max\nxi \u2208 zc\nxj \u2208 zc\n\ni\n\nj\n\nThe update is then:\n\n(cid:88)\n\n\u03bbc\u2192ij(zc\n\n1\n3\n\ni [xi], zc\n\nj[xj]) +\n\n(cid:104)\n\uf8ee\uf8f0\u03bbij\u2192ij(xi, xj) + (cid:88)\n\nc(cid:48)(cid:54)=c:ij\u2208c(cid:48)\n\n(cid:105)\n\n\u2212j\ni (xi) + \u03b8ij(xi, xj)\n\n\u03bb\u2212i\nj (xj) + \u03bb\n\n\uf8f9\uf8fb\n\n\u03bbc(cid:48)\u2192ij(zc(cid:48)\n\ni [xi], zc(cid:48)\n\nj [xj])\n\n(9)\n\n\u03bbc\u2192ij(zc\n\ni , zc\n\nj)\u2190 \u2212 bij(zc\n\ni , zc\n\nj) +\n\n1\n\n|S(c)| max\nzc\\{zc\n\nj}\ni ,zc\n\n(cid:88)\n\nst\u2208c\n\nbst(zc\n\nt )\ns, zc\n\n(10)\n\nFigure 2: The message passing updates for solving the dual LP given in Eq. 8.\n\nsolution x by locally maximizing the single node beliefs bi(xi) = (cid:80)\n\nThe algorithm in Fig. 2 solves the dual for a given choice of coarsened clusters. As mentioned\nin Sec. 2, we would like to add such clusters gradually, as in [6]. Our overall algorithm is thus\nsimilar in structure to [6] and proceeds as follows (we denote the message passing algorithm from\nFig. 2 by MPLP): 1. Run MPLP until convergence using the pairwise relaxation, 2. Find an integral\nk\u2208N (i) \u03bbki\u2192i(xi), 3. If the\ndual objective given in Eq. 8 is suf\ufb01ciently close to the primal objective f(x; \u03b8), terminate, 4. Add\na new coarsened cluster c using the strategy given in Sec. 4, 5. Initialize messages going out of\nthe new cluster c to zero, and keep all the previous message values (this will not change the bound\nvalue), 6. Run MPLP for N iterations, then return to 2.\n\n4 Choosing coarse partitions\n\nUntil now we have not discussed how to choose the clusters to add and their partitionings. Our\nstrategy for doing so closely follows that of our earlier work [6]. Given a set C of candidate clusters\nto add (e.g., the set of all triplets in the graph as in [6]), we would like to add a cluster that would\nresult in the maximum decrease of the dual bound on the MAP. In principle such a cluster could be\nfound by optimizing the dual for each candidate cluster, then choosing the best one. However, this is\ncomputationally costly, so in [6] we instead use the bound decrease resulting from just once sending\nmessages from the candidate cluster to its intersection edges.\nIf we were to add the full (un-coarsened) cluster, this bound decrease would be:\n\nd(c) =(cid:88)\n\nwhere bij(xi, xj) = \u03bbij\u2192ij(xi, xj) +(cid:80)\n\nij\u2208c\n\nmax\nxi,xj\n\n(cid:88)\n\nij\u2208c\n\nbij(xi, xj) \u2212 max\n\nxc\n\nbij(xi, xj),\n\n(13)\n\nc:ij\u2208c \u03bbc\u2192ij(zc\n\ni [xi], zc\n\nj[xj]).\n\nOur strategy now is as follows: we add the cluster c that maximizes d(c), and then choose a parti-\ni for all i \u2208 c that is guaranteed to achieve a decrease that is close to d(c). This can clearly\ntioning Z c\ni = Xi (which achieves d(c)). However, in many cases\nbe achieved by using the trivial partition Z c\nit is also possible to achieve it while using much coarser partitionings.\n\n\fi is too large to optimize over. Instead, we consider just |Xi|\nThe set of all possible partitionings Z c\ncandidate partitions that are generated based on the beliefs bi(xi). Intuitively, the states with lower\nbelief values bi(xi) are less likely to in\ufb02uence the MAP, and can thus be bundled together. We\nwill therefore consider partitions where the k states with lowest belief values are put into the same\n\u201ccatch-all\u201d coarse state sc\ni , and all other states of xi get their own coarse state. Formally, a partition\ni is the set of all xi with bi(xi) < \u03bai. The question next\ni is characterized by a value \u03bai such that sc\nZ c\nis how big we can make the catch-all state without sacri\ufb01cing the bound decrease.\nWe employ a greedy scheme whereby each i \u2208 c (in arbitrary order) is partitioned separately, while\ni = Xi for all i \u2208 c. We would like to\nthe other partitions are kept \ufb01xed. The process starts with Z c\ni such that it is suf\ufb01ciently separated from the state that achieves d(c). Formally, given a\nchoose sc\n(cid:88)\nmargin parameter \u03b3 we choose \u03bai to be as large as possible such that the following constraint still\nholds3:\n\n(cid:88)\n\nbst(zc\n\ns, zc\n\nt ) \u2264 max\n\nbst(xs, xt) \u2212 \u03b3,\n\nmax\nzc\\{zc\ni },\nzc\ni = sc\ni\n\nst\u2208c\n\nxc\n\nst\u2208c\n\ni in evaluating the constraint for the current sc\ni .\n\nwhere the \ufb01rst maximization is over the coarse variables Zc\\i, and Z c\ni is \ufb01xed to the catch-all state sc\ni\ni is a function of \u03bai). We can \ufb01nd the optimal \u03bai in time O(|Xi||c|)\n(note that the partitioning for Z c\nby starting with \u03bai = \u2212\u221e and increasing it until the constraint is violated. Since each subsequent\ni differs by one additional state xi, we can re-use the maximizations over zc\\i for the\nvalue of sc\nprevious value of sc\nIt can be shown by induction that this results in a coarsening that has a guaranteed bound decrease\nof at least d(c) + min(0, \u03b3). Setting \u03b3 < 0 would give a partitioning with fewer coarse states at the\ncost of a smaller guaranteed bound decrease. On the other hand, setting \u03b3 > 0 results in a margin\nbetween the value of the dual objective (after sending the coarsened cluster message) and its value\nif we were to \ufb01x xi in the max terms of Eq. 11 to a value in sc\ni . This makes it less likely that a state\nin sc\ni will become important again in subsequent message passing iterations. For the experiments in\nthis paper we use \u03b3 = 3d(c), scaling \u03b3 with the value of the guaranteed bound decrease for the full\ncluster. Note that this greedy algorithm does not necessarily \ufb01nd the partitioning with the fewest\nnumber of coarse states that achieves the bound decrease.\n\n5 Experiments\n\nWe report results on the protein design problem, originally described in [9]. The protein design\nproblem is the inverse of the protein folding problem. Given a desired backbone structure for the\nprotein, the goal is to construct the sequence of amino-acids that results in a low energy, and thus\nstable, con\ufb01guration. We can use an approximate energy function to guide us towards \ufb01nding a\nset of amino-acids and rotamer con\ufb01gurations with minimal energy. In [9] the design problem was\nposed as \ufb01nding a MAP con\ufb01guration in a pairwise MRF. The models used there (which are also\navailable online) have a number of states per variable that is between 2 and 158, and contain up to\n180 variables per model. The models are also quite dense so that exact calculation is not feasible.\nRecently we showed [6] that all but one of the problems described in [9] can be solved exactly by\nusing a LP relaxation with clusters on three variables. However, since each individual state has\nroughly 100 possible values, processing triplets required 106 operations, making the optimization\ncostly. In what follows we describe two sets of experiments that show that, by coarsening, we can\nboth signi\ufb01cantly reduce the computation time and achieve similar performance as if we had used\nun-coarsened triplets [6]. The experiments differ in the strategy for adding triplets, and illustrate\ntwo performance regimes. In both experimental setups we \ufb01rst run the standard edge-based message\npassing algorithm for 1000 iterations.\nIn the \ufb01rst experiment, we add all triplets that correspond to variables whose single node beliefs are\ntied (within 10\u22125) at the maximum after running the edge-based algorithm. Since tied beliefs cor-\nrespond to fractional LP solutions, it is natural to consider these in tighter relaxations. The triplets\ncorrespond to partitioned variables, as explained in Sec. 2. The partitioning is guided by the ties in\nthe single node beliefs. Speci\ufb01cally, for each variable Xi we \ufb01nd states whose single node beliefs\nare tied at the maximum. Denote the number of states maximizing the belief by r. Then, we partition\n\n3The constraint may be infeasible for \u03b3 > 0, in which case we simply choose Z c\n\ni = Xi.\n\n\fFigure 3: Comparison with algorithm from [6] for the protein \u201c1aac\u201d, after the \ufb01rst 1000 iterations.\nLeft: Dual objective as a function of time. Right: The cost per one iteration over the entire graph.\n\nthe states into r subsets, each containing a different maximizing state. The other (non-maximizing)\nstates are split randomly among the r subsets. The triplets are then constructed over the coarsened\nvariables Z c\ni and the message passing algorithm of Sec. 3 is applied to the resulting structure. After\nconvergence of the algorithm, we recalculate the single node beliefs. These may result in a different\npartition scheme, and hence new variables Z c\ni . We add new triplets corresponding to the new vari-\nables and re-run. We repeat until the dual-LP bound is suf\ufb01ciently close to the value of the integral\nassignment obtained from the messages (note that these values would not coincide if the relaxation\nwere not tight; in these experiments they do, so the \ufb01nal relaxation is tight).\nWe applied the above scheme to the ten smallest proteins in the dataset used in [6] (for the larger\nproteins we used a different strategy described next). We were able to solve all ten exactly, as in\n[6]. The mean running time was six minutes. The gain in computational ef\ufb01ciency as a result of\nusing coarsened-triplets was considerable: The average state space size for coarsened triplets was\non average 3000 times smaller than that of the original triplet state space, resulting in a factor 3000\nspeed gain over a scheme that uses the complete (un-coarsened) triplets.4 This big factor comes\nabout because a very small number of states are tied per variable, thus increasing the ef\ufb01ciency of\nour method where the number of partitions is equal to the number of tied states. While running on\nfull triplets was completely impractical, the coarsened message passing algorithm is very practical\nand achieves the exact MAP assignments.\nOur second set of experiments follows the setup of [6] (see Sec. 3), alternating between adding 5\ntriplets to the relaxation and running MPLP for 20 more iterations. The only difference is that, after\ndeciding to add a cluster, we use the algorithm from Sec. 4 to partition the variables. We tried various\nsettings of \u03b3, including \u03b3 = 0 and .01, and found that \u03b3 = 3d(c) gave the best overall runtimes.\nWe applied this second scheme to the 15 largest proteins in the dataset.5 Of these, we found the exact\nMAP in 47% of the cases (according to the criterion used in [6]), and in the rest of the cases were\nwithin 10\u22122 of the known optimal value. For the cases that were solved exactly, the mean running\ntime was 1.5 hours, and on average the proteins were solved 8.1 times faster than with [6].6 To\ncompare the running times on all 15 proteins, we checked how long it took for the difference between\nthe dual and primal objectives to be less than .01f(xM ; \u03b8), where xM is the MAP assignment. This\nrevealed that our method is faster by an average factor of 4.3. The reason why these factors are\nless than the 3000 in the previous setup is that, for the larger proteins, the number of tied states is\ntypically much higher than that for the small ones.\nResults for one of the proteins that we solved exactly are shown in Fig. 3. The cost per iteration\nincreases very little after adding each triplet, showing that our algorithm signi\ufb01cantly coarsened the\nclusters. The total number of iterations and number of triplets added were roughly the same. Two\ntriplet clusters were added twice using different coarsenings, but otherwise each triplet only needed\nto be added once, demonstrating that our algorithm chose the right coarsenings.\n\n4These timing comparisons do not apply to [6] since that algorithm did not use all the triplets.\n5We do not run on the protein 1fpo, which was not solved in [6].\n6We made sure that differences were not due to different processing powers or CPU loads.\n\n012345160180200220240260HoursObjective This paperSontag et al. UAI \u201908Primal (best decoding)Dual1000120014001600180005101520253035Iteration NumberTime (Seconds) This paperSontag et al. UAI \u201908\f6 Discussion\n\nWe presented an algorithm that enforces higher-order consistency constraints on LP relaxations,\nbut at a reduced computational cost. Our technique further explores the trade-offs of representing\ncomplex constraints on the marginal polytope while keeping the optimization tractable. In applying\nthe method, we chose to cluster variables\u2019 states based a bound minimization criterion after solving\nusing a looser constraint on the polytope.\nA class of approaches related to ours are the \u201ccoarse-to-\ufb01ne\u201d applications of belief propagation [1,\n4]. In those, one solves low-resolution versions of an MRF, and uses the resulting beliefs to initialize\n\ufb01ner resolution versions. Although they share the element of coarsening with our approach, the goal\nof coarse-to-\ufb01ne approaches is very different from our objective. Speci\ufb01cally, the low-resolution\nMRFs only serve to speed-up convergence of the full resolution MRF via better initialization. Thus,\none typically should not expect it to perform better than the \ufb01nest granularity MRF. In contrast,\nour approach is designed to strictly improve the performance of the original MRF by introducing\nadditional (coarse) clusters. One of the key technical differences is that in our formulation the\nsetting of coarse and \ufb01ne variables are re\ufb01ned iteratively whereas in [1], once a coarse MRF has\nbeen solved, it is not revisited.\nThere are a number of interesting directions to explore. Using the same ideas as in this paper, one\ncan introduce coarsened pairwise consistency constraints in addition the full pairwise consistency\nconstraints. Although this would not tighten the relaxation, by passing messages more frequently\nin the coarsened space, and only occasionally revisiting the full edges, this could give signi\ufb01cant\ncomputational bene\ufb01ts when the nodes have large numbers of states. This would be much more\nsimilar to the coarse-to-\ufb01ne approach described above.\nWith the coarsening strategy used here, the number of variables still grows exponentially with the\ncluster size, albeit at a lower rate. One way to avoid the exponential growth is to partition the\nstates of a cluster into a \ufb01xed number of states (e.g., two), and then constrain such partitions to be\nconsistent with each other. Such a process may be repeated recursively, generating a hierarchy of\ncoarsened variables. The key advantage in this approach is that it represents progressively larger\nclusters, but with no exponential growth. An interesting open question is to understand how these\nhierarchies should be constructed.\nOur techniques may also be helpful for \ufb01nding the MAP assignment in MRFs with structured poten-\ntials, such as context-speci\ufb01c Bayesian networks. Finally, these constraints can also be used when\ncalculating marginals.\n\nReferences\n[1] P. F. Felzenszwalb and D. P. Huttenlocher. Ef\ufb01cient belief propagation for early vision. Int. J. Comput.\n\nVision, 70(1):41\u201354, 2006.\n\n[2] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP\n\nLP-relaxations. In Advances in Neural Information Processing Systems 21. MIT Press, 2008.\n\n[3] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization.\n\nPattern Anal. Mach. Intell., 28(10):1568\u20131583, 2006.\n\nIEEE Trans.\n\n[4] C. Raphael. Coarse-to-\ufb01ne dynamic programming. IEEE Transactions on Pattern Analysis and Machine\n\nIntelligence, 23(12):1379\u20131390, 2001.\n\n[5] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Informa-\n\ntion Processing Systems 21. MIT Press, 2008.\n\n[6] D. Sontag, T. Meltzer, A. Globerson, Y. Weiss, and T. Jaakkola. Tightening LP relaxations for MAP using\n\nmessage-passing. In UAI, 2008.\n\n[7] M. Wainwright and M. I. Jordan. Graphical models, exponential families and variational inference. Tech-\n\nnical report, UC Berkeley, Dept. of Statistics, 2003.\n\n[8] M. Wainwright and M. I. Jordan. Log-determinant relaxation for approximate inference in discrete\n\nMarkov random \ufb01elds. IEEE Transactions on Signal Processing, 54(6):2099\u20132109, June 2006.\n\n[9] C. Yanover, T. Meltzer, and Y. Weiss. Linear programming relaxations and belief propagation \u2013 an\n\nempirical study. JMLR, 7:1887\u20131907, 2006.\n\n[10] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\n\nbelief propagation algorithms. IEEE Trans. on Information Theory, 51(7):2282\u2013 2312, 2005.\n\n\f", "award": [], "sourceid": 989, "authors": [{"given_name": "David", "family_name": "Sontag", "institution": null}, {"given_name": "Amir", "family_name": "Globerson", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}