{"title": "Constraints Based Convex Belief Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 2532, "page_last": 2540, "abstract": "Inference in Markov random fields subject to consistency structure is a fundamental problem that arises in many real-life applications. In order to enforce consistency, classical approaches utilize consistency potentials or encode constraints over feasible instances. Unfortunately this comes at the price of a serious computational bottleneck. In this paper we suggest to tackle consistency by incorporating constraints on beliefs. This permits derivation of a closed-form message-passing algorithm which we refer to as the Constraints Based Convex Belief Propagation (CBCBP). Experiments show that CBCBP outperforms the standard approach while being at least an order of magnitude faster.", "full_text": "Constraints Based Convex Belief Propagation\n\nYaniv Tenzer\n\nDepartment of Statistics\nThe Hebrew University\n\nAlexander Schwing\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of Illinois at Urbana-Champaign\n\nKevin Gimpel\n\nToyota Technological Institute at Chicago\n\nTamir Hazan\n\nFaculty of Industrial Engineering and Management\n\nTechnion - Israel Institute of Technology\n\nAbstract\n\nInference in Markov random \ufb01elds subject to consistency structure is a fundamental\nproblem that arises in many real-life applications. In order to enforce consistency,\nclassical approaches utilize consistency potentials or encode constraints over feasi-\nble instances. Unfortunately this comes at the price of a tremendous computational\nburden. In this paper we suggest to tackle consistency by incorporating constraints\non beliefs. This permits derivation of a closed-form message-passing algorithm\nwhich we refer to as the Constraints Based Convex Belief Propagation (CBCBP).\nExperiments show that CBCBP outperforms the conventional consistency potential\nbased approach, while being at least an order of magnitude faster.\n\n1\n\nIntroduction\n\nMarkov random \ufb01elds (MRFs) [10] are widely used across different domains from computer vision\nand natural language processing to computational biology, because they are a general tool to describe\ndistributions that involve multiple variables. The dependencies between variables are conveniently\nencoded via potentials that de\ufb01ne the structure of a graph.\nBesides encoding dependencies, in a variety of real-world applications we often want consistent\nsolutions that are physically plausible, e.g., when jointly reasoning about multiple tasks or when\nenforcing geometric constraints in 3D indoor scene understanding applications [18]. Therefore,\nvarious methods [22, 13, 16, 12, 1] enforce consistency structure during inference by imposing\nconstraints on the feasible instances. This was shown to be effective in practice. However for each\nnew constraint we may need to design a speci\ufb01cally tailored algorithm. Therefore, the most common\napproach to impose consistency is usage of PN-consistency potentials [9]. This permits reuse of\nexisting message passing solvers, however, at the expense of an additional computational burden, as\nreal-world applications may involve hundreds of additional factors.\nOur goal in this work is to bypass this computational burden while being generally applicable. To\ndo so, we consider the problem of inference when probabilistic equalities are imposed over the\nbeliefs of the model rather than its feasible instances. As we show in Sec. 3, the adaptive nature of\nmessage passing algorithms conveniently allows for such probabilistic equality constraints within its\nframework. Since our method eliminates potentially many multivariate factors, inference is much\nmore scalable than using PN-consistency potentials [9].\nIn this paper, for notational simplicity, we illustrate the belief constraints based message passing rules\nusing a framework known as convex belief propagation (CBP). We refer to the illustrated algorithm\nas constraints based CBP (CBCBP). However we note that the same derivation can be used to obtain,\ne.g., a constraints based tree-reweighted message passing algorithm.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe evaluate the bene\ufb01ts of our algorithm on semantic image segmentation and machine translation\ntasks. Our results indicate that CBCBP improves accuracy while being at least an order of magnitude\nfaster than CBP.\n\n2 Background\n\np(x1, . . . , xn) \u221d exp ((cid:80)\n\nIn this section we review the standard CBP algorithm. To this end we consider joint distri-\nbutions de\ufb01ned over a set of discrete random variables X = (X1, . . . , Xn). The distribu-\ntion p(x1, . . . , xn) is assumed to factor into a product of non-negative potential functions, i.e.,\nr \u03b8r(xr)) , where r \u2282 {1, ..., n} is a subset of variable indices, which we\nuse to restrict the domain via xr = (xi)i\u2208r. The real-valued functions \u03b8r(xr) assign a preference to\neach of the variables in the subset r. To visualize the factorization structure we use a region graph,\ni.e., a generalization of factor graphs. In this graph, each real-valued function \u03b8r(xr) corresponds\nto a node. Nodes \u03b8r and \u03b8p can be connected if r \u2282 p. Hence the parent set P (r) of a region r\ncontains index sets p \u2208 P (r) if r \u2282 p. Conversely we de\ufb01ne the set of children of region r as\nC(r) = {c : r \u2208 P (c)}.\n\nAn important inference task is computation of the marginal probabilities p(xr) = (cid:80)\n\np(x).\nWhenever the region graph has no cycles, marginals are easily computed using belief propagation.\nUnfortunately, this algorithm may not converge in the presence of cycles. To \ufb01x convergence a variety\nof approximations have been suggested, one of which is known as convex belief propagation (CBP).\nCBP performs block-coordinate descent over the dual function of the following program:\n\nx\\xr\n\n(cid:88)\n\n(cid:26) \u2200r\n\nbr(xr)\u03b8r(xr)+\n\nH(br) s.t.\n\nr,xr\n\nr\n\n\u2200r, p \u2208 P (r), xr\n\nbr(xr) = 1,\n\nxr\n\nbp(xp) = br(xr).\n\n(1)\n\nThis program is de\ufb01ned over marginal distributions br(xr) and incorporates their entropy H(br) in\naddition to the potential function \u03b8r.\nIn many real world applications we require the solution to be consistent, i.e., hard constraints between\nsome of the involved variables exist. For example, consider the case where X1, X2 are two binary\nvariables such that for every feasible joint assignment, x1 = x2. To encourage consistency while\nreusing general purpose solvers, a PN-consistency potential [9] is often incorporated into the model:\n\nbr(xr) \u2265 0,(cid:80)\n(cid:80)\n\nxp\\xr\n\n(cid:88)\n\nmax\n\nbr\n\n(cid:26) 0\n\n\u03b81,2(x1, x2) =\n\nx1 = x2\n\n\u2212c otherwise .\n\n(2)\n\nHereby c is a positive constant that is tuned to penalize for the violation of consistency. As c increases,\nthe following constraint holds:\n\nb1(X1 = x1) = b2(X2 = x2).\n\n(3)\nHowever, usage of PN-potentials raises concerns: (i) increasing the number of pairwise constraints\ndecreases computational ef\ufb01ciency, (ii) enforcing consistency in a soft manner requires tuning of an\nadditional parameter c, (iii) large values of c reduce convergence, and (iv) large values of c result in\ncorresponding beliefs being assigned zero probability mass which is not desirable.\nTo alleviate these issues we suggest to enforce the equality constraints given in Eq. (3) directly during\noptimization of the program given in Eq. (1). We refer to the additionally introduced constraints as\nconsistency constraints.\nAt this point two notes are in place. First we emphasize that utilizing consistency constraints instead\nof PN-consistency potentials has a computational advantage, since it omits all pairwise beliefs that\ncorrespond to consistency potentials. Therefore it results in an optimization problem with fewer\nfunctions, which is expect to be more ef\ufb01ciently solvable.\nSecond we highlight that the two approaches are not equivalent. Intuitively, as c increases, we expect\nconsistency constraints to yield better results than usage of PN-potentials. Indeed, as c increases, the\nPN-consistency potential enforces the joint distribution to be diagonal, i.e., b(X1 = i, X2 = j) = 0,\n\u2200i (cid:54)= j. However, the consistency constraint as speci\ufb01ed in Eq. (3) only requires the univariate\nmarginals to agree. The latter is a considerably weaker requirement, as a diagonal pairwise distribution\nimplies agreement of the univariate marginals, but the opposite direction does not hold. Consequently,\nusing consistency constraints results in a larger search space, which is desirable.\n\n2\n\n\fAlgorithm 1 Constraints Based Convex Belief Propagation (CBCBP)\n\nRepeat until convergence:\nUpdate \u03bb messages - for each r update for all p \u2208 P (r), xr:\n\n\u00b5p\u2192r(xr) = ln\n\nexp\n\n\u03bbp\u2192p(cid:48)(xp) +\n\n(cid:88)\n\nxp\\xr\n\n\uf8eb\uf8ed\u03b8r(xr) \u2212(cid:88)\n\uf8eb\uf8ed\u03b8r(xr) +\n(cid:88)\n\np(cid:48)\u2208P (p)\n\nc\u2208C(r)\n\n\u03bdp\u2192k(xp)\n\n\uf8f6\uf8f8\n\uf8f6\uf8f8\u2212\u00b5p\u2192r(xr)\n\n(cid:88)\n\np\u2208P (r)\n\n\u03bbr(cid:48)\u2192p(xr(cid:48)) \u2212(cid:88)\n\nk\u2208Kp\n\nr(cid:48)\u2208C(p)\\r\n\n(cid:88)\n\u00b5p\u2192r(xr) \u2212(cid:88)\n(cid:88)\n\nk\u2208Kr\n\nlog \u03b1r(cid:48),k\n\nr(cid:48)\u2208N (k)\n\n\u03bbr\u2192p(xr)\u221d\n\n1\n\n1 + |P (r)|\n\n\u03bbc\u2192r(xc) +\n\n\u03bdr\u2192k(xr)\n\nUpdate \u03bd messages - for each k \u2208 K update for all r \u2208 N (k) using \u03b1r,k as de\ufb01ned in Eq. (6):\n\n\u03bdr\u2192k(sk\n\nr ) = log \u03b1r,k \u2212 1\n\n|N (k)|\n\nFigure 1: The CBCBP algorithm. Shown are the update rules for the \u03bb and \u03bd messages.\n\nNext we derive a general message-passing algorithm that aims at solving the optimization problem\ngiven in Eq. (1) subject to consistency constraints of the form given in Eq. (3).\n\n3 Constraints Based Convex Belief Propagation (CBCBP)\n\nTo enforce consistency of beliefs we want to incorporate constraints of the form br1(xr1) = . . . =\nbrm(xrm). Each constraint involves a set of regions ri and some of their assignments xri. If this\nconstraint involves more than two regions, i.e., if m > 2, it is easier to formulate the constraint as a\nseries of constraints bri (xri) = v, i \u2208 {1, . . . , m}, for some constant v that eventually cancels.\nGenerally, given a constraint k, we de\ufb01ne the set of its neighbours N (k) to be the involved regions\n}mk\ni=1. To simplify notation we\ni as well as the involved assignment xk\nrk\nsubsequently use r \u2208 N (k) instead of (r, xr) \u2208 N (k). However, it should be clear from the context\nr .\nthat each region rk is matched with a value xk\nWe subsume all constraints within the set K. Additionally, we let Kr denote the set of all those\nconstraints k which depend on region r, i.e., Kr = {k : r \u2208 N (k)}.\nUsing the aforementioned notation we are now ready to augment the conventional CBP given in\nEq. (1) with one additional set of constraints. The CBCBP program then reads as follows:\n\nri, i.e., N (k) = {rk\n\ni , xk\nri\n\n\uf8f1\uf8f2\uf8f3 \u2200r\n\n\u2200r, p \u2208 P (r), xr\n\u2200k \u2208 K, r \u2208 N (k)\n\nbr(xr) \u2265 0,(cid:80)\n(cid:80)\n\nxp\\xr\n\nbr(xk\n\nr ) = vk\n\nbr(xr) = 1\n\nxr\n\nbp(xp) = br(xr)\n\n.\n\nbr(xr)\u03b8r(xr) +\n\nH(br)\n\ns.t.\n\n(cid:88)\n\nr,xr\n\nmax\n\nbr\n\n(cid:88)\n\nr\n\n(4)\nTo solve this program we observe that its constraint space exhibits a rich structure, de\ufb01ned on the\none hand by the parent set P , and on the other hand by the neighborhood of the constraint subsumed\nin the set K. To exploit this structure, we aim at deriving the dual which is possible because the\nprogram is strictly convex. Importantly we can subsequently derive block-coordinate updates for the\ndual variables, which are ef\ufb01ciently computable in closed form. Hence solving the program given in\nEq. (4) via its dual is much more effective. In the following we \ufb01rst present the dual before discussing\nhow to ef\ufb01ciently solve it.\n\nDerivation of the dual program: The dual program of the task given in Eq. (4) is obtained by\nusing the Lagrangian as shown in the following lemma.\nLemma 3.1.: The dual problem associated with the primal program given in Eq. (4) is:\n\n(cid:88)\n\n(cid:88)\n\nmin\n\u03bb,\u03bd\n\nlog\n\nexp\n\nr\n\nxr\n\n(cid:32)\n\n\u03b8r(xr, \u03bb) \u2212 (cid:88)\n\nk\u2208Kr\n\n(cid:33)\n\n\u03bdr\u2192k(xr)\n\n3\n\ns.t. \u2200k \u2208 K,\n\n\u03bdr\u2192k(xk\n\nr ) = 0,\n\n(cid:88)\n\nr\u2208N (k)\n\n\fr and where we introduced \u03b8r(xr, \u03bb) =\n\nProof: We begin by de\ufb01ning a Lagrange multiplier for each of the constraints given in Eq. (4).\nConcretely, for all r, p \u2208 P (r), xr we let \u03bbr\u2192p(xr) be the Lagrange multiplier associated with the\nbp(xp) = br(xr). Similarly for all k \u2208 K, r \u2208 N (k), we let\n\nr ) be the Lagrange multiplier that is associated with the constraint br(xk\n\nr ) = vk.\n\nxp\\xr\n\nc\u2208C(r) \u03bbc\u2192r(sc).\n\nwhere we set \u03bdr\u2192k(xr) = 0 \u2200k \u2208 K, r \u2208 N (k), xr (cid:54)= xk\n\n\u03b8r(xr) \u2212(cid:80)\np\u2208P (r) \u03bbr\u2192p(xr) +(cid:80)\nmarginalization constraint(cid:80)\n(cid:32)\n\u03b8r(xr, \u03bb) \u2212 (cid:88)\n\n\u03bdr\u2192k(xk\nThe corresponding Lagrangian is then given by\n\n(cid:33)\nwhere \u03b8r(xr, \u03bb) = \u03b8r(xr) \u2212(cid:80)\np\u2208P (r) \u03bbr\u2192p(xr) +(cid:80)\n(cid:88)\n(cid:88)\n\n(cid:88)\n\nL(b, \u03bb, \u03bd) =\n\n\u03bdr\u2192k(xr)\n\nbr(xr)\n\nk\u2208Kr\n\nr,xr\n\n+\n\nD(\u03bb, \u03bd) = max\n\nL(b, \u03bb, \u03bd) =\n\n(cid:32)\n\u03b8r(xr, \u03bb) \u2212 (cid:88)\n\nlog\n\nexp\n\n(cid:88)\n\nr\n\nb\n\nr\n\nxr\n\nH(br) +\n\n\u03bdr\u2192k(xk\n\nr )vk,\n\nk\u2208K,r\u2208N (k)\n\n(cid:88)\n\n(cid:33)\n\n(cid:88)\n\n(cid:88)\n\n\u03bdr\u2192k(xr)\n\n+\n\nvk\nr\u2208N (k)\n\nk\n\n\u03bdr\u2192k(xk\n\nr ).\n\nk\u2208Kr\n\nk, r \u2208 N (k), xr (cid:54)= xk\nr .\nDue to conjugate duality between the entropy and the log-sum-exp function [25], the dual function is:\n\nc\u2208C(r) \u03bbc\u2192r(xc) and \u03bdr\u2192k(xr) = 0 for all\n\nThe result follows since the dual function is unbounded from below with respect to the Lagrange\nmultipliers \u03bdr\u2192k(xk\n\nr ), requiring constraints.\n\nDerivation of message passing update rules: As mentioned before we can derive block-\ncoordinate descent update rules for the dual which are computable in closed form. Hence the\ndual given in Lemma 3.1 can be solved ef\ufb01ciently, which is summarized in the following theorem:\nTheorem 3.2.: A block coordinates descent over the dual problem giving in Lemma 3.1 results in a\nmessage passing algorithm whose details are given in Fig. 1 and which we refer to as the CBCBP\nalgorithm. It is guaranteed to converge.\n\nBefore proving this result, we provide intuition for the update rules: as in the standard and dis-\ntributed [19] CBP algorithm, each region r sends a message to its parents via the dual variable \u03bbr\u2192p.\nDifferently from CBP but similar to distributed variants [19], our algorithm has another type of\nmessages, i.e., the \u03bd messages. Conceptually, think of the constraints as a new node. A constraint\nnode k is connected to a region r if r \u2208 N (k). Hence, a region r \u2018informs\u2019 the constraint node using\nthe dual variable \u03bdr\u2192k. We now show how to derive the message passing rules to optimize the dual.\nProof: First we note that convergence is guaranteed by the strict convexity of the primal problem [6].\nNext we begin by optimizing the dual function given in Lemma 3.1 with respect to the \u03bb parameters.\nSpeci\ufb01cally, for a chosen region r we optimize the dual w.r.t. a block of Lagrange multipliers\n\u03bbr\u2192p(xr) \u2200p \u2208 P (r), xr. To this end we derive the dual with respect to \u03bbr\u2192p(xr) while keeping all\nother variables \ufb01xed. The technique for solving the optimality conditions follows existing literature,\naugmented by messages \u03bdr\u2192k. It yields the update rules given in Fig. 1.\nNext we turn to optimizing the dual with respect to the Lagrange multipliers \u03bd. Recall that each\nconstraint k \u2208 K in the dual function given in Lemma 3.1 is associated with the linear constraint\nr ) = 0. Therefore we employ a Lagrange multiplier \u03b3k for each k. For compact\n\nr\u2208N (k) \u03bdr\u2192k(xk\n\n(cid:80)\n\nexposition, we introduce the Lagrangian that is associated with a constraint k, denoted by Lk:\n\nLk(\u03bb, \u03bd) =\n\nlog\n\nexp\n\n+ \u03b3k\n\n\u03bdr\u2192k(xk\nr )\n\nDeriving Lk with respect to \u03bdr\u2192k \u2200r \u2208 N (k) and using optimality conditions, we then arrive at:\n\n(cid:88)\n\nr\u2208N (k)\n\n(cid:88)\n\nxr\n\n\uf8eb\uf8ed (cid:88)\n\nr\u2208N (k)\n\n(cid:33)\n(cid:19)\n\n\u03bdr\u2192k(xr)\n\nk\u2208Kr\n\n(cid:32)\n\u03b8r(xr, \u03bb) \u2212 (cid:88)\n(cid:18)\n\u03b1r,k \u00b7 1 + \u03b3k\u2212\u03b3k\nr , \u03bb) \u2212(cid:80)\nexp(cid:0)\u03b8r(xr, \u03bb) \u2212(cid:80)\n\nr ) = log\n\n\u03b8r(xk\n\n(cid:16)\n\nk(cid:48)\u2208Kr\n\n\u03bdr\u2192k(xk\n\n(cid:80)\n\nexp\n\nxr\\xk\n\nr\n\n(cid:17)\n\u03bdr\u2192k(cid:48)(xr)(cid:1) .\n\nk(cid:48)\u2208Kr\\k \u03bdr\u2192k(cid:48)(xk\nr )\n\n4\n\nfor all r \u2208 N (k), where\n\n\u03b1r,k =\n\n\uf8f6\uf8f8 .\n\n(5)\n\n(6)\n\n\fCBP\n\nCBCBP\n\nn = 100\n1.47 \u00b1 2e\u22124\n0.05 \u00b1 3e\u22124\n\nn = 200\n2.7 \u00b1 1e\u22124\n0.11 \u00b1 1e\u22124\n\nn = 300\n5.95 \u00b1 3e\u22123\n0.23 \u00b1 2e\u22123\n\nn = 400\n\n13.42 \u00b1 2e\u22123\n0.43 \u00b1 1e\u22123\n\nTable 1: Average running time and standard deviation, over 10 repetitions, of CBCBP and CBP. Both\ninfer the parameters of MRFs that consist of n variables.\nc = 6\n\nc = 10\n\nc = 2\n\nc = 8\n\nCBP\n\n31.40 \u00b1 0.74\n\nc = 4\n\n42.05 \u00b1 1.02\n\n49.17 \u00b1 1.27\n\n53.35 \u00b1 0.93\n\n58.01 \u00b1 0.82\n\nTable 2: Average speedup factor and standard deviation, over 10 repetitions, of CBCBP compared to\nCBP, for different values of c. Both infer the beliefs of MRFs that consist of 200 variables.\n\nSumming the right hand side of Eq. (5) over r \u2208 N (k) and using the constraint(cid:80)\n\nr\u2208N (k) \u03bdr\u2192k(xk\n\nr ) =\n\n0 results in\n\n1 + \u03b3k\u2212\u03b3k\n\n=\n\n\uf8eb\uf8ed (cid:89)\n\nr\u2208N (k)\n\n1\n\u03b1r,k\n\n\uf8f6\uf8f8 1|N (k)|\n\n.\n\nFinally, substituting this result back into Eq. (5) yields the desired update rule.\nWe summarized the resulting algorithm in Fig. 1 and now turn our attention to its evaluation.\n4 Experiments\n\nWe \ufb01rst demonstrate the applicability of the procedure using synthetic data. We then turn to image\nsegmentation and machine translation, using real-world datasets. As our work directly improves the\nstandard CBP approach, we use it as a baseline.\n\n4.1 Synthetic Evaluation\nConsider two binary variables X and Y whose support consists of L levels, {1, . . . , L}. Assume we\nare given the following PN-consistency potential:\n\n\u03b8x,y(x, y) =\n\n\u2212c otherwise,\n\n(y = 1 \u2227 x = 1) \u2228 (y = 0 \u2227 x (cid:54)= 1)\n\n(7)\n\n(cid:26) 0\n\nwhere c is some positive parameter. This potential encourages the assignment y = 1 to agree with\nthe assignment x = 1 and y = 0 to agree with x = {2, . . . , L}. Phrased differently, this potential\nfavours beliefs such that:\n\nby(y = 1) = bx(x = 1),\n\nby(y = 0) = bx(x (cid:54)= 1).\n\n(8)\n\nTherefore, one may replace the above potential using a single consistency constraint. Note that the\nabove two constraints complement each other, hence, it suf\ufb01ces to include one of them. We use the\nleft consistency constraint since it \ufb01ts our derivation.\nWe test this hypothesis by constructing four networks that consist of n = 2v, v = 50, 100, 150, 200\nvariables, where v variables are binary, denoted by Y and the other v variables are multi-levels,\nsubsumed within X. Note that the support of variable Xi, 1 \u2264 i \u2264 v, consists of i states. Each\nmulti-level variable is matched with a binary one. For each variable we randomly generate unary\npotentials according to the standard Gaussian distribution.\nWe then run the standard CBP algorithm using the aforementioned PN-consistency potential given in\nEq. (7) with c = 1. In a next step we replace each such potential by its corresponding consistency\nconstraint following Eq. (8). For each network we repeat this process 10 times and report the mean\nrunning time and standard deviation in Tab. 1.\nAs expected, CBCBP is signi\ufb01cantly faster than the standard CBP. Quantitatively, CBCBP was\napproximately 25 times faster for the smallest, and more than 31 times faster for the largest graphs.\nObviously, different values of c effect the convexity of the problem and therefore also the running\ntime of both CBP and CBCBP. To quantify its impact we repeat the experiment with n = 200 for\ndistinct values of c \u2208 {2, 4, 6, 8, 10}. In Tab. 2 we report the mean speedup factor over 10 repetitions,\nfor each value of c. As clearly evident, the speedup factors substantially increases with c.\n\n5\n\n\fCBP\n\nCBCBP\n\n84.2\n85.4\n\n74.3\n76.1\n\n1.41 \u00b1 5e\u22123\n0.02 \u00b1 2e\u22123\n\nglobal accuracy\n\naverage accuracy mean running time\n\nTable 3: Global accuracy, average accuracy and mean running time as well as standard deviation for\nthe 256 images of the MSRC-21 dataset.\n\ngrass\n\nsheep\n\nsky\n\naeropl\n\nwater\n\nvoid\nboat\n0.79 0.99 0.84 0.68 0.67 0.92 0.78 0.83 0.82 0.79 0.90 0.92 0.56 0.42 0.94 0.48 0.87 0.81 0.51 0.63 0\nCBCBP 0.72 0.97 0.89 0.77 0.84 0.95 0.83 0.83 0.82 0.80 0.92 0.96 0.69 0.49 0.95 0.58 0.89 0.81 0.53 0.65 0\n\nface\n\nsign\n\ntree\n\nbird\n\ncow\n\nbook\n\nchair\n\nroad\n\nbody\n\ncat\n\ndog\n\nCBP\n\ncar\n\nbicycle\n\n\ufb02ower\n\nTable 4: Segmentation accuracy per class of CBCBP and CBP, for the MSRC-21 dataset.\n\nImage Segmentation\n\n4.2\nWe evaluate our approach on the task of semantic segmentation using the MSRC-21 dataset [21] as\nwell as the PascalVOC 2012 [4] dataset. Both contain 21 foreground classes. Each variable Xi in our\nmodel corresponds to a super-pixel in an image. In addition, each super-pixel is associated with a\nbinary variable Yi, that indicates whether the super-pixel belongs to the foreground, i.e., yi = 1, or to\nthe background, i.e., yi = 0. The model potentials are:\nSuper-pixel unary potentials: For MSRC-21 these potentials are computed by averaging the\nTextonBoost [11] pixel-potentials inside each super-pixel. For the PascalVOC 2012 dataset we train a\nconvolutional neural network following the VGG16 architecture.\nForeground/Background unary potentials: For MSRC-21 we let the value of the potential at\nyi = 1 equal the value of the super-pixel unary potential that corresponds to the \u2018void\u2019 label, and\nfor yi = 0 we de\ufb01ne it to be the maximum value of the super-pixel unary potential among the other\nlabels. For PascalVOC 2012 we obtain the foreground/background potential by training another\nconvolutional neural network following again the VGG16 architecture.\nSuper pixel - foreground/background consistency: We de\ufb01ne pairwise potentials between super-\npixel and the foreground/background labels using Eq. (7) and set c = 1.\nNaturally, these consistency potentials encourage CBP to favour beliefs where pixels that are labeled\nas \u2018void\u2019 are also labeled as \u2018background\u2019 and vice versa. This can also be formulated using the\nconstraints bi(Xi = 0) = bi(Yi = 0) and bi(Xi (cid:54)= 1) = bi(Yi = 1).\nWe compare the CBCBP algorithm with the standard CBP approach. For MSRC-21 we use the\nstandard error measure of average per class accuracy and average per pixel accuracy, denoted as\nglobal. Performances results are provided in Tab. 3.\nAppealingly, our results indicate that CBCBP outperforms the standard CBP, across both metrics.\nMoreover and as summarized in Tab. 4, in 19 out of 21 classes CBCBP achieves an accuracy that is\nequal to or higher than CBP. At last, CBCBP is more than 65 times faster than CBP.\nIn Tab. 5 we present the average pixel accuracy as well as the Intersection over Union (IoU) metric\nfor the VOC2012 data. We observe CBCBP to perform better since it is able to transfer information\nbetween the foreground-background classi\ufb01cation and the semantic segmentation.\n4.3 Machine Translation\nWe now consider the task of machine translation. We de\ufb01ne a phrase-based translation model as\na factor graph with many large constraints and use CBCBP to ef\ufb01ciently incorporate them during\ninference. Our model is inspired by the widely-used approach of [8]. Given a sentence in a source\nlanguage, the output of our phrase-based model consists of a segmentation of the source sentence\ninto phrases (subsequences of words), a phrase translation for each source phrase, and an ordering of\nthe phrase translations. See Fig. 2 for an illustration.\nWe index variables in our model by i = 1, . . . , n, which include source words (sw), source phrases\n(sp), and translation phrase slots (tp). The sequence of source words is \ufb01rst segmented into source\nphrases. The possible values for source word sw are Xsw = {(sw1, sw2) : (sw1 \u2264 sw \u2264\nsw2) \u2227 (sw2 \u2212 sw1 < m)}, where m is the maximum phrase length.\nIf source phrase sp is used in the derivation, we say that sp aligns to a translation phrase slot tp. If\nsp is not used, it aligns to \u2205. We de\ufb01ne variables Xsp to indicate what sp aligns to: Xsp = {tp :\n\n6\n\n\faverage accuracy\n\nCBP\n\nCBCBP\n\n90.6\n91.6\n\nIOU\n62.69\n62.97\n\nTable 5: Average accuracy and IOU accuracy for the 1449 images of the VOC2012 dataset.\n\nFigure 2: An example German sentence with a derivation of its translation. On the right we show the\nxsw variables, which assign source words to source phrases, the xsp variables, which assign source\nphrases to translation phrase slots, and the xtp variables, which \ufb01ll slots with actual words in the\ntranslation. Dotted lines highlight how xsw values correspond to indices of xsp variables, and xsp\nvalues correspond to indices of xtp variables. The xsp variables for unused source spans (e.g., x(1,1),\nx(2,4), etc.) are not shown.\nsw1 \u2212 d \u2264 tp \u2264 sw2 + d} \u222a {\u2205}, i.e., all translation phrase slots tp (numbered from left to right in\nthe translation) such that the slot number is at most distance d from an edge of sp.1\nEach translation phrase slot tp generates actual target-language words which comprise the translation.\nWe de\ufb01ne variables Xtp ranging over the possible target-language word sequences (translation\nphrases) that can be generated at slot tp. However, not all translation phrase slots must be \ufb01lled\nin with translations. Beyond some value of tp (equaling the number of source phrases used in the\nderivation), they must all be empty. To enforce this, we also permit a null (\u2205) translation.\nConsistency constraints: Many derivations de\ufb01ned by the discrete product space X1 \u00d7 \u00b7\u00b7\u00b7 \u00d7 Xn\nare semantically inconsistent. For example, a derivation may place the \ufb01rst source word into the\nsource phrase (1, 2) and the second source word into (2, 3). This is problematic because the phrases\noverlap; each source word must be placed into exactly one source phrase. We introduce source word\nconsistency constraints:\n\n\u2200sp,\u2200sw \u2208 sp : bsw(sp) = b(sp).\n\nThese constraints force the source word beliefs bsw(xsw) to agree on their span. There are other\nconsistencies we wish to enforce in our model. Speci\ufb01cally, we must match a source phrase to a\ntranslation phrase slot if and only if the source phrase is consistently chosen by all of its source words.\nFormally,\n\n\u2200 sp : b(sp) =(cid:80)\n\nxsp(cid:54)=\u2205 bsp(xsp).\n\nPhrase translation potentials: We use pairwise potential functions between source phrases sp =\n(sw1, sw2) and their aligned translation phrase slots tp. We include a factor (cid:104)sp, tp(cid:105) \u2208 E if sw1 \u2212\nd \u2264 tp \u2264 sw2 +d. Letting \u03c0sp be the actual words in sp, the potentials \u03b8sp,tp(xsp, xtp) determine the\npreference of the phrase translation (cid:104)\u03c0sp, xtp(cid:105) using a phrase table feature function \u03c4 : (cid:104)\u03c0, \u03c0(cid:48)(cid:105) \u2192 Rk.\np \u03c4 ((cid:104)\u03c0sp, xtp(cid:105)) if xsp = tp and a large negative value otherwise,\nIn particular, \u03b8sp,tp(xsp, xtp) = \u03b3(cid:62)\nwhere \u03b3p is the weight vector for the Moses phrase table feature vector.\nLanguage model potentials: To include n-gram language models, we add potentials that score\ntp |xtp\u22121 \u00b7 x(1)\ntp \u00b7\npairs of consecutive target phrases, i.e., \u03b8tp\u22121,tp(xtp\u22121, xtp) = \u03b3(cid:96)\n... \u00b7 x(i\u22121)\ntp is the i-th word in xtp, \u00b7 denotes string\nconcatenation, and \u03b3(cid:96) is the feature weight. This potential sums n-gram log-probabilities of words\nin the second of the two target phrases. Internal n-gram features and the standard word penalty\nfeature [7] are computed in the \u03b8tp potentials, since they depend only on the words in xtp.\nSource phrase separation potentials: We use pairwise potentials between source phrases to\nprevent them aligning to the same translation slot. We also prevent two overlapping source phrases\n\n), where |xtp| is the number of words in xtp, x(i)\n\n(cid:80)|xtp|\n\ni=1 log Pr(x(i)\n\ntp\n\n1Our distortion limit d is based on distances from source words to translation slots, rather than distances\n\nbetween source words as in the Moses system [7].\n\n7\n\n\fno sw constraints\n\nsw constraints with CBCBP\n\n%BLEU\n13.12\n16.73\n\nTable 6: %BLEU on test set, showing the contribution of the source word consistency constraints.\n\nfrom both aligning to non-null slots (i.e., one must align to \u2205). We include a factor between two sources\nphrases if there is a translation phrase that may relate to both, namely (cid:104)sp1, sp2(cid:105) \u2208 E if \u2203 tp :\n(cid:104)sp1, tp(cid:105) \u2208 E, (cid:104)sp2, tp(cid:105) \u2208 E. The source phrase separation potential \u03b8sp1,sp2(xsp1 , xsp2) is \u2212\u221e if\neither xsp1 = xsp2 (cid:54)= \u2205 or sp1 \u2229 sp2 (cid:54)= \u2205\u2227 xsp1 (cid:54)= \u2205\u2227 xsp2 (cid:54)= \u2205. Otherwise, it is \u2212\u03b3d|(\u03b4(sp1, sp2)\u2212\n|xsp1 \u2212 xsp2|)|, where \u03b4(sp1, sp2) returns the number of source words between the spans sp1 and\nsp2. This favors similar distances between source phrases and their aligned slots.\nExperimental Setup: We consider German-to-English translation. As training data for constructing\nthe phrase table, we use the WMT2011 parallel data [2], which contains 1.9M sentence pairs. We use\nthe phrase table to compute \u03b8sp,tp and to \ufb01ll Xtp. We use a bigram language model estimated from\nthe English side of the parallel data along with 601M tokens of randomly-selected sentences from the\nLinguistic Data Consortium\u2019s Gigaword corpus. This is used when computing the \u03b8tp\u22121,tp potentials.\nAs our test set, we use the \ufb01rst 150 sentences from the WMT2009 test set. Results below are (uncased)\n%BLEU scores [17] on this 150-sentence set.\nWe use maximum phrase length m = 3 and distortion limit d = 3. We run 250 iterations of CBCBP\nfor each sentence. For the feature weights (\u03b3), we use the default weights in Moses, since our features\nare analogous to theirs. Learning the weights is left to future work.\nResults: We compare to a simpli\ufb01ed version of our model that omits the sw variables and all\nconstraints and terms pertaining to them. This variation still contains all sp and tp variables and their\nfactors. This comparison shows the contribution of our novel handling of consistency constraints.\nTab. 6 shows our results. The consistency constraints lead to a large improvement for our model at\nnegligible increase in runtime due to our closed-form update rules. We found it impractical to attempt\nto obtain these results using the standard CBP algorithm for any source sentences of typical length.\nFor comparison to a standard benchmark, we also trained a Moses system [7], a state-of-the-art\nphrase-based system, on the same data. We used default settings and feature weights, except we\nused max phrase length 3 and no lexicalized reordering model, in order to more closely match the\nsetting of our model. The Moses %BLEU on this dataset is 17.88. When using the source word\nconsistency constraints, we are within 1.2% of Moses. Our model has the virtue of being able to\ncompute marginals for downstream applications and also permits us to study particular forms of\nconstraints in phrase-based translation modeling. Future work can add or remove constraints like\nwe did in our experiments here in order to determine the most effective constraints for phrase-based\ntranslation. Our ef\ufb01cient inference framework makes such exploration possible.\n5 Related Work\nVariational approaches to inference have been extensively studied in the past. We address approximate\ninference using the entropy barrier function and there has been extensive work in this direction,\ne.g., [24, 14, 23, 5, 19, 20] to name a few. Our work differs since we incorporate consistency\nconstraints within the inference engine. We show that closed-form update rules are still available.\nConsistency constraints are implied when using PN-potentials [9]. However, pairwise functions\nare included for every constraint which is expensive if many constraints are involved. In contrast,\nconstraints over the feasible instances are considered in [22, 13, 16, 12, 1]. While impressive results\nhave been shown, each different restrictions of the feasible set may require a tailored algorithm.\nIn contrast, we propose to include probabilistic equalities among the model beliefs, which permits\nderivation of an algorithm that is generally applicable.\n6 Conclusions\nIn this work we tackled the problem of inference with belief based equality constraints, which arises\nwhen consistency among variables in the network is required. We introduced the CBCBP algorithm\nthat directly incorporates constraints into the CBP framework and results in closed-form update\nrules. We demonstrated the merit of CBCBP both on synthetic data and on two real-world tasks.\nOur experiments indicate that CBCBP outperforms PN-potentials in both speed and accuracy. In the\nfuture we intend to incorporate our approximate inference with consistency constraints into learning\nframeworks, e.g., [15, 3].\n\n8\n\n\fReferences\n[1] S. Bach, M. Broecheler, L. Getoor, and D. O\u2019leary. Scaling mpe inference for constrained continuous\nmarkov random \ufb01elds with consensus optimization. In Advances in Neural Information Processing Systems,\npages 2654\u20132662, 2012.\n\n[2] C. Callison-Burch, P. Koehn, C. Monz, and O. Zaidan. Findings of the 2011 Workshop on Statistical\n[3] L.-C. Chen\u2217, A. G. Schwing\u2217, A. L. Yuille, and R. Urtasun. Learning Deep Structured Models. In Proc.\n\nMachine Translation. In Proc. of WMT, 2011.\nICML, 2015. \u2217 equal contribution.\n\n[4] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object classes\nchallenge 2012 (voc2012). http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html,\n2012.\n\n[5] T. Hazan, J. Peng, and A. Shashua. Tightening fractional covering upper bounds on the partition function\n\nfor high-order region graphs. In Proc. UAI, 2012.\n\n[6] T. Heskes. On the uniqueness of loopy belief propagation \ufb01xed points. Neural Computation, 16(11):2379\u2013\n\n2413, 2004.\n\n[7] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran,\nR. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open source toolkit for statistical\nmachine translation. In Proc. of ACL, 2007.\n\n[8] P. Koehn, F. J. Och, and D. Marcu. Statistical phrase-based translation. In Proceedings of the 2003\nConference of the North American Chapter of the Association for Computational Linguistics on Human\nLanguage Technology-Volume 1, pages 48\u201354. Association for Computational Linguistics, 2003.\n\n[9] P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. International\n\nJournal of Computer Vision, 82(3):302\u2013324, 2009.\n\n[10] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.\n[11] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Graph cut based inference with co-occurrence statistics. In\n\nComputer Vision\u2013ECCV 2010, pages 239\u2013253. Springer, 2010.\n\n[12] A. F. Martins, M. A. Figeuiredo, P. M. Aguiar, N. A. Smith, and E. P. Xing. An augmented lagrangian\n\napproach to constrained map inference. 2011.\n\n[13] A. F. T. Martins. The geometry of constrained structured prediction: applications to inference and learning\n\nof natural language syntax. PhD thesis, Columbia University, 2012.\n\n[14] T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms-a unifying view. In UAI,\n\n2009.\n\ndual losses. In Proc. ICML, 2010.\n\n[15] O. Meshi, D. Sontag, T. Jaakkola, and A. Globerson. Learning ef\ufb01ciently with approximate inference via\n\n[16] S. Nowozin and C. H. Lampert. Global connectivity potentials for random \ufb01eld models. In Computer\n\nVision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 818\u2013825. IEEE, 2009.\n\n[17] K. Papineni, S. Roukos, T. Ward, and W. Zhu. BLEU: a method for automatic evaluation of machine\n\ntranslation. In Proc. of ACL, 2002.\n\n[18] A. G. Schwing, S. Fidler, M. Pollefeys, and R. Urtasun. Box In the Box: Joint 3D Layout and Object\n\nReasoning from Single Images. In Proc. ICCV, 2013.\n\n[19] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Distributed Message Passing for Large Scale\n\nGraphical Models. In Proc. CVPR, 2011.\n\n[20] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun. Globally Convergent Dual MAP LP Relaxation\n\nSolvers using Fenchel-Young Margins. In Proc. NIPS, 2012.\n\n[21] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation.\nIn Computer vision and pattern recognition, 2008. CVPR 2008. IEEE Conference on, pages 1\u20138. IEEE,\n2008.\n\n[22] D. A. Smith and J. Eisner. Dependency parsing by belief propagation. In Proceedings of the Conference\non Empirical Methods in Natural Language Processing, pages 145\u2013156. Association for Computational\nLinguistics, 2008.\n\n[23] D. Tarlow, D. Batra, P. Kohli, and V. Kolmogorov. Dynamic tree block coordinate ascent. ICML, pages\n\n113\u2013120, 2011.\n\n[24] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition\n\n[25] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nfunction. Trans. on Information Theory, 51(7):2313\u20132335, 2005.\nFoundations and Trends R(cid:13) in Machine Learning, 1(1-2):1\u2013305, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1322, "authors": [{"given_name": "Y\u007faniv", "family_name": "Tenzer", "institution": "The Hebrew University"}, {"given_name": "Alex", "family_name": "Schwing", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Kevin", "family_name": "Gimpel", "institution": "Carnegie Mellon University"}, {"given_name": "Tamir", "family_name": "Hazan", "institution": "Technion"}]}