{"title": "Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation", "book": "Advances in Neural Information Processing Systems", "page_first": 1950, "page_last": 1958, "abstract": "We consider energy minimization for undirected graphical models, also known as MAP-inference problem for Markov random fields. Although combinatorial methods, which return a provably optimal integral solution of the problem, made a big progress in the past decade, they are still typically unable to cope with large-scale datasets. On the other hand, large scale datasets are typically defined on sparse graphs, and convex relaxation methods, such as linear programming relaxations often provide good approximations to integral solutions.   We propose a novel method of combining combinatorial and convex programming techniques to obtain a global solution of the initial combinatorial problem. Based on the information obtained from the solution of the convex relaxation, our method confines application of the combinatorial solver to a small fraction of the initial graphical model, which allows to optimally solve big problems.   We demonstrate the power of our approach on a computer vision energy minimization benchmark.", "full_text": "Global MAP-Optimality by Shrinking the\n\nCombinatorial Search Area with Convex Relaxation\n\nBogdan Savchynskyy1\n\nJ\u00a8org Kappes2\n\nPaul Swoboda2\n\nChristoph Schn\u00a8orr1,2\n\n1Heidelberg Collaboratory for Image Processing, Heidelberg University, Germany\n\nbogdan.savchynskyy@iwr.uni-heidelberg.de\n\n2Image and Pattern Analysis Group, Heidelberg University, Germany\n{kappes,swoboda,schnoerr}@math.uni-heidelberg.de\n\nAbstract\n\nWe consider energy minimization for undirected graphical models, also known as\nthe MAP-inference problem for Markov random \ufb01elds. Although combinatorial\nmethods, which return a provably optimal integral solution of the problem, made a\nsigni\ufb01cant progress in the past decade, they are still typically unable to cope with\nlarge-scale datasets. On the other hand, large scale datasets are often de\ufb01ned on\nsparse graphs and convex relaxation methods, such as linear programming relax-\nations then provide good approximations to integral solutions.\nWe propose a novel method of combining combinatorial and convex program-\nming techniques to obtain a global solution of the initial combinatorial problem.\nBased on the information obtained from the solution of the convex relaxation, our\nmethod con\ufb01nes application of the combinatorial solver to a small fraction of the\ninitial graphical model, which allows to optimally solve much larger problems.\nWe demonstrate the ef\ufb01cacy of our approach on a computer vision energy mini-\nmization benchmark.\n\n1\n\nIntroduction\n\n(cid:88)\n\nv\u2208VG\n\n(cid:88)\n\nThe focus of this paper is energy minimization for Markov random \ufb01elds. In the most common\npairwise case this problem reads\n\nuv\u2208EG\n\nmin\nx\u2208XG\n\n\u03b8v(xv) +\n\n\u03b8uv(xu, xv) ,\n\nEG,\u03b8(x) := min\nx\u2208XG\n\n(1)\nwhere G = (VG,EG) denotes an undirected graph with the set of nodes VG (cid:51) v and the set of\nedges EG (cid:51) uv; variables xv belong to the \ufb01nite label sets Xv, v \u2208 VG; potentials \u03b8v : Xv \u2192 R,\n\u03b8uv : Xu\u00d7Xv \u2192 R, v \u2208 VG, uv \u2208 EG, are associated with the nodes and the edges of G respectively.\nWe denote by XG the Cartesian product \u2297v\u2208VGXv.\nProblem (1) is known to be NP-hard in general, hence existing methods either consider its convex\nrelaxations or/and apply combinatorial techniques such as branch-and-bound, combinatorial search,\ncutting plane etc. on top of convex relaxations. The main contribution of this paper is a novel\nmethod to combine convex and combinatorial approaches to compute a provably optimal solution.\nThe method is very general in the sense that it is not restricted to a speci\ufb01c convex programming\nor combinatorial algorithm, although some algorithms are more preferable than others. The main\nrestriction of the method is the neighborhood structure of the graph G: it has to be sparse. Basic grid\ngraphs of image data provide examples satisfying this requirement. The method is applicable also to\nhigher-order problems, de\ufb01ned on so called factor graphs [1], however we will concentrate mainly\non the pairwise case to keep our exposition simple.\nUnderlying idea. Fig. 1 demonstrates the main idea of our method. Let A and B be two subgraphs\ncovering G. Select them so that the only common nodes of these subgraphs lie on their mutual border\n\n1\n\n\fSolve A and B separately\n\nCheck consistency on \u2202A\n\nIncrease B\n\nA\\\u2202A\nB\\\u2202A\n\u2202A \u2261 \u2202B\n\nlabel mismatch\n\nA and x\u2217\n\nA and x\u2217\n\nA and x\u2217\n\nA and x\u2217\nB. This process is repeated until either labelings x\u2217\n\nFigure 1: Underlying idea of the proposed method: the initial graph is split into two subgraphs A\n(blue+yellow) and B (red+yellow), assigned to a convex and a combinatorial solver respectively. If\nthe integral solutions provided by both solvers do not coincide on the common border \u2202A (yellow)\nof the two subgraphs, the subgraph B is increased by appending mismatching nodes (green) and the\nborder is adjusted respectively.\n\u2202A(\u2261 \u2202B) de\ufb01ned in terms of the master-graph G. Let x\u2217\nB be optimal labelings computed\nindependently on A and B. If these labelings coincide on the border \u2202A, then under some additional\nconditions the concatenation of x\u2217\nB is an optimal labeling for the initial problem (1), as we\nshow in Section 3 (see Theorem 1).\nWe select the subgraph A such that it contains a \u201dsimple\u201c part of the problem, for which the convex\nrelaxation is tight. This part is assigned to the respective convex program solver. The subgraph\nB contains in contrast the dif\ufb01cult, combinatorial subproblem and is assigned to a combinatorial\nB do not coincide on some border node v \u2208 \u2202A, we (i) increase the\nsolver. If the labelings x\u2217\nsubgraph B by appending the node v and edges from v to B, (ii) correspondingly decrease A and\n(iii) recompute x\u2217\nA and x\u2217\nB coincide on\nthe border or B equals G. The sparsity of G is required to avoid fast growth of the subgraph B.\nWe refer to Section 3 for a detailed description of the algorithm, where we in particular specify the\ninitial selection of the subgraphs A and B and the methods for (i) encouraging consistency of x\u2217\nA\nand x\u2217\nB on the boundary \u2202A and (ii) providing equivalent results with just a single run of the convex\nrelaxation solver. These techniques will be described for the local polytope relaxation, known also\nas a linear programming relaxation of (1) [2, 3].\nRelated work. The literature on problem (1) is very broad, both regarding convex programming and\ncombinatorial methods. Here we will concentrate on the local polytope relaxation, that is essential\nto our approach.\nThe local polytope relaxation (LP) of (1) was proposed and analyzed in [4] (see also the recent\nreview [2]). An alternative view on the same relaxation was proposed in [5]. This view appeared to\nbe very close to the idea of the Lagrangian or dual decomposition technique (see [6] for applications\nto (1)). This idea stimulated development of ef\ufb01cient solvers for convex relaxations of (1). Scalable\nsolvers for the LP relaxation became a hot topic in recent years [7\u201314]. The algorithms however,\nwhich guarantee attainment of the optimum of the convex relaxation at least theoretically, are quite\nslow in practice, see e.g. comparisons in [11, 15]. Remarkably, the fastest scalable algorithms\nfor convex relaxations are based on coordinate descent: the diffusion algorithm [2] known from\nthe seventies and especially its dual decomposition based variant TRW-S [16]. There are other\nclosely related methods [17, 18] based on the same principle. Although these algorithms do not\nguarantee attainment of the optimum, they converge [19] to points ful\ufb01lling a condition known as\narc consistency [2] or weak tree agreement [16]. We show in Section 3 that this condition plays a\nsigni\ufb01cant role for our approach. It is a common observation that in the case of sparse graphs and/or\nstrong evidence of the unary terms \u03b8v, v \u2208 VG, the approximate solutions delivered by such solvers\nare quite good from the practical viewpoint. The belief, that these solutions are close to optimal\nones is evidenced by numerical bounds, which these solvers provide as a byproduct.\nThe techniques used in combinatorial solvers specialized to problem (1) include most of the clas-\nsical tools: cutting plane, combinatorial search and branch-and-bound methods were adapted to the\nproblem (1). The ideas of the cutting plane method form the basis for tightening the LP relaxation\nwithin the dual decomposition framework (see the recent review [20] and references therein) and\nfor \ufb01nding an exact solution for Potts models [21], which is a special class of problem (1). Com-\nbinatorial search methods with dynamic programming based heuristics were successfully applied\n\n2\n\n\fto problems de\ufb01ned on dense and fully connected but small graphs [22]. The specialized branch-\nand-bound solvers [23, 24] also use convex (mostly LP) relaxations and/or a dynamic programming\ntechnique to produce bounds in the course of the combinatorial search [25]. However the reported\napplicability of most combinatorial solvers nowadays is limited to small graphs. Specialized solvers\nlike [21] scale much better, but are focused on a certain narrow class of problems.\nThe goal of this work is to employ the fact, that local polytope solvers provide good approximate\nsolutions and to restrict computational efforts of combinatorial solvers to a relatively small, and\nhence tractable part of the initial problem.\nContribution. We propose a novel method for obtaining a globally optimal solution of the energy\nminimization problem (1) for sparse graphs and demonstrate its performance on a series of large-\nscale benchmark datasets. We were able to\n\n\u2022 solve previously unsolved large-scale problems of several different types, and\n\u2022 attain optimal solutions of hard instances of Potts models an order of magnitude faster than\n\nspecialized state of the art algorithms [21].\n\nFor an evaluation of our method we use datasets from the very recent benchmark [15].\nPaper structure. In Section 2 we provide the de\ufb01nitions for the local polytope relaxation and arc\nconsistency. Section 3 is devoted to the speci\ufb01cation of our algorithm. In Sections 4 and 5 we\nprovide results of the experimental evaluation and conclusions.\n\n2 Preliminaries\nNotation. A vector x with coordinates xv, v \u2208 VG, will be called labeling and its coordinates\nxv \u2208 Xv \u2013 labels. The notation x|W ,W \u2282 VG stands for the restriction of x to the subset W, i.e.\nfor the subvector (xv, v \u2208 W). To shorten notation we will sometimes write xuv \u2208 Xuv in place\nof (xv, xu) \u2208 Xu \u00d7 Xv for (v, u) \u2208 EG. Let also nb(v), v \u2208 VG, denote the set of neighbors of\nnode v, that is the set {u \u2208 VG : uv \u2208 EG}.\nLP relaxation. The local polytope relaxation of (1) reads (see e.g. [2])\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n(cid:80)\n\u03b8v(xv)\u00b5v(xv) +\n(cid:80)\nv\u2208VG\nxv\u2208VG \u00b5v(xv) = 1, v \u2208 VG\n(cid:80)\nxv\u2208VG \u00b5uv(xu, xv) = \u00b5u(xu), xu \u2208 Xu, uv \u2208 EG\nxu\u2208VG \u00b5uv(xu, xv) = \u00b5v(xv), xv \u2208 Xv, uv \u2208 EG .\n\n(xu,xv)\u2208Xuv\n\nxv\u2208Xv\n\nuv\u2208EG\n\nmin\n\u00b5\u22650\n\ns.t.\n\n\u03b8uv(xu, xv)\u00b5uv(xu, xv)\n\n(2)\n\n(3)\n\nThis formulation is based on the overcomplete representation of indicator vectors \u00b5 constrained\nto the local polytope commonly used for discrete graphical models [3]. It is well-known that the\nlocal polytope constitutes an outer bound (relaxation) of the convex hull of all indicator vectors of\nlabelings (marginal polytope; cf. [3]).\nThe Lagrange dual of (2) reads\n\n(cid:88)\nv\u2208VG\ns.t. \u03b3v \u2264\n\nmax\n\u03c6,\u03b3\n\n\u03b3v +\n\n(cid:88)\nv (xv) := \u03b8v(xv) \u2212(cid:80)\n\nuv\u2208EG\n\u02dc\u03b8\u03c6\n\n\u03b3uv\n\nu\u2208nb(v) \u03c6v,u(xv),\n\nv \u2208 VG, xv \u2208 Xv ,\n\n\u03b3uv \u2264 \u02dc\u03b8\u03c6\n\nto D(\u03c6) := (cid:80)\n\nuv(xu, xv) := \u03b8uv(xu, xv) + \u03c6v,u(xv) + \u03c6u,v(xu), uv \u2208 EG, (xu, xv) \u2208 Xuv .\nIn the constraints of (3) we introduced the reparametrized potentials \u02dc\u03b8\u03c6. One can see, that for any\nvalues of the dual variables \u03c6 the reparametrized energy E\u02dc\u03b8\u03c6,G(x) is equal to the non-parametrized\none E\u03b8,G(x) for any labeling x \u2208 XG. The objective function of the dual problem is equal\n\u02dc\u03b8\u03c6\nw(xw). A\nreparametrization, that is reparametrized potentials \u02dc\u03b8\u03c6, will be called optimal, if the corresponding\n\u03c6 is the solution of the dual problem (3). In general neither the optimal \u03c6 is unique nor the optimal\nreparametrization.\n\nw \u2208 arg minxw\u2208Xv\u222aXuv\n\nv) +(cid:80)\n\nuv), where x(cid:48)\n\nuv(x(cid:48)\n\u02dc\u03b8\u03c6\n\nv (x(cid:48)\n\u02dc\u03b8\u03c6\n\nuv\u2208EG\n\nv\u2208VG\n\n3\n\n\fDe\ufb01nition 1 (Strict arc consistency). We will call the node v \u2208 VG strictly arc consistent w.r.t.\npotentials \u03b8 if there exist labels x(cid:48)\nv) < \u03b8v(xv)\nfor all xv \u2208 Xv\\{x(cid:48)\nu)}. The label\nv} and \u03b8vu(x(cid:48)\nx(cid:48)\nv will be called locally optimal.\nIf all nodes v \u2208 VG are strictly arc consistent w.r.t. the potentials \u02dc\u03b8\u03c6, the dual objective value D(\u03c6)\nbecomes equal to the energy\n\nu \u2208 Xu for all u \u2208 nb(v), such that \u03b8v(x(cid:48)\nv, x(cid:48)\n\nu) < \u03b8vu(xv, xu) for all (xv, xu) \u2208 Xvu\\{(x(cid:48)\n\nv \u2208 Xv and x(cid:48)\nv, x(cid:48)\n\nD(\u03c6) = EG,\u02dc\u03b8\u03c6(x(cid:48)) = EG,\u03b8(x(cid:48))\n\n(4)\nof the labeling x(cid:48) constructed by the corresponding locally optimal labels. From duality it follows,\nthat D(\u03c6) is a lower bound for energies of all labelings EG,\u03b8(x), x \u2208 XG. Hence attainment of\nequality (4) shows that (i) \u03c6 is the solution of the dual problem (3) and (ii) x(cid:48) is the solution of both\nthe energy minimization problem (1) and its relaxation (2).\nStrict arc consistency of all nodes is suf\ufb01cient, but not necessary for attaining the optimum of the\ndual objective (3). Its ful\ufb01llment means that our LP relaxation is tight, which is not always the\ncase. However, in many practical cases the optimal reparametrization \u03c6 corresponds to strict arc\nconsistency of a signi\ufb01cant portion of, but not all graph nodes. The remaining non-consistent part is\noften much smaller and consists of many separate \u201dislands\u201c. The strict arc consistency of a certain\nnode v, even for the optimally reparametrized potentials \u02dc\u03b8\u03c6, does not guarantee global optimality\nof the corresponding locally optimal label xv (unless it holds for all nodes), though it is a good and\nwidely used heuristic to obtain an approximate solution of the non-relaxed problem (1). In this work\nwe provide an algorithm, which is able to prove this optimality or discard it. The algorithm applies\ncombinatorial optimization techniques only to the arc inconsistent part of the model, which is often\nmuch smaller than the whole model in applications.\nRemark 1. Ef\ufb01cient dual decomposition based algorithms optimize dual functions, which differ\nfrom (4) (see e.g. [6, 13, 16]), but are equivalent to it in the sense of equal optimal values. Getting\nreparametrizations \u02dc\u03b8\u03c6 is less straightforward in these cases, but can be ef\ufb01ciently computed (see\ne.g. [16, Sec. 2.2]).\n\n3 Algorithm description\nThe graph A = (VA,EA) will be called an (induced) subgraph of the graph G = (VG,EG), if\nVA \u2282 VG and EA = {uv \u2208 EG : u, v \u2208 VA}. The graph G will be called supergraph of A. The\nsubgraph \u2202A induced by a set of nodes V\u2202A of the graph A, which are connected to VG\\VA, is\ncalled its boundary w.r.t. G, i.e. V\u2202A = {v \u2208 VA : \u2203uv \u2208 EG : u \u2208 VG\\VA}. The complement B\nto A\\\u2202A, given by VB = {v \u2208 VG : v \u2208 \u2202A \u222a (VG\\VA)}, EB = {uv \u2208 EG : u, v \u2208 VB}, is called\nboundary complement to A w.r.t. the graph G. Let A be a subgraph of G and potentials \u03b8v, v \u2208 VG,\nand \u03b8uv \u2208 EG be associated with nodes and edges of G respectively. We assume, that \u03b8v, v \u2208 VA,\nand \u03b8uv \u2208 EA are associated with the subgraph A. Hence we consider the energy function EA,\u03b8 to\nbe de\ufb01ned on A together with an optimal labeling on A, which is the one that minimizes EA,\u03b8.\nThe following theorem formulates conditions necessary to produce an optimal labeling x\u2217 on the\nsubgraph G from the optimal labelings on its mutually boundary complement subgraphs A and B.\nTheorem 1. Let A be a subgraph of G and B be its boundary complement w.r.t. A. Let x\u2217\nA and\nB be labelings minimizing EA,\u03b8 and EB,\u03b8 respectively and let all nodes v \u2208 VA be strictly arc\nx\u2217\nconsistent w.r.t. potentials \u03b8. Then from\n\nx\u2217\nA,v = x\u2217\n\nB,v for all v \u2208 V\u2202A\n\n(cid:26) x\u2217\n\nA,v,\nx\u2217\nB,v,\n\n(5)\nv \u2208 A\nv \u2208 B\\A , v \u2208 VG, is optimal on G.\n\nfollows that the labeling x\u2217 with coordinates x\u2217\n\nv =\n\n(cid:26) 0,\n\nProof. Let \u03b8 denote potentials of\n\u03b8(cid:48)\nw(xw) :=\n\nw \u2208 V\u2202A \u222a E\u2202A\n\u03b8w(xw), w /\u2208 V\u2202A \u222a E\u2202A\n\nthe problem.\n\nLet us de\ufb01ne other potentials \u03b8(cid:48) as\n. Then EG,\u03b8(x) = EA,\u03b8(cid:48)(x|A) + EB,\u03b8(x|B). From strict\n\n4\n\n\fAlgorithm 1\n\n(1) Solve LP and reparametrize (G, \u03b8) \u2192 (G, \u02dc\u03b8\u03c6).\n(2) Initialize: (A, \u02dc\u03b8\u03c6) and x\u2217\n(3) repeat\n\nA,v from arc consistent nodes.\n\nSet B as a boundary complement to A.\nB on B.\nCompute an optimal labeling x\u2217\nB|\u2202A return.\nIf x\u2217\nElse set C := {v \u2208 V\u2202A : x\u2217\nA,v (cid:54)= x\u2217\nuntil C = \u2205\n\nA|\u2202A = x\u2217\n\nB,v}, A := A\\C\n\narc consistency of \u03b8 over A directly follows that EA,\u03b8(cid:48)(x\u2217\n\nA) = minxA EA,\u03b8(cid:48)(xA). From this follows\n\nEG,\u03b8(x) = { min\n\nxA,xB\n\nmin\n\nx\n\nEA,\u03b8(cid:48)(xA) + EB,\u03b8(xB)\n\n= min\n\u2202A\n\nx(cid:48)\n\nmin\n\nxA : xA|\u2202A=x(cid:48)\n\n\u2202A\n\nEA,\u03b8(cid:48)(xA) +\n\nmin\n\nxB : xB|\u2202A=x(cid:48)\n\n\u2202A\n\ns.t. xA|\u2202A = xB|\u2202A}\nEB,\u03b8(xB) \u2265 min\n= EA,\u03b8(cid:48)(x\u2217\n\nxA\n\nEA,\u03b8(cid:48)(xA) + min\nxB\nA) + EB,\u03b8(x\u2217\n\nEB,\u03b8(xB)\nB) = EG,\u03b8(x\u2217)\n\nNow we are ready to transform the idea described in the introduction into Algorithm 1.\nStep (1). As a \ufb01rst step of the algorithm we run an LP solver for the dual problem (3) on the\nwhole graph G. The output of the algorithm is the reparametrization \u02dc\u03b8\u03c6 of the initial problem.\nSince well-scalable algorithms for the dual problem (3) attain the optimum only in the limit after a\npotentially in\ufb01nite number of iterations, we cannot afford to solve it exactly. Fortunately, it is not\nneeded to do so and it is enough to get only a suf\ufb01ciently good approximation. We will return to\nthis point at the end of this section.\nStep (2). We assign to the set VA the nodes of the graph G, which satisfy the strict arc consistency\ncondition. The optimal labeling on A can be trivially computed from the reparametrized unary\nv by x\u2217\npotentials \u02dc\u03b8\u03c6\nA,v := arg minxv\nStep (3). We de\ufb01ne B as the boundary complement to A w.r.t. the master graph G and \ufb01nd an\nB on the subgraph B with a combinatorial solver. If the boundary condition (5)\noptimal labeling x\u2217\nholds we have found the optimal labeling according to Theorem 1. Otherwise we remove the nodes\nwhere this condition fails from A and repeat the whole step until either (5) holds or B = G.\n\nv (xv), v \u2208 A.\n\u02dc\u03b8\u03c6\n\n3.1 Remarks on Algorithm 1\n\nA|\u2202A obtained based only on the subgraph A coincides with the boundary labeling x\u2217\n\nEncouraging boundary consistency condition.\nIt is quite unlikely, that the optimal boundary\nB|\u2202A\nlabeling x\u2217\nobtained for the subgraph B. To satisfy this condition the unary potentials should be quite strong on\nthe border. In other words, they should be at least strictly arc consistent. Indeed they are so, since\nwe consider the reparametrized potentials \u02dc\u03b8\u03c6, obtained at the LP presolve step of the algorithm.\nSingle run of LP solver. Reparametrization allows also to perform only a single run of the LP\nsolver, keeping the results as if the subproblem over A has been solved at each iteration. The\nfollowing theorem states this property formally.\nTheorem 2. Let all nodes of a graph A be strictly arc consistent w.r.t. potentials \u02dc\u03b8\u03c6, x be the\noptimum of EA,\u02dc\u03b8\u03c6 and A(cid:48) be a subgraph of A. Then x|A(cid:48) optimizes EA(cid:48),\u02dc\u03b8\u03c6.\nProof. The proof follows directly from De\ufb01nition 1. Equation (4) holds for the labeling x|A(cid:48)\nplugged in place of x(cid:48) and graph A(cid:48) in place of G. Hence x|A(cid:48) provides a minimum of EA(cid:48),\u02dc\u03b8\u03c6.\nPresolving B for combinatorial solver. Many combinatorial solvers use linear programming re-\nlaxations as a presolving step. Reparametrization of the subproblem over the subgraph B plays the\nrole of such a presolver, since the optimal reparametrization corresponds to the solution of the dual\nproblem and makes solving the primal one easier.\nConnected components analysis. It is often the case that the subgraph B consists of several con-\nnected components. We apply the combinatorial solver to each of them independently.\n\n5\n\n\fDataset\n|VG|\nname\ntsukuba 110592\nvenus\n166222\nteddy\n168750\nfamily\npano\n\n425632\n514080\n\nStep (1) LP (TRWS)\n# it\nE\n250\n2000\n10000 14763 1345214\n\nStep (3) ILP (CPLEX)\n# it\n369537\n24\n3048296 10\n1\n\ntime, s\n186\n3083\n\ntime, s\n\n|Xv|\n16\n20\n60\n\n5\n7\n\n10000 20156\n10000 34092\n\n184825\n169224\n\n18\n1\n\nE\n\n\u2212\n\n369218\n3048043\n\n184813\n\n\u2212\n\n36\n69\n\u2212\n2\n\u2212\n\n|B|\n\nmin max\n130\n656\n66\n233\n2062 \u2212\n11\n109\n24474 \u2212\n\nTable 1: Results on Middlebury datasets. The column Dataset contains the dataset name, numbers\n|VG| of nodes and |Xv| of labels. Columns Step (1) and Step (3) contain number of iterations, time\nand attained energy at steps (1) and (3) of Algorithm 1, corresponding to solving the LP relaxation\nand use of a combinatorial solver respectively. The column |B| presents starting and \ufb01nal sizes\nof the \u201dcombinatorial\u201c subgraph B. Dash \u201d-\u201d stands for failure of CPLEX, due to the size of the\ncombinatorial subproblem.\n\nSubgraph B growing strategy. One can consider different strategies for increasing the subgraph B,\nif the boundary condition (5) does not hold. Our greedy strategy is just one possible option.\nOptimality of reparametrization. As one can see, the reparametrization plays a signi\ufb01cant role\nfor our algorithm: it (i) is required for Theorem 1 to hold; (ii) serves as a criterion for the initial\nsplitting of G into A and B; (iii) makes the local potentials on the border \u2202A stronger; (iv) allows\nto avoid multiple runs of the LP solver, when the subgraph A shrinks; (v) can speed-up some com-\nbinatorial solvers by serving as a presolve result. However, there is no real reason to search for an\noptimal reparametrization: all its mentioned functionality remains valid also if it is non-optimal. Of\ncourse, one pays a certain price for the non-optimality: (i) the initial subgraph B becomes larger;\n(ii) the local potentials \u2013 weaker; (iii) the presolve results for the combinatorial solver become less\nprecise. Note that even for non-optimal reparametrizations Theorem 2 holds and we need to run the\nLP solver only once.\n\n4 Experimental evaluation\n\nWe tested our approach on problems from the Middlebury energy minimization benchmark [26] and\nthe recently published discrete energy minimization benchmark [15], which includes the datasets\nfrom the \ufb01rst one. We have selected computer vision benchmarks intentionally, because many prob-\nlems in this area ful\ufb01ll our requirements: the underlying graph is sparse (typically it has a grid\nstructure) and the LP relaxation delivers good practical results.\nSince our experiments serve mainly as proof of concept we used general, though not always the\nmost ef\ufb01cient solvers: TRW-S [16] as the LP-solver and CPLEX [27] as the combinatorial one\nwithin the OpenGM framework [28]. Unfortunately the original version of TRW-S does not provide\ninformation about strict arc consistency and does not output a reparametrization. Therefore we used\nour own implementation in the experiments. Depending on the type of the pairwise factors (Potts,\ntruncated (cid:96)2 or (cid:96)1-norm) we found our implementation up to an order of magnitude slower than the\nfreely available code of V. Kolmogorov. This fact suggests that the provided processing time can be\nsigni\ufb01cantly improved in more ef\ufb01cient future implementations.\nIn the \ufb01rst round of our experiments we considered problems (i.e. graphical models with the spec-\ni\ufb01ed unary and pairwise factors) of the Middlebury MRF benchmark, most of which remained un-\nsolved, to the best of our knowledge.\nMRF stereo dataset consists of 3 models: tsukuba, venus and teddy. Since the optimal inte-\ngral solution of tsukuba was recently obtained by LP-solvers [11,13], we used this dataset to show\nhow our approach performs for clearly non-optimal reparametrizations. For this we run TRW-S for\n250 iterations only. The size of the subgraph B grew from 130 to 656 nodes out of more than 100000\nnodes of the original problem (see Table 1). On venus we obtained an optimal labeling after 10\niterations of our algorithm. During these iterations the size of the set B grew from 66 to 233 nodes,\nwhich is only 0.14% of the original problem size. The dataset teddy remains unsolved: though\n\n6\n\n\fEG,\u03b8(x\u2217)\n\nDataset\n\nStep (1) LP\n# it\npfau\n24010.44 1000\npalm\n12253.75\n200\nclown\ufb01sh\n100\n14794.18\ncrops\n100\n11853.12\nstrawberry 11766.34\n100\n\ntime, s # it\n14\n276\n65\n17\n8\n32\n6\n32\n29\n8\n\n14\n93\n10\n6\n31\n\n561\n328\n355\n483\n\n700\n350\n350\n350\n\n1579\n790\n797\n697\n\n3701\n181\n1601\n1114\n\nStep (3) ILP MCA\ntime, s\n> 55496 10000\n\ntime, s\n\nMPLP\n# LP it LP time, s\n\n> 15000\n\nILP time, s\n\nTable 2: Exemplary Potts model comparison. Datasets taken from the Color segmentation (N8)\nset. Column EG,\u03b8(x\u2217) shows the optimal energy value, columns Step (1) LP and Step (3) ILP\ncontain number of iterations and time spent at the steps (1) and (3) of Algorithm 1, corresponding to\nsolving the LP relaxation and use of a combinatorial solver respectively. The column MCA stands\nfor the time of the multiway-cut solver reported in [21]. The MPLP [17] column provides number\nof iterations and time of the LP presolve and the time of the tightening cutting plane phase (ILP).\n\nthe size of the problem was reduced from the original 168750 to 2062 nodes, they constituted a\nnon-manageable task for CPLEX, presumably because of the big number of labels, 60 in each node.\nMRF photomontage models are dif\ufb01cult for dual solvers like TRW-S because their range of values\nin pairwise factors is quite large and varies from 0 to more than 500000 in a factor. Hence we used\n10000 iterations of TRW-S at the \ufb01rst step of Algorithm 1. For the family dataset the algorithm\ndecreased the size of the problem for CPLEX from originally over 400000 nodes to slightly more\nthan 100 and found a solution of the whole problem. In contrast to family the initial subgraph B\nfor the panorama dataset is much larger (about 25000 nodes) and CPLEX gave up.\nMRF inpainting. Though applying TRW-S to both datasets penguin and house allows to de-\ncrease the problem to about 0.5% of its original size, the resulting subgraphs B of respectively 141\nand 856 nodes were too large for CPLEX, presumably because of the big number (256) of labels.\n\n(a) Original image\n\n(b) Kovtun\u2019s method\n\n(c) Our approach\n\n(d) Optimal Labeling\n\nFigure 2: Results for the pfau-instance from [15]. Gray pixels in (b) and (c) mark nodes that\nneed to be labeled by the combinatorial solver. Our approach (c) leads to much smaller combina-\ntorial problem instances than Kovtun\u2019s method [29] (b) used in [30]. While Kovtun\u2019s method gets\npartial optimality for 5% of the nodes only, our approach requires to solve only tiny problems by a\ncombinatorial solver.\n\nPotts models. Our approach appeared to be especially ef\ufb01cient for Potts models. We tested it on\nthe following datasets from the benchmark [15]: Color segmentation (N4), Color segmentation\n(N8), Color segmentation, Brain and managed to solve all 26 problem instances to optimality.\nSolving Potts models to optimality is not a big issue anymore due to the recent work [21], which\nrelated this problems to the multiway-cut problem [31] and adopted a quite ef\ufb01cient solver based on\nthe cutting plane technique. However, we were able to outperform even this specialized solver on\nhard instances, which we collected in Table 2. There is indeed a simple explanation for this phe-\nnomenon: the dif\ufb01cult instances are those, for which the optimal labeling contains many small areas\ncorresponding to different labels, see e.g. Fig. 2. This is not very typical for Potts models, where an\noptimal labeling typically consists of a small number of large segments. Since the number of cutting\nplanes, which have to be processed by the multiway-cut solver, grows with the total length of the\nsegment borders, the overall performance signi\ufb01cantly drops on such instances. Our approach is\nable to correctly label most of the borders when solving the LP relaxation. Since the resulting sub-\ngraph B, passed to the combinatorial solver, is quite small, the corresponding subproblems appear\n\n7\n\n\feasy to solve even for a general-purpose solver like CPLEX. Indeed, we expect an increase in the\noverall performance of our method if the multiway-cut solver would be used in place of CPLEX.\nFor Potts models there exist methods [29,32] providing part of an optimal solution, known as partial\noptimality. Often they allow to drastically simplify the problem so that it can be solved to global\noptimality on the remaining variables very fast, see [30]. However for hard instances like pfau these\nmethods can label only a small fraction of graph nodes persistently, hence combinatorial solvers\ncannot solve the rest, or require a lot of time. Our method does not provide partially optimal vari-\nables: if it cannot solve the whole problem no node can be labelled as optimal at all. On the upside\nthe subgraph B which is given to a combinatorial solver is typically much smaller, see Fig. 2.\nFor comparison we tested the MPLP solver\n[17], which is based on coordinate de-\nscent LP iterations and tightens the LP relaxation with the cutting plane approach de-\nscribed in [33]. We used its publicly available code [34].\nthis solver did\nnot managed to solve any of\nthe considered dif\ufb01cult problems (marked as unsolved in\nthe OpenGM Benchmark [15]), such as color-seg-n8/pfau, mrf stereo/{venus,\nteddy}, mrf photomontage/{family, pano}. For easier instances of the Potts model,\nwe found our solver an order of magnitude faster than MPLP (see Table 2 for the exemplary com-\nparison), though we tried different numbers of LP presolve iterations to speed up the MPLP.\nSummary. Our experiments show that our method used even with quite general and not always the\nmost ef\ufb01cient solvers like TRW-S and CPLEX allows to (i) \ufb01nd globally optimal solutions of large\nscale problem instances, which were previously unsolvable; (ii) solve hard instances of Potts models\nan order of magnitude faster than with a modern specialized combinatorial multiway-cut method;\n(iii) overcome the cutting-plane based MPLP method on the tested datasets.\n\nHowever\n\n5 Conclusions and future work\n\nThe method proposed in this paper provides a novel way of combining convex and combinatorial\nalgorithms to solve large scale optimization problems to a global optimum.\nIt does an ef\ufb01cient\nextraction of the subgraph, where the LP relaxation is not tight and combinatorial algorithms have\nto be applied. Since this subgraph often corresponds to only a tiny fraction of the initial problem, the\ncombinatorial search becomes feasible. The method is very generic: any linear programming and\ncombinatorial solvers can be used to carry out the respective steps of Algorithm 1. It is particularly\nef\ufb01cient for sparse graphs and when the LP relaxation is almost tight.\nIn the future we plan to generalize the method to higher order models, tighter convex relaxations for\nthe convex part of our solver and apply alternative and specialized solvers both for the convex and\nthe combinatorial parts of our approach.\nAcknowledgement. This work has been supported by the German Research Foundation (DFG) within the\nprogram Spatio-/Temporal Graphical Models and Applications in Image Analysis, grant GRK 1653. Authors\nthank A. Shekhovtsov, B. Flach, T. Werner, K. Antoniuk and V. Franc from the Center for Machine Perception\nof the Czech Technical University in Prague for fruitful discussions.\n\nReferences\n[1] D. Koller and N. Friedman. Probabilistic Graphical Models:Principles and Techniques. MIT Press, 2009.\n[2] T. Werner. A linear programming approach to max-sum problem: A review. IEEE Trans. on PAMI, 29(7),\n\nJuly 2007.\n\n[3] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference.\n\nFound. Trends Mach. Learn., 1(1-2):1\u2013305, 2008.\n\n[4] M. Schlesinger. Syntactic analysis of two-dimensional visual signals in the presence of noise. Kibernetika,\n\n(4):113\u2013130, 1976.\n\n[5] M. Wainwright, T. Jaakkola, and A. Willsky. MAP estimation via agreement on (hyper)trees: message\n\npassing and linear programming approaches. IEEE Trans. on Inf. Th., 51(11), 2005.\n\n[6] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposi-\n\ntion. IEEE Trans. on PAMI, 33(3):531 \u2013552, march 2011.\n\n[7] B. Savchynskyy, J. H. Kappes, S. Schmidt, and C. Schn\u00a8orr. A study of Nesterov\u2019s scheme for Lagrangian\n\ndecomposition and MAP labeling. In CVPR 2011, 2011.\n\n8\n\n\f[8] S. Schmidt, B. Savchynskyy, J. H. Kappes, and C. Schn\u00a8orr. Evaluation of a \ufb01rst-order primal-dual algo-\n\nrithm for MRF energy minimization. In EMMCVPR, pages 89\u2013103, 2011.\n\n[9] O. Meshi and A. Globerson. An alternating direction method for dual MAP LP relaxation.\n\nECML/PKDD (2), pages 470\u2013483, 2011.\n\nIn\n\n[10] A. F. T. Martins, M. A. T. Figueiredo, P. M. Q. Aguiar, N. A. Smith, and E. P. Xing. An augmented\n\nLagrangian approach to constrained MAP inference. In ICML, 2011.\n\n[11] B. Savchynskyy, S. Schmidt, J. H. Kappes, and C. Schn\u00a8orr. Ef\ufb01cient MRF energy minimization via\n\nadaptive diminishing smoothing. In UAI-2012, pages 746\u2013755.\n\n[12] D. V. N. Luong, P. Parpas, D. Rueckert, and B. Rustem. Solving MRF minimization by mirror descent.\n\nIn Advances in Visual Computing, volume 7431, pages 587\u2013598. Springer Berlin Heidelberg, 2012.\n\n[13] J. H. Kappes, B. Savchynskyy, and C. Schn\u00a8orr. A bundle approach to ef\ufb01cient MAP-inference by La-\n\ngrangian relaxation. In CVPR 2012, 2012.\n\n[14] B. Savchynskyy and S. Schmidt. Getting feasible variable estimates from infeasible ones: MRF local\n\npolytope study. Technical report, arXiv:1210.4081, 2012.\n\n[15] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schn\u00a8orr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler,\nJ. Lellmann, N. Komodakis, and C. Rother. A comparative study of modern inference techniques for\ndiscrete energy minimization problems. In CVPR, 2013.\n\n[16] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. IEEE Trans. on\n\nPAMI, 28(10):1568\u20131583, 2006.\n\n[17] A. Globerson and T. Jaakkola. Fixing max-product: Convergent message passing algorithms for MAP\n\nLP-relaxations. In NIPS, 2007.\n\n[18] T. Hazan and A. Shashua. Norm-product belief propagation: Primal-dual message-passing for approxi-\n\nmate inference. IEEE Trans. on Inf. Theory,, 56(12):6294 \u20136316, 2010.\n\n[19] M. I. Schlesinger and K. V. Antoniuk. Diffusion algorithms and structural recognition optimization prob-\n\nlems. Cybernetics and Systems Analysis, 47(2):175\u2013192, 2011.\n\n[20] V. Franc, S. Sonnenburg, and T. Werner. Cutting-Plane Methods in Machine Learning, chapter 7, pages\n\n185\u2013218. The MIT Press, Cambridge,USA, 2012.\n\n[21] J. H. Kappes, M. Speth, B. Andres, G. Reinelt, and C. Schn\u00a8orr. Globally optimal image partitioning by\n\nmulticuts. In EMMCVPR, 2011.\n\n[22] M. Bergtholdt, J. H. Kappes, S. Schmidt, and C. Schn\u00a8orr. A study of parts-based object class detection\n\nusing complete graphs. IJCV, 87(1-2):93\u2013117, 2010.\n\n[23] M. Sun, M. Telaprolu, H. Lee, and S. Savarese. Ef\ufb01cient and exact MAP-MRF inference using branch\n\nand bound. In AISTATS-2012.\n\n[24] L. Otten and R. Dechter. Anytime AND/OR depth-\ufb01rst search for combinatorial optimization. In Pro-\n\nceedings of the Annual Symposium on Combinatorial Search (SOCS), 2011.\n\n[25] M. C. Cooper, S. de Givry, M. Sanchez, T. Schiex, M. Zytnicki, and T. Werner. Soft arc consistency\n\nrevisited. Arti\ufb01cial Intelligence, 174(7-8):449\u2013478, May 2010.\n\n[26] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother.\nA comparative study of energy minimization methods for Markov random \ufb01elds with smoothness-based\npriors. IEEE Trans. PAMI., 30:1068\u20131080, June 2008.\n\n[27] ILOG, Inc. ILOG CPLEX: High-performance software for mathematical programming and optimization.\n\nSee http://www.ilog.com/products/cplex/.\n\n[28] B. Andres, T. Beier, and J. H. Kappes. OpenGM: A C++ library for discrete graphical models. ArXiv\n\ne-prints, 2012. Projectpage: http://hci.iwr.uni-heidelberg.de/opengm2/.\n\n[29] I. Kovtun. Partial optimal labeling search for a NP-hard subclass of (max, +) problems. In Proceedings\n\nof the DAGM Symposium, 2003.\n\n[30] J. H. Kappes, M. Speth, G. Reinelt, and C. Schn\u00a8orr. Towards ef\ufb01cient and exact MAP-inference for large\n\nscale discrete computer vision problems via combinatorial optimization. In CVPR, 2013.\n\n[31] S. Chopra and M. R. Rao. On the multiway cut polyhedron. Networks, 21(1):51\u201389, 1991.\n[32] P. Swoboda, B. Savchynskyy, J. H. Kappes, and C. Schn\u00a8orr. Partial optimality via iterative pruning for\n\nthe Potts model. In SSVM, 2013.\n\n[33] D. Sontag, T. Meltzer, A. Globerson, Y. Weiss, and T. Jaakkola. Tightening LP relaxations for MAP using\n\nmessage-passing. In UAI-2008, pages 503\u2013510.\n\n[34] D. Sontag. C++ code for MAP inference in graphical models.\n\n\u02dcdsontag/code/mplp_ver2.tgz.\n\nSee http://cs.nyu.edu/\n\n9\n\n\f", "award": [], "sourceid": 991, "authors": [{"given_name": "Bogdan", "family_name": "Savchynskyy", "institution": "University of Heidelberg"}, {"given_name": "J\u00f6rg Hendrik", "family_name": "Kappes", "institution": "University of Heidelberg"}, {"given_name": "Paul", "family_name": "Swoboda", "institution": "University of Heidelberg"}, {"given_name": "Christoph", "family_name": "Schn\u00f6rr", "institution": "University of Heidelberg"}]}