{"title": "Decomposable Submodular Function Minimization: Discrete and Continuous", "book": "Advances in Neural Information Processing Systems", "page_first": 2870, "page_last": 2880, "abstract": "This paper investigates connections between discrete and continuous approaches for decomposable submodular function minimization. We provide improved running time estimates for the state-of-the-art continuous algorithms for the problem using combinatorial arguments. We also provide a systematic experimental comparison of the two types of methods, based on a clear distinction between level-0 and level-1 algorithms.", "full_text": "Decomposable Submodular Function Minimization\n\nDiscrete and Continuous\n\nAlina Ene\u2217\n\nHuy L. Nguy\u02dc\u00ean\u2020\n\nL\u00e1szl\u00f3 A. V\u00e9gh\u2021\n\nAbstract\n\nThis paper investigates connections between discrete and continuous approaches for\ndecomposable submodular function minimization. We provide improved running\ntime estimates for the state-of-the-art continuous algorithms for the problem using\ncombinatorial arguments. We also provide a systematic experimental comparison\nof the two types of methods, based on a clear distinction between level-0 and\nlevel-1 algorithms.\n\n1\n\nIntroduction\n\nSubmodular functions arise in a wide range of applications: graph theory, optimization, economics,\ngame theory, to name a few. A function f : 2V \u2192 R on a ground set V is submodular if f (X) +\nf (Y ) \u2265 f (X \u2229 Y ) + f (X \u222a Y ) for all sets X, Y \u2286 V . Submodularity can also be interpreted as a\ndiminishing returns property.\nThere has been signi\ufb01cant interest in submodular optimization in the machine learning and computer\nvision communities. The submodular function minimization (SFM) problem arises in problems\nin image segmentation or MAP inference tasks in Markov Random Fields. Landmark results in\ncombinatorial optimization give polynomial-time exact algorithms for SFM. However, the high-\ndegree polynomial dependence in the running time is prohibitive for large-scale problem instances.\nThe main objective in this context is to develop fast and scalable SFM algorithms.\nInstead of minimizing arbitrary submodular functions, several recent papers aim to exploit special\nstructural properties of submodular functions arising in practical applications. This paper focuses on\nthe popular model of decomposable submodular functions. These are functions that can be written as\nsums of several \u201csimple\u201d submodular functions de\ufb01ned on small supports.\nSome de\ufb01nitions are needed to introduce our problem setting. Let f : 2V \u2192 R be a submodular\nfunction, and let n := |V |. We can assume w.l.o.g. that f (\u2205) = 0. We are interested in solving the\nsubmodular function minimization problem:\n\nFor a vector y \u2208 RV and a set S \u2286 V , we use the notation y(S) :=(cid:80)\n\nmin\nS\u2286V\n\nf (S).\n\nof a submodular function is de\ufb01ned as\n\nB(f ) := {y \u2208 RV : y(S) \u2264 f (S) \u2200S \u2286 V, y(V ) = f (V )}.\n\n(SFM)\n\nv\u2208S y(v). The base polytope\n\nOne can optimize linear functions over B(f ) using the greedy algorithm. The SFM problem can be\nreduced to \ufb01nding the minimum norm point of the base polytope B(f ) [10].\n\n(cid:26) 1\n\n2\n\n(cid:27)\n2 : y \u2208 B(f )\n\nmin\n\n(cid:107)y(cid:107)2\n\n.\n\n(Min-Norm)\n\n\u2217Department of Computer Science, Boston University, aene@bu.edu\n\u2020College of Computer and Information Science, Northeastern University, hu.nguyen@northeastern.edu\n\u2021Department of Mathematics, London School of Economics, L.Vegh@lse.ac.uk\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fWe assume that f is given in the decomposition f (S) =(cid:80)r\n\nThis reduction is the starting point of convex optimization approaches for SFM. We refer the reader\nto Sections 44\u201345 in [28] for concepts and results in submodular optimization, and to [2] on machine\nlearning applications.\ni=1 fi(S), where each fi : 2V \u2192 R is\na submodular function. Such functions are called decomposable or Sum-of-Submodular (SoS) in\nthe literature. In the decomposable submodular function minimization (DSFM) problem, we aim to\nminimize a function given in such a decomposition. We will make the following assumptions.\nFor each i \u2208 [r], we assume that two oracles are provided: (i) a value oracle that returns fi(S)\nfor any set S \u2286 V in time EOi; and (ii) a quadratic minimization oracle Oi(w). For any input\nvector w \u2208 Rn, this oracle returns an optimal solution to (Min-Norm) for the function fi + w, or\nequivalently, an optimal solution to miny\u2208B(fi) (cid:107)y + w(cid:107)2\n(cid:80)\n2. We let \u0398i denote the running time of\na single call to the oracle Oi, \u0398max := maxi\u2208[r] \u0398i denote the maximum time of an oracle call,\ni\u2208[r] \u0398i denote the average time of an oracle call.4 We let Fi,max := maxS\u2286V |fi(S)|,\n\u0398avg := 1\nr\nFmax := maxS\u2286V |f (S)| denote the maximum function values. For each i \u2208 [r], the function fi has\nan effective support Ci such that fi(S) = fi(S \u2229 Ci) for every S \u2286 V .\nDSFM thus requires algorithms on two levels. The level-0 algorithms are the subroutines used to\nevaluate the oracles Oi for every i \u2208 [r]. The level-1 algorithm minimizes the function f using the\nlevel-0 algorithms as black boxes.\n\n1.1 Prior work\n\nSFM has had a long history in combinatorial optimization since the early 1970s, following the\nin\ufb02uential work of Edmonds [4]. The \ufb01rst polynomial-time algorithm was obtained via the ellipsoid\nmethod [14]; recent work presented substantial improvements using this approach [22]. Substantial\nwork focused on designing strongly polynomial combinatorial algorithms [9, 15, 16, 25, 17, 27].\nStill, designing practical algorithms for SFM that can be applied to large-scale problem instances\nremains an open problem.\nLet us now turn to DSFM. Previous work mainly focused on level-1 algorithms. These can be classi-\n\ufb01ed as discrete and continuous optimization methods. The discrete approach builds on techniques of\nclassical discrete algorithms for network \ufb02ows and for submodular \ufb02ows. Kolmogorov [21] showed\nthat the problem can be reduced to submodular \ufb02ow maximization, and also presented a more ef\ufb01cient\naugmenting path algorithm. Subsequent discrete approaches were given in [1, 7, 8]. Continuous\napproaches start with the convex programming formulation (Min-Norm). Gradient methods were\napplied for the decomposable setting in [5, 24, 30].\nLess attention has been given to the level-0 algorithms. Some papers mainly focus on theoretical\nguarantees on the running time of level-1 algorithms, and treat the level-0 subroutines as black-boxes\n(e.g. [5, 24, 21]). In other papers (e.g. [18, 30]), the model is restricted to functions fi of a simple\nspeci\ufb01c type that are easy to minimize. An alternative assumption is that all Ci\u2019s are small, of size at\nmost k; and thus these oracles can be evaluated by exhaustive search, in 2k value oracle calls (e.g.\n[1, 7]). Shanu et al. [29] use a block coordinate descent method for level-1, and make no assumptions\non the functions fi. The oracles are evaluated via the Fujishige-Wolfe minimum norm point algorithm\n[11, 31] for level-0.\nLet us note that these experimental studies considered the level-0 and level-1 algorithms as a single\n\u201cpackage\u201d. For example, Shanu et al. [29] compare the performance of their SoS Min-Norm algorithm\nto the continuous approach of Jegelka et al. [18] and the combinatorial approach of Arora et al. [1].\nHowever, these implementations cannot be directly compared, since they use three different level-0\nalgorithms: Fujishige-Wolfe in SoS Min-Norm, a general QP solver for the algorithm of [18], and\nexhaustive search for [1]. For potentials of large support, Fujishige-Wolfe outperforms these other\nlevel-0 subroutines, hence the level-1 algorithms in [18, 1] could have compared more favorably\nusing the same Fujishige-Wolfe subroutine.\n\n4For \ufb02ow-type algorithms for DSFM, a slightly weaker oracle assumption suf\ufb01ces, returning a minimizer\nof minS\u2286Ci fi(S) + w(S) for any given w \u2208 RCi. This oracle and the quadratic minimization oracle are\nreducible to each other: the former reduces to a single call to the latter, and one can implement the latter using\nO(|Ci|) calls to the former (see e.g. [2]).\n\n2\n\n\f1.2 Our contributions\n\nOur paper establishes connections between discrete and continuous methods for DSFM, as well as\nprovides a systematic experimental comparison of these approaches. Our main theoretical contribution\nimproves the worst-case complexity bound of the most recent continuous optimization methods [5, 24]\nby a factor of r, the number of functions in the decomposition. This is achieved by improving the\nbounds on the relevant condition numbers. Our proof exploits ideas from the discrete optimization\napproach. This provides not only better, but also considerably simpler arguments than the algebraic\nproof in [24].\nThe guiding principle of our experimental work is the clean conceptual distinction between the\nlevel-0 and level-1 algorithms, and to compare different level-1 algorithms by using the same level-0\nsubroutines. We compare the state-of-the-art continuous and discrete algorithms: RCDM and ACDM\nfrom [5] with Submodular IBFS from [7]. We consider multiple options for the level-0 subroutines.\nFor certain potential types, we use tailored subroutines exploiting the speci\ufb01c form of the problem.\nWe also consider a variant of the Fujishige-Wolfe algorithm as a subroutine applicable for arbitrary\npotentials.\nOur experimental results reveal the following tradeoff. Discrete algorithms on level-1 require more\ncalls to the level-0 oracle, but less overhead computation. Hence using algorithms such as IBFS on\nlevel-1 can be signi\ufb01cantly faster than gradient descent, as long as the potentials have fairly small\nsupports. However, as the size of the potentials grow, or we do need to work with a generic level-0\nalgorithm, gradient methods are preferable. Gradient methods can perform better for larger potentials\nalso due to weaker requirements on the level-0 subroutines: approximate level-0 subroutines suf\ufb01ce\nfor them, whereas discrete algorithms require exact optimal solutions on level-0.\nPaper outline. The rest of the paper is structured as follows. The level-1 algorithmic frameworks\nusing discrete and convex optimization are described in Sections 2 and 3, respectively. Section 4\ngives improved convergence guarantees for the gradient descent algorithms outlined in Section 3.\nSection 5 discusses the different types of level-0 algorithms and how they can be used together with\nthe level-1 frameworks. Section 6 presents a brief overview of our experimental results.\nThis is an extended abstract. The full paper is available on http://arxiv.org/abs/1703.01830.\n\n2 Discrete optimization algorithms on Level-1\n\nx =(cid:80)r\n\nIn this section, we outline a level-1 algorithmic framework for DSFM that is based on a combinatorial\nframework \ufb01rst studied by Fujishige and Zhang [12] for submodular intersection. The submodular\nintersection problem is equivalent to DSFM for the sum of two functions, and the approach can be\nadapted and extended to the general DSFM problem with an arbitrary decomposition. We now give a\nbrief description of the algorithmic framework. The full version exhibits submodular versions of the\nEdmonds-Karp and pre\ufb02ow-push algorithms.\nAlgorithmic framework. For a decomposable function f, every x \u2208 B(f ) can be written as\ni=1 xi, where supp(xi) \u2286 Ci and xi \u2208 B(fi) (see e.g. Theorem 44.6 in [28]). A natural\nalgorithmic approach is to maintain an x \u2208 B(f ) in such a representation, and iteratively update it\nusing the combinatorial framework described below. DSFM can be cast as a maximum network \ufb02ow\nproblem in a network that is suitably de\ufb01ned based on the current point x. This can be viewed as an\nanalogue of the residual graph in the max\ufb02ow/mincut setting, and it is precisely the residual graph if\nthe DSFM instance is a minimum cut instance.\n\nThe auxiliary graph. For an x \u2208 B(f ) of the form x = (cid:80)r\ndirected auxiliary graph G = (V, E), with E = (cid:83)r\n\ni=1 xi, we construct the following\ni=1 Ei and capacities c : E \u2192 R+. E is a\nmultiset union: we include parallel copies if the same arc occurs in multiple Ei. The arc sets Ei\nare complete directed graphs (cliques) on Ci, and for an arc (u, v) \u2208 Ei, we de\ufb01ne c(u, v) :=\nmin{fi(S) \u2212 xi(S) : S \u2286 Ci, u \u2208 S, v /\u2208 S}. This is the maximum value \u03b5 such that x(cid:48)\ni \u2208 B(fi),\nwhere x(cid:48)\nLet N := {v \u2208 V : x(v) < 0} and P := {v \u2208 V : x(v) > 0}. The algorithm aims to improve the\ncurrent x by updating along shortest directed paths from N to P with positive capacity; there are\nseveral ways to update the solution, and we discuss speci\ufb01c approaches (derived from maximum\n\ufb02ow algorithms) in the full version. If there exists no such directed path, then we let S denote the set\n\ni(z) = xi(z) for z /\u2208 {u, v}.\n\ni(v) = xi(v) \u2212 \u03b5, x(cid:48)\n\ni(u) = xi(u) + \u03b5, x(cid:48)\n\n3\n\n\freachable from N on directed paths with positive capacity; thus, S \u2229 P = \u2205. One can show that S is\na minimizer of the function f.\nUpdating along a shortest path Q from N to P amounts to the following. Let \u03b5 denote the minimum\ncapacity of an arc on Q. If (u, v) \u2208 Q \u2229 Ei, then we increase xi(u) by \u03b5 and decrease xi(v) by \u03b5.\nThe crucial technical claim is the following. Let d(u) denote the shortest path distance of positive\ncapacity arcs from u to the set P . Then, an update along a shortest directed path from N to P results\nin a feasible x \u2208 B(f ), and further, all distance labels d(u) are non-decreasing. We refer the reader\nto Fujishige and Zhang [12] for a proof of this claim.\nLevel-1 algorithms based on the network \ufb02ow approach. Using the auxiliary graph described\n(cid:80)r\nabove, and updating on shortest augmenting paths, one can generalize several maximum \ufb02ow\nscaling variant provides a weakly polynomial running time O(n2\u0398max log Fmax + n(cid:80)r\nalgorithms to a level-1 algorithm of DSFM. In particular, based on the pre\ufb02ow-push algorithm [13],\ni=1 |Ci|2). A\none can obtain a strongly polynomial DSFM algorithm with running time O(n2\u0398max\ni=1 |Ci|3\u0398i).\n\nWe defer the details to the full version of the paper.\nIn our experiments, we use the submodular IBFS algorithm [7] as the main discrete level-1 algorithm;\nthe same running time estimate as for pre\ufb02ow-push is applicable. If all Ci\u2019s are small, O(1), the\nrunning time is O(n2r\u0398max); note that r = \u2126(n) in this case.\n\n3 Convex optimization algorithms on Level-1\n\n3.1 Convex formulations for DSFM\n\nRecall the convex quadratic program (Min-Norm) from the Introduction. This program has a unique\noptimal solution s\u2217, and the set S = {v \u2208 V : s\u2217(v) < 0} is the unique smallest minimizer to the\nSFM problem. We will refer to this optimal solution s\u2217 throughout the section.\nIn the DSFM setting, one can write (Min-Norm) in multiple equivalent forms [18]. For the \ufb01rst\n\ni=1 B(fi) \u2286 Rrn, and let A \u2208 Rn\u00d7(rn) denote the following matrix:\n\nformulation, we let P :=(cid:81)r\nNote that, for every y \u2208 P, Ay =(cid:80)r\n\nA := [InIn . . . In]\n\n.\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nr times\n\n(cid:125)\n\n(cid:27)\n\n(cid:26) 1\nmin(cid:8)(cid:107)a \u2212 y(cid:107)2\n\n2\n\nThe problem (Min-Norm) can be reformulated for DSFM as follows.\n\ni=1 yi, where yi is the i-th block of y, and thus Ay \u2208 B(f ).\n\n(Prox-DSFM)\nThe second formulation is the following. Let us de\ufb01ne the subspace A := {a \u2208 Rnr : Aa = 0}, and\nminimize its distance from P:\n\nmin\n\n.\n\n(cid:107)Ay(cid:107)2\n\n2 : y \u2208 P\n\n(Best-Approx)\nThe set of optimal solutions for both formulations (Prox-DSFM) and (Best-Approx) is the set\nE := {y \u2208 P : Ay = s\u2217}, where s\u2217 is the optimum of (Min-Norm). We note that, even though the set\nof solutions to (Best-Approx) are pairs of points (a, y) \u2208 A \u00d7 P, the optimal solutions are uniquely\ndetermined by y \u2208 P, since the corresponding a is the projection of y to A.\n\n2 : a \u2208 A, y \u2208 P(cid:9) .\n\n3.2 Level-1 algorithms based on gradient descent\n\na sequence (cid:8)(a(k), x(k))(cid:9)\n\nThe gradient descent algorithms of [24, 5] provide level-1 algorithms for DSFM. We provide a brief\noverview of these algorithms and we refer the reader to the respective papers for more details.\nThe alternating projections algorithm. Nishihara et al.\n[24] minimize (Best-Approx) using\nalternating projections. The algorithm starts with a point a0 \u2208 A and it iteratively constructs\nk\u22650 by projecting onto A and P: x(k) = argminx\u2208P(cid:107)a(k) \u2212 x(cid:107)2,\na(k+1) = argmina\u2208A(cid:107)a \u2212 x(k)(cid:107)2.\nRandom coordinate descent algorithms. Ene and Nguyen [5] minimize (Prox-DSFM) using\nrandom coordinate descent. The RCDM algorithm adapts the random coordinate descent algorithm\n\n4\n\n\fof Nesterov [23] to (Prox-DSFM). In each iteration, the algorithm samples a block i \u2208 [r] uniformly\nat random and it updates xi via a standard gradient descent step for smooth functions. ACDM, the\naccelerated version of the algorithm, presents a further enhancement using techniques from [6].\n\n3.3 Rates of convergence and condition numbers\n\nThe algorithms mentioned above enjoy a linear convergence rate despite the fact that the objective\nfunctions of (Best-Approx) and (Prox-DSFM) are not strongly convex. Instead, the works [24, 5]\nshow that there are certain parameters that one can associate with the objective functions such that\nthe convergence is at the rate (1\u2212 \u03b1)k, where \u03b1 \u2208 (0, 1) is a quantity that depends on the appropriate\nparameter. Let us now de\ufb01ne these parameters.\nLet A(cid:48) be the af\ufb01ne subspace A(cid:48) := {a \u2208 Rnr : Aa = s\u2217}. Note that the set E of optimal solutions\nto (Prox-DSFM) and (Best-Approx) is E = P \u2229 A(cid:48). For y \u2208 Rnr and a closed set K \u2286 Rnr, we let\nd(y, K) = min{(cid:107)y \u2212 z(cid:107)2 : z \u2208 K} denote the distance between y and K. The relevant parameter\nfor the Alternating Projections algorithm is de\ufb01ned as follows.\nDe\ufb01nition 3.1 ([24]). For every y \u2208 (P \u222a A(cid:48)) \\ E, let\n\n\u03ba(y) :=\n\nd(y,E)\n\nmax{d(y,P), d(y,A(cid:48))} ,\n\nand\n\n\u03ba\u2217 := sup{\u03ba(y) : y \u2208 (P \u222a A(cid:48)) \\ E} .\n\nThe relevant parameter for the random coordinate descent algorithms is the following.\nDe\ufb01nition 3.2 ([5]). For every y \u2208 P, let y\u2217 := argminp{(cid:107)p \u2212 y(cid:107)2 : p \u2208 E} be the optimal solution\nto (Prox-DSFM) that is closest to y. We say that the objective function 1\n2 of (Prox-DSFM) is\nrestricted (cid:96)-strongly convex if, for all y \u2208 P, we have\n\nWe de\ufb01ne\n\n(cid:26)\n\n(cid:96)\u2217 := sup\n\n(cid:107)A(y \u2212 y\u2217)(cid:107)2\n\n2 \u2265 (cid:96)(cid:107)y \u2212 y\u2217(cid:107)2\n2.\n\n(cid:107)Ay(cid:107)2\n\n(cid:96) :\n\n1\n2\n\n2 is restricted (cid:96)-strongly convex\n\n.\n\n2(cid:107)Ay(cid:107)2\n(cid:27)\n\nThe running time dependence of the algorithms on these parameters is given in the following theorems.\nTheorem 3.3 ([24]). Let (a(0), x(0) = argminx\u2208P(cid:107)a(0)\u2212x(cid:107)2) be the initial solution and let (a\u2217, x\u2217)\nbe an optimal solution to (Best-Approx). The alternating projection algorithm produces in\n\n(cid:18)\n\n(cid:18)(cid:107)x(0) \u2212 x\u2217(cid:107)2\n\n(cid:19)(cid:19)\n\nk = \u0398\n\n\u03ba2\u2217 ln\n\nTheorem 3.4 ([5]). Let x(0) \u2208 P be the initial solution and let x\u2217 be an optimal solution to\n(Prox-DSFM) that minimizes (cid:107)x(0) \u2212 x\u2217(cid:107)2. The random coordinate descent algorithm produces in\n\n\u0001\n\n\u0001\n\niterations a pair of points a(k) \u2208 A and x(k) \u2208 P that is \u0001-optimal, i.e.,\n\n(cid:107)a(k) \u2212 x(k)(cid:107)2\n\n2 \u2264 (cid:107)a\u2217 \u2212 x\u2217(cid:107)2\n\n2 + \u03b5.\n\nln\n\n(cid:96)\u2217\n\nk = \u0398\n\n(cid:18) r\n\n(cid:19)(cid:19)\n(cid:18)(cid:107)x(0) \u2212 x\u2217(cid:107)2\niterations a solution x(k) that is \u0001-optimal in expectation, i.e., E(cid:2) 1\n2(cid:107)Ax(k)(cid:107)2\n(cid:19)(cid:19)\n(cid:18)(cid:107)x(0) \u2212 x\u2217(cid:107)2\n(cid:17)\n(cid:16)\n(cid:113) 1\n(cid:3) \u2264 1\n2(cid:107)Ax\u2217(cid:107)2\n2(cid:107)Ax(k)(cid:107)2\n\nsolution x(k) that is \u0001-optimal in expectation, i.e., E(cid:2) 1\n\nThe accelerated coordinate descent algorithm produces in\n\niterations (speci\ufb01cally, \u0398\n\nepochs with \u0398\n\n(cid:114) 1\n(cid:18)\n(cid:17)(cid:17)\n(cid:16)(cid:107)x(0)\u2212x\u2217(cid:107)2\n\nk = \u0398\n\n(cid:96)\u2217\n\nr\n\n(cid:16)\n\nr\n\n2\n\nln\n\nln\n\n(cid:96)\u2217\n\n\u0001\n\n\u0001\n\n(cid:3) \u2264 1\n\n2\n\n2(cid:107)Ax\u2217(cid:107)2\n\n2 + \u0001.\n\niterations in each epoch) a\n\n2 + \u0001.\n\n5\n\n\f3.4 Tight analysis for the condition numbers and running times\n\nWe provide a tight analysis for the condition numbers (the parameters \u03ba\u2217 and (cid:96)\u2217 de\ufb01ned above). This\nleads to improved upper bounds on the running times of the gradient descent algorithms.\nTheorem 3.5. Let \u03ba\u2217 and (cid:96)\u2217 be the parameters de\ufb01ned in De\ufb01nition 3.1 and De\ufb01nition 3.2. We have\n\u03ba\u2217 = \u0398(n\n\nr) and (cid:96)\u2217 = \u0398(1/n2).\n\n\u221a\n\nUsing our improved convergence guarantees, we obtain the following improved running time analyses.\nCorollary 3.6. The total running time for obtaining an \u0001-approximate solution5 is as follows.\n\n\u2022 Alternating projections (AP): O\n\nn2r2\u0398avg ln\n\n(cid:16)\n\n(cid:16)\n\n(cid:16)(cid:107)x(0)\u2212x\u2217(cid:107)2\n(cid:17)(cid:17)\n(cid:16)(cid:107)x(0)\u2212x\u2217(cid:107)2\n(cid:16)\n\n.\n\n\u0001\n\n\u0001\n\n(cid:17)(cid:17)\n(cid:16)(cid:107)x(0)\u2212x\u2217(cid:107)2\n\n.\n\n\u0001\n\n.\n\n(cid:17)(cid:17)\n\n\u2022 Random coordinate descent (RCDM): O\n\nnr\u0398avg ln\nnFmax) [19], and thus (cid:107)x(0) \u2212 x\u2217(cid:107)2 =\nnFmax). For integer-valued functions, a \u03b5-approximate solution can be converted to an exact\n\nn2r\u0398avg ln\n\u2022 Accelerated random coordinate descent (ACDM): O\n\u221a\n\u221a\nWe can upper bound the diameter of the base polytope by O(\nO(\noptimum if \u03b5 = O(1/n) [2].\nThe upper bound on \u03ba\u2217 and the lower bound on (cid:96)\u2217 are shown in Theorem 4.2. The lower bound on \u03ba\u2217\nand upper bound on (cid:96)\u2217 in Theorem 3.5 follow by constructions in previous work, as explained next.\nNishihara et al. showed that \u03ba\u2217 \u2264 nr, and they give a family of minimum cut instances for which\nr). Namely, consider a graph with n vertices and m edges, and suppose for simplicity\n\u03ba\u2217 = \u2126(n\nthat the edges have integer capacities at most C. The cut function of the graph can be decomposed\ninto functions corresponding to the individual edges, and thus r = m and \u0398avg = O(1). Already\non simple cycle graphs, they show that the running time of AP is \u2126(n2m2 ln(nC)), which implies\n\u03ba\u2217 = \u2126(n\nUsing the same construction, it is easy to obtain the upper bound (cid:96)\u2217 = O(1/n2).\n\n\u221a\n\n\u221a\n\nr).\n\n4 Tight convergence bounds for the convex optimization algorithms\n\nn\n\n\u221a\n\n2 (cid:107)Ay \u2212 s\u2217(cid:107)1.\n\nr/2 + 1 and (cid:96)\u2217 \u2265 4/n2.\n\nIn this section, we show that the combinatorial approach introduced in Section 2 can be applied to\nobtain better bounds on the parameters \u03ba\u2217 and (cid:96)\u2217 de\ufb01ned in Section 3. Besides giving a stronger\nbound, our proof is considerably simpler than the algebraic one using Cheeger\u2019s inequality in [24].\nThe key is the following lemma.\nLemma 4.1. Let y \u2208 P and s\u2217 \u2208 B(f ). Then there exists a point x \u2208 P such that Ax = s\u2217 and\n(cid:107)x \u2212 y(cid:107)2 \u2264 \u221a\nBefore proving this lemma, we show how it can be used to derive the bounds.\nTheorem 4.2. We have \u03ba\u2217 \u2264 n\nProof: We start with the bound on \u03ba\u2217. In order to bound \u03ba\u2217, we need to upper bound \u03ba(y) for any\ny \u2208 (P \u222a A(cid:48)) \\ E. We distinguish between two cases: y \u2208 P \\ E and y \u2208 A(cid:48) \\ E.\n\u221a\nCase I: y \u2208 P \\E. The denominator in the de\ufb01nition of \u03ba(y) is equal to d(y,A(cid:48)) = (cid:107)Ay \u2212 s\u2217(cid:107)2/\nr.\nThis follows since the closest point a = (a1, . . . , ar) to y in A(cid:48) can be obtained as ai = yi +\n(s\u2217 \u2212 Ay)/r for each i \u2208 [r]. Lemma 4.1 gives an x \u2208 P such that Ax = s\u2217 and (cid:107)x \u2212 y(cid:107)2 \u2264\n\u221a\n2 (cid:107)Ay \u2212 s\u2217(cid:107)1 \u2264 n\n2(cid:107)Ay \u2212 s\u2217(cid:107)2. Since Ax = s\u2217, we have x \u2208 E and thus the numerator of \u03ba(y) is\nat most (cid:107)x \u2212 y(cid:107)2. Thus \u03ba(y) \u2264 (cid:107)x \u2212 y(cid:107)2/((cid:107)Ay \u2212 s\u2217(cid:107)2/\nCase II: y \u2208 A(cid:48) \\ E. This means that Ay = s\u2217. The denominator of \u03ba(y) is equal to d(y,P). For\neach i \u2208 [r], let qi \u2208 B(fi) be the point that minimizes (cid:107)yi \u2212 qi(cid:107)2. Let q = (q1, . . . , qr) \u2208 P. Then\n5The algorithms considered here solve the optimization problem (Prox-DSFM). An \u03b5-approximate solution\nto an optimization problem min{f (x) : x \u2208 P} is a solution x \u2208 P satisfying f (x) \u2264 f (x\u2217) + \u03b5, where\nx\u2217 \u2208 argminx\u2208P f (x) is an optimal solution.\n\nr) \u2264 n\n\nr/2.\n\n\u221a\n\n\u221a\n\nn\n\n6\n\n\fr\n\n(cid:16)\n\n(cid:17)\n\n\u221a\n\n2\n\nr\n\n\u221a\n\n2\n\nr\n\n1 + n\n\n\u221a\n2 , as desired.\n\nr\n\n1 \u2264 n2\n\n4 (cid:107)Ax \u2212 Ay(cid:107)2\n\nd(y,P). Therefore \u03ba(p) \u2264 1 + n\n\n(cid:17)(cid:107)q \u2212 y(cid:107)2 =\n\n(cid:80)\nv(Ax)(v) =(cid:80)\n\ni=1 (cid:107)qi\u2212yi(cid:107)1 = (cid:107)q\u2212y(cid:107)1 \u2264 \u221a\n\n2 (cid:107)Aq\u2212s\u2217(cid:107)1. We have (cid:107)Aq\u2212s\u2217(cid:107)1 = (cid:107)Aq\u2212Ay(cid:107)1 \u2264(cid:80)r\nd(y,P) = (cid:107)y \u2212 q(cid:107)2. Lemma 4.1 with q in place of y gives a point x \u2208 E such that (cid:107)q \u2212 x(cid:107)2 \u2264\n\u221a\nnr(cid:107)q\u2212y(cid:107)2.\n(cid:16)\nn\n\u221a\n2 (cid:107)q\u2212 y(cid:107)2. Since x \u2208 E, we have d(y,E) \u2264 (cid:107)x\u2212 y(cid:107)2 \u2264 (cid:107)x\u2212 q(cid:107)2 +(cid:107)q\u2212 y(cid:107)2 \u2264\nThus (cid:107)q\u2212 x(cid:107)2 \u2264 n\n1 + n\nLet us now prove the bound on (cid:96)\u2217. Let y \u2208 P and let y\u2217 := argminp{(cid:107)p \u2212 y(cid:107)2 : y \u2208 E}. We need to\nn2(cid:107)y \u2212 y\u2217(cid:107)2\n2. Again, we apply Lemma 4.1 to obtain a point x \u2208 P such\nverify that (cid:107)A(y \u2212 y\u2217)(cid:107)2\n2 \u2265 4\nthat Ax = s\u2217 and (cid:107)x \u2212 y(cid:107)2\n4(cid:107)Ax \u2212 Ay(cid:107)2\n2 \u2264 n\n2. Since Ax = s\u2217, the de\ufb01nition of\n2 \u2264 (cid:107)x\u2212 y(cid:107)2\n2. Using that Ax = Ay\u2217 = s\u2217, we have (cid:107)Ax\u2212 Ay(cid:107)2 = (cid:107)Ay\u2212 Ay\u2217(cid:107)2.\ny\u2217 gives (cid:107)y\u2212 y\u2217(cid:107)2\n(cid:3)\nProof of Lemma 4.1: We give an algorithm that transforms y to a vector x \u2208 P as in the statement\nthrough a sequence of path augmentations in the auxiliary graph de\ufb01ned in Section 2. We initialize\nx = y and maintain x \u2208 P (and thus Ax \u2208 B(f )) throughout. We now de\ufb01ne the set of source\nand sink nodes as N := {v \u2208 V : (Ax)(v) < s\u2217(v)} and P := {v \u2208 V : (Ax)(v) > s\u2217(v)}.\nOnce N = P = \u2205, we have Ax = s\u2217 and terminate. Note that since Ax, s\u2217 \u2208 B(f ), we have\nv s\u2217(v) = f (V ), and therefore N = \u2205 is equivalent to P = \u2205. The blocks of x are\ndenoted as x = (x1, x2, . . . , xr), with xi \u2208 B(fi).\nClaim 4.3. If N (cid:54)= \u2205, then there exists a directed path of positive capacity in the auxiliary graph\nbetween the sets N and P .\nProof: We say that a set T is i-tight, if xi(T ) = fi(T ). It is a simple consequence of submodularity\nthat the intersection and union of two i-tight sets are also i-tight sets. For every i \u2208 [r] and every\nu \u2208 V , we de\ufb01ne Ti(u) as the unique minimal i-tight set containing u. It is easy to see that for an arc\n(u, v) \u2208 Ei, c(u, v) > 0 if and only if v \u2208 Ti(u). We note that if u /\u2208 Ci, then x(u) = fi({u}) = 0\nand thus Ti(u) = {u}.\nLet S be the set of vertices reachable from N on a directed path of positive capacity in the auxiliary\ngraph. For a contradiction, assume S \u2229 P = \u2205. By the de\ufb01nition of S, we must have Ti(u) \u2286 S for\nevery u \u2208 S and every i \u2208 [r]. Since the union of i-tight sets is also i-tight, we see that S is i-tight\nfor every i \u2208 [r], and consequently, x(S) = f (S). On the other hand, since N \u2286 S, S \u2229 P = \u2205,\nand N (cid:54)= \u2205, we have x(S) < s\u2217(S). Since s\u2217 \u2208 B(f ), we have f (S) = x(S) < s\u2217(S) \u2264 f (S), a\ncontradiction. We conclude that S \u2229 P (cid:54)= \u2205.\n(cid:3)\nIn every step of the algorithm, we take a shortest directed path Q of positive capacity from N to P ,\nand update x along this path. That is, if (u, v) \u2208 Q \u2229 Ei, then we increase xi(u) by \u03b5 and decrease\nxi(v) by \u03b5, where \u03b5 is the minimum capacity of an arc on Q. Note that this is the same as running\nthe Edmonds-Karp-Dinitz algorithm in the submodular auxiliary graph. Using the analysis of [12],\none can show that this change maintains x \u2208 P, and that the algorithm terminates in \ufb01nite (in fact,\nstrongly polynomial) time. We defer the details to the full version of the paper.\n(cid:80)\nIt remains to bound (cid:107)x \u2212 y(cid:107)2. At every path update, the change in (cid:96)\u221e-norm of x is at most \u03b5,\nand the change in (cid:96)1-norm is at most n\u03b5, since the length of the path is \u2264 n. At the same time,\nn(cid:107)Ay \u2212 s\u2217(cid:107)1/2. Using the inequality (cid:107)p(cid:107)2 \u2264(cid:112)(cid:107)p(cid:107)1(cid:107)p(cid:107)\u221e, we obtain (cid:107)x \u2212 y(cid:107)2 \u2264 \u221a\nv\u2208N (s\u2217(v) \u2212 (Ax)(v)) decreases by \u03b5. Thus, (cid:107)x \u2212 y(cid:107)\u221e \u2264 (cid:107)Ay \u2212 s\u2217(cid:107)1/2 and (cid:107)x \u2212 y(cid:107)1 \u2264\n2 (cid:107)Ay \u2212 s\u2217(cid:107)1,\n(cid:3)\n\ncompleting the proof.\n\nn\n\n5 The level-0 algorithms\n\nIn this section, we brie\ufb02y discuss the level-0 algorithms and the interface between the level-1 and\nlevel-0 algorithms.\nTwo-level frameworks via quadratic minimization oracles. Recall from the Introduction the\nassumption on the subroutines Oi(w) that \ufb01nds the minimum norm point in B(fi + w) for the input\nvector w \u2208 Rn for each i \u2208 [r]. The continuous methods in Section 3 directly use the subroutines\nOi(w) for the alternating projection or coordinate descent steps. For the \ufb02ow-based algorithms in\nSection 2, the main oracle query is to \ufb01nd the auxiliary graph capacity c(u, v) of an arc (u, v) \u2208 Ei\nfor some i \u2208 [r]. This can be easily formulated as minimizing the function fi +w for an appropriate w\nwith supp(w) \u2286 Ci. As explained at the beginning of Section 3, an optimal solution to (Min-Norm)\n\n7\n\n\fimmediately gives an optimal solution to the SFM problem for the same submodular function. Hence,\nthe auxiliary graph capacity queries can be implemented via single calls to the subroutines Oi(w).\nLet us also remark that, while the functions fi are formally de\ufb01ned on the entire ground set V , their\neffective support is Ci, and thus it suf\ufb01ces to solve the quadratic minimization problems on the\nground set Ci.\nWhereas discrete and continuous algorithms require the same type of oracles, there is an important\ndifference between the two algorithms in terms of exactness for the oracle solutions. The discrete\nalgorithms require exact values of the auxiliary graph capacities c(u, v), as they must maintain\nxi \u2208 B(fi) throughout. Thus, the oracle must always return an optimal solution. The continuous\nalgorithms are more robust, and return a solution with the required accuracy even if the oracle only\nreturns an approximate solution. As discussed in Section 6, this difference leads to the continuous\nmethods being applicable in settings where the combinatorial algorithms are prohibitively slow.\nLevel-0 algorithms. We now discuss speci\ufb01c algorithms for quadratic minimization over the base\npolytopes of the functions fi. Several functions that arise in applications are \u201csimple\u201d, meaning that\nthere is a function-speci\ufb01c quadratic minimization subroutine that is very ef\ufb01cient. If a function-\nspeci\ufb01c subroutine is not available, one can use a general-purpose submodular minimization algorithm.\nThe works [1, 7] use a brute force search as the subroutine for each each fi, whose running time is\n2|Ci|EOi. However, this is applicable only for small Ci\u2019s and is not suitable for our experiments where\nthe maximum clique size is quite large. As a general-purpose algorithm, we used the Fujishige-Wolfe\nminimum norm point algorithm [11, 31]. This provides an \u03b5-approximate solution in O(|Ci|F 2\ni,max/\u03b5)\niterations, with overall running time bound O((|Ci|4 + |Ci|2EOi)F 2\ni,max/\u03b5) [3]. The experimental\nrunning time of the Fujishige-Wolfe algorithm can be prohibitively large [20]. As we discuss in\nSection 6, by warm-starting the algorithm and performing only a small number of iterations, we were\nable to use the algorithm in conjunction with the gradient descent level-1 algorithms.\n\n6 Experimental results\n\nWe evaluate the algorithms on energy minimization problems that arise in image segmentation\nproblems. We follow the standard approach and model the image segmentation task of segmenting an\nobject from the background as \ufb01nding a minimum cost 0/1 labeling of the pixels. The total labeling\ncost is the sum of labeling costs corresponding to cliques, where a clique is a set of pixels. We refer\nto the labeling cost functions as clique potentials.\nThe main focus of our experimental analysis is to compare the running times of the decomposable\nsubmodular minimization algorithms. Therefore we have chosen to use the simple hand-tuned\npotentials that were used in previous work: the edge-based costs [1] and the count-based costs\nde\ufb01ned by [29, 30]. Speci\ufb01cally, we used the following clique potentials in our experiments, all of\nwhich are submodular:\n\nModels of color features [26].\n\n\u2022 Unary potentials for each pixel. The unary potentials are derived from Gaussian Mixture\n\u2022 Pairwise potentials for each edge of the 8-neighbor grid graph. For each graph edge (i, j)\nbetween pixels i and j, the cost of a labeling equals 0 if the two pixels have the same label,\nand exp(\u2212(cid:107)vi \u2212 vj(cid:107)2) for different labels, where vi is the RGB color vector of pixel i.\n\u2022 Square potentials for each 2 \u00d7 2 square of pixels. The cost of a labeling is the square root\nof the number of neighboring pixels that have different labels, as in [1].\n\u2022 Region potentials. We use the algorithm from [30] to identify regions. For each region Ci,\nthe labeling cost is fi(S) = |S||Ci \\ S|, where S and Ci \\ S are the subsets of Ci labeled 0\nand 1, respectively, see [29, 30].\n\nWe used \ufb01ve image segmentation instances to evaluate the algorithms.6 The experiments were carried\nout on a single computer with a 3.3 GHz Intel Core i5 processor and 8 GB of memory; we reported\naveraged times over 10 trials.\nWe performed several experiments with various combinations of potentials and parameters. In the\nminimum cut experiments, we evaluated the algorithms on instances containing only unary and\n\n6The data is available at http://melodi.ee.washington.edu/~jegelka/cc/index.html and\n\nhttp://research.microsoft.com/en-us/um/cambridge/projects/visionimagevideoediting/\nsegmentation/grabcut.htm\n\n8\n\n\fFigure 1: Running times in seconds on a subset of the instances. The results for the other instances\nare very similar and are deferred to the full version of the paper. The x-axis shows the number of\niterations for the continuous algorithms. The IBFS algorithm is exact, and we display its running\ntime as a \ufb02at line. In the \ufb01rst three plots, the running time of IBFS on the small cliques instances\nnearly coincides with its running time on minimum cut instances. In the last plot, the running time of\nIBFS is missing since it is computationally prohibitive to run it on those instances.\n\npairwise potentials; in the small cliques experiments, we used unary, pairwise, and square potentials.\nFinally, the large cliques experiments used all potentials above. Here, we used two different level-0\nalgorithms for the region potentials. Firstly, we used an algorithm speci\ufb01c to the particular potential,\nwith running time O(|Ci| log(|Ci|) + |Ci|EOi). Secondly, we used the general Fujishige-Wolfe\nalgorithm for level-0. This turned out to be signi\ufb01cantly slower: it was prohibitive to run the algorithm\nto near-convergence. Hence, we could not implement IBFS in this setting as it requires an exact\nsolution.\nWe were able to implement coordinate descent methods with the following modi\ufb01cation of Fujishige-\nWolfe at level-0. At every iteration, we ran Fujishige-Wolfe for 10 iterations only, but we warm-started\nwith the current solution xi \u2208 B(fi) for each i \u2208 [r]. Interestingly, this turned out to be suf\ufb01cient for\nthe level-1 algorithm to make progress.\nSummary of results. Figure 1 shows the running times for some of the instances; we defer the full\nexperimental results to the full version of the paper. The IBFS algorithm is signi\ufb01cantly faster than\nthe gradient descent algorithms on all of the instances with small cliques. For all of the instances\nwith larger cliques, IBFS (as well as other combinatorial algorithms) are no longer suitable if the\nonly choice for the level-0 algorithms are generic methods such as the Fujishige-Wolfe algorithm.\nThe experimental results suggest that in such cases, the coordinate descent methods together with\na suitably modi\ufb01ed Fujishige-Wolfe algorithm provides an approach for obtaining an approximate\nsolution.\n\n9\n\n02004006008001000#iterations / #functions02004006008001000Running Timeplant (all experiments)IBFS (mincut)IBFS (small cliques)IBFS (large cliques)RCDM (mincut)RCDM (small cliques)RCDM (large cliques)ACDM (mincut)ACDM (small cliques)ACDM (large cliques)02004006008001000#iterations / #functions02004006008001000Running Timeoctopus (all experiments)IBFS (mincut)IBFS (small cliques)IBFS (large cliques)RCDM (mincut)RCDM (small cliques)RCDM (large cliques)ACDM (mincut)ACDM (small cliques)ACDM (large cliques)02004006008001000#iterations / #functions0100200300400500600Running Timepenguin (all experiments)IBFS (mincut)IBFS (small cliques)IBFS (large cliques)RCDM (mincut)RCDM (small cliques)RCDM (large cliques)ACDM (mincut)ACDM (small cliques)ACDM (large cliques)02004006008001000#iterations / #functions0100200300400500600700800900Running Timeplant (large cliques with Fujishige-Wolfe)RCDMACDM\fReferences\n[1] C. Arora, S. Banerjee, P. Kalra, and S. Maheshwari. Generic cuts: An ef\ufb01cient algorithm for\noptimal inference in higher order MRF-MAP. In European Conference on Computer Vision,\npages 17\u201330. Springer, 2012.\n\n[2] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations\n\nand Trends in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[3] D. Chakrabarty, P. Jain, and P. Kothari. Provable submodular minimization using Wolfe\u2019s\n\nalgorithm. In Advances in Neural Information Processing Systems, pages 802\u2013809, 2014.\n\n[4] J. Edmonds. Submodular functions, matroids, and certain polyhedra. Combinatorial structures\n\nand their applications, pages 69\u201387, 1970.\n\n[5] A. R. Ene and H. L. Nguyen. Random coordinate descent methods for minimizing decomposable\nIn Proceedings of the 32nd International Conference on Machine\n\nsubmodular functions.\nLearning (ICML), 2015.\n\n[6] O. Fercoq and P. Richt\u00e1rik. Accelerated, parallel, and proximal coordinate descent. SIAM\n\nJournal on Optimization, 25(4):1997\u20132023, 2015.\n\n[7] A. Fix, T. Joachims, S. Min Park, and R. Zabih. Structured learning of sum-of-submodular higher\norder energy functions. In Proceedings of the IEEE International Conference on Computer\nVision, pages 3104\u20133111, 2013.\n\n[8] A. Fix, C. Wang, and R. Zabih. A primal-dual algorithm for higher-order multilabel Markov\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern\n\nrandom \ufb01elds.\nRecognition, pages 1138\u20131145, 2014.\n\n[9] L. Fleischer and S. Iwata. A push-relabel framework for submodular function minimization and\napplications to parametric optimization. Discrete Applied Mathematics, 131(2):311\u2013322, 2003.\n\n[10] S. Fujishige. Lexicographically optimal base of a polymatroid with respect to a weight vector.\n\nMathematics of Operations Research, 5(2):186\u2013196, 1980.\n\n[11] S. Fujishige and S. Isotani. A submodular function minimization algorithm based on the\n\nminimum-norm base. Paci\ufb01c Journal of Optimization, 7(1):3\u201317, 2011.\n\n[12] S. Fujishige and X. Zhang. New algorithms for the intersection problem of submodular systems.\n\nJapan Journal of Industrial and Applied Mathematics, 9(3):369, 1992.\n\n[13] A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-\ufb02ow problem. Journal of\n\nthe ACM (JACM), 35(4):921\u2013940, 1988.\n\n[14] M. Gr\u00f6tschel, L. Lov\u00e1sz, and A. Schrijver. The ellipsoid method and its consequences in\n\ncombinatorial optimization. Combinatorica, 1(2):169\u2013197, 1981.\n\n[15] S. Iwata. A faster scaling algorithm for minimizing submodular functions. SIAM Journal on\n\nComputing, 32(4):833\u2013840, 2003.\n\n[16] S. Iwata, L. Fleischer, and S. Fujishige. A combinatorial strongly polynomial algorithm for\n\nminimizing submodular functions. Journal of the ACM (JACM), 48(4):761\u2013777, 2001.\n\n[17] S. Iwata and J. B. Orlin. A simple combinatorial algorithm for submodular function minimiza-\n\ntion. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2009.\n\n[18] S. Jegelka, F. Bach, and S. Sra. Re\ufb02ection methods for user-friendly submodular optimization.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2013.\n\n[19] S. Jegelka and J. A. Bilmes. Online submodular minimization for combinatorial structures.\nIn Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages\n345\u2013352, 2011.\n\n[20] S. Jegelka, H. Lin, and J. A. Bilmes. On fast approximate submodular minimization.\n\nAdvances in Neural Information Processing Systems, pages 460\u2013468, 2011.\n\nIn\n\n10\n\n\f[21] V. Kolmogorov. Minimizing a sum of submodular functions. Discrete Applied Mathematics,\n\n160(15):2246\u20132258, 2012.\n\n[22] Y. T. Lee, A. Sidford, and S. C.-w. Wong. A faster cutting plane method and its implications for\ncombinatorial and convex optimization. In IEEE Foundations of Computer Science (FOCS),\n2015.\n\n[23] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362, 2012.\n\n[24] R. Nishihara, S. Jegelka, and M. I. Jordan. On the convergence rate of decomposable submodular\nfunction minimization. In Advances in Neural Information Processing Systems (NIPS), pages\n640\u2013648, 2014.\n\n[25] J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization.\n\nMathematical Programming, 118(2):237\u2013251, 2009.\n\n[26] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using\n\niterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309\u2013314, 2004.\n\n[27] A. Schrijver. A combinatorial algorithm minimizing submodular functions in strongly polyno-\n\nmial time. Journal of Combinatorial Theory, Series B, 80(2):346\u2013355, 2000.\n\n[28] A. Schrijver. Combinatorial optimization - Polyhedra and Ef\ufb01ciency. Springer, 2003.\n\n[29] I. Shanu, C. Arora, and P. Singla. Min norm point algorithm for higher order MRF-MAP\ninference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 5365\u20135374, 2016.\n\n[30] P. Stobbe and A. Krause. Ef\ufb01cient minimization of decomposable submodular functions. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2010.\n\n[31] P. Wolfe. Finding the nearest point in a polytope. Mathematical Programming, 11(1):128\u2013149,\n\n1976.\n\n11\n\n\f", "award": [], "sourceid": 1645, "authors": [{"given_name": "Alina", "family_name": "Ene", "institution": "University of Warwick"}, {"given_name": "Huy", "family_name": "Nguyen", "institution": "Northeastern University"}, {"given_name": "L\u00e1szl\u00f3", "family_name": "V\u00e9gh", "institution": "London School of Economics"}]}