{"title": "Fast Decomposable Submodular Function Minimization using Constrained Total Variation", "book": "Advances in Neural Information Processing Systems", "page_first": 8185, "page_last": 8195, "abstract": "We consider the problem of minimizing the sum of submodular set functions assuming minimization oracles of each summand function. Most existing approaches reformulate the problem as the convex minimization of the sum of the corresponding Lov\\'asz extensions and the squared Euclidean norm, leading to algorithms requiring total variation oracles of the summand functions; without further assumptions, these more complex oracles require many calls to the simpler minimization oracles often available in practice. In this paper, we consider a modified convex problem requiring constrained version of the total variation oracles that can be solved with significantly fewer calls to the simple minimization oracles. We support our claims by showing results on graph cuts for 2D and 3D graphs.", "full_text": "Fast Decomposable Submodular Function\n\nMinimization using Constrained Total Variation\n\nK S Sesh Kumar\n\nData Science Institute\n\nImperial College London, UK\ns.karri@imperial.ac.uk\n\nFrancis Bach\n\nINRIA and Ecole normale superieure\nPSL Research University, Paris France.\n\nfrancis.bach@inria.fr\n\nThomas Pock\n\nInstitute of Computer Graphics and Vision,\n\nGraz University of Technology, Graz, Austria.\n\npock@icg.tugraz.at\n\nAbstract\n\nWe consider the problem of minimizing the sum of submodular set functions as-\nsuming minimization oracles of each summand function. Most existing approaches\nreformulate the problem as the convex minimization of the sum of the correspond-\ning Lov\u00e1sz extensions and the squared Euclidean norm, leading to algorithms\nrequiring total variation oracles of the summand functions; without further assump-\ntions, these more complex oracles require many calls to the simpler minimization\noracles often available in practice. In this paper, we consider a modi\ufb01ed convex\nproblem requiring a constrained version of the total variation oracles that can be\nsolved with signi\ufb01cantly fewer calls to the simple minimization oracles. We support\nour claims by showing results on graph cuts for 2D and 3D graphs.\n\n1\n\nIntroduction\n\nA discrete function F de\ufb01ned on a \ufb01nite ground set V of n objects is said to be submodular if the\nmarginal cost of each object reduces with the increase in size of the set it is conditioned on, i.e.,\nF : 2V \u2192 R is submodular if and only if the marginal cost of an object {x} \u2208 V conditioned on the\nset A \u2286 V \\ {x} denoted by F ({x}|A) = F ({x} \u222a A) \u2212 F (A) reduces as the set A becomes bigger.\nThe diminishing returns property of submodular functions has been central to solving several machine\nlearning problems such as document summarization [1], sensor placement [2] and graphcuts [3]\n(see [4] for more applications). Without loss of generality, we consider normalized submodular\nfunctions, i.e., F (\u2205) = 0.\nSubmodular function minimization (SFM) can be solved exactly using polynomial algorithms but with\nhigh computational complexity. One of the standard algorithms is the Fujishighe-Wolfe algorithm [5,\n6] but most recently, SFM has been tackled using cutting plane methods [7] and geometric scaling [8].\nAll the above algorithms rely on value function oracles, that is, access to F (A) for arbitrary subsets\nA of V , and solve SFM with high running-time complexities, e.g., O(n4 logO(1) n) and more. These\nalgorithms are typically not trivial to implement and do not scale to problems with large ground sets\n(such as n = 106 in computer vision applications). For scalable practical solutions, it is imperative to\nexploit the structural properties of the function to minimize.\nSubmodularity is closed under addition [5, 4]. We make a structural assumption that is practically\nuseful in many machine learning problems [9, 10] (e.g., when a 2D-grid graph is seen as the\nconcatenation of vertical and horizontal chains) and consider the problem of minimizing a sum of\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsubmodular functions [11], i.e.,\n\nr(cid:88)\n\ni=1\n\nFi(A),\n\nmin\nA\u2282V\n\nF (A) :=\n\n(1)\n\nassuming each summand Fi, i = 1, . . . , r, is \u201csimple\u201d, i.e., with available ef\ufb01cient oracles which\nare more complex than plain function evaluations. The simplest of these oracles is being able to\nminimize the submodular function Fi plus some modular function, and we will consider these oracles\nin this paper. This is weaker than the usual \u201ctotal variation\u201d oracles detailed below.\nOne of the standard approaches to solve the discrete optimization problem in Eq. (1) is to consider\nan equivalent continuous optimization problem that minimizes the Lov\u00e1sz extension [12] f of the\nsubmodular function over the n-dimensional unit hypercube (see a de\ufb01nition in Section 2). This\napproach uses a well known result in the submodularity literature that the minimizers of the set\nfunction F can be directly obtained from the minimizers of its Lov\u00e1sz extension f; the continuous\noptimization problem is given by\n\ni=1\n\nmin\n\nfi(w),\n\nf (w) :=\n\nw\u2208[0,1]n\n\n(2)\nwhere fi is the Lov\u00e1sz extensions of submodular function Fi, for each i \u2208 [r]. Lov\u00e1sz extension of\nsubmodular functions are convex but non-smooth and piecewise linear functions. Therefore, we can\n\u221a\nuse subgradients as they can be calculated using greedy algorithms in O(n log(n)) time and O(n)\ncalls to the value function oracle per iteration [13]. However, this is slow with O(1/\nt) convergence\nrate where t is the number of iterations of the optimization algorithms. Moreover, in signal processing\napplications, high precision is needed, hence the need for faster algorithms.\nAn alternative approach is to consider a continuous optimization problem [4, Chapter 8] of the form\n\nmin\nw\u2208Rn\n\nf (w) +\n\n\u03c8(wj),\n\n(3)\n\nwhere \u03c8 : R \u2192 R is a convex function whose Fenchel-conjugate [14] is de\ufb01ned everywhere (in order\nto have a well-de\ufb01ned dual as later shown in Eq. (9)). This is equivalent to solving all the following\ndiscrete optimization problems parameterized by \u03b1 \u2208 R,\n\nr(cid:88)\n\nn(cid:88)\n\nj=1\n\n\u03c8(w) =\n\nif |w| (cid:54) \u03b5,\n2 w2\n+\u221e otherwise,\n\n(cid:26) 1\n\n2\n\nF (A) + |A|\u03c8(cid:48)(\u03b1).\n\nmin\nA\u2282V\n\nGiven the solutions A\u03b1 for all \u03b1 \u2208 R in Eq. (4), then we may obtain the optimal solution w\u2217 of\nj = sup{\u03b1 \u2208 R, j \u2208 A\u03b1}. Conversely, given the optimal solution w\u2217 of Eq. (3), we\nEq. (3) using w\u2217\nmay obtain the solutions A\u03b1 of the discrete optimizaton problems in Eq. (4) by thresholding at \u03b1,\ni.e., {w\u2217 \u2265 \u03b1}. As a consequence of this we can obtain the solution of Eq. (1), when \u03b1 is chosen\nso that \u03c8(cid:48)(\u03b1) = 0 (typically \u03b1 = 0 because \u03c8 is even). Note that this algorithmic scheme seems\nwasteful because we take a continuous solution of Eq. (3) and only keep the signs of its solution. One\ncontribution of the paper is to propose a function \u03c8 that focuses only on values of w close to zero.\n\nOur problem setting and approach. We assume the availability of SFM oracles of the individual\nsummand functions, which we refer to as SFMDi, for each i \u2208 [r] that gives the optimal solution of\n(5)\nwhere u \u2208 Rn is any n-dimensional vector and 1A \u2208 {0, 1}n is the indicator function of the set A.\nNote that the complexity of the oracle does not typically depend on the vector u. We consider the\nfollowing continuous optimization problem, which we refer to as SFMCi for each i \u2208 [r],\n\nFi(A) \u2212 u(cid:62)1A,\n\nSFMDi : argmin\n\nA\u2282V\n\nSFMCi : argmin\nw\u2208Rn\n\nfi(w) \u2212 t(cid:62)w +\n\n\u03c8(wj),\n\n(6)\n\nwhere t \u2208 Rn is any n-dimensional vector. In our setting, we consider \u03c8 as the following convex\nfunction,\n\nn(cid:88)\n\nj=1\n\n(4)\n\n(7)\n\n\fwhere \u03b5 \u2208 R+. In Section 3.1, we show that the continuous optimization problem SFMCi can be\noptimized using discrete oracles SFMDi using a modi\ufb01ed divide-and-conquer algorithm [15]. In\nSection 3.2, we use various optimization algorithms that use SFMCi as the inner loop to solve the\ncontinuous optimization problem in Eq. (3) consequently solving the SFM problem in Eq. (1).\n\nRelated work. Most of the earlier works have considered quadratic functions for the choice of\n2 v2. As a result, SFMCi in Eq. (6) is referred to as total\n\u03c8 [15, 16, 17, 18, 19, 20], i.e., \u03c8(v) = 1\n2 (t\u2212 w)2 [21]. These\nvariation or TV oracle as they solve the problems of the form minw\u2208Rn f (w) + 1\noracles are ef\ufb01cient for cut functions de\ufb01ned on chain graphs [22, 23] with O(n) complexity. However,\nthis does not hold for general submodular functions. One way to solve continuous optimization\nproblems like SFMCi is to use a sequence of at most n discrete minimization oracles like SFMDi\nthrough divide-and-conquer algorithms (see Section 3.1).\nRecent work has also focused on using directly discrete minimization oracles of the form SFMDi,\nsuch as [16] that considers a total variation problem with active set methods; [24] used discrete mini-\nmization oracles SFMDi but solved a different convex optimization problem. [25] also considers the\ntotal variation problem but uses incidence relations and oblique projections for quicker convergence.\n[26] reduces the search space for the SFM problem, i.e., V , using heuristics. Our choice of \u03c8 results\nin a similar reduction of the search space that results in a more ef\ufb01cient solution.\n\nContributions. Our main contribution is to propose a new convex optimization problem that can be\nused to \ufb01nd the minimum of a sum of submodular set-functions. For graph cuts, this new problem can\nbe seen as a constrained total variation problem that is more ef\ufb01cient that the regular total variation\n(lesser number of discrete minimization oracle calls). This is bene\ufb01cial when minimizing the sum\nof constrained total variation problems, and consequently bene\ufb01cial for the corresponding discrete\nminimization problem, i.e., minimizing the sum of submodular functions. For the case of sum of\ntwo functions, we show that recent acceleration techniques from [27] can be highly bene\ufb01cial in\nour case. This is validated using experiments on segmentation of two dimensional images and three\ndimensional volumetric surfaces.\nNote that we use cuts mainly due to easy access to minimization oracles of cut functions [28], but our\nresult applies to all submodular functions.\n\n2 Review of Submodular Function Minimization (SFM)\n\nIn this section, we review the relevant concepts from submodular analysis (for more details, see\n[4, 5]). All possible subsets of the ground set V can be considered as the vertices {0, 1}n of the\nhypercube in n dimensions (going from A \u2286 V to 1A \u2208 {0, 1}n). Thus, any set-function may be\nseen as a function F de\ufb01ned on the vertices of the hypercube {0, 1}n. It turns out that F may be\nextended to the full hypercube [0, 1]n by piecewise-linear interpolation, and then to the whole vector\nspace Rn [4].\nThis extension f is piecewise linear for any set-function F . It turns out that it is convex if and only\nif F is submodular [12]. Any piecewise linear convex function may be represented as the support\nfunction of a certain polytope K, i.e., as f (w) = maxs\u2208K w(cid:62)s [14]. For the Lov\u00e1sz extension of a\nsubmodular function, there is an explicit description of K, which we now review.\n\nBase polytope. We de\ufb01ne the base polytope as B(F ) = (cid:8)s \u2208 Rn, s(V ) = F (V ), \u2200A \u2282\nV, s(A) (cid:54) F (A)(cid:9), where we use the classical notation s(A) = s(cid:62)1A. A key result in submodular\n\nanalysis is that the Lov\u00e1sz extension is the support function of B(F ), that is, for any w \u2208 Rn,\n\nf (w) = sup\n\ns\u2208B(F )\n\nw(cid:62)s.\n\n(8)\n\nThe maximizers above may be computed in closed form from an ordered level-set representation\nof w using a \u201cgreedy algorithm\u201d, which (a) \ufb01rst sorts the elements of w in decreasing order such\nthat w\u03c3(1) \u2265 . . . \u2265 w\u03c3(n) where \u03c3 represents the order of the elements in V ; and (b) computes\ns\u03c3(k) = F ({\u03c3(1), . . . , \u03c3(k)}) \u2212 F ({\u03c3(1), . . . , \u03c3(k \u2212 1)}). This leads to a closed-form formula\nfor f (w) and a subgradient.\n\n3\n\n\fSFM as a convex optimization problem. A key result from submodular analysis [12] is the\nequivalence between the SFM problem minA\u2286V F (A) and the convex optimization problem\nminw\u2208[0,1]n f (w). One can then obtain an optimal A from level sets of an optimal w. More-\nover, this leads to the dual problem maxs\u2208B(F )\ni=1(si)\u2212. Note that for our algorithm to work, we\nneed oracles SFMDi that output both the primal variable (A or w) and the dual variable s \u2208 B(F ).\n\n(cid:80)n\n\nConvex optimization and its dual. We consider the continuous optimization problem in Eq. (3).\nIts dual problem derived using Eq. (8) is given by\n\n\u03c8\u2217(\u2212sj).\n\n(9)\n\n\u2212 n(cid:88)\n\nj=1\n\nmax\ns\u2208B(F )\n\n(cid:40) 1\n\nIn this paper, we consider the convex function \u03c8 : R \u2192 R de\ufb01ned in Eq. (7). Its Fenchel-conjugate\n\u03c8\u2217 is given by\n\n\u03c8\u2217(s) =\n\n2 s2\n\u03b5|s| \u2212 \u03b52\n\n2\n\nif |s| (cid:54) \u03b5,\notherwise.\n\n(10)\n\n3 Fast Submodular Function Minimization with Constrained Total\n\nVariation\n\nIn this section, we propose an algorithm to optimize the continuous optimization problem in Eq. (3)\nusing minimization oracles of individual discrete functions SFMDi in Eq. (5). As a \ufb01rst step, we\npropose a modi\ufb01ed divide-and-conquer algorithm to solve the continuous optimization problem\nSFMCi, for each i \u2208 [r] in Section 3.1. In Section 3.2, we use the optimization problems SFMCi as\nblack boxes to solve the continuous optimization problem in Eq. (3). The overview is provided in\nAlgorithm 1.\n\nAlgorithm 1 From SFMDi to SFMC\n1: Input Discrete function minimization oracle for Fi : 2V \u2192 R and \u03b5 \u2208 R+.\n2: output Optimal primal/dual solutions for Eq. (3) and Eq. (9) respectively (w\u2217, s\u2217)\n3: for all i \u2208 [r] do\n4:\n\nOptimize SFMC using SFMCi by applying algorithms in Section 3.3 and Section 3.4. This\nrequires optimal primal-dual solutions of SFMCi, (w\u2217\ni ) that may be obtained using Algo-\nrithm 2 in Section 3.1 assuming oracles SFMDi.\n\ni , s\u2217\n\n5: end for\n6: (w\u2217(U ), s\u2217(U )) = (w\u2217\n\nU , s\u2217\nU )\n\n3.1 Single submodular function\n\nFor brevity, we drop the subscript i and consider the following primal optimization problem. Algo-\nrithm 2 below is an extension of the classical divide-and-conquer algorithm from [29]. Note that it\nrequires access to dual certi\ufb01cates for the SFM problems.\n\nAlgorithm 2 From SFMDi to SFMCi\n1: Input Discrete function minimization oracle for F : 2V \u2192 R and \u03b5 \u2208 R+.\n2: output Optimal primal/dual solutions for Eq. (3) and Eq. (9) respectively (w\u2217, s\u2217)\n3: A+ = argminA\u2282V F (A) + \u03b5|A| with a dual certi\ufb01cate s+ \u2208 B(F ).\n4: A\u2212 = argminA\u2282V F (A) \u2212 \u03b5|A| with a dual certi\ufb01cate s\u2212 \u2208 B(F ) (we must have A+ \u2286 A\u2212)\n5: w\u2217(A+) = \u2212\u03b5, s\u2217(A+) = s+, w\u2217(V \\ A\u2212) = \u03b5, s\u2217(V \\ A\u2212) = s\u2212\n6: U := A\u2212 \\ A+ and a discrete function G : 2U s.t. G(B) = F (A+ \u222a B) \u2212 F (A+) with Lov\u00e1sz\n\nextension g : {0, 1}|U| \u2192 R.\nalgorithm [16] to obtain (w\u2217\nU , s\u2217\nU ).\nU , s\u2217\nU )\n\n7: Solve for optimal solutions of minw\u2208R|U| g(w) + 1\n8: (w\u2217(U ), s\u2217(U )) = (w\u2217\n\n2 w2 and its dual using divide-and-conquer\n\n4\n\n\fProposition 1 Algorithm 2 gives an optimal primal-dual pair for the optimization problem in Eq. (3)\nand Eq. (9) respectively.\n\nSee the proof in the supplementary material (Section A). Note that the number of steps is at most\nthe number of different values that w may take (the solution w is known to have many equal\ncomponents [30]). In the worst case, this is still n, but in practice many components are equal to \u2212\u03b5\nor \u03b5, thus reducing the number of SFM calls (for \u03b5 very close to zero, only two calls are necessary). In\nSection 5, we show empirically that this is the case, the number of SFM calls decreases signi\ufb01cantly\nwhen \u03b5 tends to zero.\n\n3.2 Sum of submodular functions\n\nIn this section, we consider the optimization problem in Eq. (3) with the function \u03c8 from Eq. (7).\nThe primal optimization problem is given by\n\nmin\n\nw\u2208[\u2212\u03b5,\u03b5]n\n\nfi(w) +\n\n(cid:107)w(cid:107)2\n2.\n\n1\n2\n\n(11)\n\nr(cid:88)\n\ni=1\n\nIn order to derive a dual problem with the appropriate structure, we consider the functions gi de\ufb01ned\nas follows: gi(w) = fi(w) if |w| (cid:54) \u03b5, and +\u221e otherwise, with the Fenchel conjugate\nti\u2208B(Fi)\n\nw(cid:62)si \u2212 fi(w) = inf\n\nw(cid:62)(si \u2212 ti) = \u03b5\n\ng\u2217\ni (si) = sup\n\n(cid:107)si \u2212 ti(cid:107)1.\n\nti\u2208B(Fi)\n\nsup\n\ninf\n\nw\u2208[\u2212\u03b5,\u03b5]n\n\nw\u2208[\u2212\u03b5,\u03b5]n\n\nTherefore, we can derive the following dual:\n\nmin\n\nw\u2208[\u2212\u03b5,\u03b5]n\n\nfi(w) +\n\n1\n2\n\n(cid:107)w(cid:107)2\n\n2 = min\nw\u2208Rn\n\ngi(w) +\n\nr(cid:88)\n\ni=1\n\nr(cid:88)\nr(cid:88)\n\ni=1\n\ni=1\nmax\n\n= min\nw\u2208Rn\n\nmax\nsi\u2208Rn\n\n=\n\n(s1,...,sr)\u2208Rn\u00d7r\n\n(cid:107)w(cid:107)2\n\n1\n2\n\n2\n\n(cid:8)w(cid:62)si \u2212 g\u2217\n\u2212 r(cid:88)\n\ni (si)(cid:9) +\n(cid:13)(cid:13) r(cid:88)\n\ni (si) \u2212 1\ng\u2217\n2\n\n1\n2\n\ni=1\n\ni=1\n\n(cid:107)w(cid:107)2\n\n2\n\n(cid:13)(cid:13)2\n\n2.\n\nsi\n\n(12)\n\nWe are now faced with the similar optimization problem than previous work [15], where the primal\nproblems is equivalent to computing the proximity operator of the sum of functions g1 + g2. The\nmain difference is that when \u03b5 is in\ufb01nite (i.e., with no constraints), then the dual functions g\u2217\ni are\nindicator functions of the base polytopes B(Fi), and the dual problem in Eq. (12) can be seen as\n\ufb01nding the distance between two polytopes.\nThis is not the case for our constrained functions. This limits the choice of algorithms. In this paper,\nwe consider block-coordinate ascent (which was already considered in [15], leading to alternate\nprojection algorithms), and a novel recent accelerated coordinate descent algorithm [27]. We could\nalso consider (accelerated) proximal gradient descent on the dual problem in Eq. (12), but it was\nshown empirically to be worse than alternating re\ufb02ections [15] (which we compare to in experiments,\nbut which we cannot readily extend without adding a new hyperparameter).\n\n3.3 Optimization algorithms for all r\nAll of our algorithms will rely on the computing the proximity operator of the functions g\u2217\nwe now consider.\nProximity operator. The key component we will need is the so-called proximal operator of g\u2217\ni ,\nthat is being able to compute ef\ufb01ciently, for a certain \u03b7,\n\ni , which\n\ng\u2217\ni (si) + 1\n\n2\u03b7(cid:107)si \u2212 ti(cid:107)2\n2.\n\nmin\nsi\u2208Rn\n\nmin\nsi\u2208Rn\n\ng\u2217\ni (si) + 1\n\n2\u03b7(cid:107)si \u2212 ti(cid:107)2\n\nUsing the classical Moreau identity [31], this is equivalent to solving\ni si \u2212 gi(wi) + 1\nw(cid:62)\n\u03b7 ti(cid:107)2\n2(cid:107)wi(cid:107)2\n2.\n\nmax\nwi\u2208Rn\n\u2212gi(wi) \u2212 \u03b7\n\u2212gi(wi) + w(cid:62)\n\n2 = min\nsi\u2208Rn\n= max\nwi\u2208Rn\n= max\nwi\u2208Rn\n\n2(cid:107)wi \u2212 1\ni ti \u2212 \u03b7\n\n2\u03b7(cid:107)si(cid:107)2\n2 + 1\n\n2 + 1\n2\u03b7(cid:107)ti(cid:107)2\n\n2\n\n2\u03b7(cid:107)ti(cid:107)2\n\n2 \u2212 1\n\n\u03b7 s(cid:62)\ni ti\n\n5\n\n\fThis is exactly the oracle SFMCi, for which we presented in Section 3.1 an algorithm using only the\ndiscrete oracles SFMDi.\n\nBlock coordinate ascent. We consider the following iteration\n\n\u2200i \u2208 [r], snew\n\ni = argmin\ni \u2208Rn\nsnew\n\ng\u2217\ni (snew\n\ni\n\n) + 1\n2\n\n(cid:13)(cid:13)(cid:80)i\n\nj +(cid:80)r\n\nj=1 snew\n\nj=i+1 sj\n\n(cid:13)(cid:13)2\n\n2,\n\nwhich is exactly block-coordinate ascent on the dual problem in Eq. (12). Since the non-smooth\ni (si) is separable, it is globally convergent, with a convergence rate at least equal to\n\ni=1 g\u2217\n\nO(1/t), where t is the number of iterations (see, e.g., Theorem-6.7 of [32]).\n\nfunction(cid:80)r\n\n3.4 Acceleration for the special case r = 2\n\nWhen there are only two functions, following [27], the problem in Eq. (12) can be written as:\n\n\u2212g\u2217\n\n1(s1) \u2212 h1(s1),\n\nmax\ns1\u2208Rn\n\n(13)\n\nwith\n\n2(cid:107)s1 + s2(cid:107)2\n\n2(cid:107)s1(cid:107)2\n2(cid:107)w2(cid:107)2\n\n2 \u2212 1\n2 \u2212 w(cid:62)\n\ng\u2217\n2(s2) + 1\n\u2212g2(w2) + 1\n\u2212g2(w2) \u2212 1\n\nh1(s1) = inf\ns2\u2208Rn\n= sup\nw2\u2208Rn\n= sup\nw2\u2208Rn\n\n2 s2 \u2212 g2(w2) + 1\nw(cid:62)\n2, with s2 = w2 + s1,\n\u2212f2(w2) \u2212 w(cid:62)\n\n2(cid:107)s1 + s2(cid:107)2 = sup\ninf\ns2\u2208Rn\nw2\u2208Rn\n2(cid:107)w2 + s1(cid:107)2\n2 s1 =\nw2\u2208[\u2212\u03b5,\u03b5]n\n1(s1) = s1 + s\u2217\nThe function h1 is 1-smooth with gradient equal to h(cid:48)\n1(s1) \u2212 h1(s1) leads to the iteration\ngradient to the problem of maximizing maxs1\u2208Rn \u2212g\u2217\n2(cid:107)s1 + snew\n(cid:107)2\n1 \u2212 s1(cid:107)2\n2(cid:107)snew\n2 + h(cid:48)\n2(cid:107)snew\n1 \u2212 s1(cid:107)2\n2(cid:107)snew\n\ng\u2217\n2(snew\ng\u2217\n1(snew\ng\u2217\n1(snew\ng\u2217\n1(snew\n\n2 + (s1 + snew\n(cid:107)2\n2,\n\n= argmin\n2 \u2208Rn\nsnew\n= argmin\n1 \u2208Rn\nsnew\n= argmin\n1 \u2208Rn\nsnew\n= argmin\n1 \u2208Rn\nsnew\n\n1 \u2212 s1)\n)(cid:62)(snew\n\n1(s1)(cid:62)(snew\n\n1 \u2212 s1)\n\n1 + snew\n\nsnew\n1\n\nsnew\n2\n\n) + 1\n\n) + 1\n\n) + 1\n\n) + 1\n\nsup\n\n2\n\n1\n\n1\n\n1\n\n2\n\n2\n\n2\n\n2 s1 \u2212 1\n\n2(cid:107)w2(cid:107)2\n2.\n\n2(s1). Applying proximal\n\nwhich is exactly block coordinate descent. Each of these steps are exactly using the same oracle as\nbefore. We can now accelerate it using FISTA [33] with the step size from the smoothness constant\nwhich is equal to 1. Starting from a pair of iterate (s1, t1), this leads to the iteration:\n\nsnew\n2\n\nsnew\n1\n\ntnew\n1\n\n2\n\n) + 1\n\n= argmin\n2 \u2208Rn\nsnew\n= argmin\n1 \u2208Rn\nsnew\n= snew\n1 + \u03b2(snew\n\ng\u2217\n2(snew\ng\u2217\n) + 1\n1(snew\n1 \u2212 s1),\n\n1\n\n2\n\n(cid:107)2\n\n2(cid:107)t1 + snew\n2(cid:107)snew\n\n1 + snew\n\n2\n\n(cid:107)2\n\n2\n\nwith \u03b2 = (t \u2212 1)/(t + 2) at iteration t. This algorithm converges in O(1/t2).\nThis acceleration can also be used for the case r > 2 by using the product space trick (see, e.g., [15,\nSection 3.2]). However, this requires a correction in the product space that leads to inef\ufb01ciencies of\nthe algorithm in practice. See [34] for more details.\n\n4 Theoretical Analysis\n\nr(cid:88)\n\ni=1\n\nIn this section, we provide a convergence analysis for the methods above. For simplicity of results,\nwe consider the following primal-dual formulation (where both primal and dual variables live in\nbounded sets):\n\nmin\n\nw\u2208[\u2212\u03b5,\u03b5]n\n\nfi(w) +\n\n1\n2\n\n(cid:107)w(cid:107)2\n\n2 = min\nw\u2208Rn\n\nw(cid:62)ti +\n\nmax\n\nti\u2208B(Fi)\n\nr(cid:88)\n\ni=1\n\nn(cid:88)\n\u2212 n(cid:88)\n\nj=1\n\n\u03c8(wj)\n\n\u03c8\u2217(cid:0) r(cid:88)\n\n(cid:1).\n\ntij\n\n(14)\n\nj=1\n\ni=1\n\n=\n\n(t1,...,tr)\u2208B(F1)\u00d7\u00b7\u00b7\u00b7\u00d7B(Fr)\n\nmax\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Comparison to state-of-the-art algorithms for 2D and 3D SFM.\n\nWe assume that we have a pair (w, t1, . . . , tr) of approximate primal-dual solutions for Eq. (14), with\na duality gap \u03b7C. This leads to a pair (w, u) of primal-dual approximate solutions for\n\nmin\n\nw\u2208[\u2212\u03b5,\u03b5]n\n\nf (w) +\n\n1\n2\n\n(cid:107)w(cid:107)2\n\n2 = max\nu\u2208B(F )\n\n\u03c8\u2217(uj),\n\n(15)\n\n\u2212 n(cid:88)\n\nj=1\n\nfor which we can get an approximate subset of V (see proof in supplementary material, Section B):\n\nProposition 2 Given a feasible primal candidate w for Eq. (15) with suboptimality \u03b7C, one of the\nsuplevel sets {w (cid:62) \u03b1} of w is an \u03b7D-optimal minimizer of F , with \u03b7D = \u03b7C\n\n4\u03b5 +(cid:112) \u03b7Cn\n\n2 .\n\nSince our dual problems are all O(1)-smooth (using the traditional de\ufb01nitions of smoothness [35]),\ntheir guarantees will always be of the form \u03b7C = \u22062\nt\u03b1 where \u2206 is a notion of diameter of the base\npolytopes and \u03b1 = 2 for accelerated algorithms and \u03b1 = 1 for plain algorithms. The overall discrete\ngap is thus up to constant terms\n\n\u221a\n\n\u03b7D =\n\n\u2206\nn\nt\u03b1/2\n\n+\n\n\u22062\n\u03b5t\u03b1 .\n\nWe see clearly that the \ufb01nal bound on the (discrete) gap is decreasing with \u03b5. This suggest to use \u03b5\nproportional to \u2206\u221a\nnt\u03b1 to take it as small as possible while only losing a factor of 2 in the convergence\nbound.\nGuarantees for FISTA applied to the dual of Eq. (14). The function \u03c8\u2217 is O(1)-smooth, and the\nobjective in Eq. (14) is r-smooth. Each B(Fi) has a square diameter less than \u22062\ni=1 \u22062\nto [36, Cor. 2(b)], these guarantees extend to the corresponding primal iterate w.\n\n(cid:2)Fi({j})+\n\ni =(cid:80)n\n\ni and \u03b1 = 2. Owing\n\nj=1\n\nFi(V \\{j}) \u2212 Fi(V )(cid:3)2. Thus, in the result above, we have \u22062 = r(cid:80)r\n(cid:1) +(cid:80)n\nnr(cid:112)(cid:80)r\n\nThe primal set has squared diameter n\u03b52; the dual set has squared diameter less than(cid:80)r\na primal-dual algorithm, of the form \u22062 = r(cid:80)r\n\nGuarantees for primal-dual algorithms applied to Eq. (14). We consider the primal-dual for-\nmulation\n\ni , the\nr. Thus, from [37], we get a guarantee from\ni . We thus get overall a\ni + \u03b5\n\nbilinear function has a largest singular value equal to\n\nw(cid:62)(cid:0)(cid:80)r\n\n(t1,...,tr)\u2208B(F1)\u00d7\u00b7\u00b7\u00b7\u00d7B(Fr)\n\nmin\n\nw\u2208[\u2212\u03b5,\u03b5]n\n\nj=1 \u03c8(wj).\n\ni=1 \u22062\n\ni=1 \u22062\n\n\u221a\n\ni=1 \u22062\n\ni=1 ti\n\n\u221a\n\nmax\n\nguarantee of the same form as above, with the same dependency in \u03b5.\n\n5 Experiments\n\nIn this section, we consider the minimization of cut functions [3] that are an important examples\nof submodular functions. In our experiments, we consider the problem of minimizing cuts on 2D\nimages and 3D volumetric surfaces for segmentation. We consider a two-dimensional image of size\nn = 2400 \u00d7 2400 = 5.8 \u00d7 106 pixels, and a 3D volumetric surface of size n = 102 \u00d7 100 \u00d7 79 =\n8.1 \u00d7 105 voxels. The SFM oracles are obtained by using max-\ufb02ow codes, which is the dual of the\nmin-cut problem. We compare our results to the standard block coordinate descent (BCD) [15] and\n\n7\n\n#SFM calls0200040006000Discrete Duality Gap10-410-2100102104106BCDaccBCD\u2206acc\u2206\u2206/tacc\u2206/t\u2206/\u221atacc\u2206/\u221atAAR#2D SFM calls0200400600Discrete Duality Gap10-2100102104106BCDaccBCD\u2206acc\u2206\u2206/tacc\u2206/t\u2206/\u221atacc\u2206/\u221at#SFM calls050010001500Discrete Duality Gap10-410-2100102104106BCD\u2206\u2206/t\u2206/\u221atAAR\faveraged alternating re\ufb02ections algorithm (AAR) [15], which are using full total variation oracles\n(which we solve using the usual divide-and-conquer algorithm that is only using SFM calls).\nIn our approach, we have a parameter \u03b5 dependent on \u2206\u221a\nnt\u03b1 , where \u2206 is the notion of diameter of the\n\u221a\nbase polytope, n is the number of elements in the ground set and t is the number of iterations. In our\nexperiments, we choose \u03b5 proportional to \u2206, \u2206/t and \u2206/\nt and respectively represent them by the\nsame terms in Figure 1. For the case of the sum of two functions, the block coordinate descent can be\n\u221a\naccelerated [27] as shown in Section 3.4. We refer to their accelerated versions as acc BCD, acc\n\u2206, acc \u2206/t and acc \u2206/\nt respectively. Therefore, BCD, acc BCD, AAR use quadratic \u03c8 and\nthe rest use \u03c8 as de\ufb01ned in Eq. (7).\nFigure 1 shows the performance of various algorithms on different problems which we detail below.\nThe horizontal axes represents the number of discrete minimization oracles, i.e., SFMDi required to\nsolve the SFM and the vertical axes represents the discrete duality gap given by\n\nwhere A \u2282 V , s \u2208 B(F ) are the discrete primal-dual pairs and s\u2212(V ) = (cid:80)n\n\ngap(A, s) = F (A) \u2212 s\u2212(V ),\n\ni=1 min(si, 0). We\nconsider three experiments that may be broadly classi\ufb01ed into sum of two functions and sum of three\nfunctions.\n\nSum of two functions (r = 2).\nIn this case, we consider minimization of the submodular function\nthat can be written as sum of two submodular functions, i.e., F = F1 + F2. We consider the problem\nof mininiming graph cuts on 2D grid that can be written as the sum of horizontal and vertical chain\ngraphs in Figure 1-(a). In this case, the SFMDi orcale represents the min-cut on a chain graph while\nthe SFM problem represents min-cut on a 2D grid. We can observe that the constrained total variation\nformulation reduces the number of min-cut/ max-\ufb02ow calls when compared to full total variation.\nHere, we explicitly calculate the diameter of the base polytope \u2206 for choosing \u03b5.\nFigure 2-(a) shows the total number of SFMDi oracle calls required to solve the SFM problem for\ndifferent values of \u03b5. Figure 2-(b) shows the total number of constrained TV SFMCi calls required\nto solve the SFM problem. Figure 2-(c) shows the average number of SFMDi oracle calls required to\nsolve a single SFMCi problem. The algorithms considered in these graphs are BCD and accelerated\nBCD algorithms using constrained total variation represented by \u03b5 and acc \u03b5 respectively. We clearly\nsee the trade-off for the choice of \u03b5: the number of SFM calls per TV calls increases with \u03b5, while the\nnumber of TV calls decreases, leading to intermediate values of \u03b5 which lead to signi\ufb01cant gains in\nthe total number of SFM calls in Figure 2-(a).\nWe consider 3D grid that can be decomposed into 2D frames and chains graphs. In Figure 1-(b),\nwe show the performance of our algorithm compared to other state-of-the-art algorithms for this\ndecomposition. In this case we use two different discrete oracles SFMDi, i.e., min-cut on a chain\nand min-cut on a 2D grid to solve SFM, i.e., min-cut on the 3D grid. We show only the number of\noracle calls to min-cut on 2D grids for analysis as they are more expensive than min-cuts on chains.\n\nSum of three functions (r = 3).\nIn this case, we consider minimization of the submodular function\nthat can be written as sum of three submodular functions, i.e., F = F1 + F2 + F3. Min-cut on the 3D\ngrid can also be seen as sum of chain graphs in three directions, thereby using discrete minimization\noracles only of the chain graphs. Figure 1-(c) shows the number of calls to 1D min-cut to solve the\n3D min-cut problem using various continuous optimization problems and algorithms. Our approach\nconsiderably reduces the number of calls to 1D min-cut (SFMDi) oracles. Figure 2-(d) shows the\ntotal number of 1D min-cuts (SFMDi) to solve 3D SFM for various values of \u03b5. Figure 2-(e) shows\nthe total number of constrained total variation SFMCi calls required to solve the SFM problem.\nFigure 2-(e) shows the average number of SFMDi oracle calls required to solve SFMCi for this\nproblem. We observe a similar behavior than for r = 2, with best values of \u03b5 not being very small or\nvery large.\n\n6 Conclusion\n\nIn this paper, we have proposed a simple modi\ufb01cation of state-of-the-art algorithms for decomposable\nsubmodular function minimization. Adding box constraints to the continuous optimization problems\nallow for signi\ufb01cant reduction in the number of individual submodular function minimization calls.\n\n8\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nFigure 2: BCD and acc BCD for 2D function: F = F1 + F2 and 3D functions: F = F1 + F2 + F3.\n\nThe application of accelerated block coordinate ascent techniques makes the speed-up stronger.\nFurther speed-ups may be achieved by extended the proposed algorithms to [18, 25]. These techniques\nare easily parallelizable and it would be interesting to compare to dedicated parallel algorithms\nfor graph cuts [38]. Moreover, these speed-ups could be extended to more general submodular\noptimization problems [39].\n\nAcknowledgments. This research was funded by the Leverhulme Centre for the Future of Intelli-\ngence, Cambridge and the Data Science Institute, Imperial College London. We acknowledge support\nfrom the European Research Council (SEQUOIA project 724063) and (HOMOVIS project 640156).\n\nReferences\n[1] H. Lin and J. Bilmes. A class of submodular functions for document summarization.\n\nProceedings of NAACL/HLT, 2011.\n\nIn\n\n[2] A. Krause and C. Guestrin. Submodularity and its applications in optimized information\n\ngathering. ACM Transactions on Intelligent Systems and Technology, 2011.\n\n[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts.\n\nIEEE Transactions in Pattern Analysis and Machine Intelligence, 23(11):1222\u20131239, 2001.\n\n[4] F. Bach. Learning with Submodular Functions: A Convex Optimization Perspective, volume 6\n\nof Foundations and Trends in Machine Learning. NOW, 2013.\n\n[5] S. Fujishige. Submodular Functions and Optimization. Elsevier, 2005.\n\n[6] D. Chakrabarty, P. Jain, and P. Kothari. Provable submodular minimization using Wolfe\u2019s\n\nalgorithm. In Advances in Neural Information Processing Systems, 2014.\n\n[7] Y. T. Lee, A. Sidford, and S. C. Wong. A faster cutting plane method and its implications for\ncombinatorial and convex optimization. In Annual Symposium on Foundations of Computer\nScience, 2015.\n\n[8] Daniel Dadush, L\u00e1szl\u00f3 A. V\u00e9gh, and Giaeomo Zambelli. Geometric rescaling algorithms for\nsubmodular function minimization. In Proceedings of the Symposium on Discrete Algorithms\n(SODA), 2018.\n\n[9] P. Stobbe. Convex Analysis for Minimizing and Learning Submodular Set functions. PhD thesis,\n\nCalifornia Institute of Technology, 2013.\n\n9\n\nEpsilon100#SFM calls200030004000500060007000\u03b5acc\u03b5Epsilon100105#TV calls0100200300400\u03b5acc\u03b5Epsilon100105Avg. SFM per TV2025303540\u03b5acc\u03b5Epsilon1001051010#SFM calls500600700800\u03b5Epsilon1001051010#TV calls1520253035\u03b5Epsilon1001051010Avg. SFM per TV15202530354045\u03b5\f[10] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual\ndecomposition. IEEE Transactions in Pattern Analysis and Machine Intelligence, 33(3):531\u2013\n552, 2011.\n\n[11] V. Kolmogorov. Minimizing a sum of submodular functions. Discrete Applied Mathematics,\n\n2012.\n\n[12] L. Lov\u00e1sz. Submodular functions and convexity. Mathematical Programming: the State of the\n\nArt, Bonn, pages 235\u2013257, 1982.\n\n[13] J. Edmonds. Submodular functions, matroids, and certain polyhedra.\n\noptimization - Eureka, you shrink!, pages 11\u201326. Springer, 1970.\n\nIn Combinatorial\n\n[14] R. T. Rockafellar. Convex Analysis. Princeton U. P., 1997.\n\n[15] S. Jegelka, F. Bach, and S. Sra. Re\ufb02ection methods for user-friendly submodular optimization.\n\nIn Advances in Neural Information Processing Systems, 2013.\n\n[16] K. S. Sesh Kumar and Francis Bach. Active-set methods for submodular minimization problems.\n\nJournal of Machine Learning Research, 18(1):4809\u20134839, 2017.\n\n[17] Alina Ene and Huy Nguyen. Random coordinate descent methods for minimizing decomposable\nsubmodular functions. In International Conference on Machine Learning, pages 787\u2013795, 2015.\n\n[18] Alina Ene, Huy Nguyen, and L\u00e1szl\u00f3 A V\u00e9gh. Decomposable submodular function minimization:\n\ndiscrete and continuous. In Advances in Neural Information Processing Systems, 2017.\n\n[19] R. Nishihara, S. Jegelka, and M. I. Jordan. On the convergence rate of decomposable submodular\nfunction minimization. In Advances in Neural Information Processing Systems 27, pages 640\u2013\n648, 2014.\n\n[20] Chetan Arora, Subhashis Banerjee, Prem Kalra, and S. N. Maheshwari. Generic cuts: An\n\nef\ufb01cient algorithm for optimal inference in higher order MRF-MAP. In ECCV, 2012.\n\n[21] A. Chambolle. Total variation minimization and a class of binary MRF models. In Energy\n\nMinimization Methods in Computer Vision and Pattern Recognition, 2005.\n\n[22] Laurent Condat. A direct algorithm for 1d total variation denoising. Technical report, GREYC\n\nlaboratory, CNRS-ENSICAEN-Univ. of Caen, 2012.\n\n[23] \u00c1lvaro Barbero and Suvrit Sra. Modular proximal optimization for multidimensional total-\n\nvariation regularization. Technical Report 1411.0589, ArXiv, 2014.\n\n[24] P. Stobbe and A. Krause. Ef\ufb01cient minimization of decomposable submodular functions. In\n\nAdvances in Neural Information Processing Systems, 2010.\n\n[25] P. Li and O. Milenkovic. Revisiting decomposable submodular function minimization with\n\nincidence relations. In Advances in Neural Information Processing Systems, 2018.\n\n[26] Weizhong Zhang, Bin Hong, Lin Ma, Wei Liu, and Tong Zhang. Safe element screening for\nsubmodular function minimization. In International Conference on Machine Learning, 2018.\n\n[27] Antonin Chambolle and Thomas Pock. A remark on accelerated block coordinate descent for\ncomputing the proximity operators of a sum of convex functions. Journal of computational\nmathematics, 1:29\u201354, 2015.\n\n[28] V. Kolmogorov and R. Zabih. What energy functions can be minimized by graph cuts? IEEE\n\nTPAMI, 26(2):147\u2013159, 2004.\n\n[29] H. Groenevelt. Two algorithms for maximizing a separable concave function over a polymatroid\n\nfeasible region. European Journal of Operational Research, 54(2):227\u2013236, 1991.\n\n[30] Francis Bach. Shaping level sets with submodular functions. In Advances in Neural Information\n\nProcessing Systems, pages 10\u201318, 2011.\n\n10\n\n\f[31] Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal\nprocessing. In Fixed-point Algorithms for Inverse Problems in Science and Engineering, pages\n185\u2013212. Springer, 2011.\n\n[32] S\u00e9bastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and\n\nTrends R(cid:13) in Machine Learning, 8(3-4):231\u2013357, 2015.\n\n[33] A. Beck and M. Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm for linear inverse\n\nproblems. SIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[34] K. S. Sesh Kumar, A. Barbero, S. Jegelka, S. Sra, and F. Bach. Convex optimization for parallel\n\nenergy minimization. Technical Report 01123492, HAL, 2015.\n\n[35] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[36] Tseng Paul. On accelerated proximal gradient methods for convex-concave optimization.\n\nsubmitted to SIAM Journal on Optimization, 2008.\n\n[37] Antonin Chambolle and Thomas Pock. On the ergodic convergence rates of a \ufb01rst-order\n\nprimal\u2013dual algorithm. Mathematical Programming, 159(1-2):253\u2013287, 2016.\n\n[38] A. Shekhovtsov and V. Hlav\u00e1c. A distributed mincut/max\ufb02ow algorithm combining path\naugmentation and push-relabel. In Energy Minimization Methods in Computer Vision and\nPattern Recognition, 2011.\n\n[39] Francis Bach. Submodular functions: from discrete to continuous domains. Mathematical\n\nProgramming, pages 1\u201341, 2016.\n\n11\n\n\f", "award": [], "sourceid": 4455, "authors": [{"given_name": "Senanayak Sesh Kumar", "family_name": "Karri", "institution": "Imperial College London"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - Ecole Normale Superieure"}, {"given_name": "Thomas", "family_name": "Pock", "institution": "Graz University of Technology"}]}