{"title": "Revisiting Decomposable Submodular Function Minimization with Incidence Relations", "book": "Advances in Neural Information Processing Systems", "page_first": 2237, "page_last": 2247, "abstract": "We introduce a new approach to decomposable submodular function minimization (DSFM) that exploits incidence relations. Incidence relations describe which variables effectively influence the component functions, and when properly utilized, they allow for improving the convergence rates of DSFM solvers. Our main results include the precise parametrization of the DSFM problem based on incidence relations, the development of new scalable alternative projections and parallel coordinate descent methods and an accompanying rigorous analysis of their convergence rates.", "full_text": "Revisiting Decomposable Submodular Function\n\nMinimization with Incidence Relations\n\nPan Li\nUIUC\n\npanli2@illinois.edu\n\nAbstract\n\nOlgica Milenkovic\n\nUIUC\n\nmilenkov@illinois.edu\n\nWe introduce a new approach to decomposable submodular function minimiza-\ntion (DSFM) that exploits incidence relations. Incidence relations describe which\nvariables effectively in\ufb02uence the component functions, and when properly uti-\nlized, they allow for improving the convergence rates of DSFM solvers. Our\nmain results include the precise parametrization of the DSFM problem based on\nincidence relations, the development of new scalable alternative projections and\nparallel coordinate descent methods and an accompanying rigorous analysis of\ntheir convergence rates.\n\nIntroduction\n\n1\nA set function F : 2[N ] \u2192 R over a ground set [N ] is termed submodular if for all pairs of sets\nS1, S2 \u2286 [N ], one has F (S1) + F (S2) \u2265 F (S1 \u2229 S2) + F (S1 \u222a S2). Submodular functions capture\nthe ubiquitous phenomenon of diminishing marginal costs [1] and they frequently arise as part of the\nobjective function of various machine learning optimization problems [2, 3, 4, 5, 6, 7].\nAmong the various submodular function optimization problems, submodular function minimization\n(SFM), which may be stated as minS\u2286[N ] F (S), is one of the most important and commonly studied\nquestions. The current fastest known SFM algorithm has complexity O(N 4 logO(1) N + \u03c4 N 3),\nwhere \u03c4 denotes the time needed to evaluate the submodular function [8]. Although SFM solvers\noperate in time polynomial in N, the high-degree of the underlying polynomial prohibits their use\nin practical large-scale settings. For this reason, a recent line of work has focused on developing\nscalable and parallelizable algorithms for solving the SFM problem by leveraging the property of\ndecomposability [9]. Decomposability asserts that the submodular function may be written as a sum\nof \u201csimpler\u201d submodular functions that may be optimized sequentially or in parallel. Formally, the\nunderlying problem, referred to as decomposable SFM (DSFM), may be stated as:\n\n(cid:88)\n\nr\u2208[R]\n\nDSFM: min\n\nS\n\nFr(S),\n\n(1)\n\nwhere Fr : 2[N ] \u2192 R is a submodular function for all r \u2208 [R]. Algorithmic solutions for the DSFM\nproblem fall into two categories, combinatorial optimization approaches [10, 11] and continuous func-\ntion optimization methods [12]. In the latter setting, a crucial concept is the Lov\u00b4asz extension of the\nsubmodular function which is convex [13] and lends itself to a norm-regularized convex optimization\nframework. Prior work in continuous DSFM has focused on devising ef\ufb01cient algorithms for solving\nthe convex problem and deriving matching convergence results. The best known approaches include\nthe alternating projection (AP) methods [14, 15] and the coordinate descent (CD) methods [16].\nDespite some simpli\ufb01cations offered through decomposibility, DSFM algorithms still suffer from\nscalability issues and have convergence guarantees that are suboptimal. To address the \ufb01rst issue, one\nneeds to identify additional problem constraints that allow for parallel implementations. To resolve the\nsecond issue and more precisely characterize and improve the convergence rates, one needs to better\nunderstand how the individual submodular components jointly govern the global optimal solution.\nIn both cases, it is crucial to utilize incidence relations that describe which subsets of variables\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdirectly affect the value of any given component function. Often, incidences involve relatively small\nsubsets of elements, which leads to desirable sparsity constraints. This is especially the case for\nmin-cut problems on graphs and hypergraphs (where each submodular component involves two or\nseveral vertices) [17, 18] and MAP inference with higher-order potentials (where each submodular\ncomponent involves variables corresponding to adjacent pixels) [9]. Although incidence relations\nhave been used to parametrize the algorithmic complexity of combinatorial optimization methods\nfor solving DSFM problems [10], they have been largely overlooked in continuous optimization\nmethods. Some prior work considered merging decomposable parts with nonoverlapping support\ninto one submodular function, thereby creating a coarser decomposition that may be processed more\nef\ufb01ciently [14, 15, 16], but the accompanying algorithms were neither designed in a form that can\noptimally use this information nor analyzed precisely with respect to their convergence rates and\nmerging strategies. In an independent work, Djolonga and Krause found that the variational inference\nproblem in L-FIELD could be reduced to a DSFM problem with sparse incidence relations [19],\nwhile their analysis only worked for regular cases.\nHere, we revisit two benchmark algorithms for continuous DSFM \u2013 AP and CD \u2013 and describe how\nto modify them to exploit incidence relations that allow for signi\ufb01cantly improved computational\ncomplexity. Furthermore, we provide a complete theoretical analysis of the algorithms parametrized\nby incidence relations with respect to their convergence rates. AP-based methods that leverage\nincidence relations achieve better convergence rates than classical AP algorithms both in the sequential\nand parallel optimization scenario. The random CD method (RCDM) and accelerated CD method\n(ACDM) that incorporate incidence information can be parallelized. The complexity of sequential\nCD methods cannot be improved using incidence relations, but the convergence rate of parallel CD\nmethods strongly depends on how the incidence relations are used for coordinate sampling: while\na new specialized combinatorial sampling based on equitable coloring [20] is optimal, uniformly\nat random sampling produces a 2-approximation. It also leads to a greedy method that empirically\noutperforms random sampling. A summary of these and other \ufb01ndings is presented in Table 1.\n\nPrior work\n\nAP\n\nParallel\n\nSequential\nSequential\nO(N 2R2) O(N 2R2/K) O(N(cid:107)\u00b5(cid:107)1R)\nO(N 2R)\n\n-\n\nRCDM O(N 2R)\n\nACDM O(N R)\n\n-\n\nO(N R)\n\nO\n\nThis work\n\n(cid:16)(cid:16) R\u2212K\n(cid:18)(cid:16) R\u2212K\n\n(cid:17)\nParallel\nO(N(cid:107)\u00b5(cid:107)1R/K)\n(cid:17)1/2\nR\u22121 N(cid:107)\u00b5(cid:107)1\nR\u22121 N(cid:107)\u00b5(cid:107)1\n\nR\u22121 N 2 + K\u22121\nR\u22121 N 2 + K\u22121\n\nO\n\n(cid:17)\n\n(cid:19)\n\nR/K\n\nR/K\n\nHere, (cid:107)\u00b5(cid:107)1 = (cid:80)\n\nTable 1: Overview of known and new results: each entry contains the required number of iterations to\nachieve an \u0001-optimal solution (the dependence on \u0001 is the same for all algorithms and hence omitted).\ni\u2208[N ] \u00b5i, where for all i \u2208 [N ], \u00b5i equals the number of submodular functions\nthat involve element i; K is a parallelization parameter that equals the number of min-norm points\nproblems that have to be solved within each iteration.\n2 Background, Notation and Problem Formulation\nWe start our exposition by reviewing several recent lines of work for solving the DSFM problem, and\nfocus on approaches that transform the DSFM problem into a continuous optimization problem. Such\napproaches exploit the fact that the Lov\u00b4asz extension of a submodular function is convex. Without\nloss of generality, we tacitly assume that all submodular functions Fr are normalized, i.e., that\ni\u2208S zi.\n\nFr(\u2205) = 0 for all r \u2208 [R]. Also, we de\ufb01ne given a vector z \u2208 RN and S \u2286 [N ], z(S) =(cid:80)\n\nThen, the base polytope of the r-th submodular function Fr is de\ufb01ned as\n\nBr (cid:44) {yr \u2208 RN|yr(S) \u2264 Fr(S), for any S \u2282 [N ], and yr([N ]) = Fr([N ])}.\n\nsolved through continuous optimization, minx\u2208[0,1]N(cid:80)\n\nThe Lov\u00b4asz extention [13] fr(\u00b7) : RN \u2192 R of a submodular function Fr is de\ufb01ned as fr(x) =\nmaxyr\u2208Br(cid:104)yr, x(cid:105), where (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product of two vectors. The DSFM problem can be\nr fr(x). To counter the nonsmoothness of\nthe objective function, a proximal formulation of a generalization of the above optimization problem\nis considered instead [14],\n\n(cid:88)\n\nr\u2208[R]\n\nmin\nx\u2208RN\n\nfr(x) +\n\n(cid:107)x(cid:107)2\n2.\n\n1\n2\n\n2\n\n(2)\n\n\fAs the problem (2) is strongly convex, it has a unique optimal solution, denoted by x\u2217. The exact\ndiscrete solution to the DSFM problem equals S\u2217 = {i \u2208 [N ]| x\u2217\ni > 0}.\nFor convenience, we denote the product of base polytopes as B = \u2297R\n(y1, y2, ..., yR) \u2208 B. Also, we let A be a simple linear mapping \u2297R\na point a = (a1, a2, ..., aR) \u2208 \u2297R\nsolving (2) use the dual form of the problem, described in the next lemma.\nLemma 2.1 ([14]). The dual problem of (2) reads as\n\nr=1Br, and write y =\nRN \u2192 RN , which given\nr\u2208[R] ar. The AP and CD algorithms for\n\nRN outputs Aa =(cid:80)\n\nr=1\n\nr=1\n\n(cid:107)a \u2212 y(cid:107)2\n\n2\n\nmin\na,y\n\ns.t. Aa = 0, y \u2208 B.\n\n(3)\n\nMoreover, problem (3) may be written in the more compact form\n\n(4)\nFor both problems, the primal and dual variables are related according to x = \u2212Ay. In what follows,\nfor notational simplicity, we write g(y) = 1\n\nmin\n\ns.t.\n\ny\n\n2\n\n2(cid:107)Ay(cid:107)2\n2.\n\n(cid:107)Ay(cid:107)2\n\ny \u2208 B.\n\nr\n\nr = y(k\u22121)\n\n\u2212 Ay(k\u22121)/R, y(k)\n\nThe AP [15] and RCD algorithms [16] described below provide solutions to the problems (3) and (4),\nrespectively. They both rely on repeated projections \u03a0Br (\u00b7) onto the base polytopes Br, r \u2208 [R].\nThese projections are typically less computationally intense than projections onto the complete base\npolytope of F as they involve fewer data dimensions. The projection operation \u03a0Br (\u00b7) requires one\nto solve a min-norm problem by either exploiting the special forms of Fr or by using the general\npurpose algorithm of Wolfe [21]. The complexity of the method is typically characterized by the\nnumber of required projections \u03a0Br (\u00b7).\nThe AP algorithm. Starting with y = y(0), iteratively compute a sequence (a(k), y(k))k=1,2,... such\nthat for all r \u2208 [R], a(k)\nr ), until a stopping criteria is met.\nThe RCDM algorithm. In each iteration k, chose uniformly at random a subset of elements in y\nassociated with one atomic function in the decomposition (1), say the one with index rk. Then,\nr = y(k\u22121)\ncompute the sequence (y(k))k=1,2,... according to y(k)\n,\nfor r (cid:54)= rk.\nFinding an \u0001-optimal solution for both the AP and RCD methods requires O(N 2R log( 1\n\u0001 )) iterations.\nIn each iteration, the AP algorithm computes the projections onto all R base polytopes, while the\nRCDM only computes one projection. Therefore, as may be seen from Table 1, the sequential\nAP solver, which computes one projection in each iteration, requires O(N 2R2 log( 1\n\u0001 )) iterations.\nHowever, the projections within one iteration of the AP method can be generated in parallel, while\nthe projections performed in the RCDM have to be generated sequentially.\n2.1\n\nIncidence Relations and Related Notations\n\n(cid:16)\u2212(cid:80)\n\nr = \u03a0Br (a(k)\n\nrk = \u03a0Brk\n\ny(k\u22121)\n\n, y(k)\n\nr(cid:54)=rk\n\n(cid:17)\n\nr\n\nr\n\nWe next formally introduce one of the key concepts used in this work: incidence relations between\nelements of the ground set and the component submodular functions.\nWe say that an element i \u2208 [N ] is incident to a submodular function F iff there exists a S \u2286 [N ]/{i}\nsuch that F (S \u222a {i}) (cid:54)= F (S); similarly, we say that the submodular function F is incident to an\nelement i iff i is incident to F . To verify whether an element i is incident to a submodular function\nF , one needs to verify that F ({i}) = 0 and that F ([N ]) = F ([N ]/{i}) since for any S \u2286 [N ]/{i}\n\nF ({i}) \u2265 F (S \u222a {i}) \u2212 F (S) \u2265 F ([N ]) \u2212 F ([N ]/{i}).\n\nFurthermore, note that if i \u2208 [N ] is not incident to Fr, then for any yr \u2208 Br, one has yr,i = 0. Let\nSr be the set of all elements incident to Fr. For each element i, denote the number of submodular\nfunctions that are incident to i by \u00b5i = |{r \u2208 [R] : i \u2208 Sr}|. We also refer to \u00b5i as the degree of\nelement i. We \ufb01nd it useful to partition the set of submodular functions into different groups. Given\na group C \u2286 [R] of submodular functions, we de\ufb01ne the degree of the element i within C, \u00b5C\ni , as\ni = |{r \u2208 C : i \u2208 Sr}|.\n\u00b5C\nWe also de\ufb01ne a skewed norm involving two vectors w \u2208 RN\n\ni\u2208[N ] wiz2\n\ni . With a slight abuse of notation, for two vectors \u03b8 = (\u03b81, \u03b82, ..., \u03b8R) \u2208 \u2297R\n\n>0 and z \u2208 RN according to (cid:107)z(cid:107)2,w (cid:44)\nRN\n>0\n\nr=1\n\n(cid:113)(cid:80)\n\n3\n\n\fr=1\n\nr=1\n\nRN , we also de\ufb01ne the norm (cid:107)y(cid:107)2,\u03b8 (cid:44)(cid:113)(cid:80)\n\nr=1\n\n2,\u03b8r\n\nand y \u2208 \u2297R\n\nr\u2208[R] (cid:107)yr(cid:107)2\n\nRN and a positive vector \u03b8 \u2208 \u2297R\n\nrefer to should be clear from the context. In addition, we let (cid:107)\u03b8(cid:107)1,\u221e =(cid:80)\n\n. Which of the norms we\ni\u2208[N ] maxr\u2208[R]:i\u2208Sr \u03b8r,i.\nFor a closed set K \u2286 \u2297R\n>0, the distance between y and K is\nRN\nde\ufb01ned as d\u03b8(y,K) = min{(cid:107)y \u2212 z(cid:107)2,\u03b8|z \u2208 K}. Also, given a set \u2126 \u2286 RN , we let \u03a0\u2126,w(\u00b7) denote\nthe projection operation onto \u2126 with respect to the norm (cid:107) \u00b7 (cid:107)2,w.\nGiven a vector w \u2208 RN\nRN whose r-th entry\nsatis\ufb01es (I(w))r = w. It is easy to check that (cid:107)I(w)(cid:107)1,\u221e = (cid:107)w(cid:107)1. Of special interest are induced\nvectors based on pairs of N-dimensional vectors, \u00b5 = (\u00b51, \u00b52, ..., \u00b5N ), \u00b5C = (\u00b5C\nN ).\n2 , ..., \u00b5C\nFinally, for w, w(cid:48) \u2208 RN , we denote the element-wise power of w by w\u03b1 = (w\u03b1\nN ), for\n1 , w\u03b1\nsome \u03b1 \u2208 R, and the element-wise product of w and w(cid:48) by w (cid:12) w(cid:48) = (w1w(cid:48)\n1, w2w(cid:48)\nN ).\nNext, recall that x\u2217 is the unique optimal solution of the problem (2) and let Z = {\u03be \u2208\n\u2297R\nRN|A\u03be = \u2212x\u2217, \u03ber,i = 0,\u2200i \u2208 Sr,\u2200r \u2208 [R]}. Then, due to the duality relationship of\nLemma 2.1, \u039e = Z \u2229 B is the set of optimal solutions {y}.\n3 Continuous DSFM Algorithms with Incidence Relations\n\n>0, we also make use of an induced vector I(w) \u2208 \u2297R\n\n1 , \u00b5C\n2 , ..., w\u03b1\n2, ..., wN w(cid:48)\n\nr=1\n\nr=1\n\nIn what follows, we revisit the AP and CD algorithms and describe how to improve their performance\nand analytically establish their convergence rates. Our \ufb01rst result introduces a modi\ufb01cation of the\nAP algorithm (3) that exploits incidence relations so as to decrease the required number of iterations\nfrom O(N 2R) to O(N(cid:107)\u00b5(cid:107)1). Our second result is an example that shows that the convergence rates\nof CD algorithms [11] cannot be directly improved by exploiting the functions\u2019 incidence relations\neven when the incidence matrix is extremely sparse. Our third result is a new algorithm that relies of\ncoordinate descent steps but can be parallelized. In this setting, incidence relations are essential to\nthe parallelization process.\nTo analyze solvers for the continuous optimization problem (2) that exploit the incidence structure of\nthe functions, we make use of the skewed norm (cid:107) \u00b7 (cid:107)2,w with respect to some positve vector w that\naccounts for the fact that incidences are, in general, nonuniformly distributed. In this context, the\nprojection \u03a0Br,w(\u00b7) reduces to solving a classical min-norm problem after a simple transformation\nof the underlying space which does not incur signi\ufb01cant complexity overheads. To see this, note\nthat in order to solve a generic min-norm point problem, one typically uses either Wolfe\u2019s algorithm\n(continuous) or a divide-and-conquer procedure (combinatorial). The complexity of the former is at\nmost quadratic in Fr,max (cid:44) maxv,S |Fr(S \u222a {v}) \u2212 Fr(S)| [22], while the complexity of the latter\nmerely depends on log Fr,max [14] (see Section A in the Supplement). It is unclear if including the\nweight vector w into the projection procedure increases or decreases Fr,max. In either case, given that\nin our derivations all elements of w are contained in [1, maxi\u2208[N ] \u00b5i] instead of N or R, we do not\nexpect to see signi\ufb01cant changes in the complexity of the projection operation. Hence, throughout\nthe remainder of our exposition, we regard the projection operation as an oracle and measure the\ncomplexity of all algorithms in terms of the number of projections performed.\nAlso, observe that one may avoid computing projections in skewed-norm spaces by introducing in (2)\na weighted rather than an unweighted proximal term. This gives another continuous objective that\nstill provides a solution to the discrete problem (1). Even in this case, we can prove that the numbers\nof iterations used in the different methods listed Table 1 remain the same. Furthermore, by combining\nprojections in skewed-norm spaces and weighted proximal terms, it is possible to actually reduce\nthe number of iterations given in Table 1. However, for simplicity, we focus on the objective (2)\nand projections in skewed-norm spaces. Methods using weighted proximal terms with and without\nskewed-norm projections are analyzed in a similar manner in Section L of the Supplement.\nWe make frequent use of the following result which generalizes Lemma 4.1 of [11].\nLemma 3.1. Let \u03b8 \u2208 \u2297R\n>0 be two positive vectors. Let y \u2208 B and let z be in the\nbase polytope of the submodular function F . Then, there exists a point \u03be \u2208 B such that A\u03be = z and\n\n(cid:107)Ay \u2212 z(cid:107)1. Moreover, (cid:107)\u03be \u2212 y(cid:107)2,\u03b8 \u2264(cid:113)(cid:107)\u03b8(cid:107)1,\u221e(cid:107)w\u22121(cid:107)1\n\n(cid:107)\u03be \u2212 y(cid:107)2,\u03b8 \u2264(cid:113)(cid:107)\u03b8(cid:107)1,\u221e\n\n>0, w \u2208 RN\nRN\n\n(cid:107)Ay \u2212 z(cid:107)2,w.\n\nr=1\n\n2\n\n2\n\n3.1 The Incidence Relation AP (IAP)\n\nThe following result establishes the basis of our improved AP method leveraging incidence structures.\n\n4\n\n\fLemma 3.2. The following problem is equivalent to problem (3):\n\n(cid:107)a \u2212 y(cid:107)2\n\n2,I(\u00b5)\n\nmin\na,y\n\ns.t.\n\ny \u2208 B, Aa = 0, and ar,i = 0, \u2200(r, i) : i /\u2208 Sr, r \u2208 [R].\n\n(5)\n\nr=1\n\nr=1\n\nRN|Aa = 0, ar,i = 0, \u2200(r, i) : i /\u2208 Sr} and A(cid:48) = {a \u2208 \u2297R\n\nRN|Aa = 0}.\nLet A = {a \u2208 \u2297R\nThe AP algorithm for problem (5) consists of alternatively computing projections between A and\nB, as opposed to those between A(cid:48) and B used in the problem (3). However, as already pointed out,\nunlike for the classical AP problem (3), the distance in (5) is not Euclidean, and hence the projections\nmay not be orthogonal.\nThe IAP method for solving (5) proceeds as follows. We begin with a = a(0) \u2208 A, and iteratively\ncompute a sequence (a(k), y(k))k=1,2,... as follows: for all r \u2208 [R], y(k)\nr,i =\ny(k\u22121)\n(Ay(k\u22121))i, \u2200 i \u2208 Sr. The key difference between the AP and IAP algorithms is that\nr,i \u2212 \u00b5\u22121\nthe latter effectively removes \u201cirrelevant\u201d components of yr by \ufb01xing the irrelevant components of a\nto 0. In the AP method of Nishihara [15], these components are never zero as they may be \u201ccorrupted\u201d\nby other components during AP iterations. Removing irrelevant components results in projecting y\ninto a subspace of lower dimensions, which signi\ufb01cantly accelerates the convergence of IAP.\n\nr = \u03a0Br,\u00b5(a(k)\n\nr ), a(k)\n\ni\n\nB\n\ny(0)(y(cid:48)(0))\n\ny(cid:48)(1)\n\ny(cid:48)(2)\n\ny(1)\n\ny\u2217\n\nA(cid:48)\n\na(cid:48)(1)\nA\n\na(cid:48)(2)\na(1)\n\nFigure 1: Illustration of the IAP method for solving problem (5): The space A is a subspace of A(cid:48),\nwhich leads to faster convergence of the IAP method when compared to AP.\n\nThe analysis of the convergence rate of the IAP method follows a similar outline as that used to ana-\nlyze (3) in [15]. Following Nishihara et al. [15], we de\ufb01ne the following parameter that plays a key role\nin determining the rate of convergence of the AP algorithm, \u03ba\u2217 (cid:44) sup\nmax{dI(\u00b5)(y,Z),dI(\u00b5)(y,B)}.\nLemma 3.3 ([15]). If \u03ba\u2217 < \u221e, the AP algorithm converges linearly with rate 1 \u2212 1\niteration, the algorithm outputs a value y(k) that satis\ufb01es\n\n\u03ba2\u2217 . At the k-th\n\ny\u2208Z\u222aB/\u039e\n\ndI(\u00b5)(y,\u039e)\n\ndI(\u00b5)(y(k), \u039e) \u2264 2dI(\u00b5)(y(0), \u039e)\n\n(cid:18)\n\n1 \u2212 1\n\u03ba2\u2217\n\n(cid:19)k\n\n.\n\nTo apply the above lemma in the IAP setting, one \ufb01rst needs to establish an upper bound on \u03ba\u2217. This\nbound is given in Lemma 3.4 below.\n\nLemma 3.4. The parameter \u03ba\u2217 is upper bounded as \u03ba\u2217 \u2264(cid:112)N(cid:107)\u00b5(cid:107)1/2 + 1.\n\nBy using the above lemma and the bound on \u03ba\u2217, one can establish the following convergence rate for\nthe IAP method.\nTheorem 3.5. After O(N(cid:107)\u00b5(cid:107)1 log(1/\u0001)) iterations, the IAP algorithm for solving problem (5)\noutputs a pair of points (a, y) that satis\ufb01es dI(\u00b5)(y, \u039e) \u2264 \u0001.\nNote that in practice, one often has (cid:107)\u00b5(cid:107)1 (cid:28) N R, which shows that the convergence rate of the AP\nmethod for solving the DSBM problem may be signi\ufb01cantly improved.\n3.2 Sequential Coordinate Descent Algorithms\nUnlike the AP algorithm, the CD algorithms by Ene et al. [16] remain unchanged given (4). Our\n\ufb01rst goal is to establish whether the convergence rate of the CD algorithms can be improved using a\nparameterization that exploits incidence relations.\nThe convergence rate of CD algorithms is linear if the objective function is component-wise smooth\nand (cid:96)-strong convex. In our case, g(y) is component-wise smooth as for any y, z \u2208 B that only differ\n\n5\n\n\fin the r-th block (i.e., yr (cid:54)= zr, yr(cid:48) = zr(cid:48) for r(cid:48) (cid:54)= r), one has\n\n(cid:107)\u2207rg(y) \u2212 \u2207rg(z)(cid:107)2 \u2264 (cid:107)y \u2212 z(cid:107)2.\nHere, \u2207rg denotes the gradient vector associated with the r-th block.\nDe\ufb01nition 3.6. We say that the function g(y) is (cid:96)-strongly convex in (cid:107) \u00b7 (cid:107)2,, if for any y \u2208 B\ng(y\u2217) \u2265 g(y) + (cid:104)\u2207g(y), y\u2217 \u2212 y(cid:105) +\nwhere y\u2217 = arg min\nz\u2208\u039e\n\n2 \u2265 (cid:96)(cid:107)y\u2217 \u2212 y(cid:107)2\n2,\n2. Moreover, we let (cid:96)\u2217 = sup{(cid:96) : g(y) is (cid:96)-strongly convex in (cid:107) \u00b7 (cid:107)2}.\n\n2, or equivalently, (cid:107)Ay \u2212 Ay\u2217(cid:107)2\n\n(cid:107)y\u2217 \u2212 y(cid:107)2\n\n(cid:107)z \u2212 y(cid:107)2\n\n(cid:96)\n2\n\n(6)\n\nNote that the above de\ufb01nition essentially establishes a form of weak-strong convexity [23]. Then,\nusing standard analytical tools for CD algorithms [24], we can prove the following result [16].\nTheorem 3.7. The RCDM for problem (4) outputs a point y that satis\ufb01es E[g(y)] \u2264 g(y\u2217) + \u0001 after\n(cid:96)\u2217 log(1/\u0001)) iterations. The ACDM applied to the problem (4) outputs a point y that satis\ufb01es\nO( R\nE[g(y)] \u2264 g(y\u2217) + \u0001 after O( R\u221a\n\n(cid:96)\u2217 log(1/\u0001)) iterations.\n\nMore precisely, let N = 2n + 1, R = 2n, and (cid:107)\u00b5(cid:107)1 =(cid:80)\n\nTo precisely characterize the convergence rate, we need to \ufb01nd an accurate estimate of (cid:96)\u2217. Ene et\nal. [11] derived (cid:96)\u2217 \u2265 1\nN 2 without taking into account the incidence structure. As sparse incidence\nside information improves the performance of the AP method, it is of interest to determine if the\nsame can be accomplished for the CD algorithms. Example 3.1 establishes that this is not possible in\ngeneral if one only relies on (cid:96)\u2217.\nExample 3.1. Consider a DSFM problem with a extremely sparse incidence structure with |Sr| = 2.\nr\u2208[R] |Sr| = 4n (cid:28) N R. Let Fr be incident\nto the elements {r, r + 1}, for all r \u2208 [R], and be such that Fr({r}) = Fr({r + 1}) = 1, Fr(\u2205) =\nFr({r, r + 1}) = 0. Then, (cid:96)\u2217 < 7\nN 2 .\nNote that the optimal solution of problem (4) for this particular setting equals y\u2217 = 0. Let us consider\na point y \u2208 B speci\ufb01ed as follows. First, due to the given incidence relations, the block yr has two\ncomponents corresponding to the elements indexed by r and r + 1. For any r \u2208 [R],\n\n(cid:26)\n\nyr,r = \u2212yr,r+1 =\n\nr\nn\n\n2n+1\u2212r\n\nn\n\nr \u2264 n,\nr \u2265 n + 1.\n2n2 \u2264 7\n\nN 2 for all n \u2265 3.\n\n(7)\n\nn , (cid:107)y(cid:107)2\n\n2 > 4\n\n3 n, which results in (cid:96)\u2217 < 3\n\nTherefore, g(y) = 1\nExample 3.1 only illustrates that an important parameter of CDMs cannot be improved using incidence\ninformation; but this does not necessarily imply that a sequential RCDM that uses incidence structures\ncannot offer better convergence rates than O(N 2R). In Section E of the Supplement, we present\nadditional experimental evidence that supports our observation, using the setting of Example 3.1.\nAs a \ufb01nal remark, note that Nishihara et al. [15] also proposed a lower bound that does not make use\nof sparse incidence structures and only works for the AP method.\n\n(cid:88)\n\n3.3 New Parallel CD methods\nIn what follows, we propose two CDMs which rely on parallel projections and incidence relations.\nThe following observation is key to understanding the proposed approach. Suppose that we have a\nnonempty group of blocks C \u2286 [R]. Let y, h \u2208 \u2297R\nRN . If hr,i is nonzero only for block r \u2208 C\nand i \u2208 Sr, then,\n\nr=1\n\n(cid:107)Ah(cid:107)2\n\n2 \u2264 g(y) +\n\n1\n2\n\n1\n2\n\n(cid:107)hr(cid:107)2\n\n(cid:104)\u2207rg(y), hr(cid:105) +\n\ng(y + h) = g(y) + (cid:104)\u2207g(y), h(cid:105) +\n\nr\u2208C\n\n(8)\nHence, for all r \u2208 C, if we perform projections onto Br with respect to the norm (cid:107) \u00b7 (cid:107)2,\u00b5C simultane-\nously in each iteration of the CDM, convergence is guaranteed as the value of the objective function\nremains bounded. The smaller the components of \u00b5C, the faster the convergence. Note that the\ncomponents of \u00b5C are the numbers of incidence relations of elements restricted to the set C. Hence,\nin each iteration, blocks that ought to be updated in parallel are those that correspond to submodular\nfunctions that have supports with smallest possible intersections.\nOne can select blocks that are to be updated in parallel in a combinatorially speci\ufb01ed fashion or in a\nrandomized fashion, as dictated by what we call an \u03b1-proper distribution. To describe our parallel\nRCDM, we \ufb01rst introduce the notion of an \u03b1-proper distribution.\n\n2,\u00b5C .\n\nr\u2208C\n\n(cid:88)\n\n6\n\n\f2 , ..., \u03b8P\n\nR) such that for r \u2208 [R], \u03b8P\n\nDe\ufb01nition 3.8. Let P be a distribution used to sample a group of C blocks. De\ufb01ne \u03b8P =\n(\u03b8P\n1 , \u03b8P\ndistribution, if for any r \u2208 [R] and a given \u03b1 \u2208 (0, 1), we have P(r \u2208 C) = \u03b1.\nWe are now ready to describe the parallel RCDM algorithm \u2013 Algorithm 1; the description of the\nparallel ACDM is postponed to Section J of the Supplement.\n\n(cid:2)\u00b5C|r \u2208 C(cid:3). We say that P is an \u03b1-proper\n\n(cid:44) EC\u223cP\n\nr\n\nAlgorithm 1: Parallel RCDM for Solving (4)\nInput: B, \u03b1\n0: Initialize y(0) \u2208 B, k \u2190 0\n1: Do the following steps iteratively until the dual gap < \u0001:\n2:\n3:\ny(k+1)\n4:\nr\nSet y(k+1)\n5:\n6: Output y(k)\n\nSample Cik using some \u03b1-proper distribution P\nFor r \u2208 Cik:\n\nr \u2212 (\u03b8P\n(y(k)\nfor r (cid:54)\u2208 Cik, k \u2190 k + 1\n\nr )\u22121 (cid:12) \u2207rg(y(k)))\n\n\u2190 \u03a0Br,\u03b8P\n\u2190 y(k)\n\nr\n\nr\n\nr\n\nNext, we establish strong convexity results for the space (cid:107) \u00b7 (cid:107)2,\u03b8P by invoking Lemma 3.1.\nLemma 3.9. For any y \u2208 B, let y\u2217 = arg min\u03be\u2208\u039e (cid:107)\u03be \u2212 y(cid:107)2\n\n(cid:107)Ay \u2212 Ay\u2217(cid:107)2\n\n2\n\nN(cid:107)\u03b8P(cid:107)1,\u221e\n\n2 \u2265\n(cid:20)\n\n1 \u2212\n\n(cid:21)\n\n\u2264\n\n2,\u03b8P . Then,\n(cid:107)y \u2212 y\u2217(cid:107)2\n\n2,\u03b8P .\n\n(cid:21)k(cid:20)\n\nThe convergence rate of Algorithm 1 is established in the next theorem.\nTheorem 3.10. At each iteration of Algorithm 1, y(k) satis\ufb01es\n\n(cid:20)\n\n(cid:21)\n\n.\n\n4\u03b1\n\nE\n\ng(y(k)) \u2212 g(y\u2217) +\n\n1\n2\n\ng(y(0)) \u2212 g(y\u2217) +\n\n1\n2\n\nr\u2208[R]:i\u2208Sr\n\nd2\n\u03b8P (yk, \u03be)\n\n(N(cid:107)\u03b8P(cid:107)1,\u221e + 2)\n\n\u03b8P (y0, \u03be)\nd2\nThe parameter N(cid:107)\u03b8P(cid:107)1,\u221e is obtained by combining the strong convexity constant and the properties\nof the sampling distribution P . Small values of (cid:107)\u03b8P(cid:107)1,\u221e ensure better convergence rates, and we\nnext bound this value.\nLemma 3.11. For any \u03b1-proper distribution P and an element i \u2208 [N ], max\nr,i \u2265\n\u03b8P\nmax{\u03b1\u00b5i, 1}. Consequently, (cid:107)\u03b8P(cid:107)1,\u221e \u2265 max{\u03b1(cid:107)\u00b5(cid:107)1, N}.\nWithout considering incidence relations, i.e., by setting (cid:107)\u00b5(cid:107)1 = N R, one always has (cid:107)\u03b8P(cid:107)1,\u221e \u2265\n\u03b1N R, which shows that parallelization cannot improve the convergence rate of the RCDM.\nThe next lemma characterizes an achievable (cid:107)\u03b8P(cid:107)1,\u221e obtained by choosing P to be a uniform\ndistribution, which, when combined with Theorem 3.10, proves the result of the last column in\nTable 1.\nLemma 3.12. If C is a set of size 0 < K \u2264 R obtained by sampling the K-subsets of [R] uniformly\nat random, then \u03b8P\nComparing Lemma 3.11 and Lemma 3.12, we see that the (cid:107)\u03b8P(cid:107)1,\u221e achieved by sampling uniformly\nat random is at most a factor of two of the lower bound since \u03b1 = K/R. A natural question is if\nit is possible to devise a better sampling strategy. This question is addressed in Section K of the\nSupplement, where we related the sampling problem to equitable coloring [20]. By using Hajnal-\nSzemer\u00e9di\u2019s Theorem [25], we derived a suf\ufb01cient condition under which an \u03b1-proper distribution P\nthat achieves the lower bound in Lemma 3.11 can be found in polynomial time. We also described\na greedy algorithm for minimizing (cid:107)\u03b8P(cid:107)1,\u221e that empirically convergences faster than sampling\nuniformly at random.\n\nR\u22121 1. Moreover, (cid:107)\u03b8P(cid:107)1,\u221e = K\u22121\n\nR\u22121 (cid:107)\u00b5(cid:107)1 + R\u2212K\n\nR\u22121 \u00b5 + R\u2212K\n\nr = K\u22121\n\nR\u22121 N.\n\n4 Experiments\n\nIn what follows, we illustrate the performance of the newly proposed DSFM algorithms on a\nbenchmark datasets used for MAP inference in image segmentation [9] and used for semi-supervised\n\n7\n\n\fFigure 2: Image segmentation example. First row: Gap vs the number of iterations \u00d7\u03b1. Second row:\nThe number of iterations \u00d7\u03b1 vs \u03b1. Here, \u03b1 is the parallelization parameter, while K = \u03b1R equals\nthe number of projections that have to be computed in each iteration.\n\nlearning over graphs 1. More experiments on semi-supervised learning over hypergraphs can be\nfound in Section M of the Supplement.\nIn all the experiments, we evaluated the convergence rate of the algorithms by using the smooth\nduality gap \u03bds and the discrete duality gap \u03bdd. The primal problem solution equals x = \u2212Ay so\n2(cid:107)Ay(cid:107)2).\nMoreover, as the level set S\u03bb = {v \u2208 [N ]|xv > \u03bb} can be easily found based on x, the discrete\n\nthat the smooth duality gap can be computed according to \u03bds =(cid:80)\nduality gap can be written as \u03bdd = min\u03bb F (S\u03bb) \u2212(cid:80)\n\nv\u2208[N ] min{\u2212xv, 0}.\n\nr fr(x) + 1\n\n2(cid:107)x(cid:107)2 \u2212 (\u2212 1\n\nMAP inference. We used two images \u2013 oct and smallplant \u2013 adopted from [14]2. The images\ncomprise 640 \u00d7 427 pixels so that N = 273, 280. The decomposable submodular functions are\nconstructed following a standard procedure. The \ufb01rst class of functions arises from the 4-neighbor\ngrid graph over the pixels. Each edge corresponds to a pairwise potential between two adjacent\npixels i, j that follows the formula exp(\u2212(cid:107)vi \u2212 vj(cid:107)2\n2), where vi is the RGB color vector of pixel i.\nWe split the vertical and horizontal edges into rows and columns that result in 639 + 426 = 1065\ncomponents in the decomposition. Note that within each row or each column, the edges have no\noverlapping pixels, so the projections of these submodular functions onto the base polytopes reduce\nto projections onto the base polytopes of edge-like submodular functions. The second class of\nsubmodular functions contain clique potentials corresponding to the superpixel regions; speci\ufb01cally,\nfor region r, Fr(S) = |S|(|Sr| \u2212 |S|) [26]. These functions give another 500 decomposition\ncomponents. We apply the divide and conquer method in [14] to compute the projections required for\nthis type of submodular functions. Note that in each experiment, all components of the submodular\nfunction are of nearly the same size, and thus the projections performed for different components\nincur similar computational costs. As the projections represent the primary computational units, for\ncomparative purposes we use the number of iterations (similarly to [14, 16]).\nWe compared \ufb01ve algorithms: RCDM with a sampling distribution P found by the greedy algorithm\n(RCDM-G), RCDM with uniform sampling (RCDM-U), ACDM with uniform sampling (ACDM-U),\nAP based on (5) (IAP) and AP based on (3) (AP). Figure 2 depicts the results. In the \ufb01rst row, we\ncompared the convergence rates of different algorithms for a \ufb01xed parallelization parameter \u03b1 = 0.1.\nThe values on the horizontal axis correspond to # iterations \u00d7 \u03b1, the total number of projections\nperformed divided by R. The results are averaged over 10 independent experiments. We observe\nthat the CD-based methods outperform AP-based methods, and that ACDM-U is the best performing\nCD-based method. IAP signi\ufb01cantly outperforms AP. Similarly, RCDM-G outperforms RCDM-U.\nWe also investigated the relationship between the number of iterations and the parameter \u03b1. We\nrecorded the number of iterations needed to achieve a smooth and discrete gap below a certain given\nthreshold. The results are shown in the second row of Figure 2. We did not plot the curves for the\nAP-based methods as they are essentially horizontal lines. Among the CD-based methods, ACDM-U\nperforms best. RCDM-G offers a much better convergence rate than RCDM-U since the sampling\nprobability P produced by the greedy algorithm leads to a smaller value of (cid:107)\u03b8P(cid:107)1,\u221e compared to\n\n1The code for this work can be found in https://github.com/lipan00123/DSFM-with-incidence-relations.\n2Downloaded from the website of Professor Stefanie Jegelka: http://people.csail.mit.edu/stefje/code.html\n\n8\n\n0100200300400500600#Iterations\u00d7 \u03b110-410-2100102104106108smooth gap (\u03b1 = 0.1, oct)RCDM-GRCDM-UACDM-UIAPAP050100150200250300#Iterations \u00d7 \u03b110-810-610-410-2100102104106discrete gap (\u03b1 = 0.1, oct)RCDM-GRCDM-UACDM-UIAPAP0100200300400500600#Iterations\u00d7 \u03b1101102103104105106107108smooth gap (\u03b1 = 0.1, smallplant)RCDM-GRCDM-UACDM-UIAPAP0100200300400500600#Iterations \u00d7 \u03b110-810-610-410-2100102104106discrete gap (\u03b1 = 0.1, smallplant)RCDM-GRCDM-UACDM-UIAPAP00.050.10.150.2\u03b1300400500600700800900100011001200#Iterations \u00d7 \u03b1 (\u03bds < 10-2, oct)RCDM-GRCDM-UACMD-U00.050.10.150.2\u03b145505560657075#Iterations \u00d7 \u03b1 (\u03bdd < 10-3, oct)RCDM-GRCDM-UACMD-U00.050.10.150.2\u03b12003004005006007008009001000#Iterations \u00d7 \u03b1 (\u03bds < 102, smallplant)RCDM-GRCDM-UACMD-U00.050.10.150.2\u03b1100200300400500600700#Iterations \u00d7 \u03b1 (\u03bdd < 10-3, smallplant)RCDM-GRCDM-UACMD-U\fFigure 3: Zachary\u2019s Karate Club. Left two: Gap vs the number of iterations \u00d7\u03b1. Right two: The\nnumber of iterations \u00d7\u03b1 vs \u03b1. Here, \u03b1 is the parallelization parameter, while K = \u03b1R equals the\nnumber of projections that have to be computed in each iteration.\n\nuniform sampling. The reason behind this \ufb01nding is that the supports of the components in the\ndecomposition are localized, which makes the sampling P obtained from the greedy algorithm highly\neffective. For RCDM-U, the total number of iterations increases almost linearly with \u03b1 (= K/R),\nwhich con\ufb01rms the results of Lemma 3.12.\nNote that in the above examples of MAP inference, another way to decompose the submodular\nfunctions is available: as there are three natural layers of non-overlapping incidence sets, we can\nmerge all vertical edges, all horizontal edges, and all superpixel regions into three components\nrespectively. Then, each of this component is incident to all pixels, and the derived results in this work\nwill reduce to those of the former works [14, 16]. However, such a way to decompose submodular\nfunction strongly depends on the particular structure and thus is not general for DSFM problems.\nThe following example on semi-supervised learning over graphs does not contain natural layers for\ndecomposition.\nSemi-supervised learning. We tested our algorithms over the dataset of Zachary\u2019s karate club [27].\nThis dataset is used as a benchmark example for evaluating semisupervised learning algorithms over\ngraphs [28]. It includes N = 34 vertices and R = 78 submodular functions in the decomposition,\neach corresponding to one edge in the network. The objective function of both semi-supervised\nlearning problems may be written as\n\n(cid:88)\n\nr\u2208[R]\n\nmin\n\nx\n\n\u03c4\n\nfr(x) +\n\n(cid:107)x \u2212 x0(cid:107)2\n\n2\n\n1\n2\n\n(9)\n\nwhere \u03c4 is a parameter that needs to be tuned, and x0 \u2208 {\u22121, 0, 1}N , so that the nonzero components\ncorrespond to the labels that are known a priori. In our case, as we are only concerned with the\nconvergence rate of the algorithm, we \ufb01x \u03c4 = 0.1. In the experiments for Zachary\u2019s karate club, we\nset x0(1) = 1, x0(34) = \u22121 and let all other components of x0 be equal to zero.\nFigure 3 shows the results of the experiments pertaining to Zachary\u2019s karate club. In the left two\nsub\ufb01gures, we compared the convergence rates of different algorithms for a \ufb01xed parallelization\nparameter \u03b1 = 0.1. The values on the horizontal axis correspond to # iterations \u00d7 \u03b1, the total number\nof projections performed divided by R. In the right two sub\ufb01gures, we controlled the numbers of\nprojections executed within one iteration by tuning the parameter \u03b1 and recorded the number of\niterations needed to achieve smooth/discrete gaps below 10\u22123. The values depicted on the vertical\naxis correspond to # iterations \u00d7\u03b1, describing the total number of projections needed to achieve\nthe given accuracy. In all cases, we see the similar tendency to that of the MAP inference. As may\nbe seen, AP-based methods require more projections than CD-based methods, but IAP consistently\noutperforms AP, which is consistent with our theoretical results. Among the CD-based methods,\nACDM-U offers the best performance in general, and RCDM-G slightly outperforms RCDM-U, since\nthe greedy algorithm used for sampling produces a smaller (cid:107)\u03b8P(cid:107)1,\u221e than uniform sampling. As the\nAP-based methods are completely parallelizable, and increasing the parameter \u03b1 does not increase the\ntotal number of projections. However, for RCDM-U, the total number of iterations required increases\nalmost linearly with \u03b1, which is supported by the result in Lemma 3.12. The performance curve for\nRCDM-G exhibits large oscillations due to the discrete problem component, needed for \ufb01nding a\nbalanced partition.\n\n9\n\n0100200300400500#Iterations\u00d7 \u03b110-1010-810-610-410-2100102smooth gap (\u03b1 = 0.1, Karate)RCDM-GRCDM-UACDM-UIAPAP0100200300400500#Iterations \u00d7 \u03b110-710-610-510-410-310-210-1100discrete gap (\u03b1= 0.1, Karate)RCDM-GRCDM-UACDM-UIAPAP00.10.20.30.4\u03b1303540455055606570#Iterations \u00d7 \u03b1 (smooth)RCDM-GRCDM-UACMD-U00.10.20.30.4\u03b11015202530354045505560#Iterations \u00d7 \u03b1 (discrete)RCDM-GRCDM-UACMD-U\f5 Acknowledgement\n\nThe authors gratefully acknowledge many useful suggestions by the reviewers. This work was\nsupported in part by the NSF grant CCF 15-27636, the NSF Purdue 4101-38050 and the NFT STC\ncenter Science of Information.\n\nReferences\n\n[1] S. Fujishige, Submodular functions and optimization. Elsevier, 2005, vol. 58.\n[2] K. Wei, R. Iyer, and J. Bilmes, \u201cSubmodularity in data subset selection and active learning,\u201d in\n\nProceedings of the International Conference on Machine Learning, 2015, pp. 1954\u20131963.\n\n[3] P. Li and O. Milenkovic, \u201cInhomogeneous hypergraph clustering with applications,\u201d in Advances\n\nin Neural Information Processing Systems, 2017, pp. 2305\u20132315.\n\n[4] \u2014\u2014, \u201cSubmodular hypergraphs: p-laplacians, cheeger inequalities and spectral clustering,\u201d in\n\nProceedings of the International Conference on Machine Learning, 2018, pp. 3014\u20133023.\n\n[5] P. Kohli, P. H. Torr et al., \u201cRobust higher order potentials for enforcing label consistency,\u201d\n\nInternational Journal of Computer Vision, vol. 82, no. 3, pp. 302\u2013324, 2009.\n\n[6] H. Lin and J. Bilmes, \u201cA class of submodular functions for document summarization,\u201d in\nProceedings of the Meeting of the Association for Computational Linguistics: Human Language\nTechnologies-Volume 1. Association for Computational Linguistics, 2011, pp. 510\u2013520.\n\n[7] A. Krause and C. Guestrin, \u201cNear-optimal observation selection using submodular functions,\u201d\nin Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence, vol. 7, 2007, pp. 1650\u20131654.\n[8] Y. T. Lee, A. Sidford, and S. C.-w. Wong, \u201cA faster cutting plane method and its implications\nfor combinatorial and convex optimization,\u201d in Foundations of Computer Science (FOCS), 2015\nIEEE 56th Annual Symposium on.\n\nIEEE, 2015, pp. 1049\u20131065.\n\n[9] P. Stobbe and A. Krause, \u201cEf\ufb01cient minimization of decomposable submodular functions,\u201d in\n\nAdvances in Neural Information Processing Systems, 2010, pp. 2208\u20132216.\n\n[10] V. Kolmogorov, \u201cMinimizing a sum of submodular functions,\u201d Discrete Applied Mathematics,\n\nvol. 160, no. 15, pp. 2246\u20132258, 2012.\n\n[11] A. Ene, H. Nguyen, and L. A. V\u00e9gh, \u201cDecomposable submodular function minimization:\ndiscrete and continuous,\u201d in Advances in Neural Information Processing Systems, 2017, pp.\n2874\u20132884.\n\n[12] F. Bach et al., \u201cLearning with submodular functions: A convex optimization perspective,\u201d\n\nFoundations and Trends R(cid:13) in Machine Learning, vol. 6, no. 2-3, pp. 145\u2013373, 2013.\n\n[13] L. Lov\u00e1sz, \u201cSubmodular functions and convexity,\u201d in Mathematical Programming The State of\n\nthe Art. Springer, 1983, pp. 235\u2013257.\n\n[14] S. Jegelka, F. Bach, and S. Sra, \u201cRe\ufb02ection methods for user-friendly submodular optimization,\u201d\n\nin Advances in Neural Information Processing Systems, 2013, pp. 1313\u20131321.\n\n[15] R. Nishihara, S. Jegelka, and M. I. Jordan, \u201cOn the convergence rate of decomposable submod-\nular function minimization,\u201d in Advances in Neural Information Processing Systems, 2014, pp.\n640\u2013648.\n\n[16] A. Ene and H. Nguyen, \u201cRandom coordinate descent methods for minimizing decomposable\nsubmodular functions,\u201d in Proceedings of the International Conference on Machine Learning,\n2015, pp. 787\u2013795.\n\n[17] D. R. Karger, \u201cGlobal min-cuts in RNC, and other rami\ufb01cations of a simple min-cut algorithm.\u201d\nin Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, vol. 93, 1993, pp. 21\u201330.\n[18] C. Chekuri and C. Xu, \u201cComputing minimum cuts in hypergraphs,\u201d in Proceedings of the ACM-\nSIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,\n2017, pp. 1085\u20131100.\n\n[19] J. Djolonga and A. Krause, \u201cScalable variational inference in log-supermodular models.\u201d in\n\nProceedings of the International Conference on Machine Learning, 2015, pp. 1804\u20131813.\n\n10\n\n\f[20] W. Meyer, \u201cEquitable coloring,\u201d The American Mathematical Monthly, vol. 80, no. 8, pp.\n\n920\u2013922, 1973.\n\n[21] P. Wolfe, \u201cFinding the nearest point in a polytope,\u201d Mathematical Programming, vol. 11, no. 1,\n\npp. 128\u2013149, 1976.\n\n[22] D. Chakrabarty, P. Jain, and P. Kothari, \u201cProvable submodular minimization using Wolfe\u2019s\n\nalgorithm,\u201d in Advances in Neural Information Processing Systems, 2014, pp. 802\u2013809.\n\n[23] H. Karimi, J. Nutini, and M. Schmidt, \u201cLinear convergence of gradient and proximal-gradient\nmethods under the polyak-\u0142ojasiewicz condition,\u201d in Joint European Conference on Machine\nLearning and Knowledge Discovery in Databases. Springer, 2016, pp. 795\u2013811.\n\n[24] Y. Nesterov, \u201cEf\ufb01ciency of coordinate descent methods on huge-scale optimization problems,\u201d\n\nSIAM Journal on Optimization, vol. 22, no. 2, pp. 341\u2013362, 2012.\n\n[25] A. Hajnal and E. Szemer\u00e9di, \u201cProof of a conjecture of Erd\u00f6s,\u201d Combinatorial Theory and Its\n\nApplications, vol. 2, pp. 601\u2013623, 1970.\n\n[26] A. Levinshtein, A. Stere, K. N. Kutulakos, D. J. Fleet, S. J. Dickinson, and K. Siddiqi, \u201cTur-\nbopixels: fast superpixels using geometric \ufb02ows,\u201d IEEE Transactions on Pattern Analysis and\nMachine Intelligence, vol. 31, no. 12, pp. 2290\u20132297, 2009.\n\n[27] W. W. Zachary, \u201cAn information \ufb02ow model for con\ufb02ict and \ufb01ssion in small groups,\u201d Journal of\n\nAnthropological Research, vol. 33, no. 4, pp. 452\u2013473, 1977.\n\n[28] T. N. Kipf and M. Welling, \u201cSemi-supervised classi\ufb01cation with graph convolutional networks,\u201d\n\narXiv preprint arXiv:1609.02907, 2016.\n\n[29] S. Fujishige and X. Zhang, \u201cNew algorithms for the intersection problem of submodular systems,\u201d\n\nJapan Journal of Industrial and Applied Mathematics, vol. 9, no. 3, p. 369, 1992.\n\n[30] O. Fercoq and P. Richt\u00e1rik, \u201cAccelerated, parallel, and proximal coordinate descent,\u201d SIAM\n\nJournal on Optimization, vol. 25, no. 4, pp. 1997\u20132023, 2015.\n\n[31] H. A. Kierstead, A. V. Kostochka, M. Mydlarz, and E. Szemer\u00e9di, \u201cA fast algorithm for equitable\n\ncoloring,\u201d Combinatorica, vol. 30, no. 2, pp. 217\u2013224, 2010.\n\n[32] A. Chambolle and J. Darbon, \u201cOn total variation minimization and surface evolution using\nparametric maximum \ufb02ows,\u201d International journal of computer vision, vol. 84, no. 3, p. 288,\n2009.\n\n[33] R. Albert and A.-L. Barab\u00e1si, \u201cStatistical mechanics of complex networks,\u201d Reviews of modern\n\nphysics, vol. 74, no. 1, p. 47, 2002.\n\n[34] M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram, \u201cThe total variation on hypergraphs-learning\non hypergraphs revisited,\u201d in Advances in Neural Information Processing Systems, 2013, pp.\n2427\u20132435.\n\n[35] N. Yadati, M. Nimishakavi, P. Yadav, A. Louis, and P. Talukdar, \u201cHypergcn: Hypergraph\nconvolutional networks for semi-supervised classi\ufb01cation,\u201d arXiv preprint arXiv:1809.02589,\n2018.\n\n11\n\n\f", "award": [], "sourceid": 1125, "authors": [{"given_name": "Pan", "family_name": "Li", "institution": "University of Illinois Urbana-Champaign"}, {"given_name": "Olgica", "family_name": "Milenkovic", "institution": "University of Illinois at Urbana-Champaign"}]}