{"title": "Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1009, "abstract": "Recently, several works have shown that natural modifications of the classical conditional gradient method (aka Frank-Wolfe algorithm) for constrained convex optimization, provably converge with a linear rate when the feasible set is a polytope, and the objective is smooth and strongly-convex. However, all of these results suffer from two significant shortcomings: i) large memory requirement due to the need to store an explicit convex decomposition of the current iterate, and as a consequence, large running-time overhead per iteration ii) the worst case convergence rate depends unfavorably on the dimension In this work we present a new conditional gradient variant and a corresponding analysis that improves on both of the above shortcomings. In particular, both memory and computation overheads are only linear in the dimension, and in addition, in case the optimal solution is sparse, the new convergence rate replaces a factor which is at least linear in the dimension in previous works, with a linear dependence on the number of non-zeros in the optimal solution At the heart of our method, and corresponding analysis, is a novel way to compute decomposition-invariant away-steps. While our theoretical guarantees do not apply to any polytope, they apply to several important structured polytopes that capture central concepts such as paths in graphs, perfect matchings in bipartite graphs, marginal distributions that arise in structured prediction tasks, and more. Our theoretical findings are complemented by empirical evidence that shows that our method delivers state-of-the-art performance.", "full_text": "Linear-Memory and Decomposition-Invariant\n\nLinearly Convergent Conditional Gradient Algorithm\n\nfor Structured Polytopes\n\nDan Garber\n\nToyota Technological Institute at Chicago\n\ndgarber@ttic.edu\n\nOfer Meshi\n\nGoogle\n\nmeshi@google.com\n\nAbstract\n\nRecently, several works have shown that natural modi\ufb01cations of the classical\nconditional gradient method (aka Frank-Wolfe algorithm) for constrained convex\noptimization, provably converge with a linear rate when: i) the feasible set is a\npolytope, and ii) the objective is smooth and strongly-convex. However, all of these\nresults suffer from two signi\ufb01cant shortcomings:\n\n1. large memory requirement due to the need to store an explicit convex de-\ncomposition of the current iterate, and as a consequence, large running-time\noverhead per iteration\n\n2. the worst case convergence rate depends unfavorably on the dimension\n\nIn this work we present a new conditional gradient variant and a corresponding\nanalysis that improves on both of the above shortcomings. In particular:\n\n1. both memory and computation overheads are only linear in the dimension\n2. in case the optimal solution is sparse, the new convergence rate replaces a\nfactor which is at least linear in the dimension in previous work, with a linear\ndependence on the number of non-zeros in the optimal solution\n\nAt the heart of our method and corresponding analysis, is a novel way to compute\ndecomposition-invariant away-steps. While our theoretical guarantees do not apply\nto any polytope, they apply to several important structured polytopes that capture\ncentral concepts such as paths in graphs, perfect matchings in bipartite graphs,\nmarginal distributions that arise in structured prediction tasks, and more. Our\ntheoretical \ufb01ndings are complemented by empirical evidence which shows that our\nmethod delivers state-of-the-art performance.\n\n1\n\nIntroduction\n\nThe ef\ufb01cient reduction of a constrained convex optimization problem to a constrained linear optimiza-\ntion problem is an appealing algorithmic concept, in particular for large-scale problems. The reason\nis that for many feasible sets of interest, the problem of minimizing a linear function over the set\nadmits much more ef\ufb01cient methods than its non-linear convex counterpart. Prime examples for this\nphenomenon include various structured polytopes that arise in combinatorial optimization, such as\nthe path polytope of a graph (aka the unit \ufb02ow polytope), the perfect matching polytope of a bipartite\ngraph, and the base polyhedron of a matroid, for which we have highly ef\ufb01cient combinatorial\nalgorithms for linear minimization that rely heavily on the speci\ufb01c rich structure of the polytope\n[21]. At the same time, minimizing a non-linear convex function over these sets usually requires the\nuse of generic interior point solvers that are oblivious to the speci\ufb01c combinatorial structure of the\nunderlying set, and as a result, are often much less ef\ufb01cient. Indeed, it is for this reason, that the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fconditional gradient (CG) method (aka Frank-Wolfe algorithm), a method for constrained convex\noptimization that is based on solving linear subproblems over the feasible domain, has regained much\ninterest in recent years in the machine learning, signal processing and optimization communities. It\nhas been recently shown that the method delivers state-of-the-art performance on many problems of\ninterest, see for instance [14, 17, 4, 10, 11, 22, 19, 25, 12, 15].\nAs part of the regained interest in the conditional gradient method, there is also a recent effort to\nunderstand the convergence rates and associated complexities of conditional gradient-based methods,\nwhich are in general far less understood than other \ufb01rst-order methods, e.g., the projected gradient\nmethod. It is known, already from the \ufb01rst introduction of the method by Frank and Wolfe in the\n1950\u2019s [5] that the method converges with a rate of roughly O(1/t) for minimizing a smooth convex\nfunction over a convex and compact set. However, it is not clear if this convergence rate improves\nunder an additional standard strong-convexity assumption. In fact, certain lower bounds, such as in\n[18, 8], suggest that such improvement, even if possible, should come with a worse dependence on\nthe problem\u2019s parameters (e.g., the dimension). Nevertheless, over the past years, various works tried\nto design natural variants of the CG method that converge provably faster under the strong convexity\nassumption, without dramatically increasing the per-iteration complexity of the method. For instance,\nGu\u00e9Lat and Marcotte [9] showed that a CG variant which uses the concept of away-steps converges\nexponentially fast in case the objective function is strongly convex, the feasible set is a polytope,\nand the optimal solution is located in the interior of the set. A similar result was presented by Beck\nand Teboulle [3] who considered a speci\ufb01c problem they refer to as the convex feasibility problem\nover an arbitrary convex set. They also obtained a linear convergence rate under the assumption that\nan optimal solution that is far enough from the boundary of the set exists. In both of these works,\nthe exponent depends on the distance of the optimal solution from the boundary of the set, which\nin general can be arbitrarily small. Later, Ahipasaoglu et al. [1] showed that in the speci\ufb01c case of\nminimizing a smooth and strongly convex function over the unit simplex, a variant of the CG method\nwhich also uses away-steps, converges with a linear rate. Unfortunately, it is not clear from their\nanalysis how this rate depends on natural parameters of the problem such as the dimension and the\ncondition number of the objective function.\nRecently, Garber and Hazan presented a linearly-converging CG variant for polytopes without any\nrestrictions on the location of the optimum [8]. In a later work, Lacoste-Julien and Jaggi [16] gave\na re\ufb01ned af\ufb01ne-invariant analysis of an algorithm presented in [9] which also uses away steps, and\nshowed that it also converges exponentially fast in the same setting as the Garber-Hazan result. More\nrecently, Beck and Shtern [2] gave a different, duality-based, analysis for the algorithm of [9], and\nshowed that it can be applied to a wider class of functions than purely strongly convex functions.\nHowever, the explicit dependency of their convergence rate on the dimension is suboptimal, compared\nto [8, 16]. Aside from the polytope case, Garber and Hazan [7] have shown that in case the feasible\nset is strongly-convex and the objective function satis\ufb01es certain strong convexity-like proprieties,\nthen the standard CG method converges with an accelerated rate of O(1/t2). Finally, in [6] Garber\nshowed a similar improvement (roughly quadratic) for the spectrahedron \u2013 the set of unit trace\npositive semide\ufb01nite matrices.\nDespite the exponential improvement in convergence rate for polytopes obtained in recent results,\nthey all suffer from two major drawbacks. First, while in terms of the number of calls per-iteration to\nthe linear optimization oracle, these methods match the standard CG method, i.e., a single call per\niteration, the overhead of other operations both in terms of running time and memory requirements is\nsigni\ufb01cantly worse. The reason is that in order to apply the so-called away-steps, which all methods\nuse, they require to maintain at all times an explicit decomposition of the current iterate into vertices\nof the polytope. In the worst case, maintaining such a decomposition and computing the away-steps\nrequire both memory and per-iteration runtime overheads that are at least quadratic in the dimension.\nThis is much worse than the standard CG method, whose memory and runtime overheads are only\nlinear in the dimension. Second, the convergence rate of all previous linearly convergent CG methods\ndepends explicitly on the dimension. While it is known that this dependency is unavoidable in certain\ncases, e.g., when the optimal solution is, informally speaking, dense (see for instance the lower bound\nin [8]), it is not clear that such an unfavorable dependence is mandatory when the optimum is sparse.\nIn this paper, we revisit the application of CG variants to smooth and strongly-convex optimization\nover polytopes. We introduce a new variant which overcomes both of the above shortcomings from\nwhich all previous linearly-converging variants suffer. The main novelty of our method, which is the\nkey to its improved performance, is that unlike previous variants, it is decomposition-invariant, i.e., it\n\n2\n\n\fPaper\nFrank & Wolfe [5]\nGarber & Hazan [8]\nLacoste-Julien & Jaggi [16]\nBeck & Shtern [2]\nThis paper\n\n#iterations to \u270f err.\n\n#LOO calls\n\nruntime\n\nmemory\n\nD 2/\u270f\n\n\uf8ffnD2 log(1/\u270f)\n\uf8ffnD2 log(1/\u270f)\n\uf8ffn2D2 log(1/\u270f)\n\n\uf8ffcard(x\u21e4)D2 log(1/\u270f)\n\n1\n1\n1\n1\n2\n\nn\n\nn\n\nn min(n, t)\nn min(n, t)\nn min(n, t)\n\nn min(n, t)\nn min(n, t)\nn min(n, t)\n\nn\n\nn\n\nTable 1: Comparison with previous work. We de\ufb01ne \uf8ff := /\u21b5, we let n denote the dimension and D\ndenote the Euclidean diameter of the polytope. The third column gives the number of calls to the\nlinear optimization oracle per iteration, the fourth column gives the additional arithmetic complexity\nat iteration t, and the \ufb01fth column gives the worst case memory requirement at iteration t. The\nbounds for the algorithms of [8, 16, 2], which are independent of t, assume an algorithmic version of\nCarath\u00e9odory\u2019s theorem, as fully detailed in [2]. The bound on number of iterations of [16] depends\non the squared inverse pyramidal width of P, which is dif\ufb01cult to evaluate, however, this quantity is\nat least proportional to n.\n\ndoes not require to maintain an explicit convex decomposition of the current iterate. This principle\nproves to be crucial both for eliminating the memory and runtime overheads, as well as to obtaining\nshaper convergence rates for instances that admit a sparse optimal solution.\nA detailed comparison of our method to previous art is shown in Table 1. We also provide in Section\n5 empirical evidence that the proposed method delivers state-of-the-art performance on several tasks\nof interest. While our method is less general than previous ones, i.e., our theoretical guarantees do not\nhold for arbitrary polytopes, they readily apply to many structured polytopes that capture important\nconcepts such as paths in graphs, perfect matchings in bipartite graphs, Markov random \ufb01elds, and\nmore.\n\n2 Preliminaries\nThroughout this work we let k\u00b7k denote the standard Euclidean norm. Given a point x 2 Rn, we let\ncard(x) denote the number of non-zero entries in x.\nDe\ufb01nition 1. We say that a function f (x) : Rn ! R is \u21b5-strongly convex w.r.t. a norm k\u00b7k , if for\nall x, y 2 Rn it holds that f (y) f (x) + rf (x) \u00b7 (y x) + \u21b5\nDe\ufb01nition 2. We say that a function f (x) : Rn ! R is -smooth w.r.t. a norm k\u00b7k , if for all\nx, y 2 Rn it holds that f (y) \uf8ff f (x) + rf (x) \u00b7 (y x) + \nThe \ufb01rst-order optimality condition implies that for a \u21b5-strongly convex f, if x\u21e4 is the unique\nminimizer of f over a convex and compact set K\u21e2 Rn, then for all x 2K it holds that\n\n2 kx yk2.\n\n2kx yk2.\n\nf (x) f (x\u21e4) \n\n\u21b5\n2 kx x\u21e4k2.\n\n(1)\n\n2.1 Setting\nIn this work we consider the optimization problem minx2P f (x), where we assume that:\n\n\u2022 f (x) is \u21b5-strongly convex and -smooth with respect to the Euclidean norm.\n\u2022 P is a polytope which satis\ufb01es the following two properties:\n\n1. P can be described algebraically as P = {x 2 Rn | x 0, Ax = b} .\n2. All vertices of P lie on the hypercube {0, 1}n.\n\nWe let x\u21e4 denote the (unique) minimizer of f over P, and we let D denote the Euclidean diameter of\nP, namely, D = maxx,y2P kx yk. We let V denote the set of vertices of P, where according to\nour assumptions, it holds that V\u2713{ 0, 1}n.\nWhile the polytopes that satisfy the above assumptions are not completely general, these assumptions\nalready capture several important concepts such as paths in graphs, perfect-matchings, Markov\n\n3\n\n\frandom \ufb01elds, and more. Indeed, a surprisingly large number of applications from machine learning,\nsignal processing and other domains are formulated as optimization problems in this category (e.g.,\n[13, 15, 16]). We give detailed examples of such polytopes in Section A in the appendix. Importantly,\nthe above assumptions allow us to get rid of the dependency of the convergence rate on certain\ngeometric parameters (such as , \u21e0 in [8] or the pyramidal width in [16]), which can be polynomial\nin the dimension, and hence result in an impractical convergence rate. Finally, for many of these\npolytopes, the vertices are sparse, i.e., for any vertex v 2V , card(v) << n. In this case, when the\noptimum x\u21e4 can be decomposed as a convex combination of only a few vertices (and thus, sparse by\nitself), we get a sharper convergence rate that depends on the sparsity of x\u21e4 and not explicitly on the\ndimension, as in previous work. We believe that our theoretical guarantees could be well extended to\nmore general polytopes, as suggested in Section C in the appendix; we leave this extension for future\nwork.\n\n3 Our Approach\n\nIn order to better communicate our ideas, we begin by \ufb01rst brie\ufb02y introducing the standard conditional\ngradient method and its accelerated away-steps-based variants. We discuss both the blessings and\nshortcomings of these away-steps-based variants in Subsection 3.1. Then, in Subsection 3.2, we\npresent our new method, a decomposition-invariant away-steps-based conditional gradient algorithm,\nand discuss how it addresses the shortcomings of previous variants.\n\n3.1 The conditional gradient method and acceleration via away-steps\n\nThe standard conditional gradient algorithm is given below (Algorithm 1). It is well known that\nwhen setting the step-size \u2318t in an appropriate way, the worst case convergence rate of the method is\nO(D 2/t) [13]. This convergence rate is tight for the method in general, see for instance [18].\n\nAlgorithm 1 Conditional Gradient\n1: let x1 be some vertex in V\n2: for t = 1... do\n3:\n4:\n5:\n6: end for\n\nvt arg minv2V v \u00b7r f (xt)\nchoose a step-size \u2318t 2 (0, 1]\nxt+1 xt + \u2318t(vt xt)\n\nAlgorithm 2 Pairwise Conditional Gradient\n1: let x1 be some vertex in V\n2: for t = 1... do\n3:\n\nbe an explicitly main-\n\nt v(i)\n\nt\n\ni=1 (i)\n\nlet Pkt\ntained convex decomposition of xt\nv+\nt arg minv2V v \u00b7r f (xt)\njt arg minj2[kt] v(j)\nchoose a step-size \u2318t 2 (0, (jt)\nxt+1 xt + \u2318t(v+\nupdate the convex decomposition of xt+1\n\n\u00b7 (rf (xt))\n\nt v(jt)\n\nt\n)\n\n]\n\n4:\n5:\n6:\n7:\n8:\n9: end for\n\nt\n\nt\n\nConsider the iterate of Algorithm 1 on iteration t, and let xt =Pk\ni=1 ivi be its convex decomposition\ninto vertices of the polytope P. Note that Algorithm 1 implicitly discounts each coef\ufb01cient i by\na factor (1 \u2318t), in favor of the new added vertex vt. A different approach is not to decrease all\nvertices in the decomposition of xt uniformly, but to more-aggressively decrease vertices that are\nworse than others with respect to some measure, such as their product with the gradient direction.\nThis key principle proves to be crucial to breaking the 1/t rate of the standard method, and to achieve\na linear convergence rate under certain strong-convexity assumptions, as described in the recent\nworks [8, 16, 2]. For instance, in [8] it has been shown, via the introduction of the concept of a Local\nLinear Optimization Oracle, that using such a non-uniform reweighing rule, in fact approximates a\ncertain proximal problem, that together with the shrinking effect of strong convexity, as captured by\nEq. (1), yields a linear convergence rate. We refer to these methods as away-step-based CG methods.\nAs a concrete example, which will also serve as a basis for our new method, we describe the pairwise\nvariant recently studied in [16], which applies this principle in Algorithm 2.1 Note that Algorithm 2\ndecreases the weight of exactly one vertex in the decomposition: that with the largest product with\nthe gradient.\n\n1While the convergence rate of this pairwise variant, established in [16], is signi\ufb01cantly worse than other\naway-step-based variants, here we show that a proper analysis yields state-of-the-art performance guarantees.\n\n4\n\n\fIt is important to note that since previous away-step-based CG variants do not decrease the coef\ufb01cients\nin the convex decomposition of the current iterate uniformly, they all require to explicitly store and\nmaintain a convex decomposition of the current iterate. This issue raises two main disadvantages:\nSuperlinear memory and running-time overheads Storing a decomposition of the current iterate\nas a convex combination of vertices generally requires O(n2) memory in the worst case. While\nthe away-step-based variants increase the size of the decomposition by at most a single vertex per\niteration, they also typically exhibit linear convergence after performing at least \u2326(n) steps [8, 16, 2],\nand thus, this O(n2) estimate still holds. Moreover, since these methods require i) to \ufb01nd the worst\nvertex in the decomposition, in terms of dot-product with current gradient direction, and ii) to update\nthis decomposition at each iteration (even when using sophisticated update techniques such as in [2]),\nthen the worst case per-iteration overhead in terms of computation is also \u2326(n2).\nDecomposition-speci\ufb01c performance The choice of away-step depends on the speci\ufb01c decomposi-\ntion that is maintained by the algorithm. Since the feasible point xt may admit several different convex\ndecompositions, committing to one such decomposition, might result in sub-optimal away-steps. As\nobservable in Table 1, for certain problems in which the optimal solution is sparse, all analyses of\nprevious away-steps-based variants are signi\ufb01cantly suboptimal, since they all depend explicitly on\nthe dimension. This seems to be an unavoidable side-effect of being decomposition-dependent. On\nthe other hand, the fact that our new approach is decomposition-invariant allows us to obtain sharper\nconvergence rates for such instances.\n\n3.2 A new decomposition-invariant pairwise conditional gradient method\nOur main observation is that in many cases of interest, given a feasible iterate xt, one can in fact\ncompute an optimal away-step from xt without relying on any single speci\ufb01c decomposition. This\nobservation allows us to overcome both of the main disadvantages of previous away-step-based CG\nvariants. Our algorithm, which we refer to as a decomposition-invariant pairwise conditional gradient\n(DICG), is given below in Algorithm 3.\n\nAlgorithm 3 Decomposition-invariant Pairwise Conditional Gradient\n1: input: sequence of step-sizes {\u2318t}t1\n2: let x0 be an arbitrary point in P\n3: x1 arg minv2V v \u00b7r f (x0)\n4: for t = 1... do\n5:\n6:\n\nde\ufb01ne the vector \u02dcrf (xt) 2 Rn as follows: [ \u02dcrf (xt)]i :=\u21e2 [rf (xt)]i\n1\nvt arg minv2V v \u00b7 ( \u02dcrf (xt))\nchoose a new step-size \u02dc\u2318t using one of the following two options:\n\nv+\nt arg minv2V v \u00b7r f (xt)\n\n7:\n8:\n\nOption 1: prede\ufb01ned step-size\n\nif xt(i) > 0\nif xt(i) = 0\n\nOption 2: line-search\n\nlet t be the smallest natural number such that 2t \uf8ff \u2318t, and set a new step-size \u02dc\u2318t 2t\nt max2[0,1]{xt + (v+\n\n\u02dc\u2318t min\u23182(0,t] f (xt + \u2318(v+\n\nt vt ) 0},\n\nt vt ))\n\n9:\n10: end for\n\nxt+1 xt + \u02dc\u2318t(v+\n\nt vt )\n\nThe following observation shows the optimality of away-steps taken by Algorithm 3.\nObservation 1 (optimal away-steps in Algorithm 3). Consider an iteration t of Algorithm 3 and\nsuppose that the iterate xt is feasible. Let xt =Pk\ni=1 ivi for some integer k, be an irreducible\nway of writing xt as a convex sum of vertices of P, i.e., i > 0 for all i 2 [k]. Then it holds that\n8i 2 [k] : vi \u00b7r f (xt) \uf8ff vt\nProof. Let xt =Pk\ni=1 ivi be a convex decomposition of xt into vertices of P, for some integer\nk, where each i is positive. Note that it must hold that for any j 2 [n] and any i 2 [k], xt(j) =\n0 ) vi(j) = 0, since by our assumption V\u21e2 Rn\n+. The observation then follows directly from the\nde\ufb01nition of vt .\n\n\u00b7r f (xt), and t min{xt(i)| i 2 [n], xt(i) > 0}.\n\nWe next state the main theorem of this paper, which bounds the convergence rate of Algorithm 3. The\nproof is provided in Section B.3 in the appendix.\n\n5\n\n\f2 .\n\nD 2\n\n2\n\nt\u25c6 .\n\n8D 2card(x\u21e4)\n\nf (xt) f (x\u21e4) \uf8ff\n\nTheorem 1. Let M1 =p\u21b5/(8card(x\u21e4)) and M2 = D 2/2. Consider running Algorithm 3 with\nOption 1 as the step-size, and suppose that 8t 1 : \u2318t =M1/(2pM2)1 M 2\n1 /(4M2) t1\nThen, the iterates of Algorithm 3 are always feasible, and 8t 1:\n\u21b5\n\nexp\u2713\nWe now turn to make several remarks regarding Algorithm 3 and Theorem 1:\nThe so-called dual gap, de\ufb01ned as gt := (xt v+\nt ) \u00b7r f (xt), which serves as a certi\ufb01cate for the\nsub-optimality of the iterates of Algorithm 3, also converges with a linear rate, as we prove in Section\nB.4 in the appendix.\nNote that despite the different parameters of the problem at hand (e.g., \u21b5, , D, card(x\u21e4)), running\nthe algorithm with Option 1 for choosing the step-size, for which the guarantee of Theorem 1\nholds, requires knowing a single parameter, i.e., M1/pM2. In particular, it is an easy consequence\nthat running the algorithm with an estimate M 2 [0.5M1/pM2, M1/pM2], will only affect the\nleading constant in the convergence rate listed in the theorem. Hence, M1/pM2 could be ef\ufb01ciently\nestimated via a logarithmic-scale search.\nTheorem 1 improves signi\ufb01cantly over the convergence rate established for the pairwise conditional\ngradient variant in [16]. In particular, the number of iterations to reach an \u270f error in the analysis of\n[16] depends linearly on |V|!, where |V| is the number of vertices of P.\n4 Analysis\n\nThroughout this section we let ht denote the approximation error of Algorithm 3 on iteration t, for\nany t 1, i.e., ht = f (xt) f (x\u21e4).\n4.1 Feasibility of the iterates generated by Algorithm 3\nWe start by proving that the iterates of Algorithm 3 are always feasible. While feasibility is straight-\nforward when using the the line-search option to set the step-size (Option 2), it is less obvious when\nusing Option 1. We will make use of the following observation, which is a simple consequence of the\noptimal choice of vt and our assumptions on P. A proof is given in Section B.1 in the appendix.\nObservation 2. Suppose that on some iteration t of Algorithm 3, the iterate xt is feasible, and that\nthe step-size is chosen using Option 1. Then, if for all i 2 [n] for which xt(i) 6= 0 it holds that\nxt(i) \u02dc\u2318t, the following iterate xt+1 is also feasible.\nLemma 1 (feasibility of iterates under Option 1). Suppose that the sequence of step-sizes {\u2318t}t1 is\nmonotonically non-increasing, and contained in the interval [0, 1]. Then, the iterates generated by\nAlgorithm 3 using Option 1 for setting the step-size, are always feasible.\n\nProof. We are going to prove by induction that on each iteration t there exists a non-negative integer-\nvalued vector st 2 Nn, such that for any i 2 [n], it holds that xt(i) = 2tst(i). The lemma\nthen follows since, by de\ufb01nition, \u02dc\u2318t = 2t, and by applying Observation 2. The base case t = 1\nholds since x1 is a vertex of P and thus for any i 2 [n] we have that x1(i) 2{ 0, 1} (recall that\nV\u21e2{ 0, 1}n). On the other hand, since \u23181 \uf8ff 1, it follows that 1 0. Thus, there indeed exists a\nnon-negative integer-valued vector s1, such that x1 = 21s1.\nSuppose now that the induction holds for some t 1. Since by de\ufb01nition of vt , subtracting \u02dc\u2318tvt\nfrom xt can only decrease positive entries in xt (see proof of Observation 2), and both vt , v+\nt are\nvertices of P (and thus in {0, 1}n), and \u02dc\u2318t = 2t, it follows that each entry i in xt+1 is given by:\n\nt (i) = 1 or vt (i) = v+\n\nt (i) = 0\n\nxt+1(i) = 2t8<:\n\nst(i)\nst(i) 1\nst(i) + 1\n\nif st(i) 1 & vt (i) = v+\nif st(i) 1 & vt (i) = 1 & v+\nif vt (i) = 0 & v+\nt (i) = 1\n\nt (i) = 0\n\nThus, xt+1 can also be written in the form 2t \u02dcst+1 for some \u02dcst+1 2 Nn. By de\ufb01nition of t and the\nmonotonicity of {\u2318t}t1, we have that\n2t+1 \u02dcst+1,\nthe induction holds also for t + 1.\n\n2t\n2t+1 is a positive integer. Thus, setting st+1 = 2t\n\n6\n\n\ft M2.\n\u23182\n\ni=1(ii)vi+(Pk\n\n4.2 Bounding the per-iteration error-reduction of Algorithm 3\nThe following technical lemma is the key to deriving the linear convergence rate of our method, and\nin particular, to deriving the improved dependence on the sparsity of x\u21e4, instead of the dimension. At\na high-level, the lemma translates the `2 distance between two feasible points into a `1 distance in a\nsimplex de\ufb01ned over the set of vertices of the polytope.\nLemma 2. Let x, y 2P . There exists a way to write x as a convex combination of vertices of P,\nx =Pk\ni=1 i)z\nwith i 2 [0, i]8i 2 [k],z 2P , andPk\n\ni=1 ivi for some integer k, such that y can be written as y =Pk\ni=1 i \uf8ffpcard(y)kx yk.\n\nThe proof is given in Section B.2 in the appendix. The next lemma bounds the per-iteration improve-\nment of Algorithm 3 and is the key step to proving Theorem 1. We defer the rest of the proof of\nTheorem 1 to Section B.3 in the appendix.\nLemma 3. Consider the iterates of Algorithm 3, when the step-sizes are chosen using Option 1. Let\nt +\n\nM1 =p\u21b5/(8card(x\u21e4)) and M2 = D 2/2. For any t 1 it holds that ht+1 \uf8ff ht \u2318tM1h1/2\nProof. De\ufb01ne t = p2card(x\u21e4)ht/\u21b5, and note that from Eq. (1) we have that t \npcard(x\u21e4)kxt x\u21e4k. As a \ufb01rst step, we are going to show that the point yt := xt + t(v+\nt vt )\nsatis\ufb01es: yt \u00b7r f (xt) \uf8ff x\u21e4 \u00b7r f (xt). From Lemma 2 it follows that we can write x as a convex com-\nbination xt =Pk\ni=1 iz, where i 2 [0, i],\nz 2P , andPk\n\ni=1(i i)vi +Pk\n\ni=1 ivi and write x\u21e4 as x\u21e4 =Pk\ni=1 i \uf8ff t. It holds that\n(yt xt) \u00b7r f (xt) = t(v+\n\uf8ff Xk\nxt + t(v+\n\nt vt ) \u00b7r f (xt)\n\nt vt ) \u00b7r f (xt) \uf8ffXk\ni(z vi) \u00b7r f (xt) = (x\u21e4 xt) \u00b7r f (xt),\nt vt ) \u00b7r f (xt) \uf8ff 0, and the second inequality follows\n(2)\n2 \uf8ff \u02dc\u2318t \uf8ff \u2318t. Using the\n\nt vt ) \u00b7r f (xt) \uf8ff x\u21e4 \u00b7r f (xt).\n\nObserve now that from the de\ufb01nition of \u02dc\u2318t it follows for any t 1 that \u2318t\nsmoothness of f (x) we have that\n\nwhere the \ufb01rst inequality follows since (v+\nfrom the optimality of v+\n\nt and vt (Observation 1). Rearranging, we have that indeed\n\ni(v+\n\ni=1\n\ni=1\n\nht+1 = f (xt + \u02dc\u2318t(v+\n\n\uf8ff ht + \u02dc\u2318t(v+\n= ht +\n\n\u2318t\n\nt vt )) f (x\u21e4) \uf8ff ht + \u02dc\u2318t(v+\nt vt ) \u00b7r f (xt) +\n\uf8ff ht +\n2t(xt + t(v+\n\n\u02dc\u23182\nt D 2\n\n2\n\nt vt ) \u00b7r f (xt) +\n\n\u02dc\u23182\nt \n2 kv+\nt vt ) \u00b7r f (xt) +\n(v+\n\u23182\nt D 2\n\n\u2318t\n2\n\nt vt k2\n\u23182\nt D 2\n\n2\n\nt vt ) xt \u00b7r f (xt) +\nt vt ) \u00b7r f (xt) \uf8ff 0, the forth inequality follows from\nwhere the third inequality follows since (v+\nEq. (2), and the last inequality follows from convexity of f (x). Finally, plugging the value of t\ncompletes the proof.\n\n(x\u21e4 xt) \u00b7r f (xt) +\n\n\uf8ff ht +\n\n\uf8ff ht \n\n\u2318t\n2t\n\n\u2318t\n2t\n\n\u23182\nt D 2\n\n\u23182\nt D 2\n\nht +\n\n2\n\n2\n\n2\n\n,\n\n5 Experiments\nIn this section we illustrate the performance of our algorithm in numerical experiments. We use\nthe two experimental settings from [16], which include a constrained Lasso problem and a video\nco-localization problem. In addition, we test our algorithm on a learning problem related to an\noptical character recognition (OCR) task from [23]. In each setting we compare the performance\nof our algorithm (DICG) to standard conditional gradient (CG), as well as to the fast away (ACG)\nand pairwise (PCG) variants [16]. For the baselines in the \ufb01rst two settings we use the publicly\navailable code from [16], to which we add our own implementation of Algorithm 3. Similarly, for the\nOCR problem we extend code from [20], kindly provided by the authors. For all algorithms we use\nline-search to set the step size.\n\n7\n\n\fLasso\n\nVideo co-localization\n\nOCR\n\n \n\np\na\nG\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\n10\u22128\n \n0\n\n10\u22122\n\n10\u22124\n\n10\u22126\n\np\na\nG\n\n400\n\nIteration\n\n600\n\n800\n\n1000\n\n \n\nCG\nACG\nPCG\nDICG\n\n \n\nCG\nACG\nPCG\nDICG\n\n500\n\n1000\n\nIteration\n\n1500\n\n2000\n\n \n\nCG\nACG\nPCG\nDICG\n\n10\u22121\n\np\na\nG\n\n10\u22122\n\n10\u22123\n\n \n0\n\n10-2\n\np\na\nG\n\n10-3\n\n \n\nCG\nACG\nPCG\nDICG\n\n500\n\n1000\n\nEffective passes\n\n1500\n\n2000\n\nCG\nACG\nPCG\nDICG\n\n105\n\n100\n\np\na\nG\n\nCG\nACG\nPCG\nDICG\n\n200\n\n10\u22125\n\n \n0\n\n105\n\n100\n\np\na\nG\n\n10\u22125\n\n \n0\n\n2\n\n4\n6\nTime (sec)\n\n8\n\n10\n\n10\u22128\n \n0\n\n5\n\n10\n\n15\n\nTime (sec)\n\n20\n\n25\n\n30\n\n0\n\n1\n\n2\n\n3\n\nTime (hr)\n\n4\n\n5\n\n6\n\nFigure 1: Duality gap gt vs. iterations (top) and time (bottom) in various settings.\n\nLasso In the \ufb01rst example the goal is to solve the problem: minx2M k \u00afAx \u00afbk2, where M is a\nscaled `1 ball. Notice that the constraints M do not match the required structure of P, however, with\na simple change of variables we can obtain an equivalent optimization problem over the simplex.\nWe generate the random matrix \u00afA and vector \u00afb as in [16]. In Figure 1 (left, top) we observe that\nour algorithm (DICG) converges similarly to the pairwise variant PCG and faster than the other\nbaselines. This is expected since the away direction v in DICG (Algorithm 3) is equivalent to the\naway direction in PCG (Algorithm 2) in the case of simplex constraints.\n\nVideo co-localization The second example is a quadratic program over the \ufb02ow polytope, originally\nproposed in [15]. This is an instance of P that is mentioned in Section A in the appendix. As can\nbe seen in Figure 1 (middle, top), in this setting our proposed algorithm signi\ufb01cantly outperforms\nthe baselines, as a result of \ufb01nding a better away direction v. Figure 1 (middle, bottom) shows\nconvergence on a time scale, where the difference between the algorithms is even larger. One\nreason for this difference is the costly search over the history of vertices maintained by the baseline\nalgorithms. Speci\ufb01cally, the number of stored vertices grows fast with the number of iterations and\nreaches 1222 for away steps and 1438 for pairwise steps (out of 2000 iterations).\n\nOCR We next conduct experiments on a structured SVM learning problem resulting from an OCR\ntask. The constraints in this setting are the marginal polytope corresponding to a chain graph over\nthe letters of a word (see [23]), and the objective function is quadratic. Notice that the marginal\npolytope has a concise characterization in this case and also satis\ufb01es our assumptions (see Section A\nin the appendix for more details). For this problem we actually run Algorithm 3 in a block-coordinate\nfashion, where blocks correspond to training examples in the dual SVM formulation [17, 20]. In\nFigure 1 (right, top) we see that our DICG algorithm is comparable to the PCG algorithm and faster\nthan the other baselines on the iteration scale. Figure 1 (right, bottom) demonstrates that in terms\nof actual running time we get a noticeable speedup compared to all baselines. We point out that\nfor this OCR problem, both ACG and PCG each require about 5GB of memory to store the explicit\ndecomposition in the implementation of [20]. In comparison, our algorithm requires 220MB of\nmemory to store the current iterate, and the other variables in the code require 430MB (common to\nall algorithms), so using DICG results in signi\ufb01cant memory savings.\n\n6 Extensions\n\nOur results are readily extendable in two important directions. First, we can relax the strong convexity\nrequirement of f (x) and handle a broader class of functions, namely the class considered in [2].\nSecond, we extend the line-search variant of Algorithm 3 to handle arbitrary polytopes, but without\nconvergence guarantees, which is left as future work. Both extensions are brought in full detail in\nSection C in the appendix.\n\n8\n\n\fReferences\n[1] S. Damla Ahipasaoglu, Peng Sun, and Michael J. Todd. Linear convergence of a modi\ufb01ed frank-wolfe\nalgorithm for computing minimum-volume enclosing ellipsoids. Optimization Methods and Software,\n23(1):5\u201319, 2008.\n\n[2] Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly\n\nconvex functions. arXiv preprint arXiv:1504.05002, 2015.\n\n[3] Amir Beck and Marc Teboulle. A conditional gradient method with linear rate of convergence for solving\n\nconvex linear systems. Math. Meth. of OR, 59(2):235\u2013247, 2004.\n\n[4] Miroslav Dud\u00edk, Za\u00efd Harchaoui, and J\u00e9r\u00f4me Malick. Lifted coordinate descent for learning with trace-\n\nnorm regularization. Journal of Machine Learning Research - Proceedings Track, 22:327\u2013336, 2012.\n\n[5] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly,\n\n3:149\u2013154, 1956.\n\n[6] Dan Garber. Faster projection-free convex optimization over the spectrahedron.\n\narXiv:1605.06203, 2016.\n\narXiv preprint\n\n[7] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In\nProceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11\nJuly 2015, pages 541\u2013549, 2015.\n\n[8] Dan Garber and Elad Hazan. A linearly convergent variant of the conditional gradient algorithm under\nstrong convexity, with applications to online and stochastic optimization. SIAM Journal on Optimization,\n26(3):1493\u20131528, 2016.\n\n[9] Jacques Gu\u00e9Lat and Patrice Marcotte. Some comments on Wolfe\u2019s \u2018away step\u2019. Mathematical Program-\n\nming, 35(1), 1986.\n\n[10] Za\u00efd Harchaoui, Matthijs Douze, Mattis Paulin, Miroslav Dud\u00edk, and J\u00e9r\u00f4me Malick. Large-scale image\nIn IEEE Conference on Computer Vision and Pattern\n\nclassi\ufb01cation with trace-norm regularization.\nRecognition, CVPR, 2012.\n\n[11] Elad Hazan and Satyen Kale. Projection-free online learning. In Proceedings of the 29th International\n\nConference on Machine Learning, ICML, 2012.\n\n[12] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. CoRR,\n\nabs/1602.02101, 2016.\n\n[13] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the\n\n30th International Conference on Machine Learning, ICML, 2013.\n\n[14] Martin Jaggi and Marek Sulovsk\u00fd. A simple algorithm for nuclear norm regularized problems.\n\nProceedings of the 27th International Conference on Machine Learning, ICML, 2010.\n\nIn\n\n[15] Armand Joulin, Kevin Tang, and Li Fei-Fei. Ef\ufb01cient image and video co-localization with Frank-Wolfe\n\nalgorithm. In Computer Vision\u2013ECCV 2014, pages 253\u2013268. Springer, 2014.\n\n[16] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization\n\nvariants. In Advances in Neural Information Processing Systems, pages 496\u2013504, 2015.\n\n[17] Simon Lacoste-Julien, Martin Jaggi, Mark W. Schmidt, and Patrick Pletscher. Block-coordinate frank-\nwolfe optimization for structural svms. In Proceedings of the 30th International Conference on Machine\nLearning, ICML, 2013.\n\n[18] Guanghui Lan. The complexity of large-scale convex programming under a linear optimization oracle.\n\nCoRR, abs/1309.5550, 2013.\n\n[19] S\u00f6ren Laue. A hybrid algorithm for convex semide\ufb01nite optimization.\n\nInternational Conference on Machine Learning, ICML, 2012.\n\nIn Proceedings of the 29th\n\n[20] Anton Osokin, Jean-Baptiste Alayrac, Puneet K. Dokania, and Simon Lacoste-Julien. Minding the gaps\nfor block frank-wolfe optimization of structured svm. In International Conference on Machine Learning\n(ICML), 2016.\n\n[21] A. Schrijver. Combinatorial Optimization - Polyhedra and Ef\ufb01ciency. Springer, 2003.\n[22] Shai Shalev-Shwartz, Alon Gonen, and Ohad Shamir. Large-scale convex minimization with a low-rank\n\nconstraint. In Proceedings of the 28th International Conference on Machine Learning, ICML, 2011.\n\n[23] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Advances in Neural Information\n\nProcessing Systems. MIT Press, 2003.\n\n[24] M. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference.\n\nNow Publishers Inc., Hanover, MA, USA, 2008.\n\n[25] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. J. Mach. Learn. Res.,\n\n13(1):1\u201326, January 2012.\n\n9\n\n\f", "award": [], "sourceid": 588, "authors": [{"given_name": "Dan", "family_name": "Garber", "institution": "Technion"}, {"given_name": "Dan", "family_name": "Garber", "institution": "Toyota Technological Institute at Chicago"}, {"given_name": "Ofer", "family_name": "Meshi", "institution": "Google"}]}