{"title": "Greedy Algorithms for Cone Constrained Optimization with Convergence Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 773, "page_last": 784, "abstract": "Greedy optimization methods such as Matching Pursuit (MP) and Frank-Wolfe (FW) algorithms regained popularity in recent years due to their simplicity, effectiveness and theoretical guarantees. MP and FW address optimization over the linear span and the convex hull of a set of atoms, respectively. In this paper, we consider the intermediate case of optimization over the convex cone, parametrized as the conic hull of a generic atom set, leading to the first principled definitions of non-negative MP algorithms for which we give explicit convergence rates and demonstrate excellent empirical performance. In particular, we derive sublinear (O(1/t)) convergence on general smooth and convex objectives, and linear convergence (O(e^{-t})) on strongly convex objectives, in both cases for general sets of atoms. Furthermore, we establish a clear correspondence of our algorithms to known algorithms from the MP and FW literature. Our novel algorithms and analyses target general atom sets and general objective functions, and hence are directly applicable to a large variety of learning settings.", "full_text": "Greedy Algorithms for Cone Constrained\nOptimization with Convergence Guarantees\n\nFrancesco Locatello\n\nMPI for Intelligent Systems - ETH Zurich\n\nMichael Tschannen\n\nETH Zurich\n\nlocatelf@ethz.ch\n\nmichaelt@nari.ee.ethz.ch\n\nGunnar R\u00e4tsch\n\nETH Zurich\n\nMartin Jaggi\n\nEPFL\n\nraetsch@inf.ethz.ch\n\nmartin.jaggi@epfl.ch\n\nAbstract\n\nGreedy optimization methods such as Matching Pursuit (MP) and Frank-Wolfe\n(FW) algorithms regained popularity in recent years due to their simplicity, effec-\ntiveness and theoretical guarantees. MP and FW address optimization over the\nlinear span and the convex hull of a set of atoms, respectively. In this paper, we\nconsider the intermediate case of optimization over the convex cone, parametrized\nas the conic hull of a generic atom set, leading to the \ufb01rst principled de\ufb01nitions\nof non-negative MP algorithms for which we give explicit convergence rates and\ndemonstrate excellent empirical performance. In particular, we derive sublinear\n(O(1/t)) convergence on general smooth and convex objectives, and linear con-\nvergence (O(e\u2212t)) on strongly convex objectives, in both cases for general sets\nof atoms. Furthermore, we establish a clear correspondence of our algorithms\nto known algorithms from the MP and FW literature. Our novel algorithms and\nanalyses target general atom sets and general objective functions, and hence are\ndirectly applicable to a large variety of learning settings.\n\n1\n\nIntroduction\n\nIn recent years, greedy optimization algorithms have attracted signi\ufb01cant interest in the domains\nof signal processing and machine learning thanks to their ability to process very large data sets.\nArguably two of the most popular representatives are Frank-Wolfe (FW) [12, 21] and Matching\nPursuit (MP) algorithms [34], in particular Orthogonal MP (OMP) [9, 49]. While the former targets\nminimization of a convex function over bounded convex sets, the latter apply to minimization over a\nlinear subspace. In both cases, the domain is commonly parametrized by a set of atoms or dictionary\nelements, and in each iteration, both algorithms rely on querying a so-called linear minimization\noracle (LMO) to \ufb01nd the direction of steepest descent in the set of atoms. The iterate is then updated\nas a linear or convex combination, respectively, of previous iterates and the newly obtained atom\nfrom the LMO. The particular choice of the atom set allows to encode structure such as sparsity and\nnon-negativity (of the atoms) into the solution. This enables control of the trade-off between the\namount of structure in the solution and approximation quality via the number of iterations, which\nwas found useful in a large variety of use cases including structured matrix and tensor factorizations\n[50, 53, 54, 18].\nIn this paper, we target an important \u201cintermediate case\u201d between the two domain parameterizations\ngiven by the linear span and the convex hull of an atom set, namely the parameterization of the\noptimization domain as the conic hull of a possibly in\ufb01nite atom set. In this case, the solution\ncan be represented as a non-negative linear combination of the atoms, which is desirable in many\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fapplications, e.g., due to the physics underlying the problem at hand, or for the sake of interpretability.\nConcrete examples include unmixing problems [11, 16, 3], model selection [33], and matrix and\ntensor factorizations [4, 24]. However, existing convergence analyses do not apply to the currently\nused greedy algorithms. In particular, all existing MP variants for the conic hull case [5, 38, 52] are\nnot guaranteed to converge and may get stuck far away from the optimum (this can be observed in\nthe experiments in Section 6). From a theoretical perspective, this intermediate case is of paramount\ninterest in the context of MP and FW algorithms. Indeed, the atom set is not guaranteed to contain\nan atom aligned with a descent direction for all possible suboptimal iterates, as is the case when the\noptimization domain is the linear span or the convex hull of the atom set [39, 32]. Hence, while conic\nconstraints have been widely studied in the context of a manifold of different applications, none of\nthe existing greedy algorithms enjoys explicit convergence rates.\nWe propose and analyze new MP algorithms tailored for the minimization of smooth convex functions\nover the conic hull of an atom set. Speci\ufb01cally, our key contributions are:\n\n\u2022 We propose the \ufb01rst (non-orthogonal) MP algorithm for optimization over conic hulls\nguaranteed to converge, and prove a corresponding sublinear convergence rate with ex-\nplicit constants. Surprisingly, convergence is achieved without increasing computational\ncomplexity compared to ordinary MP.\n\u2022 We propose new away-step, pairwise, and fully corrective MP variants, inspired by variants\nof FW [28] and generalized MP [32], respectively, that allow for different degrees of weight\ncorrections for previously selected atoms. We derive corresponding sublinear and linear (for\nstrongly convex objectives) convergence rates that solely depend on the geometry of the\natom set.\n\u2022 All our algorithms apply to general smooth convex functions. This is in contrast to all prior\nwork on non-negative MP, which targets quadratic objectives [5, 38, 52]. Furthermore, if\nthe conic hull of the atom set equals its linear span, we recover both algorithms and rates\nderived in [32] for generalized MP variants.\n\u2022 We make no assumptions on the atom set which is simply a subset of a Hilbert space, in\n\nparticular we do not assume the atom set to be \ufb01nite.\n\nBefore presenting our algorithms (Section 3) along with the corresponding convergence guarantees\n(Section 4), we brie\ufb02y review generalized MP variants. A detailed discussion of related work can\nbe found in Section 5 followed by illustrative experiments on a least squares problem on synthetic\ndata, and non-negative matrix factorization as well as non-negative garrote logistic regression as\napplications examples on real data (numerical evaluations of more applications and the dependency\nbetween constants in the rate and empirical convergence can be found in the supplementary material).\nNotation. Given a non-empty subset A of some Hilbert space, let conv(A) be the convex hull\nof A, and let lin(A) denote its linear span. Given a closed set A, we call its diameter diam(A) =\nmaxz1,z2\u2208A (cid:107)z1 \u2212 z2(cid:107) and its radius radius(A) = maxz\u2208A (cid:107)z(cid:107). (cid:107)x(cid:107)A := inf{c > 0 : x \u2208\nc \u00b7 conv(A)} is the atomic norm of x over a set A (also known as the gauge function of conv(A)).\nWe call a subset A of a Hilbert space symmetric if it is closed under negation.\n\n2 Review of Matching Pursuit Variants\nLet H be a Hilbert space with associated inner product (cid:104)x, y(cid:105), \u2200 x, y \u2208 H. The inner product induces\nthe norm (cid:107)x(cid:107)2 := (cid:104)x, x(cid:105), \u2200 x \u2208 H. Let A \u2282 H be a compact set (the \u201cset of atoms\u201d or dictionary)\nand let f : H \u2192 R be convex and L-smooth (L-Lipschitz gradient in the \ufb01nite dimensional case).\nIf H is an in\ufb01nite-dimensional Hilbert space, then f is assumed to be Fr\u00e9chet differentiable. The\ngeneralized MP algorithm studied in [32], presented in Algorithm 1, solves the following optimization\nproblem:\n(1)\n\nf (x).\n\nmin\n\nx\u2208lin(A)\n\nIn each iteration, MP queries a linear minimization oracle (LMO) solving the following linear\nproblem:\n\n(2)\nfor a given query y \u2208 H. The MP update step minimizes a quadratic upper bound gxt(x) =\nf (xt) + (cid:104)\u2207f (xt), x \u2212 xt(cid:105) + L\n2 (cid:107)x \u2212 xt(cid:107)2 of f at xt, where L is an upper bound on the smoothness\n\nLMOA(y) := arg min\n\n(cid:104)y, z(cid:105)\n\nz\u2208A\n\n2\n\n\fconstant of f with respect to a chosen norm (cid:107) \u00b7 (cid:107). Optimizing this norm problem instead of f\ndirectly allows for substantial ef\ufb01ciency gains in the case of complicated f. For symmetric A and for\n2(cid:107)y \u2212 x(cid:107)2, y \u2208 H, Algorithm 1 recovers MP (Variant 0) [34] and OMP (Variant 1) [9, 49],\nf (x) = 1\nsee [32] for details.\n\nAlgorithm 1 Norm-Corrective Generalized Match-\ning Pursuit\n1: init x0 \u2208 lin(A), and S := {x0}\n2: for t = 0 . . . T\n3:\n4:\n5:\n6:\n\nFind zt := (Approx-)LMOA(\u2207f (xt))\nS := S \u222a {zt}\nLet b := xt \u2212 1\nVariant 0:\n\nL\u2207f (xt)\n\nUpdate xt+1 := arg min\nz:=xt+\u03b3zt\n\n\u03b3\u2208R\n\n(cid:107)z \u2212 b(cid:107)2\n\nApproximate linear oracles. Solving the\nLMO de\ufb01ned in (2) exactly is often hard in\npractice, in particular when applied to matrix\n(or tensor) factorization problems, while ap-\nproximate versions can be much more ef\ufb01cient.\nAlgorithm 1 allows for an approximate LMO.\nFor given quality parameter \u03b4 \u2208 (0, 1] and\ngiven direction d \u2208 H, the approximate LMO\nfor Algorithm 1 returns a vector \u02dcz \u2208 A such\nthat\n\n(3)\nrelative to z = LMOA(d) being an exact solu-\ntion.\n\n(cid:104)d, \u02dcz(cid:105) \u2264 \u03b4(cid:104)d, z(cid:105),\n\n7:\n\nVariant 1:\n\nUpdate xt+1 := arg min\nz\u2208lin(S)\n\n(cid:107)z \u2212 b(cid:107)2\n\n8:\n9: end for\n\nOptional: Correction of some/all atoms z0...t\n\nDiscussion and limitations of MP. The anal-\nysis of the convergence of Algorithm 1 in [32]\ncritically relies on the assumption that the ori-\ngin is in the relative interior of conv(A) with\nrespect to its linear span. This assumption originates from the fact that the convergence of MP- and\nFW-type algorithms fundamentally depends on an alignment assumption of the search direction\nreturned by the LMO (i.e., zt in Algorithm 1) and the gradient of the objective at the current iteration\n(see third premise in [39]). Speci\ufb01cally, for Algorithm 1, the LMO is assumed to select a descent\ndirection, i.e., (cid:104)\u2207f (xt), zt(cid:105) < 0, so that the resulting weight (i.e., \u03b3 for Variant 0) is always positive.\nIn this spirit, Algorithm 1 is a natural candidate to minimize f over the conic hull of A. However,\nif the optimization domain is a cone, the alignment assumption does not hold as there may be\nnon-stationary points x in the conic hull of A for which minz\u2208A(cid:104)\u2207f (x), z(cid:105) = 0. Algorithm 1 is\ntherefore not guaranteed to converge when applied to conic problems. The same issue arises for\nessentially all existing non-negative variants of MP, see, e.g., Alg. 2 in [38] and in Alg. 2 in [52]. We\nnow present modi\ufb01cations corroborating this issue along with the resulting MP-type algorithms for\nconic problems and corresponding convergence guarantees.\n\n3 Greedy Algorithms on Conic Hulls\nThe cone cone(A \u2212 y) tangent to the convex set conv(A) at a point y is formed by the half-lines\nemanating from y and intersecting conv(A) in at least one point distinct from y. Without loss of\ngenerality we consider 0 \u2208 A and assume the set cone(A) (i.e., y = 0) to be closed. If A is \ufb01nite\ni=1 \u03b1iai s.t. ai \u2208 A, \u03b1i \u2265 0 \u2200i}. We\n\nthe cone constraint can be written as cone(A) := {x : x =(cid:80)|A|\n\nconsider conic optimization problems of the form:\n\nmin\n\nf (x).\n\nx\u2208cone(A)\n\n(4)\nNote that if the set A is symmetric or if the origin is in the relative interior of conv(A) w.r.t. its linear\nspan then cone(A) = lin(A). We will show later how our results recover known MP rates when the\norigin is in the relative interior of conv(A).\nAs a \ufb01rst algorithm to solve problems of the form (4), we present the Non-Negative Generalized\nMatching Pursuit (NNMP) in Algorithm 2 which is an extension of MP to general f and non-negative\nweights.\n\nDiscussion: Algorithm 2 differs from Algorithm 1 (Variant 0) in line 4, adding the iteration-\ndependent atom \u2212 xt(cid:107)xt(cid:107)A to the set of possible search directions1. We use the atomic norm for the\n1This additional direction makes sense only if xt (cid:54)= 0. Therefore, we set \u2212 xt(cid:107)xt(cid:107)A = 0 if xt = 0, i.e., no\n\ndirection is added.\n\n3\n\n\fAlgorithm 2 Non-Negative Matching Pursuit\n1: init x0 = 0 \u2208 A\n2: for t = 0 . . . T\nz\u2208(cid:110)\n3:\n4:\n\nFind \u00afzt := (Approx-)LMOA(\u2207f (xt))\n(cid:111)(cid:104)\u2207f (xt), z(cid:105)\nzt = arg min\n\n\u00afzt,\n\n\u2212xt\n(cid:107)xt(cid:107)A\n\n(cid:104)\u2212\u2207f (xt),zt(cid:105)\n\n\u03b3 :=\nUpdate xt+1 := xt + \u03b3zt\n\nL(cid:107)zt(cid:107)2\n\n5:\n6:\n7: end for\n\nFigure 1: Two dimensional example for TA(xt) where\nA = {a1, a2}, for three different iterates x0, x1 and\nx2. The shaded area corresponds to TA(xt) and the\nwhite area to lin(A) \\ TA(xt).\n\nnormalization because it yields the best constant in the convergence rate. In practice, one can replace\nit with the Euclidean norm, which is often much less expensive to compute. This iteration-dependent\nadditional search direction allows to reduce the weights of the atoms that were previously selected,\nthus admitting the algorithm to \u201cmove back\u201d towards the origin while maintaining the cone constraint.\nThis idea is informally explained here and formally studied in Section 4.1.\nRecall the alignment assumption of the search direction and the gradient of the objective at the current\niterate discussed in Section 2 (see also [39]). Algorithm 2 obeys this assumption. The intuition\nbehind this is the following. Whenever xt is not a minimizer of (4) and minz\u2208A(cid:104)\u2207f (xt), z(cid:105) = 0,\nthe vector \u2212 xt(cid:107)xt(cid:107)A is aligned with \u2207f (xt) (i.e., (cid:104)\u2207f (xt),\u2212 xt(cid:107)xt(cid:107)A(cid:105) < 0), preventing the algorithm\nfrom stopping at a suboptimal iterate. To make this intuition more formal, let us de\ufb01ne the set of\nfeasible descent directions of Algorithm 2 at a point x \u2208 cone(A) as:\n\n(cid:26)\n\nd \u2208 H : \u2203z \u2208 A \u222a(cid:110) \u2212 x\n\n(cid:111)\n\n(cid:27)\n\ns.t. (cid:104)d, z(cid:105) < 0\n\n(cid:107)x(cid:107)A\n\nTA(x) :=\n\n(5)\nIf at some iteration t = 0, 1, . . . the gradient \u2207f (xt) is not in TA(xt) Algorithm 2 terminates as\nminz\u2208A(cid:104)d, z(cid:105) = 0 and (cid:104)d,\u2212xt(cid:105) \u2265 0 (which yields zt = 0). Even though, in general, not every\ndirection in H is a feasible descent direction, \u2207f (xt) /\u2208 TA only occurs if xt is a constrained\nminimum of Equation 4:\nLemma 1. If \u02dcx \u2208 cone(A) and \u2207f (\u02dcx) (cid:54)\u2208 TA then \u02dcx is a solution to minx\u2208cone(A) f (x).\nInitializing Algorithm 2 with x0 = 0 guarantees that the iterates xt always remain inside cone(A)\neven though this is not enforced explicitly (by convexity of f, see proof of Theorem 2 in Appendix D\nfor details).\n\n.\n\nrepresentation of xt =(cid:80)t\u22121\n\nLimitations of Algorithm 2: Let us call active the atoms which have nonzero weights in the\ni=0 \u03b1izi computed by Algorithm 2. Formally, the set of active atoms is\nde\ufb01ned as S := {zi : \u03b1i > 0, i = 0, 1, . . . , t \u2212 1}. The main drawback of Algorithm 2 is that when\nthe direction \u2212 xt(cid:107)xt(cid:107)A is selected, the weight of all active atoms is reduced. This can lead to the\nalgorithm alternately selecting \u2212 xt(cid:107)xt(cid:107)A and an atom from A, thereby slowing down convergence in a\nsimilar manner as the zig-zagging phenomenon well-known in the Frank-Wolfe framework [28]. In\norder to achieve faster convergence we introduce the corrective variants of Algorithm 2.\n\n3.1 Corrective Variants\n\nTo achieve faster (linear) convergence (see Section 4.2) we introduce variants of Algorithm 2, termed\nAway-steps MP (AMP) and Pairwise MP (PWMP), presented in Algorithm 3. Here, inspired by the\naway-steps and pairwise variants of FW [12, 28], instead of reducing the weights of the active atoms\nuniformly as in Algorithm 2, the LMO is queried a second time on the active set S to identify the\ndirection of steepest ascent in S. This allows, at each iteration, to reduce the weight of a previously\nselected atom (AMP) or swap weight between atoms (PWMP). This selective \u201creduction\u201d or \u201cswap\nof weight\u201d helps to avoid the zig-zagging phenomenon which prevent Algorithm 2 from converging\nlinearly.\nAt each iteration, Algorithm 3 updates the weights of zt and vt as \u03b1zt = \u03b1zt + \u03b3 and \u03b1vt = \u03b1vt \u2212 \u03b3,\nrespectively. To ensure that xt+1 \u2208 cone(A), \u03b3 has to be clipped according to the weight which is\ncurrently on vt, i.e., \u03b3max = \u03b1vt. If \u03b3 = \u03b3max, we set \u03b1vt = 0 and remove vt from S as the atom vt\nis no longer active. If dt \u2208 A (i.e., we take a regular MP step and not an away step), the line search\nis unconstrained (i.e., \u03b3max = \u221e).\n\n4\n\n\fFor both algorithm variants, the second LMO query increases the computational complexity. Note\nthat an exact search on S is feasible in practice as |S| has at most t elements at iteration t.\nTaking an additional computational burden allows to update the weights of all active atoms in the\nspirit of OMP. This approach is implemented in the Fully Corrective MP (FCMP), Algorithm 4.\n\nAlgorithm 3 Away-steps (AMP) and Pairwise\n(PWMP) Non-Negative Matching Pursuit\n1: init x0 = 0 \u2208 A, and S := {x0}\n2: for t = 0 . . . T\n3:\n4:\n5:\n6:\n7:\n8:\n\nFind zt := (Approx-)LMOA(\u2207f (xt))\nFind vt := (Approx-)LMOS (\u2212\u2207f (xt))\nS = S \u222a zt\nAMP: dt = arg mind\u2208{zt,\u2212vt}(cid:104)\u2207f (xt), d(cid:105)\nPWMP: dt = zt \u2212 vt\n\u03b3 := min\nL(cid:107)dt(cid:107)2\nUpdate \u03b1zt, \u03b1vt and S according to \u03b3\nUpdate xt+1 := xt + \u03b3dt\n\n(cid:110)(cid:104)\u2212\u2207f (xt),dt(cid:105)\n\n(\u03b3max see text)\n\n(\u03b3 see text)\n\n, \u03b3max\n\n(cid:111)\n\n9:\n\nAlgorithm 4 Fully Corrective Non-Negative\nMatching Pursuit (FCMP)\n1: init x0 = 0 \u2208 A,S = {x0}\n2: for t = 0 . . . T\n3:\n4:\n5:\n\nFind zt := (Approx-)LMOA(\u2207f (xt))\nS := S \u222a {zt}\nVariant 0:\n\n(cid:107)x\u2212(xt\u2212 1\n\nL\u2207f (xt))(cid:107)2\n\nxt+1 = arg min\nx\u2208cone(S)\n\n6:\n\nVariant 1:\nRemove atoms with zero weights from S\n\nxt+1 = arg minx\u2208cone(S) f (x)\n\n7:\n8: end for\n\n10:\n11: end for\nAt each iteration, Algorithm 4 maintains the set of active atoms S by adding zt and removing atoms\nwith zero weights after the update. In Variant 0, the algorithm minimizes the quadratic upper bound\ngxt(x) on f at xt (see Section 2) imitating a gradient descent step with projection onto a \u201cvarying\u201d\ntarget, i.e., cone(S). In Variant 1, the original objective f is minimized over cone(S) at each iteration,\nwhich is in general more ef\ufb01cient than minimizing f over cone(A) using a generic solver for cone\n2(cid:107)y \u2212 x(cid:107)2, y \u2208 H, Variant 1 recovers Algorithm 1 in [52] and\nconstrained problems. For f (x) = 1\nthe OMP variant in [5] which both only apply to this speci\ufb01c objective f.\n\n3.2 Computational Complexity\n\nO(1/t)\n\nC + O(d)\n\nconvergence\n\nC + O(d + td)\n\ncost per iteration\n\nalgorithm\nNNMP\nPWMP\nAMP\nFCMP v. 0\nFCMP v. 1\n\nWe brie\ufb02y discuss the computa-\ntional complexity of the algorithms\nwe introduced. For H = Rd, sums\nand inner products have cost O(d).\nLet us assume that each call of the\nLMO has cost C on the set A and\nO(td) on S. The variants 0 and 1\nof FCMP solve a cone problem at\neach iteration with cost h0 and h1,\nrespectively. In general, h0 can be\nmuch smaller than h1. In Table 1\nwe report the cost per iteration for every algorithm along with the asymptotic convergence rates\nderived in Section 4.\n\nTable 1: Computational complexity versus convergence rate (see Sec-\ntion 4) for strongly convex objectives\n\nC + O(d + td)\nC + O(d) + h0\nC + O(d) + h1\n\nO(cid:0)e\u2212\u03b2k(t)(cid:1)\n(cid:16)\n2 k(t)(cid:17)\nO(cid:0)e\u2212\u03b2k(t)(cid:1)\nO(cid:0)e\u2212\u03b2k(t)(cid:1)\n\n3|A|!+1\n\n3|A|!+1\n\ne\u2212 \u03b2\n\n-\nt\n\nt\n\nt\n\nk(t)\n\nt/2\n\nO\n\n4 Convergence Rates\n\nIn this section, we present convergence guarantees for Algorithms 2, 3, and 4. All proofs are deferred\nto the Appendix in the supplementary material. We write x(cid:63) \u2208 arg minx\u2208cone(A) f (x) for an optimal\nsolution. Our rates will depend on the atomic norm of the solution and the iterates of the respective\nalgorithm variant:\n\n\u03c1 = max{(cid:107)x(cid:63)(cid:107)A,(cid:107)x0(cid:107)A . . . ,(cid:107)xT(cid:107)A} .\n\n(6)\nIf the optimum is not unique, we consider x(cid:63) to be one of largest atomic norm. A more intuitive\nand looser notion is to simply upper-bound \u03c1 by the diameter of the level set of the initial iterate\nx0 measured by the atomic norm. Then, boundedness follows since the presented method is a\ndescent method (due to Lemma 1 and line search on the quadratic upper bound, each iteration strictly\n\n5\n\n\ff (xt) \u2212 f (x(cid:63)) \u2264 4(cid:0) 2\n\n(cid:1)\n\n,\n\n\u03b4 L\u03c12 radius(A)2 + \u03b50\n\n\u03b4t + 4\n\ndecreases the objective and our method stops only at the optimum). This justi\ufb01es the statement\nf (xt) \u2264 f (x0). Hence, \u03c1 must be bounded for any sequence of iterates produced by the algorithm,\nand the convergence rates presented in this section are valid as T goes to in\ufb01nity. A similar notion to\nmeasure the convergence of MP was established in [32]. All of our algorithms and rates can be made\naf\ufb01ne invariant. We defer this discussion to Appendix B.\n\n4.1 Sublinear Convergence\n\nWe now present the convergence results for the non-negative and Fully-Corrective Matching Pursuit\nalgorithms. Sublinear convergence of Algorithm 3 is addressed in Theorem 3.\nTheorem 2. Let A \u2282 H be a bounded set with 0 \u2208 A, \u03c1 := max{(cid:107)x(cid:63)(cid:107)A,(cid:107)x0(cid:107)A, . . . ,(cid:107)xT(cid:107)A,}\nand f be L-smooth over \u03c1 conv(A \u222a \u2212A). Then, Algorithms 2 and 4 converge for t \u2265 0 as\n\nwhere \u03b4 \u2208 (0, 1] is the relative accuracy parameter of the employed approximate LMO (see Equa-\ntion (3)).\nRelation to FW rates. By rescaling A by a large enough factor \u03c4 > 0, FW with \u03c4A as atom\nset could in principle be used to solve (4). In fact, for large enough \u03c4, only the constraints of (4)\nbecome active when minimizing f over conv(\u03c4A). The sublinear convergence rate obtained with\nthis approach is up to constants identical to that in Theorem 2 for our MP variants, see [21]. However,\nas the correct scaling is unknown, one has to either take the risk of choosing \u03c4 too small and hence\nfailing to recover an optimal solution of (4), or to rely on too large \u03c4 which can result in slow\nconvergence. In contrast, knowledge of \u03c1 is not required to run our MP variants.\n\nIf A is symmetric, we have that lin(A) = cone(A) and it is easy to show\nRelation to MP rates.\nthat the additional direction \u2212 xt(cid:107)xt(cid:107) in Algorithm 2 is never selected. Therefore, Algorithm 2 becomes\nequivalent to Variant 0 of Algorithm 1, while Variant 1 of Algorithm 1 is equivalent to Variant 0 of\nAlgorithm 4. The rate speci\ufb01ed in Theorem 2 hence generalizes the sublinear rate in [32, Theorem 2]\nfor symmetric A.\n\n4.2 Linear Convergence\n\nWe start by recalling some of the geometric complexity quantities that were introduced in the context\nof FW and are adapted here to the optimization problem we aim to solve (minimization over cone(A)\ninstead of conv(A)).\nDirectional Width. The directional width of a set A w.r.t. a direction r \u2208 H is de\ufb01ned as:\n\n(7)\nPyramidal Directional Width [28]. The Pyramidal Directional Width of a set A with respect to a\ndirection r and a reference point x \u2208 conv(A) is de\ufb01ned as:\n\ndirW (A, r) := max\ns,v\u2208A\n\n(cid:10) r(cid:107)r(cid:107) , s \u2212 v(cid:11)\n\ndirW (S \u222a {s(A, r)}, r),\n\nP dirW (A, r, x) := minS\u2208Sx\n\n(8)\nwhere Sx := {S | S \u2282 A and x is a proper convex combination of all the elements in S} and\ns(A, r) := maxs\u2208A(cid:104) r(cid:107)r(cid:107) , s(cid:105).\nInspired by the notion of pyramidal width in [28], which is the minimal pyramidal directional width\ncomputed over the set of feasible directions, we now de\ufb01ne the cone width of a set A where only\nthe generating faces (g-faces) of cone(A) (instead of the faces of conv(A)) are considered. Before\ndoing so we introduce the notions of face, generating face, and feasible direction.\nFace of a convex set. Let us consider a set K with a k\u2212dimensional af\ufb01ne hull along with a\npoint x \u2208 K. Then, K is a k\u2212dimensional face of conv(A) if K = conv(A) \u2229 {y : (cid:104)r, y \u2212 x(cid:105) =\n0} for some normal vector r and conv(A) is contained in the half-space determined by r, i.e.,\n(cid:104)r, y \u2212 x(cid:105) \u2264 0, \u2200 y \u2208 conv(A). Intuitively, given a set conv(A) one can think of conv(A) being a\ndim(conv(A))\u2212dimensional face of itself, an edge on the border of the set a 1-dimensional face and\na vertex a 0-dimensional face.\n\n6\n\n\fFace of a cone and g-faces. Similarly, a k\u2212dimensional face of a cone is an open and unbounded\nset cone(A) \u2229 {y : (cid:104)r, y \u2212 x(cid:105) = 0} for some normal vector r and cone(A) is contained in the half\nspace determined by r. We can de\ufb01ne the generating faces of a cone as:\n\ng-faces(cone(A)) :={B \u2229 conv(A) :B \u2208 faces(cone(A))} .\n\nNote that g-faces(cone(A)) \u2282 faces(conv(A)) and conv(A) \u2208 g-faces(cone(A)). Furthermore,\nfor each K \u2208 g-faces(cone(A)), cone(K) is a k\u2212dimensional face of cone(A).\nWe now introduce the notion of feasible directions. A direction d is feasible from x \u2208 cone(A) if it\npoints inwards cone(A), i.e., if \u2203\u03b5 > 0 s.t. x + \u03b5d \u2208 cone(A). Since a face of the cone is itself a\ncone, if a direction is feasible from x \u2208 cone(K) \\ 0, it is feasible from every positive rescaling of x.\nWe therefore can consider only the feasible directions on the generating faces (which are closed and\nbounded sets). Finally, we de\ufb01ne the cone width of A.\n\nCone Width.\n\nCWidth(A) :=\n\nmin\nx\u2208K\n\nK\u2208g-faces(cone(A))\nr\u2208cone(K\u2212x)\\{0}\n\nP dirW (K \u2229 A, r, x)\n\n(9)\n\nWe are now ready to show the linear convergence of Algorithms 3 and 4.\nTheorem 3. Let A \u2282 H be a bounded set with 0 \u2208 A and let the objective function f : H \u2192 R be\nboth L-smooth and \u00b5-strongly convex over \u03c1 conv(A \u222a \u2212A). Then, the suboptimality of the iterates\nof Algorithms 3 and 4 decreases geometrically at each step in which \u03b3 < \u03b1vt (henceforth referred to\nas \u201cgood steps\u201d) as:\n\n\u03b5t+1 \u2264 (1 \u2212 \u03b2) \u03b5t,\n\n(10)\nwhere \u03b2 := \u03b42 \u00b5 CWidth(A)2\nL diam(A)2 \u2208 (0, 1], \u03b5t := f (xt)\u2212f (x(cid:63)) is the suboptimality at step t and \u03b4 \u2208 (0, 1]\nis the relative accuracy parameter of the employed approximate LMO (3). For AMP (Algorithm 3),\n\u03b2AMP = \u03b2/2. If \u00b5 = 0 Algorithm 3 converges with rate O(1/k(t)) where k(t) is the number of\n\u201cgood steps\u201d up to iteration t.\n\nDiscussion. To obtain a linear convergence rate, one needs to upper-bound the number of \u201cbad\nsteps\u201d t\u2212 k(t) (i.e., steps with \u03b3 \u2265 \u03b1vt). We have that k(t) = t for Variant 1 of FCMP (Algorithm 4),\nk(t) \u2265 t/2 for AMP (Algorithm 3) and k(t) \u2265 t/(3|A|! + 1) for PWMP (Algorithm 3) and Variant 0\nof FCMP (Algorithm 4). This yields a global linear convergence rate of \u03b5t \u2264 \u03b50 exp (\u2212\u03b2k(t)). The\nbound for PWMP is very loose and only meaningful for \ufb01nite sets A. However, it can be observed\nin the experiments in the supplementary material (Appendix A) that only a very small fraction of\niterations result in bad PWMP steps in practice. Further note that Variant 1 of FCMP (Algorithm 4)\ndoes not produce bad steps. Also note that the bounds on the number of good steps given above are\nthe same as for the corresponding FW variants and are obtained using the same (purely combinatorial)\narguments as in [28].\nRelation to previous MP rates. The linear convergence of the generalized (not non-negative) MP\nvariants studied in [32] crucially depends on the geometry of the set which is characterized by the\nMinimal Directional Width mDW(A):\n\nmDW(A) := min\nd\u2208lin(A)\nd(cid:54)=0\n\nz\u2208A(cid:104) d\n\n(cid:107)d(cid:107) , z(cid:105) .\n\nmax\n\n(11)\n\nThe following Lemma relates the Cone Width with the minimal directional width.\nLemma 4. If the origin is in the relative interior of conv(A) with respect to its linear span, then\ncone(A) = lin(A) and CWidth(A) = mDW(A).\nNow, if the set A is symmetric or, more generally, if cone(A) spans the linear space lin(A) (which\nimplies that the origin is in the relative interior of conv(A)), there are no bad steps. Hence, by\nLemma 4, the linear rate obtained in Theorem 3 for non-negative MP variants generalizes the one\npresented in [32, Theorem 7] for generalized MP variants.\n\n7\n\n\fRelation to FW rates. Optimization over conic hulls with non-negative MP is more similar to FW\nthan to MP itself in the following sense. For MP, every direction in lin(A) allows for unconstrained\nsteps, from any iterate xt. In contrast, for our non-negative MPs, while some directions allow for\nunconstrained steps from some iterate xt, others are constrained, thereby leading to the dependence\nof the linear convergence rate on the cone width, a geometric constant which is very similar in spirit\nto the Pyramidal Width appearing in the linear convergence bound in [28] for FW. Furthermore, as\nfor Algorithm 3, the linear rate of Away-steps and Pairwise FW holds only for good steps. We \ufb01nally\nrelate the cone width with the Pyramidal Width [28]. The Pyramidal Width is de\ufb01ned as\n\nPWidth(A) :=\n\nP dirW (K \u2229 A, r, x).\n\nmin\nx\u2208K\n\nK\u2208faces(conv(A))\nr\u2208cone(K\u2212x)\\{0}\n\nWe have CWidth(A) \u2265 PWidth(A) as the minimization in the de\ufb01nition (9) of CWidth(A) is only\nover the subset g-faces(cone(A)) of faces(conv(A)). As a consequence, the decrease per iteration\ncharacterized in Theorem 3 is larger than what one could obtain with FW on the rescaled convex set\n\u03c4A (see Section 4.1 for details about the rescaling). Furthermore, the decrease characterized in [28]\nscales as 1/\u03c4 2 due to the dependence on 1/ diam(conv(A))2.\n\n5 Related Work\n\nThe line of recent works by [44, 46, 47, 48, 37, 32] targets the generalization of MP from the\nleast-squares objective to general smooth objectives and derives corresponding convergence rates\n(see [32] for a more in-depth discussion). However, only little prior work targets MP variants with\nnon-negativity constraint [5, 38, 52]. In particular, the least-squares objective was addressed and\nno rigorous convergence analysis was carried out. [5, 52] proposed an algorithm equivalent to our\nAlgorithm 4 for the least-squares case. More speci\ufb01cally, [52] then developed an acceleration heuristic,\nwhereas [5] derived a coherence-based recovery guarantee for sparse linear combinations of atoms.\nApart from MP-type algorithms, there is a large variety of non-negative least-squares algorithms,\ne.g., [30], in particular also for matrix and tensor spaces. The gold standard in factorization problems\nis projected gradient descent with alternating minimization, see [43, 4, 45, 23]. Other related works\nare [40], which is concerned with the feasibility problem on symmetric cones, and [19], which\nintroduces a norm-regularized variant of problem (4) and solves it using FW on a rescaled convex\nset. To the best of our knowledge, in the context of MP-type algorithms, we are the \ufb01rst to combine\ngeneral convex objectives with conic constraints and to derive corresponding convergence guarantees.\n\nBoosting:\nIn an earlier line of work, a \ufb02avor of the generalized MP became popular in the context\nof boosting, see [35]. The literature on boosting is vast, we refer to [42, 35, 7] for a general overview.\nTaking the optimization perspective given in [42], boosting is an iterative greedy algorithm minimizing\na (strongly) convex objective over the linear span of a possibly in\ufb01nite set called hypothesis class.\nThe convergence analysis crucially relies on the assumption of the origin being in the relative interior\nof the hypothesis class, see Theorem 1 in [17]. Indeed, Algorithm 5.2 of [35] might not converge\nif the [39] alignment assumption is violated. Here, we managed to relax this assumption while\npreserving essentially the same asymptotic rates in [35, 17]. Our work is therefore also relevant in\nthe context of (non-negative) boosting.\n\nIllustrative Experiments\n\n6\nWe illustrate the performance of the presented algorithms on three different exemplary tasks, showing\nthat our algorithms are competitive with established baselines across a wide range of objective func-\ntions, domains, and data sets while not being speci\ufb01cally tailored to any of these tasks (see Section 3.2\nfor a discussion of the computational complexity of the algorithms). Additional experiments targeting\nKL divergence NMF, non-negative tensor factorization, and hyperspectral image unmixing can be\nfound in the appendix.\n\nSynthetic data. We consider minimizing the least squares objective on the conic hull\nof 100 unit-norm vectors sampled at random in the \ufb01rst orthant of R50. We compare\nthe convergence of Algorithms 2, 3, and 4 with the Fast Non-Negative MP (FNNOMP)\nof [52], and Variant 3 (line-search) of the FW algorithm in [32] on the atom set rescaled\nby \u03c4 = 10(cid:107)y(cid:107) (see Section 4.1), observing linear convergence for our corrective variants.\n\n8\n\n\fFigure 2 shows the suboptimality \u03b5t, averaged over 20\nrealizations of A and y, as a function of the iteration t. As\nexpected, FCMP achieves fastest convergence followed\nby PWMP, AMP and NNMP. The FNNOMP gets stuck\ninstead. Indeed, [52] only show that the algorithm termi-\nnates and not its convergence.\n\nFigure 2: Synthetic data experiment.\n\nNon-negative matrix factorization. The second task\nconsists of decomposing a given matrix into the product\nof two non-negative matrices as in Equation (1) of [20].\nWe consider the intersection of the positive semide\ufb01nite\ncone and the positive orthant. We parametrize the set A as\nthe set of matrices obtained as an outer product of vectors\nfrom A1 = {z \u2208 Rk : zi \u2265 0 \u2200 i} and A2 = {z \u2208 Rd : zi \u2265 0 \u2200 i}. The LMO is approximated\nusing a truncated power method [55], and we perform atom correction with greedy coordinate descent\nsee, e.g., [29, 18], to obtain a better objective value while maintaining the same (small) number of\natoms. We consider three different datasets: The Reuters Corpus2, the CBCL face dataset3 and the\nKNIX dataset4. The subsample of the Reuters corpus we used is a term frequency matrix of 7,769\ndocuments and 26,001 words. The CBCL face dataset is composed of 2,492 images of 361 pixels\neach, arranged into a matrix. The KNIX dataset contains 24 MRI slices of a knee, arranged in a\nmatrix of size 262, 144 \u00d7 24. Pixels are divided by their overall mean intensity. For interpretability\nreasons, there is interest to decompose MRI data into non-negative factorizations [25]. We compare\nPWMP and FCMP against the multiplicative (mult) and the alternating (als) algorithm of [4], and the\ngreedy coordinate descent (GCD) of [20]. Since the Reuters corpus is much larger than the CBCL\nand the KNIX dataset we only used the GCD for which a fast implementation in C is available. We\nreport the objective value for \ufb01xed values of the rank in Table 2, showing that FCMP outperform all\nthe baselines across all the datasets. PWMP achieves smallest error on the Reuters corpus.\n\nNon-negative garrote. We consider the non-negative garrote which is a common approach to\nmodel order selection [6]. We evaluate NNMP, PWMP, and FCMP in the experiment described\nin [33], where the non-negative garrote is used to perform model order selection for logistic regression\n(i.e., for a non-quadratic objective function). We evaluated training and test accuracy on 100 random\nsplits of the sonar dataset from the UCI machine learning repository. In Table 3 we compare the\nmedian classi\ufb01cation accuracy of our algorithms with that of the cyclic coordinate descent algorithm\n(NNG) from [33].\n\n-\n-\n\nCBCL\nK = 50\n\nalgorithm Reuters\nCBCL\nKNIX\nK = 10\nK = 10\nK = 10\nmult\n2.4241e3 1.1405e3 2.4471e03\n2.7292e03\n2.73e3\nals\nGCD\n5.9799e5 2.2372e3\n2.2372e03\nPWMP\n5.9591e5 2.2494e3 789.901 2.2494e03\nFCMP\n5.9762e5 2.2364e3\n2.2364e03\nTable 2: Objective value for least-squares non-negative\nmatrix factorization with rank K.\n\n3.84e3\n806\n\n786.15\n\ntraining accuracy\n\ntest accuracy\nNNMP 0.8345 \u00b1 0.0242 0.7419 \u00b1 0.0389\nPWMP 0.8379 \u00b1 0.0240 0.7419 \u00b1 0.0392\nFCMP 0.8345 \u00b1 0.0238 0.7419 \u00b1 0.0403\n0.8069 \u00b1 0.0518 0.7258 \u00b1 0.0602\nNNG\n\nTable 3: Logistic Regression with non-negative\nGarrote, median \u00b1 std. dev.\n\n7 Conclusion\n\nIn this paper, we considered greedy algorithms for optimization over the convex cone, parametrized\nas the conic hull of a generic atom set. We presented a novel formulation of NNMP along with a\ncomprehensive convergence analysis. Furthermore, we introduced corrective variants with linear\nconvergence guarantees, and veri\ufb01ed this convergence rate in numerical applications. We believe that\nthe generality of our novel analysis will be useful to design new, fast algorithms with convergence\nguarantees, and to study convergence of existing heuristics, in particular in the context of non-negative\nmatrix and tensor factorization.\n\n2http://www.nltk.org/book/ch02.html\n3http://cbcl.mit.edu/software-datasets/FaceData2.html\n4http://www.osirix-viewer.com/resources/dicom-image-library/\n\n9\n\n01020304050Iteration10-410-2100102104SuboptimalitySynthetic dataPWMP (Alg. 3)NNMP (Alg. 2)FCMP (Alg. 4)FWAMP (Alg. 3)FNNOMP\fReferences\n[1] Animashree Anandkumar, Rong Ge, Daniel J Hsu, Sham M Kakade, and Matus Telgarsky.\nTensor decompositions for learning latent variable models. Journal of Machine Learning\nResearch, 15(1):2773\u20132832, 2014.\n\n[2] M\u00e1rio C\u00e9sar Ugulino Ara\u00fajo, Teresa Cristina Bezerra Saldanha, Roberto Kawakami Harrop\nGalvao, Takashi Yoneyama, Henrique Caldas Chame, and Valeria Visani. The successive projec-\ntions algorithm for variable selection in spectroscopic multicomponent analysis. Chemometrics\nand Intelligent Laboratory Systems, 57(2):65\u201373, 2001.\n\n[3] Jonas Behr, Andr\u00e9 Kahles, Yi Zhong, Vipin T Sreedharan, Philipp Drewe, and Gunnar R\u00e4tsch.\nMitie: Simultaneous rna-seq-based transcript identi\ufb01cation and quanti\ufb01cation in multiple\nsamples. Bioinformatics, 29(20):2529\u20132538, 2013.\n\n[4] Michael W Berry, Murray Browne, Amy N Langville, V Paul Pauca, and Robert J Plemmons.\nAlgorithms and applications for approximate nonnegative matrix factorization. Computational\nstatistics & data analysis, 52(1):155\u2013173, 2007.\n\n[5] Alfred M Bruckstein, Michael Elad, and Michael Zibulevsky. On the uniqueness of nonnegative\nsparse solutions to underdetermined systems of equations. IEEE Transactions on Information\nTheory, 54(11):4813\u20134820, 2008.\n\n[6] P B\u00fchlmann and B Yu. Boosting, model selection, lasso and nonnegative garrote. Technical\n\nReport 127, Seminar f\u00fcr Statistik ETH Z\u00fcrich, 2005.\n\n[7] Peter B\u00fchlmann and Bin Yu. Boosting. Wiley Interdisciplinary Reviews: Computational\n\nStatistics, 2(1):69\u201374, 2010.\n\n[8] Martin Burger. In\ufb01nite-dimensional optimization and optimal design. 2003.\n[9] Sheng Chen, Stephen A Billings, and Wan Luo. Orthogonal least squares methods and their\napplication to non-linear system identi\ufb01cation. International Journal of control, 50(5):1873\u2013\n1896, 1989.\n\n[10] Andrzej Cichocki and PHAN Anh-Huy. Fast local algorithms for large scale nonnegative matrix\nand tensor factorizations. IEICE transactions on fundamentals of electronics, communications\nand computer sciences, 92(3):708\u2013721, 2009.\n\n[11] Ernie Esser, Yifei Lou, and Jack Xin. A method for \ufb01nding structured sparse solutions to\nnonnegative least squares problems with applications. SIAM Journal on Imaging Sciences,\n6(4):2010\u20132046, 2013.\n\n[12] M Frank and P Wolfe. An algorithm for quadratic programming. Naval research logistics\n\nquarterly, 1956.\n\n[13] Nicolas Gillis. Successive nonnegative projection algorithm for robust nonnegative blind source\n\nseparation. SIAM Journal on Imaging Sciences, 7(2):1420\u20131450, 2014.\n\n[14] Nicolas Gillis and Fran\u00e7ois Glineur. Accelerated multiplicative updates and hierarchical als\nalgorithms for nonnegative matrix factorization. Neural Computation, 24(4):1085\u20131105, 2012.\n[15] Nicolas Gillis, Da Kuang, and Haesun Park. Hierarchical clustering of hyperspectral images\nusing rank-two nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote\nSensing, 53(4):2066\u20132078, 2015.\n\n[16] Nicolas Gillis and Robert Luce. A fast gradient method for nonnegative sparse regression with\n\nself dictionary. arXiv preprint arXiv:1610.01349, 2016.\n\n[17] Alexander Grubb and J Andrew Bagnell. Generalized boosting algorithms for convex optimiza-\n\ntion. arXiv preprint arXiv:1105.2054, 2011.\n\n[18] Xiawei Guo, Quanming Yao, and James T Kwok. Ef\ufb01cient sparse low-rank tensor completion\n\nusing the Frank-Wolfe algorithm. In AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[19] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for\nnorm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):75\u2013112,\n2015.\n\n[20] Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection\nfor non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 1064\u20131072. ACM, 2011.\n\n[21] Martin Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML\n\n2013 - Proceedings of the 30th International Conference on Machine Learning, 2013.\n\n10\n\n\f[22] Hyunsoo Kim, Haesun Park, and Lars Elden. Non-negative tensor factorization based on\nalternating large-scale non-negativity-constrained least squares. In Bioinformatics and Bioengi-\nneering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on, pages\n1147\u20131151. IEEE, 2007.\n\n[23] Jingu Kim, Yunlong He, and Haesun Park. Algorithms for nonnegative matrix and tensor\nfactorizations: A uni\ufb01ed view based on block coordinate descent framework. Journal of Global\nOptimization, 58(2):285\u2013319, 2014.\n\n[24] Jingu Kim and Haesun Park. Fast nonnegative tensor factorization with an active-set-like\n\nmethod. In High-Performance Scienti\ufb01c Computing, pages 311\u2013326. Springer, 2012.\n\n[25] Ivica Kopriva and Andrzej Cichocki. Nonlinear band expansion and 3d nonnegative tensor\nfactorization for blind decomposition of magnetic resonance image of the brain. In International\nConference on Latent Variable Analysis and Signal Separation, pages 490\u2013497. Springer, 2010.\n[26] Abhishek Kumar, Vikas Sindhwani, and Prabhanjan Kambadur. Fast conical hull algorithms for\n\nnear-separable non-negative matrix factorization. In ICML (1), pages 231\u2013239, 2013.\n\n[27] Simon Lacoste-Julien and Martin Jaggi. An Af\ufb01ne Invariant Linear Convergence Analysis for\nFrank-Wolfe Algorithms. In NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and\nFriends, December 2013.\n\n[28] Simon Lacoste-Julien and Martin Jaggi. On the Global Linear Convergence of Frank-Wolfe\n\nOptimization Variants. In NIPS 2015, pages 496\u2013504, 2015.\n\n[29] S\u00f6ren Laue. A Hybrid Algorithm for Convex Semide\ufb01nite Optimization. In ICML, 2012.\n[30] Charles L Lawson and Richard J Hanson. Solving least squares problems, volume 15. SIAM,\n\n1995.\n\n[31] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In\n\nAdvances in neural information processing systems, pages 556\u2013562, 2001.\n\n[32] Francesco Locatello, Rajiv Khanna, Michael Tschannen, and Martin Jaggi. A uni\ufb01ed optimiza-\ntion view on generalized matching pursuit and frank-wolfe. In Proc. International Conference\non Arti\ufb01cial Intelligence and Statistics (AISTATS), 2017.\n\n[33] Enes Makalic and Daniel F Schmidt. Logistic regression with the nonnegative garrote. In\n\nAustralasian Joint Conference on Arti\ufb01cial Intelligence, pages 82\u201391. Springer, 2011.\n\n[34] St\u00e9phane Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE\n\nTransactions on Signal Processing, 41(12):3397\u20133415, 1993.\n\n[35] Ron Meir and Gunnar R\u00e4tsch. An introduction to boosting and leveraging. In Advanced lectures\n\non machine learning, pages 118\u2013183. Springer, 2003.\n\n[36] Jos\u00e9 MP Nascimento and Jos\u00e9 MB Dias. Vertex component analysis: A fast algorithm to unmix\nhyperspectral data. IEEE transactions on Geoscience and Remote Sensing, 43(4):898\u2013910,\n2005.\n\n[37] Hao Nguyen and Guergana Petrova. Greedy strategies for convex optimization. Calcolo, pages\n\n1\u201318, 2014.\n\n[38] Robert Peharz, Michael Stark, and Franz Pernkopf. Sparse nonnegative matrix factorization\n\nusing l0-constraints. In IEEE, editor, Proceedings of MLSP, pages 83 \u2013 88, Aug 2010.\n\n[39] Javier Pena and Daniel Rodriguez. Polytope conditioning and linear convergence of the frank-\n\nwolfe algorithm. arXiv preprint arXiv:1512.06142, 2015.\n\n[40] Javier Pena and Negar Soheili. Solving conic systems via projection and rescaling. Mathematical\n\nProgramming, pages 1\u201325, 2016.\n\n[41] Aleksei Pogorelov. Extrinsic geometry of convex surfaces, volume 35. American Mathematical\n\nSoc., 1973.\n\n[42] Gunnar R\u00e4tsch, Sebastian Mika, Manfred K Warmuth, et al. On the convergence of leveraging.\n\nIn NIPS, pages 487\u2013494, 2001.\n\n[43] F Sha, LK Saul, and Daniel D Lee. Multiplicative updates for nonnegative quadratic program-\nming in support vector machines. Advances in Neural Information Processing Systems, 15,\n2002.\n\n[44] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading Accuracy for Sparsity in\nOptimization Problems with Sparsity Constraints. SIAM Journal on Optimization, 20:2807\u2013\n2832, 2010.\n\n11\n\n\f[45] Amnon Shashua and Tamir Hazan. Non-negative tensor factorization with applications to\nstatistics and computer vision. In Proceedings of the 22nd international conference on Machine\nlearning, pages 792\u2013799. ACM, 2005.\n\n[46] Vladimir Temlyakov. Chebushev Greedy Algorithm in convex optimization. arXiv.org, Decem-\n\nber 2013.\n\n[47] Vladimir Temlyakov. Greedy algorithms in convex optimization on Banach spaces. In 48th\n\nAsilomar Conference on Signals, Systems and Computers, pages 1331\u20131335. IEEE, 2014.\n\n[48] VN Temlyakov. Greedy approximation in convex optimization. Constructive Approximation,\n\n41(2):269\u2013296, 2015.\n\n[49] Joel A Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions\n\non Information Theory, 50(10):2231\u20132242, 2004.\n\n[50] Zheng Wang, Ming jun Lai, Zhaosong Lu, Wei Fan, Hasan Davulcu, and Jieping Ye. Rank-one\n\nmatrix pursuit for matrix completion. In ICML, pages 91\u201399, 2014.\n\n[51] Max Welling and Markus Weber. Positive tensor factorization. Pattern Recognition Letters,\n\n22(12):1255\u20131261, 2001.\n\n[52] Mehrdad Yaghoobi, Di Wu, and Mike E Davies. Fast non-negative orthogonal matching pursuit.\n\nIEEE Signal Processing Letters, 22(9):1229\u20131233, 2015.\n\n[53] Yuning Yang, Siamak Mehrkanoon, and Johan A K Suykens. Higher order Matching Pursuit\n\nfor Low Rank Tensor Learning. arXiv.org, March 2015.\n\n[54] Quanming Yao and James T Kwok. Greedy learning of generalized low-rank models. In IJCAI,\n\n2016.\n\n[55] Xiao-Tong Yuan and Tong Zhang. Truncated power method for sparse eigenvalue problems. J.\n\nMach. Learn. Res., 14(1):899\u2013925, April 2013.\n\n12\n\n\f", "award": [], "sourceid": 519, "authors": [{"given_name": "Francesco", "family_name": "Locatello", "institution": "MPI - ETH Z\u00fcrich"}, {"given_name": "Michael", "family_name": "Tschannen", "institution": "ETH Zurich"}, {"given_name": "Gunnar", "family_name": "Raetsch", "institution": "ETHZ"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}]}