{"title": "Blended Matching Pursuit", "book": "Advances in Neural Information Processing Systems", "page_first": 2044, "page_last": 2054, "abstract": "Matching pursuit algorithms are an important class of algorithms in signal processing and machine learning. We present a blended matching pursuit algorithm, combining coordinate descent-like steps with stronger gradient descent steps, for minimizing a smooth convex function over a linear space spanned by a set of atoms. We derive sublinear to linear convergence rates according to the smoothness and sharpness orders of the function and demonstrate computational superiority of our approach. In particular, we derive linear rates for a large class of non-strongly convex functions, and we demonstrate in experiments that our algorithm enjoys very fast rates of convergence and wall-clock speed while maintaining a sparsity of iterates very comparable to that of the (much slower) orthogonal matching pursuit.", "full_text": "Blended Matching Pursuit\n\nCyrille W. Combettes\n\nGeorgia Institute of Technology\n\nAtlanta, GA, USA\n\ncyrille@gatech.edu\n\nAbstract\n\nSebastian Pokutta\n\nZuse Institute Berlin and TU Berlin\n\nBerlin, Germany\npokutta@zib.de\n\nMatching pursuit algorithms are an important class of algorithms in signal pro-\ncessing and machine learning. We present a blended matching pursuit algorithm,\ncombining coordinate descent-like steps with stronger gradient descent steps, for\nminimizing a smooth convex function over a linear space spanned by a set of atoms.\nWe derive sublinear to linear convergence rates according to the smoothness and\nsharpness orders of the function and demonstrate computational superiority of our\napproach. In particular, we derive linear rates for a large class of non-strongly\nconvex functions, and we demonstrate in experiments that our algorithm enjoys\nvery fast rates of convergence and wall-clock speed while maintaining a sparsity of\niterates very comparable to that of the (much slower) orthogonal matching pursuit.\n\nIntroduction\n\n1\nLet H be a separable real Hilbert space, D \u2282 H be a dictionary, and f : H \u2192 R be a smooth convex\nfunction. In this paper, we aim at solving the problem:\n\nFind a solution to min\n\nx\u2208H f (x) which is sparse relative to D.\n\n(1)\n\nTogether with fast convergence, achieving high sparsity, i.e., keeping the iterates as linear combi-\nnations of a small number of atoms in the dictionary D, is a primary objective and leads to better\ngeneralization, interpretability, and decision-making in machine learning. In signal processing, Prob-\nlem (1) encompasses a wide range of applications, including compressed sensing, signal denoising,\nand information retrieval, and is often solved with the Matching Pursuit algorithm [Mallat and Zhang,\n1993]. Our approach is inspired by the Blended Conditional Gradients algorithm [Braun et al.,\n2019] which solves the constrained setting of Problem (1), i.e., minimizing f over the convex hull\nconv(D) of the dictionary, and is ultimately based on the Frank-Wolfe algorithm [Frank and Wolfe,\n1956] a.k.a. Conditional Gradient algorithm [Levitin and Polyak, 1966]. It enhances the vanilla\nFrank-Wolfe algorithm by replacing the linear minimization oracle with a weak-separation oracle\n[Braun et al., 2017] and by blending the traditional Frank-Wolfe steps with lazi\ufb01ed Frank-Wolfe\nsteps and projected gradient steps, while still avoiding projections. Frank-Wolfe algorithms are\nparticularly well-suited for problems with a desired sparsity in the solution (see, e.g., Jaggi [2013]\nand the references therein) however, from an optimization perspective, although they approximate the\noptimal descent direction \u2212\u2207f (xt) via the linear minimization oracle vFW\nt \u2190 arg minD(cid:104)\u2207f (xt), v(cid:105),\nthey move in the direction vFW\nAn analogy between Frank-Wolfe algorithms and the unconstrained Problem (1) was proposed by\nLocatello et al. [2017]. They uni\ufb01ed the Frank-Wolfe and Matching Pursuit algorithms, and proposed\na Generalized Matching Pursuit algorithm (GMP) and an Orthogonal Matching Pursuit algorithm\n(OMP) for solving Problem (1), which descend in the directions vFW\n. Essentially, Locatello et al.\n[2017] established that GMP corresponds to the vanilla Frank-Wolfe algorithm and OMP corresponds\nto the Fully-Corrective Frank-Wolfe algorithm. GMP and OMP converge with similar rates in the\nvarious regimes, namely with a sublinear rate for smooth convex functions and with a linear rate for\n\nt \u2212 xt in order to ensure feasibility, which provides less progress.\n\nt\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsmooth strongly convex functions, however they have different advantages: GMP converges (much)\nfaster in wall-clock time while OMP offers (much) sparser iterates. The interest in these algorithms\nstems from the fact that they work in the general setting of smooth convex functions in Hilbert spaces\nand that their convergence analyses do not require incoherence or restricted isometry properties (RIP,\nCand\u00e8s and Tao [2005]) of the dictionary, which are quite strong assumptions from an optimization\nstandpoint. For an in-depth discussion of the advantages of GMP and OMP over other methods, e.g.,\nin Tropp [2004], Gribonval and Vandergheynst [2006], Davenport and Wakin [2010], Shalev-Shwartz\net al. [2010], Temlyakov [2013, 2014, 2015], Tibshirani [2015], Yao and Kwok [2016], and Nguyen\nand Petrova [2017], we refer the interested reader to Locatello et al. [2017]. In a follow-up work,\nLocatello et al. [2018] presented an Accelerated Matching Pursuit algorithm.\nWe aim at unifying the best of GMP (speed) and OMP (sparsity) into a single algorithm by blending\nthem strategically. However, while the overall idea is reasonably natural, we face considerable\nchallenges as many important features of Frank-Wolfe methods do not apply anymore in the Matching\nPursuit setting and cannot be as easily overcome as in Locatello et al. [2017], requiring a different\nanalysis. For example, Frank-Wolfe (duality) gaps are not readily available but they are crucial in\nmonitoring the blending, and further key components, such as the weak-separation oracle, require\nmodi\ufb01cations.\n\nContributions. We propose a Blended Matching Pursuit algorithm (BMP), a fast and sparse \ufb01rst-\norder method for solving Problem (1). Our method uni\ufb01es the best of GMP (speed) and OMP\n(sparsity) into one algorithm, which is of fundamental interest for practitioners. We establish a\ncontinuous range of convergence rates between O(1/\u0001p) and O(ln 1/\u0001), where \u0001 > 0 is the desired\naccuracy and p > 0 depends on the properties of the function. In particular, we derive linear\nrates of convergence for a large class of smooth convex but non-strongly convex functions. Lastly,\nwe demonstrate the computational superiority of BMP over state-of-the-art methods, with BMP\nconverging the fastest in wall-clock time while maintaining its iterates at close-to-optimal sparsity,\nand this without requiring sparsity-inducing constraints.\n\nOutline. We introduce notions and notation in Section 2. We present the Blended Matching Pursuit\nalgorithm in Section 3 with the convergence analyses in Section 3.1. Computational experiments are\nprovided in Section 4. Additional experiments and the proofs can be found in the Appendix.\n\n2 Preliminaries\nWe work in a separable real Hilbert space (H,(cid:104)\u00b7,\u00b7(cid:105)) with induced norm (cid:107) \u00b7 (cid:107). A set D \u2282 H of\nnormalized vectors is a dictionary if it is at most countable and cl(span(D)) = H, and in this case its\nelements are referred to as atoms. For any set S \u2286 H, let S(cid:48) := S \u222a \u2212S denote the symmetrization\nof S and DS := supu,v\u2208S (cid:107)u \u2212 v(cid:107) denote the diameter of S. If S is closed and convex, let projS\ndenote the orthogonal projection onto S and dist(\u00b7,S) := (cid:107) id\u2212 projS (cid:107) denote the distance to S.\nFor Problem (1) to be feasible, we will assume f to be coercive, i.e., lim(cid:107)x(cid:107)\u2192+\u221e f (x) = +\u221e. Since\nf is convex, this is actually a mild assumption when arg minH f (cid:54)= \u2205.\nLet f : H \u2192 R be a Fr\u00e9chet differentiable function. In the following, we use extended notions of\nsmoothness and strong convexity by introducing orders, and we weaken and generalize the notion of\nstrong convexity to that of sharpness (see, e.g., Roulet and d\u2019Aspremont [2017] and Kerdreux et al.\n[2019] for recent work). We say that f is:\n\n(i) smooth of order (cid:96) > 1 if there exists L > 0 such that for all x, y \u2208 H,\n(cid:107)y \u2212 x(cid:107)(cid:96),\n\nf (y) \u2212 f (x) \u2212 (cid:104)\u2207f (x), y \u2212 x(cid:105) (cid:54) L\n(cid:96)\n\n(ii) strongly convex of order s > 1 if there exists S > 0 such that for all x, y \u2208 H,\n\nf (y) \u2212 f (x) \u2212 (cid:104)\u2207f (x), y \u2212 x(cid:105) (cid:62) S\ns\n\n(cid:107)y \u2212 x(cid:107)s,\n\n(iii) sharp of order \u03b8 \u2208 ]0, 1[ on K if K \u2282 H is a bounded set, \u2205 (cid:54)= arg minH f \u2282 int(K), and\n\nthere exists C > 0 such that for all x \u2208 K,\n\n(cid:18)\n\n(cid:19)\n\n(cid:16)\n\n(cid:17)\u03b8\n\n.\n\n(cid:54) C\n\nf (x) \u2212 minH f\n\ndist\n\nx, arg min\n\nH\n\nf\n\n2\n\n\fIf needed, we may specify the constants by introducing f as L-smooth, S-strongly convex, or C-sharp.\nThe following fact, whose result was already used in Nemirovskii and Nesterov [1985], provides a\nbound on the sharpness order of a smooth function. A proof is available in Appendix D.\nFact 2.1. Let f : H \u2192 R be smooth of order (cid:96) > 1, convex, and sharp of order \u03b8 \u2208 ]0, 1[ on K.\nThen \u03b8 \u2208 ]0, 1/(cid:96)].\n\n2.1 On sharpness and strong convexity\nNotice that if f : H \u2192 R is Fr\u00e9chet differentiable and strongly convex of order s > 1, then\ncard(arg minH f ) = 1. Let {x\u2217} := arg minH f. It follows directly from \u2207f (x\u2217) = 0 that for any\nbounded set K \u2282 H such that x\u2217 \u2208 int(K), f is sharp of order \u03b8 = 1/s on K. Thus, strong convexity\nimplies sharpness. However, not every sharp function is strongly convex; moreover, the next example\nshows that not every sharp and convex function is strongly convex.\nExample 2.2 (Distance to a convex set). Let C \u2282 H be a nonempty, closed, and bounded convex\nset, and K \u2282 H be a bounded set such that C \u2282 int(K). The function f : x \u2208 H (cid:55)\u2192 dist(x,C)2 =\n(cid:107)x \u2212 projC(x)(cid:107)2 is convex, and it is sharp of order \u03b8 = 1/2 on K. Indeed, since arg minH f = C\nand minH f = 0, we have for all x \u2208 K,\n\n(cid:18)\n\n(cid:19)\n\n(cid:16)\n\n(cid:17)1/2\n\n.\n\ndist\n\nx, arg min\n\nf\n\nH\n\n= (cid:107)x \u2212 projC(x)(cid:107) =\n\nf (x) \u2212 minH f\n\nNow, suppose C contains more than one element. Then, f has more than one minimizer. However, a\nfunction that is strongly convex of order s > 1 has no more than one minimizer. Therefore, f cannot\nbe strongly convex of order s, for all s > 1. Notice that f is also a smooth function, of order (cid:96) = 2.\n\nHence, sharpness is a more general notion of strong convexity. It is a local condition around the\noptimal solutions while strong convexity is a global condition. In fact, building on the \u0141ojasiewicz\ninequality of \u0141ojasiewicz [1963], [Bolte et al., 2007, Equation (15)] showed that sharpness always\nholds in \ufb01nite dimensional spaces for reasonably well-behaved convex functions; see Lemma 2.3.\nPolynomial convex functions, the (cid:96)p-norms, the Huber loss (see Appendix A.4), and the recti\ufb01er\nReLU are simple examples of such functions.\nLemma 2.3. Let f : Rn \u2192 ]\u2212\u221e, +\u221e] be a lower semicontinuous, convex, and subanalytic function\nwith {x \u2208 Rn | 0 \u2208 \u2202f (x)} (cid:54)= \u2205. Then for any bounded set K \u2282 Rn, there exists \u03b8 \u2208 ]0, 1[ and\nC > 0 such that for all x \u2208 K,\n\n(cid:18)\n\n(cid:19)\n\n(cid:16)\n\ndist\n\nx, arg min\n\nRn\n\nf\n\n(cid:54) C\n\nf (x) \u2212 minRn\n\nf\n\n(cid:17)\u03b8\n\n.\n\nStrong convexity is a standard requirement to prove linear convergence rates on smooth convex\nobjectives but, regrettably, this considerably restricts the set of candidate functions. For our Blended\nMatching Pursuit algorithm, we will only require sharpness to establish linear convergence rates, thus\nincluding a larger class of functions.\n\n2.2 Matching Pursuit algorithms\nFor y \u2208 H and f : x \u2208 H (cid:55)\u2192 (cid:107)y \u2212 x(cid:107)2/2, Problem (1) falls in the area of sparse recovery and is\noften solved with the Matching Pursuit algorithm [Mallat and Zhang, 1993]. The algorithm recovers a\nsparse representation of the signal y from the dictionary D by sequentially pursuing the best matching\natom. At each iteration, it searches for the atom vt \u2208 D most correlated with the residual y \u2212 xt, i.e.,\nvt := arg maxv\u2208D |(cid:104)y \u2212 xt, v(cid:105)|, and adds it to the linear decomposition of the current iterate xt to\nform the new iterate xt+1, keeping track of the active set St+1 = St \u222a {vt}. However, this does not\nprevent the algorithm from selecting atoms that have already been added in earlier iterations or that\nare redundant, hence affecting sparsity. The Orthogonal Matching Pursuit variant [Pati et al., 1993,\nDavis et al., 1994] overcomes this by computing the new iterate as the projection of the signal y onto\nSt \u222a {vt}; see Chen et al. [1989] and Tropp [2004] for analyses and Zhang [2009] for an extension to\nthe stochastic case. Thus, y \u2212 xt+1 becomes orthogonal to the active set.\nIn order to solve Problem (1) for any smooth convex objective, Locatello et al. [2017] proposed\nthe Generalized Matching Pursuit (GMP) and Generalized Orthogonal Matching Pursuit (GOMP)\nalgorithms (Algorithm 1); slightly abusing notation we will refer to the latter simply as Orthogonal\n\n3\n\n\fMatching Pursuit (OMP). The atom selection subroutine is implemented with a Frank-Wolfe linear\nminimization oracle arg minv\u2208D(cid:48)(cid:104)\u2207f (xt), v(cid:105) (Line 3). The solution vt to this oracle is guaranteed\nto be a descent direction as it satis\ufb01es (cid:104)\u2207f (xt), vt(cid:105) (cid:54) 0 by symmetry of D(cid:48), and (cid:104)\u2207f (xt), vt(cid:105) = 0 if\nand only if xt \u2208 arg minH f. Notice that for y \u2208 H and f : x \u2208 H (cid:55)\u2192 (cid:107)y \u2212 x(cid:107)2/2, the GMP and\nOMP variants of Algorithm 1 recover the original Matching Pursuit and Orthogonal Matching Pursuit\nalgorithms respectively. In particular, up to a sign which does not affect the sequence of iterates,\narg maxv\u2208D |(cid:104)y \u2212 xt, v(cid:105)| \u21d4 arg minv\u2208D(cid:48)(cid:104)\u2207f (xt), v(cid:105). In practice, the main difference in the case\nof general smooth convex functions is that the OMP variant (Line 6) is much more expensive, as a\nclosed-form solution to this projection step is not available anymore. Hence, Line 6 is typically a\nsequence of projected gradient steps and OMP is signi\ufb01cantly slower than GMP to converge.\n\nAlgorithm 1 Generalized/Orthogonal Matching Pursuit (GMP/OMP)\nInput: Start atom x0 \u2208 D, number of iterations T \u2208 N\u2217.\nOutput: Iterates x1, . . . , xT \u2208 span(D).\n1: S0 \u2190 {x0}\n2: for t = 0 to T \u2212 1 do\n3:\n\n(cid:104)\u2207f (xt), v(cid:105)\n\nvt \u2190 arg min\nv\u2208D(cid:48)\nSt+1 \u2190 St \u222a {vt}\nGMP variant: xt+1 \u2190 arg min\nf\nxt+Rvt\nOMP variant: xt+1 \u2190 arg min\nspan(St+1)\n\nf\n\n4:\n5:\n\n6:\n\n7: end for\n\n2.3 Weak-separation oracle\n\nWe present in Oracle 2 the weak-separation oracle, a modi\ufb01ed version of the one \ufb01rst introduced in\nBraun et al. [2017] and used in, e.g., Lan et al. [2017], Braun et al. [2019]. Note that the modi\ufb01cation\nasks for an unconstrained improvement, whereas the original weak-separation oracle required an\nimprovement relative to a reference point. As such, our variant here is even simpler than the original\nweak-separation oracle. The oracle is called in Line 11 by the Blended Matching Pursuit algorithm.\n\nOracle 2 Weak-separation LPsepD(c, \u03c6, \u03ba)\nInput: Linear objective c \u2208 H, objective value \u03c6 (cid:54) 0, accuracy \u03ba (cid:62) 1.\nOutput: Either atom v \u2208 D such that (cid:104)c, v(cid:105) (cid:54) \u03c6/\u03ba (positive call), or false ensuring (cid:104)c, z(cid:105) (cid:62) \u03c6 for\nall z \u2208 conv(D) (negative call).\nThe weak-separation oracle determines whether there exists an atom v \u2208 D such that (cid:104)c, v(cid:105) (cid:54) \u03c6/\u03ba,\nand thereby relaxes the Frank-Wolfe linear minimization oracle. If not, then this implies that conv(D)\ncan be separated from the ambient space by c and \u03c6 with the linear inequality (cid:104)c, z(cid:105) (cid:62) \u03c6 for all\nz \u2208 conv(D). In practice, the oracle can be ef\ufb01ciently implemented using caching, i.e., \ufb01rst testing\natoms that were already returned during previous calls as they may satisfy the condition here again.\nIn this case, caching also preserves sparsity. If no active atom satis\ufb01es the condition, the oracle can be\nsolved, e.g., by means of a call to a linear optimization oracle; see Braun et al. [2017] for an in-depth\ndiscussion. Lastly, we would like to brie\ufb02y note that the parameter \u03ba can be used to further promote\npositive calls over negative calls, by weakening the improvement requirement and therefore speeding\nup the oracle. Indeed, only negative calls need a full scan of the dictionary.\n\n3 The Blended Matching Pursuit algorithm\n\nwe blend steps, we maintain the explicit decomposition of the iterates xt =(cid:80)nt\n\nWe now present our Blended Matching Pursuit algorithm (BMP) in Algorithm 3. Note that although\nj=1 \u03bbt,ij aij as linear\n\ncombinations of the atoms.\nRemark 3.1 (Algorithm design). BMP actually does not require the atoms to have exactly the same\nnorm and only needs the dictionary to be bounded, whether it be for ensuring the convergence rates\nor for computations; one could further take advantage of this to add weights to certain atoms. Line 6\n\n4\n\n\f(cid:104)\u2207f (xt), v(cid:105)\n\nAlgorithm 3 Blended Matching Pursuit (BMP)\nInput: Start atom x0 \u2208 D, parameters \u03b7 > 0, \u03ba (cid:62) 1, and \u03c4 > 1, number of iterations T \u2208 N\u2217.\nOutput: Iterates x1, . . . , xT \u2208 span(D).\n1: S0 \u2190 {x0}\nv\u2208D(cid:48)(cid:104)\u2207f (x0), v(cid:105)/\u03c4\n2: \u03c60 \u2190 min\n3: for t = 0 to T \u2212 1 do\nif(cid:10)\u2207f (xt), vFW-S\n(cid:11) (cid:54) \u03c6t/\u03b7 then\nt \u2190 arg min\nvFW-S\n4:\nv\u2208S(cid:48)\n(cid:101)\u2207f (xt) \u2190 projspan(St)(\u2207f (xt))\nxt+1 \u2190 arg min\nxt+R(cid:101)\u2207f (xt)\nSt+1 \u2190 St\n\u03c6t+1 \u2190 \u03c6t\nvt \u2190 LPsepD(cid:48)(\u2207f (xt), \u03c6t, \u03ba)\nif vt = false then\n\n5:\n6:\n7:\n\nelse\n\nf\n\nt\n\nt\n\n{constrained step}\n\n{dual step}\n\n{full step}\n\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n\nelse\n\nxt+1 \u2190 xt\nSt+1 \u2190 St\n\u03c6t+1 \u2190 \u03c6t/\u03c4\nxt+1 \u2190 arg min\nf\nxt+Rvt\nSt+1 \u2190 St \u222a {vt}\n\u03c6t+1 \u2190 \u03c6t\n\nend if\n\n18:\n19:\n20:\n21:\n22:\n23: end for\n\nend if\nOptional: Correct St+1\n\nis simply taking the component of \u2207f (xt) parallel to span(St), which can be achieved by basic\nlinear algebra and costs O(n card(St)2) when H = Rn. The line searches Lines 7 and 17 can\nbe replaced with explicit step sizes using the smoothness of f (see Fact B.2 in the Appendix). The\npurpose of (the optional) Line 22 is to reoptimize the active set St+1, e.g., by reducing it to a subset\nthat forms a basis for its linear span. One could also obtain further sparsity by removing atoms\nwhose coef\ufb01cient in the decomposition of the iterate is smaller than some threshold \u03b4 > 0.\n\nBlending. BMP aims at unifying the best of GMP and OMP. As seen in Section 2.2, an OMP\niteration is a sequence of projected gradient (PG) steps. The idea is that the sequence of PG steps\nconstituting an OMP iteration is actually overkill: there is a sweet spot where further optimizing\nover span(St) is less effective than adding a new atom and taking a GMP step into a (possibly)\nnew space. However, PG steps have the bene\ufb01t of preserving sparsity, since no new atom is added.\nFurthermore, GMP steps require an expensive scan of the dictionary to output the descent direction\nt \u2190 arg minv\u2208D(cid:48)(cid:104)\u2207f (xt), v(cid:105). To remedy this, BMP blends constrained steps (PG steps, Line 7)\nvFW\nwith full steps (lazi\ufb01ed GMP steps, Line 17) by promoting constrained steps as long as the progress\nin function value is comparable to that of a GMP step, else by taking a full step in an approximate\ndirection vt (with cheap computation via Oracle 2) such that the progress is comparable to that of a\n(cid:105) and\nGMP step. Therefore, to monitor this blending of steps, we wish to compare (cid:104)\u2207f (xt), vFW-S\n(cid:104)\u2207f (xt), vt(cid:105) to (cid:104)\u2207f (xt), vFW\n(cid:105), which quantities measure the progress in function value offered by a\nconstrained step, a full step, and a GMP step respectively (see proofs in the Appendix).\n\nt\n\nt\n\nDual gap estimates. The aforementioned comparisons however cannot be made directly as the\nquantity (cid:104)\u2207f (xt), vFW\n(cid:105) is (deliberately) not computed; computing it requires an expensive complete\nscan of the dictionary. Instead, we use an estimation of this quantity, by introducing the dual gap\nestimate |\u03c6t|. This designation comes from the fact that \u2212(cid:104)\u2207f (xt), vFW\n(cid:105) is our equivalent of the\nduality gap from the constrained setting (see, e.g., Jaggi [2013]), and this will guide how we build our\nestimation. Indeed, since D(cid:48) is symmetric and assuming 0 \u2208 int(conv(D(cid:48))), there exists (an unknown)\n\nt\n\nt\n\n5\n\n\f\u03c1 > 0 such that {x0, . . . , xT} \u222a arg minH f \u2282 \u03c1 conv(D(cid:48)). Then for all x\u2217 \u2208 arg minH f,\n\n\u0001t := f (xt) \u2212 f (x\u2217) (cid:54) (cid:104)\u2207f (xt), xt \u2212 x\u2217(cid:105)\n\nt\n\nt\n\nt\n\n(cid:54)\n\nmax\n\nu,v\u2208\u03c1 conv(D(cid:48))\n\n(cid:105),\n\n(cid:104)\u2207f (xt), u \u2212 v(cid:105) = \u22122\u03c1(cid:104)\u2207f (xt), vFW\n\n(2)\nwhich is our desired inequality. We set \u03c60 \u2190 (cid:104)\u2207f (x0), vFW\n0 (cid:105)/\u03c4 (Line 2) so \u00010 (cid:54) 2\u03c4 \u03c1|\u03c60| by (2).\nThe criterion in Line 5 compares (cid:104)\u2207f (xt), vFW-S\n(cid:105) to \u03c6t. If this quantity is below the threshold \u03c6t,\nthen a constrained step is not taken and the weak-separation oracle (Line 11, Oracle 2) is called to\nsearch for an atom vt satisfying (cid:104)\u2207f (xt), vt(cid:105) (cid:54) \u03c6t. If the oracle cannot \ufb01nd such an atom, then\na full step is not taken and it returns a negative call with the certi\ufb01cate (cid:104)\u2207f (xt), vFW\n(cid:105) > \u03c6t. In\nthis case, BMP has detected an improved dual gap estimate and takes a dual step (Line 13): by (2),\nthis implies that \u0001t (cid:54) 2\u03c1|\u03c6t| so with \u03c6t+1 \u2190 \u03c6t/\u03c4 and xt+1 \u2190 xt, we recover \u0001t+1 (cid:54) 2\u03c4 \u03c1|\u03c6t+1|.\nFurthermore, observe that this update is a geometric rescaling which ensures that BMP requires only\nNdual = O(ln 1/\u0001) dual steps (see proofs). Thus, the total number of negative calls, i.e., the number\nof iterations requiring a complete scan of the dictionary, is only O(ln 1/\u0001). Therefore, for this and\nfor the blending of steps, the dual gap estimates |\u03c6t| are the key to the speed-up realized by BMP.\nParameters. BMP involves three (hyper-)parameters \u03b7 > 0, \u03ba (cid:62) 1, and \u03c4 > 1 to be set before\nrunning the algorithm. The parameter \u03b7 needs to be tuned carefully, as its value affects the criterion\nin Line 5 to promote either speed of convergence (e.g., \u03b7 \u223c 0.1, promoting full steps) or sparsity of\nthe iterates (e.g., \u03b7 \u223c 1000, promoting constrained steps). In our experiments (see Section 4 and\nthe Appendix), we found that setting \u03b7 \u223c 5 leads to close to both maximal speed of convergence\nand sparsity of the iterates, with the default choices \u03ba = \u03c4 = 2. In this setting, BMP converges\n(much) faster than GMP and has iterates with sparsity very comparable to that of OMP, and therefore\nit is possible to enjoy both properties of speed and sparsity simultaneously. Note that the value\nof \u03ba also impacts the range of values of \u03b7 to which BMP is sensitive, since the criterion (Line 5)\n(cid:104)\u2207f (xt), v(cid:105) (cid:54) \u03c6t/\u03b7 while the weak-separation oracle asks for v \u2208 D(cid:48) such that\ntests minv\u2208S(cid:48)\n(cid:104)\u2207f (xt), v(cid:105) (cid:54) \u03c6t/\u03ba. In speci\ufb01c experiments, parameter tuning might further improve performance.\n\nt\n\n3.1 Convergence analyses\n\nWe start with the simpler case of smooth convex functions of order (cid:96) > 1 (Theorem 3.2). Our\nmain result is Theorem 3.3, which subsumes the case of strongly convex functions. To establish the\nconvergence rates of GMP and OMP, Locatello et al. [2017] assume knowledge of an upper bound on\nsup{(cid:107)x\u2217(cid:107)D(cid:48),(cid:107)x0(cid:107)D(cid:48), . . . ,(cid:107)xT(cid:107)D(cid:48)} where (cid:107) \u00b7 (cid:107)D(cid:48) : x \u2208 H (cid:55)\u2192 inf{\u03c1 > 0 | x \u2208 \u03c1 conv(D(cid:48))} is the\natomic norm. In Locatello et al. [2018], this is resolved by working with the atomic norm (cid:107) \u00b7 (cid:107)D(cid:48)\ninstead of the Hilbert space induced norm (cid:107) \u00b7 (cid:107) to, e.g., de\ufb01ne smoothness and strong convexity of f\nand derive the proofs, but (cid:107) \u00b7 (cid:107)D(cid:48) itself can be dif\ufb01cult to derive in many applications. In contrast, we\nneed neither the \ufb01niteness assumption nor to change the norm, however we assume f to be coercive\nto ensure feasibility of Problem (1), a reasonably mild assumption.\nTheorem 3.2 (Smooth convex case). Let D \u2282 H be a dictionary such that 0 \u2208 int(conv(D(cid:48))) and\nlet f : H \u2192 R be smooth of order (cid:96) > 1, convex, and coercive. Then the Blended Matching Pursuit\n\nalgorithm (Algorithm 3) ensures that f (xt)\u2212minH f (cid:54) \u0001 for all t (cid:62) T where T = O(cid:0)(L/\u0001)1/((cid:96)\u22121)(cid:1).\n\nWe now present our main result in its full generality. We provide the general convergence rates of\nBMP (Algorithm 3) in Theorem 3.3. Recall that sharpness is implied by strong convexity and that it\nis a very mild assumption in \ufb01nite dimensional spaces as it is satis\ufb01ed by all well-behaved convex\nfunctions (Lemma 2.3).\nTheorem 3.3 (Smooth convex sharp case). Let D \u2282 H be a dictionary such that 0 \u2208 int(conv(D(cid:48)))\nand let f : H \u2192 R be L-smooth of order (cid:96) > 1, convex, coercive, and C-sharp of order \u03b8 \u2208 ]0, 1/(cid:96)]\non K. Then the Blended Matching Pursuit algorithm (Algorithm 3) ensures that f (xt) \u2212 minH f (cid:54) \u0001\nfor all t (cid:62) T where\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nO\n\nO\n\n(cid:18)\n(cid:32)(cid:18) C (cid:96)L\n\n\u00011\u2212(cid:96)\u03b8\n\nC 1/(1\u2212\u03b8)L1/((cid:96)\u22121) ln\n\n(cid:19)1/((cid:96)\u22121)(cid:33)\n\n(cid:18) C|\u03c60|\n\n(cid:19)(cid:19)\n\n\u00011\u2212\u03b8\n\nT =\n\nif (cid:96)\u03b8 = 1\n\nif (cid:96)\u03b8 < 1.\n\nMoreover, dist(xt, arg minH f ) \u2192 0 as t \u2192 +\u221e at same rate.\n\n6\n\n\fIf f is not strongly convex then Locatello et al. [2017] only guarantee a sublinear convergence rate\nO(1/\u0001) for GMP and OMP, while Theorem 3.3 can still guarantee higher convergence rates, up to\nlinear convergence O(ln 1/\u0001) if (cid:96)\u03b8 = 1, using sharpness. Note that in the popular case of smooth\nstrongly convex functions of orders (cid:96) = 2 and s = 2, Theorem 3.3 guarantees a linear convergence\n\nrate as these functions are sharp of order \u03b8 = 1/2 (with constant C = (cid:112)2/S) and thus satisfy\n\n(cid:96)\u03b8 = 1. For completeness, we also study this special case in Appendix C, with a simpler proof. In\nconclusion, Theorem 3.3 extends linear convergence rates to a large class of non-strongly convex\nfunctions solving Problem (1).\nRemark 3.4 (Optimality of the convergence rates). Let n (cid:54) +\u221e be the dimension of H. Ne-\nmirovskii and Nesterov [1985] provided unimprovable rates when solving Problem (1) in different\ncases. These optimal rates are reported in Table 1, where we compare them to those of BMP proved\nin this paper (Theorems 3.2 and 3.3). The third column gives the lower bounds on complexity stated\nin Nemirovskii and Nesterov [1985, Equations (1.20), (1.21\u2019), and (1.21)]. Note that our rates are\ndimension independent and hold globally across iterations. It remains an open question to determine\nwhether the gap in the exponent can be closed by accelerating BMP.\n\nTable 1: Comparison of the rates of BMP vs. the lower bounds on complexity.\n\nProperties of f\n\nSmooth convex\n\nSmooth convex sharp\nwith (cid:96) = 2, \u03b8 = 1/2\nSmooth convex sharp\n\nwith (cid:96)\u03b8 < 1\n\nBMP rate\nT (\u0001) = O\nT (\u0001) = O\n\nT (\u0001) = O\n\n(cid:19)\n(cid:19)(cid:19)\n\n1\n\n(cid:18) 1\n\n\u00011/((cid:96)\u22121)\nln\n\n\u0001\n\n(cid:18)\n(cid:18)\n(cid:18)\n\n1\n\n\u0001(1\u2212(cid:96)\u03b8)/((cid:96)\u22121)\n\n(cid:18)\n(cid:18)\n(cid:18)\n\nmin\n\nmin\n\nT (\u0001) = \u2126\n\nT (\u0001) = \u2126\n\n(cid:19)\n\nLower bound on complexity\n\n(cid:27)(cid:19)\n(cid:19)(cid:27)(cid:19)\n\nn,\n\n\u00011/(1.5(cid:96)\u22121)\n\nn, ln\n\n1\n\n(cid:18) 1\n\n\u0001\n\n(cid:26)\n(cid:26)\n(cid:26)\n\n(cid:27)(cid:19)\n\nT (\u0001) = \u2126\n\nmin\n\nn,\n\n1\n\n\u0001(1\u2212(cid:96)\u03b8)/(1.5(cid:96)\u22121)\n\n4 Computational experiments\n\nWe implemented BMP in Python 3 along with GMP and OMP [Locatello et al., 2017], the Accelerated\nMatching Pursuit algorithm (accMP) [Locatello et al., 2018], and the Blended Conditional Gradients\n(BCG) [Braun et al., 2019] and Conditional Gradient with Enhancement and Truncation (CoGEnT)\n[Rao et al., 2015] algorithms for completeness. All algorithms share the same code framework to\nensure fair comparison. No enhancement beyond basic coding was performed. We ran the experiments\non a laptop under Linux Ubuntu 18.04 with Intel Core i7 3.5GHz CPU and 8GB RAM. The random\ndata are drawn from Gaussian distributions. For GMP, OMP, BCG, and CoGEnT, we represented the\ndual gaps by \u2212 minv\u2208D(cid:48)(cid:104)\u2207f (xt), v(cid:105), yielding a zig-zag plot dissimilar to the stair-like plot of the\ndual gap estimates |\u03c6t| of BMP. The Appendix contains additional experiments.\n\n4.1 Comparison of BMP vs. GMP, OMP, BCG, and CoGEnT\nLet H be the Euclidean space (Rn,(cid:104)\u00b7,\u00b7(cid:105)) and D be the set of signed canonical vectors\n{\u00b1e1, . . . ,\u00b1en}. Suppose we want to learn the (sparse) source x\u2217 from observed data y := Ax\u2217 + w,\nwhere A \u2208 Rm\u00d7n and where w \u223c N (0, \u03c32Im) is the noise in the observed y. The general and most\n2 s.t. (cid:107)x(cid:107)0 (cid:54) (cid:107)x\u2217(cid:107)0 =: s but the (cid:96)0-pseudo\nintuitive formulation of the problem is minx\u2208Rn (cid:107)y\u2212Ax(cid:107)2\nnorm constraint is nonconvex and makes the problem NP-hard and therefore intractable in many\nsituations [Natarajan, 1995]. To remedy this, one can handle the sparsity constraint in various ways,\neither by completely removing it and relying on an algorithm inherently promoting sparsity (as done\nin BMP, GMP, and OMP), or through a convex relaxation of the constraint, often via the (cid:96)1-norm, and\nthen solving the new constrained convex problem minx\u2208Rn (cid:107)y \u2212 Ax(cid:107)2\n2 s.t. (cid:107)x(cid:107)1 (cid:54) (cid:107)x\u2217(cid:107)1 (as done\nin BCG and CoGEnT). We ran a comparison of these methods, where we favorably provided the con-\nstraint (cid:107)x(cid:107)1 (cid:54) (cid:107)x\u2217(cid:107)1 for BCG and CoGEnT although x\u2217 is unknown. We set m = 500, n = 2000,\ns = 100, and \u03c3 = 0.05. In BMP, we set \u03ba = \u03c4 = 2 and we chose \u03b7 = 5; see Appendix A.1 for an\nin-depth sensitivity analysis of BMP with respect to \u03b7. We did not perform any additional correction\nof the active sets (Line 22). Note that [Rao et al., 2015, Table III] demonstrated the superiority of\n\n7\n\n\fCoGEnT over CoSaMP [Needell and Tropp, 2009], Subspace Pursuit [Dai and Milenkovic, 2009],\nand Gradient Descent with Sparsi\ufb01cation [Garg and Khandekar, 2009] on an equivalent experiment\nand we therefore do not compare to those methods.\n\nFigure 1: Comparison of BMP vs. GMP, OMP, BCG, and CoGEnT, with \u03b7 = 5.\n\nFigure 1 shows that BMP is the fastest algorithm in wall-clock time and has close-to-optimal sparsity.\nIt is important to stress that, unlike BCG and CoGEnT, BMP achieves this while having no explicit\nsparsity-promoting constraint, regularization, nor information on x\u2217. Thus, when (cid:107)x\u2217(cid:107)1 is not\nprovided, which is the case in most applications, BCG and CoGEnT would require a hyper-parameter\ntuning of the sparsity-inducing constraint (or, equivalently, the Lagrangian penalty parameters), such\nas the radius of the (cid:96)1-ball [Tibshirani, 1996], as used here, or the trace-norm-ball [Fazel et al., 2001].\nOMP and CoGEnT converge faster per-iteration, as expected, given that they solve a reoptimization\nproblem at each iteration, however this is very costly and the disadvantage becomes evident in\nwall-clock time performance. Note that another \u201cobvious\u201d choice for an algorithm would be projected\ngradient descent, however the provided sparsity is far from suf\ufb01cient (see Appendix A.2).\n\nFigure 2: Comparison in NMSE of BMP vs. GMP, OMP, BCG, and CoGEnT, with \u03b7 = 5.\n\n2/(cid:107)x\u2217(cid:107)2\n\nIn Figure 2, we compare the Normalized Mean Squared Error (NMSE) of the different methods.\nThe NMSE at iterate xt is de\ufb01ned as (cid:107)xt \u2212 x\u2217(cid:107)2\n2. The plots show a rebound occurring once\nthe NMSE reaches \u223c 10\u22124, which is due to the algorithms over\ufb01tting to the noisy measurements\ny. A post-processing step can mitigate the rebound via early stopping or by removing atoms whose\ncoef\ufb01cient in the decomposition of the iterate are smaller than some threshold \u03b4 > 0. We used early\nstopping on a validation set and present the test error (cid:107)ytest \u2212 AtestxT(cid:107)2\n2/mtest on a test set in Table 2,\nwhere xT is the solution iterate for each algorithm. For completeness, we also reported the results for\nthe Gradient Hard Thresholding Pursuit (GraHTP) and Fast Gradient Hard Thresolding Pursuit (Fast\nGraHTP) algorithms [Yuan et al., 2018], for which we favorably set k = (cid:107)x\u2217(cid:107)0. As expected, GMP\n\n8\n\n\fperforms the worst on the test set because its NMSE does not achieve suf\ufb01cient convergence (see\nFigure 2), highlighting the importance of a clean, i.e., sparse, decomposition into the dictionary D.\n\nTable 2: Test error achieved using early stopping on a validation set.\n\nAlgorithm GMP\nTest error\n0.1917\n\nOMP\n0.0036\n\nBMP\n0.0037\n\nBCG\n0.0068\n\nCoGEnT GraHTP\n0.0036\n0.0043\n\nFast GraHTP\n\n0.0037\n\nThe Appendix contains additional experiments on different objective functions: an arbitrarily chosen\nnorm (Appendix A.3), the Huber loss (Appendix A.4), the distance to a convex set (Appendix A.5),\nand a logistic regression loss (Appendix A.6). The conclusions are identical.\n\n4.2 Comparison of BMP vs. accMP\n\nLocatello et al. [2018] recently provided an Accelerated Matching Pursuit algorithm (accMP) for\nsolving Problem (1). We implemented the same code as theirs, using the exact same parametrization.\nThe code framework matches the one we used for BMP. We ran BMP on their toy data example and\ncompared the results against accMP (which they labeled accelerated steepest in their plot); notice\nthat we recovered their (per-iteration) plot exactly. The experiment is to minimize f : x \u2208 R100 (cid:55)\u2192\n(cid:107)x \u2212 b(cid:107)2\n2/2 over the linear span of D, where D is dictionary of 200 randomly chosen atoms in R100\nand b \u2208 R100 is also randomly chosen. The parameters, kindly provided by the authors of Locatello\net al. [2018], for accMP were L = 1000 and \u03bd = 1. We report the results in Figure 3.\n\nFigure 3: Comparison of BMP vs. accMP, with \u03b7 = 3.\n\nWe see that BMP outperforms accMP in both speed of convergence and sparsity of the iterates. In\nfact, in terms of sparsity, accMP needs to use all available atoms to converge while BMP needs only\nhalf as much. Furthermore, accMP needs 75% of all available atoms to start converging signi\ufb01cantly\nwhile BMP starts to converge instantaneously. We suspect that this is due to the following: accMP\naccelerates coordinate descent-like directions, which might be relatively bad approximations of\nthe actual descent direction \u2212\u2207f (xt), whereas BMP is working directly with (the projection of)\n\u2212\u2207f (xt), achieving much more progress and offsetting the effect of acceleration.\n\n5 Final remarks\n\nWe presented a Blended Matching Pursuit algorithm (BMP) which enjoys both properties of fast\nrate of convergence and sparsity of the iterates. More speci\ufb01cally, we derived linear convergence\nrates for a large class of non-strongly convex functions solving Problem (1), and we showed that our\nblending approach outperforms the state-of-the-art methods in speed of convergence while achieving\nclose-to-optimal sparsity, and this without requiring sparsity-inducing constraints nor regularization.\nAlthough BMP already outperforms the Accelerated Matching Pursuit algorithm [Locatello et al.,\n2018] in our experiments, we believe it is also amenable to acceleration.\n\nAcknowledgments\n\nResearch reported in this paper was partially supported by NSF CAREER award CMMI-1452463.\n\n9\n\n\fReferences\n\nJ. Bolte, A. Daniilidis, and A. Lewis. The \u0141ojasiewicz inequality for nonsmooth subanalytic functions with\n\napplications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205\u20131223, 2007.\n\nG. Braun, S. Pokutta, and D. Zink. Lazifying conditional gradient algorithms. In Proceedings of the 34th\n\nInternational Conference on Machine Learning, pages 566\u2013575, 2017.\n\nG. Braun, S. Pokutta, D. Tu, and S. Wright. Blended conditional gradients: the unconditioning of conditional\ngradients. In Proceedings of the 36th International Conference on Machine Learning, pages 735\u2013743, 2019.\n\nE. J. Cand\u00e8s and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):\n\n4203\u20134215, 2005.\n\nS. Chen, S. A. Billings, and W. Luo. Orthogonal least squares methods and their application to non-linear system\n\nidenti\ufb01cation. International Journal of Control, 50(5):1873\u20131896, 1989.\n\nL. Condat. Fast projection onto the simplex and the (cid:96)1 ball. Mathematical Programming, 158(1):575\u2013585, 2016.\n\nW. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions\n\non Information Theory, 55(5):2230\u20132249, 2009.\n\nM. A. Davenport and M. B. Wakin. Analysis of orthogonal matching pursuit using the restricted isometry\n\nproperty. IEEE Transactions on Information Theory, 56(9):4395\u20134401, 2010.\n\nG. Davis, S. Mallat, and Z. Zhang. Adaptive time-frequency decompositions with matching pursuits. Optical\n\nEngineering, 33(7):2183\u20132191, 1994.\n\nM. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum order system\n\napproximation. In Proceedings of the American Control Conference, pages 4734\u20134739, 2001.\n\nM. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):\n\n95\u2013110, 1956.\n\nR. Garg and R. Khandekar. Gradient descent with sparsi\ufb01cation: An iterative algorithm for sparse recovery\nwith restricted isometry property. In Proceedings of the 26th International Conference on Machine Learning,\npages 337\u2013344, 2009.\n\nR. Gribonval and P. Vandergheynst. On the exponential convergence of matching pursuits in quasi-incoherent\n\ndictionaries. IEEE Transactions on Information Theory, 52(1):255\u2013261, 2006.\n\nI. Guyon, S. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the NIPS 2003 feature selection challenge. In\n\nAdvances in Neural Information Processing Systems 17, pages 545\u2013552. 2005.\n\nP. J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):73\u2013101, 1964.\n\nM. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th\n\nInternational Conference on Machine Learning, pages 427\u2013435, 2013.\n\nT. Kerdreux, A. d\u2019Aspremont, and S. Pokutta. Restarting Frank-Wolfe. In Proceedings of the 22nd International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 1275\u20131283, 2019.\n\nG. Lan, S. Pokutta, Y. Zhou, and D. Zink. Conditional accelerated lazy stochastic gradient descent.\n\nProceedings of the 34th International Conference on Machine Learning, pages 1965\u20131974, 2017.\n\nIn\n\nE. S. Levitin and B. T. Polyak. Constrained minimization methods. USSR Computational Mathematics and\n\nMathematical Physics, 6(5):1\u201350, 1966.\n\nF. Locatello, R. Khanna, M. Tschannen, and M. Jaggi. A uni\ufb01ed optimization view on generalized matching\npursuit and Frank-Wolfe. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 860\u2013868, 2017.\n\nF. Locatello, A. Raj, S. P. Karimireddy, G. R\u00e4tsch, B. Sch\u00f6lkopf, S. U. Stich, and M. Jaggi. On matching pursuit\nand coordinate descent. In Proceedings of the 35th International Conference on Machine Learning, pages\n3198\u20133207, 2018.\n\nS. \u0141ojasiewicz. Une propri\u00e9t\u00e9 topologique des sous-ensembles analytiques r\u00e9els. In Les \u00c9quations aux D\u00e9riv\u00e9es\n\nPartielles, 117, pages 87\u201389. Colloques Internationaux du CNRS, 1963.\n\n10\n\n\fS. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal\n\nProcessing, 41(12):3397\u20133415, 1993.\n\nB. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227\u2013234,\n\n1995.\n\nD. Needell and J. A. Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied\n\nand Computational Harmonic Analysis, 26(3):301\u2013321, 2009.\n\nA. S. Nemirovskii and Y. E. Nesterov. Optimal methods of smooth convex minimization. USSR Computational\n\nMathematics and Mathematical Physics, 25(2):21\u201330, 1985.\n\nH. Nguyen and G. Petrova. Greedy strategies for convex optimization. Calcolo, 54(1):207\u2013224, 2017.\n\nY. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad. Orthogonal matching pursuit: recursive function approximation\nwith applications to wavelet decomposition. In Proceedings of the 27th Asilomar Conference on Signals,\nSystems, and Computers, pages 40\u201344, 1993.\n\nN. Rao, S. Shah, and S. Wright. Forward-backward greedy algorithms for atomic norm regularization. IEEE\n\nTransactions on Signal Processing, 63(21):5798\u20135811, 2015.\n\nV. Roulet and A. d\u2019Aspremont. Sharpness, restart and acceleration.\n\nProcessing Systems 30, pages 1119\u20131129. 2017.\n\nIn Advances in Neural Information\n\nS. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity in optimization problems with\n\nsparsity constraints. SIAM Journal on Optimization, 20:2807\u20132832, 2010.\n\nV. Temlyakov. Chebushev greedy algorithm in convex optimization. arXiv preprint arXiv:1312.1244, 2013.\n\nV. Temlyakov. Greedy algorithms in convex optimization on Banach spaces. In Proceedings of the 48th Asilomar\n\nConference on Signals, Systems, and Computers, pages 1331\u20131335, 2014.\n\nV. Temlyakov. Greedy approximation in convex optimization. Constructive Approximation, 41(2):269\u2013296,\n\n2015.\n\nR. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B\n\n(Methodological), 58(1):267\u2013288, 1996.\n\nR. J. Tibshirani. A general framework for fast stagewise algorithms. Journal of Machine Learning Research, 16\n\n(1):2543\u20132588, 2015.\n\nJ. A. Tropp. Greed is good: algorithmic results for sparse approximation. IEEE Transactions on Information\n\nTheory, 50(10):2231\u20132242, 2004.\n\nQ. Yao and J. T. Kwok. Greedy learning of generalized low-rank model. In Proceedings of the 25th International\n\nJoint Conference on Arti\ufb01cial Intelligence, pages 2294\u20132300, 2016.\n\nX.-T. Yuan, P. Li, and T. Zhang. Gradient hard thresholding pursuit. Journal of Machine Learning Research, 18\n\n(166):1\u201343, 2018.\n\nT. Zhang. On the consistency of feature selection using greedy least squares regression. Journal of Machine\n\nLearning Research, 10:555\u2013568, 2009.\n\n11\n\n\f", "award": [], "sourceid": 1204, "authors": [{"given_name": "Cyrille", "family_name": "Combettes", "institution": "Georgia Institute of Technology"}, {"given_name": "Sebastian", "family_name": "Pokutta", "institution": "Zuse Institute Berlin"}]}