{"title": "On Iterative Hard Thresholding Methods for High-dimensional M-Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 685, "page_last": 693, "abstract": "The use of M-estimators in generalized linear regression models in high dimensional settings requires risk minimization with hard L_0 constraints. Of the known methods, the class of projected gradient descent (also known as iterative hard thresholding (IHT)) methods is known to offer the fastest and most scalable solutions. However, the current state-of-the-art is only able to analyze these methods in extremely restrictive settings which do not hold in high dimensional statistical models. In this work we bridge this gap by providing the first analysis for IHT-style methods in the high dimensional statistical setting. Our bounds are tight and match known minimax lower bounds. Our results rely on a general analysis framework that enables us to analyze several popular hard thresholding style algorithms (such as HTP, CoSaMP, SP) in the high dimensional regression setting. Finally, we extend our analysis to the problem of low-rank matrix recovery.", "full_text": "On Iterative Hard Thresholding Methods for\n\nHigh-dimensional M-Estimation\n\nPrateek Jain\u2217\n\nAmbuj Tewari\u2020\n\nPurushottam Kar\u2217\n\n\u2217Microsoft Research, INDIA\n\n\u2020University of Michigan, Ann Arbor, USA\n\n{prajain,t-purkar}@microsoft.com, tewaria@umich.edu\n\nAbstract\n\nThe use of M-estimators in generalized linear regression models in high dimen-\nsional settings requires risk minimization with hard L0 constraints. Of the known\nmethods, the class of projected gradient descent (also known as iterative hard\nthresholding (IHT)) methods is known to offer the fastest and most scalable solu-\ntions. However, the current state-of-the-art is only able to analyze these methods\nin extremely restrictive settings which do not hold in high dimensional statisti-\ncal models. In this work we bridge this gap by providing the \ufb01rst analysis for\nIHT-style methods in the high dimensional statistical setting. Our bounds are tight\nand match known minimax lower bounds. Our results rely on a general analysis\nframework that enables us to analyze several popular hard thresholding style al-\ngorithms (such as HTP, CoSaMP, SP) in the high dimensional regression setting.\nFinally, we extend our analysis to the problem of low-rank matrix recovery.\n\n1\n\nIntroduction\n\nModern statistical estimation is routinely faced with real world problems where the number of pa-\nrameters p handily outnumbers the number of observations n. In general, consistent estimation of\nparameters is not possible in such a situation. Consequently, a rich line of work has focused on\nmodels that satisfy special structural assumptions such as sparsity or low-rank structure. Under\nthese assumptions, several works (for example, see [1, 2, 3, 4, 5]) have established that consistent\nestimation is information theoretically possible in the \u201cn (cid:28) p\u201d regime as well.\nThe question of ef\ufb01cient estimation, however, is faced with feasibility issues since consistent esti-\nmation routines often end-up solving NP-hard problems. Examples include sparse regression which\nrequires loss minimization with sparsity constraints and low-rank regression which requires dealing\nwith rank constraints which are not ef\ufb01ciently solvable in general [6].\nInterestingly, recent works have demonstrated that these hardness results can be avoided by assuming\ncertain natural conditions over the loss function being minimized such as restricted strong convexity\n(RSC) and restricted strong smoothness (RSS). The estimation routines proposed in these works\ntypically make use of convex relaxations [5] or greedy methods [7, 8, 9] which do not suffer from\ninfeasibility issues.\nDespite this, certain limitations have precluded widespread use of these techniques. Convex\nrelaxation-based methods typically suffer from slow rates as they solve non-smooth optimization\nproblems apart from being hard to analyze in terms of global guarantees. Greedy methods, on the\nother hand, are slow in situations with non-negligible sparsity or relatively high rank, owing to their\nincremental approach of adding/removing individual support elements.\nInstead, the methods of choice for practical applications are actually projected gradient (PGD) meth-\nods, also referred to as iterative hard thresholding (IHT) methods. These methods directly project\n\n1\n\n\f1 \u2212 \u0001\n1\n\n1 \u2212 \u0001\n\n(cid:21)\n\n(cid:20) 1\n\nthe gradient descent update onto the underlying (non-convex) feasible set. This projection can be\nperformed ef\ufb01ciently for several interesting structures such as sparsity and low rank. However, tra-\nditional PGD analyses for convex problems viz. [10] do not apply to these techniques due to the\nnon-convex structure of the problem.\nAn exception to this is the recent work [11] that demonstrates that PGD with non-convex regulariza-\ntion can offer consistent estimates for certain high-dimensional problems. However, the work in [11]\nis only able to analyze penalties such as SCAD, MCP and capped L1. Moreover, their framework\ncannot handle commonly used penalties such as L0 or low-rank constraints.\nInsuf\ufb01ciency of RIP based Guarantees for M-estimation. As noted above, PGD/IHT-style meth-\nods have been very popular in literature for sparse recovery and several algorithms including Iterative\nHard Thresholding (IHT) [12] or GraDeS [13], Hard Thresholding Pursuit (HTP) [14], CoSaMP\n[15], Subspace Pursuit (SP) [16], and OMPR((cid:96)) [17] have been proposed. However, the analysis\nof these algorithms has traditionally been restricted to settings that satisfy the Restricted Isometry\nproperty (RIP) or incoherence property. As the discussion below demonstrates, this renders these\nanalyses inaccessible to high-dimensional statistical estimation problems.\nAll existing results analyzing these methods require the condition number of the loss function, re-\nstricted to sparse vectors, to be smaller than a universal constant. The best known such constant is\ndue to the work of [17] that requires a bound on the RIP constant \u03b42k \u2264 0.5 (or equivalently a bound\n\u2264 3 on the condition number). In contrast, real-life high dimensional statistical settings,\n1+\u03b42k\n1\u2212\u03b42k\nwherein pairs of variables can be arbitrarily correlated, routinely require estimation methods to per-\nform under arbitrarily large condition numbers. In particular if two variates have a covariance matrix\nlike\n, then the restricted condition number (on a support set of size just 2) of the sam-\nple matrix cannot be brought down below 1/\u0001 even with in\ufb01nitely many samples. In particular when\n\u0001 < 1/6, none of the existing results for hard thresholding methods offer any guarantees. Moreover,\nmost of these analyses consider only the least squares objective. Although recent attempts have\nbeen made to extend this to general differentiable objectives [18, 19], the results continue to require\nthat the restricted condition number be less than a universal constant and remain unsatisfactory in a\nstatistical setting.\nOverview of Results. Our main contribution in this work is an analysis of PGD/IHT-style methods\nin statistical settings. Our bounds are tight, achieve known minmax lower bounds [20], and hold\nfor arbitrary differentiable, possibly even non-convex functions. Our results hold even when the\nunderlying condition number is arbitrarily large and only require the function to satisfy RSC/RSS\nconditions. In particular, this reveals that these iterative methods are indeed applicable to statistical\nsettings, a result that escaped all previous works.\nOur \ufb01rst result shows that the PGD/IHT methods achieve global convergence if used with a relaxed\nprojection step. More formally, if the optimal parameter is s\u2217-sparse and the problem satis\ufb01es\nRSC and RSS constraints \u03b1 and L respectively (see Section 2), then PGD methods offer global\nconvergence so long as they employ projection to an s-sparse set where s \u2265 4(L/\u03b1)2s\u2217. This\ngives convergence rates that are identical to those of convex relaxation and greedy methods for the\nGaussian sparse linear model. We then move to a family of ef\ufb01cient \u201cfully corrective\u201d methods and\nshow as before, that for arbitrary functions satisfying the RSC/RSS properties, these methods offer\nglobal convergence.\nNext, we show that these results allow PGD-style methods to offer global convergence in a variety\nof statistical estimation problems such as sparse linear regression and low rank matrix regression.\nOur results effortlessly extend to the noisy setting as a corollary and give bounds similar to those of\n[21] that relies on solving an L1 regularized problem.\nOur proofs are able to exploit that even though hard-thresholding is not the prox-operator for any\nconvex prox function, it still provides strong contraction when projection is performed onto sets of\nsparsity s (cid:29) s\u2217. This crucial observation allows us to provide the \ufb01rst uni\ufb01ed analysis for hard\nthresholding based gradient descent algorithms. Our empirical results con\ufb01rm our predictions with\nrespect to the recovery properties of IHT-style algorithms on badly-conditioned sparse recovery\nproblems, as well as demonstrate that these methods can be orders of magnitudes faster than their\nL1 and greedy counterparts.\n\n2\n\n\fOrganization. Section 2 sets the notation and the problem statement. Section 3 introduces the\nPGD/IHT algorithm that we study and proves that the method guarantees recovery assuming the\nRSC/RSS property. We also generalize our guarantees to the problem of low-rank matrix regression.\nSection 4 then provides crisp sample complexity bounds and statistical guarantees for the PGD/IHT\nestimators. Section 5 extends our analysis to a broad family of compressive sensing algorithms that\ninclude the so-called fully-corrective hard thresholding methods and provide similar results for them\nas well. We present some empirical results in Section 6 and conclude in Section 7.\n\n2 Problem Setup and Notations\nHigh-dimensional Sparse Estimation. Given data points X = [X1, . . . , Xn]T , where Xi \u2208 Rp,\nand the target Y = [Y1, . . . , Yn]T , where Yi \u2208 R, the goal is to compute an s\u2217-sparse \u03b8\u2217 s.t.,\n\n\u03b8\u2217 = arg min\n\n\u03b8,(cid:107)\u03b8(cid:107)0\u2264s\u2217 f (\u03b8).\n\n(cid:80)\n(1)\ni (cid:96)((cid:104)Xi, \u03b8(cid:105), Yi) for some\nTypically, f can be thought of as an empirical risk function i.e. f (\u03b8) = 1\nn\nloss function (cid:96) (see examples in Section 4). However, for our analysis of PGD and other algorithms,\nwe need not assume any other property of f other than differentiability and the following two RSC\nand RSS properties.\nDe\ufb01nition 1 (RSC Property). A differentiable function f : Rp \u2192 R is said to satisfy restricted\nstrong convexity (RSC) at sparsity level s = s1 + s2 with strong convexity constraint \u03b1s if the\nfollowing holds for all \u03b81, \u03b82 s.t. (cid:107)\u03b81(cid:107)0 \u2264 s1 and (cid:107)\u03b82(cid:107)0 \u2264 s2:\nf (\u03b81) \u2212 f (\u03b82) \u2265 (cid:104)\u03b81 \u2212 \u03b82,\u2207\u03b8f (\u03b82)(cid:105) +\n\n(cid:107)\u03b81 \u2212 \u03b82(cid:107)2\n2.\n\nDe\ufb01nition 2 (RSS Property). A differentiable function f : Rp \u2192 R is said to satisfy restricted\nstrong smoothness (RSS) at sparsity level s = s1 + s2 with strong convexity constraint Ls if the\nfollowing holds for all \u03b81, \u03b82 s.t. (cid:107)\u03b81(cid:107)0 \u2264 s1 and (cid:107)\u03b82(cid:107)0 \u2264 s2:\nf (\u03b81) \u2212 f (\u03b82) \u2264 (cid:104)\u03b81 \u2212 \u03b82,\u2207\u03b8f (\u03b82)(cid:105) +\n\n(cid:107)\u03b81 \u2212 \u03b82(cid:107)2\n2.\n\n\u03b1s\n2\n\nLs\n2\n\nLow-rank Matrix Regression. Low-rank matrix regression is similar to sparse estimation as pre-\nsented above except that each data point is now a matrix i.e. Xi \u2208 Rp1\u00d7p2, the goal being to estimate\na low-rank matrix W \u2208 Rp1\u00d7p2 that minimizes the empirical loss function on the given data.\n\nW \u2217 = arg\n\nmin\n\nW,rank(W )\u2264r\n\nf (W ).\n\n(2)\n\nFor this problem the RSC and RSS properties for f are de\ufb01ned similarly as in De\ufb01nition 1, 2 except\nthat the L0 norm is replaced by the rank function.\n\n3\n\nIterative Hard-thresholding Method\n\nIn this section we study the popular projected gradient descent (a.k.a iterative hard thresholding)\nmethod for the case of the feasible set being the set of sparse vectors (see Algorithm 1 for pseu-\ndocode). The projection operator Ps(z), can be implemented ef\ufb01ciently in this case by projecting\nz onto the set of s-sparse vectors by selecting the s largest elements (in magnitude) of z. The stan-\n2 for all (cid:107)\u03b8(cid:48)(cid:107)0 \u2264 s. However, it\ndard projection property implies that (cid:107)Ps(z) \u2212 z(cid:107)2\nturns out that we can prove a signi\ufb01cantly stronger property of hard thresholding for the case when\n(cid:107)\u03b8(cid:48)(cid:107)0 \u2264 s\u2217 and s\u2217 (cid:28) s. This property is key to analysing IHT and is formalized below.\nLemma 1. For any index set I, any z \u2208 RI, let \u03b8 = Ps(z). Then for any \u03b8\u2217 \u2208 RI such that\n(cid:107)\u03b8\u2217(cid:107)0 \u2264 s\u2217, we have\n\n2 \u2264 (cid:107)\u03b8(cid:48) \u2212 z(cid:107)2\n\n(cid:107)\u03b8 \u2212 z(cid:107)2\n\n2 \u2264 |I| \u2212 s\n\n|I| \u2212 s\u2217(cid:107)\u03b8\u2217 \u2212 z(cid:107)2\n\n2.\n\nSee Appendix A for a detailed proof.\nOur analysis combines the above observation with the RSC/RSS properties of f to provide geometric\nconvergence rates for the IHT procedure below.\n\n3\n\n\fAlgorithm 1 Iterative Hard-thresholding\n1: Input: Function f with gradient oracle, sparsity level s, step-size \u03b7\n2: \u03b81 = 0, t = 1\n3: while not converged do\n4:\n5: end while\n6: Output: \u03b8t\n\n\u03b8t+1 = Ps(\u03b8t \u2212 \u03b7\u2207\u03b8f (\u03b8t)), t = t + 1\n\nrespectively. Let Algorithm 1 be invoked with f, s \u2265 32(cid:0) L\n\nTheorem 1. Let f have RSC and RSS parameters given by L2s+s\u2217 (f ) = L and \u03b12s+s\u2217 (f ) = \u03b1\n3L . Also let \u03b8\u2217 =\n)) satis\ufb01es:\n\narg min\u03b8,(cid:107)\u03b8(cid:107)0\u2264s\u2217 f (\u03b8). Then, the \u03c4-th iterate of Algorithm 1, for \u03c4 = O( L\n\ns\u2217 and \u03b7 = 2\n\n\u03b1 \u00b7 log( f (\u03b80)\n\n(cid:1)2\n\n\u03b1\n\n\u0001\n\nf (\u03b8\u03c4 ) \u2212 f (\u03b8\u2217) \u2264 \u0001.\n\nProof. (Sketch) Let St = supp(\u03b8t), S\u2217 = supp(\u03b8\u2217), St+1 = supp(\u03b8t+1) and I t = S\u2217\u222aSt\u222aSt+1.\nUsing the RSS property and the fact that supp(\u03b8t) \u2286 I t and supp(\u03b8t+1) \u2286 I t, we have:\nf (\u03b8t+1) \u2212 f (\u03b8t) \u2264 (cid:104)\u03b8t+1 \u2212 \u03b8t, gt(cid:105) +\n\nL\n=\n2\n\u03b61\u2264 L\n2\n\nI t +\n\nI t \u2212 \u03b8t\n(cid:107)\u03b8t+1\n|I t| \u2212 s\n|I t| \u2212 s\u2217 \u00b7 (cid:107)\u03b8\u2217\n\u00b7\n\n(cid:107)\u03b8t+1 \u2212 \u03b8t(cid:107)2\nL\n2,\n2\n2 \u2212 1\n\u00b7 gt\n(cid:107)gt\n2\n2L\n3L\n\u00b7 gt\nI t(cid:107)2\nI t \u2212 \u03b8t\n1\nL\n\nI t(cid:107)2\n\nI t +\n\nI t(cid:107)2\n2,\n2 \u2212 1\n2L\n\n((cid:107)gt\n\nI t\\(St\u222aS\u2217)(cid:107)2\n\nSt\u222aS\u2217(cid:107)2\n2),\n(3)\nwhere \u03b61 follows from an application of Lemma 1 with I = I t and the Pythagoras theorem. The\nabove equation has three critical terms. The \ufb01rst term can be bounded using the RSS condition.\nUsing f (\u03b8t) \u2212 f (\u03b8\u2217) \u2264 (cid:104)gt\n2 bounds the third term\nS\u2217 can be arbitrarily small.\nin (3). The second term is more interesting as in general elements of gt\nS\u2217\\St+1 as they are selected by\nHowever, elements of gt\nhard-thresholding. Combining this insight with bounds for gt\nS\u2217\\St+1 and with (3), we obtain the\ntheorem. See Appendix A for a detailed proof.\n\nI t\\(St\u222aS\u2217) should be at least as large as gt\n\nSt\u222aS\u2217 , \u03b8t \u2212 \u03b8\u2217(cid:105) \u2212 \u03b1\n\n2 (cid:107)\u03b8t \u2212 \u03b8\u2217(cid:107)2\n\nSt\u222aS\u2217(cid:107)2\n\n2 + (cid:107)gt\n\n2 \u2264 1\n\n2\u03b1(cid:107)gt\n\n3.1 Low-rank Matrix Regression\n\nWe now generalize our previous analysis to a projected gradient descent (PGD) method for low-rank\nmatrix regression. Formally, we study the following problem:\n\nf (W ), s.t., rank(W ) \u2264 s.\n\nmin\nW\n\nThe hard-thresholding projection step for low-rank matrices can be solved using SVD i.e.\n\nP Ms(W ) = Us\u03a3sV T\ns ,\n\nwhere W = U \u03a3V T is the singular value decomposition of W . Us, Vs are the top-s singular vectors\n(left and right, respectively) of W and \u03a3s is the diagonal matrix of the top-s singular values of W .\nTo proceed, we \ufb01rst note a property of the above projection similar to Lemma 1.\nLemma 2. Let W \u2208 Rp1\u00d7p2 be a rank-|I t| matrix and let p1 \u2265 p2. Then for any rank-s\u2217 matrix\nW \u2217 \u2208 Rp1\u00d7p2 we have\n\n(cid:107)P Ms(W ) \u2212 W(cid:107)2\n\nF \u2264 |I t| \u2212 s\n\n|I t| \u2212 s\u2217(cid:107)W \u2217 \u2212 W(cid:107)2\n\nF .\n\nProof. Let W = U \u03a3V T be the singular value decomposition of W . Now, (cid:107)P Ms(W ) \u2212 W(cid:107)2\n\ni=s+1 \u03c32\n\n(cid:80)|I t|\n2 \u2264 |I t| \u2212 s\nwhere the last step uses the von Neumann\u2019s trace inequality (T r(A \u00b7 B) \u2264(cid:80)\n\ni = (cid:107)Ps(diag(\u03a3)) \u2212 diag(\u03a3)(cid:107)2\nF \u2264 |I t| \u2212 s\n\nW . Using Lemma 1, we get:\n(cid:107)P Ms(W ) \u2212 W(cid:107)2\n\n|I t| \u2212 s\u2217(cid:107)\u03a3\u2217 \u2212 diag(\u03a3)(cid:107)2\n\nF =\n2, where \u03c31 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03c3|I t| \u2265 0 are the singular values of\n\n|I t| \u2212 s\u2217(cid:107)W \u2217 \u2212 W(cid:107)2\n\n(6)\n\nF ,\n\ni \u03c3i(A)\u03c3i(B)).\n\n(4)\n\n(5)\n\n4\n\n\fSuppose we invoke it with f, s \u2265 32(cid:0) L\n\nThe following result for low-rank matrix regression immediately follows from Lemma 4.\nTheorem 2. Let f have RSC and RSS parameters given by L2s+s\u2217 (f ) = L and \u03b12s+s\u2217 (f ) = \u03b1.\nReplace the projection operator Ps in Algorithm 1 with its matrix counterpart P Ms as de\ufb01ned in (5).\n3L . Also let W \u2217 = arg minW,rank(W )\u2264s\u2217 f (W ).\n\u03b1 \u00b7 log( f (W 0)\nf (W \u03c4 ) \u2212 f (W \u2217) \u2264 \u0001.\n\nThen the \u03c4-th iterate of Algorithm 1, for \u03c4 = O( L\n\ns\u2217, \u03b7 = 2\n\n) satis\ufb01es:\n\n(cid:1)2\n\n\u03b1\n\n\u0001\n\nProof. A proof progression similar to that of Theorem 1 suf\ufb01ces. The only changes that need to be\nmade are: \ufb01rstly Lemma 2 has to be invoked in place of Lemma 1. Secondly, in place of consid-\nering vectors restricted to a subset of coordinates viz. \u03b8S, gt\nI, we would need to consider matrices\nrestricted to subspaces i.e. WS = USU T\nS W where US is a set of singular vectors spanning the\nrange-space of S.\n\n4 High Dimensional Statistical Estimation\n\nThis section elaborates on how the results of the previous section can be used to give guarantees for\nIHT-style techniques in a variety of statistical estimation problems. We will \ufb01rst present a generic\nconvergence result and then specialize it to various settings. Suppose we have a sample of data\npoints Z1:n and a loss function L(\u03b8; Z1:n) that depends on a parameter \u03b8 and the sample. Then we\ncan show the following result. (See Appendix B for a proof.)\nSuppose L(\u03b8; Z1:n) is differentiable and satis-\nTheorem 3. Let \u00af\u03b8 be any s\u2217-sparse vector.\n\ufb01es RSC and RSS at sparsity level s + s\u2217 with parameters \u03b1s+s\u2217 and Ls+s\u2217 respectively, for\ns \u2265 32\ns\u2217. Let \u03b8\u03c4 be the \u03c4-th iterate of Algorithm 1 for \u03c4 chosen as in Theorem 1\nand \u03b5 be the function value error incurred by Algorithm 1. Then we have\n\n(cid:16) L2s+s\u2217\n\n(cid:17)2\n\n\u03b12s+s\u2217\n\n\u221a\n(cid:107) \u00af\u03b8 \u2212 \u03b8\u03c4(cid:107)2 \u2264 2\n\ns + s\u2217(cid:107)\u2207L( \u00af\u03b8; Z1:n)(cid:107)\u221e\n\n\u03b1s+s\u2217\n\n+\n\n2\u0001\n\n\u03b1s+s\u2217\n\n.\n\n(cid:115)\n\n1\n\nNote that the result does not require the loss function to be convex. This fact will be crucially used\nlater. We now apply the above result to several statistical estimation scenarios.\nSparse Linear Regression. Here Zi = (Xi, Yi) \u2208 Rp \u00d7 R and Yi = (cid:104) \u00af\u03b8, Xi(cid:105) + \u03bei where\n\u03bei \u223c N (0, \u03c32) is label noise. The empirical loss is the usual least squares loss i.e. L(\u03b8; Z1:n) =\nn(cid:107)Y \u2212 X\u03b8(cid:107)2\n2. Suppose X1:n are drawn i.i.d. from a sub-Gaussian distribution with covariance\n\u03a3 with \u03a3jj \u2264 1 for all j. Then [22, Lemma 6] immediately implies that RSC and RSS at\nsparsity level k hold, with probability at least 1 \u2212 e\u2212c0n, with \u03b1k = 1\nand\n(c0, c1 are universal constants). So we can set k = 2s + s\u2217 and if\nLk = 2\u03c3max(\u03a3) + c1\n4 \u03c3min(\u03a3) and Lk \u2264 2.25\u03c3max(\u03a3) which means that\nn > 4c1k log p/\u03c3min(\u03a3) then we have \u03b1k \u2265 1\nLk/9\u03b1k \u2264 \u03ba(\u03a3) := \u03c3max(\u03a3)/\u03c3min(\u03a3). Thus it is enough to choose s = 2592\u03ba(\u03a3)2s\u2217 and ap-\nply Theorem 3. Note that (cid:107)\u2207L( \u00af\u03b8; Z1:n)(cid:107)\u221e = (cid:107)X T \u03be/n(cid:107)\u221e \u2264 2\u03c3\nn with probability at least\n1\u2212c2p\u2212c3 (c2, c3 are universal constants). Putting everything together, we have the following bound\nwith high probability:\n\n2 \u03c3min(\u03a3) \u2212 c1\n\nk log p\n\nk log p\n\nn\n\nn\n\n(cid:113) log p\n(cid:114) \u0001\n\n(cid:114)\n\n(cid:107) \u00af\u03b8 \u2212 \u03b8\u03c4(cid:107)2 \u2264 145\n\n\u03ba(\u03a3)\n\n\u03c3min(\u03a3)\n\n\u03c3\n\ns\u2217 log p\n\nn\n\n+ 2\n\n,\n\n\u03c3min(\u03a3)\n\nwhere \u0001 is the function value error incurred by Algorithm 1.\nNoisy and Missing Data. We now look at cases with feature noise as well. More speci\ufb01cally,\nassume that we only have access to \u02dcXi\u2019s that are corrupted versions of Xi\u2019s. Two models of noise are\npopular in literature [21]: a) (additive noise) \u02dcXi = Xi+Wi where Wi \u223c N (0, \u03a3W ), and b) (missing\ndata) \u02dcX is an R\u222a{(cid:63)}-valued matrix obtained by independently, with probability \u03bd \u2208 [0, 1), replacing\neach entry in X with (cid:63). For the case of additive noise (missing data can be handled similarly),\n2 \u03b8T \u02c6\u0393\u03b8 \u2212 \u02c6\u03b3T \u03b8 where \u02c6\u0393 = \u02dcX T \u02dcX/n \u2212 \u03a3W and \u02c6\u03b3 = \u02dcX T Y /n are\nZi = ( \u02dcXi, Yi) and L(\u03b8; Z1:n) = 1\n\n5\n\n\fAlgorithm 2 Two-stage Hard-thresholding\n1: Input: function f with gradient oracle, sparsity level s, sparsity expansion level (cid:96)\n2: \u03b81 = 0, t = 1\n3: while not converged do\n4:\n5:\n6:\n7:\n8:\n9: end while\n10: Output: \u03b8t\n\n(cid:101)\u03b8t = Ps(\u03b2t)\n\u03b8t+1 = arg min\u03b8,supp(\u03b8)\u2286supp((cid:101)\u03b8t) f (\u03b8), t = t + 1\n\ngt = \u2207\u03b8f (\u03b8t), St = supp(\u03b8t)\nZ t = St \u222a (largest (cid:96) elements of |gt\n\u03b2t = arg min\u03b2,supp(\u03b2)\u2286Zt f (\u03b2)\n\nSt|)\n\n// fully corrective step\n\n// fully corrective step\n\nunbiased estimators of \u03a3 and \u03a3T \u00af\u03b8 respectively. [21, Appendix A, Lemma 1] implies that RSC, RSS\n2 \u03c3min(\u03a3) \u2212\nat sparsity level k hold, with failure probability exponentially small in n, with \u03b1k = 1\nop+(cid:107)\u03a3W (cid:107)2\n, 1) log p.\nk\u03c4 (p)/n and Lk = 1.5\u03c3max(\u03a3) + k\u03c4 (p)/n for \u03c4 (p) = c0\u03c3min(\u03a3) max(\n\u03c32\nThus for k = 2s + s\u2217 and n \u2265 4k\u03c4 (p)/\u03c3min(\u03a3) we have Lk/\u03b1k \u2264 7\u03ba(\u03a3). Note that L(\u00b7; Z1:n)\nis non-convex but we can still apply Theorem 3 with s = 1568\u03ba(\u03a3)2s\u2217 because RSC, RSS hold.\nUsing the high probability upper bound (see [21, Appendix A, Lemma 2]) (cid:107)\u2207L( \u00af\u03b8; Z1:n)(cid:107)\u221e \u2264\nc1 \u02dc\u03c3(cid:107) \u00af\u03b8(cid:107)2\n\n((cid:107)\u03a3(cid:107)2\n\nmin(\u03a3)\n\nop)2\n\n(cid:114)\n\n\u03ba(\u03a3)\n\n\u03c3min(\u03a3)\n\n\u02dc\u03c3(cid:107) \u00af\u03b8(cid:107)2\n\ns\u2217 log p\n\nn\n\n+ 2\n\n(cid:114) \u0001\n\n\u03c3min(\u03a3)\n\n(cid:112)log p/n gives us the following\n(cid:113)(cid:107)\u03a3W(cid:107)2\n\n(cid:107) \u00af\u03b8 \u2212 \u03b8\u03c4(cid:107)2 \u2264 c2\nop + (cid:107)\u03a3(cid:107)2\n\nwhere \u02dc\u03c3 =\n\nop((cid:107)\u03a3W(cid:107)op + \u03c3) and \u0001 is the function value error in Algorithm 1.\n\n5 Fully-corrective Methods\n\nIn this section, we study a variety of \u201cfully-corrective\u201d methods. These methods keep the optimiza-\ntion objective fully minimized over the support of the current iterate. To this end, we \ufb01rst prove a\nfundamental theorem for fully-corrective methods that formalizes the intuition that for such meth-\nods, a large function value should imply a large gradient at any sparse \u03b8 as well. This result is similar\nto Lemma 1 of [17] but holds under RSC/RSS conditions (rather than the RIP condition as in [17]),\nas well as for the general loss functions. See Appendix C for a detailed proof.\nLemma 3. Consider a function f with RSC parameter given by L2s+s\u2217 (f ) = L and RSS parameter\ngiven by \u03b12s+s\u2217 (f ) = \u03b1. Let \u03b8\u2217 = arg min\u03b8,(cid:107)\u03b8(cid:107)0\u2264s\u2217 f (\u03b8) with S\u2217 = supp(\u03b8\u2217). Let St \u2286 [p] be\nany subset of co-ordinates s.t. |St| \u2264 s. Let \u03b8t = arg min\u03b8,supp(\u03b8)\u2286St f (\u03b8). Then, we have:\n\n2\u03b1(f (\u03b8t) \u2212 f (\u03b8\u2217)) \u2264 (cid:107)gt\n\nSt\u222aS\u2217(cid:107)2\n\n2 \u2212 \u03b12(cid:107)\u03b8t\n\nSt\\S\u2217(cid:107)2\n\n2\n\nTwo-stage Methods. We will, for now, concentrate on a family of two-stage fully corrective meth-\nods that contains popular compressive sensing algorithms like CoSaMP and Subspace Pursuit (see\nAlgorithm 2 for pseudocode). These algorithms have thus far been analyzed only under RIP con-\nditions for the least squares objective. Using our analysis framework developed in the previous\nsections, we present a generic RSC/RSS-based analysis for general two-stage methods for arbitrary\nloss functions. Our analysis shall use the following key observation that the the hard thresholding\nstep in two stage methods does not increase the objective function a lot.\nWe defer the analysis of partial hard thresholding methods to a later version of the paper. This family\nincludes the OMPR((cid:96)) method [17], which is known to provide the best known RIP guarantees in\nthe compressive sensing setting. Using our proof techniques, we can show that this method offers\ngeometric convergence rates in the statistical setting as well.\n\nLemma 4. Let Zt \u2286 [n] and |Zt| \u2264 q. Let \u03b2t = arg min\u03b2,supp(\u03b2)\u2286Zt f (\u03b2) and (cid:98)\u03b8t = Pq(\u03b2t).\n\nThen, the following holds:\n\nf ((cid:98)\u03b8t) \u2212 f (\u03b2t) \u2264 L\n\n\u03b1\n\n\u00b7\n\n(cid:96)\n\ns + (cid:96) \u2212 s\u2217 \u00b7 (f (\u03b2t) \u2212 f (\u03b8\u2217)).\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: A comparison of hard thresholding techniques (HTP) and projected gradient methods\n(GraDeS) with L1 and greedy methods (FoBa) on sparse noisy linear regression tasks. 1(a) gives\nthe number of undiscovered elements from supp(\u03b8\u2217) as label noise levels are increased. 1(b) shows\nthe variation in running times with increasing dimensionality p. 1(c) gives the variation in running\ntimes (in logscale) when the true sparsity level s\u2217 is increased keeping p \ufb01xed. HTP and GraDeS are\nclearly much more scalable than L1 and FoBa. 1(d) shows the recovery properties of different IHT\nmethods under large condition number (\u03ba = 50) setting as the size of projected set is increased.\n\n(cid:107)(cid:98)\u03b8t \u2212 \u03b2t(cid:107)2\n\n2\n\nProof. Let vt = \u2207\u03b8f (\u03b2t). Then, using the RSS property we get:\n\nf ((cid:98)\u03b8t) \u2212 f (\u03b2t) \u2264 (cid:104)(cid:98)\u03b8t \u2212 \u03b2t, vt(cid:105) +\n|s + (cid:96) \u2212 s\u2217|(cid:107)w \u2212 \u03b2t(cid:107)2\nnoting that supp((cid:98)\u03b8t) \u2286 Zt. \u03b62 follows by Lemma 1 and the fact that (cid:107)w(cid:107)0 \u2264 s\u2217. Now, using the\n\nwhere w is any vector such that wZt\nRSC property and the fact that \u2207\u03b8f (\u03b2t) = 0, we have:\n\n= 0 and (cid:107)w(cid:107)0 \u2264 s\u2217. \u03b61 follows by observing vt\n\n2,\n(7)\n= 0 and by\n\n(cid:107)(cid:98)\u03b8t \u2212 \u03b2t(cid:107)2\n\n\u03b62\u2264 L\n2\n\n2\n\n|(cid:96)|\n\nL\n2\n\n\u03b61=\n\nL\n2\n\nZt\n\n(cid:107)w \u2212 \u03b2t(cid:107)2\n\n2 \u2264 f (\u03b2t) \u2212 f (w) \u2264 f (\u03b2t) \u2212 f (\u03b8\u2217).\n\n\u03b1\n2\n\n(8)\n\nThe result now follows by combining (7) and (8).\nTheorem 4. Let f have RSC and RSS parameters given by \u03b12s+s\u2217 (f ) = \u03b1 and L2s+(cid:96)(f ) =\nL resp. Call Algorithm 2 with f, (cid:96) \u2265 s\u2217 and s \u2265 4 L2\n\u03b12 s\u2217. Also let \u03b8\u2217 =\narg min\u03b8,(cid:107)\u03b8(cid:107)0\u2264s\u2217 f (\u03b8). Then, the \u03c4-th iterate of Algorithm 2, for \u03c4 = O( L\n) satis\ufb01es:\n\n\u03b12 (cid:96) + s\u2217 \u2212 (cid:96) \u2265 4 L2\n\n\u03b1 \u00b7 log( f (\u03b80)\n\n\u0001\n\nf (\u03b8\u03c4 ) \u2212 f (\u03b8\u2217) \u2264 \u0001.\n\nSee Appendix C for a detailed proof.\n\n6 Experiments\n\nWe conducted simulations on high dimensional sparse linear regression problems to verify our pre-\ndictions. Our experiments demonstrate that hard thresholding and projected gradient techniques can\nnot only offer recovery in stochastic setting, but offer much more scalable routines for the same.\nData: Our problem setting is identical to the one described in the previous section. We \ufb01xed a\nparameter vector \u00af\u03b8 by choosing s\u2217 random coordinates and setting them randomly to \u00b11 values.\nData samples were generated as Zi = (Xi, Yi) where Xi \u223c N (0, Ip) and Yi = (cid:104) \u00af\u03b8, Xi(cid:105) + \u03bei where\n\u03bei \u223c N (0, \u03c32). We studied the effect of varying dimensionality p, sparsity s\u2217, sample size n and\nlabel noise level \u03c3 on the recovery properties of the various algorithms as well as their run times.\nWe chose baseline values of p = 20000, s\u2217 = 100, \u03c3 = 0.1, n = fo \u00b7 s\u2217 log p where fo is the\noversampling factor with default value fo = 2. Keeping all other quantities \ufb01xed, we varied one of\nthe quantities and generated independent data samples for the experiments.\nAlgorithms: We studied a variety of hard-thresholding style algorithms including HTP [14],\nGraDeS [13] (or IHT [12]), CoSaMP [15], OMPR [17] and SP [16]. We compared them with a\nstandard implementation of the L1 projected scaled sub-gradient technique [23] for the lasso prob-\nlem and a greedy method FoBa [24] for the same.\n\n7\n\n00.10.20.30.4020406080Noise level (sigma)Support Recovery Error HTPGraDeSL1FoBa0.511.522.5x 104050100150200Dimensionality (p)Runtime (sec) HTPGraDeSL1FoBa010020030040050010\u2212310\u22122100102104Sparsity (s*)Runtime (sec) HTPGraDeSL1FoBa80100120140160010203040Projected Sparsity (s)Support Recovery Error CoSaMPHTPGraDeS\fEvaluation Metrics: For the baseline noise level \u03c3 = 0.1, we found that all the algorithms were\nable to recover the support set within an error of 2%. Consequently, our focus shifted to running\ntimes for these experiments. In the experiments where noise levels were varied, we recorded, for\neach method, the number of undiscovered support set elements.\nResults: Figure1 describes the results of our experiments in graphical form. For sake of clarity\nwe included only HTP, GraDeS, L1 and FoBa results in these graphs. Graphs for the other algo-\nrithms CoSaMP, SP and OMPR can be seen in the supplementary material. The graphs indicate that\nwhereas hard thresholding techniques are equally effective as L1 and greedy techniques for recov-\nery in noisy settings, as indicated by Figure1(a), the former can be much more ef\ufb01cient and scalable\nthan the latter. For instance, as Figure1(b), for the base level of p = 20000, HTP was 150\u00d7 faster\nthan the L1 method. For higher values of p, the runtime gap widened to more than 350\u00d7. We also\nnote that in both these cases, HTP actually offered exact support recovery whereas L1 was unable to\nrecover 2 and 4 support elements respectively.\nAlthough FoBa was faster than L1 on Figure1(b) experiments, it was still slower than HTP by 50\u00d7\nand 90\u00d7 for p = 20000 and 25000 respectively. Moreover, due to its greedy and incremental\nnature, FoBa was found to suffer badly in settings with larger true sparsity levels. As Figure 1(c)\nindicates, for even moderate sparsity levels of s\u2217 = 300 and 500, FoBa is 60 \u2212 75\u00d7 slower than\nHTP. As mentioned before, the reason for this slowdown is the greedy approach followed by FoBa:\nwhereas HTP took less than 5 iterations to converge for these two problems, FoBa spend 300 and\n500 iterations respectively. GraDeS was found to offer much lesser run times in comparison being\nslower than HTP by 30 \u2212 40\u00d7 for larger values of p and 2 \u2212 5\u00d7 slower for larger values of s\u2217.\nExperiments on badly conditioned problems. We also ran experiments to verify the performance\nof IHT algorithms in high condition number setting. Values of p, s\u2217 and \u03c3 were kept at baseline\nlevels. After selecting the optimal parameter vector \u00af\u03b8, we selected s\u2217/2 random coordinates from\nits support and s\u2217/2 random coordinates outside its support and constructed a covariance matrix\nwith heavy correlations between these chosen coordinates. The condition number of the resulting\nmatrix was close to 50. Samples were drawn from this distribution and the recovery properties of\nthe different IHT-style algorithms was observed as the projected sparsity levels s were increased.\nOur results (see Figure 1(d)) corroborate our theoretical observation that these algorithms show\na remarkable improvement in recovery properties for ill-conditioned problems with an enlarged\nprojection size.\n\n7 Discussion and Conclusions\n\nIn our work we studied iterative hard thresholding algorithms and showed that these techniques\ncan offer global convergence guarantees for arbitrary, possibly non-convex, differentiable objective\nfunctions, which nevertheless satisfy Restricted Strong Convexity/Smoothness (RSC/RSM) condi-\ntions. Our results apply to a large family of algorithms that includes existing algorithms such as\nIHT, GraDeS, CoSaMP, SP and OMPR. Previously the analyses of these algorithms required strin-\ngent RIP conditions that did not allow the (restricted) condition number to be larger than universal\nconstants speci\ufb01c to these algorithms.\nOur basic insight was to relax this stringent requirement by running these iterative algorithms with\nan enlarged support size. We showed that guarantees for high-dimensional M-estimation follow\nseamlessly from our results by invoking results on RSC/RSM conditions that have already been\nestablished in the literature for a variety of statistical settings. Our theoretical results put hard\nthresholding methods on par with those based on convex relaxation or greedy algorithms. Our\nexperimental results demonstrate that hard thresholding methods outperform convex relaxation and\ngreedy methods in terms of running time, sometime by orders of magnitude, all the while offering\ncompetitive or better recovery properties.\nOur results apply to sparsity and low rank structure, arguably two of the most commonly used\nstructures in high dimensional statistical learning problems. In future work, it would be interesting\nto generalize our algorithms and their analyses to more general structures. A uni\ufb01ed analysis for\ngeneral structures will probably create interesting connections with existing uni\ufb01ed frameworks\nsuch as those based on decomposability [5] and atomic norms [25].\n\n8\n\n\fReferences\n[1] Peter B\u00a8uhlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and appli-\n\ncations. Springer, 2011.\n\n[2] Sahand Negahban, Martin J Wainwright, et al. Estimation of (near) low-rank matrices with noise and\n\nhigh-dimensional scaling. The Annals of Statistics, 39(2):1069\u20131097, 2011.\n\n[3] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-dimensional\n\nlinear regression over (cid:96)q-balls. Information Theory, IEEE Transactions on, 57(10):6976\u20136994, 2011.\n\n[4] Angelika Rohde and Alexandre B Tsybakov. Estimation of high-dimensional low-rank matrices. The\n\nAnnals of Statistics, 39(2):887\u2013930, 2011.\n\n[5] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, Bin Yu, et al. A uni\ufb01ed framework\nStatistical Science,\n\nfor high-dimensional analysis of M-estimators with decomposable regularizers.\n27(4):538\u2013557, 2012.\n\n[6] Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing,\n\n24(2):227\u2013234, 1995.\n\n[7] Ji Liu, Ryohei Fujimaki, and Jieping Ye. Forward-backward greedy algorithms for general convex smooth\nfunctions over a cardinality constraint. In Proceedings of The 31st International Conference on Machine\nLearning, pages 503\u2013511, 2014.\n\n[8] Ali Jalali, Christopher C Johnson, and Pradeep D Ravikumar. On learning discrete graphical models using\n\ngreedy methods. In NIPS, pages 1935\u20131943, 2011.\n\n[9] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in optimization\n\nproblems with sparsity constraints. SIAM Journal on Optimization, 20(6):2807\u20132832, 2010.\n\n[10] Yurii Nesterov.\n\nIntroductory lectures on convex optimization: A basic course, volume 87 of Applied\n\nOptimization. Springer, 2004.\n\n[11] P. Loh and M. J. Wainwright. Regularized M-estimators with nonconvexity: Statistical and algorithmic\n\ntheory for local optima, 2013. arXiv:1305.2436 [math.ST].\n\n[12] Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressed sensing. Applied\n\nand Computational Harmonic Analysis, 27(3):265 \u2013 274, 2009.\n\n[13] Rahul Garg and Rohit Khandekar. Gradient descent with sparsi\ufb01cation: an iterative algorithm for sparse\n\nrecovery with restricted isometry property. In ICML, 2009.\n\n[14] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing. SIAM J. on Num. Anal.,\n\n49(6):2543\u20132563, 2011.\n\n[15] Deanna Needell and Joel A. Tropp. CoSaMP: Iterative Signal Recovery from Incomplete and Inaccurate\n\nSamples. Appl. Comput. Harmon. Anal., 26:301\u2013321, 2008.\n\n[16] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction. IEEE\n\nTrans. Inf. Theory, 55(5):22302249, 2009.\n\n[17] Prateek Jain, Ambuj Tewari, and Inderjit S. Dhillon. Orthogonal matching pursuit with replacement. In\n\nAnnual Conference on Neural Information Processing Systems, 2011.\n\n[18] Sohail Bahmani, Bhiksha Raj, and Petros T Boufounos. Greedy sparsity-constrained optimization. The\n\nJournal of Machine Learning Research, 14(1):807\u2013841, 2013.\n\n[19] Xiaotong Yuan, Ping Li, and Tong Zhang. Gradient hard thresholding pursuit for sparsity-constrained\n\noptimization. In Proceedings of The 31st International Conference on Machine Learning, 2014.\n\n[20] Yuchen Zhang, Martin J. Wainwright, and Michael I. Jordan. Lower bounds on the performance of\n\npolynomial-time algorithms for sparse linear regression. arXiv:1402.1918, 2014.\n\n[21] P. Loh and M. J. Wainwright. High-dimension regression with noisy and missing data: Provable guaran-\n\ntees with non-convexity. Annals of Statistics, 40(3):1637\u20131664, 2012.\n\n[22] Alekh Agarwal, Sahand N. Negahban, and Martin J. Wainwright. Fast global convergence of gradient\n\nmethods for high-dimensional statistical recovery. Annals of Statistics, 40(5):2452\u20142482, 2012.\n\n[23] Mark Schmidt. Graphical Model Structure Learning with L1-Regularization. PhD thesis, University of\n\nBritish Columbia, 2010.\n\n[24] Tong Zhang. Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations. IEEE\n\nTrans. Inf. Theory, 57:4689\u20134708, 2011.\n\n[25] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex geometry of\n\nlinear inverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n9\n\n\f", "award": [], "sourceid": 485, "authors": [{"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}, {"given_name": "Purushottam", "family_name": "Kar", "institution": "Microsoft Research India"}]}