{"title": "Implicit Regularization for Optimal Sparse Recovery", "book": "Advances in Neural Information Processing Systems", "page_first": 2972, "page_last": 2983, "abstract": "We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under the restricted isometry assumption. For a given parametrization yielding a non-convex optimization problem, we show that prescribed choices of initialization, step size and stopping time yield a statistically and computationally optimal algorithm that achieves the minimax rate with the same cost required to read the data up to poly-logarithmic factors. Beyond minimax optimality, we show that our algorithm adapts to instance difficulty and yields a dimension-independent rate when the signal-to-noise ratio is high enough. Key to the computational efficiency of our method is an increasing step size scheme that adapts to refined estimates of the true solution. We validate our findings with numerical experiments and compare our algorithm against explicit $\\ell_{1}$ penalization. Going from hard instances to easy ones, our algorithm is seen to undergo a phase transition, eventually matching least squares with an oracle knowledge of the true support.", "full_text": "Implicit Regularization for Optimal Sparse Recovery\n\nTomas Va\u0161kevi\u02c7cius1, Varun Kanade2, Patrick Rebeschini1\n1 Department of Statistics, 2 Department of Computer Science\n\n{tomas.vaskevicius, patrick.rebeschini}@stats.ox.ac.uk\n\nUniversity of Oxford\n\nvarunk@cs.ox.ac.uk\n\nAbstract\n\nWe investigate implicit regularization schemes for gradient descent methods applied\nto unpenalized least squares regression to solve the problem of reconstructing a\nsparse signal from an underdetermined system of linear measurements under the\nrestricted isometry assumption. For a given parametrization yielding a non-convex\noptimization problem, we show that prescribed choices of initialization, step size\nand stopping time yield a statistically and computationally optimal algorithm that\nachieves the minimax rate with the same cost required to read the data up to\npoly-logarithmic factors. Beyond minimax optimality, we show that our algorithm\nadapts to instance dif\ufb01culty and yields a dimension-independent rate when the\nsignal-to-noise ratio is high enough. Key to the computational ef\ufb01ciency of our\nmethod is an increasing step size scheme that adapts to re\ufb01ned estimates of the\ntrue solution. We validate our \ufb01ndings with numerical experiments and compare\nour algorithm against explicit \ufffd1 penalization. Going from hard instances to easy\nones, our algorithm is seen to undergo a phase transition, eventually matching least\nsquares with an oracle knowledge of the true support.\n\n1\n\nIntroduction\n\nMany problems in machine learning, science and engineering involve high-dimensional datasets where\nthe dimensionality of the data d is greater than the number of data points n. Linear regression with\nsparsity constraints is an archetypal problem in this setting. The goal is to estimate a d-dimensional\nvector w\ufffd \u2208 Rd with k non-zero components from n data points (xi, yi) \u2208 Rd \u00d7 R, i \u2208 {1, . . . , n},\nlinked by the linear relationship yi = \ufffdxi, w\ufffd\ufffd + \u03bei, where \u03bei is a possible perturbation to the ith\nobservation. In matrix-vector form the model reads y = Xw\ufffd + \u03be, where xi corresponds to the ith\nrow of the n \u00d7 d design matrix X. Over the past couple of decades, sparse linear regression has been\nextensively investigated from the point of view of both statistics and optimization.\nIn statistics, sparsity has been enforced by designing estimators with explicit regularization schemes\nbased on the \ufffd1 norm, such as the lasso [46] and the closely related basis pursuit [15] and Dantzig\nselector [13]. In the noiseless setting (\u03be = 0), exact recovery is possible if and only if the design\nmatrix satis\ufb01es the restricted nullspace property [16, 17, 19]. In the noisy setting (\u03be \ufffd= 0), exact\n2 in the case of i.i.d.\nsub-Gaussian noise with variance proxy \u03c32 when the design matrix satis\ufb01es restricted eigenvalue\nconditions [9, 48]. The lasso estimator, de\ufb01ned as any vector w that minimizes the objective\n2 + \u03bb\ufffdw\ufffd1, achieves the minimax-optimal rate upon proper tuning of the regularization\n\ufffdXw \u2212 y\ufffd2\nparameter \u03bb. The restricted isometry property (RIP) [14] has been largely considered in the literature,\nas it implies both the restricted nullspace and eigenvalue conditions [16, 49], and as it is satis\ufb01ed\nwhen the entries of X are i.i.d. sub-Gaussian and subexponential with sample size n = \u03a9(k log(d/k))\nand n = \u03a9(k log2(d/k)) respectively [32, 1], or when the columns are unitary, e.g. [23, 24, 39, 41].\n\nrecovery is not feasible and a natural criterion involves designing estimators \ufffdw that can recover\nthe minimax-optimal rate k\u03c32 log(d/k)/n for the squared \ufffd2 error \ufffd\ufffdw \u2212 w\ufffd\ufffd2\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn optimization, computationally ef\ufb01cient iterative algorithms have been designed to solve convex\nproblems based on \ufffd1 constraints and penalties, such as composite/proximal methods [4, 35]. Under\nrestricted eigenvalue conditions, such as restricted strong convexity and restricted smoothness, various\niterative methods have been shown to yield exponential convergence to the problem solution globally\nup to the statistical precision of the model [2], or locally once the iterates are close enough to the\noptimum and the support of the solution is identi\ufb01ed [10, 28, 45]. In some regimes, for a prescribed\nchoice of the regularization parameter, these algorithms are computationally ef\ufb01cient. They require\n\n\ufffdO(1) iterations, where the notation \ufffdO hides poly-logarithmic terms, and each iteration costs O(nd).\nHence the total running cost is \ufffdO(nd), which is the cost to store/read the data in/from memory.\n\nThese results attest that there are regimes where optimal methods for sparse linear regression exist.\nHowever, these results reply upon tuning the hyperparameters for optimization, such as the step\nsize, carefully, which in turn depends on identifying the correct hyperparameters, such as \u03bb, for\nregularization. In practice, one has to resort to cross-validation techniques to tune the regularization\nparameter. Cross-validation adds an additional burden from a computational point of view, as the\noptimization algorithms need to be run for different choices of the regularization terms. In the context\nof linear regression with \ufffd2 penalty, a.k.a. ridge regression, potential computational savings have\nmotivated research on the design of implicit regularization schemes where model complexity is\ndirectly controlled by tuning the hyper-parameters of solvers applied to unpenalized/unconstrained\nprograms, such as choice of initialization, step-size, iteration/training time. There has been increasing\ninterest in understanding the effects of implicit regularization (sometimes referred to as implicit\nbias) of machine learning algorithms.\nIt is widely acknowledged that the choice of algorithm,\nparametrization, and parameter-tuning, all affect the learning performance of models derived from\ntraining data. While implicit regularization has been extensively investigated in connection to the \ufffd2\nnorm, there seem to be no results for sparse regression, which is surprising considering the importance\nof the problem.\n\n1.1 Our Contributions\n\nIn this work, we merge statistics with optimization, and propose the \ufb01rst statistically and computa-\ntionally optimal algorithm based on implicit regularization (initialization/step-size tuning and early\nstopping) for sparse linear regression under the RIP.\nThe algorithm that we propose is based on gradient descent applied to the unregularized, underdeter-\nmined objective function \ufffdXw\u2212y\ufffd2\n2 where w is parametrized as w = u\ufffdu\u2212v\ufffdv, with u, v \u2208 Rd\nand \ufffd denotes the coordinate-wise multiplication operator for vectors. This parametrization yields a\nnon-convex problem in u and v. We treat this optimization problem as a proxy to design a sequence\nof statistical estimators that correspond to the iterates of gradient descent applied to solve the sparse\nregression problem, and hence are cheap to compute iteratively. The matrix formulation of the\nsame type of parametrization that we adopt has been recently considered in the setting of low-rank\nmatrix recovery where it leads to exact recovery via implicit regularization in the noiseless setting\nunder the RIP [25, 30]. In our case, this choice of parametrization yields an iterative algorithm that\nperforms multiplicative updates on the coordinates of u and v, in contrast to the additive updates\nobtained when gradient descent is run directly on the parameter w, as in proximal methods. This\nfeature allows us to reduce the convergence analysis to one-dimensional iterates and to differentiate\nthe convergence on the support set S = {i \u2208 {1, . . . , d} : w\ufffd\n\ufffd= 0} from the convergence on its\ncomplement Sc = {1, . . . , d} \\ S.\nWe consider gradient descent initialized with u0 = v0 = \u03b11, where 1 is the all-one vector. We show\nthat with a suf\ufb01ciently small initialization size \u03b1 > 0 and early stopping, our method achieves exact\nreconstruction with precision controlled by \u03b1 in the noiseless setting, and minimax-optimal rates in\nthe noisy setting. To the best of our knowledge, our results are the \ufb01rst to establish non-\ufffd2 implicit\nregularization for a gradient descent method in a general noisy setting.1 These results rely on a\nconstant choice of step size \u03b7 that satis\ufb01es a bound related to the unknown parameter w\ufffd\nmax = \ufffdw\ufffd\ufffd\u221e.\nWe show how this choice of \u03b7 can be derived from the data itself, i.e. only based on known quantities.\nIf the noise vector \u03be is made up of i.i.d. sub-Gaussian components with variance proxy \u03c32, this choice\nmax\u221an)/(\u03c3\u221alog d) log \u03b1\u22121) iteration complexity to achieve minimax rates. In\nof \u03b7 yields O((w\ufffd\norder to achieve computational optimality, we design a preconditioned version of gradient descent (on\n\ni\n\n1However, see Remark 1 for work concurrent to ours that achieves similar goals.\n\n2\n\n\fthe parameters u and v) that uses increasing step-sizes and has running time \ufffdO(nd). The iteration-\ndependent preconditioner relates to the statistical nature of the problem. It is made by a sequence\nof diagonal matrices that implement a coordinate-wise increasing step-size scheme that allows\ndifferent coordinates to accelerate convergence by taking larger steps based on re\ufb01ned estimates of\nmax\u221an)/(\u03c3\u221alog d)) log \u03b1\u22121)\nthe corresponding coordinates of w\ufffd. This algorithm yields O(log((w\ufffd\niteration complexity to achieve minimax rates in the noisy setting. Since each iteration costs O(nd),\nthe total computation complexity is, up to poly-logarithmic factors, the same as simply storing/reading\nthe data. This algorithm is minimax-optimal and, up to logarithmic factors, computationally optimal.\nIn contrast, we are not aware of any work on implicit \ufffd2 regularization that exploits an increasing\nstep sizes scheme in order to attain computational optimality.\nTo support our theoretical results we present a simulation study of our methods and comparisons with\nthe lasso estimator and with the gold standard oracle least squares estimator, which performs least\nsquares regression on S assuming oracle knowledge of it. We show that the number of iterations t in\nour method plays a role similar to the lasso regularization parameter \u03bb. Despite both algorithms being\nminimax-optimal with the right choice of t and \u03bb respectively, the gradient descent optimization\npath\u2014which is cheaper to compute as each iteration of gradient descent yields a new model\u2014exhibits\nqualitative and quantitative differences from the lasso regularization path\u2014which is more expensive\nto compute as each model requires solving a new lasso optimization program. In particular, the\nsimulations emphasize how the multiplicative updates allow gradient descent to \ufb01t one coordinate\nof w\ufffd at a time, as opposed to the lasso estimator that tends to \ufb01t all coordinates at once. Beyond\nminimax results, we prove that our methods adapt to instance dif\ufb01culty: for \u201ceasy\u201d problems where\nthe signal is greater than the noise, i.e. w\ufffd\ni |, our estimators\nachieve the statistical rate k\u03c32 log(k)/n, which does not depend on d. The experiments con\ufb01rm this\nbehavior and further attest that our estimators undergo a phase transition that is not observed for\nthe lasso. Going from hard instances to easy ones, the learning capacity of implicitly-regularized\ngradient descent exhibits a qualitative transition and eventually matches the performance of oracle\nleast squares.\n\nmin \ufffd \ufffdXT\u03be\ufffd\u221e/n with w\ufffd\n\nmin = mini\u2208S |w\ufffd\n\n1.2 Related Work\n\nSparse Recovery. The statistical properties of explicit \ufffd1 penalization techniques are well studied\n[48, 13, 9, 31, 33]. Minimax rates for regression under sparsity constraints are derived in [37].\nComputing the whole lasso regularization path can be done via the lars algorithm [18]. Another\nwidely used approach is the glmnet which uses cyclic coordinate-descent with warm starts to compute\nregularization paths for generalized linear models with convex penalties on a pre-speci\ufb01ed grid of\nregularization parameters [22]. [4] reviews various optimization techniques used in solving empirical\nrisk minimization problems with sparsity inducing penalties. Using recent advances in mixed integer\noptimization, [8] shows that the best subset selection problem can be tackled for problems of moderate\nsize. For such problem sizes, comparisons between the lasso and best subset selection problem (\ufffd0\nregularization) were recently made, suggesting that the best subset selection performs better in high\nsignal-to-noise ratio regimes whereas the lasso performs better when the signal-to-noise ratio is\nlow [29]. In this sense, our empirical study in Section 5 suggests that implicitly-regularized gradient\ndescent is more similar to \ufffd0 regularization than \ufffd1 regularization. Several other techniques related\nto \ufffd1 regularization and extensions to the lasso exist. We refer the interested reader to the books\n[11, 47].\nImplicit Regularization/Bias. Connections between \ufffd2 regularization and gradient descent opti-\nmization paths have been known for a long time and are well studied [12, 20, 52, 7, 38, 51, 34, 44, 3].\nIn contrast, the literature on implicit regularization inducing sparsity is scarce. Coordinate-descent\noptimization paths have been shown to be related to \ufffd1 regularization paths in some regimes\n[21, 18, 40, 54]. Understanding such connections can potentially allow transferring the now well-\nunderstood theory developed for penalized forms of regularization to early-stopping-based regular-\nization which can result in lower computational complexity. Recently, [53] have shown that neural\nnetworks generalize well even without explicit regularization despite the capacity to \ufb01t unstructured\nnoise. This suggests that some implicit regularization effect is limiting the capacity of the obtained\nmodels along the optimization path and thus explaining generalization on structured data. Under-\nstanding such effects has recently drawn a lot of attention in the machine learning community. In\nparticular, it is now well understood that the optimization algorithm itself can be biased towards a\nparticular set of solutions for underdetermined problems with many global minima where, in contrast\n\n3\n\n\fto the work cited above, the bias of optimization algorithm is investigated at or near convergence,\nusually in a noiseless setting [43, 27, 26, 25, 30]. We compare our assumptions with the ones made\nin [30] in Appendix G.\nRemark 1 (Concurrent Work). After completing this work we became aware of independent con-\ncurrent work [56] which considers Hadamard product reparametrization wt = ut \ufffd vt in order to\nimplicitly induce sparsity for linear regression under the RIP assumption. Our work is signi\ufb01cantly\ndifferent in many aspects discussed in Appendix H. In particular, we obtain computational optimality\nand can properly handle the general noisy setting.\n\n2 Model and Algorithms\n\nWe consider the model de\ufb01ned in the introduction. We denote vectors with boldface letters and real\nnumbers with normal font; thus, w denotes a vector and wi denotes the ith coordinate of w. For\nany index set A we let 1A denote a vector that has a 1 entry in all coordinates i \u2208 A and a 0 entry\nelsewhere. We denote coordinate-wise inequalities by \ufffd. With a slight abuse of notation we write\nw2 to mean the vector obtained by squaring each component of w. Finally, we denote inequalities\nup to multiplicative absolute constants, meaning that they do not depend on any parameters of the\nproblem, by \ufffd. A table of notation can be found in Appendix J.\nWe now de\ufb01ne the restricted isometry property which is the key assumption in our main theorems.\nDe\ufb01nition 1 (Restricted Isometry Property (RIP)). A n \u00d7 d matrix X/\u221an satis\ufb01es the (\u03b4, k)-(RIP) if\nfor any k-sparse vector w \u2208 Rd we have (1 \u2212 \u03b4)\ufffdw\ufffd2\nThe RIP assumption was introduced in [14] and is standard in the compressed sensing literature. It\nrequires that all n \u00d7 k sub-matrices of X/\u221an are approximately orthonormal where \u03b4 controls extent\nto which this approximation holds. Checking if a given matrix satis\ufb01es the RIP is NP-hard [5]. In\ncompressed sensing applications the matrix X/\u221an corresponds to how we measure signals and it can\nbe chosen by the designer of a sparse-measurement device. Random matrices are known to satisfy\nthe RIP with high probability, with \u03b4 decreasing to 0 as n increases for a \ufb01xed k [6].\nWe consider the following problem setting. Let u, v \u2208 Rd and de\ufb01ne the mean squared loss as\n\n2 \u2264 \ufffdXw/\u221an\ufffd2\n\n2 \u2264 (1 + \u03b4)\ufffdw\ufffd2\n2 .\n\nL(u, v) =\n\n1\nn \ufffdX (u \ufffd u \u2212 v \ufffd v) \u2212 y\ufffd2\n2 .\n\nLetting w = u \ufffd u \u2212 v \ufffd v and performing gradient descent updates on w, we recover the original\nparametrization of mean squared error loss which does not implicitly induce sparsity. Instead, we\nperform gradient descent updates on (u, v) treating it as a vector in R2d and we show that the\ncorresponding optimization path contains sparse solutions.\nLet \u03b7 > 0 be the learning rate, (mt)t\u22650 be a sequence of vectors in Rd and diag(mt) be a d \u00d7 d\ndiagonal matrix with mt on its diagonal. We consider the following general form of gradient descent:\n\n(ut+1, vt+1) = (ut, vt) \u2212 \u03b7 diag(mt, mt)\n\n\u2202L(ut, vt)\n\u2202(ut, vt)\n\n.\n\n(1)\n\nWe analyze two different choices of sequences (mt)t\u22650 yielding two separate algorithms.\nAlgorithm 1. Let \u03b1, \u03b7 > 0 be two given parameters. Let u0 = v0 = \u03b1 and for all t \u2265 0 we let\nmt = 1. Perform the updates given in (1).\nAlgorithm 2. Let \u03b1, \u03c4 \u2208 N and w\ufffd\n20\u02c6z and\nmax be three given parameters. Set \u03b7 = 1\nu0 = v0 = \u03b1. Perform the updates in (1) with m0 = 1 and mt adaptively de\ufb01ned as follows:\n\nmax \u2264 \u02c6z \u2264 2w\ufffd\n\n1. Set mt = mt\u22121.\n\n2. If t = m\u03c4\ufffdlog \u03b1\u22121\ufffd for some natural number m \u2265 2 then let mt,j = 2mt\u22121,j for all j\n\nt,j \u2264 2\u2212m\u22121 \u02c6z.\n\nsuch that u2\n\nt,j \u2228 v2\n\nAlgorithm 1 corresponds to gradient descent with a constant step size, whereas Algorithm 2 doubles\n\nthe step-sizes for small enough coordinates after every \u03c4\ufffdlog \u03b1\u22121\ufffd iterations.\n\nBefore stating the main results we de\ufb01ne some key quantities. First, our results are sensitive to the\ncondition number \u03ba = \u03ba(w\ufffd) = w\ufffd\nmin of the true parameter vector w\ufffd. Since we are not able\n\nmax/w\ufffd\n\n4\n\n\fto recover coordinates below the maximum noise term \ufffdXT\u03be\ufffd\u221e/n, for a desired precision \u03b5 we can\ntreat all coordinates of w\ufffd below \u03b5 \u2228 (\ufffdXT\u03be\ufffd\u221e/n) as 0. This motivates the following de\ufb01nition of\nan effective condition number for given w\ufffd, X, \u03be and \u03b5:\n\n\u03baeff = \u03baeff(w\ufffd, X, \u03be, \u03b5) = w\ufffd\n\nmax/(w\ufffd\n\nmin \u2228 \u03b5 \u2228 (\ufffdXT\u03be\ufffd\u221e/n)).\n\nWe remark that \u03baeff(w\ufffd, X, \u03be, \u03b5) \u2264 \u03ba(w\ufffd). Second, we need to put restrictions on the RIP constant \u03b4\nand initialization size \u03b1. These restrictions are given by the following:\nmax)2 \u2227\ufffdw\ufffd\n\u03b4(k, w\ufffd, X, \u03be, \u03b5) = 1/(\u221ak(1 \u2228 log \u03baeff(w\ufffd))), \u03b1(w\ufffd, \u03b5, d) :=\n\n\u03b52 \u2227 \u03b5 \u2227 1\n(2d + 1)2 \u2228 (w\ufffd\n\nmin\n2\n\n.\n\n3 Main Results\n\n|wt,i \u2212 w\ufffd\n\ni | \ufffd\uf8f1\uf8f2\uf8f3\n\nn XT\u03be\ufffd\ufffd\u221e \u2228 \u03b5\n\ufffd\ufffd 1\nn\ufffdXT\u03be\ufffdi\ufffd\ufffd \u2228 \u03b4\u221ak\ufffd\ufffd 1\n\ufffd\ufffd 1\n\n\u221a\u03b1\n\nThe following result is the backbone of our contributions. It establishes rates for Algorithm 1 in the\n\ufffd\u221e norm as opposed to the typical rates for the lasso that are often only derived for the \ufffd2 norm.\nSuppose that X/\u221an satis\ufb01es the (k + 1, \u03b4)-RIP with \u03b4 \ufffd\nTheorem 1. Fix any \u03b5 > 0.\n\u03b4 (k, w\ufffd, X, \u03be, \u03b5) and let the initialization \u03b1 satisfy \u03b1 \u2264 \u03b1(w\ufffd, \u03b5, d). Then, Algorithm 1 with\n\u03b7 \u2264 1/(20w\ufffd\n\nmax) and t = O((\u03baeff(w\ufffd))/(\u03b7w\ufffd\n\nmax) log \u03b1\u22121) iterations satis\ufb01es\nif i \u2208 S and w\ufffd\nif i \u2208 S and w\ufffd\nif i /\u2208 S.\n\nn XT\u03be \ufffd 1S\ufffd\ufffd\u221e \u2228 \u03b5\n\nmin \ufffd\ufffd\ufffd 1\nmin \ufffd\ufffd\ufffd 1\n\nn XT\u03be\ufffd\ufffd\u221e \u2228 \u03b5,\nn XT\u03be\ufffd\ufffd\u221e \u2228 \u03b5,\n\nmax, w\ufffd\n\nmax \ufffd \u03b7 \u2264 1/(20w\ufffd\n\nThis result shows how the parameters \u03b1, \u03b7 and t affect the learning performance of gradient descent.\nThe size of \u03b1 controls the size of the coordinates outside the true support S at the stopping time. We\ndiscuss the role and also the necessity of small initialization size to achieve the desired statistical\nperformance in Section 5. A different role is played by the step size \u03b7 whose size affects the optimal\nstopping time t. In particular, (\u03b7t)/ log \u03b1\u22121 can be seen as a regularization parameter closely related\nto \u03bb\u22121 for the lasso. To see this, suppose that the noise \u03be is \u03c32-sub-Gaussian with independent\ncomponents. Then with high probability \ufffdXT\u03be\ufffd\u221e/n \ufffd (\u03c3\u221alog d)/\u221an). In such a setting an optimal\nchoice of \u03bb for the lasso is \u0398((\u03c3\u221alog d)/\u221an). On the other hand, letting t\ufffd be the optimal stopping\nmin(X, \u03be, \u03b5)) = O(\u221an/(\u03c3\u221alog d)).\ntime given in Theorem 1, we have (\u03b7t\ufffd)/ log \u03b1\u22121 = O(1/w\ufffd\nThe condition \u03b7 \u2264 1/(20w\ufffd\nIf we can set 1/w\ufffd\nto O(\u03baeff(w\ufffd) log \u03b1\u22121). The magnitude of w\ufffd\nsetting the proper initialization size \u03b1 depends on w\ufffd\n\nmax) is also necessary up to constant factors in order to prevent explosion.\nmax) then the iteration complexity of Theorem 1 reduces\nmax is, however, an unknown quantity. Similarly,\nmin, d and the desired precision \u03b5. The\nrequirement that \u03b1 \u2264 \ufffdw\ufffd\nmin/2 is an artifact of our proof technique and tighter analysis could\nreplace this condition by simply \u03b1 \u2264 \u03b5. Hence the only unknown quantity for selecting a proper\ninitialization size is w\ufffd\nThe next theorem shows how w\ufffd\nmax can be estimated from the data up to a multiplicative factor 2\nat the cost of one gradient descent iteration. Once this estimate is computed, we can properly set\nwhich satis\ufb01es our theory and is tight up to\nthe initialization size and the learning rate \u03b7 \ufffd 1\nconstant multiplicative factors. We remark that \u02dc\u03b7 used in Theorem 2 can be set arbitrarily small (e.g.,\nset \u02dc\u03b7 = 10\u221210) and is only used for one gradient descent step in order to estimate w\ufffd\nmax). Set \u03b1 = 1 and suppose that X/\u221an satis\ufb01es the (k + 1, \u03b4)-RIP with\nTheorem 2 (Estimating w\ufffd\n\u03b4 \u2264 1/(20\u221ak). Let the step size \u02dc\u03b7 be any number satisfying 0 < \u02dc\u03b7 \u2264 1/(5w\ufffd\nmax) and suppose that\nmax \u2265 5\ufffdXT\u03be\ufffd\u221e/n. Perform one step of gradient descent and for each i \u2208 {1, . . . , d} compute\nw\ufffd\nthe update factors de\ufb01ned as f +\n. Then\nw\ufffd\nmax \u2264 (fmax \u2212 1)/(3\u02dc\u03b7) < 2w\ufffd\nWe present three main corollaries of Theorem 1. The \ufb01rst one shows that in the noiseless setting exact\nrecovery is possible and is controlled by the desired precision \u03b5 and hence by the initialization size \u03b1.\nCorollary 1 (Noiseless Recovery). Let \u03be = 0. Under the assumptions of Theorem 1, the choice of \u03b7\ngiven by Theorem 2 and t = O(\u03baeff(w\ufffd) log \u03b1\u22121), Algorithm 1 yields \ufffdwt \u2212 w\ufffd\ufffd2\n\ni = (u1)i and f\u2212i = (v1)i. Let fmax = \ufffdf +\ufffd\u221e \u2228 \ufffdf\u2212\ufffd\u221e\nmax.\n\n2 \ufffd k\u03b52.\n\nmax.\n\nmax.\n\nw\ufffd\n\nmax\n\n5\n\n\f2 \ufffd (k\u03c32 log d)/n with probability at least 1 \u2212 1/(8d3).\n\nIn the general noisy setting exact reconstruction of w\ufffd is not possible. In fact, the bounds in Theorem 1\ndo not improve with \u03b5 chosen below the maximum noise term \ufffdXT\u03be\ufffd\u221e/n. In the following corollary\nwe show that with a small enough \u03b5 if the design matrix X is \ufb01xed and the noise vector \u03be is sub-\nGaussian, we recover minimax-optimal rates for \ufffd2 error. Our error bound is minimax-optimal in the\nsetting of sub-linear sparsity, meaning that there exists a constant \u03b3 > 1 such that k\u03b3 \u2264 d.\nCorollary 2 (Minimax Rates in the Noisy Setting). Let the noise vector \u03be be made of independent\n\u03c32-sub-Gaussian entries. Let \u03b5 = 4\ufffd\u03c32 log(2d)/\u221an. Under the assumptions of Theorem 1, the\nmax\u221an)/(\u03c3\u221alog d) log \u03b1\u22121),\nchoice of \u03b7 given by Theorem 2 and t = O(\u03baeff(w\ufffd) log \u03b1\u22121) = O((w\ufffd\nAlgorithm 1 yields \ufffdwt \u2212 w\ufffd\ufffd2\nThe next corollary states that gradient descent automatically adapts to the dif\ufb01culty of the problem.\nmin \ufffd\nThe statement of Theorem 1 suggests that our bounds undergo a phase-transition when w\ufffd\n\ufffdXT\u03be\ufffd\u221e/n which is also supported by our empirical \ufb01ndings in Section 5. In the \u03c32-sub-Gaussian\nnoise setting the transition occurs as soon as n \ufffd (\u03c32 log d)/(w\ufffd\nmin)2. As a result, the statistical\nbounds achieved by our algorithm are independent of d in such a setting. To see that, note that\nwhile the term \ufffdXT\u03be\ufffd\u221e/n grows as O(log d), the term \ufffdXT\u03be \ufffd 1S\ufffd\u221e/n grows only as O(log k).\nIn contrast, performance of the lasso deteriorates with d regardless of the dif\ufb01culty of the problem.\nWe illustrate this graphically and give a theoretical explanation in Section 5. We remark that the\nfollowing result does not contradict minimax optimality because we now treat the true parameter w\ufffd\nas \ufb01xed.\nCorollary 3 (Instance Adaptivity). Let the noise vector \u03be be made of independent \u03c32-sub-Gaussian\nentries. Let \u03b5 = 4\ufffd\u03c32 log(2k)/\u221an. Under the assumptions of Theorem 1, the choice of \u03b7 given by\nmax\u221an)/(\u03c3\u221alog k) log \u03b1\u22121), Algorithm 1 yields\nTheorem 2 and t = O(\u03baeff(w\ufffd) log \u03b1\u22121) = O((w\ufffd\n\ufffdwt \u2212 w\ufffd\ufffd2\nThe \ufb01nal theorem we present shows that the same statistical bounds achieved by Algorithm 1 are\nalso attained by Algorithm 2. This algorithm is not only optimal in a statistical sense, but it is also\noptimal computationally up to poly-logarithmic factors.\nTheorem 3. Compute \u02c6z using Theorem 2. Under the setting of Theorem 1 there exists a large enough\nabsolute constant \u03c4 so that Algorithm 2 parameterized with \u03b1, \u03c4 and \u02c6z satis\ufb01es the result of Theorem 1\nand t = O(log \u03baeff log \u03b1\u22121) iterations.\n\n2 \ufffd (k\u03c32 log k)/n. with probability at least 1 \u2212 1/(8k3).\n\nCorollaries 1, 2 and 3 also hold for Algorithm 2 with stopping time equal to O(log \u03baeff log \u03b1\u22121). We\nemphasize that both Theorem 1 and 3 use gradient-based updates to obtain a sequence of models\nwith optimal statistical properties instead of optimizing the objective function L. In fact, if we let\nt \u2192 \u221e for Algorithm 2 the iterates would explode.\n4 Proof Sketch\n\nIn this section we prove a simpli\ufb01ed version of Theorem 1 under the assumption XTX/n = I. We\nfurther highlight the intricacies involved in the general setting and present the intuition behind the\nkey ideas there. The gradient descent updates on ut and vt as given in (1) can be written as\n\ni > 0, vt,i if w\ufffd\n\n6\n\nut+1 = ut \ufffd\ufffd1 \u2212 (4\u03b7/n)XT(Xwt \u2212 y)\ufffd , vt+1 = vt \ufffd\ufffd1 + (4\u03b7/n)XT(Xwt \u2212 y)\ufffd .\n(2)\nThe updates can be succinctly represented as ut+1 = ut \ufffd (1 \u2212 r) and vt+1 = vt \ufffd (1 + r), where\nby our choice of \u03b7, \ufffdr\ufffd\u221e \u2264 1. Thus, (1 \u2212 r)\ufffd (1 + r) \ufffd 1 and we have ut \ufffd vt \ufffd u0 \ufffd v0 = \u03b121.\nHence for any i, only one of |ut,i| and |vt,i| can be larger then the initialization size while the other is\neffectively equal to 0. Intuitively, ut,i is used if w\ufffd\ni < 0 and hence one of these terms\ncan be merged into an error term bt,i as de\ufb01ned below. The details appear in Appendix B.4. To avoid\ngetting lost in cumbersome notation, in this section we will assume that w\ufffd \ufffd 0 and w = u \ufffd u.\nTheorem 4. Assume that w\ufffd \ufffd 0, 1\nn XTX = I, and that there is no noise (\u03be = 0). Parameterize\nw = u \ufffd u with u0 = \u03b11 for some 0 < \u03b1 < \ufffdw\ufffd\nmax) and t =\nProof. As XTX/n = I, y = Xw\ufffd, and vt = 0, the updates given in equation (2) reduce component-\ni = 0, wt,i\nwise to updates on wt given by wt+1,i = wt,i \u00b7 (1 \u2212 4\u03b7(wt,i \u2212 w\ufffd\n\nmin)), Algorithm 1 yields \ufffdwt \u2212 w\ufffd\ufffd\u221e \u2264 \u03b12.\n\nmin. Letting \u03b7 \u2264 1/(10w\ufffd\n\ni ))2. For i such that w\ufffd\n\nO(log(w\ufffd\n\nmax/\u03b12)/(\u03b7w\ufffd\n\n\fis non-increasing and hence stays below \u03b12. For i such that w\ufffd\ni > \u03b12, the update rule given\nabove ensures that as long as wt,i < w\ufffd\ni /2, wt,i increases at an exponential rate with base at least\ni )2. As w0,i = \u03b12, in O(log(w\ufffd\ni /2. Subsequently,\n(1+2\u03b7w\ufffd\ni /\u03b12)/(\u03b7w\ufffd\nthe gap (w\ufffd\nmin)) steps we\nhave \ufffdwt \u2212 w\ufffd\ufffd\u221e \u2264 \u03b12. The exact details are an exercise in calculus, albeit a rather tedious one,\nand appear in Appendix B.1.\n\ni \u2212 wt,i) halves every O(1/\u03b7w\ufffd\n\ni )) steps, it holds that wt,i \u2265 w\ufffd\n\ni ) steps; thus, in O(log(w\ufffd\n\nmax/\u03b12)/(\u03b7w\ufffd\n\nThe proof of Theorem 4 contains the key ideas of the proof of Theorem 1. However, the presence\nof noise (\u03be \ufffd= 0) and only having restricted isometry of XTX rather than isometry requires a subtle\nand involved analysis. We remark that we can prove tighter bounds in Theorem 4 than the ones in\nTheorem 1 because we are working in a simpli\ufb01ed setting.\nError Decompositions. We decompose wt into st := wt \ufffd 1S and et := wt \ufffd 1Sc so that\nwt = st + et. We de\ufb01ne the following error sequences:\n\nbt = XTXet/n + XT\u03be/n,\n\nwhich allows us to write updates on st and et as\n\npt =\ufffdXTX/n \u2212 I\ufffd (st \u2212 w\ufffd),\n\nst+1 = st \ufffd (1 \u2212 4\u03b7(st \u2212 w\u2217 + pt + bt))2,\n\net+1 = et \ufffd (1 \u2212 4\u03b7(pt + bt))2.\n\nexceeds \u221a\u03b1, the term\nError Sequence bt. Since our theorems require stopping before \ufffdet\ufffd\u221e\nXTXet/n can be controlled entirely by the initialization size. Hence bt \u2248 XT\u03be/n and it represents\nan irreducible error arising due to the noise on the labels. For any i \u2208 S at stopping time t we cannot\nexpect the error on the ith coordinate |wt,i\u2212w\ufffd\ni | to be smaller than |(XT\u03be)/n)i|. If we assume pt = 0\nand \u03be \ufffd= 0 then in light of our simpli\ufb01ed Theorem 4 we see that the terms in et grow exponentially\ni | \ufffd \ufffdXT\u03be\ufffd\u221e/n\nwith base at most (1 + 4\u03b7\ufffdXT\u03be\ufffd\u221e/n). We can \ufb01t all the terms in st such that |w\ufffd\nmin \ufffd \ufffdXT\u03be\ufffd\u221e/n then all the elements in st\nwhich leads to minimax-optimal rates. Moreover, if w\ufffd\ngrow exponentially at a faster rate than all of the error terms. This corresponds to the easy setting\nwhere the resulting error depends only on \ufffdXT\u03be/n \ufffd 1S\ufffd\u221e yielding dimension-independent error\nbounds. For more details see Appendix B.2.\nError Sequence pt. Since st \u2212 w\ufffd is a k-sparse vector using the RIP we can upper-bound \ufffdpt\ufffd\u221e \u2264\n\u221ak\u03b4 \ufffdst \u2212 w\ufffd\ufffd\u221e\n. Note that for small t we have \ufffds0\ufffd\u221e \u2248 \u03b12 \u2248 0 and hence, ignoring the\nlogarithmic factor in the de\ufb01nition of \u03b4 in the worst case we have Cw\ufffd\nmax \ufffd \ufffdXT\u03be\ufffd\u221e/n then the error terms can grow\nfor some absolute constant 0 < C < 1. If w\ufffd\nexponentially with base (1 + 4\u03b7 \u00b7 Cw\ufffd\nmax) whereas the signal terms such that |w\ufffd\nmax can\nmax). On the other hand, in the light of Theorem 4 the signal\nshrink exponentially at rate (1\u2212 4\u03b7 \u00b7 Cw\ufffd\nelements converge exponentially fast to the true parameters w\ufffd\nmax and hence the error sequence pt\nshould be exponentially decreasing. For small enough C and a careful choice of initialization size \u03b1\nwe can ensure that elements of pt decrease before the error components in et get too large or the\nsignal components in st get too small. For more details see Appendix A.2 and B.3.\nmax \ufffd\nTuning Learning Rates. The proof of Theorem 2 is given in Appendix D. If we choose 1/w\ufffd\nmax/\u03b12)) iterations. The\n\u03b7 \u2264 1/(10w\ufffd\nreason the factor \u03ba appears is the need to ensure that the convergence of the component w\ufffd\nmax\nis stable. However, this conservative setting of the learning rate unnecessarily slows down the\nmax. In Theorem 4, oracle knowledge of w\ufffd would allow\nconvergence for components with w\ufffd\ni ) yielding the total\nto set an individual step size for each coordinate i \u2208 S equal to \u03b7i = 1/(10w\ufffd\nnumber of iterations equal to O(log(w\ufffd\nmax/\u03b12)). In the setting where XTX/n \ufffd= I this would not\nbe possible even with the knowledge of w\ufffd, since the error sequence pt can be initially too large\nmax. Instead, we need to wait for\nwhich would result in explosion of the coordinates i with |w\ufffd\npt to get small enough before we increase the step size for some of the coordinates as described in\nAlgorithm 2. The analysis is considerably involved and the full proof can be found in Appendix E.\nWe illustrate effects of increasing step sizes in Section 5.\n\nmax) in Theorem 4 then all coordinates converge in O(\u03ba log(w\ufffd\n\nmax \u2264 \ufffdpt\ufffd\u221e\n\ni | \ufffd w\ufffd\n\ni | \ufffd w\ufffd\n\ni \ufffd w\ufffd\n\n< w\ufffd\n\nmax\n\ni = w\ufffd\n\n5 Simulations\n\nUnless otherwise speci\ufb01ed, the default simulation set up is as follows. We let w\ufffd = \u03b31S for some\nconstant \u03b3. For each run the entries of X are sampled as i.i.d. Rademacher random variables and\nthe noise vector \u03be follows i.i.d. N (0, \u03c32) distribution. For \ufffd2 plots each simulation is repeated a\n\n7\n\n\ftotal of 30 times and the median \ufffd2 error is depicted. The error bars in all the plots denote the 25th\nand 75th percentiles. Unless otherwise speci\ufb01ed, the default values for simulation parameters are\nn = 500, d = 104, k = 25, \u03b1 = 10\u221212, \u03b3 = 1, \u03c3 = 1 and for Algorithm 2 we set \u03c4 = 10.\nEffects of Initialization Size. As discussed in Section 4 each coordinate grows exponentially at\na different rate. In Figure 1 we illustrate the necessity of small initialization for bringing out the\nexponential nature of coordinate paths allowing to effectively \ufb01t them one at a time. For more\nintuition, suppose that coordinates outside the true support grow at most as fast as (1 + \u03b5)t while the\ncoordinates on the true support grow at least as fast as (1 + 2\u03b5)t. Since exponential function is very\nsensitive to its base, for large enough t we have (1 + \u03b5)t \ufffd (1 + 2\u03b5)t. The role of the initialization\nsize \u03b1 is then \ufb01nding a small enough \u03b1 such that for large enough t we have \u03b12(1 + \u03b5)t \u2248 0 while\n\u03b12(1 + 2\u03b5)t is large enough to ensure convergence of the coordinates on the true support.\n\nFigure 1: Effects of initialization size. We set k = 5, n = 100, \u03b7 = 0.05, \u03c3 = 0.5 and run\nAlgorithm 1. We remark that the X axes in the two \ufb01gures on the right differ due to different choices\nof \u03b1.\n\nExponential Convergence with Increasing Step Sizes. We illustrate the effects of Algorithm 2 on\nan ill-conditioned target with \u03ba = 64. Algorithm 1 spends approximately twice the time to \ufb01t each\ncoordinate that the previous one, which is expected, since the coordinate sizes decrease by half. On\nthe other hand, as soon as we increase the corresponding step size, Algorithm 2 \ufb01ts each coordinate at\napproximately the same number of iterations, resulting in O(log \u03ba log \u03b1\u22121) total iterations. Figure 2\ncon\ufb01rms this behavior in simulations.\n\nFigure 2: Comparison of Algorithms 1 and 2. Let n = 250, k = 7 and the non-zero coordinates of\nw\ufffd be {2i : i = 0, . . . , 6}. For both algorithms let \u03b7 = 1/(20 \u00b7 26). The vertical lines in the \ufb01gure\non the right are equally spaced at \u03c4 log \u03b1\u22121 iterations. The scale of x axes differ by a factor of 16.\nThe shaded region corresponds to 25th and 75th percentiles over 30 runs.\n\nPhase Transitions. As suggested by our main results, we present empirical evidence that when\nmin \ufffd \ufffdXT\u03be\ufffd\u221e/n our algorithms undergo a phase transition with dimension-independent error\nw\ufffd\nbounds. We plot results for three different estimators. First we run Algorithm 2 for 2000 iterations\nand save every 10th model. Among the 200 obtained models we choose the one with the smallest\nerror on a validation dataset of size n/4. We run the lasso for 200 choices of \u03bb equally spaced on a\nlogarithmic scale and for each run we select a model with the smallest \ufffd2 parameter estimation error\nusing an oracle knowledge of w\ufffd. Finally, we perform a least squares \ufb01t using an oracle knowledge\nof the true support S. Figure 3 illustrates, that with varying \u03b3, \u03c3 and n we can satisfy the condition\nmin \ufffd \ufffdXT\u03be\ufffd\u221e/n at which point our method approaches an oracle-like performance. Given\nw\ufffd\nexponential nature of the coordinate-wise convergence, all coordinates of the true support grow at a\nstrictly larger exponential rate than all of the coordinates on Sc as soon as w\ufffd\nmin \u2212 \ufffdXT\u03be\ufffd\u221e/n >\n\ufffdXT\u03be\ufffd\u221e/n. An approximate solution of this equation is shown in Figure 3 using vertical red lines.\n\n8\n\n050100150200Numberofiterationst\u22124\u2212202log2||wt\u2212w\ufffd||22E\ufb00ectofinitiliaztionsize\u03b1\u03b1=10\u22122\u03b1=10\u22123\u03b1=10\u22124020406080100Numberofiterationst0.00.51.01.5wt,iCoordinatespathswith\u03b1=10\u22123i\u2208Si\u2208Sc0100200300Numberofiterationst0.00.51.01.5wt,iCoordinatespathswith\u03b1=10\u221212i\u2208Si\u2208Sc010000200003000040000Numberofiterationst\u221250510log2||wt\u2212w\ufffd||22Constantstepsize05001000150020002500Numberofiterationst\u221250510log2||wt\u2212w\ufffd||22Increasingstepsizes\fFigure 3: Phase transitions. The \ufb01gure on the right uses \u03b3 = 1/4. The red vertical lines show\nsolutions of the equation w\ufffd\n\nmin = \u03b3 = 2 \u00b7 E\ufffd\ufffdXT\u03be\ufffd\u221e/n\ufffd \u2264 2 \u00b7 \u03c3\ufffd2 log(2d)/\u221an.\n\nmin \ufffd \ufffdXT\u03be\ufffd\u221e/n\nDimension Free Bounds in the Easy Setting. Figure 4 shows that when w\ufffd\nour algorithm matches the performance of oracle least squares which is independent of d.\nIn\ncontrast, the performance of the lasso deteriorates as d increases. To see why this is the case, in\nthe setting where XTX/n = I, the lasso solution with parameter \u03bb has a closed form solution\n| \u2212 \u03bb)+, where wLS is the least squares solution. In the sub-Gaussian noise\ni = sign(wLS\nw\u03bb\nsetting, the minimax rates are achieved by the choice \u03bb = \u0398(\ufffd\u03c32 log(d)/n) introducing a bias\nwhich depends on log d. Such a bias is illustrated in Figure 4 and is not present at the optimal stopping\ntime of our algorithm.\n\ni )(|wLS\n\ni\n\nFigure 4: Dimension dependent bias for the lasso. In contrast, in a high signal-to-noise ratio setting\ngradient descent is able to recover coordinates on S without a visible bias.\n\n6 Summary and Further Improvements\n\nIn this paper, we have provided the \ufb01rst statistical and computational guarantees for two algorithms\nbased on implicit regularization applied to a sparse recovery problem under the RIP assumption in the\ngeneral noisy setting. We show that Algorithms 1 and 2 yield optimal statistical rates and, in contrast\nto \ufffd1-penalization, adapt to the problem dif\ufb01culty. In particular, given enough data both algorithms\nyield dimension-independent rates. While our algorithms are parametrized by step-size, initialization-\nsize and the number of iterations, we show that a suitable choice of step-size and initialization-size\ncan be computed from the data at the cost of one gradient-descent iteration. Consequently, the\nonly tuneable hyper-parameter in practice is the number of iterations, which can be done using\ncross-validation as we have done in our experiments. With the provided choice of hyper-parameters,\n\nyielding a computationally sub-optimal algorithm. To circumvent this issue we propose a novel\n\nwe show that Algorithm 1 attains statistical optimality after \ufffdO(\u221an) gradient-descent iterations,\nincreasing step-sizes scheme (Algorithm 2) under which only \ufffdO(1) iterations are required, resulting\n\nin a computationally optimal algorithm up to poly-logarithmic factors.\nOur results can be improved in two different aspects. First, our constraints on the RIP parameter \u03b4\nresult in sub-optimal sample complexity with respect to the sparsity parameter k. Second, the RIP\ncondition could potentially be replaced by the restricted eigenvalue (RE) condition which allows\ncorrelated designs. We expand on both of these points in Appendix I and provide empirical evidence\nsuggesting that both inef\ufb01ciencies are artifacts of our analysis and not inherent limitations of our\nalgorithms.\n\n9\n\n0.00.20.40.60.81.0\u03b30.000.250.500.75||\ufffdw\u2212w\ufffd||22Comparing\ufffd2errorsgradientdescentlassoleastsquaresoracle0.02.55.07.510.012.5\u03c301020||\ufffdw\u2212w\ufffd||22Comparing\ufffd2errorsgradientdescentlassoleastsquaresoracle89101112log2n\u22127.5\u22125.0\u22122.50.0log2||\ufffdw\u2212w\ufffd||22Comparing\ufffd2errorsgradientdescentlassoleastsquaresoracle0102030||wt||10.00.51.01.5wt,iGradientdescenti\u2208Si\u2208Sc0102030||w\u03bb||10.00.51.01.5wt,iLassoi\u2208Si\u2208Sc789logd0.00.20.40.60.8||\ufffdw\u2212w\ufffd||22Comparing\ufffd2errorsgradientdescentlassoleastsquaresoracle\fAcknowledgments\n\nTomas Va\u0161kevi\u02c7cius is supported by the EPSRC and MRC through the OxWaSP CDT programme\n(EP/L016710/1). Varun Kanade and Patrick Rebeschini are supported in part by the Alan Turing\nInstitute under the EPSRC grant EP/N510129/1.\n\nReferences\n[1] Radoslaw Adamczak, Alexander E Litvak, Alain Pajor, and Nicole Tomczak-Jaegermann.\nRestricted isometry property of matrices with independent columns and neighborly polytopes\nby random sampling. Constructive Approximation, 34(1):61\u201388, 2011.\n\n[2] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence rates of\ngradient methods for high-dimensional statistical recovery. In Advances in Neural Information\nProcessing Systems, pages 37\u201345, 2010.\n\n[3] Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for\n\nleast squares regression. arXiv preprint arXiv:1810.10082, 2018.\n\n[4] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with\nsparsity-inducing penalties. Foundations and Trends R\ufffd in Machine Learning, 4(1):1\u2013106, 2012.\n[5] Afonso S Bandeira, Edgar Dobriban, Dustin G Mixon, and William F Sawin. Certifying the\nrestricted isometry property is hard. IEEE transactions on information theory, 59(6):3448\u20133450,\n2013.\n\n[6] Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin. A simple proof of the\nrestricted isometry property for random matrices. Constructive Approximation, 28(3):253\u2013263,\n2008.\n\n[7] Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning\n\ntheory. Journal of complexity, 23(1):52\u201372, 2007.\n\n[8] Dimitris Bertsimas, Angela King, and Rahul Mazumder. Best subset selection via a modern\n\noptimization lens. The Annals of Statistics, 44(2):813\u2013852, 2016.\n\n[9] Peter J Bickel, Ya\u2019acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of Lasso and\n\nDantzig selector. The Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[10] Kristian Bredies and Dirk A. Lorenz. Linear convergence of iterative soft-thresholding. Journal\nISSN 1531-5851. doi:\n\nof Fourier Analysis and Applications, 14(5):813\u2013837, Dec 2008.\n10.1007/s00041-008-9041-1. URL https://doi.org/10.1007/s00041-008-9041-1.\n\n[11] Peter B\u00fchlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory\n\nand applications. Springer Science & Business Media, 2011.\n\n[12] Peter B\u00fchlmann and Bin Yu. Boosting with the \ufffd2 loss: Regression and classi\ufb01cation. Journal\n\nof the American Statistical Association, 98(462):324\u2013339, 2003.\n\n[13] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is\n\nmuch larger than n. The Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[14] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE Transactions on\n\nInformation Theory, 51(12):4203\u20134215, 2005.\n\n[15] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by\n\nbasis pursuit. SIAM Journal on Scienti\ufb01c Computing, 20(1):33, 1998.\n\n[16] Albert Cohen, Wolfgang Dahmen, and Ronald DeVore. Compressed sensing and best k-term\n\napproximation. Journal of the American mathematical society, 22(1):211\u2013231, 2009.\n\n[17] David L Donoho and Xiaoming Huo. Uncertainty principles and ideal atomic decomposition.\n\nIEEE transactions on information theory, 47(7):2845\u20132862, 2001.\n\n10\n\n\f[18] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.\n\nThe Annals of Statistics, 32(2):407\u2013499, 2004.\n\n[19] Arie Feuer and Arkadi Nemirovski. On sparse representation in pairs of bases. IEEE Transac-\n\ntions on Information Theory, 49(6):1579\u20131581, 2003.\n\n[20] Jerome Friedman and Bogdan E Popescu. Gradient directed regularization. Technical report,\n\n2004.\n\n[21] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,\n\nvolume 1. Springer series in statistics New York, 2001.\n\n[22] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized\n\nlinear models via coordinate descent. Journal of Statistical Software, 33(1):1, 2010.\n\n[23] Olivier Gu\u00e9don, Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Subspaces\nand orthogonal decompositions generated by bounded orthogonal systems. Positivity, 11(2):\n269\u2013283, 2007.\n\n[24] Olivier Gu\u00e9don, Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Majorizing\nmeasures and proportional subsets of bounded orthonormal systems. Revista matem\u00e1tica\niberoamericana, 24(3):1075\u20131095, 2008.\n\n[25] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati\nSrebro. Implicit regularization in matrix factorization. In Advances in Neural Information\nProcessing Systems, pages 6151\u20136159, 2017.\n\n[26] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias\nin terms of optimization geometry. In International Conference on Machine Learning, pages\n1827\u20131836, 2018.\n\n[27] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient\nIn Advances in Neural Information Processing\n\ndescent on linear convolutional networks.\nSystems, pages 9461\u20139471, 2018.\n\n[28] Elaine T Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for \ufffd_1-minimization:\n\nMethodology and convergence. SIAM Journal on Optimization, 19(3):1107\u20131130, 2008.\n\n[29] Trevor Hastie, Robert Tibshirani, and Ryan J Tibshirani. Extended comparisons of best subset\nselection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692, 2017.\n\n[30] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized\nmatrix sensing and neural networks with quadratic activations. In Conference On Learning\nTheory, pages 2\u201347, 2018.\n\n[31] Nicolai Meinshausen and Bin Yu. Lasso-type recovery of sparse representations for high-\n\ndimensional data. The Annals of Statistics, 37(1):246\u2013270, 2009.\n\n[32] Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Uniform uncertainty prin-\nciple for bernoulli and subgaussian ensembles. Constructive Approximation, 28(3):277\u2013289,\n2008.\n\n[33] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A uni\ufb01ed\nframework for high-dimensional analysis of M-estimators with decomposable regularizers.\nStatistical Science, 27(4):538\u2013557, 2012.\n\n[34] Gergely Neu and Lorenzo Rosasco. Iterate averaging as regularization for stochastic gradient\n\ndescent. Proceedings of Machine Learning Research vol, 75:1\u201321, 2018.\n\n[35] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends R\ufffd in Optimization,\n\n1(3):127\u2013239, 2014.\n\n[36] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for\ncorrelated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241\u20132259,\n2010.\n\n11\n\n\f[37] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax rates of estimation for high-\ndimensional linear regression over \ufffdq-balls. IEEE transactions on information theory, 57(10):\n6976\u20136994, 2011.\n\n[38] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric\nregression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research,\n15(1):335\u2013366, 2014.\n\n[39] Justin Romberg. Compressive sensing by random convolution. SIAM Journal on Imaging\n\nSciences, 2(4):1098\u20131128, 2009.\n\n[40] Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a regularized path to a maximum margin\n\nclassi\ufb01er. Journal of Machine Learning Research, 5(Aug):941\u2013973, 2004.\n\n[41] Mark Rudelson and Roman Vershynin. On sparse reconstruction from fourier and gaussian\nmeasurements. Communications on Pure and Applied Mathematics: A Journal Issued by the\nCourant Institute of Mathematical Sciences, 61(8):1025\u20131045, 2008.\n\n[42] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. In\n\nConference on Learning Theory, pages 10\u20131, 2012.\n\n[43] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The\nimplicit bias of gradient descent on separable data. The Journal of Machine Learning Research,\n19(1):2822\u20132878, 2018.\n\n[44] Arun Suggala, Adarsh Prasad, and Pradeep K Ravikumar. Connecting optimization and\nregularization paths. In Advances in Neural Information Processing Systems, pages 10608\u2013\n10619, 2018.\n\n[45] Shaozhe Tao, Daniel Boley, and Shuzhong Zhang. Local linear convergence of ISTA and FISTA\n\non the LASSO problem. SIAM Journal on Optimization, 26(1):313\u2013336, 2016.\n\n[46] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal\n\nStatistical Society: Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[47] Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: the\n\nlasso and generalizations. Chapman and Hall/CRC, 2015.\n\n[48] Sara van de Geer. The deterministic Lasso. Research Report, 140, 2007.\n\n[49] Sara A Van De Geer and Peter B\u00fchlmann. On the conditions used to prove oracle results for the\n\nlasso. Electronic Journal of Statistics, 3:1360\u20131392, 2009.\n\n[50] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48.\n\nCambridge University Press, 2019.\n\n[51] Yuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algo-\nrithms: A general analysis with localized complexities. In Advances in Neural Information\nProcessing Systems, pages 6065\u20136075, 2017.\n\n[52] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent\n\nlearning. Constructive Approximation, 26(2):289\u2013315, 2007.\n\n[53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.\n\n[54] Tong Zhang and Bin Yu. Boosting with early stopping: Convergence and consistency. The\n\nAnnals of Statistics, 33(4):1538\u20131579, 2005.\n\n[55] Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performance\nof polynomial-time algorithms for sparse linear regression. In Conference on Learning Theory,\npages 921\u2013948, 2014.\n\n[56] Peng Zhao, Yun Yang, and Qiao-Chu He. Implicit regularization via hadamard product over-\nparametrization in high-dimensional linear regression. arXiv preprint arXiv:1903.09367, 2019.\n\n12\n\n\f", "award": [], "sourceid": 1696, "authors": [{"given_name": "Tomas", "family_name": "Vaskevicius", "institution": "University of Oxford"}, {"given_name": "Varun", "family_name": "Kanade", "institution": "University of Oxford"}, {"given_name": "Patrick", "family_name": "Rebeschini", "institution": "University of Oxford"}]}