NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 1696 Implicit Regularization for Optimal Sparse Recovery

### Reviewer 1

The paper is well written, with emphasis to follow the main results and the implications. Some questions for clarification below: 1. What is the intuition for the choice of this reparametrization ? Is it the reparametrization which leads to sparse of the solutions or the choice of the free parameters ? 2. Though the authors mention the existing works on implicit L2 regularization, it would be interesting (and self-contained) to contrast the implicit L2 vs the L1 regularisation in terms of the algorithms and the results. 3. What are the possibilities of extensions of these sparse recovery towards structured sparse ones ? Minor comments / typos: 1. Algorithm 2, step 2. Duplicate notations for ‘m’, being used as a scaling factor (scalar) as well as a vector. 2. Line 181. Maximum noise term w*_max ? 3. Line 243. Constat -> constant

### Reviewer 2

The paper's introduction is well written, the structure is also good. However, some important points could be made more clear, for example: 1. line 164-165: Where did you show the corresponding optimization path contains sparse solutions? Is the entire path sparse or it is only sparse at some iterations? It will help the reader if some more intuitions behind the algorithm could be provided. 2. Some notations in the main results should be clearly defined. For example, w*_{max} is the largest entry of w*. 3. The growth condition for dimension d with respect to n?

### Reviewer 3

The results are very interesting, technically sound, and the paper is well written, in particular the proof sketch is very useful. The main critical comment I have is that the results require the RIP constant to satisfy \delta = 1/sqrt{k}, where k is the sparsity. A sub-Gaussian random matrix satisfies the RIP of oder k with high probability provided that the number of measurements (or rows of the matrix A \in \reals^{n\times d}), n, obey: n >= c \delta^{-2}(k log(d/s)) With \delta = 1/sqrt{k} as required by the theory, the number of measurements/rows of A needs to satisfy: n >= c k^2 log(d/s) I.e., the measurements need to be quadratic in the sparsity in contrast to l1 minimization, which only requires the RIP constant to be a fixed constant, independent of the sparsity, to succeed, and thus only requires the number of measurements to be slightly larger than linear. Thus, the algorithm presented (or at least its guarantees) are highly suboptimal in the sample complexity. It would be important to prominently mention that (otherwise the title `optimal sparse recovery' is misleading). Also it would be interesting to add some numerical studies that show what happens when this condition is violated, to investigate whether the condition is an artifact of the theory or a property of the algorithm. For example, one could do phase transition curves (different to the ones in the paper, relating recovery performance k versus n) and compare them to lasso.