Paper ID: | 7240 |
---|---|

Title: | Outlier-robust estimation of a sparse linear model using $\ell_1$-penalized Huber's $M$-estimator |

This is a very good paper. I really enjoyed reading the paper. The result is very strong and the paper is very readable. I strongly recommend accepting the paper, if the proof is correct. (I could not check the supplementary material due to time constraints.) This paper solves an important open problem in robust estimation. The open problem is described in the beginning: "Is it possible to attain optimal rates of estimation in outlier-robust sparse regression using penalized empirical risk minimization (PERM) with convex loss and convex penalties?" The answer was negative in the past researches, but the present paper gives the answer "YES". The estimate is very simple, because it is obtained from the minimization of the simple Huber-loss with the $l_1$ penalty with some devised tuning parameter $\lambda_o$, which is an interesting point. To show the optimal convergence rate, we need some additional assumptions that the authors honestly mention on l.26-27. However, they are not strong. The introduction is very clear. The authors give a clear history of the study of convergence rate in robust estimation. The key sentences are: "this result (which means the result obtained in this paper) is not valid in the most general situation, but we demonstrate its validity under the assumptions that the design matrix satisfies some incoherence condition and only the response is subject to contamination." and "The main goal of the present paper is to show that this sub-optimality is not an intrinsic property of the estimator (3), but rather an artefact of previous proof techniques. By using a refined argument, we prove that $\hat{\beta}$ defined by (3) does attain the optimal rate under very mild assumptions." Section 2 gives a key theorem. In particular, the authors illustrate why some complicated assumptions are necessary in the first paragraph with interesting examples, and they also give the main point of the proof which is to treat the extra parameter $\theta$ as a nuisance parameter when we obtain the bound, but not in the past. Section 3 focuses on the Gaussian design. Section 4 gives a very clear survey on prior work. The authors give future works very clearly in Section 5, including technical details amazingly. Section 6 is a numerical experiment, which verifies the order of $o/n$. This is a very small experiment, but this is enough, because the main purpose of this paper is a theoretical one. Major Comment I have only one comment. I think the largest feature is that the tuning parameter $\lambda_o$ is incorporated into the Huber loss in a different way. The usual $l_1$ penalized Huber loss function with $\lambda_o=1$ will not give the optimal convergence rate. What is the role of $\lambda_o$ on eq.(5)? (although the role is clear on eq.(3).) $\Phi(u)=(1/2)u^2 \cap (|u|-1/2)$. The effect of $\lambda_o$ vanishes for $u^2$, but it remains for (|u|-1/2), which gives the loss function $\lambda_o\sqrt{n} ( (1/n)\sum_{i=1}^n |y_i-X_i^\top \beta| -1/2 )$. When $\lambda_o$ has the order used in Theorem 3, the factor $\lambda_o\sqrt{n}$ has the order $(\log(n))^(1/2)$, which converges to infinity. This implies that the effect of large error of $|y_i-X_i^\top \beta|$ is not admitted at all. As a result, the main loss is the squared error only. Is this an appropriate point of view? If you have a clear point of view on the role of $\lambda_o$ on eq.(5), please give related comments. -------------------------------------------- Thank you for your reply.

I thank the authors for their detailed response. There is another NeuIPS submission whose results supersedes this one (this paper's result is only a corollary of that submission), and I will leave the decision to the AC. ===== - originality This paper only deals with the setting where response variables (y), and has to assume that x are sub-Gaussian, and similar results were already established in (Bhatia et al., 2017). However, (Liu et al., 2019) can handle corruptions in both x and y, although the bound is slightly worse. - quality The proofs are sound. - clarity The paper is well written. - significance There are many papers in this area, and this work has not differentiated itself from other papers.

I enjoyed reading this paper and support its publication in NeurIPS. The primary contribution of this work is technical and theoretical, but it provides a sharpened analysis of a very practical algorithm for robust regression. This work shows that under certain natural conditions, ell-1 penalized least-squares achieves the minimax rate of convergence for this problem, up to a logarithmic factor. I am not an expert in this area, but according to the authors, previous analyses of this same algorithm failed to show this sharp rate, and previous algorithms achieving this rate were complicated and difficult to use in practice. The technical innovation applies the KKT condition for beta, at the estimated outlier-contamination vector theta, to derive a recursive bound for the squared-error of the beta estimate. A main term of this bound is v'X'u for (u,v) the errors in (theta,beta), and a main technical insight is that this can be controlled by ||u||_2*||v||_1/sqrt(n) + ||u||_1*||v||_2/sqrt(n), rather than the naive operator-norm bound ||v||_2*||u||_2, when the design X satisfies an incoherence property with the standard basis. The improved bound then applies this insight and an a priori bound on (u,v) from a more standard Lasso-regression analysis. The authors demonstrate that the required incoherence property holds for (correlated) multivariate Gaussian designs, using a Gaussian-process and peeling argument. In my view, this insight is non-trivial, and both the paper and proof are also well-written. A few comments/minor typos: (1) I think some more explanation is needed after Definition 1, to explain how this captures the notions of restricted invertibility and incoherence in the previous paragraph. In particular, the role of the transfer principles in restricted invertibility is a bit confusing, as the RE condition for Sigma is only discussed later on the page. Explaining why (ii) captures some notion of incoherence would also be helpful. (2) It took me a while to find where X^{(n)} and \xi^{(n)} are defined---perhaps this can be clarified more explicitly. (3) Line 101, p-th largest should be p-th smallest (4) Line 476, J should be S and beta_j should be beta_j^* (5) Is there a minus sign missing in the first term on the right of Lemma 1, and its application in line (525)? (6) I'm not sure what trace means above line 564 for W_{b,v}. ------------- Post-rebuttal: Thanks to the authors for the response, clarification, and discussion.