Review for NeurIPS paper: On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression

NeurIPS 2020

On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression

Review 1

Summary and Contributions: This paper studies the linear regression within an over-parameterized regime. Both the covariates and the coefficients are assumed to have some nontrivial covariances. Overparameterisation has received much interest in recent years, since typical deep learning models are indeed over-parameterized. The main result of the work is a precise characterization of prediction risk E(y-x^T\hat \beta_\lambda)^2 in the proportional limit p/n\to \gamma\in (1,\infty), and analyzed the sign of the optimal parameter \lambda. In particular, optimal parameter can be negative. This differs drastically from the standard regression case.

Strengths: This work is theoretically sound, with some empirical evaluations. Over-parameterization is important for understanding many deep learning techniques. The model under study is still linear, so not directly applicable to deep learning, but it is more general than many existing studies, and hence it presents an interesting contribution to the community.

Weaknesses: The main limitation is that the study only presents the results in an asymptotic regime. It is not directly clear to the referee that how much insight can be shedded into practically important issues, either in a finite data / model, or beyond the linear model. Nonetheless, I am convince that the work is still important enough. A few minor comments that can be improved. 1) Figure 1 is actually not referenced in the main text. 2) page 2, it would be nice to give a reference to the statement on line 41. 3) I find Definition 2 hard to understand. Especially what is "the same order" ? Does it refer to entrywise of the same order or the vector norm of the same order. It would be better to have a more precise definition, preferrably with a forma. 4) line 241, change "conjuncture" to "conjecture". One last point is that some numerical results on more complicated models would greatly strengthen the study, e.g., on simple two-layer neural networks.

Correctness: the claims and method seem correct

Clarity: the paper is well written

Relation to Prior Work: The discussion with respect to prior work is sufficient.

Reproducibility: Yes

Additional Feedback: I thank the authors for the detailed response, which has addressed all my concerns satisfactorily.

Review 2

Summary and Contributions: The paper analyzes the generalization error in overparametrized linear regression with general quadratic norm regularization. The authors provide exact characterization of risk in the proportional asymptotic limit. They also discuss the optimal choice of regularization parameter and thoretically explain when $\lambda$ should be negative. Furthermore, they discuss the optimal choice of PD matrix that defines the regularization norm.

Strengths: Analyzing different problems in the proportional asymptotic limit has gained a lot of interest recently. Looking at different problems in this limit can theoretically explain interesting phenomena that we observe such as double descent or the need for negative regularization parameter. This paper extends the results of recent works to a more general setting with fewer restrictive assumptions, by allowing the data to have general covariance and the regularzation norm be defined with a general PD matrix. They build on the risk characterization to determine the optimal choice of $\lambda$ (or its sign) and optimal choice of matrix for the norm in regularization.

Weaknesses: The main issue I have with the paper is about the novelty of the results. The authors mention that previous work on linear regression is not as general as current work. In particular, they either only allow isotropic features or signal. This paper which is arXived about a month before the NeurIPS deadline seems to do both: [1] Emami, Melikasadat, et al. "Generalization error of generalized linear models in high dimensions." arXiv preprint arXiv:2005.00180 (2020). (I would refer to this paper as [1]). The results of this paper allow to characterize the exact generalization error in the same asymptotic limit for Guassian data with general covariance and any regularization, which includes the $\ell_2$ type regularzations considered here, as well as more general regularizations like general $\ell_p$ norms. Here are my understanding of the differences of the results of the two papers: - In [1] the authors allow for a Gaussian feature with any covariance matrix, whereas your paper allow non-Gaussina features so long as they have bounded 12th centered-moment. - In [1] the authors allow any regularization whereas your paper considers only $\ell_2$-type regularizations. - [1] the authors consider generalized linear models which have linear regression as a special case, whereas this work only considers linear regression. - Your paper provides a closed form formula for the generalization error, but in [1] the generalization error is given via a set of recursive equations which they call state evolution (SE). But it seems like getting exact formulas from SE is possible in certain simple cases like linear ridge regression as they do in the appendix. I guess they could do it for a general $\ell_2$ norm (but they have not). As such, I think the main novelty of this paper are the results about optimal choice of $\lambda$ and $\Sigma_w$. But I they might not be significant enough contribution for a NeurIPS paper. I believe at least a good comparison to the results of [1] is needed in this work.

Correctness: I did not check all the equations in the proof carefully, but at a glance, they seem to be correct. Furthermore, the theoretical results match the empirical results which corroborates correctness of the theory.

Clarity: The paper is very well written and clear. It is easy to follow

Relation to Prior Work: Yes. The author makes it very clear what the main contributions of the paper are, and in what sense the results are more general compared to previous results.

Reproducibility: Yes

Additional Feedback: Minor issue: - Reference [EM75] is missing the authors. Post author feedback: After reading other reviews and the author feedback I still believe that my initial rating was correct.

Review 3

Summary and Contributions: This paper studies a weighted version of ridge regularization in overparameterized linear regression. Instead of penalizing the empirical risk by the usual $\ell_2$ norm, authors consider a general setup where the penalization belongs to a class of norms where it turns out that in several setups the euclidian norm is not optimal. Among other findings, this setup allows to elucidate the mystery behind negative optimal regularization that was observed lately in Neural Networks. This can be seen as one of the highlights of this paper.

Strengths: This paper considers a general setup of regularized linear regression where, unlike prior work, the design, signal and weighting matrix may be anisotropic. Using the notion of misalignment between signal and design, authors were moreover able to give theoretical justification to recent empirical phenomena in Neural Networks. For all these reasons I think that the present submission is relevant to the NeurIPS community.

Weaknesses: * Assumption 2 of codiagonalizability is pretty restrictive in terms of the class of allowed weighting matrices. * Optimal choices of the weighting matrix $\Sigma_w$ depend either on the unknown signal or the population covariance. While authors try to address this point in section 6 claiming that $\Sigma_x$ can be estimated from data, I think this does not entirely solve the problem. First what they propose requires a prior knowledge of the limiting spectral distributions. Second the limiting distribution of the empirical covariance of the design is singular and hence does not satisfy Assumption 1 in the overparameterized regime. It would be interesting to understand to which extent this theory is valid when using covariance estimators as weighting matrices.

Correctness: The claims and their proofs seem correct to me.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Comment: * In my opinion the main concern of the present paper is the adaptive choice of the optimal weighting matrix. I understand that it is not necessarily easy to achieve this task. I think a first step could be just to come up with a decision rule (independent from the signal) choosing the optimal weighting matrix between identity matrix and the covariance $\Sigma_x$. Another option is proving that interpolating between those two matrices may outperform both of them. Typos: * line -1 in the caption of Figure 1: "compared" ----------------------------------------------------------------------- Given the page restriction, I think the reply to my concerns was OK. I decided to keep my initial score which I believe is good for this submission.