NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2794
Title:On the number of variables to use in principal component regression

The authors study the "double-descent" phenomenon in high-dimensional principal components regression. The results are gnerally interesting and the reviews mostly positive. My own inclination is positive as well, but I have several nontrivial concerns. The authors should be careful to address these in a camera-ready version of the paper. - Title: the title is misleading. There is no practical tool being offered here for PCR, just an interesting analysis of its performance. So the title should be changed. - Presentation: several pages of the paper are taken up by lengthy proof sketches that are essentially only interesting/useful to those fluent in random matrix theory. This is a definite waste of space because most readers will not get anything from this. It would be much better to use the space to explain the **significance** of the results, the consequences for practical/general consumptions, the intuition, etc. And of course, experiments. - Experiments: are totally lacking from the main paper. There should be enough space for at least a few convincing experiments once the proof sketches are cut. - Comparison to existing work: it's not clear at all to me the significance of the lower bound on the eigenvalues of \Sigma in previous papers on ridge regression/double-descent. As far as I understand, this is typically used to invoke some kind of uniform convergence argument that allows us to interchange limits (and effectively take \lambda \to 0 before taking n,p \to \infty). But this is a rather precise/technical use of such a condition, not a fundamental reliance on a particular regime. And of course it is entirely possible to just keep the \lambda \to 0 "on the outside" and still interpret the results as making a statement in the proper order. So I feel the authors need to much better motivate what is new about their analysis if they want to claim that assumption of Gaussianity + new techniques actually makes a difference. Finally, I would like to see a more explicit comparison to the misspecified model in Hastie et al, which seems very similar.