Review for NeurIPS paper: Triple descent and the two kinds of overfitting: where & why do they appear?

NeurIPS 2020

Triple descent and the two kinds of overfitting: where & why do they appear?

Meta Review

The paper discusses the existence of a triple descent phenomenon in the test loss as a function of the dataset size. The reviewers unanimously appreciated the conceptual novelty to the paper where authors separate the two potential phenomena causing non-monotonic test error behavior in terms of number of samples. This is very relevant work for the conference and as such the reviewers have provided extensive feedback. I urge the authors to take into account the detailed feedback in their revision. Additionally, below is the anonymized transcript of some interesting discussion points which I believe highlight some confusions in the paper and I strongly encourage the authors to address them. Most importantly among these please address with a mathematical proof/extensive empirical evidence the following concern raised by R1 regarding one of the main claims in the paper: The claim that the linear peak is exhibited only in the presence of noise as such is not justified in the paper (the authors cite [6] but [6] is only for linear models), I believe with non-linear RF models, there might still be variance terms from initialization and training data, in other words, it is not clear if the total variance can exhibit a linear peak even when SNR=\inf (no noise). In addition, following R2&R3’s suggestion, I would highly recommend adding illustrative experiments and discussion that demonstrates the presence or absence of the proposed phenomenon in practical networks. Select quotes from the discussion. ------ R1 I’m also confused by one of the paper’s main contributions, that the linear peak is solely due to overfitting noise. From playing with the RF model myself, it seems that the total variance (i.e. not loss) can exhibit a linear peak even when SNR=\inf, which seems to cut against their conclusion. AC (based on quick reading) 1. The definition of linear & non-linear peak in lines 38-42 is confusing - reading the line literally, they define the linear peak only for linear models. But my understanding is that they are not focusing on linear models at all instead they consider only non-linear models and denote the phenomenon at N=D as linear peak and N=P as non-linear peak where D is the input dimensions and P is the interpolation threshold (min number of samples which can interpolate any y values on the training dataset) - I immediately see the significance of P, but I don't know what is the real significance of D? ---- Is the N=D peak a consequence of when the network behaves like a "linear" model (i.e. \hat{f}(x)= < x,w > for some w)? ---- or is it the consequence of the fact the the ground truth is linear (i.e., f^*(x) = < x,\beta > +\epsilon)? ---- (may be out of scope but) In the definition of r which seems to be crucial, the properties of nonlinearity around 0 are considered, I don't see why 0? How will the analysis change if we allowed a bias term? ---- What are \psi and \phi in eq 4 used for? Beyond [19], where else is the asymptotic in 4 used? 2. I think another major concern with the paper is discussion of related work and stating relevant results. This also makes me confused about separating what is rigorous and what is speculative in terms of the explanation they provide. Specifically, - The claim for non-linear models that the N=D peak is implicitly regularized by nonlinearity - I couldn't figure what it means and how they justify it? Can anyone clarify? - The decomposition/attributions of different variance terms to different peaks is super confusing for me. Sec 2.3 is I think important but is poorly explained. The claim is that the linear peak N=D is purely due to noise \epsilon variance. This is not obvious to me why -- they cite [6] but [6] is only for linear model, but it is not clear to me why for non-linear models the initialization variance should not affect the variance at D (with say ReLU activation or even linear activation but where the first layer is fixed overcomplete representation with P>>D). Overall, in combination with not understanding what they call as implicit regularization from non-linearity, I dont really get what the phenomenon happening at N=D is? Although they cite [19] I think to be self contained the components of the variance should be written out clearly here and they should mention what computation of this they plot in Fig 6. - Minor- In L146++ they mention [6] for showing how small non-zero eigenvalues are bad for generalization but 6 only studies linear models. Also they mention "the norm of the interpolator needs to become very large to fit small eigenvalues according to [3]" -- this is true but only for vanishing \gamma (which is the setting they work in for most parts but they do not clarify this upfront. R1 I think I can answer some of your questions. 1. ---My understanding is that the linear peak and nonlinear peak are defined through their locations, i.e. N=D and N=P. The reason for the terminology is that to analyze the RF model, one projects the activation function into a linear and nonlinear component----more specifically, into the first two Hermite polynomials and their remainder, i.e. into a linear function and a function that has been called "purely nonlinear." The constants eta and zeta govern the weights of these two components. When \zeta=\eta, the activation function is linear yielding a model of the form a*<\Theta, x>. When \zeta=0, the linear component of the activation function is 0. After this projection, one can see that these peaks coincide with different rank constraints on the kernel, the size of each of which is related to the constants \eta and \zeta. ---Adding biases to the random feature model significantly complicates the analysis. Rather than the self-consistent equation, which determines the asymptotics, being a simple polynomial it is a couple integral equation. See Adlam et al. 2019 arxiv: 1912.00827. ---The constants \psi and \phi are crucial to the model. They determine the high-dimensionality of the data and parameterization of the model. The results for the RMT problem are all asymptotic, as N,D,P -> \infty. The constants \psi and \phi fix their ratios in the limit, and changing them will alter the limit behavior. When the authors plot a theoretical prediction as a function of N, they are adjusting the constants \psi and \phi. 2. ---Previous work has noted that the spectral properties of the RF model are equivalent (in some sense) to a different matrix model, see Eq. (6). Since the linear peak ultimately comes from the term with the \sqrt{\zeta} prefactor, reducing this constant reduces the size of the linear peak. I believe this is what the authors mean by implicit regularization. ---I agree that the bias-variance decomposition is confusing, but this is not the authors' fault. There has been a huge proliferation of different bias-variance decompositions recently, and it has generated a lot of confusion. The bias-variance decomposition the authors use includes the randomness from three sources: initialization of the random features, sampling of the training points, and label noise. They then apply the law of total variance twice to the variance term, which produces three terms that they attribute to initialization of the random features, sampling of the training points, and label noise. One can argue whether the interpretation of these terms is correct. My main concern is that while the test loss does not have a peak in the SNR=\inf case (i.e. no label noise), I believe the total variance can. This cuts against one of the main claims of the paper, that the linear peak is due to label noise. I agree that there is still variance due to the initialization of the random features (see blue curves in Fig. 6), but I do not think it has a peak at N=D, so in that sense it does not cause the linear peak. I think if the authors included a mathematical statement of the claim that the linear peak is due to label noise, it would help clarify our discussion.