NeurIPS 2020

### Review 1

Summary and Contributions: This paper examines self-distillation, a popular method to improve the generalization of deep networks by re-training the network on its own predictions. Despite its popularity, self-distillation is only poorly understood and this paper attempts to provide a theoretical underpinning by examining self-distillation in Hilbert Spaces with the hope that some of the findings will provide insights to the existing practices in the deep learning community.

Strengths: The problem / topic is very timely and quite necessary. As far as I know, there exists very little prior work that tries to understand the inner workings of self-distillation.

Weaknesses: There is an obvious gap between the sub-domain where the authors can make their theoretical contributions (Hilbert spaces) and where practitioners use self-distillation. It is not clear to me how relevant the findings are to the deep learning community (who arguably are the main people using self-distillation). This gap is also present between sections 1-4 and section 5. The authors bridge this gap with a reference to the popular Neural Tangent Kernel work, but it is probably fair to say that this is a weak link. The paper falls short on its conclusion. A topic like this could really shine if there were some interesting discussion that would link back the rigorous results obtained in Hilbert Space to the ultimate domain of interest (deep learning). I am afraid the complete lack thereof diminishes the relevance of the findings quite a bit.

Correctness: I didn’t go into the nitty gritty details of the derivations, but the findings are probably correct.

Clarity: It is OK. The motivation is well stated, similarly the approach and techniques are well communicated, however a discussion of the relevance of the findings is completely missing.

Relation to Prior Work: I was a little shocked that the authors attributed model distillation to Hinton et al.’15. The well known work on “Model Compression” [Bucilua et al. 2006], precedes it by a decade.

Reproducibility: Yes

Additional Feedback: Let me state clearly that I am no learning theorist. However, still, I am quite active in the deep learning community have worked in this particular area for a while - so it is fair to expect that this paper should speak to me. —- After rebuttal: Thanks for the clarifications. I updated my review. Ultimately, I am worried that this paper is showing something inherently obvious. If I were to take a training data set, and fit it with kernelized ridge regression (very much akin to eq. (11)), then the remaining residual can be attributed to the regularization. If I repeat this process on the predictions the residual from the true label will be increased and eventually I will over-regularizer my data. There could of course be an initial benefit because my first classifier was under-regularized. This is clear, although in practice it is of course unlikely that anybody would do it, as it is easier to just set the regularization parameter better in the first place. I am sure the same applies to deep networks. What I didn’t understand is if there are implications that -for deep nets- go beyond these rather obvious findings. Although the topic is very interesting and the approach promising, all in all, I don’t think the paper should be accepted in its current form, because I believe this kind of discussion is very important in this context and very much underdeveloped in this manuscript. What I don’t know is if this could easily be added or if the theoretical contribution in itself (even without link to deep learning) is interesting enough to merit a publication. I will therefore mark this as reject with high uncertainty and am looking forward to the rebuttal and discussion with the other reviewers. Given my uncertainty I would be happy to be convinced in either direction.

### Review 2

Summary and Contributions: This paper analyzes the effects of self-distillation in training neural networks. They show that self-distillation progressively reduces the number of basis functions that can be used to represent the desired solution.

Strengths: The paper shows an interesting relation between self-distillation and regularization. By limiting the number of basis functions used to represent the solution, networks trained using self-distillation can avoid over-fitting for a few rounds of training. However, if self-distillation is carried out for a "large" number of rounds, the networks can under-fit; a phenomenon also verified empirically.

Weaknesses: The paper does a very good job. This reviewer is not able to suggest any major weaknesses other than minor typos.

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Yes.

Reproducibility: Yes

### Review 3

Summary and Contributions: The paper analyses self-distillation (knowledge distillation when the teacher and student have the same model) in the context of kernel regression, showing that self-distillation can regularize (by sparsifying the basis functions) and thus explains the performance improvement. The analysis also shows that more steps of self-distillation can lead to underfitting and thus decline in performance. These theretical insights are validated empirically on a toy example and on CIFAR10.

Strengths: 1. Novel theoretical analysis of self-distillation that can explain why it improves performance. This has been a curious phenomenon obverved and this analysis (in the context of kernel regression) yields insights on how self-distillation can sparsify the basis functions and thus regularizes the model. The paper deals with optimally tuning the regularization strength lambda in each step by formulating in terms of a fixed desired loss tolerance epsilon (equation 1). This formulation makes it easier to analyze the solution of the optimization problem in closed form. 2. Experiments validate the theory and yields interesting observations. The synthetic experiment (section 4) illustrates the theory in section 3. The experiment on CIFAR10 (section 5) also supports the theory: as one performs more steps of self-distillation, train accuracy goes down, while test accuracy goes up then down. The performance decline for large number of self-distillation steps is interesting.

Weaknesses: 1. Unclear what argument some sections are making. Section 3.1, 3.2 lower bound the number of rounds of self distillation but it is not clear what insight does it give. There seems to lack motivation for this. Section 3.5 similarly bounds the sparsity level S_B, but I'm not sure for what purpose. 2. The objective function does not correspond to distillation as done in practice. In particular, in all steps, one has both the loss with respect to the original labels y and the teacher labels. The paper only analyzes the objective with the teacher labels. If one uses the original labesl as well, I'm not sure the same phenomena (solution collapse to zero, performance decline) will happen.

Correctness: The theoretical claims appear to be correct. The empirical methodology seems correct as well.

Clarity: The dense and sparse terminologies in Section 3.3, 3.4 are a little misleading, since the entries aren't actually zero (just small). It might be better to define it clearly and relate to, for example, the intrinsic dimension of the matrix (trace divided by spectral norm).

Relation to Prior Work: Differences from previous work are sufficiently discussed.

Reproducibility: Yes

Additional Feedback: Typos: - Equation (15): $\equiv$ should be $\implies$. - line 207: nThe -> The - line 249: the closer $a$ becomes to 1 -> the closer $s$ becomes to 1? - Figure 3 caption, third line: "For Right" -> Four Right Section 3.7 mentioned "more details in appendix" but I can't find details about generalization bounds in appendix. ===== Update after authors' rebuttal: Thanks for clarifying why the self-distillation objective in the paper uses predictions and not original labels. This has addressed my concerns above, and I have updated my score.