NeurIPS 2020

### Review 1

Summary and Contributions: This paper computes the Gaussian process kernel (GPK) and the neural tangent kernel (NTK) of a specific ResNet-like architectures, which characterize the limiting behavior of NNs when the size m of their hidden layers becomes infinite, and when training only the last network layer (GPK case) or when training all layers but the last and first ones (NTK case). The authors then compute the limit NTK of ResNets when their number of layers L grows indefinitely, and compare it to the limit NTK of classical feedforward nets (FFNets). They find that, while the limit NTK of FFNets is a degenerate discrete kernel (constant + nugget in 0) --which can fit any data but cannot generalize--, the limit NTK of ResNets is smooth and close --and sometimes equal-- to the NTK of a ResNet of depth 1. To the authors, this explains why learning with FFNets is limited to a few dozen layers only while ResNets have thousands, and why adding ResNet layers changes the accuracy only marginally. The authors however admit (and confirm empirically) that in practice, even for infinitely wide ResNets, the number of layers does slightly influence the performance. Finite sample guarantees are provided for all convergence results.

Strengths: The article provides an interesting analysis and comparison of the limiting behavior of infinitely wide FFNets and ResNets when their depth increases, which brings us a step closer towards understanding the behavior and compared generalization abilities of some SOTA network architecture (FFNets vs ResNets).

Weaknesses: The essential parts of the proofs are in appendix, which reviewers are not required to check, and which would anyway be too long (17 pages) for me to review in such a short period. Therefore, all propositions and theorems should at least be validated/double-checked by empirical simulations included in the paper, f.ex. by letting neural networks become sufficiently large and deep, and checking that they behave as predicted by the theorems. Please note that these experiments would be different from those of Figs. 2 & 3. What I ask for is a simple, yet strong safety check, that's easy to review and confirms the validity of the theorems. I do understand that, mathematically speaking, a proven result does not need confirmation. Still, in practice, and given the reviewing circumstances, such safety checks would be useful/critical both for the authors and the NeurIPS reviewing committee, who, by accepting the paper, would endorse its correctness and engage its own credibility. ### Minor points: - The paper only analyses fully connected ResNets, whereas ResNets typically use convolutional layers in practice. The paper should at least clearly state this obvious limitation and include it in the Broader Impact section. - The paper only analyzes a specific scaling of the bottleneck weight $\alpha = L^\gamma$, without further justification. I don't recall seeing such a scaling in any applied work. Therefore, a small justification would be the least. - Practitioners systematically include batchnorms in ResNets (and FFNets), which are not taking into account by this paper. Therefore, it would be interesting to add some experiments to study whether these batchnorms actually alter the predictions of the theorems. (These batchnorm experiments could be coupled with or compared to the safety-check experiments mentioned above.)

Correctness: Results seem plausible, but since I did not check the proofs (17 pages in appendix), and since the paper did not include any "safety check" experiments to confirm the theorems, it's impossible to say for sure. See "Weaknesses".

Clarity: The paper is very clear and well written.

Relation to Prior Work: Yes, well discussed.

Reproducibility: Yes

### Review 2

Summary and Contributions: This paper calculates the equivalent Gauss process kernel and the equivalent neural tangent kernel of any ResNet with an infinite width. It then proves that when the depth of a ResNet goes to infinity, the equivalent kernels of this ResNet converges to a one-layer network. The authors then conducted experiments to verify these results.

Strengths: The strengths are listed below. 1. This result is impressive: when the depth of a ResNet goes to infinity, the equivalent kernels of this ResNet converges to a one-layer network. It may shed lights to fully understanding ResNets’ behaviour. It also suggests a degenerative nature of the deep leaning models. 2. The proofs are solid and given in detail.

Weaknesses: My concerns are listed as follows. 1. The assumptions are still quite strong: the width needs to be infinite. There still exists a gap between this paper and the practice. It is desired to give an "error analyse" for the “distance” between networks in such an overparameterized regime and those used in practice. For example, how significant would the results change if the width of the network is bounded? 2. The authors suggest that its theoretical results (“infinite-depth networks are equivalent to one-layer networks”) supports ResNets’ good generalizability. A direct measure for generalizability would be a generalization bound (or hypothesis complexity here). Would the current result help deliver a smaller generalization bound? Or, does the current form of equivalence (in the view of NTK) straightforwardly lead to the equivalence in the view of complexity or generalizability? ====== Thanks for the response. My concerns are cleared. It would be great to add the explanation regarding generalization in the next version.

Correctness: The claims, method, and empirical methodology are correct.

Clarity: This paper is well-written.

Relation to Prior Work: The differences with previous contributions are clearly discussed.

Reproducibility: Yes

Additional Feedback: It is desired to clarify the two aforementioned weaknesses in the rebuttal session.

### Review 3

Summary and Contributions: The authors show differences between the limiting kernel for neural networks when depth goes to infinite for the NTK induced by a standard feed forward network and a residual network. In particular in their framework they show that the NTK for an infinite depth feedforward network is degenerate, whereas of a residual network the infinite depth limit is not.

Strengths: The relationship between kernels and neural networks is a timely and important problem. Showing that the infinite depth NTK for a feedforward network is degenerate is somewhat interesting.

Weaknesses: Post Rebuttal Update: While I think the paper is still below the bar for acceptance (as I think the motivation is weak) the rebuttal and discussion with other reviewers on the novelty of the theoretical work has encouraged me to increase my score to a 5. Overall this paper is poorly motivated and written. It is unclear why taking depth to infinite is the correct regime to study when networks that achieve state of the art accuracy on ImageNet or Cifar are not necessarily the deepest (50 layers for cifar, few 100 for ImageNet). Thus the whole motivation for the paper is somewhat questionable. The experiments are very weak and use non-standard 2 class versions of MNIST and CIFAR. It is unclear why the authors do not compare to published kernel results on CIFAR-10 and MNIST. I think a cleaner rewrite of the theoretical result and stronger experimental section with rigorous experimentation that shows *exactly at what point* deeper feedforward kernels degrade and compare this to when deeper feedforward neural networks degrade could make a significantly stronger paper.

Correctness: I did not verify the correctness of the theorems but I can buy they are true, and the experiments seem weak but correct.

Clarity: No, there are 5 theorems in the paper and it is unclear what point they serve. The entire paper reads like a wall of lemmas and theorems and it is utterly unclear the motivation for each mathematical statement.

Relation to Prior Work: The paper doesn't cite much of the relevant prior work. Greg Yang has derivations for neural tangent kernels for various architecture: https://arxiv.org/abs/2006.14548 Shankar et al show (contrary to the results in this paper) that for moderate depth increases test accuracy does improve for neural kernels (NNGP not NTK): https://arxiv.org/abs/2003.02237

Reproducibility: Yes

### Review 4

Summary and Contributions: Consider a (variant of the) ResNet, and draw its weights independently from a Gaussian with zero mean, and appropriate variance that is inversely proportional to the width. This paper proves that the L2-norm squared of the pre-activations of each layer of a finite ResNet, with finite (but large enough) width, is close to the ResNet's corresponding GP kernel in a probably-approximately sense. This is stronger than what was shown by [Y19], and similar to what was shown by [27, 14]. Theorem 3 tells us about how close to the NN-GP kernel the networks with finite width are, as opposed to talking about almost-sure convergence in the limit (which is [Y19]). The authors prove the same for the inner product of the gradient other all parameters, and the ResNet's corresponding NTK (Theorem 4). The authors then use these proofs to link the following statement about the NTK of a feed-forward network and a ResNet, to the behaviour of the actual networks: As the depth of the network goes to infinity, the feed-forward network's NTK forgets about what the two inputs were. With some Tikhonov regularization, the kernel can fit any finite training set; however, the predictions outside of the training set are always 0. It follows that the concept class of this kernel is not learnable, that is, it is not the case that for every learning distribution and function x->y, the kernel machine can identify x->y from a finite training set. However, the ResNet kernel does not forget about the input with infinite depth, and so its concept class is learnable. This partially explains why ResNets generalize better than FFNets in deep settings. [Y19] Yang, Greg. https://papers.nips.cc/paper/9186-wide-feedforward-or-recurrent-neural-networks-of-any-architecture-are-gaussian-processes [27] Daniely, A., Frostig, R., & Singer, Y., Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. [14] Allen-Zhu, Z., Li, Y., & Song, Z., A convergence theory for deep learning via over-parameterization,

Strengths: The paper provides a mathematically rigorous and sound treatment of (a variant of) ResNets. The proofs presented here are novel. That infinitely deep ResNets do not forget about the input, but finite ones do and therefore cannot learn, is an important result. It is rigorously proven here. The last result presented is verified numerically. A related result (with gamma=0.5) is also numerically tested, though not proven.