NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4827 What Can ResNet Learn Efficiently, Going Beyond Kernels?

### Reviewer 1

Dear Authors: I read your rebuttal. I do indeed understand the point of your paper. I also agree that solving linear equations is not a good example of something kernel methods can't do. The 'isotonic regression' algorithm for learning a ReLU is a simple SGD algorithm that uses a straight-through estimator. The high-level message of your paper is that there are problems that gradient-based methods can solve, but kernel methods cannot. Learning a ReLU is a fine example. Your paper has many other interesting contributions such as working with vanilla SGD and also formally ruling out a variety of kernels. That's very nice, and I continue to recommend acceptance. The construction in this paper seems interesting and noteworthy. It is unfortunate that high level description of the lower bound is left until supplemental. Also, the paper leaves out many relevant works on this topic. I do not understand this submission's description of "Regularization Matters: Generalization and Optimization of Neural Nets vs their Induced Kernel." [37] does give a separation between neural nets and NTK (the submission here is better in the sense that they get a separation for all correlational kernels). Another recent paper by Yehudai and Shamir shows that a ReLU cannot be learned by random features. This implies that a ReLU cannot be learned by NTK, recursive kernel, etc. But a ReLU can be learned for all distributions by gradient descent on a surrogate (convex) loss see "Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression." So ReLU already seems to be a separation for many classes of kernels (though it is not proved for all correlational kernels). Building on this work "Learning Neural Networks with Two Nonlinear Layers" there is an algorithm that shows how to learn neural networks that cannot be embedded into known kernels (NTK, recursive) because they can learn ReLU, so again this goes beyond current kernel methods.

### Reviewer 2

The results themselves form a nice study. The biggest problem I find is the impact. The studied network is neither convex nor concave but yet it is special (single-skip three-layer ResNet). It begs the question if the result can be extended to other networks, in particular without the skip connection, i.e., non ResNet. I do not see any practical relevance. The result says in layman words "use neural networks instead of kernel methods," which these days everybody takes as granted. Another big issues I have is with the exposition of the paper and the numerous grammar mistakes. The exposition of the paper is weird. It is impossible to read the paper without continuously going back and forth. For example, Theorem 1 is using complexity before it is actually formally defined. In the same theorem, probability is used but it is unclear what really is random (the samples are random but this is not the origin of this probability). I think a much better flow would be to define the quantities before being used. The two overview sections are kind of out-of-kilter. I'm not sure what a better flow would be but the current one definitely doesn't work well. My 'below accept' rate is due to unclear impact and relevance of the results, and very poor exposition of the work with numerous grammar mistakes. Minor comments (these are only a few of them; there are many additional grammar mistakes) 21: understood how 26: "from them" is incorrect 27: "to the optimization side" ??? 32: are kernel methods ... are defined 40: by random etc

### Reviewer 3

(1) I have a serious concern about the parameter alpha: In Theorem 1, the prediction error delta needs to be larger than alpha^4. This means that if we require a high accuracy prediction, i.e., delta is small, alpha also needs to be small. Increasing the sample size/number of iterations T does not reduce th error. In another word, if we require the learnt network to be consistent, i.e., risk->0, then the composite signal in (2.2) needs to diminish. The scaling alpha->0 makes this regime not very reasonable. In contrast, if we consider alpha to be small fix constant, Theorem 1 will not give a meaningful bound. (2) The authors should highlight the difference between this paper and Allen-Zhu et al. 2018 for analyzing feedforward NN in terms of the proof techniques, as a large proportion of the proof technique in this paper are adapted from Allen-Zhu et al. 2018.