NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4444
Title:On Exact Computation with an Infinitely Wide Neural Net


		
This paper has two main contributions. First, the convolutional extension of the neural tangent kernel (CNTK) was proposed and then an algorithm using CNTKs for "exact computation with an infinitely wide neural net" was designed. The algorithm allows squared-loss kernel regression with CNTKs corresponding to infinitely wide vanilla CNNs with ReLU activation as well as those also with global average pooling. Its time complexity is linear in the depth and quadratic in the amount of data and the height and width of the images. This time complexity, in previous papers, was believed to be impossible. Second, it has the first non-asymptotic proof of the equivalence between a fully-trained sufficiently wide neural net and the kernel regression predictor using NTK. The clarity, the novelty, and the significance are all above the corresponding thresholds of NeurIPS and thus it should clearly be accepted. I personally think this paper could even be a nice oral or spotlight presentation. The paper looks very theoretical at first glance, but in fact it is not---it is more than pure theory and is definitely of practical interests. The authors and the expert reviewers all ranked the first contribution higher; the experimental observations and thoughts on them in lines 273--293 should be interesting to the majority of machine learning researchers in the world. So at least to me, this paper is half theoretical and half empirical, and the second half is much more important. This paper (and the papers it relies on) has shown neural net is not the only thing that can go deep---kernel can go deep too as in Eq (9) and the equation below line 259 and also perform reasonably good as in Table 1. I do think understanding kernel methods better would help us understanding deep methods better. Therefore, I am recommending an oral or spotlight in order to draw the attention of DL practitioners to take a look at DL theory and consider what our next step should be (note that I am not from an area of any type of theory and I don't often stand on the theory side). In order to address the broader audience in the whole NeurIPS, I offer my quick thoughts on the paper (I didn't check the appendices due to limited time): A. The first impression (i.e., a theory paper) may stop many guys from going through your paper and reduce the impact. A possible reason may be that the paper has no figure and only one table. Since the current version uses itemize environments a lot, you may consider to make some "colorful" figures instead of nested lists. B. I feel that the messages in lines 273--293 are not fully conveyed in the introduction where only a subset of them is given. You may consider to separate the experimental observations from "our contributions", compress those observations and insert all of them after "our contributions". C. Up to now I am not sure who first proposed the concept of CNTKs due to some writing issues. It's a bit strange to me to say "one may also generalize the NTK to convolutional neural nets, and we call the corresponding kernel Convolutional Neural Tangent Kernel (CNTK)" because "one" suggests it's someone else. Moreover, you said "the random feature methods for approximating CNTK in earlier work" which again suggests CNTK exists already in earlier work. When I evaluated the novelty of the paper, I assumed this concept was proposed here, but this point should be clarified. D. Some claims are not supported by citations, for example in the introduction: It has long been known that weakly-trained convolutional nets have reasonable performance on MNIST and CIFAR-10. Weakly-trained nets that are fully-connected instead of convolutional, can also be thought of as “multi-layer random kitchen sinks,” which also have a long history. You are responsible to demonstrate "long been known" and "long history". The final version should be as friendly as possible to everybody attending the conference. Last but not least, there are still two concerns from the reviewers: one is about "specific bound on m" and one is about "positivity of H*". For the second concern, the reviewer thinks this theoretical guarantee should really be provided. Please address them in the final version.