NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1145
Title:Random deep neural networks are biased towards simple functions

Reviewer 1

The manuscript is clear, original, well written, and contributes towards the theoretical understanding of an important problem. The only weakness appears to be the overreliance on the Guassian assumption for the features and the assumed regularity of the covariance function, which is satisfied by ReLU but not other activation functions, such as tanh.

Reviewer 2

Summary: If we assume the inputs of DNNs to be binary (e.g. black-white MNIST images), then the authors show that the Hamming distance between data points (e.g. the number of pixels that need to be flipped) with two different classes grows at least a \sqrt{N / log(N)} rate. The proof utilizes the (now standard) Gaussian process approximation results for DNNs. This theoretical result suggests that DNNs produce functions with low Kolmogorov complexity, which is useful for studying generalization bounds of DNNs. Some experiments on random data and on tiny nets on MNIST (in the supplement) are presented, empirically verifying the bounds. I tend to weakly reject this paper due to the weakness of the empirical and theoretical results, and on the organization of the paper (MNIST results in main text?). Given result 2), I’d strongly suggest that the authors think about what might be possible from an adversarial setting. Connecting the fact that the average number of features changed to change a classification seems like a potential way to provide some sort of guarantee against adversarial attacks. This could provide another alternative theoretical result if done properly. Originality: This paper seems to be a natural theoretical extension of the work of [27]. In particular, the authors of that paper compute the marginal likelihood of the GP representation of DNNs and derive a generalization bound based of that, with extensive empirical results. Similarly, they also hypothesize and empirically study the string complexity of small functions like is studied here, empirically finding that lower complexity functions are produced with higher probability. As such, I don’t think that the theoretical results and limited empirical results (only on MNIST and toy data) are significant or original enough for a full paper at this time. I’d strongly suggest utilizing the Hamming distance bound proved in this paper in an experimental set-up like that of Section 4 and Appendix F of [27]. Particularly, the results in Figure 1 ought to be able to be replicated on MNIST and small portions of (binarized?) CIFAR for the Hamming distance type bounds shown in this paper. Clarity: Overall, the paper is mostly nice to read and free of grammatical errors. One major organization issue that I noticed is that the MNIST results are heavily mentioned in the abstract and conclusion, but are nowhere in the main text. They are instead only found in the Appendix. This is somewhat misleading, as a reader of the paper might be confused and not see them. I’d strongly suggest that these results be moved into the main text – either by swapping out the toy problem experiments or by moving one of the detailed proofs of Theorem 1 or 3 to the Appendix instead. Lines 73-82: “A fundamental consequence … provided by the Kolmogorov complexity.” This sentence and the rest of the paragraph could be rephrased as “DNNs produce a class of functions that are a subset of all functions with low Kolmogorov complexity.” The rest is unclear logically and somewhat pedantic. Lines 54-82: This paragraph seems to be run-on. To further emphasize the contributions, it would be nicer to have bullet points, or at least split paragraphs here. Significance: To make this quite interesting from a theoretical perspective, can the Hamming distance be shown to a) align with some sort of PAC Bayes bound and/or b) empirically correlate with generalization error? That is, can you make the connections between simplicity of functions and generalization error rigorous from a learning theoretic perspective? It’s possible that this is somewhere in the references, but it should be specifically mentioned in the main text. Quality: Lines 135-142: Assuming that the inputs “l[ie] on a sphere” is not quite the same assuming that the inputs are binarized on a grid. Additionally, the binarized on a grid assumption seems like it would only hold for binary images. How would the authors expand this to either tabular (continuous) data on a sphere or image data [0, 1]^d x [0, 1]^d x [0, 1]^d (standard color images)? Line 150: “from the central limit theorem, [the post-activation] has a Gaussian distribution” While this assumption is standard in the literature [39, 41, 42, and others], the rate of convergence here could be arbitrarily small (we’d actually expect the post-activations to have much heavier tails, see Vladmirova, et al, 2019). This could make the rate of convergence quite slow, and ought to be mentioned. Appendix: In my understanding, the relationship between Hamming distances on random data, the MNIST training data, and the MNIST test data is quite under-explored. While the proofs in the paper seem to be interesting mathematically, converting these bounds into practically useful tools seems to require the inflation on real data. Why do the authors think that the Hamming distances on the real image datasets are inflated? Line 152: Could the authors explain what the understanding of the function F(.) is throughout the proof? From skimming the supplementary material, it seems intimately related to the basis expansion of the kernel corresponding to the activation function. If this is the explanation, a sentence explaining this would seriously help in understanding the proofs. References: Vladmirova, et al, 2019. Understanding Priors in Bayesian Neural Networks at the Unit Level, ICML, --------- Post-Rebuttal I'm inclined to raise my score to a weak accept as the authors have convinced me that a) they will move the MNIST results to the main text as necessary and b) that the bounds on the Hamming distance can be extended to different data types. My concern of the potential weakness of this work in comparison to [27] has also been alleviated somewhat by the other reviewers. I'd also like to thank the authors for providing a further experiment demonstrating that the Hamming distance computation seems to be empirically interesting on real datasets. Also, please adjust the citation of [27] to include the OpenReview link (the published ICLR version): .

Reviewer 3

After reading through the author response, the authors have not only agreed to make the organizational changes requested but also added a new experiment. We maintain that this paper should be accepted. We had one reviewer disagree on the basis that this is an incremental theoretical result from [27], however the theoretical result of this paper is a new one that expands on just one of the empirical observations made in [27]. The theoretical result here is still a significant one, in our opinion. -------------------------- Before Rebuttal The authors present a novel proof regarding the Hamming distance of random bit string inputs with different binary classifications to random neural networks. Their proof helps explain some experimental results observed in prior work [27] and notably, their predicted theoretical results are experimentally confirmed even after a network has been trained. The mathematical set up of Gaussian process approximation for their proof is by the author’s admission not novel, coming from [41]. However, they expand on this setup to show that classification outputs of nearby (short-distance) inputs are almost perfectly correlated, i.e. they have the same classification. This expansion on [41] allows them to construct their proof of Theorem 1. The major drawbacks of this paper are due to presentation style and organization. The immediately obvious mistake is that the authors label Theorems 1 and 3, but there is no Theorem 2. More importantly, the paper requires the reader to bounce between the proof logic in the body of the paper and the proof setup in the supplement. The result on MNIST is also buried in the supplement. While it is understandable that the proof setup must stay in the supplement due to space restrictions, it might be more prudent to move the heuristic argument into the supplement and the MNIST result into the main paper. Another option is to move the proof entirely into the supplement, leaving only the theorem statements, the heuristic argument for intuition, and all experimental results in the main paper. Section 1.1, while very clear in explaining how the authors’ work fits into the research landscape, could at least be broken into paragraphs and made a bit more succinct. It would also be nice to see some standard deviation error bars in the figures. The supplement also mentions the greedy algorithm for finding a nearest neighbor bit string as being in Section 6.1 of the main paper, when there is no Section 6 in the main paper.