Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
- Their analysis of the dependence of the plateau on the spectral density is based on an unrealistic spectrum where the eigenvalues are highly degenerate. It would make more sense to show results for data with low-dimensional structure, in which the first one or two are non-zero, and the rest are either zero or epsilon small. Do the conclusions for the two eigenvalues case still hold in this example? It is hard for me to see what I should learn from figures 5 and 6. - The dependence of the learning dynamics on the spectral properties of the input data is not new and was previously studies by Saxe et al. (ArXiv, 2013) for simple linear networks. It would be appropriate if these results were mentioned or discussed in the text. - In their analysis, the authors do not consider the importance of initial conditions of the weights. It has been previously showed that the initial conditions have a big impact on the trainability and learning dynamics of these networks. In this case, they would be defined as the initial conditions on the order parameters Q, R, and D. - The analysis here seems tractable only for networks with a small number of hidden units. This regime is different from many recent studies of mean-field, that assume the number of hidden units is large at the thermodynamic limit. It is not clear to me if the phenomena of the plateaus, in this case, is the result of the bottleneck structure where a very high dimensional input (N) is projected to an O(1) number of non-linear units. Are the plateaus a result of a bottleneck in the network? If we were to look at a deep network, where the number of hidden units is also at the thermodynamic limit (Schoenholtz, Glimer, Ganguly & Sohl-Dickstein, 2016), will the plateaus arise from the final layer in this case? - Some of the mean-field derivations are cumbersome and unclear. For example, it is not clear to me why the authors needed to define the variable Omega and its dynamics. It does not seem to serve any conclusion, and t was hard to understand. For example, define more functions, f,g,h without stating what they are. Furthermore, the use of g, which was used above for the transfer function is confusing. ***update *** After the authors' response, I am still not convinced about the meaningfulness of the simple spectrum. Furthermore, the comments I had, as the initial conditions and the width of the hidden layer were not addressed properly. However, I agree with reviewer #2 that the work is interesting, original, and should be published. I am recommending to publish this work but strongly urge the authors to address some of my concerns. First and foremost, the clarity and readability of the paper should be improved to make it accessible to a wider audience.
I think this is a very nice and original contribution trying to bridge properties of simple models of neural networks, the teacher-student soft committee machine in the present case, with empirical observations available in deep learning. Adjustment of the models to match the observed behaviour seems to be a very valuable way to proceed towards better understanding of the deep learning phenomena. The paper is well written. Comments and questions: ** Can the authors define the relation between number of epochs and number normalized steps? In particular in Fig. 1, did the system see only 10*Epoch samples or was it the whole MNIST passed trough several times by the SGD. ** To give a better idea of how good is the neural network the authors are using, can they state the accuracy on the test set corresponding to Fig. 1? ** The authors summarize one of their contribution by saying: "By analyzing the macroscopic system we derived, we showed that the dynamics of learning depends only on the eigenvalue distribution of the covariance matrix of the input data, provided that the learning rate is sufficiently small." This should be stated more precisely. Surely the dynamics of learning depends also on the way the labels were generated, which is not considered in this sentence. ** The plateau phenomena is intimately related to the specialization of hidden student units ot the teacher units, I think it would be valuable if the authors discuss this connection quantitatively and evaluate their theory in this respect. In particular the authors conclude "Considering this, the claim that the plateau phenomenon does not occur in learning of MNIST is controversy; this suggests the possibility that we are observing quite long and low plateaus." Shouldn't the specialization of hidden units or the lack of there-off be a good measure to resolve this "controversy"? ------ post-feedback I have read the other reviews and the author's feedback. I maintain my score. The problem this paper addresses is in my opinion important. At the same time I urge the authors to consider all comments from the reviews to make their paper clearer to the NeurIPS audience.
I read the paper, but did not check the mathematical details. I think the paper is interesting and it seems that it relies on standard mathematical analysis. I think the results are important and advance the knowledge about neural nets.