Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper provides an analysis on dynamics of online learning of two-layer neural networks under the teacher-student scenario. The analysis extends that by Saad and Solla (1995) by considering a covariance matrix of the input which may not be proportional to the identity matrix. The main contribution of this paper is the finding that the plateau phenomenon observed in learning dynamics of nonlinear neural networks depends on statistics of input data. The three reviewers rated this paper above the acceptance threshold, mentioning originality and importance of the contribution of this paper. At the same time, two reviewers raised concern about clarity of presentation. The clarity concern still remains after the authors' rebuttal and subsequent discussion among the reviewers. I would recommend acceptance of this paper in view of its originality and potential importance. I would like to strongly encourage the authors to take into account the review comments seriously to improve clarity of their presentation. I would also like to supplement some of the review comments in the following: - On initial condition of weights: As Reviewer #1 mentioned, the initial condition of weights in the numerical experiments in Section 4 should be specified explicitly in view of reproducibility. In the rebuttal the authors simply claimed their belief that the effect of initial conditions should be limited, but showing numerical results with several different initial conditions would provide a direct and explicit evidence. - On number of hidden units: In this study the numbers of hidden units are K and M for the student and teacher networks, respectively. Then the total number of order parameters are roughly O(d(K+M)^2). As Reviewer #1 mentioned, on the other hand, there have been several papers studying macroscopic description of deep neural networks with random weights, where the limit of the numbers of units in hidden layers to infinity is typically considered. A natural question, which is actually one of the comments by Reviewer #1, is whether the plateau phenomenon still persists in the latter limit as well. It would be nice if some comments toward such extensions will be given. - On specialization of student hidden units: I feel that little insights have been provided as to why the plateau phenomenon would happen depending on the input covariance. As commented by Reviewer #2, specialization of student hidden units, or equivalently, as the authors briefly mention in lines 20-24, breaking of the intrinsic symmetry with respect to exchange of student hidden units will be responsible for the plateau phenomenon. The authors nevertheless did not discuss this issue any further, even in their rebuttal. It is important to investigate the mechanism causing the plateau phenomenon rather than just to demonstrate that the non-identity covariance could cause plateaux. - On reference list: In References, currently only 8 papers are listed, which are very small in number compared with typical NeurIPS submissions. In fact, in the rebuttal the authors cited 4 more papers in their additional discussion, and I suspect that there would be even more regarding the above mentioned points. These papers should be included in the reference list, as the references exempted from the page limit. Minor points: - Line 74: N in 1 \le n \le N should read M. In the displayed equation that follows the line, y_N should be y_M.