Paper ID: | 3462 |
---|---|

Title: | The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks |

This well-written paper is the latest in a series of works which analyze how signals propagate in random neural networks, by analyzing mean and variance of activations and gradients given random inputs and weights. The technical accomplishment can be considered incremental with respect to this series of works. However, while the techniques used are not new, the performed analysis leads to new insights on the use of batch/layer normalization. In particular, the analysis provides a close look on mechanisms that lead to pathological sharpness on DNNs, showing that the mean subtraction is the main ingredient to counter these mechanisms. While these claims would have to be verified in more complicated settings (e.g. with more complicated distributions on inputs and weights), it is an important first step to know that they hold for such simple networks. Minor points follow: - on the sum in E(\theta) (end of page 3), upper limit is defined but not lower one - \lambda_max seems to concentrate, is it possible to compute an average value rather than a bound? - worth adding a comment about the difference between (20)/(21) and (23)/(24) - the normalization techniques are studied considering the whole dataset, and not batches as typically done in SGD; is there any way of adapting it? - I wonder if the authors have any comment on Kunster et al., "Limitations of the Empirical Fisher Approximation", who show that the empirical FIM is not necessarily representative of the geometry of the model # Post-rebuttal addendum I was previously unaware of the publication of [21] at AISTATS, and as such I share my concern with other reviewers regarding the originality of this work. I recommend the clarifications given in the response are included in an eventual updated version of this manuscript. Otherwise, I am quite pleased by the effort the authors have put in the rebuttal, and in particular with the experiments on trained networks. I think this is yet another evidence that this work might be useful in practice. Following these two points—one negative and one positive—I have decided to keep my score the same.

Originality: In terms of overall technique, this paper uses techniques that have already been developed in previous mean field papers. However, the application of these techniques to the fisher information is a novel, challenging, and worthwhile contribution. Moreover, looking at the fisher information in combination with normalization techniques is a sufficiently novel perspective that it leads me to believe this paper is sufficiently novel to warrant publication. Quality: I have somewhat mixed opinions about this paper in terms of quality (which has overlap with its significance). On the one hand, the authors have carefully considered a number of different approaches to normalization theoretically. Indeed, I think the distinction between the different normalization settings applied to the spectrum of the fisher is the key takeaway message of the paper. Having said that, the experimental portion of the paper could use work. In particular, the authors remark early on that one should expect the maximum achievable learning rate to scale as the inverse maximum eigenvalue of the fisher. Moreover, batch normalization in the final layer is shown to reduce the maximum eigenvalue. It would be nice to combine these results to show that this improved conditioning does indeed translate to a higher maximum achievable learning rate. Incidentally, it would also be nice to confirm the max eigenvalue behavior of the various normalization types. Clarity: I thought the clarity of the exposition was good. I was able to follow the results and understand the important aspects of the paper clearly. I also feel that given the exposition, if I were to sit down I could probably reproduce most of the results in a reasonable amount of time. One minor point, as I mentioned above, I think an important technique that is introduced in the paper is the duality between the fisher and the NTK (F vs F*). I might recommend the authors move a discussion of this into the main text. Significance: I think the results _could_ be significant. In particular, I think the notion of normalizing the last layer is quite novel (especially since one need only perform mean subtraction). I also think that the theoretical techniques introduced here may end up being impactful. As elaborated above, however, I think to maximize impact the authors should add significant experimental work either to this paper or in a followup work. --------- Update: After reading the author's rebuttal and a subsequent discussion I have decided to raise my review to a 7. I think there is sufficient novelty w.r.t. [21] and the new experiment seems interesting!

This paper is well written and properly cited. The interplay between batch normalization and the behavior of Fisher information is interesting. Though, it is questionable whether this calculation explained batch normalization helps optimization. The authors calculated the FIM of training loss at initialization, and it is not the same as the Hessian of training loss. Therefore, it is questionable to say FIM reflects the "landscape" of the training loss function. Also, whether suppressing the extreme eigenvalues of FIM can also make the Hessian (at initialization, and along the whole trajectory) well-conditioned is not clear. Hence, there is a gap in showing batch normalization providing a larger learning rate. Is it due to technical restriction that one cannot calculate the behavior of eigenvalues of Hessian at initialization? In terms of technical contribution, is there some new ingredients in the calculations of batch normalization settings, or it is just a simply re-application of calculations in [1]? [1] Ryo Karakida, Shotaro Akaho, and Shunichi Amari. Universal statistics of fisher information in deep neural networks: Mean field approach. -------- I am satisfied with the authors' response to my questions. I am also glad to see the authors did more experiments in the response. I think I can increase my score to 7. Though, I think the "landscape" terminology used in this paper is not accurate. The FIM is dual to NTK. By the NTK point of view, the dynamics of neural networks is roughly linear, and what matters the convergence is the condition number of the NTK (sharing same eigenvalues as the FIM). The "local shape of landscape" terminology may confuse the readers. I think the authors could explain this more clear in the paper.