NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2446
Title:Understanding and Improving Layer Normalization

Reviewer 1

Originality: Though the analysis is new, the proposed algorithm is not. Quality: To motivate AdaNorm, DetachNorm is first used for ablation study. Significance: Give the minor empirical results, it seem hard for the practitioners to widely adopt AdaNorm.

Reviewer 2

The paper presents a simple yet an effective idea that is based on a rigorous analysis on the effects of the bias and gain used in LayerNorm. The paper is written very clearly. The experimental setup is excellent, and all the details on the model size, hyper-parameters are described in the supplementary material. There is a typo in the table 2: imporvement -> improvement

Reviewer 3

The authors explore two ideas in this work: 1) analysis of different parts of LayerNorm (normalization itself and gradients); 2) new adaptive LayerNorm where bias and gain of LayerNorm are replaced by a new adaptive scaling approach. The authors provide a theoretical analysis of both parts, and provide empirical analysis of the discussed ideas that explain and justify the two topics that are explored. Detailed comments are given below: - The title is very vague, and should be modified/expanded to better convey the topics being explored. - As mentioned above, the paper consists of two somewhat unrelated topics. Although the authors attempt to connect them, it is not really clear that they do belong under one roof. - Equation for variance in (1) seems wrong. Is that the equation that was used or is it a typo? I tend to think the former as it repeats several times throughout the paper. - DetachNorm does seem interesting, but it could be that poor results are (at least to a degree) due to a fact that the model parameters are not being trained to optimize the intended loss as the gradient is simply wrong (due to parts of the gradient being detached and essentially random noise is added into the model through the special copy function). Can you please elaborate on this, maybe I am missing something? - I am not sure that (3) is necessary to add to the paper as it is most likely clear to the readers already. But it is OK to leave it if the authors see fit. - Table 2 could and should be consolidated as it basically repeats the same info twice. - At the start of Section 4 the authors say "The reason for overfitting is probably ...", where the reason is not really shown previously. Seems like a guess. - Also, the proposed AdaNorm does not really directly address the items discussed in the first part of the paper. The two topics seem to be glued together. - "Unlike bias and gain being fixed ...", bias and gain have not actually been shown anywhere in the paper. Would be good to remove some of the unnecessary equations and add some that would actually help a reader follow the text. - In Theorem 2 and above, should the absolute value be only around z_i and not the entire sum? What if z_i's are large but they cancel each other out? - Some notation is used without being introduced, such as E below Theorem 2. This should be fixed. - "To prevent ... dismissing the feature of gradient", what does this even mean? This is just given very briefly, but it seems that it should be much better explained. All in all it is an interesting paper, but a number of critical parts are not well explained and should be much better elaborated. ### After reading the authors' feedback ### Thank you for the feedback. Not all the comment were addressed which is not ideal, but could be sufficient given limited space. I am increasing my recommendation to "weak accept", as I don't believe that the paper in its current state warrants higher than that.