Review for NeurIPS paper: Deep Transformers with Latent Depth

NeurIPS 2020

Deep Transformers with Latent Depth

Review 1

Summary and Contributions: In order to train much deeper Transformer-based architectures, the authors propose jointly learning the parameters of the model along with a posterior distribution over the layers of that model, representing which ones to select or drop out for a given task. This distribution is approximated with a Gumbel-Softmax and is used to softly weight layers during training and to prune them at inference time, significantly reducing the runtime / memory usage. The authors show that this proposed method allows them to train deeper models (up to 100 layers) without divergence and to do marginally better on masked language modeling and multilingual machine translation.

Strengths: Novel technique for training deeper modes that do better on MLM and multilingual machine translation tasks. Interesting exploration of multitask machine translation, where a subsequence of layers are used for different language pairs. Motivates layer pruning at inference time with the observation that the runtime at this stage is directly proportional to the number of layers. The auxiliary loss that they added to encourage an effective utilization of k layers seemed clever and effective. Explores the role of latent layers in the decoder, encoder, or in both.

Weaknesses: Doesn't compare against simpler alternatives, such as one described by the authors themselves: instead of learning a separate distribution over layers, pass in the language as an embedding and have the model implicitly learn to weight layers. The authors suggest that this would require additional (Nxd) parameters, but could allow for greater cross-lingual learning. Although the authors are able to train transformer models with up to 100 layers, it's not clear that this is providing any benefit, either in terms of individual task performance or in terms of being a better multitask model (e.g. for the MLM task, LL is the only method that doesn't diverge for the 96 layer model, but its performance is worse than a static 48 layer model; for O2M multilingual translation, LL-D 24/24 outperforms 12/100 in all but en-bul). Very few ablation studies were performed (only one exploring the impact of different loss terms).

Correctness: The claims and methods seem correct and the tasks on which they evaluated their methods (MLM and multilingual MT) are relevant.

Clarity: The paper is well written overall, but there were a number of grammatical errors and typos, including: On page 2, "WMT’16 English-German machine translation task task, masked language modeling ," ['task task', and super nit, extra space before comma] On page 3, "Coupled Variaitonal" "each of which has appealing property to achieve" -> "each of which has the appealing property of achieving" "we let each language to learn" -> "we let each language learn" On page 4, "to demonstrate the effectiveness of the proposed approach at enabling training deeper Transformers [and?] whether this increased depth improves model performance." "WMT'16 Engligh-German sentence pairs" "from averaging [the] last 5 checkpoint[s]" On page 5, "Attenton" is misspelled five times "Next, we compared the learning curves when training deeper models model" (probably don't mean to have the second "model") On page 6, "In Table 1 we evaluating.." (we are evaluating or we evaluated?) "summerized" -> "summarized" On page 7, "an uniform prior Beta" -> "a uniform prior Beta" "In order to understand how the D_KL loss term affect[s] layer selection policies..." "In Table 8 we compare [a] deeper model..." On page 8, "to prevent explode or vanishing gradient" -> "exploding"

Relation to Prior Work: Yes, and the authors compare against prior methods, including DLCL, LayerDrop, and ReZero. However, it would be good to see how deep of a network can be trained with Zhang et al (2019)'s initialization technique and/or how that interacts with the proposed method.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper describes an architecture for building multi-lingual machine translation models in which a transformer network is shared by the language pairs, but different layer depth may be learned by individual language pairs at training time. The paper claims that this approach solves the vanishing gradient problem encountered in training very deep networks. This has the benefit of allowing low resource languages to benefit from deeper network architectures than would otherwise be possible. The main drawback of this paper is the lack of discussion in terms of existing work in multi-lingual machine translation that might permit comparison. [I have read the author rebuttal. Note my comment below on Zhang 2020. ]

Strengths: What is presented in this paper is a respectable idea, one that is probably applicable in other areas (e.g., multi-task learning), namely that large number of transformer layers can be learned without running into the problem of vanishing gradients. The idea that the number of layers can be learned is not itself entirely new; however, the idea that this can be differentially learned for different language pairs in multilingual translation does appear to be novel. The analysis is generally good, and the discussion useful.

Weaknesses: The biggest weakness of this work lies in its lack of comparison with existing machine translation models. This makes it very difficult to assess the relative contribution of this work to existing state of the art systems.

Correctness: Reporting of average bleu scores in Tables 3, 4, 5 and 6 should probably be accompanied by a confidence interval, or some other metric of statistical variation. (Ditto tables 7 & 8) The average scores are for the most part very close, and the differences may not be statistically very significant. Adding or removing a random language pair or two might well change the rankings. In table 1 some bleu scores are annotated with an unspecified measure of variance--it is good that this is done, but what is it? is this a confidence interval? a standard error?

Clarity: The paper is mostly reasonably clear and well written.

Relation to Prior Work: The following paper should probably be addressed in that it offers an different approach to the vanishing gradient problem in deep transformers: Zhang et al. 2019. Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention. EMNLP. I understand that the present paper is couched in terms of the contribution of the algorithm to very deep transformer models. I am surprised, however, at what appears to be little awareness of previous work on multilingual and zero shot machine translation. The authors probably should refer to Zhang et al. 2020. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation for the references therein. [As the authors point out correctly in their rebuttal, this paper came out after the submission deadline. It was my intention that they should consult the bibliography of this paper for recent work.]

Reproducibility: Yes

Additional Feedback: The "broader impact statement" doesn't really capture what is intended by "broader impact." A statement of what benefits it brings to machine translation quality, for example, would be more appropriate. All papers listed as being on arXiv need to be checked for acceptance at conferences.

Review 3

Summary and Contributions: The authors present a probabilistic setup to learn which layers in a very deep transformer (64-100 layers) at test-time per-language for multilingual machine translation. They test this method on single language pair translation (WMT16 En-DE), Masked language modeling, and multilingual many to one and one to many translation. The impact of latent weights on gradient scaling is investigated, demonstrating deeper models can be trained without layer normalization.

Strengths: Training deep transformers is a difficult problem, for which multiple methods have been proposed (e.g. layerdrop, rezero). Authors introduce another method, particularly in the case of multilingual translation. Method allows for learning to share or not-share layers across languages. Method allows for targeting a latent depth (compute budget). Empirical evaluations demonstrate consistent significant improvement on multilingual MT, as well as inconsistent improvements on masked language modeling and single-language translation (particularly w.r.t parameter count). Detailed analysis of learned latent layer selection coefficients is conducted.

Weaknesses: Masked Language Modeling: While the method does stabilize training at larger depths, it achieves slightly worse performance. It's unclear whether this is due to dataset size vs. model capacity. An experiment on layer sizes 24, 48, 96 with different dataset sizes might help to illuminate this.

Correctness: Yes. Authors also run several ablations investigating the effect of latent layer selection on (encoder vs decoder vs encoder+decoder), effect of prior, as well as the effects of the various sub-losses

Clarity: Yes, Setup is well explained, and figures in analysis are helpful for understanding.

Relation to Prior Work: Yes, to my knowledge.

Reproducibility: Yes

Additional Feedback: Including some measure of compute time/# of layers selected/# of parameters used in inference in table 1 could help readers understand the performance differences between models. In the current setup, layer selection is dependent only on the language. Could making it dependent on the input as well improve performance? It would be interesting to see a quantitative evaluation of to what extent different languages use the same layers, for both related languages and un-related languages. Would it be possible to compare performance of latent depth multi-lingual models to non-latent depth models of the same depth on single high-resource language pairs? L196: "is also shown effective" -> "is also shown to be effective" The results in the rebuttal on latent depth vs. multi-lingual and hamming distance on layer selection for similar and dissimilar languages further supports the authors' motivation.

Review 4

Summary and Contributions: The paper introduces a probabilistic framework to select which layers to use per language pair in a multilingual machine translation setup. This is done by introducing a binary latent variable z per layer that indicates whether the corresponding layer is used in multilingual NMT. When z is continuous in (0, 1), it acts as gradient scaling that enable training a deep transformer for bilingual NMT. The model is trained end to end via optimizing ELBO. The paper shows that a very deep NMT (up to 100 layers in the decoder) can be trained.

Strengths: - The proposed approach using latent variable z to select Transformer layer or to scale gradient is intuitive and principled. - The experiments are comprehensive, covering both bilingual NMT and multilingual NMT. - Analysis of priors and coefficient parameters is provided to help understand several choices of the model design.

Weaknesses: While increasing the depth of the transformer bring extra gain in BLEU, for bilingual NMT, deeper models also increase the inference time.

Correctness: The claim and method presented in the paper are correct.

Clarity: The paper is well written.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I wonder what the training time and inference time of the latent depth transformer in comparison to the baselines used in the paper. == Post-rebuttal comment == I keep my score after reading the author response.