NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7830
Title:Are Sixteen Heads Really Better than One?

Reviewer 1

Originality: The work is fairly original, and certainly garnered a fair amount of attention when released on arXiv. I don’t think most MT researchers would have thought to try to drop attention heads after training is complete. But this is obviously a very familiar idea to those working in network pruning . Quality: The experimental work here is top-notch, experiments are well-designed and described clearly, with statistical significance clearly indicated when appropriate. However, it is a little disappointing that one of the “soundbite” results from this paper, that “some layers can be reduced to a single head” was (a) an oracle result where the most important head was selected by looking at test set and (b) was done one layer at a time. My main takeaway from the paper, that MT/BERT Transformers can safely prune 20/40% of their heads, is much less surprising and exciting. The second technical contribution mentioned above is fairly minor and not particularly novel; the paper should be seen as being mostly experimental, with much of its novelty being derived from the fact that it is covering ground that is as of yet under-explored in NMT/NLP. Clarity: Aside from the above mentioned disconnect between the abstract/conclusion and the rest of the content, the paper is exceptionally clear. I suppose I also didn’t find Figure 5 to be as clear and helpful as the authors found it - it is hard to see linear relationships between results at a given timestep, and it is not clear why the authors choose to call out Epoch 10 as being particularly important (as opposed to Epoch 6, for example). Significance: This paper could potentially start a trend in trying to prune NMT and BERT systems. But I’m not sure any major players will be rushing to prune their attention heads based on these results as is. Specific questions: Equation (1) shows attention heads being combined in a sum. Aren’t they usually combined by concatenation? ====== Thanks to the authors for their response. I greatly appreciate how much space and attention was devoted to my review. The clarifications to Figure 5 and to Equation (1) are helpful. The additional experiments showing greater reductions from pruning certain BERT models, and the generalizability of the "one head" claim across dev and test are very comforting. I have adjusted my score accordingly. Having now read the other reviews, I sympathize with R2's concerns about how good of a fit this work is for a NeurIPS audience. Thinking about this from an ML (as opposed to NLP) standpoint, it would have been interesting to see how much more efficient pruning attention heads is than pruning network nodes without structural constraints. I imagine it is certainly easier to get speed-ups with this structured pruning.

Reviewer 2

This paper prunes the heads for transformer models and empirically asks whether they are needed. The answer seems to be that we don't need many heads for good accuracy. This is an interesting empirical result, but I think there are a few more experiments that should be run to convince the reader that the conclusions are general: (a) Repeating the analysis with transformer modeled trained with different number of heads. (b) Repeating the analysis on more datasets, e.g. transformer trained on a different dataset. I understand there was an IWSLT experiment in Sec 6 but it asks slightly different problems than Sec 2-5. Clarification questions: - Fig 1. y axis is number of heads, which is a bit confusing. Is it supposed to be frequency of models with a given BLEU/accuracy instead? Or is this plot really binnned by number of heads? - Fig 3. What is the green line? I don't understand why pruning based on the end metric (BLEU or accuracy) would do worse than the blue line (I_h)?

Reviewer 3

This paper offers a solid discussion of pruning attention heads in models using multi-headed attention mechanisms. The provided 'heuristic' strategies to do so seem effective, yet one could imagine additional variants worth evaluating. The analysis is solid, the findings somewhat surprising and practically highly relevant as they improve inference speed considerably.