Summary and Contributions: This paper presents DynaBERT which adapts the size of a BERT or RoBERTa model both in width and in depth. While the depth adaptation is well known, the width adaptation uses importance scores for the heads to rewire the network, so the most useful heads are kept. The authors show that this approach, combined with a procedure to distill knowledge from a vanilla large pre-trained and fine-tuned model into the smaller adapted one can perform very well when compared to both the original large model, and to other methods used to compress BERT-like models. This work introduces a novel rewiring approach to reduce the width of the transformer model by a particular multiplier and combines several other previously known approaches (adaptive depth, distillation) to achieve superior results.
Strengths: The main strength of the work is in the empirical evaluation of the results and the various ablation studies to delineate contributions of the different techniques proposed in the paper that together make up DynaBERT. The width-shrinking technique appears to be novel and works very well, which will hopefully inspire additional research. I believe this work will be useful to the NeurIPS community.
Weaknesses: One weakness is that the procedure to obtain a compressed model is quite involved, and must be done separately for each downstream task. It would be interesting to see whether this approach can be adapted to work during the pre-training phase.
Correctness: The claims and methodology seem correct from what is described in the paper.
Clarity: The paper is very well written and is quite easy to read!
Relation to Prior Work: This work builds on many techniques introduced in other works. I believe it did a good job delineating its contribution.
Reproducibility: Yes
Additional Feedback: I've read the author rebuttal and thank the authors for their clarifications. I am believe my rating is still appropriate for this work.
Summary and Contributions: Various works have proposed decreasing the depth of BERT style transformer models. This work proposes a way to dynamically decreases both depth and width by first training a model with dynamic width, then distilling from that model to a model with both dynamic width and depth. In terms of parameter efficiency and inference speed this approach outperforms other ways of distilling BERT.
Strengths: The empirical results are strong, and present an interesting analysis of whether depth of width is the most important. The combination of depth and width reduction is novel, along with the two stage procedure. Interesting analysis of important attention heads.
Weaknesses: I found the paper quite dense and hard to read. The results are impressive, but rely on various complicated procedures (data augmentation, multiple rounds of finetuning and distillation.)
Correctness: Yes.
Clarity: I found the paper quite dense, and would appreciate more precise definitions of various tricks. E.g. Table 4, what exactly is 'fine-tuning'? You should link more between your results section and your methods section, i.e. say something like 'this result is from the method described in section 3.1' or whatever the case may be. line 140: 'In this work, we use the model that has higher 141 average validation accuracy of all sub-networks, between with and without fine-tuning.' I can't work out what this sentence means.
Relation to Prior Work: Yes.
Reproducibility: Yes
Additional Feedback: I believe equation 1 is normally expressed in terms of concatenating together attention heads, which confused me at first. You might want to be clearer about why you are using a different formulation of the standard transformer layer for your purposes. I think you need to briefly define TinyBERT and LayerDrop in case the reader isn't familiar. Why are they fair comparison points (i.e. maybe they require less fine-tuning compute power, given your multi-step process)? I would also briefly define your data augmentation procedure, even if it is described in prior work. There are a lot of moving parts here, and you perform ablation studies to show they all help. But something to think about is if there is a simpler method that can give you similar results. You should think about potential negative consequences in your broader impact study, for example the complexity of the distillation procedure might make it harder to apply to new domains (if we need to tune many hyperparameters etc.). Update: I thank the authors for their useful response to my questions. I will increase my score given most of my concerns were adressed, but I urge the authors to pay attention to readability and clarity for the final version.
Summary and Contributions: Authors propose DynaBERT which allows a user to adjusts size and latency based on adaptive width and depth of the BERT model. They show that they can do this, offering the user choices based on what the user would need. They show results that seem to indicate performance gains at similar sizes (in terms of width and depth) in comparison to existing compression methods for BERT.
Strengths: The experimentation is comprehensive and addresses a problem in general distillation where the user doesn't have a method that gives them complete control. The paper is written well and is mostly clear.
Weaknesses: The broader impacts and related work sections are a bit weak. The broader impacts reads as a rehashing of the conclusion and the related work part is integrated into the introduction and given just a half paragraph. I'd like to see these improved. See the additional feedback section.
Correctness: I think all of these aspects are correct and clean.
Clarity: Paper is well written.
Relation to Prior Work: Kind of. Its clear how it relates to some of the other fixed distillation approaches, but without a comprehensive related work section its hard to figure out.
Reproducibility: Yes
Additional Feedback: Random things: - Table 1 is a bit overloaded and difficult to parse. Also I'm not sure which row and column are m_w vs m_d. - Figure 3 is really difficult to parse too, RoBERTa performance is covered by the DynaRoBERTa and sometimes similarly for BERT. Can you present this differently with lines corresponding to the base models? - Why MNLI and SST-2 specifically for Figure 3 rather than others? Related Work: There's a little bit of discussion in the first half of paragraph 2 of the introduction, but no comprehensive addressing of how your work sits in context to the work already out there. Including work that talks about the capacity of large language models, what they can and can't do would be important here, how more layers/parameters help language models in general (Jawahar et al 2019; What does BERT learn about the structure of language?, Jozefowicz et al 2016 Exploring the limits of language modeling, Melis et al 2017 On the State of the Art of Evaluation in Neural Language Models, Subramani et al 2019 Can Unconditional Language Models Recover Arbitrary Sentences?). There are many others but this is a subset. Experimentation & Analysis: The discussion section and analysis section list out results and present tables, but there isnt much of a discussion in the main body. Can you talk about and put your results in context so that its easier for the reader to understand how and why certain approaches work? I know its difficult and sometimes cost prohibitive, but having statistical significance on your studies would go a long way into seeing if there really is a statistical difference between your approach and others. Broader Impacts: This reads like a rehashing, extra hashing of your conclusion rather than really discussing ethical and societal implications of your work. I'd like to see quite a bit here discussing the potentials of this method in helping and harming if used in certain ways. ------------------------------------------------------------------------------------------ Thanks for the authors for submitting an author response. I read through the others' reviews and the author response. Thank you for addressing some of the related work and statistical significance concerns. The clarity around how this relates and is put into context around other prior work would help me make better judgment. The statement you made connecting this to Jawahar work and the stability of the system is a positive. I was between a 4 and a 5 before the author response and am now I'm at around a 5.5, leaning slightly away from acceptance, so I'm gonna stick to the 5.