NeurIPS 2020

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Meta Review

This paper proposes movement pruning - a first order weight pruning method that allows pruning to be more easily adaptive during fine tuning. This is compared to traditional magnitude pruning. Movement pruning is shown to be more adaptive for the scenario where the weights are shifting during fine tuning. All four reviewers recommend accepting this paper (though some found it borderline). I agree with the reviewers and recommend acceptance. One weakness pointed out is that while the baselines are strong, the way they are reported may be a bit misleading. In particular, models are compared based on the sparsity percentage, which puts models with fewer parameters (e.g., MiniBERT) at a disadvantage. The clarification that the authors reported sparsity relative to BERT base (rather than relative sparsity) clarified that the comparison seems more fair than originally realized. I encourage authors to take into account the reviewers' suggestions in the final version. In particular, they should add the missing citations pointed out by R3, and clarify the relation with prior work in a discussion section.