Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This work investigates whether multi-headed attention models (e.g., BERT) actually need to use multiple attention heads. The perhaps surprising finding is that in some cases a single head suffices. Reviewers agreed that the question here is interesting and the empirical work sound. This work may motivate follow-up efforts that investigate similar pruning exercises.