Review for NeurIPS paper: Bayesian Attention Modules

NeurIPS 2020

Bayesian Attention Modules

Meta Review

This paper proposes considering attention mechanisms as continuous latent variables, using VAEs for training. It uses reparametrizable distributions such as Weibull and log-normal distributions to get unnormalized weights, which are then normalized. Experiments show that the proposed continuous latent attention mechanism gets better performance compared to deterministic attention on a wide variety of tasks, including image captioning, machine translation, graph classification, and fine-tuning BERT. All reviewers recommended acceptance, pointing out that this is an interesting idea and a solid and well-executed work. One concern was raised about the significance of improvement on VQA and NMT, and about directly setting prior to approximate posterior, which the authors addressed in the rebuttal. I agree with the reviewers and recommend acceptance. However, I encourage the authors to follow the reviewers’ suggestions to improve the paper, including discussing the advantages/disadvantages of continuous latent variable models over discrete latent variable models, as suggested by R1.