This paper proposes considering attention mechanisms as continuous latent variables, using VAEs for training. It uses reparametrizable distributions such as Weibull and log-normal distributions to get unnormalized weights, which are then normalized. Experiments show that the proposed continuous latent attention mechanism gets better performance compared to deterministic attention on a wide variety of tasks, including image captioning, machine translation, graph classification, and fine-tuning BERT. All reviewers recommended acceptance, pointing out that this is an interesting idea and a solid and well-executed work. One concern was raised about the significance of improvement on VQA and NMT, and about directly setting prior to approximate posterior, which the authors addressed in the rebuttal. I agree with the reviewers and recommend acceptance. However, I encourage the authors to follow the reviewers’ suggestions to improve the paper, including discussing the advantages/disadvantages of continuous latent variable models over discrete latent variable models, as suggested by R1.