NeurIPS 2020

Bayesian Attention Modules

Review 1

Summary and Contributions: This paper proposes to treat attentions as continuous latent variables, where training uses VAE. To parameterize the prior and approxiamte posterior, this work uses reparametrizable distributions such as Weibull and lognormal distributions to get unnormalized weights, and then normalizes them to get attentions. This procedure has lower variance due to using pathwise gradients. In order to compute the KL term, this work further approximates KL with the KL between the untransformed distributions which have analytic forms. Or equivalently, such an approximation can also be viewed as modeling the unnormalized attentions as latent variables. Parameterization of the approximate posterior is also different from prior works: the approximate posterior directly comes from the soft attention weights, which makes the parameterization light-weight (instead of relying on a separate inference network), but at the same time, the inference network does not get access to the entire target, so it can never reach the true posterior. Experiments show that the proposed continuous latent attention mechanism gets better performance compared to deterministic attention on a wide variety of tasks, including image captioning, machine translation, graph classification, and finetuning BERT.

Strengths: 1. This work enables modeling uncertainties of the attention variable, complementing prior works on discrete latent attentions. 2. The proposed method is simple, does not require s separate inference network and can be easily plugged to existing models using attentions. 3. Empirically the results are better than determinstic attentions on a variety of tasks.

Weaknesses: 1. The improvement compared to deterministic attention seems marginal on some tasks such as VQA and machine translation. 2. It is not very clear what benefits modeling attention uncertainties give us. Unlike prior works on discrete latent variables that can make interpretability claims, I'm not sure why we want to model soft attention as a latent variable. Are they better under low resource scenarios? Also, can you do more quantification of the benefits of modeling attention uncertainties? Or even qualitatively showing a few examples of samples from the attention distribution and see if they truly reflect the underlying uncertainties. 3. Related to 2, what's the benefit of latent continuous attention over latent discrete attention? The hard attention baseline looks too weak to me, probably a Gumbel-Softmax type discrete attention is a more reasonable baseline. 4. The inference network does not get the true outputs, so the "approximate posterior" can never reach true posteriors. I'm not sure why you need a separate prior in this formulation. IMHO the "approximate posterior" here shall be used as the prior (it is how it's used at test time), which immediately sets the KL term to zero hence the ELBO is better than optimizing with a different prior. This would be the continuous version of Xu et al 2015's "hard attention". Why would using a different prior help (to me it gets the same information as the "approximate posterior" here)? Can you do an ablation study where you get rid of the KL term in Eq. 4? 5. While I can understand that the true KL between attention distributions is not computable, the approximation here is essentially shifting the latent variable modeling from attentions to the unnormalized weights. However, using the "approximate posterior" as the prior gets rid of the KL hence this is no longer a problem. ===post-rebuttal=== My concern about setting prior to the approximate posterior has been addressed in the authors' response. It might be useful to incorporate it into the next version.

Correctness: Yes.

Clarity: Yes, it's very easy to follow.

Relation to Prior Work: Yes, to the best of my knowledge.

Reproducibility: Yes

Additional Feedback: I think using a separate prior requires more justification. An easier and more parameter efficient way is to simply use the "approximate posterior" here as the prior, resulting a continuous version of "hard attention".

Review 2

Summary and Contributions: This paper proposes a stochastic version of attention (BAM) with differentiable ELBO object. The experiments show competitive results against strong baseline methods. The variational distributions can be Weibull or Lognormal which are reparameterizable. It makes the stochastic soft attention practical.

Strengths: 1. Introduce a context related soft attention model by introducing a data dependent prior $p_{\eta}(W)$, where a prior beliefs of attention distribution can be injected. 2. BAM use reparameterizable variational distribution as an approximation to the posterior of attention weights. It makes the training process under the framework of gradient descent. 3. BAM is applied to various attention based models (graph node classification, QVA, NLP, etc.), and the experiments show its improvement.

Weaknesses: The variational distribution is a certain distribution rather than a family. This potentially restricted the expression ability of the attention. Compared with existing method, the improvement comes from different variational distribution and reparameterization trick.

Correctness: yes

Clarity: yes

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: As mentioned in the paper, the BAM is more efficient than existing method, could you please provide the running time of BAM and comparison methods. This will give me a quantitative perception.

Review 3

Summary and Contributions: The paper proposes a Bayesian soft-attention module of general use and efficient in terms of memory and computational cost

Strengths: 1. Stochastic attention modules that allow to perform Bayesian inference for neural network learning, which in turn allow to produce posterior distribution predictions rather than point estimates 2. Extensive study over different domains and problems 3. Efficient design that can be easily extended from the existing attention models

Weaknesses: Good paper, the following are some notes to further improve the paper rather than major weaknesses: 1. Some discussion on why posterior inference is chosen over W among all the parameters existing in attention network would be beneficial 2. It would be interesting to know why the gamma distribution itself was not chosen instead of the Weibull distribution 3. In experiments it would be interesting to see more comparison with other stochastic attention methods. For example, Deng, Y., Kim, Y., Chiu, J., Guo, D. and Rush, A., 2018. Latent alignment and variational attention. In Advances in Neural Information Processing Systems (pp. 9712-9724) (mentioned in the related work) also used the same dataset for the VQA task, but it is only considered for the machine translation task 4. It is unclear why uncertainty estimation is evaluated only on one task 5. Table 1 ā€“ why does only BAM-WC have standard deviation results?

Correctness: The claims and empirical methodology appear to be correct

Clarity: The paper is very well written and easy to follow. I enjoy reading it. A couple of comments: 1. Figure 1 requires a bit more explanation 2. Line 299, MLE is not defined 3. Line 301, missing reference for evaluation metrics, couple of words on 4 different BLEU metrics in Table 3 would be appreciated Minor: Eq. (1) ā€“ in the denominator it is better to use a different notation for j (e.g. jā€™) to avoid confusion with j on the left-hand side and in the numerator

Relation to Prior Work: It is clearly discussed how this work differs from previous contributions

Reproducibility: Yes

Additional Feedback: Update after the response: I have read the other reviews and authors' response. I would like to thank the authors for their replies. I remain with the opinion that this is a good paper worthy to be presented at NeurIPS =========================================== See weaknesses