Reviews: e-SNLI: Natural Language Inference with Natural Language Explanations

Update after Author Feedback: Thanks for all of the clarification, especially with regard to the BLEU scores. I think the idea of explicable models is worth pursuing, and this is a decent contribution to showing how one might do that. It is unfortunate that this work shows a huge tradeoff between models that perform at high levels and those that explain well (from 4.1 it seems like we can get good performance, but then can't generate correct explanations very often and from 4.2 we can generate correct explanations more often at the expense of good performance). It also seems disappointing that the BLEU scores in the PREDICT setting are already so close to the inter-annotator agreement even though they are not correct explanations very often; this seems to suggest that we really do need to rely on the percent correct given by human evaluation and that the BLEU scores are not very meaningful. This seems like a bottleneck for this resource being widely adopted. Nonetheless, these findings are a solid contribution and so is the data if others are willing to do human evaluation or work on a new automatic metric for a task like this. Original Review: This work augments the SNLI corpus with explanations for each label. They then run several different kinds of experiments to show what can be done with such an augmented datasets. First, they show that it is possible to train a model to predict the label and then generate an explanation based on that label. Second, they show that given a ground truth explanation, the labels can be easily predicted. Third, they show that the model that both predicts and explains (e-SNLI) can provide useful sentence representations. Fourth, they try to transfer e-SNLI to MultiNLI. These seem like the perfect experiments for a very interesting attempt to make deep learning models more interpretable. As the authors mentioned, this line of work has some history with Park et al, Ling et al, and Jansen et al, but I have not seen anything quite like this, especially for a task like SNLI, which does hold potential for transfer learning. It is an especially important task to learn how to interpret if we want to have a better sense of how our models are thinking of logical inference. Subsection 4.1 I’m confused/concerned by the evaluation of the generated explanations and the resulting low BLEU scores. First of all, why use only BLEU-4? Why not use the same BLUE that is typically used for tasks like neural machine translation? This would make the numbers more interpretable to other NLP researchers. Second, is it really .2 in the same scale that NMT researchers would report 30+ BLEU scores for NMT. The obvious answer is no because it is just BLEU-4, but how do I relate those? .2 BLEU on translation would give complete nonsense, but the generated explanations in Table 2 look quite good. How do I make sense of this as a reader? Does this mean the explanations are coherent but mostly ungrounded in the prediction? If the inter-annotator agreement is truly so low, then how can we ever really evaluate these systems? Will we always have to rely on human judgment in comparing future systems to the one you’ve proposed? I’m also worried by these numbers: if 99% of labels are consistent with their explanations, why is there such a wide gap between the 84% correct labels and the 35.7% correct explanations? Does this mean that the explanations are usually not explaining at all? They just happen to give the same label if you were to infer from them alone? Based on the discussion in the next section, I’m assuming the ‘them’ in ’35.7% of them’ refer to the 84 out of 100 examples correctly predicted rather than out of all 100 qualitatively examined examples, but that was ambiguous until I read 4.2. How exactly were partial points awarded to get the 35.7% correct explanations? You mention it is done in the footnote, but I’d like more detail there if there is going to be work that builds off this in the future. Maybe it would just be less ambiguous to not award these partial points. Subsection 4.2 The 67.75% is quite low in this section. Using the hypothesis alone can give you a score right around there (Gururangan et al. 2018); that should probably be mentioned. I like this conclusion here. The GENERATE model can just generate explanations after the fact that don’t have any real incentive to be correct. Perhaps with a higher alpha value (Eq. 1), this would change, but we don’t have those experiments to verify. 4.3 and 4.4 appear to show some empirical validation of the method for transfer learning and universal embeddings. I think there might be a typo in Table 5’s std deviation for InferSent on MultiNLI. line 203: “a model on model” seems like a typo I really like this idea, the experiments, and the example generated explanations, but I have major concerns about evaluation for this kind of task. Even though the language for the generated explanations is surprisingly coherent, it seems like in most cases, the explanations are not actually aligning with the predictions in a meaningful way. What’s worse is that the only way to find out is to manually inspect. This doesn’t seem like a sustainable way to compare which models in the future are better at the explanation part of e-SNLI, which is really the crux of the paper. In short, I’m left feeling like the evaluation setup just isn’t quite there yet for the community to be able to use these ideas. This is a hard problem. I don’t have great ideas for how to fix the evaluation, and I’m definitely looking forward to hearing what the authors have to say about this, especially when it comes to the BLEU evaluation and how I might better interpret those results. For now, the overall is a 5 with a confidence of 4.

This work augments the natural language inference (NLI) task with explanations for entailment relationships between the premise and hypothesis. The authors introduce a new dataset including these explanations, and explore models that exploit these explanations for natural language inference and sentence-level representation learning. I appreciate the general idea of employing supervision for model interpretability. I am also impressed with the authors' meticulous annotation protocol that employs a number of techniques during and after annotation to ensure data quality. However, I see three main problems that prevent me from recommending the acceptance of this paper. First, the experimental results of this paper fall short in demonstrating the strengths of the authors' approach. Second, the SNLI dataset seems like a suboptimal choice for such a dataset. Third, the contribution of this paper seems to better fit an NLP venue than a general ML venue. See details below. Comments: -- Unlike what the authors claim, the experimental results presented in this paper do not indicate that the dataset is useful in any downstream task. Currently these results are better seen as negative rather than positive evidence. While this does not mean that the dataset will not turn out to be useful in the future, it is discouraging to see that the authors were not able to show any positive result using it. In particular: - Table 1 is not very informative as it stands, since there is no perplexity comparison to other baselines, and perplexity itself may not be an accurate way of measuring the quality of generated explanations (low perplexity may not equate to coherent sentences). - Running multiple seeds and reporting standard deviation is important. However, in both Table 4 and 5, the improvements that are seen over baselines in almost all cases are well within standard deviations, implying the null hypothesis. - The SNLI results reported on section 4.1 are quite lower than state-of-the-art (https://nlp.stanford.edu/projects/snli/), which further weakens the authors' claims. -- The choice of SNLI as a base dataset for adding explanations seems suboptimal given recent evidence on the type of information encoded in it. - It seems that choosing another dataset, at the very least MultiNLI, which also suffers from some of the same problems, but to a less extent (Gururangan et al., 2018, Poliak et al., 2018). While the works pointing to these problems are relatively new and might not have been available when the authors started working on this project, (a) the authors do cite some of these works, (b) while this is not entirely fair to blame the authors for not being aware of those problems, it is still substantially reduces the usefulness of this dataset. The authors actually mention that the annotated explanations in their dataset heavily correlate with the entailment classes, leading one to suspect that the explanations might be reinforcing stylistic artifacts in the training data. Indeed, the examples in Figure 1 seem very templated and lexical. - To explore this phenomena, the authors are encouraged to perform a more global analysis of the generated explanations, perhaps by using crowdworkers to label the correctness of generated explanations across the full evaluation dataset. Such an analysis would be useful towards reinforcing the author's claims that the models is learning to explain entailment phenomena, rather than associating specific words or templates with the inference class and premise/hypothesis pair. -- The authors point towards a tradeoff between l_label and l_explanation. I was surprised to see that the authors took the model with the best l_label score. It seems that the highest alpha value explored (0.9) was selected, indicating that the best l_label might have been obtained with alpha=1. Optimizing a different function might make more sense here. -- I would have liked to see a better exposition and analysis around Section 4.4 (Transfer), which is quite sparse as it stands. As the authors state, multiNLI is a good testbed for cross-genre transfer, so it would be pertinent to see, for instance, comparisons for each genre. Questions to the authors: -- The annotation protocol is missing some important details (section 2.2): - How many annotators took part in creating this dataset? How much did they get paid per explanation? - Quality assurance: what do the author refer to as an error? is it a binary decision (any score < 1?), or averaged score? -- What is L_explanation in equation 1? -- Typos and such: - Table 2(b): shouldn't the label be "entailment"? - Table 5 caption: "fine-tunning" should be find-tuning - Line 179: "with the same 5 seeds as above.": seeds only mentioned later - Line 246: "consisten" should be "consistent" ======= Thank you for you response. While I partially agree with some of the comments, I am still unsure about the quality of the generated explanations. I would advise the authors to include the analyses described in their response, as well as other ones suggested by all reviewers, and resubmit to an NLP conference. I would also suggest working harder to improve results using e-InferSent (potentially using a different model).

Paper ID:	5810
Title:	e-SNLI: Natural Language Inference with Natural Language Explanations

Reviewer 1

Reviewer 2

Reviewer 3