Review for NeurIPS paper: Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

NeurIPS 2020

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

Review 1

Summary and Contributions: Thank you for the author response. --- This paper addresses the problem of using eye-tracking data for NLP tasks. The paper presents two models. One is a text saliency model that learns fixation durations using syntactic data. The syntactic data is made based on a cognitive model of gaze during reading, E-Z reader model. The other model presented is a joint model that learns an NLP task using the output of the text saliency model in the attention layer. The joint model achieves state-of-the-art performances on the two different nlp tasks: paraphrasing and sentence compression. Also, ablation experiments show that the effectiveness of the proposed models.

Strengths: I think the work could be used for a variety of NLP applications since it proposes an approach to address the problem of lacking eye-tracking data (and I believe it is indeed a problem especially for some medical nlp applications). Also, it is interesting and valuable that the proposed model uses a cognitive theory to make syntactic eye-tracking data and shows they are effective.

Weaknesses: I don't see a major problem.

Correctness: The method and experiments seem ok. The claims are supported by the experiments.

Clarity: Yes. it is clearly written and I enjoyed reading it.

Relation to Prior Work: Yes. Explicitly using human attention for text comprehension tasks is new. Combining a cognitive theory and data-driven approaches is new. Combining gaze information in the attention layers for NLP is new.

Reproducibility: Yes

Additional Feedback: 1. Why did you combine BiLSTM and Transformer instead of just using transformer? The paper says "to better capture the sequential context". Did you try transformer only? 2. I'm wondering why there is no comparison with prior methods using eye gaze data. I can guess a few reasons, but could you give justification? 3. I'm wondering why the experiment section is in the current order -- joint model first. This is not critical, but just felt a bit strange because I expected evaluation of the text saliency model would come first. 4. typo: line 327, to coarse => too coarse?

Review 2

Summary and Contributions: The paper makes two contributions: 1) a way to bootstrap a reading text saliency model (TSM) from a cognitive model of reading gaze fixation and a small amount of fine-tuning data; 2) a way to incorporate and fine-tune the TSM in paraphrase and in sentence compression, showing that gaze prediction improves pure text models for those tasks.

Strengths: Nice, novel contributions, both in the TSM and in its incorporation in two reading comprehension tasks. Creative approach to build a strong TSM with very limited training data by using synthetic data from a cognitive model. Convincing empirical evaluation.

Weaknesses: The TSM architecture (GloVe>BiLSTM>Transformer) should be evaluated through ablation studies. For instance, a why not just a single (deeper) Transformer, maybe pre-trained as an MLM on a large text corpus to get good token embeddings? On the rebuttal, the authors mention preliminary experiments that supported their choice, I strongly encourage them to summarize those experiments in the final paper.

Correctness: Claims, method, and evaluation are solid and convincing.

Clarity: The paper was a real pleasure to read, only a couple of points where it could be improved for the reader. Packs a lot of information in an easy to digest way, I learned a lot from reviewing it.

Relation to Prior Work: As far as I can see, prior work is carefully addressed in a nicely organized Related Work section. I have not worked myself on these specific problems myself (gaze modeling and sentence paraphrase/compression) but I have colleagues who do, I follow their work closely, and I saw nothing of significance missing.

Reproducibility: Yes

Additional Feedback: 116-118: Did you compare BiLSTM+Transformer with Transformer alone to validate this? What about Transformer variants with longer contexts, such as Compressive Transformer? It would be good to be more precise about what is gained with the BiLSTM. Sec 3.2: Explain that h_i and s_j are computed according to task-specific architectures given in Sec 4, I was a bit lost here wondering about those architectural details. What’s the intuition behind the two score functions? Sec 4.1: This is where you should say what h_i and s_j mean for the two task networks. I can guess from the text, but it would be better to be explicit.

Review 3

Summary and Contributions: I like the paper in many ways, but it is based on a very fundamental, questionable premise: “Gaze has also been used to regularize neural attention layers via multi-task learning [3, 37]. To the best of our knowledge, however, no previous work has supervised NLP attention models by integrating human gaze predictions into neural attention layers.” This is, in fact, exactly what [3] does. In fact, their proposal is very related to the proposal here. While the proposal here multiplies in gaze-based attention weights, [3] use attention (over LSTM states) as a regularizer. Arguably, the second proposal models gaze and NLP tasks “more” jointly. I therefore suggest the authors instead focus on the real merits of their work: Using EZReader for pretraining. This is novel. It would also be interesting to see a more direct comparison of the two approaches to integrating gaze information. Such a comparison should ideally cover some/most tasks and set-ups used in previous work.

Strengths: The experiments all made sense, very thorough and present a substantial amount of work.

Weaknesses: Claims of novelty are false, at least in part, and relations with previous work not adequately discussed.

Correctness: Methods, yes, claims less so, I feel.

Clarity: Yes.

Relation to Prior Work: No, this is the main shortcoming.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: Paper proposed an E-Z reader based pre-training objective to utilize human gaze information which can be potentially used to improve a downstream NLP task. Additionally, the paper discusses how the pre-trained model information can be injected into a task-specific network using attention mechanisms. Using empirical results on paraphrasing and sentence compression task, the paper claims that utilizing gaze information using a pre-trained model helps in achieving the state of the art results for both paraphrasing and sentence compression tasks. ### Update after reading author response ### I would like to thank authors for answering most of my questions/concerns. In particular, providing details about Li et al. [2019] experiment. I have revised my score accordingly.

Strengths: Paper has the following two novelties - Paper proposes a new pre-training technique based on E-Z Reader data which might be useful for many NLP tasks. - A simple yet intuitive way to add gaze information pre-trained knowledge to the task-specific DNN using attention The paper shows that their proposed TSM model's duration predictions are highly correlated with human gaze data which potentially shows that E-Z reader based pre-training is a good proxy to utilize gaze information in NLP models/tasks. Further, the paper provides an empirical evaluation using two public datasets for paraphrasing and sentence compression tasks claiming that such training helps in downstream tasks.

Weaknesses: There are multiple issues with the claims and evaluations presented in the paper. In particular, as a reader, I am not convinced that reported gains are due to exploiting gaze information. 1. An improvement over SOTA? : For paraphrasing task, the paper claims Patro et al. (2018) as SOTA which is an outdated baseline. [Decom_para ACL19] is a better baseline for comparison. Given that "No Fixation" method gives 27.81 BLEU-4 score with 69M params, I doubt that the proposed model's 28.82 BLEU-4 score with 79M is truly better than Patro et al. (2018)'s model. Ideally, authors should report the performance of baseline models using the same number of parameters. Similarly, on sentence compression task they should use a baseline with similar model params. With the current evaluation setup, it's not clear if gains can be attributed to the higher model capacity. 2. Evaluation: Paper reports only BLEU-4 scores for paraphrase task. Often people report multiple metrics to compare methods as a 1 point improvement in BLEU (27.81->28.82) on a single data might not mean anything in general. Usually, people report other metrics such as METEOR, ROUGE along with BLEU for a fair evaluation. For future revision of the paper, authors can also consider using more accurate metrics such as [BERTScore ICLR20], [BLEURT ACL20]. 3. Model architecture choice: What is the motivation of adding a transformer layer after a bilstm in text saliency model? The paper claims that this architecture allows us to better capture the sequential context without quantifying what do they mean by "better" ? Bi-LSTM followed by n-layer transformers in a non-standard NLP architecture so authors should describe what advantages does it provide over a standard Bi-lstm or a started transformer model? 4. Impact of pre-training on CNN and Daily Mail: Since the proposed models were pre-trained on CNN and Daily Mail and the baseline models are not pre-trained, it's not clear if the gains are due to model exploiting gaz information. We know that pre-training models on unlabeled corpus lead to better generalization performance across NLP tasks. I am still not convinced that predicting fixation durations provides any advantage over standard pre-training task such as masked language modeling. 5. Task/Dataset Choice: I think text summarization might be a good candidate to show the advantage of adding gaze information. Is there any particular reason for not considering that task? Also, to ensure that these techniques generalize, it's important to report numbers on more than 1 dataset for a given task. 6. Missing important implementation details: For the seq2seq model, author mentioned that they used greedy search. Is there any reason for not using a standard beam-search?

Correctness: Yes, the proposed method and evaluations are correct.

Clarity: Yes. I was able to understand most of the paper without any issues.

Relation to Prior Work: Paper discussed prior work related to exploiting gaze attention however doesn't discuss paraphrase models proposed in 2019 such as [Decom_para ACL19]. Refer to the Additional feedback section for missing references. Paper identifies it's main contribution in-terms of adding gaze information using attention.

Reproducibility: No

Additional Feedback: Please refer to weaknesses section for suggestions/questions. References: 1. [Decom_para ACL19] Decomposable Neural Paraphrase Generation https://arxiv.org/abs/1906.09741 2. [BERTScore ICLR20] BERTScore: Evaluating Text Generation with BERT https://arxiv.org/abs/1904.09675 3. [BLEURT ACL20] BLEURT: Learning Robust Metrics for Text Generation https://arxiv.org/abs/2004.04696