Review for NeurIPS paper: CogLTX: Applying BERT to Long Texts

NeurIPS 2020

CogLTX: Applying BERT to Long Texts

Review 1

Summary and Contributions: This paper addresses an issue arising from the well-known quadratic space complexity of the attention mechanism. Attention is everywhere in modern sequence models such as Transformers, so that it has become difficult to handle longer sequences. The authors propose an elegant retrieval-like method that enables to integrate iterative "pondering" in scoring the possible subsequences. This relies on the assumption that within a longer text there exists a subsequence which is shorter than the length limit of pretrained Transformers and which is necessary and sufficient for performing a target task. Their method works as follows: - A sequence Z is initialized with the question (in the case of QA) or is initially empty (text classification). - All the subsequences are assigned coarse scores using a fine-tuned BERT judge, by concatenating them with Z. Only as many high-scoring subsequences are kept as they fit in a window. - All the high-scoring subsequences are embedded jointly and given a fine-grained score. The top k are kept and used to initialize Z again. - Loop until I iterations are done. - The final Z is provided to the BERT reasoner to perform extractive QA or text classification. The most enjoyable bit in the paper is the very clever trick to provide relevance labels to train the BERT judge in the case where the support sentences are not explicitly annotated (as in most QA datasets and in text classification), which exploits the difference between the BERT reasoner loss including and excluding a given subsequence.

Strengths: Quadratic space complexity is traded for linear time complexity, which is nice. The method is well-thought and results in nice boosts on the datasets considered in the evaluation setting. The trick used for the unsupervised training of the judge is worth at least the rest of the paper. It is also very appreciable to see that the authors have been inspired by the well-known principles in cognitive science, so that CogLXT is also theoretically motivated.

Weaknesses: 1) The assumption that within a longer text there exists a subsequence shorter than the length limit of pretrained Transformers that is necessary and sufficient for performing a target task is quite heavy. This holds more or less for QA, but will it generalize to long document summarization, for example? 2) The time complexity is identical to the sliding window approach, factoring out the number of iterative steps. However, there is quite a bit of overhead due to the iterative MemRecall mechanism. This overhead is only evaluated in Figure 5 with a batch size of 1, if I understand correctly. If so, I would have appreciated the results with a more realistic number of samples per batch. 3) The Longformer should have been used in the experiments as well as RoBERTa, as it has a space complexity of O(n log n). I am not convinced by the claim of the author(s) that this is completely orthogonal to their contribution. If comparable figures could be reached, this would weaken their empirical contribution. 4) The Reformer paper is referenced once in the introduction, without elaborating on it. I don't think the authors would be expected to compare against it, since no pre-trained checkpoint has been released, but I would have liked a discussion about it. 5) No qualitative assessment of the MemRecall mechanism is provided. It would have been nice to see what the unsupervised training of the BERT judge is selecting, especially for text classification, where you do not have support sentences (in which the answer occurs) -- this could be interesting also for interpretability, a topic which is completely unexplored.

Correctness: Yes, as far as I can tell.

Clarity: No. It took me quite a bit to figure out what the authors are doing, as many things are very hastily described, especially in the method exposition. The figures, though they could be certainly made clearer, help quite a bit, and are essential to the understanding. This should not be the case. Moreover, there are a number of typos/grammatical mistakes. The past participle of break is 'broken', not 'breaked'. 'RoBERTa' is always spelled 'Robert'/'Roberta'.

Relation to Prior Work: Yes with some caveats mentioned in 3) and 4) in the Weakness section.

Reproducibility: No

Additional Feedback: I would also explore what the approach offers for interpretability.

Review 2

Summary and Contributions: This paper considers applying BERT to long texts, by finding key related sentences for a target sentence. That is, among many sentences in a document, it extracts relevant context for a given sentence, and use them as context to limit the size of input and the memory consumption.

Strengths: - The approach is simple, and sound. - Finding the right context is an important problem, and this approach systematically solves it, in a way that can be applied to diverse corpora by dividing them into three types of tasks. - The evaluation is thorough, and it shows significant improvement in many tasks.

Weaknesses: - Training judge without labels can be expensive due to trial-and-error search of relevant sentences. - The sufficient condition Eq (6) is not explained. - Some more qualitative analysis in the evaluation can be added. For example, more example outputs of judge in the experiments for each task can be useful understanding the behavior. Also, accuracy breakdown based on document length (e.g., histogram) can be useful.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: There are missing prior work considering long distance context. For example, "Unsupervised Sentence Embedding Using Document Structure-based Context" (ECML 2019) leverages document structure as a cue to find context. I wonder if the proposed approach actually extracts something similar to that can be extracted by document structure information.

Reproducibility: Yes

Additional Feedback: ------------After rebuttal--------------- I appreciate the authors' detailed response to many of the concerns. Assuming the authors reflect these and also other comments in the reviews to the camera ready version, I would still accept this paper. ----------------------------------------------- - Please see the weakness section above. - Initially, z+ passed to judge contains all sentences. How do you avoid the memory problem with judge? - "We hypothesize that the first sentence (the lead) and the last sentence (the conclusion) are usually the most informative parts in news articles." Can a document structure based approach, or even "first and last sentences" baseline does similarly?

Review 3

Summary and Contributions: The paper proposes a model that given a long text and its blocks/sentences and their relevance judgements (which can be initialized from IR methods) iteratively refines the relevance scores and find supporting sentences for a downstream task. The model uses two BERT models, a "judge" for scoring, and "reasoner" for, e.g., answering. The main contribution is the fine-tuning/training iterative procedure and good results on a number of datasets and tasks.

Strengths: The model is well evaluated, including ablation studies for the model features. The iterative application of retrieval/reasoning using pretrained(BERT) models seems to be novel.

Weaknesses: Some parts of important related work are missing: Please discuss the relation of your model to "Latent Retrieval for Weakly Supervised Open Domain Question Answering" (ACL 2019), which is a single step paragraph retrieval and reasoning and was applied to a different set of tasks, while this work has multiple iterations of similar retrieval/reasoning and considers sentences and is applied to different datasets (application to multi-hop datasets is nice), but former would have been good as a benchmark.

Correctness: Looks good.

Clarity: The paper is well written.

Relation to Prior Work: The paper discusses previous work but could discuss it in more detail and could really benefit from discussion of recent open domain QA work (e.g. one mentioned above), sentence extraction / summarization, and also more detailed discussion of recent methods scaling attention to long sequence (some of which are cited, e.g. Reformer).

Reproducibility: Yes

Additional Feedback: - Please add more details to the paper on the relevance scores; are they binary? How exactly are they updated?

Review 4

Summary and Contributions: ################## After rebuttal ############################### Thank the authors for their response. Concerns about baselines that authors can consider to improve the paper. For Table 3 (20NewsGroups), it is better for the authors to offer another version of CogLTX with BERT for fair comparison with the baselines since RoBERTa significantly outperforms BERT on GLUE leaderboard (88.1 vs 80.5). For Table 4 (A+), why sliding window uses BERT but CogLTX uses RoBERTa. It is better for sliding window to uses RoBERta too. On the other hand, for table 2 (HotpotQA), I am convinced by the comparison between Longformer and CogLTX on HotpotQA (69.5 vs 69.2) (discussion in the rebuttal). Thanks for the authors' response. I improve the overall score from 5 to 6. ######################################################### The paper introduces a method called CogLTX to apply BERT/RoBERTa to long texts. Experiments show the proposed method outperforms BERT sliding window methods but underperforms SOTA models.

Strengths: 1. Figure 3 is really clear which helps me understand the main idea of the paper. 2. Experiments results on different tasks are shown. 3. The paper proposes unsupervised training for the judge process to solve the tasks without relevance labels.

Weaknesses: 1. The main idea is similar to SAE just as the authors say in the paper. SAE scores paragraphs and CogLTX cores sentences. Compared with SAE, CogLTX is a more complex and fine-grained method, but the performance is worse than that of SAE. It is better to analyze memory, computation, and inference latency between SAE and CogLTX. 2. Although the paper shows experimental results of 4 tasks. However, except HotpotQA task, the baselines of the other three tasks are not strong with a poor backbone (like [29] in Table 3 uses BERT, but CogLTX uses RoBERTa that is much better than BERT) or just sliding window (Table 1 and Table 4). It is better to clarify the backbone of baseline and CogLTX. As for Hotpot QA task (Table 2), SAE can be seen as a fair baseline to CogLTX, but the performance of CogLTX is worse than that of SAE and still has a not small margin (69.21 vs 71.45).

Correctness: Yes

Clarity: Yes, I like the Figure 3.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: