Summary and Contributions: The Big Bert model proposed in this paper contains sparse attention that can handle long sequences without increasing the requirement of hardware. The results on various NLP tasks support their method.
Strengths: - Modeling the long-range of text for the BERT-based model is challenging. The sparse attention proposed in this paper is quite interesting, especially, they claim that it can handle sequences of length up to 8x of previous works. - They also provide theoretical results to support their sparse attention. - The experiments on QA and document classification looks quite good (compared with the state-of-the-art methods)
Weaknesses: While I quite agree that modeling long text is challenging for current Transformer (Vaswani). One of the inspirations of this work is "locality of reference" which assumes that a token can be derived from its neighboring tokens. However, for a document with a series of paragraphs, sometimes, the token may be related to the sentence in another paragraph. I think this is the weakness of the sliding window in BigBERT. Did the authors conduct the speed (inference time) comparison between their BigBERT and other methods?
Correctness: The method and experiment of this work are solid and convincing.
Clarity: This paper is well-written.
Relation to Prior Work: This paper provides a solid comparison with other methods.
Reproducibility: Yes
Additional Feedback: The feedback resolve my questions, I keep my suggestion.
Summary and Contributions: The authors pointed that the full self-attention have computational and memory requirement that is quadratic in the sequence length. They produce a sparse attention mechanism that improves performance on a multitude of tasks requiring long contexts. At meanwhile, they proved the proposed BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
Strengths: The overall motivation and theoretical part of the article is detailed and clear. The proposed sparse attention mechanism improves performance on a multitude of tasks that require long contexts and satisfies all the known theoretical properties of full transformer. Moreover, the experiments for proving the performance of the model is much more important are also sufficient.
Weaknesses: The authors select three representative NLP tasks to showcase benefits of modeling longer input sequence. However, there is no experiment to show whether the model performs well in the short text than other models.
Correctness: The claims, method and empirical methodology are all correct.
Clarity: This paper is well organized and clearly described.
Relation to Prior Work: yes
Reproducibility: Yes
Additional Feedback: It’s a good job. This paper is well organized and clearly described, whose overall motivation and theoretical part of the article is detailed. But, I suggest that the experiment section can be enriched, like I said above.
Summary and Contributions: The authors propose a sparse attention mechanism that reduces this quadratic dependency issue in self-attention. The model consists of three types of attention mechanism: global attention on fixed positions, local attention in sliding-window and attention on random positions. The authors prove that the proposed method can preserve the properties of full attention model and achieve SOTA on a variety of NLP tasks.
Strengths: 1. The authors provide theoretical analysis on sparse attention mechanism which is interesting and useful for further work in this direction. 2. It's interesting to see random attention only can also work on SQuAD and MNLI tasks. 3. The experiment results are quite solid. The proposed model achieves SOTA on a variety of NLP tasks which cover multi-hop QA, QA with longer context and document classification. It is also tested on genomics data.
Weaknesses: 1. Although the random attention in BigBird is interesting, the global attention and local attention in sliding window are not novel, similar to sparse-transformer (Generating Long Sequences with Sparse Transformers). 2. The model doesn't work well on short answer extraction of natural question dataset.
Correctness: Yes
Clarity: Yes
Relation to Prior Work: `Yes
Reproducibility: Yes
Additional Feedback: I don't agree "reduces this quadratic dependency to linear" in line 4. It seems not linear and relies on the window size. Please clarify this. Typo: line 155, "turning complete"