Review for NeurIPS paper: SMYRF - Efficient Attention using Asymmetric Clustering

NeurIPS 2020

SMYRF - Efficient Attention using Asymmetric Clustering

Review 1

Summary and Contributions: The paper describes a method for reducing the size of attention maps using LSH based hashing. As motivation for the approach, the authors describe that attention maps of trained models are usually rather sparse and that mapping keys and values to cluster, innerproducts between the corresponding clusters can replace the products of keys and values. The paper also shows that finding an optimal assignment is NP-hard and thus, heuristic approaches must be persued. To cluster the data key and values are mapped into a metric space where finding maximal inner products corresponds to kNN search. After applying an assysmtric local sensitive hash function keys and values can be partitioned into L equal sized clusters. The method can be applied to replace pretrained dense attention layers and can be used for training from scratch. However, this only seems to work well given that the dense attention map has a certain level of sparsity and the number of clusters is sufficiently large enough. These results show performance can be kept up even under considerable memory decrease.

Strengths: The paper addresses a major performance problem of attentions layers which is the quadratic size of attention maps. Results indicate that proposed method is capable to reduce memory requirements and keep up performance on two state-of-the-art architectures BigGAN and BERT. A final advantage of the method that it can be used for from scratch training. However, as mentioned in the appendix this is often problematic due to the dense nature of attention maps in early state of training.

Weaknesses: The evaluation of the method does not compare to any other method for reducing the memory requirements of attention layers like the closely related Reformer. The authors provided the comparison in the author feedback and they should add this analysis to the final version if accepted.

Correctness: I did not find any flaws in the underlying theoretic resutls.

Clarity: The paper is well understandable. However, for following the details reading the appendix helped me a lot.

Relation to Prior Work: The paper sufficiently highlights it differences to previous works. Comparison to the related methods were provided in the author feedback and should be included in the paper.

Reproducibility: Yes

Additional Feedback: ==================== post feedback ==================== I followed the author feedback as mentioned before I would like to see the providedd comparision to previous work in the paper.

Review 2

Summary and Contributions: The authors propose a novel type of balanced clustering algorithm to approximate attention. It uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The method can be directly used for pre-trained models and achieves competitive/better performance with BigGAN/BERT/RoBERTa by shrinking 50% memory.

Strengths: 1. The proposed model can save memory for both training and inference. The experiments are significant compared to the memory usage of BigGAN and BERT. 2. The proposed model can be widely applied for both CV and NLP tasks. 3. The analysis of the number of queries per cluster in the proposed model is quite interesting.

Weaknesses: 1. There is no ablation study for Novel Asymmetric Transformations. It unclear whether the proposed method is better than only using Locality Sensitive Hashing (LSH). 2. The proposed Adaptive Clustering still relies on LSH. Although there is some further optimisation on E2LSH to create balanced clusters, the method is still quite similar to Re-Former work. The authors need to have some fair comparison with Re-Former or LSH. 3. The proposed method can not significantly boost the performance. I would like to see whether the method can be applied on longer sequence which can not be encoded by BERT, such as LongFormer work. For the models of BERT/RoBERTa on GLUE, it seems no need to save memory for training. 4. I would like to see some analysis of reversible Transformer/CNN (reformer) and distillation works which can also save memory for either training or inference.

Correctness: Yes

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: [EDIT: thank you for the response, for clarifying the proof, and for the additional experiments. I am increasing my score.] This paper presents an asymmetric LHS clustering strategy for speeding up neural attention to O(N log N). The key property is the "drop-in replacement" nature of the proposed layer, decently preserving performance when replacing a dense attention module with no fine tuning.

Strengths: - Drop-in nature makes proposed method much more useful than alternatives. - Strong significance: Efficient attention is highly relevant and sought after at the moment. - Evaluation with great results on image&text data in multiple regimes: with/without fine-tuning, training from scratch, several compression levels.

Weaknesses: - Experiments do not compare against related efficient attention models (reformer, routing transformer, longformer) even in settings where the comparison would be possible. - Phrasing and proof of Theorem 1 is a bit confusing and seems of little impact to the overall goals of the paper.

Correctness: To my assessment the proposed algorithm is correct. The "catch" seems to be knowing the max query and key norms up front, which is not available in the standard (A)LSH streaming setup, but is perfectly fine for attention. Some small doubts about Theorem 1 below, but IMO this result is not required for this paper. - Calling Theorem 1 and the associated problem "approximate attention clustering" is a bit confusing because it sounds like _approximating_ the optimization problem is NP-hard, rather than solving it exactly. Why not just "attention clustering" or "attention co-clustering", if connections to co-clustering turn out to be relevant? - In Lemma 1, it's not clear to me that the proofs works for eps=0, since you would need to subtract infinities, and in fact this seems to be the only reason you need to introduce eps at all, rather than to just mask everything to -inf. However, you need to set eps=0 to get to Theorem 1. Please check. - Right before line 88 (supplementary material) I believe you forgot a minus sign and the last line (as well as the eqns on line 90 and 91) should be max, not min?

Clarity: The paper is mostly clear and easy to read, with some minor concerns brought below. (4/5) - Figure 1 does not bring much clarity IMO and the letters in the circles are too small to read. Perhaps it's clearer to show a clustered attention weight matrix with some markings (or permuted row-cols)? - Expression (2) is a bit hard to read. It may be easier to define cluster assignments as a function from Q u K to cluster indices, rather than this reverse formulation. (See also notations in literature on biclustering.) - You need L=O(N) to get NlogN complexity, but the paper is not clear about how L is chosen. From the code & supp it seems to me you parametrize the *cluster size*, please clarify this in the main paper.

Relation to Prior Work: - Relationship to efficient attention is very clearly discussed. - Performance is not benchmarked against the previous contributions. - Missing connections to the literature on biclustering / co-clustering, which seem to deal with similar optimization problems -- please check and highlight this connection. For instance: Biclustering of Expression Data, Y. Cheng & G.M.Church, ISMB-AAAI, 2000.

Reproducibility: Yes

Additional Feedback: This paper seems to be a valuable contribution to the "efficient attention" body of work, specifically appealing for the drop-in potential. It is hard to say whether any performance is lost due to this drop-in potential: comparisons against e.g. Reformer would address this. On line 40 you point out that Star Transformer prevents causal masking. Does your architecture support causal masking? I appreciate your discussion of broader impact, especially the final concern that block-wise attention may introduce some hard-to-characterize bias. This seems worth thinking more about. Presentation and typos: - Table 2 is too wide - L245 "is drop-in" -> "is a drop-in" - L253 tight -> ties (? not sure what you meant) - L273 its' -> its - Check capitalization in references (e.g. lsh (alsh) is lowercase) - What is q_D, Q_C_D and q_C in the equation between lines 87-88 of the appendix? - The plots in the supplementary material are very low resolution, please use vector graphics (e.g. pdf). Plots like Fig. 3/4 in Supplementary should be log-scaled probably.

Review 4

Summary and Contributions: The paper proposes a balanced clustering method to approximate attention based on Asymmetric LSH and an adaptive clustering scheme. The work also proves that the underlying optimization problem is NP-hard.

Strengths: - The model does not require changes to attention. - The memory and speed are improved compared with the original self-attention module.

Weaknesses: - The results of the BERT experiments are not strong enough. Most gains come from the RTE that is a very small dataset. However, on the most important task MNLI, the performance degrades significantly. In terms of performance, it seems that knowledge distillation outperforms the proposed method, with the same reduction of memory size. - The speedup in terms of detailed GPU hours shouls also be reported in the tables. - The previous attention clustering methods are not throughly compared with in the experiments. - It's unclear how GPU friendly the method is. - More analysis would help readers to understand the limitation of the proposed method. - The method is evaluated for fine-tuned BERT models. It's more insightful to show that the proposed method can also work well in the pre-training setting.

Correctness: - The experiments should also be evaulated on the large-size BERT, where the attention approximation should be harder. It's unclear whether the proposed method performs worse along with the increased model size. - The results of the BERT experiments are not strong enough. Most gains come from the RTE that is a very small dataset. However, on the most important task MNLI, the performance degrades significantly. In terms of performance, it seems that knowledge distillation outperforms the proposed method, with the same reduction of memory size.

Clarity: The presentation is easy to follow in general. The font size in Figure 1 is too tiny.

Relation to Prior Work: - The previous attention clustering methods are not throughly compared with in the experiments. [1] Efficient Content-Based Sparse Attention with Routing Transformers [2] Reformer: The efficient transformer.

Reproducibility: Yes

Additional Feedback: ====after author response I've read the author response and increased my score to 6.