This paper addresses the quadratic bottleneck in transformer architecture. It proposes a Sparse Adaptive Connection (SAC) model which learns to predict sparse connections (attention links) between inputs and attentions are only performed on those predictive links. The proposed method is competitive with state-of-the-art models on WMT, LM and Image classification tasks while significantly reducing memory cost. Overall, three of the four reviewers seem to have liked the paper, although they had some concerns (below), while one reviewer (R3) proposed weak reject. A weakness pointed out by R2 and R3 is that only accuracy is reported, but speed is not, which seems necessary to support the title "Accelerating Self-Attention". The authors promised to add more details about computational efficiency and memory cost in the final version, and I urge them to do so. While I am a bit in the fence with this paper, I still think the proposed approach is interesting and therefore recommend acceptance. However, I strongly urge the authors to follow the reviewers’ recommendations to improve the paper. Besides the weakness mentioned above, two reviewers (R2 and R4) pointed out missing related work, in particular R4 says a comparison against Correia et al. (2019)’s “Adaptively Sparse Transformers” is needed. This is indeed an important piece of related work that is missing and a discussion should be added in the context of the proposed approach. While a empirical comparison would indeed be nice, I don’t think it is crucial, given that the goal of Correia et al. is not to accelerate transformers (e.g., their method still takes quadratic time). I urge the authors to cite this work along to the two citations suggested by R2 in the context of relative position encoding.