NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:41
Title:Stand-Alone Self-Attention in Vision Models

Reviewer 1

This paper proposes to build deep neural networks for computer vision based on fully self-attention layers. A local self-attention layer is proposed to overcome the limitations of the global attention layers that must be applied to reduced versions of the input image due to its computational load. This self-attention layer is used to replace all the convolutional layers in ResNet architectures. The proposed model is compared to standard CNN Resnet on ImageNet and Coco databases. Using self-attention layers allows to reduce the number of parameters of the model while maintaining the performance level. Ablation studies are conducted on different aspects of the proposed model. The goal of the replacing convolutions with local self-attention is a bit in contradiction with the initial objective of using attention layers. As noted by the authors, attention layers were introduced to take into account long term dependencies, what convolution layers could not do. An alternative direction is to use recurrent layers. Reducing self-attention to a local attention layer disable this long term dependency effect. Moreover, using multiple heads makes self-attention even closer to convolutions : it is local and learns multiple representation of the input. The paper could have studied more deeply the differences, both computationally and in terms of expected properties, between the local self-attention layers and convolutional layers. In the end, it seems that the parameters of the convolutional layer (the different kernel parameters) have been replaced by the parameters of the linear transformation Wq, Wk and Wv. The importance of positional features, shown in Table 5, could also have been more explored, because this is one of the main differences with CNN. Why is it important for self-attention ? Is it also useful for CNN on images ? (it has been used for CNN on language). Results in Table 6 are really surprising. One could consider that using only the positional interaction is a degenerated form of convolution, where each position in the kernel are encoded with a different set of parameters.

Reviewer 2

This paper answers the question on whether self-attention can be used as a stand-alone primitive for many vision tasks. The paper provides a clear answer to this question and demonstrates through methodological and empirical analyses that self-attention can be used as a stand-along primitive and also provides specific analyses, including different ablative analyses, showing under what circumstances attention can underperform or outperform convolution and under what circumstances an optimal combination between the two can be used for delivering the best performance. Empirical analyses are obtained using ImageNet for classification and COCO for object detection. The paper reads well and provides a good balance in methodological and empirical analyses and discussions. The findings and conclusions are interesting but not surprising. They are useful for computer vision practitioners and researchers. In this sense, the work is of both academic and societal impacts. Overall, I like this work and I am confident that the community shall benefit from these findings and conclusions from this work. I have a few minor questions on this work. - Presumably, convolution-stem + attention should deliver the best performance. Why on Table 1 for ResNet-50 is full attention better than convolution-stem + attention? The similar observations are obtained in Figure 3 for certain scenarios? - The conclusion that enlarging the spatial extent k in attention improves performance but plateaus off at 11x11 is based on the observations on Table 4. I wonder whether this conclusion is premature – what if you continue enlarging the spatial extent? Would the performance drop after 11x11? Or plateaus? Of increase again? - In general, the conclusion that convolution is good at capturing low level features while attention is good at higher level is probably valid for all the “natural” images like those in ImageNet and COCO. What if you have binary/illusory/sketch images where you may need attention in the first place? The presentation of the work is acceptable. But there are grammatical errors and typos. Also somehow most of the references were missing in the paper. The above was my initial review. I read authors' response and am happy with their answers. I stay with my original review.

Reviewer 3

The paper addresses the problem of replacing convolutions with self-attention layers in vision models. This is done by devising a new stand-alone self-attention layer, which borrows ideas from both convolution and self-attention. Like convolutions, it works on a neighborhood of the image, but replaces dot operations with self-attention operations. Unlike convolutions, this layers features a significant reduction in the number of parameters and computational complexity, plus the parameter count is independent on the size of the spatial extent. As in sequence modelling, they employ relative position embeddings, on both rows and columns of the neighborhood. Experiments are carried out on (almost) state of the art classification and detection architectures, where the authors replace convolutional blocks with their self-attention layer, and use average pooling with stride to do the spatial down sampling. As they have experimental findings that self-attention is not effective in replacing the convolution stem, they also build a replacement for stem layers which is devised by injecting distance-based information in the computation of the values of the attention. The rest of the experimental evaluation ablates the architectures by considering the role of the conv stem, network width, modifying the set of layers which are replaced with self-attention, and varying the spatial extent of the self-attention. Overall, the paper introduces a novel and significant idea. Investigating the role of self-attention as a stand-alone layer is a fundamental research question given the recent advances in both vision and sequence processing. The paper is also well written and clear. On the other side, a deeper investigation of what self-attention can learn with respect to convolution would have been useful, but is left for future works. The provided experiments underline that the new layer can significantly reduce the number of parameters and computational complexity, therefore the result is definitely interesting by itself and opens possibilities for future research.