NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:354
Title:Saccader: Improving Accuracy of Hard Attention Models for Vision

Reviewer 1

This paper addresses the problem of training hard-attention mechanisms on image classification. To do so, it introduces a new hard-attention layer (called a Saccader cell) with a pretraining procedure that improves performance. More importantly, they show that the approch is more interpretable requiring fewer glimpses than other methods while outperforming other similar approches and being close in performance to non-intepretable models such as ResNet. Originality: The proposed Saccader model is original and compares favorably to state of the art works in term of performance and also, more importantly, interpretability. Related work has been cited adequately. However, it is not clear from the paper what are the main technical difference(s) between Saccader and its main competitor DRAM. Quality: Experimental results show how the Saccader model outperforms comparable and state-of-the-art models. Indeed, figures allow us to see the differences in accuracy and in image coverage. The latter is quite informative for the interpretability claims of the paper. However, no weaknesses of the work have been noted. Indeed, while the results are important, no ablation study has been made with the Saccader cell and the attention network. These two components contain several sub-components tied together and it is not clear that they are all necessary. Furthermore, it is not clear why the attention network is needed at all. Could the increase of parameters of the attention network result in the increase performance of the Saccader model in comparison to the DRAM model? Clarity: The paper was well organized. Sections follow the usual order of NIPS papers. Small comments: In Section 3.1, item 2., it is not clear why is the “what” and “where” features are called this way. In Section 3.1, item 3., at that point in the paper, it is not clear why there is the concept of time $t$. One or two sentences grossly explaining the reinforcement learning part of the paper at that point might make the paper clearer. Significance: In addition to what has been noted in the Contributions section, while the Saccader cell and its pretraining procedure have been designed for convolutional networks, there is a safe bet that this cell can and will be used beyond computer vision tasks such as NLP and few-short learning.

Reviewer 2

This paper proposes a hard attention model named Saccader together with a pretraining procedure for efficient training of the model. The network is pre-trained in two parts using self-supervision, and the whole network is trained after that. The use of hard attention allows for understanding the image with only a small portion of the original image, and also enables understanding which part of the image is useful for classification. The authors experiment on ImageNet and discuss possible applications to other image-based tasks. Obtaining a good hard attention models is intriguing due to the computational cost it might save and the interpretability it brings. I think training a hard attention model is an interesting and important task, and the proposed model and pretraining procedures are straight-forward and seem to be well-motivated. I do have a few concerns (listed below) but at my current understanding I believe the authors have proposed an effective model that is widely applicable. Some concerns: - The experiment section could be improved. It would be helpful to include a comparison of computational cost and/or parameter counts of the Saccader to other image classification networks, either with hard attention or not. The reported accuracy could be more impressive if the size of the network is taken into account. It is probably also helpful to try the network on large-scale and/or fine-grained datasets. - I feel the description of the model could be improved in terms of clarity. For example, I don't think I see in section 3.1 how the final prediction is made--is it based purely on the "logits" at the predicted location or does the network also see the original image at given location? Also the description of the Saccader cell seems a bit fast to me, a sentence or two on what equations (1), (2), (3) do might help. - Does the "location network" in line 137 refer to the attention network, the 1-by-1 conv and the Saccader cell? Might be helpful to include an introduction before the term appears.

Reviewer 3

Strength: The idea of using hard attention for interpretability is novel to the field. Moreover, the design of the representation network limits the receptive field and prevents the model from using global information towards classification, which serves the purpose of interoperability evaluation. (However, this is also a limit which is discussed in the review) In addition, the saccader cell is also a simple yet novel design. Weakness: This model is only applicable to classification task on limited type images. Due to the design of the representation network, the model could only generate prediction on a patch of the image. The model only selects a fixed number of glimpses with fixed size for all inputs. The final prediction is a simple average across fixed number (T) of patches. This design would fail to apply on many classification tasks, such as pedestrain detection, where the image-level label is determined by multiple small ROIs. Moreover, the global distribution of spatial features is neglected by the model. The model doesn't generalize to classification tasks, such as cancer classification, where both the global features (such as the spatial distribution of radiodense tissues) and local features (lesion border) together determines the label. This model is only evaluated on ImageNet which is not representative. The pre-training procedure for the saccader cell is questionable. The loss function is designed to force the saccader cell to generate large probability on regions that representation network gives large-value logits. This step introduces a strong bias and creates a self-feedback loop. (Most of this part has been addressed by author's response.) The experiment design is somehow insufficient. First of all, the author only compares models that fit the framework of this paper (i.e. models that has an explicit glimpse selection mechanism). However, models from other families (such as weakly supervised localization models) are not compared. A simple baseline could be built using a black-box classification network (such as ResNet-V2-50) with some model-agnostic technique (such as Class Activation Map). In addition, in section 4.3, the author declares higher classification performance with NASNet but it's unclear whether this improve comes from the increased model capacity or higher input resolution.