NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8507
Title:Semantic-Guided Multi-Attention Localization for Zero-Shot Learning

Reviewer 1

The problem is relevant and the method is based on an interesting attention based idea to look at different regions in the image for the task of ZSL The losses used focus on (i) making each attention map peaky, while making different maps diverse, (ii) embedding based softmax for better prediction and (iii) class center triplet loss which makes the features closer to their respective class centers relative to the other class centers. Line 190 mentions that the image and parts are sent to “separate backbone networks”, which implies that the network parameters are not shared. If that is the case then the method will have ~3x parameters cf competing methods ie. a significantly higher capacity network overall. What happens when the CNN params are shared? And what happens when the image only baseline has a higher capacity network backbone (which is also then end-to-end finetuned)? The learning of channel wise attention weights are initialized by clustering the features using k-means, which is shown as an L2 loss minimization approach in eqn.(3) in supplementary section. A detailed ablation study would have been helpful showing the importance of this initialization. The number of clusters is fixed to be 2 for all the datasets. There is no justification or experiments provided to validate the same. Parametric studies for this should be provided, preferably with network weight sharing and without. As mentioned in supplementary material, clustering of feature channels is done using “CNN trained for the conventional classification task and extract the coordinates of the peak for each channel”. Is this a diffrent CNN other than that used in the approach? If so then the approach should include the same or else if the clustering is done along with the training of the CNN network embedded in the approach, then it would give erroneous peak channel value for the initial iterations. A more clear explanation is required about the calculation of channel wise attention. Results for Generalized ZSL, which is a more practical and harder task combining both seen and unseen classes at test time. Missed one of the relevant approaches in this area: Ji, Zhong, et al. "Stacked semantics-guided attention model for fine-grained zero-shot learning." Advances in Neural Information Processing Systems. 2018. A more recent paper in the same idea of attention for ZSL (although the paper appeared after NeurIPS’19 submission date, so not considering in this review, only fyi): Xie, Guo-Sen, et al. "Attentive Region Embedding Network for Zero-shot Learning." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019. Minor comments: ‘qualitative’ in place of ‘quantitative’ as mentioned in the title of appendix G Some of the main contributions are pushed to supplementary section, such as the details of clustering of channels mentioned in appendix A and the inference in appendix D. It would have been better if these contents could have been mentioned in the main paper. --------------------- Post rebuttal --------------------- I appreciate the interesting rebuttal. However, I still find that the submission would need more work. - The with and without parameter sharing exposes high variance in results with number of clusters. Since the part definition step is stochastic, I would also do multiple initializations and report variances. - The ZSL results are not quite state of the art (eg many results from CVPR 2018 Feature Generating networks, Xian et al.; there are even better numbers out there now) - The clustering process and its working would still need more analysis and discussion (with experiments) I would keep my rating.

Reviewer 2

While there are weaknesses, this paper is a solid submission. The idea is interesting and effective. It outperforms the state of the art. Strength: + The paper is well written and the explanations are clear. + The quantitative results (especially Table 2) clearly demonstrate the effectiveness of the proposed method. + Figure 1 is well designed and useful to understand the model. + Qualitative results in Figure 2 is convincing and demonstrates the consistency of the attention module across different classes. Weakness: - Motivation behind 3.2 Section 3.2 describes the cropping network that uses a 2d continuous boxcar function. Motivation for this design choice is weak, as previous attempts in local attention have used Gaussian masks [a], simple bilinear sampling using spatial transformers [b], or even pooling methods [c]. If this makes a difference, it would be great to demonstate it in an experiment. At minimum, bilinear sampling should be compared against. [a] Gregor, Karol, et al. "Draw: A recurrent neural network for image generation." ICML, 2015. [b] Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." NeurIPS, 2015. [c] He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017. - Discrepancy between eq. 9 and Figure 1. From eq. 9, it seems like the output patches are not cropped parts of the input image but just masked versions of the input image where most pixels are black. Is this correct? In this case, Figure 1 is misleading. And if so, wouldn't zooming on the region of interest using bilinear sampling provide better results? - Class-Center Triplet Loss The formulation of class-center triplet loss (L_CCT) is not entirely convincing. While the authors claim L2 normalization is introduced to ease the setting of a proper margin, this also has a different effect. This would in fact, divert the formulation to be different from the traditional definition of a margin. For example, these two points in the semantic feature space could be close, but far away after the normalization that projects them on a unit hypersphere. And the other way around is also true. Especially given the fact that the unnormalized version of phi is used also in L_CLS, the effect of this formulation is not obvious. In fact, the formulation resembles the cosine distance in an inner product, and the margin would be set -- roughly speaking -- on the cosine angle. The authors should discuss this in their paper. I find the current explanation misleading. - Backbone CNN Although I assume so, in Section 3.3 / Figure 1, it is not clear which backbone CNNs share their weights, and which don't (if some don't). Is the input image going through the same CNN as the local patches? Are the local patches going through the same CNN? I suggest some coloring to make it clear if not all are shared. - Minor issues L15: "must be limited to one paragraph". L193: L_CAT --> L_CCT Equation 11: it would be clearer with indices under the max function. L215: "unit sphere" -> "unit hypersphere". Unless the dimension of the semantic feature space is 3, which in this case should be mentioned. Potential Enhancements: * This paper is targeting zero-shot classification but since the multi-attention module is a major contribution by itself, it could have been validated on other tasks. An obvious one is fine-grained classification, on CUB-200 for instance. It is maybe possible for the authors to report this result since they already use CUB-200, but I would understand if it is not done in the rebuttal. ==== POST REBUTTAL ==== The additional results have made the submission even stronger than before. I am therefore more confident in the rating.

Reviewer 3

Originality: The work combines a set of well known components from the DL community. We do not see a real technical innovation, but the combination of the said methods is interesting in itself. The difference o the Multi-Attention compared to previous methods, is not clearly explained and not very convincing. Clarity: I find that the submission is clear overall. About the class-center triplet loss; The authors should explain how the extracted features are finally used for 0-shot learning. This gets clearer in the supplementary material, but it should be explained in the main paper and a reference to the supplementary should be added. A couple of typos - one in the abstract-. Figure 1. and the text of the paper clearly describe how the multiple components are articulated. Significance: The impact and significance of the paper for the community will be average. This is a good piece of work and the code is provided, nevertheless, we do not see a clear technical breakthrough that would be reused by other authors.