NeurIPS 2020

Consistent Structural Relation Learning for Zero-Shot Segmentation

Review 1

Summary and Contributions: === Post rebuttal update === I originally gave this paper an '8' and I will keep my original rating. The method is a good improvement upon [1]: it extends [1] with a simple and reproducable idea. Experimentally they demonstrate good improvements over [1]. In contrast to R3, I think that is not only a decent amount of novelty, but also the simple kind of novelty that is likely to be adopted by other reviewers. The other two main weaknesses highlighted by several reviewers were: 1) A better positioning w.r.t. [1], which the authors did in their rebuttal (which should make its way to the paper). 2) Comparison to more related work, convincingly done in the rebuttal. Therefore I am satisfied with the author response and will keep my original rating. === End of post-rebuttal update === The authors address zero-shot semantic segmentation by exploiting word embeddings. In particular, [1] proposed a framework to create visual features based on (A) images of seen categories, and (B) embedding-features generated from word-embeddings, for both seen and unseen categories. They first make the seen visual features consistent with the generated seen word-embedding-features (using Maximum Mean Discrepancy). This enables generating examples in word-embedding-feature space for unseen categories (from their word-embeddings). This in turn can be used to train a classifier over both seen categories (using real visual features) and unseen categories (using word-embedding features). This paper extends [1] by: - Improving the feature generation using explicit weighted averages over nearby concepts (Sec. 4.1). - Add two 'consistency' losses which aim to preserve relations between classes in both the visual and word-embedding-space. Results show decent improvements over [1] and earlier works [25, 50].

Strengths: - Well-motivated idea. - Simple solution, which can be used and reproduced in practice. - Good results. - Well written.

Weaknesses: The following weaknesses are minor: - discussion with related work can be improved, specifically w.r.t. [1]. - The evaluation dataset (PASCAL Context) is rather old. Evaluation on a more modern dataset such as the COCO panoptic dataset or ADE20k would be desirable, especially since knowledge transfer is expected to work better when there exist more unseen classes. (Pascal Context has 33 classes, COCO has more in the range of 160, ADE20k has many more but has more label ambiguities).

Correctness: Yes.

Clarity: Yes.

Relation to Prior Work: Contribution w.r.t. [1] can be better described.

Reproducibility: Yes

Additional Feedback: The following items are suggestions to improve the paper. Addressing the first three in the author response is optional. Please don't address the others. 1. Please better discuss the differences w.r.t. [1]. In particular: how does Sec. 4.1 differ from their method? Please mark more clearly that the point-wise consistency was already used in [1]. 2. How is \gamma determined? is your system sensitive to its value? 3. It would be great if the authors could apply their method to the COCO panoptic dataset (or maybe ADE20k). This dataset seems more suitable for this task (see motivation above). Having such number would set a baseline and a new standard for future work. 4. The Reference section has et al. instead of the full authors list. This is unacceptable and needs to be changed to include the names of all authors. 5. Please discuss how the current work relates to similar work in Computer Vision: 6 L28: "lower quality annotations". This is false. The quality may be the same or higher (since segmentation is harder to do). Please rephrase. For example 'weaker form of annotations' or 'annotations with less information than the target task'. 7. L140: At l=L [...]. It would be clearer to relate \hat(x) to v^l_{i,j} directly. 8. L157-L165: This paragraph is now incorrectly part of 'point-wise consistency'. Consider moving to the beginning of 4.2 or put its content into the correct paragraphs below (pair-wise consistency and list-wise consistency).

Review 2

Summary and Contributions: This paper proposes a novel feature generation approach for zero-shot semantic segmentation. The key idea is to constrain the generating of unseen visual features by exploiting the structural relations between seen and unseen categories. They show better performance compared to previous works.

Strengths: -The proposed approach is novel. -Results are descent.

Weaknesses: -It is not writtent clearly whether the pixels of unseen classes are used during training or not. If yes, then it is not fair to compare with the Z3SNet and SPNet. In Section 3, the problem statement describes the standard zero-shot setting without using any pixels of unseen classes. But in Section 4.1, the nodes consist of pixel feature embeddings of unseen classes, which contradicts the setup described in Section 3. -Feature generation has been extensively explored in the context of zero-shot image classification. But this paper fails to compare with other feature generation methods except [1]. -This paper looks like another feature generation approach for zero-shot image classification. There is no specific technic designed for semantic segmentation. The authors did not justify why the proposed feature generator works well for the semantic segmentation problem. --------------------------- Post-rebutall: My concerns regarding the weakness are addressed in the rebuttal. Therefore I decide to increase my rating to be an "7"

Correctness: Yes.

Clarity: The paper writing needs to be improved. The problem setup is not very clear. In particular, it is unclear if pixels of unseen classes are used, how the background pixels are handled and what happen if unseen and seen classes co-occurrence in one image.

Relation to Prior Work: Yes, the authors discuss the differences.

Reproducibility: No

Additional Feedback:

Review 3

Summary and Contributions: This paper proposes an approach for zero-shot segmentation tasks by considering relations between different categories. In particular, the framework is conditioned on semantic word embeddings and it tries to generate visual features of unseen classes by using the similiarity between seen classes and unseen classes. And such similarity is modeled by the relation aggregation and the pair-wise and list-wise consistency operations. The authors conducted experiments on Pascal-voc datasets and better performance is achieved on these two datasets.

Strengths: 1. The problem of zero-shot segmentation is valuable, since collecting labels for semantic segmentation tasks is really expensive. And the task is relatively new compared to recent progress in ZSL for classification, detection tasks. 2. Good results have been achieved compare to other methods.

Weaknesses: - The novelty of this paper is limited and a bit incremental compared to [1]. It seems the differences are pair-wise and list-wise consistency losses which are not used in [1]. The feature aggregation step is pretty standard. The contribution of the proposed relation aggregation steps (i.e., Eqn 3) is not justified in the expeirments. A baseline to compare it to compute M^A, M^X explicitly without the relation aggregation module. - There is nothing special that is designed for segmentation tasks other than you are using a deeplab3+ as the backbone network. Thus, why not try these things on classification tasks? If the whole framework is about ZSL for segmentation, I'd like to more modules that are modeling the spatial information in images. I think this would be more important for segmentation. - What are the word embedding used in the experiments? - Detailed implementation details are missing, like learning rate, optimizer, etc. - The presentation could be further improved, eg., L140, simultaneous --> simultaneously, L175, focus --> focusing. Please carefully examine the paper.

Correctness: Seems correct.

Clarity: Could be improved. There are many grammar mistakes.

Relation to Prior Work: I would like to see more discussions with [1]

Reproducibility: No

Additional Feedback: The authors addressed some of my concerns in the rebuttal, particularly about the relations to [1] and clarifications about the feature aggregation module. The reason I'm not excited about the proposed pairwise consistency loss is that it is used a lot in zero-shot learning classification tasks. Overall, I'm satisfied with the rebuttal and I am changing my score to 6.

Review 4

Summary and Contributions: This paper proposed a Generalized Zero-Shot Segmentation method to generate the better visual features, by considering the relationship of deferent categories in both the word embedding space and the visual feature space. The experimental results show the effectiveness of the proposed method.

Strengths: (1) The proposed method integrates both feature generating and relation learning in a unified network architecture. (2) The proposed method introduces the relational constraints from different structure granularities, to facilitate the generalization of unseen categories. (3) The experimental results demonstrate the good effectiveness of the proposed method.

Weaknesses: (1) Although consider the consistence of the structural relationship of word embedding space and visual feature space is new in GZS3, it has been applied in other tasks, e,g., visual recognition. However, these references are not mentioned. (2) I suggest the author clarify the contributions of this paper. Similar works related to each contribution should be discussed. (3) Why the author do not dynamically learn the weights of the three consistency losses? (4) Why different categories are with similar relations in semantic word embedding space and visual feature space is not quite clear. Although Fig. 4 shows the relations in both spaces, I suggest the authors give more intuitive presentations. Because this is the fundament of the motivation of this paper. (5) The authors present some failure cases in the supplementary materials, but the reason is not given, for example, why the proposed method doesn’t work when with multiple instances? (6) In Fig.4, it seems that the proposed method learns a more consistent relationship. But some relation is not reasonable, i.e., motorbike and horse, which is weak in the word embedding and visual feature spaces, but it is stronger in the generated visual space with the proposed method.

Correctness: Yes

Clarity: Yes, this paper is clear to read.

Relation to Prior Work: See the weakness. Some methods have considered the consistence of the structural relationship of word embedding space and visual feature space, the author should discuss these methods. Besides, some recent GAN based zero-shot methods are missing, for example, Generative Dual Adversarial Network for Generalized Zero-shot Learning. CVPR 2019 GTNet: Generative Transfer Network for Zero-Shot Object Detection. AAAI 2020

Reproducibility: Yes

Additional Feedback: