Review for NeurIPS paper: Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

NeurIPS 2020

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Review 1

Summary and Contributions: The authors propose a novel method for domain adaptation of semantic segmentation networks in the challenging synthetic to real scenario. The work mainly proposes to align the feature representation among the two domains along parts of the images which share the same visual context. These parts can be automatically identified by deep feature correlations and cycle consistency. Given a couple of images, a source deep feature is spatially matched to the most similar target one, tha same process is repeated for the selected target feature and all the source ones. If the final matching source feature shares the same semantic label as the initial source feature, the cycle is considered a success and the triplet of features can be used to force a similarity among features extracted across domains. This technique together with other regularization allows this method to achieve state of the art results in the two considered datasets.

Strengths: + I really liked the idea of matching deep features by deep correlation to automatically identify portions of the images with the same visual appearance. I find it quite novel and at the same time it is quite intuitive why we would want to make the representation of only subsect of the image features similar instead of aligning everything. + My intuition on the method, and a slightly different point of view from the one presented on the paper, is that the proposed gradient diffusion by spatial aggregation basically provides an elegant and self-supervised way of performing unsupervised image segmentation (see Fig. 2). Then by matching this feature with two coherent ones from the source image the method is implicitly doing some sort of soft-labeling of the target sample. + The whole system can be trained end-to-end without requiring cumbersome process or subsequent training stages plus it seems to be pretty robust with respect to the choice of the hyperparameters as discussed in the supplementary material. + The ablation study in Tab. 2 is concise but provides all the useful information to show the effectiveness of the proposed loss functions and regularizations.

Weaknesses: - The method has been tested on a single architecture even if, as the authors states at line 243, new architecture like PSPNet might perform even better. - Some more recent works that might be highly correlated needs to be cited and compared against, for example this work “Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation - Wang et al. cvpr2020” might share some similarity with the proposed method that need to be discussed in context. - The method has been tested only among synthetic and real datasets that share quite a lot of similarities. It’s a little bit unclear how well the proposed feature matching strategy generalizes to datasets with bigger domain discrepancies. - Some implementation details can be made more clear in the paper

Correctness: I believe the claim to be correct and the experimental evaluation methodology to be correct.

Clarity: The paper is well written and easy to follow. Some minor implementation details can be made more clear, but this can be addressed in a polishing of the paper before camera ready.

Relation to Prior Work: Previous works have mostly been properly cited and discussed in this manuscript. However some more recent and highly related works should be added and discussed, some of them outperform the proposal but I still think that this work it’s pretty valuable. For example: a. Bidirectional Learning for Domain Adaptation of Semantic Segmentation - Li et al. 2019 b. Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation - Kim et al. 2020 c. Differential Treatment for Stuff and Things: A Simple Unsupervised Domain Adaptation Method for Semantic Segmentation - Wang et al. 2020 d. Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision - Pan et al. 2020

Reproducibility: Yes

Additional Feedback: I am giving an 8 because I considered this submission very interesting and quite novel. I have some doubts that I would like the authors to clarify in the rebuttal and, eventually, in a revised version of the paper: (a) Is there a reason why the method has not been tested with newer architectures than DeepLab v2, e.g. the cited PSPNet? (b) How do you select which source pixels to use to define pixel level associations? Are all source pixels of a certain image considered as candidates? (c) At which resolution is the pixel level association between source and target pixel computed? (d) You have mentioned using your method together with a self-training method but how would you carry out the integration? Simply pseudo-labeling target samples and adding an additional loss on it? Or would you integrate the pseudo-labeling coming from self-training into the feature matching and cycle consistency formulation? ===== Post rebuttal comments ===== The authors have correctly addressed my few concerns on the rebuttal, therefore I'm keeeping my original rating and i would sugges accepting this work to neurips

Review 2

Summary and Contributions: This paper proposes a domain adaptation technique by finding cycle-consistent pixels between source and target images and reinforcing. That is, given a pixel in the source, finding the nearest neighbor in a target, then finding that point’s nearest neighbor back in the source, and checking that the original and final source points belong in the same class. If so, they are brought together in feature space, away from the other points (in a contrastive manner). This is different than cycada (which aims to stylize the pixels) or feature discrepancy minimization methods.

Strengths: The paper proposes to mine for cycle-consistent pixel associations: going from source to target back to source, using nearest neighbors, and checking if the start and end points are in the same class. These points are then brought together in feature space, in contrastive to other points in the image. To my knowledge, this method is novel and different than previous methods. The paper explores this idea quite thoroughly. The method performs this in feature space (Sec 3.2) and output space (3.4), with some feature diffusion such that a more diverse set of points are selected (Sec 3.3), and with contrast normalization on features. The paper validates each of these design decisions (along with contrastive vs “simply” associating cycle-consistent points together) in Table 2. The paper beats state-of-the-art methods.

Weaknesses: Section 3.2 proposes associating pixels through cycle-consistency in a feature representation. One dimension the paper does not study (that I see) is the receptive field, or how deep in the network these features are extracted from. Is the method sensitive to this design choice? Can the system benefit from perhaps performing this operation in every (or multiple) layers of a feature extractor, such as in VGG perceptual loss? I also believe the mechanism of finding nearest neighbors could be better studied and visualized. Figure 2 does visualize qualitative similarity maps. How often do the cycle-consistent associations find a associated pixel in the target domain of the same, correct label? Is it difficult to find matches with classes that are not not often represented?

Correctness: Yes, the paper’s claims and methodology seem correct.

Clarity: I was able to understand the paper and method.

Relation to Prior Work: While the related work covers domain adaptation literature, I believe it can draw some further connections outside of this immediate area. For example, [1] learns features guided by cycle consistency and [2] finds correspondences between images by cycle-consistent feature matching for graphics applications. Furthermore, the work uses the contrastive loss, which has seen popular use through the unsupervised learning community [3] and shown benefits in knowledge distillation [4]. [1] Zhou et al. Learning dense correspondence via 3d-guided cycle consistency. CVPR 2016. [2] Aberman et al. Neural Best-Buddies: Sparse Cross-Domain Correspondence. SIGGRAPH 2018. [3] van den Oord et al. Contrastive Predictive Coding. 2018. [4] Tian et al. Contrastive Representation Distillation. ICLR 2020.

Reproducibility: Yes

Additional Feedback: Overall, I was able to understand the method. To my knowledge, it is novel and different than previous methods. The paper studied various aspects and design decisions related to this idea through an ablation study. The method also outperforms previous methods. I do think studying the feature extractor to greater detail would make the paper stronger, along with more connections to previous literature. Overall, I believe this is a solid submission. ----------- I thank the authors for the additional information in the rebuttal. I hope they can be incorporated in an updated revision or supplementary material.

Review 3

Summary and Contributions: This paper addresses the problem of unsupervised domain adaptation for semantic segmentation, where the source data with annotated labels and the target data without labels are available. It proposes to enhance the similarities between cycle-consistent pixels between source and target images, compared to other pixel-pairs. To address the problem that the cycle-consistent pixels are sparse, a spatial aggregation module is used so as to back-propogation gradients across all pixels in the training images. Experiments on two adaptation cases show the effectiveness of the proposed method.

Strengths: The proposed method is technically sound and reasonable. While existing methods for domain adaptation are to minimize the distribution discrepancy, this paper proposes a new perspective for domain adaptation, which finds and minimizes the possible pixel-pairs belonging to the same category across domains. The basic idea is reasonable and somewhat straightforward, but the paper find a good way to incorporate this idea into addressing the domain adaptation problem end-to-end. The paper has strong experimental results. The ablation study is also helpful to understand the method.

Weaknesses: The loss contains many parts, which makes tuning the weights for these parts could be a tedious task. The paper lacks studies on how to tune these parameters, and how they will influence the final results. It is unclear whether the proposed method is sensitive to them or not.

Correctness: It is technically sound.

Clarity: Yes, the paper is well organized and easy to follow.

Relation to Prior Work: The relation to previous works could be improved by describing more about existing works and how the proposed one differs to them. Currently, it only very briefly describes three main categories of domain adaptation methods.

Reproducibility: Yes

Additional Feedback: After rebuttal, I maintain my initial recommendataion.