Review for NeurIPS paper: A Dictionary Approach to Domain-Invariant Learning in Deep Networks

NeurIPS 2020

A Dictionary Approach to Domain-Invariant Learning in Deep Networks

Review 1

Summary and Contributions: This paper proposes using filter decomposition and layer branching to encourage the CNN to learn domain-invariant features. Experiments verify the effectiveness of proposed method.

Strengths: This paper focuses on investigating the suitable architecture for learning invariant features, which is complementary to previous invariant-feature learning methods and provides some valuable perspective.

Weaknesses: 1) Some necessary baselines are missing. 2) The baseline methods compared to are not state-of-the-arts. 3) No ablation studies are presented.

Correctness: The claims are reasonable. And the empirical evaluation is correct.

Clarity: Generally, the paper is easy to follow. And the ideas are clearly stated.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: Some issues need to be addressed: 1) Missing baselines. Recent studies [1-2] have shown that domain-specific BN may improve the adaptation performance. So why in Basic Branching and DAFD, all BNs are shared across domains? Maybe for DAFD, it is reasonable because it assumes filter decomposition could 'correct' the shifts. But for the Basic Branching, not sharing BN across domains may be helpful. It is better that the author should compare with the solution without sharing BN. And for the Basic Branching, why the FC layers and classifier should be shared across domains? By splitting branches for convolutional layers, the feature statistics across domains may not be mitigated, and a unified classifier may not exist. 2) The baseline methods are not state-of-the-arts. The author should conduct experiments based on some recent state-of-the-arts methods, e.g. [3-4] to show the effectiveness. The author claims that the proposed framework is a plug-in method. So it is better to try various domain-invariant feature learning methods. Besides, it would be better if the author could empirically compare with the methods of [1] and [24] presented in the references. It seems that the motivations of them share some similarities with this paper. 3) No ablation studies are presented. For example, how does the hyper-parameters K affect the adaptation accuracy? And we cannot find specific settings of K in the experiment part. And we would like to see the comparison between Basic Branching and DAFD for the unsupervised domain adaptation setting. [1] Domain-Specific Batch Normalization for Unsupervised Domain Adaptation, CVPR2019 [2] Transferable Normalization: Towards Improving Transferability of Deep Neural Networks, NIPS2019 [3] Contrastive Adaptation Network for Unsupervised Domain Adaptation, CVPR2019 [4] ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation, CVPR2019 --------------- after rebuttal ------------ I think the authors have addressed most of my concerns in their rebuttal. So I raise my rating to accept. And the authors should update the related details in their revised paper.

Review 2

Summary and Contributions: This paper proposes a multi-domain convolutional filter. The filter is factorized into domain independent and domain dependent parts. The factorization is motivated by potential discrepancy in spatial structure for different domains. The authors demonstrated that the filter can be effective in multi-domain supervised learning and unsupervised domain adaptation.

Strengths: The contribution of the paper is modular and clearly stated. The experiments are designed with reasonable comparison. The main claim is that sharing part of parameters across domains at the filter level if effective compared to using the same or totally separate parameters for different domain. The contribution is relatively independent of the use case: e.g. supervised learning or unsupervised adaptation, which can be potentially useful whenever multiple domains are present and easily combined with other techniques. The authors also provide supporting evidence for the effectiveness by visualization: on both high level alignment (just by supervised learning, without adding alignment loss) in Figure 3 and low level filters in App.A.

Weaknesses: This paper can be made more convincing by demonstrating what the filters are learning from real datasets, in addition to the toy examples constructed in App.A. Although the design is well motivated we probably need more supporting evidence for how the design can be effective in practice.

Correctness: Not sure.

Clarity: yes

Relation to Prior Work: I am not very familiar with previous work in this topic so not sure if there is missing prior work or not.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper proposes to use the dictionary approach to capture domain-invariant features in domain adaptation tasks. The domain-specific dictionary atoms and domain-shared coefficients are used as plugins in deep networks. It seems that the idea of dictionary method is proposed in other related works, while trying to use it to solve domain adaptation problem is a simple but effective solutions. This paper proves that the filter transfers can be implemented by atoms transfer only, if the domain transfer results from a sequence of spatial transforms of filters in generative net.

Strengths: 1. The incorporation of dictionary mechanism and domain adaptation problem is an interesting and concise idea. As a pluggable module, the domain-specific dictionary atoms and domain-shared coefficients can be applied to other domain adaptation methods to improve performance. 2. In the experiments, the effectiveness of the proposed module is proven in both semi-supervised and unsupervised domain adaptation settings. 3. The proposed plugin framework has potential to help research works in areas beyond domain adaptation.

Weaknesses: 1. Compared to the basic branching architecture (Fig 1(b)), the proposed approach has much fewer parameter increments. However, I'm curious about the performance improvements comparison of these two architectures when they are used as plugins upon existing domain adaptation tasks. It seem that the basic branching architecture is more expressive in the unsupervised setting (though overfitting exists in semi-supervised setting). Can the authors give more insights about why the proposed approach can outperform (or not) the basic branching architecture just regarding the performance?

Correctness: The claims and method are correct. But it's possible that I miss some mistakes in the proof. The empirical methodology follows the common setting in domain adaptation works.

Clarity: Yes, the paper is well written.

Relation to Prior Work: Yes, the discussion about the difference with related works is sufficient. And this paper's contribution is also clear.

Reproducibility: No

Additional Feedback: My main concern is that there is no enough details about the implementation of the method, neither in the paper nor in the supplementary material. For example, what's number of K used in all the experiments? How many cross-domain layers are used? It would make the empirical results more convincing and make the work more comprehensible if the authors could add more details.

Review 4

Summary and Contributions: This paper proposes a branching layer for learning domain-invariant representations. It uses dual networks, one for source domain, and the other for the target domain. The proposed branching learns different filters for each domain separately as dictionary atoms, and at the meantime, learns shared coefficients of these atoms across domains. Experiments on several benchmarks show the effectiveness of the proposed method. In addtion, it also gives a theoretical proof about the invariance of the learned representations.

Strengths: The idea of sharing dictionary coefficients across domains while keeping each domain has its own atom is reasonable. The proof of invariance would give readers more insight about the proposed work, as well as other related deep learning works. Experimental results are good, showing that the proposed method can be plug-in existing methods and contribute to performance improvement.

Weaknesses: The basic idea of separating convolutions for learning domain-invariant features is similar to [1]. Although it mentions [1] in the section of related work, it should discuss more clearly about its difference and relations to [1]. Due to the very related idea of [1], the proposed method has be compared with [1] for different experiments. To sum up, missing such an important comparison and discussion could be a severe limitation of this work.

Correctness: Could be improved. For exmaple, 1. L219, "the performance on source domain degrades when the number of target domain data is comparable" -- the experimental setups and results can not support this statement. 10% of SVHN can not be said as comparable to MNIST. 2. From this visualization, (c) is very well aligned for different domains, which is much better than (a). However, results in Table 1 only show a relatively small improvement of A3 compared to A1. It lacks a reasonable explanation. 3. The improvements in tables should be given as the absolute numbers, currently, it uses relative ratio which would mislead readers and exaggerate the improvements especially when the baseline is low.

Clarity: Yes, easy to follow.

Relation to Prior Work: No, relation to [1] should be discussed more.

Reproducibility: Yes

Additional Feedback: The rebuttal partially addresses my concerns about its relation to [1], however, I think these should be provied in the original submission. I still maintain my initial rating.