NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1135
Title:Transferable Normalization: Towards Improving Transferability of Deep Neural Networks

Reviewer 1

Originality: TransNorm follows similar style from AdaBN and AutoDIAL, but shows higher performance. Clarity: the paper is well written. I especially like Figure 1, which clearly illustrate the proposed TransNorm. Quality: The proposed technique seems sound, and experimental settings/results are solid. Significance: The algorithms is easy to use (so as other normalization techniques). Though higher performance is shown in the paper, it is unknown how significant it is when adopting in other real scenarios compared with other normalization techniques.

Reviewer 2

This paper is different from most works focus on reducing the domain shift from perspective of loss functions, contributing to network design by developing a novel transferable normalization (TranNorm) layer. TranNorm is well motivated, separately normalizing source and target features in a minibatch and meanwhile weighting each channel in terms of transferability. It is clear different from and meanwhile significantly outperformes related methods, e.g., AdaBN [15] and AutoDIAL [21]. The TranNorm layer is simple and free of parameters, which can be conveniently plugged in mainstream networks. I think that this work will have a non-trivial impact: the proposed TranNorm can be used as backbone layer improving other state-of-the-art methods. The experiments are extensively evaluated both qualitatively and quantitively, demonstrating the effectiveness of the proposed TranNorm. The TranNorm layer is key contribution of this paper. This layer consists of separately normalizing the source and target features, followed by weighting with $\alpha$ with respect to transferability, which is empirically defined. I would like to see ablation analysis on $\alpha$: what will the performance be if one sets $\alpha=1$, and what will the performance be if the transferability is defined as softmax or Gaussian with tunable variance over discrepancy of statistics, i.e., $\mu/\sqrt{\sigma^2+\epsilon}$ (rather than only $\mu$). --------------------------------------------------------- One of my comments is that what the performance will be if the two probabilities build upon $\mu/\sqrt{\sigma^2+\epsilon}$, rather than only $\mu$. However, after reading the rebuttal, I still cannot see what kind of distances are used in the probabilities of softmax and Gaussian, from the second table and analysis therein. Note that I check the submitted code in (lines 152 and 156), finding that the two probabilities build upon only $\mu$. I wish the authors make further clarification and perform corresponding experiments in the final version. I keep my original recommendation unchanged.

Reviewer 3

Strengths: + Proposes a new normalisation statistics-based method for DA. This line of attack against the domain-adaptation problem is rather under-studied compared to other approaches. + A good range of benchmarks. Evaluation on multiple base DA methods and network backbones. + Analysis Sec 4.4 is interesting. Weaknesses: 1. Weak novelty. Addressing domain-shift via domain specific moments is not new. It was done among others by Bilen & Vedaldi, 2017,”Universal representations: The missing link between faces, text, planktons, and cat breeds”. Although this paper may have made some better design decisions about exactly how to do it. 2. Justification & analysis: A normalisation-layer based algorithm is proposed, but without much theoretical analysis to justify the specific choices. EG: Why is is exactly: that gamma and beta should be domain-agnostic, but alpha should be domain specific. 3. Positioning wrt AutoDial, etc: The paper claims “parameter-free” as a strength compared to AutoDIAL, which has a domain-mixing parameter. However, this spin is a bit misleading. It removes one learnable parameter, but instead includes a somewhat complicated heuristic Eq 5-7 governing transferability. It’s not clear that removing a single parameters (which is learned in AutoDIAL) with a complicated heuristic function (which is hand-crafted here) is a clear win. 4. The evaluation is a good start with comparing several base DA methods with and without the proposed TransferNorm architecture. It would be stronger if the base DA methods were similarly evaluated with/without the architectural competitors such as AutoDial and AdaBN that are direct competitors to TN. 5. English is full of errors throughout. "Seldom previous works", etc. ------ Update ----- The authors response did a decent job of responding to the concerns. The paper could be reasonable to accept. I hope the authors can update the paper with the additional information from the response.