NeurIPS 2020

Self-training Avoids Using Spurious Features Under Domain Shift

Review 1

Summary and Contributions: This paper concentrates its theoretical study on the utility of self-training (in the form of min-entropy minimization/pseudo-labeling algorithms which do not have access to labels y) in a simple generative model with a structured domain shift. The key idea in the set-up assumes one is given an arbitrary source/initialization classifier $w_S$, and then presented with labels from a target domain in which there are some “useful” features x_1 correlated with labels y, and then spurious features x_2 which are just noise unrelated to the labels y. Under Gaussian assumptions on the spurious features, a suitable mixture assumption on the useful features, and good source initialization guarantees are provided to show the aforementioned algorithms return vectors with small norm on the spurious support. Several theoretical counter-examples/thought experiments and simulated real-data experiments are provided to validate the theory.

Strengths: This paper targets a timely and important problem. With regards to domain shift, I also like that this paper posits/analyzes (albeit a simple) structured model of domain shift to study self-training. Much prior work in distributional robustness assumes a worst-case model for distribution shift which seems too pessimistic to capture behavior that is salient to real applications. I also enjoyed reading the intuitions in the proof sketch and examples/toy experiments used to support the validity of the assumptions; indeed I believe one of the key contributions of the paper is not the proof techniques (which seem to involve some detailed computations to bounds densities of MoG in various regimes for example in some cases) but the rather the relevance of the assumptions and set-up. I think Theorems 3.1 and Theorem 3.2 form a nice contribution showing how some forms of unsupervised self-training can provide benefits in a simple setting.

Weaknesses: One important caveat to the above is that an important missing component of Theorem 3.1 is a bound on the target accuracy of the final classifier (in analogy to Theorem 3.2 which studies a much simpler setting). The algorithmic set-up of pseudo-labeling is nice, but I feel the statistical implications for generalization is an important point in a paper studying self-training -- perhaps more important than the algorithmic connections to min-entropy minimization. Identifying assumptions and conditions on when self-training can aid learning--using any procedure--would seems to be the most interesting contribution. Clarification on why this is not provided or difficult to provide in the context of Theorem 3.1 would be useful. The setting studied is also simple but I do not view this as a significant downside.

Correctness: The claims appear correct.

Clarity: The paper is well-written. One small point is that in the pre-amble to Theorem 3.2 the parameter $\gamma$ is not explained (and is not referenced in Theorem 3.2); it appears related to the SNR for the problem but am not sure. One other point I may have misunderstood is that in the second paragraph of Section 3 it is stated that “... motivates our assumptions of separation … and that the spurious $x_2$ is a mixture of sliced log-concave distributions”. My understanding was that $x_2$ is assumed to be Gaussian (based on Equation 2.1) and $x_1$ is a mixture.

Relation to Prior Work: Seems to be cited appropriately.

Reproducibility: Yes

Additional Feedback: I have read the reviews/feedback and the appreciate the comment on relationship to max-margin classification. I will maintain my score.

Review 2

Summary and Contributions: The paper studies why self-training can avoid spurious features based convergence analysis, which tries to explain the recent empirical success using self-training on large scale data. Under a linear model with binary label and mild assumptions on the distribution of input features, the paper shows that if the initialization is good enough, entropy-based update can penalize the parameter of those spurious features. The theory is validated with synthetic experiments on variants of CelebA and MNIST dataset.

Strengths: It studies the power of self-training from a theoretical perspective and proposes a toy but new model setting to give convergence analysis. Self-training for large-scale data is a hot topic, and this paper tries to provide some explanation.

Weaknesses: While trying to explain the success of self-training, it is unclear how the framework or the theory can be generalized / provide guidance to empirical discoveries. From the theoretical side, the `surprising’ part mentioned in the paper is not that surprising because the loss function would favor good features due to their correlation with the model. === Post rebuttal == Thanks for writing the rebuttal. I have carefully read it and understand the arguments. Based on all the feedback and my own evaluation, I would keep my score as it is.

Correctness: I did not check the proof details but the claims seem to be correct.

Clarity: In general yes -- there are some typos and formatting issue which can be fixed easily.

Relation to Prior Work: I think so.

Reproducibility: Yes

Additional Feedback: I did not check the proof details but I am interested in the following: It seems that Theorem 3.2 holds for d_2 > 1 and the only condition that depends on Sigma_2 is the minimum eigenvalue -- so is the analysis here equivalent to analyzing d_2=1 ?

Review 3

Summary and Contributions: For linear classifiers with certain data distribution assumptions, the authors prove that when the initial classifier is sufficiently accurate, self-training, via either pseudo-labels or entropy minimization, will improve robustness of the classifier against domain shift caused by spurious features. In particular, such improvement occurs due to a feature selection effect. This work provides a novel theoretical explanation on effectiveness of self-training on unsupervised domain adaptation problems.

Strengths: The work makes great efforts in carefully crafting assumptions to lay a workable foundation for the theoretical work. The theoretical results are insightful and the experimental results support their theoretical implications. Its feature selection mechanism can be further examined by practitioner.

Weaknesses: 1. There are many technical assumptions and it is hard to identify which play the essential role in the result and which are just for technical issues, and thus possible to relax in the future. 2. The loss functions for the pseudo-label self-training and entropy minimization are not standard. I suggest analyzing log-loss for the pseudo-label self-training and making the equivalence between l_exp and l_ent formal. 3. The authors need to explain why the additional source loss on labeled data in self-training is required in the experiment on CelebA. === Post rebuttal === The author's response answers my questions. My rate remains unchanged.

Correctness: I think the claims and methods are correct, though the proof is not carefully checked.

Clarity: The paper overall is well written. But several places can be further improved. 1. l_exp is used for two different purpose on Page 3. 2. Instead of showing many but far from sufficient details of the proof in Section 4, providing more intuition and interpretation about the relation between assumptions and conclusions may be more helpful. 3. The proof of Prop 2.1 is in Appendix D not E.

Relation to Prior Work: The relation between this work and previous contributions is clearly discussed.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: The authors propose an analysis fin the setting of unsupervised domain adaptation for avoiding spurious features which correlate with the source labels but have no correlation with the target labels. The authors claim and prove that entropy minimization on unlabelled target data will avoid using the spurious feature if initialized with a decently accurate source classifier, even though the objective is non-convex and contains multiple bad local minima using the spurious features.

Strengths: The paper is well written and all the theoretical claims are well established. Spurious or useless features which do not have any correlation with the target domain are a source of inherent bias for many unsupervised domain adaptation tasks. This paper proves that these inherent bias is automatically avoided given a good initial classifier for the source domain. The authors have also carried out experimental studies on two datasets – a semi synthetic coloured MNIST and CelebA. In both the datasets, the results validate the claims made by showing that self-training with a decent initial classifier indeed improves the performance in the target domain.

Weaknesses: Even though it is understood that the paper is more theoretical in nature, the experimental section feels ad-hoc. The authors assume that the spurious features are Gaussian and the non-spurious, a mixture of log-concave distributions. However, the experiments do not show that these assumptions are hold. Is it possible to show that the weights for the spurious features are indeed minimized to zero while doing the self-supervised pre-training? Also, what happens to the model accuracy if the classifier gets stuck in some sub-optimal point? Will self-training reject low confidence samples? It will be better if there is some study of any effect of regularization on the self-training approach to identify and avoid spurious features. Post-Rebuttal: Concerns addressed in rebuttal. Ratings unchanged, inspite of some minor weaknesses and promises to add some details. Will not object is majority of reviewers feel otherwise for ratings.

Correctness: Appears to be correct

Clarity: Quite clear

Relation to Prior Work: Discussions on prior work done

Reproducibility: Yes

Additional Feedback: