NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
This paper explores a very interesting idea, that unfortunately is marred by rushed writing. The paper is also filled with typos, that make understanding it and judging the quality of the work extremely hard. Exploring the central idea of the paper, that of a generator playing vs a discriminator and a classifier, and giving more intuition about this approach and why the authors expect it to work better than other traditional GAN based methods was extremely necessary. Instead the authors just mention it in passing just before section 3. This has the potential to be a good paper in a future version if all the writing issues are addressed and the paper is planned out a bit more. Nevertheless, I outline a few questions I had to the authors of the paper. 1. Why does the RHS of min max U(C,G,D) in the main equation (10) not depend on G? Same question for equation 14. 2. Is there a inf_{\lambda > 0} missing from the second term of the LHS of equation 4 and equation 5? If yes, how does that change things? 3. I understand trying to bound the error for the worst-case distribution in an \epsilon Wasserstein ball during the theoretical result, as that gives an upper bound for all other distributions. But why train on the worst-case distribution? What does the framework achieve from this? 4. Given that the generator approximates a worst-case distribution in the \epsilon ball, one would expect that for good data (without noise added) just the classifier would achieve better performance than this framework. Why isn't this the case in the MNIST set within the Imbalance 1 column and SVHN set under the Normal column?
Reviewer 2
The formulation in eq. 1 appears new. However, I am not convinced why minimizing the loss w.r.t. worst case distribution (in some epsilon ball) is a good idea (eq. 1). In this formulation, the task of classifier is made harder by having to model distributions that may be unlike the true distribution. Further, it is evident from experiments that when the epsilon is large, the method doesn’t work. Please comment on how the method is different from adversarial data augmentation methods for smaller epsilon. It is not clear to me how the second line of eq. 8 follows from first line and what Q_i are? Line 267, concludes that proposed method learns distributions closer to true distribution consistent with some claim in sec. 3. It is not clear which claim it is consistent with. Further, it is not clear why it would model “true” distribution. This is because, the generator is learned to model the worst case distribution within a epsilon ball of the empirical distribution and there is no reason for that worst case distribution to be the true distribution. In my view any closeness to true distribution is purely circumstantial. How does the method compare to data augmentation methods e.g. mixup or simple addition of noise to the training examples? This comparison is important because the proposed method can be viewed as a data augmentation method, where the generator acts as a learned data augmentor. Other minor points Please change “Crap” in the title and paper to something more prudent e.g. “Bad” Line 62: demotes -> denotes Line 67: assumed to include For clarity (e.g. in eq. 3) it may be better to use some other letter instead of “d” to represent d(Q, P) Line 89: matrix -> metric
Reviewer 3
I have studied machine learning for around eight years. This paper is out of my expertise. However, based my educated guess, this paper is well written. The theory seems strong and sound.
Reviewer 4
The paper has provided an interesting approach to conduct learning from crap training data. Different from existing three-player works that lie in semi-supervised setting, this paper aims to investigate the underlying data generating scheme with the help of generator, and thus a different three-player game has been established to learn from crap data. The authors have clearly illustrated the proposed algorithm. Beginning with a general optimization problem over probability distributions in the ambiguity set, the authors derived a tractable solution by introducing a generator to approximate the worst-case distribution and Wasserstein distance as the measure of distributions. With the help of a generator network, the proposed algorithm actually implies a robust solution that can deal with the worst-case data distribution. Experimental results on real world dataset demonstrate the effectiveness of exploring the generation and classification tasks in a unified framework. Most importantly, the paper provided some theoretical analysis to better understand the properties and optimization of the proposed algorithm.