NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7170
Title:Generative Well-intentioned Networks

Reviewer 1


		
The paper is pleasant to read and presents a compelling innovation which complements GANs. The insight is simple yet powerful with strong potential impact in production systems relying on a quantification of uncertainty and offering the option to reject inputs. Strengths: 1) The exposition is flawless as is the experimental evaluation; 2) The method is original and simple and yet powerful as shown in the experimental section; 3) The method differs significantly from Defense GAN in a practically critical aspect: it only requires transformation of rejected examples. Weaknesses: 1) It may be interesting to comment on the added latency when re-processing an input as this may be a major practical problem when using such an approach in practice. For instance, what would be the added computational cost of the transformer in typical cases?

Reviewer 2


		
The method, using GANs to map low confidence examples to high confidence examples of the same class is highly original and shows promise as an effective method. The authors also provide a good description of how their work differs from Defense-GAN, and also notes some of the prior work on classifiers with a reject option (although they should also cite some of the more recent work doing so with DNNs e.g. [1]). The paper is clearly written, however there seems to be some information gaps with regards to network architecture and the mechanism for conditioning on the low confidence image x. Additionally there are several methods existing in the literature for conditioning the discriminator on label information: [1], [2], and concatenation and it's not clear from the paper which is used. The main drawback of this paper is that the datasets used (MNIST and FashionMNIST) are too toy to allow the reader to draw informed conclusions. The performance of the baseline classifier (from which the rejected images are derived) is extremely poor. There are simple convolutional network architectures that achieve >99.5% accuracy on MNIST compared to the 98% baseline and >95% on FashionMNIST compared to the 87.4% of the baseline used. This makes it difficult to evaluate how much of the improvements shown would be dwarfed by employing a better architecture. It's also not clear here what how the use of a Bayesian Neural Network interacts with the rejection function. How much different are the rejections if only a single model's confidence score is used for the rejection criteria rather than the median, what about the mean? Update: The authors have made effort to address my main concern of using a stronger baseline classifier, and training on harder datasets such as CIFAR10. They have also clarified some of the GAN architectural details. I still think that the incremental benefit of using an ensemble for computing rejection thresholds is at best dubious and doesn't obviously incorporate any kind of Bayesian model uncertainty and would like to see this clarified or investigated in the text. However due to the improvements, I will raise my score to a 6. [1] Geifman, Yonatan, and Ran El-Yaniv. "Selective classification for deep neural networks." Advances in neural information processing systems. 2017. [2] Odena, Augustus, Christopher Olah, and Jonathon Shlens. "Conditional image synthesis with auxiliary classifier gans." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. [3] Miyato, Takeru, and Masanori Koyama. "cGANs with projection discriminator." arXiv preprint arXiv:1802.05637 (2018).

Reviewer 3


		
The author(s) developed and elegant architecture to employ GANs to improve the accuracy of the classifiers. They authors have also proposed a baseline implementation for the proposed GWIN architecture. This work can potentially have significant impact in areas in which the uncertain classification and mis-classification Classification of uncertain observations with increased certainty is a potentially important and useful problem. However, it is also at the risk of increasing the false positive rate. The authors fail to discuss such vital consequences and their experiments focus only on accuracy. For observations that cannot be classified with high certainty, it is somewhat unreasonable to translate them just for better recognizability under the specified classifier. In order to demonstrate the effectiveness of the proposed procedure, the experiments compare the prediction accuracy on rejected samples, since the pre-trained classifier is not expected to have high accuracy on samples where it does not have the confidence to judge. Also, when comparing with the baseline BNN, the BNN+GWIN has much more coefficients and has additional confident data set training steps. It is unclear whether the higher accuracy is a result of a more complicated model and larger training epochs. Last but not least, for image classification, uncertainty is often induced by noise in observations. So how would the proposed procedure compare with state-of-the-art denoising methods for overall prediction accuracy? On the other hand, if denoising can largely reduce the number of uncertain observations, then the contribution of this work is incremental at most.