NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:7259
Title:Self-supervised GAN: Analysis and Improvement with Multi-class Minimax Game

Reviewer 1

Originality: The method is relatively new although it is similar to some conditional GAN works in the literature. The main idea is the analysis showing the limitations of prior GAN+SSL work and in proposing a scheme with better chances of succeeding (at least theoretically). Then experiments show that there is an improvement. It would be good to show more the analogies to prior conditional GAN work, and this would not hurt the contribution, rather it would better clarify its context and provide more links to practitioners (who could better understand it). ++++++++++++++++++++++ Quality: There are some inconsistencies that I would like the authors to clarify. Basically, the minimax game should use the same cost function for the optimization of the discriminator, the generator and the classifier. However, as it is evident from the general formulation in eq 1 and 2, this is not the case. It would be good to clarify that this is not a standard minimax optimization. If we ignore this inconsistency in the optimization objective function, then I still have some considerations that I would like the authors to respond to. To my understanding the proposed method (eq 11) is also probably not the ultimate solution, as the optimal solution of eq 11 is a tradeoff between the ideal loss (the KL term) and the additional SSL term. Indeed, after eq 11 one can continue the calculations and get to the final optimization function for the generator G. As a first step, the optimal D in eq 8 is the usual one as the second term does not depend on D. Thus, we can plug in the expression for D(x) = pd(x)/(pd(x)+pg(x)) and the expression in eq 11 to eq 9 to calculate the cost function in G. If lambda_g = 1 (and my calculations are not wrong), this expression simplifies in a KL divergence between (pd+pg)/2 and pd summed to a constant and the second term in eq 11. This means 2 things: 1) the new objective function in G is not symmetric in the distributions pd and pg (as in the original formulation); 2) because of the second term, the solution in G will be a tradeoff between the GAN objective and the classifier objective. A tradeoff could hurt both objectives and in particular be worse for the GAN objective, which is the one of interest. Could the authors clarify if my derivation is wrong or if my conclusions are wrong with some clear mathematical/logical explanation? In general, I would make theorems out of major results and some of the theorems in the paper are basic intermediate calculations. For example, Theorem 1 is quite trivial and I would not make a theorem out of a simple substitution. ++++++++++++++++++++++ Clarity: The paper is clearly written for most part. It clearly has a space problem as it needs space to write down equations and theorem proofs. This is helped by the supplementary material, but I have the impression that more could be written in the theoretical analysis (eg see my comments above on continuing the derivation until one gets to the final optimization in G). The work is probably non trivial to reproduce so sharing the code would be quite important. There are some typos and grammar errors, but they are not too serious. It would certainly help the reader if the authors revised the writing or had a native English speaker revise the text. ++++++++++++++++++++++ Significance: The results show that the proposed method is better than [4]. It is always difficult to judge whether this is due to the proposed change or instead to a better tuning and engineering. The authors mention that they also searched parameters for the competing method but they could not replicate the published performance. This makes these experimental evaluations difficult to assess. +++++++++++++++++++++ I read all the reviews and the rebuttal. We discussed briefly the rebuttal and I decided to keep my score.

Reviewer 2

Overall the paper is well-written and easy to understand. I actually like the paper a lot. The logics are also clear: Discovering the limitation of the previous methods using both theoretical analysis and toy example. Then, proposed the solution (in a simple way) that works better. The only concern is about the empirical experiments which are done only in the simple dataset. If the authors provide the results in more complex datasets (such as Imagenet and CIFAR-100), it would improve the quality of this paper 1. Self-supervised signal - The proposed model and the previous models usually use the rotation as the self-supervised signal. - Are there any other signals that can be used for the self-supervised signal? - Because the rotation can be only used for image dataset and cannot be extended to the other domains such as time-series or tabular data. 2. Figure 3 - Line 167 - 180: The author claims that the previous method has the possibility to mode collapse for generating the samples whose rotations are easily captured by the classifier. - It would be good if the author also shows the results of the proposed model and claims that the proposed method has a small mode collapse problem. - The point is that the author can use the toy example not only showing the limitation of the previous method, but also show the superiority of the proposed method. Currently, I can understand that there is a problem in the previous method; however, it is not clear whether the proposed method can solve this problem in the same setting. 3. Section 5 - It seems like the additional losses can be interpreted as another discriminator that distinguish the rotated fake image and the rotated real image. - In that case, why the authors put this in the classifier? Can we make another discriminator to distinguish the real rotated image and fake rotated image separately? In this case, the separate discriminator has higher flexibility. - In the theoretical analysis, the author also mentioned that this minimizes the KL between rotated real and fake images. - It would be good to show the results of this as well in order to support the proposed method in comparison to other possible models. 4. Datasets - The author provided the results on fairly small and less complex datasets (CIFAR-10 and STL-10 datasets). - If the authors provide the results on more complex datasets (with more labels (means diverse)) such as CIFAR-100 and Imagenet, it will improve the quality of the paper.

Reviewer 3

This paper provides an in-depth discussion on the Self-supervised (SS) learning mechanism in GAN training. The theoretical analysis reveals that the existing SS design indulges mode-collapsed generators. This work presents an improved SS task for GAN, which is shown to achieve better convergence property in both theoretical and empirical analysis. The paper is well-written. It is easy to follow as the logic presents both the motivation and the solution in a clear way. The analysis regarding the interaction between the SS task and the learning of generator provides insights for other design of GAN training methods. Empirical results are shown both in the main content and the appendix. Below are some confusions that the author is better to clarify in the rebuttal. 1. Figure.3 is hard to understand based on the content in Page 5, Line 151-157. The author may better explain what Figure.3 (a) and (b) shows here, clearly. 2. The paper directly adopts many reported results of the baselines in other papers, e.g. Table 1. Are these methods compared under the same setting? Moreover, many results are missing here without explaining why. I encourage the author to clarify this. 3. As the improved version focuses on tackling the mode-collapse problem here. It is better to presents comparison on some criteria, e.g. the number of modes covered, in the main body of the paper.