NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Overall the paper is quite well written and convincing. The claimed rigorous results are not proven at all in the paper but proofs (which are based on known techniques, in particular developed for ML problems by recent works including some by the authors) are provided in the SM. Still I think that it would have been nice to give hints on how the proof works at least at a high level, just to convince the reader not keen to read the SM, or as an introduction for the more mathematically inclined one. The results are strong, the studied model is relevant despite being simple enough for complete analysis. The results in terms of ODEs/PDEs/SDEs are a bit harsh and do not give much insight by themselves (the equations are quite complicated) but the authors then provide nice numerical experiments to back up the results, and meaningful interpretation of the observed learning regimes, which I find very nice and enlightening. The paper is of high quality. Still there are a number of typos that should be corrected. I recommend acceptance. Update: I have read the authors' rebuttal and taken it into account.
Reviewer 2
In this paper, the authors derive a mean-field theory for the analysis of GAN training when the learning rate is not necessarily small. They show that the training dynamics of a large network is described by a smaller set of parameters that scale as the number of hidden features. The dynamics of the order parameter obey a collection of ODEs while the microscopic parameters follow stochastic dynamics. The source of the noise in the dynamics is the noise in the data (here modeled as a spiked covariance matrix) and the generative model. They provide detailed analysis and simulation of a simple model with two hidden features and a simple quadratic loss function. Analysis of the temporal evolution of the order parameters reveals several dynamical phases. In one phase the weights of the generator converge to high overlap with that of the data model, signifying a success of feature retrieval by the generator. In another dynamical phase, the weights of the student oscillate around the correct values without ever converging. In the last phase, the system exhibits mode collapsing, where the generator finds only some of the features. They find that the bifurcating parameters between these solutions are the total amount of noise — the mean level of noise in the system, given by the mean of the noise in the data, and the noise level used for the generator. Interestingly, the macroscopic dynamics of the order parameters reveals an interesting hierarchical interaction when one feature is learned after the other. I think the authors should have focused more on this result, and perhaps compare it to known analysis of simple GAN (I do not know any, but either this is an exciting and new result or the authors should compare it with previous findings). However, I also have some criticism: - The authors introduce the macroscopic overlap matrix M, which aggregates all the order parameters into one matrix, but it makes it unclear. Different indices range of these matrix means different things, which make it hard to follow. - The result that the macroscopic dynamic is deterministic and accurate at the asymptotic regime with scaling 1/sqrt{N}, and that the microscopic dynamic is stochastic seem trivial to me. These are the starting point for any mean-field theory. While indeed these assumptions can, and often should be proven rigorously, the underlying microscopic noise, and the self-averaging of the teacher and student weights in the model is enough to support this claim. To conclude, I think this paper shows interesting findings on the training dynamics of GANs. I suggest that the authors focus more on the result they gain by analyzing the dynamics of the order parameters, and discuss further its meaning and applications. The further discussion can replace comparison between the deterministic macroscopic dynamics and the stochastic microscopic one — which seem trivial. ** update ** Despite the authors' response, I still find the use of the concatenated matrix M confusing. Nevertheless, it may be the better choice of due to the short format limitation. Either way, I still find this work interesting and believe it should be published. I am also satisfied with the authors' response to my comments about the emphasis on the stochastic microscopic dynamics. I agree that given previous works on a statistical analysis of GANs, which I am not well familiar with, it is a point important to make. I would suggest the authors add a couple of sentences explaining and emphasizing this, for the like of me that find their statements trivial.
Reviewer 3
The analyses performed in this study are concrete and sound good. It is an interesting direction to apply the theoretical framework of scaling limits of stochastic processes into the training of GANs. This work revealed that the training dynamics is highly nontrivial even in a simple setting with single-layer models (i.e., a linear model as the generator and simple perceptron as the discriminator). It enables us to theoretically quantify the effects of stochastic noise on the dynamics and, especially, to observe appropriately sized noises which lead to the success of feature recovery and prevent oscillation and model collapsing. My major concern is how one can generalize this analysis into more general settings. ・When d>1: Although the dynamics shown in Figure 1 is the case of d=2, the phase diagram shown in Supplementary Material corresponds to d=1. Since practical GANs usually suppose the multiple features (d>1), it seems to be better to show a detailed analysis of d=2. It would be helpful to enrich the description on d=2 which is briefly discussed in lines 45-50 in Supplementary Material and show its phase diagram. ・Norm constraints on U (in line 77): Authors say that we can set columns of U to be orthonormal with each other without loss of generality. Is there any preconditioning or whitening method which can normalize arbitrary U to a matrix with orthonormal columns? Assuming this orthonormal U will be essential in the analysis of example 1 because it should be impossible to find the solution V=U when U and V have different norms. It is better to describe how one can normalize U which determines an unknown data distribution. ・Case of nonlinear generators: The linear generator (1) seems to an ideal setting and one natural question is how one can generalize this analysis into non-linear generators. It will be helpful to remark on the case of non-linear generators and explain in what point the analytical evaluation become intractable in such non-linear generators as a mixture of Gaussians and VAE. Presentational issues: - line 1, “shallow GAN”: This word seems to be hardly used. Precisely speaking, this study investigated a solvable GAN with single-layer generative and discriminative models. - Supplementary Material, line 51, “heuristic derivations”: In what sense is it heuristic? The derivation shown here seems to be asymptotically exact when n is sufficiently large. Is there any approximation or do you mean that the derivation is not mathematically rigorous but intuitive? Typos: Caption in Fig 1: model collapsing -> mode collapsing? Line 8: this analysis provide”s” Line 297: our analysis provide”s” --- After rebuttal ------------------------------------------------------- Authors' response solved most of my concerns. I am still not so convinced of how the analysis can be extended into more general settings, but Authors did some additional works for understanding the phase diagram of d>=2 and promised to add comments on more complicated generators such as ReLU and learnable P_\tilde{c}. So, I keep my score and am looking forward to seeing the revised paper.