Reviews: Two Generator Game: Learning to Sample via Linear Goodness-of-Fit Test

====== After reviewing the author feedback, my updated comments are as follows: The theorems are a good addition and clear things up; I recommend including them. Regarding the experiment comparing MMD, Lin-MMD, and FSSD: I don’t doubt that FSSD yields a more powerful test than MMD (as you say, a GOF test has access to information which isn’t available in a two-sample test). But GANs (even MMD-GANs) don’t just minimize the MMD; they minimize a loss involving a discriminator which was estimated from potentially millions of samples in past iterations. Similarly, DEAN doesn’t just minimize the FSSD; it minimizes an estimate of the FSSD where the true energy is replaced by a model which was estimated from samples (as with GANs, potentially millions of samples). Overall I raise my score to a 6, with the strong recommendation that the authors back away from justifying DEAN through arguments about GOF tests being more efficient than two-sample tests. GANs are not two-sample tests, at least not between minibatches. It's a novel GAN algorithm, and it gives you an energy model "for free"; that should be justification enough. ======== The idea of training the generator in a GAN by using a GOF test is quite novel (to my knowledge) and interesting (in my opinion). GANs have computational and statistical challenges, and exploring the space of possible formulations has the potential to lead to improvements -- I consider this paper a good exploration of a formulation which is quite different from the usual ones, even if it doesn't result in any improvements. The paper is generally written clearly; the derivation and the experiments are easy enough to follow. This paper seems to be missing a characterization of the solutions. In particular, is DEAN consistent (i.e. at equilibrium, the IGN recovers the data density) for any choice of energy model objective? L158-161 seems to suggest so, but it doesn't seem intuitive to me. GOF tests are often more powerful than two-sample tests, and this paper seems to imply that this gives DEAN improved statistical power over usual GAN techniques (e.g. L61-65, L170-171, L175-176); in fact it seems to be the main selling point of the method. However, this claim isn't formally stated or explained in detail, nor backed by any theorem or experiment in the paper. I'd like to see the claim be made clearer and more explicit, and also a demonstration of this improved statistical power in a toy setting where the exact solution is known and exact optimization is possible. If the claim is improved statistical power, I'd want to see a plot showing faster convergence on a toy problem in terms of sample size. Regarding the experiments: I think they're sufficient for the claim that "DEAN performs roughly on par with a good GAN", but not to demonstrate any practical improvement -- for that you'd either need a bigger effect size, or a much more careful setup (e.g. human eval against strong baselines). I like the novelty of the idea behind this paper and would very much like to see it accepted, but in its current state I think improvements are needed; hence I vote to reject.

The proposed model has two networks - one network to explicitly learn the energy distribution, and a second implicit generation network which is trained by minimizing goodness of fit statistic between the energy distribution and the generated data. The authors demonstrate how Stein operator can be used for training the implicit generator, resulting in a two step training procedure. Originality: The proposed approach of minimizing the gof statistic between generated data and an energy distribution seems novel. The authors have clearly highlighted their approach in context of prior works on energy-based GANs. Clarity: The paper is well-written, though some more discussion can be added in the experiments on some additional takeaways and insights. It's not clear if authors considered two different energy functions, one autoencoder based and another RBM based? A comparison of the two doesn't come up in the experiments. T Quality: Authors experiment with 3 datasets, and compare against 5 baseline method. They report Inception Score, Fréchet Inception Distance and Kernel Inception Distance. The examples in Figure 2 generated from the proposed DEAN model seem to be of high quality. It would have been great to see some crowdsourced experiments to judge the quality of generated images. Additionally, there aren't many insights about the method apart from better quantitative measures.

Paper ID:	6021
Title:	Two Generator Game: Learning to Sample via Linear Goodness-of-Fit Test

Reviewer 1

Reviewer 2

Reviewer 3