NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2260
Title:A Prior of a Googol Gaussians: a Tensor Ring Induced Prior for Generative Models

Reviewer 1


		
I have read the author response and other reviews and decided to keep my original score of 7. Summary: The paper proposes a family of priors for GANs and VAEs. These priors are mixtures of Gaussians with a large number of components but which can be represented using few number of learnable parameters using tensor ring decomposition. This family of priors enable efficient marginalization and conditioning. The method is applicable to both discrete and continuous latent variables. The method is extended to conditional generative modeling; in particular missing values in the conditioning variable can be marginalized out. Experiments are conducted on CelebA and Cifar10. Originality: The proposed method is novel to my knowledge. Clarity and Quality: The paper is very well written and easy to follow. The experiments are somewhat satisfying. I would have liked to see comparison to works using richer priors. For example comparison to the VampPrior [1] for the VAE experiment would be useful. Furthermore, it is not clear whether the TRIP outperforms the GMM baseline solely because it has higher capacity. For example in the appendix it is mentioned that for the GMM the number of components used is 1000; I was expecting 128*10 number of components (128 dimensional latents with 10 gaussians for each dimension). See section 3 of the supplement. Significance: For the VAE, I would deem this work significant if it was shown that this has the possibility to also help with latent variable collapse. For the GAN I would deem this work less significant as it relies on REINFORCE which is somewhat problematic due to high variance (this is rightfully acknowledged in the paper). Questions and Minor Comments: (1) What happens when you use this approach to form the variational distribution in the VAE? (2) line 100: it is "log marginal likelihood" not "marginal log-likelihood" (3) For the GAN did you also use multiple samples from the prior as a GAN baseline? (4) Why not use 1280=128*10 for the GMM baseline in the gan model? That would be more fair to the baseline. (5) How do you select the core size m_k? [1] VAE with a VampPrior. Tomczak and Max Welling, 2018.

Reviewer 2


		
Thank you to the authors for performing these experiments and addressing the concerns raised by the reviewers. I am pleased to see the performance of TRIP in the context of flows as well. I recommend that this paper be accepted. == The authors present TRIP (Tensor Ring Induced Prior), a parametric family of distributions. These distributions are parameterized as a tensor ring decomposition (Zhao et al. 2016) by d "cores," which define a distribution over d discrete variables. A continuous distribution over R^n can be obtained by placing one Gaussian distribution for each value of the discrete variables, which corresponds to a mixture of a very large number of Gaussians (10^100 Gaussians in this paper). The authors then demonstrate the effectiveness of this parameterization as a learnable prior for VAEs and GANs. The authors justify this approach because the inherent multimodality of this parameterization may better suit the multimodal nature of natural images. The authors cite half-present glasses in the case of GANs trained on CelebA as a disadvantage of unimodal priors. Originality: This work is builds on a wide body of work on learned priors. This approach seems novel as far as I'm aware, although I'm not familiar with the related work on tensor decompositions. Quality: The authors carefully motivate, define, and experimentally test this approach in a wide variety of settings. One concern I have about the experimental setup is that the authors compare TRIP to a N(0, I) prior and a GMM prior. However these seem like unfair comparisons because TRIP has many more parameters than N(0, I) and GMM. It may be fairer to compare to a decoder with the same number of parameters as a TRIP-based decoder would have. I would also like to have gotten a better sense of how much slower a TRIP prior is to train compared to the standard approaches. Clarity: I found this paper to be well-written and easy to follow. The logic flows well from section to section. I very much appreciated the visualizations, especially Figure 1, 4, and 5. Significance: TRIP seems like a practical algorithm that can be used as a prior for VAEs and GANs, or more generally whenever a mixture of a large number of Gaussians is desired.

Reviewer 3


		
# Overall This paper introduces a complex prior (TRIP) for deep generative models. TRIP has tractable marginal and conditional distributions and can produce an exponential number of mixtures of Gaussian with a small number of parameters. Overall, the paper is well written, the proposed technique is elegant and the motivation is clear. The main weakness is the experiment. # Weaknesses - Some important related works are discussed in Sec.5 but not compared directly in the experiments. What is gained by TRIP vs autoregressive priors [12,13] or flow-based priors [15]? There are no quantitative comparisons between training the generative models with TRIP and with other advanced parametrized priors. - What is the computational cost of TRIP? Since TRIP introduces additional parameters for the prior and brings extra computation, it is worth knowing that how much it slows down the training.