NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1661
Title:Learning Hierarchical Priors in VAEs

Reviewer 1


		
This a a solid work. The authors proposed a simple hierarchical prior modelling an infinite mixture and improved state-of-the-art density estimation. This can be seen as an extension to both GECO (Rezende et al) and VampPriorVAE (Tomczak et al), which was also properly cited. The paper is well written, very clear and the evaluation collaborates the expected improvement in performance. Some minor comments to add: * Should the optimisation problem be: min KL(q||p) s.t. E[C(x, z)] < K^2 * I wished the authors could have described more of the results of the graph-based interpolations. In what way is the interpolations done with VHP+REWO better than the models compared to? Is there a way to quantify these results? (Path length of A* algorithm?) * It would be good also to see whether the Lagrangian update alone already leads to a good performance, or whether it is actually the hierarchical prior. * In terms of reporting the results it would be better to do multiple runs and report LL mean + standard error

Reviewer 2


		
This paper discussed how to enhance the existing methods in which designed prior could over regularize the posteriori, so it will try to find a way to learn a complex prior which can learn the latent pattern of data manifold more efficiently. To learn such prior, paper adopted and modified one dual optimization technique and introduced an efficient algorithm on how to update the hierarchical prior and posteriori parameters. The combination of complex priori with the introduced algorithm have learned a posterior which has more informative latent representation and avoids posteriori collapse. In addition, paper introduced a graph search method to interpolate the states and showed how effective algorithm can discover a meaningful posteriori over the experiment section. So we can summarize the contribution of this paper as following - Introduce a hierarchical prior which can avoid over regularization of the posterior while learning latent variables manifold - Adopting and expanding an optimization technique and an algorithm to learn hierarchical prior and hierarchical posterior parameters. Authors used importance weight techniques to model the prior and hierarchical posterior - Defined an interpolation techniques in latent state and demonstrated its success using different experiments In experiments section, authors demonstrated application of their method using synthetic and real data and showed the proposed method outperformed competing algorithms. Appendix contains very useful information about how changing pieces of algorithm or prior will change the outcome and is convincing why such setup has been picked. Quality Motivation, claims and and supporting material in main paper and supplementary material are explained well and I could not find any significant technical issues with the details of claims made in this paper. The quality of experimental results is very good and many possible variation has been experimented and choice of the authors have been justified. Just one point is to show whether equation 9 which is objective function of our optimization is less that marginal log-likelihood of data under certain circumferences for \lambda and \kappa. Clarity: I think the paper objectives and explanation are pretty clear and flow of material is very smooth. Originality: As mentioned in summary the main contribution of this paper could be summarized as bellow - Introduce a hierarchical prior which can avoid over regularization of the posterior while learning latent variables manifold - Adopting and expanding an optimization technique and an algorithm to learn hierarchical prior and hierarchical posterior parameters. Authors used importance weight techniques to model the prior and hierarchical posterior - Defined an interpolation techniques in latent state and demonstrated its success using different experiments The paper has solid original contribution. Authors have done a detailed literature review and most of related works have been mentioned and contribution of this paper have been compared and highlighted clearly. My only suggestion is that authors can take a look at the paper Molchanov, Dmitry, et al. "Doubly semi-implicit variational inference." arXiv preprint arXiv:1810.02789 (2018). which use semi-implicit hierarchical prior and hierarchical posteriori distribution which makes it very similar to this paper. I would belive that optimization of parameters is more efficient than doubly semi-implicit due to how the inner-outer loop has been handled in this paper. Significance:  The method is new and original. I think the experiments section has answered many question that I had while I was reading paper and extend of experiments are higher than average NIPS submission. Just as mentioned in originality section, I would like to refer the authors to Molchanov, Dmitry, et al. "Doubly semi-implicit variational inference." arXiv preprint arXiv:1810.02789 (2018) which they potentially can compare their performance to the method purposed in that paper.

Reviewer 3


		
The paper contributes an extension to GECO which admits a hierarchical prior p(z) = \int p(zeta)p(z|zeta) \dd{zeta} whose likelihood is intractable. The motivation for such a prior is to combat posterior collapse and enable better disentanglement. The authors then design an importance-weighted upperbound on KL(q(z|x) || p(z)) by lowerbounding log p(z) using samples from an importance distribution q(zeta|z). GECO as well as REWO (the proposed variant) rely on constrained optimisation, where the constraint is replaced by a weighted penalty, whose weight (the Lagrange multiplier lambda) is optimised via SGD. In this paper the authors propose a modification of the update for lambda that promotes convergence to a proper ELBO whenever the constraint is satisfied. The constraint corresponds to a pre-specified expected reconstruction loss. The novelty is in the combination of techniques (GECO + importance weighted bounds) and the different update rule for the Lagrange multiplier. I found the paper quite clear though the argumentation is sometimes a bit too informal (see for example, lines 123--131, where authors list merits of the technique--perhaps justified empirically---which are hard to predict from design choices alone).