NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: Although VAEs using a stick-breaking construction with Kumaraswamy distributions has been considered before (Nalisnick, Smyth, STICK-BREAKING VARIATIONAL AUTOENCODERS, 2017), the idea to use such a construction and extend it by mixing over the orderings to obtain a density more similar to a Dirichlet is new and interesting. Related work is adequately cited. Quality: The paper seems technically sound and claims are largely supported. Although Theorem 1 is a standard result, reiterating it is likely useful for the subsequent exposition. Experimental results show that the method outperforms some baselines, however, I feel that some additional experiments would be useful (see details below in Section 5. Improvements). Clarity: The paper is written relatively clearly, with some rather minor issues, see below. Significance: The idea of mixing over the orderings in the stick-breaking process construction can be a quite useful idea - for practitioners it could simplify to computing the gradients using reparameterisation plus some Monte Carlo averaging - although it is not so clear how well it compares to some recent work. Also it could be used for applications different from the VAE considered such as in non-amortized variational inference in Bayesian models with a Dirichlet prior. Some issues: -Can you explain why equation 181 holds and how you compute the KL? There is some dependence over the components when doing the stick-breaking, plus some additional mixing over the orderings, so this is not so obvious to me how this works. -How many Monte Carlo samples over the ordering does one need to get approximately symmetric distributions in practice? Like with a 50 dimensional latent space you are considering, does this not increase the variance of the gradients too much? -Apart from symmetry issues, can there anything be said about what does randomizing over the ordering imply for the moments of the distribution (compared to say y \sim Dirichlet(\alpha) having a negative covariance Cov(y_i,y_j)=-\alpha_i \alpha_j / (\sum_k\alpha_k)^2(\sum_k\alpha_k+1)? Some minor issues: -it might be more clearer to make explicit the conditioning on x in q(\pi) and q(z) in equation 7 -There seems to be some confusing with the letter \pi and x in lines 173 and 181. -After lines 385 and 387, I find it more clearer if the integration over the variational distribution is done earlier than after the third line. #POST AUTHOR RESPONSE: The response of the authors makes the paper stronger, as they include additional experiments comparing the proposed approach with recent alternatives (Figurnov et al., Implicit Reparameterization Gradients, 2018; and Naesseth et al., Reparameterization Gradients through Acceptance-RejectionSampling Algorithms), that I had liked to see as an improvement, so I increase my score from 5 to 6. The proposed method performs well. A comparison with simple score/reinforce-gradient estimator would also be nice as it might show that low-variance gradients are necessary for this specific application. However, the calculation of the KL divergence in line 181 is still not clear to me, even after the response. Particularly, it is not obvious to me why the dependence of x_{o_i} on x_{o_1},…,x_{o_{i-1}} seems to not matter in the KL-calculation, or is this meant be just an approximation?
Reviewer 2
## A New distribution on the Simplex with Auto-Encoding Applications ## Review after author rebuttal The authors significantly strengthen the paper by addressing the comments of the reviewers. In particular, they extended the experimental section (which I was particularly concerned about) by adding all the baselines that I suggested. IRG is the most competitive baseline which seems to have the same performance. The authors argue that the implementation of the proposed method is simpler that IRG. On the other hand IRG is more general (it is not only about the dirichlet distribution). Based on this, I am increasing my score to 6. ## Summary The authors propose a new distribution over the simplex amenable to the re-parameterization trick. To so so, they resort stick-breaking construction sampling the sticks from i.i.d. kumaraswamy distributions. To avoid the influence of the order in which the stick are sampled, they propose to integrate over all possible orders and they resort to MC to estimate this integral. In the experiments, they apply the proposed distribution to approximate the posterior over the labels of a semi-supervised conditional VAE. As baselines they use the original proposal that does not use a prior and using a gaussian-softmax prior. However, the results are incomplete and they fail to compare to other more recent proposals (Gamma-SB-VAE, Kumar-SB-VAE, general reparameterization trick, ...). ### Details The main idea of the authors is to used an stick-breaking construction sampling the sticks from i.i.d. kumaraswamy distributions (and idea already published a few years ago) and get rid off the influence of the order in which the sticks are sampled by integrating over all possible orders. This integration is intractable, so they approximate it using plain montecarlo estimates. This has been applied in other contexts before but I believe the application to symmetrize the kumaraswamy-stick-breaking distribution has not done before to the extend of my knowledge. However, it seems somehow straightforward. The results are show in the first part of the paper, however, even thought it is technically correct it seems to me that claimming a "new distribution" it is maybe an oversell. Nevertheless, the main weakness of the paper is when the try to prove the superiority of this distribution when applied as a prior of a model. The authors choose the semi-supervised conditional autoencoder originally proposed in [1]. In the original proposal, Kingma et al do not use a prior over the conditioning categorical variables. In this paper, the authors propose to use the symetrized-kumaraswamy-stick-breaking and they show that it improves the performance. The also compare to a gaussian-softmax prior for which it is know that cannot model multi-modal distributions and again the symetrized-kumaraswamy-stick-breaking slightly outperforms this prior. However, the authors do not compare to the most obvious baseline, the kumaraswamy-stick-breaking already proposed in [2]. This baseline is needed to see the actual contribution of the paper which is the approximate integration over all possible orders, not the kumaraswamy-stick-breaking contruction that have been already used in several papers in the literature. Also, in [2] they use a gamma-stick-breaking construction based on an approximation of the inverse CDF. Finally, there have been some interesting advances that extends the re-parameterization trick to the beta/gamma distribution [3, 4] that are also missing in the experimental section. Overall, the theoretical contribution is not novel enough and the experimental section is far from complete. It could be a candidate for a workshop paper but it falls below the novelty and quality neurips bar. ### Minors * Why the results table seems incomplete? * Why using a beta in the KL term when you could used the symetrized-kumaraswamy-stick-breaking as well? ### References [1] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems 27, pages 3581–3589. Curran Associates, Inc., 2014. [2] Eric Nalisnick and Padhraic Smyth. Stick-breaking variational autoencoders. International Conference on Learning Representations (ICLR), Apr 2017. [3] Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. The generalized reparameteriza- tion gradient. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 460–468. Curran Associates, Inc., 2016. [4] Figurnov, Michael, Mohamed, Shakir, and Mnih, Andriy. Implicit reparameterization gradients. Neurips, 2018
Reviewer 3
This is a very well written paper and enjoyed reading it. The motivation is clear, the authors describe problems that arose with the Dirichlet prior in a VAE setting and how they have been circumvented in reference 12. They propose that circumvention was necessary because of the otherwise intractable inference procedure. Being able to utilize the reparameterization trick und thus cheaply arrive at gradient updates enables them to recreate the described model in a more principled and rigorous way. They describe a stick-breaking construction for the new type of simplex distribution and how this initially leads to strong dependence on component ordering. Their solution to this is both straight forward and compelling. They conclude with thorough experiments, demonstrating the usefulness of their approach and showing the benefit of having access to closed form gradient updates in terms of smaller error rates. In all, I have very little to criticize, this is well executed and presented research and may have a large impact on VAE applications that are henceforth not limited to Dirichlet priors when modelling multivariate random variables.