Review for NeurIPS paper: Constraining Variational Inference with Geometric Jensen-Shannon Divergence

NeurIPS 2020

Constraining Variational Inference with Geometric Jensen-Shannon Divergence

Meta Review

The paper considers the use of skewed geometric-Jensen-Shannon divergences as a replacement for the standard rate term (i.e. the divergence KL(q(z|x) || p(z)) in variational autoencoders. This paper makes 3 contributions. First, it explores the properties of the skew geometric-Jensen-Shannon divergence (JS^G_\alpha) with respect to classic Kullback-Leibler (KL) and Jensen-Shannon (JS) divergence. Second, the authors modify the JS^G_\alpha and reverse the skew parameter in the geometric distribution in order to obtain a divergence with more intuitive properties in limit cases. Finally, the authors use the JS^G_\alpha in the Variational Autoencoders (VAE) setting in replacement of the Kullback-Leibler (KL) divergence and perform experiments on several benchmark datasets to show the relevance of their method. This submission was on balance positively received by the reviewers. Reviewers found the paper clearly written and appreciate the detail with which the properties of various divergences are discussed. One point of criticism raised by reviewers is that the paper is primarily an application of divergences that were proposed in a paper by Nielsen (2019), although the authors do propose extensions to this divergence. There was some disagreement among reviewers about the strength of the experimental evaluations. One reviewer found the evaluation strong, whereas another noted that experiments only demonstrate improvements to baselines with particular choices of hyperparameters, which makes it difficult to assess the robustness of the reported results. The AC noted a couple of points where discussion of the experiments was not clear (and reviewers were not able to clarify these points). In particular: 1. A β-VAE objective with β=4 will typically have poor performance in terms of reconstruction loss, which can be improved by reducing β when this is the metric for performance. Why did the authors consider a value β<1? 2. Why do the authors consider KL(p(z) || q(z|x)) as a baseline? One would expect this to lead to a large covariance in the encoder, which once again would lead to a poor reconstruction loss. 3. Most importantly, the propose loss in equation (22) contains a hyperparameter α and a hyperparameter λ, which is analogous to the constant β in the β-VAE objective. While the authors report results for different α values, it was unclear to the AC (and after discussion also to the reviewers) how the parameter λ is chosen. Do the authors simply use λ=1 in all experiments? Based on the above points, the AC would say that it would be appropriate to do additional experiments that report results for a number of values β and λ values. These experiments are not all that computationally expensive, so this should easily be doable by camera ready. The AC is inclined to say that the contribution of this paper does not hinge on the outcome of these additional experiments. There is value in evaluating the use of these divergences in VAE objectives, regardless of their exact performance relative to other objectives. Based on this, the AC is inclined to say that this paper is just about above the bar for acceptance, provided the authors address issues with clarity and experimentation discussed in the reviews and above.