Review for NeurIPS paper: Recursive Inference for Variational Autoencoders

NeurIPS 2020

Recursive Inference for Variational Autoencoders

Review 1

Summary and Contributions: This paper proposes an amortized variational inference technique which uses a mixture distribution. Each component of the mixture is output by a separate neural network encoder, trained to optimize the improvement in the ELBO. The authors compare their technique with semi-amortized VI, normalizing flows, and a non-recursive mixture encoder across a variety of image datasets, where their technique performs well. In terms of inference time, their technique is also faster than semi-amortized VI. Update: After reading the authors' response, the other reviews, and the reviewer discussion, I am increasing my score toward acceptance. My primary concerns were regarding the overstated claims on semi-amortized VI and the evaluation. The authors reasonably addressed these issues in their response, and I am hopeful that they will include these changes (and others) in the final submission. This paper presents and reasonably novel contribution which is also demonstrated reasonably well. As such, I feel that this paper would be a useful addition to the conference.

Strengths: Soundness: The theoretical grounding and empirical evaluation are largely sound. The authors derive their technique in Section 3 and show how this naturally results in an objective for the current component that trades off between the ELBO and a KL from the current approximate posterior distribution. The authors compare their approach against a comprehensive set of baselines, including a standard VAE, semi-amortized VI (which also involves additional computation), normalizing flows (which uses a more expressive distribution), and a non-recursive mixture distribution (which has the same form of distribution). This comparison is performed across a range of image datasets and multiple model sizes using multiple runs. The authors also compare their approach with boosted VI, showing that the KL, rather than entropy, is useful for mixture estimation. Significance: The proposed approach outperforms more expressive VI techniques, particularly IAF. Normalizing flows, such as IAF, are seen as a general purpose technique for improving the flexibility of approximate posterior distributions. If this technique can be further scaled up, it could provide an alternative to normalizing flows, instead based on mixture distributions. This could be significant in improving VI in more complex settings and models. Further, in the authors’ formulation, estimating the components can be parallelized at test time, enabling more expressive distributions with constant time cost. Novelty: While mixture distributions are a common feature of probabilistic models, it seems that the exact formulation of splitting the ELBO as a sequence of mixture components is a novel aspect of this work. It’s perhaps possible that this estimation technique could also be extended to aspects of the generative model or other probabilistic modeling settings. Relevance: This paper focuses on improving variational inference in deep latent variable models, i.e. VAEs. Because these topics relate to deep learning and probabilistic modeling, this paper is relevant to the generative modeling community and the NeurIPS community more broadly.

Weaknesses: Soundness: I did not completely follow the derivation/explanation in Section 3, and I imagine that the definition of Q (with or without the current mixture) could trip up readers. To better understand the method, it would be helpful to re-write and/or expand the method, perhaps in the appendix. Some of the claims around semi-amortized VI are exaggerated, and it may not be entirely fair to frame this work in comparison with semi-amortized VI. In the introduction and background, the authors lump a number of different works under the name “semi-amortized variational inference,” stating that it is difficult to tune the step-size and that it requires estimating Hessians. One of these works, iterative amortized inference (Marino et al., 2018), does not require tuning a step-size or estimating a Hessian, whereas another previous work (Krishnan et al., 2018) only requires tuning the step size. Further, these works still use Gaussian approximate posteriors, focusing on decreasing the amortization gap (Cremer et al., 2018). This paper, in contrast, improves the form of the distribution, which can be seen as decreasing the approximation gap. The authors show, surprisingly, that their method outperforms normalizing flows across a range of datasets. The authors claim that these models tend to overfit. However, from what I can tell, the authors do not provide the training performance of these models. This would be helpful to allow readers to assess these claims. Further, this claim may be overly general; the authors explore a particular family of single latent level convolutional model architectures. It’s unclear whether their technique would equally apply to more complex hierarchical models. The authors utilize multiple inference networks in their approach. While they compare with a non-recursive version of this model, it should be emphasized that their approach contains more parameters than many of the baselines. Significance: Ultimately, the experiments demonstrate slight improvements in performance for a particular family of convolutional architectures on image datasets. To further evaluate and demonstrate the significance of these results, it would be helpful to expand the experiments to include other models and data modalities. One simple additional example would be to use binarized MNIST with a Bernoulli conditional likelihood. This is a fairly standard benchmark in the deep generative modeling community. Novelty: The proposed technique is similar in spirit to boosted VI, and ultimately comes down to a new way to train mixture distributions. While the authors demonstrate the benefits of their particular approach, the difference appears to be truly in this relative training scheme and clamped KL term. In terms of novelty, these are not large contributions. Relevance: No weaknesses.

Correctness: In the experiments, log-densities are reported (in nats). Typically, on image datasets log-probabilities are reported (in bits/dimension). Correcting this would help to improve the ability of readers and other researchers to compare these results with past/future papers. As noted above, some of the claims around semi-amortized VI are overstated or lumped across multiple different works. It should be noted that the authors’ method also requires multiple gradient passes during training. An analysis of the training time of each method would be helpful to assess the additional cost of this approach. Conventionally, many previous works use 5,000 importance-weighted samples to evaluate the marginal log-likelihood. The paper reports the value obtained with 100 samples. I don’t imagine that the results will change significantly, but it would be useful to eventually re-perform these evaluations with additional samples. Also as noted above, the test results for models with normalizing flows are, in many cases, below those for simpler models. The authors claim that this is due to overfitting, but do not provide the training results.

Clarity: Yes, the paper is well written. Section 3 could be slightly improved, particularly to distinguish between the mixture distribution before and after the addition of the current mixture component. The setup for Figure 1 does not appear to be fully explained in the paper. Some details regarding the experiment setup appear to be omitted from the paper and appendix. For instance, from the code, it’s clear that the authors use Gaussian conditional likelihoods, but I could not find this detail in the paper.

Relation to Prior Work: For the most part, the relation to previous work is discussed well. For readers, such as myself, who are unfamiliar with boosted VI, it would be helpful to explain in math how the objective derived here relates to the entropy objective employed in the boosted VI setting. The authors hint at this but it could be further expanded formally. Also, it would be useful to explain further why amortization cannot/has not been applied in this setting thus far. Again, as noted above, the authors lump multiple works together under the name “semi-amortized VI.” The authors should distinguish between the differences in these works, as some of the claims are incorrect/overstated. The authors may also want to cite Iterative refinement of the approximate posterior for directed belief networks by Hjelm et al., 2016 as well as Recurrent inference machines for solving inference problems by Putzky et al., 2017.

Reproducibility: Yes

Additional Feedback: It would be helpful to better distinguish between the amortization gap vs. the approximation gap (Cremer et al., 2018). Much of this paper is framed in terms of comparing with semi-amortized VI, however, the proposed approach uses a more expressive distribution, so this framing is somewhat flawed. Line 51: I would consider changing the word “inappropriate” here, or at least explaining this point in further detail. Line 119: is epsilon in [0,1]? Figure 1: This figure could be improved, e.g. by adding labels to the axes, labeling the mixture components with a legend, etc. It’s also unclear why the cyan component in the conventional mixture in instance 2 does not find a mode of the distribution. Section 3.1: This form of recursive estimation is not explicitly conditioned on the previous components of the distribution, but must instead learn these implicitly through amortization. While this allows parallelized inference at test time, it may be computationally limited. The authors may consider explicitly conditioning on the previous mixture distribution parameters, or, for example, using a recurrent network as the output of the encoder to sequentially generate each mixture component for all dimensions. Section 5: The authors should state clearly that their method contains more parameters in the inference network than most of the baseline methods.

Review 2

Summary and Contributions: The paper proposes a new fashion of constructing encoders in the VAE framework. The idea is to build a mixture of encoders in a recursive manner. The paper is rather clearly written, and all concepts are explained.

Strengths: + The proposed idea is interesting. + The proposed approach is easy to implement and seems to work in practice. + The paper is well-written.

Weaknesses: - All results reported in the paper suggest that all datasets were modeled by Gaussian decoders. This is not the best possible choice for images that are discrete in nature. I wonder whether the same conclusions could be drawn if discretized logistic distributions are used. - The authors claim that the posterior collapse is not an issue. I understand that and buy this argument. However, it would be beneficial to provide empirical evidence for that.

Correctness: The derivations seem to be correct. The reasoning is logical and sound. The only issue I have right now is the choice of the distribution. It seems the authors used Gaussian decoders that is not necessarily a correct choice for image data.

Clarity: The paper is well-written. I do not have any concerns about the clarity of the paper.

Relation to Prior Work: The prior work is properly discussed.

Reproducibility: Yes

Additional Feedback: o I wonder how stable is the proposed procedure. The authors use an additional quantity C, i.e., the barrier point. I imagine if C is too large, then the method could return strange results. I am curious what is the experience of the authors with this matter. ===AFTER THE REBUTTAL=== I would like to thank the authors for their rebuttal. After a discussion with the other reviewers I decided to raise my score to 7. My main concern about the paper is the presentation of the idea. Please try to work on that. But otherwise, good job!

Review 3

Summary and Contributions: After author rebuttal: Thank you very much for the detailed rebuttal. I in particular appreciated the experiments on binary versions of datasets and encourage the authors to include them in the final version of the paper. I also encourage the authors to reconsider/revamp the presentation of the proposed approach (and discussion of related work) given R1's excellent comments. Finally, as noted by R5 it would be good to investigate how the decoupling the inference network/generative model changes the training/optimization dynamics. ========= This paper proposes a method to improve variational approximations to the posterior in the context of learning deep generative models (i.e., variational autoencoders). The approach utilizes richer variational distributions than the standard Gaussian by successfully learning mixture distributions via iterative refinement. This recursive inference approach is found to outperform a variety of existing methods that also increase the flexibility of the posterior.

Strengths: - Well-motivated method that proposes an amortized alternative to methods such as boosted variational inference. Solid contribution to the existing line of work on improving variational approximations within the VAE framework. - Comprehensive comparison against a wide range of existing approaches (including semi-amortized approaches, which still assume a Gaussian variational posterior, and flow-based methods, which increase the flexibility of the variational family)

Weaknesses: - The model is only compared against internally-implemented baselines. In particular, the difference in evaluation setup makes it hard to compare against external baselines. For example, why not apply the approach on the standard binarized setup for MNIST/Omniglot? (It is of course not necessary to achieve "SOTA" results, but it seems unideal to be in a completely different setup to majority of existing approaches) - It would have been interesting to see the effectiveness of this approach as a function of the encoder/decoder capacity. For example, what happens with an autoregressive decoder like the PixelCNN? Can you get away with less powerful encoders in future refinement steps? - The model is only tested on image domains. It would have been interesting to apply this on sequential data such as text (not a huge drawback though).

Correctness: Yes

Clarity: The paper is very clear, well-motivated, and easy to understand

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: - Small point: "For training, it requires backpropagation for the objective that involves gradients, implying the need for Hessian evaluation" ==> it would be more accurate to say that we need the Hessian vector product (vs. the full Hessian). - I would encourage the authors to discuss the potential impacts of deep generative models (such as VAEs) in their broader impacts statement

Review 4

Summary and Contributions: A method is proposed for variational autoencoders for better posterior approximation. The method (RME) is like boosting in that mixture components are added iteratively to the approximate posterior, but previously added components are trained together. The mixture coefficients depend on the data point and are trained concurrently with the parameters mixture components. The main contribution is bringing Boosted VI to amortized inference. UPDATE: I've read the authors' response and I stand by my assessment and confidence score.

Strengths: Suboptimal inference in VAEs is an important problem. The paper proposes a novel method in the context of amortized variational inference and demonstrates good experimental results. The figures with the 2D latent space illustrate the strength of the algorithm.

Weaknesses: My main concern is that although the 2D examples look intuitively appealing, they are produced with a given p() (i.e. a fixed decoder). VAEs train q() and p() jointly and so does RME. The interaction between an already hard mixture estimation problem (the encoder mixture) and learning the decoder p() is not explored. - Are multi-modal posteriors needed in VAEs? - Is p(x|z) implicitly regularized by RME to have higher variance? This implicit regularization concern would be alleviated by a more careful experimental setup, which included multiple regularization methods on p() and more effort to tune hyperparameters such as the learning rate and network sizes. Also related, RME(1) is not equivalent to the VAE due to how it updates the encoder and decoder parameters in a coordinate descent manner. Did you do experiments with it? Finally, since the method can be viewed as a way to avoid component collapse in mixture encoders, the Vamprior is a very obvious missing baseline.

Correctness: As alluded to in weaknesses, the experiments could be improved.

Clarity: Yes.

Relation to Prior Work: Estimating mixtures and component collapse has a large literature and should be mentioned in related works. Lines 152-158 are the closest to even mentioning this.

Reproducibility: Yes

Additional Feedback: