Review for NeurIPS paper: Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation

NeurIPS 2020

Exemplar VAE: Linking Generative Models, Nearest Neighbor Retrieval, and Data Augmentation

Review 1

Summary and Contributions: The paper proposes a new class of VAEs that introduces nonparametric notion by including training data. Then, the idea boils down to a specific choice of the marginal over latents that is a mixture distribution (the number of components = the number of training data). In order to overcome overfitting, the authors propose two regularizers. Moreover, in order to allow efficient training, the authors utilize kNN procedure to choose a subset of training data. The experiment clearly show an advantage of the proposed approach over the standard marginal over z's and the VampPrior.

Strengths: + The idea of adding a nonparametric flavor to VAEs is interesting. + The paper is clearly written. + The presented concepts are explained in a lucid manner.

Weaknesses: - The presented idea is closely related to: Graves, A., Menick, J., & Oord, A. V. D. (2018). Associative compression networks for representation learning. arXiv preprint arXiv:1804.02476. It would be beneficial to comment on similarities and dissimilarities between this paper and the presented approach. - Since the approach uses training data, it would be insightful to provide training/inference wall-clock time for a vanilla VAE (w/ Gaussian prior and w/ VampPrior) and the proposed VAE, and also provide a simple computational complexity analysis. This would greatly help to see what is the tradeoff between better bpd/quality of images and higher complexity.

Correctness: The presented method seems to be correct, I cannot find any flaw or unclear choice. All steps are well motivated.

Clarity: Yes, the paper is clearly written. All concepts are easy to follow.

Relation to Prior Work: The prior work is clearly presented and proper paper are cited. However, one paper is definitely missing: Graves, A., Menick, J., & Oord, A. V. D. (2018). Associative compression networks for representation learning. arXiv preprint arXiv:1804.02476. I would expect a proper discussion of this paper.

Reproducibility: Yes

Additional Feedback: o The idea of leave-one-out during training seems to be equivalent to the pseudolikelihood approach. o Please correct "Algorithm ??" in the line 140. ===AFTER THE REBUTTAL=== I would like to thank the authors for their rebuttal. After the discussion with the other reviewers, I decided to keep my score. The paper is well-written, and the idea is interesting, however, as indicated in the reviews, the novelty is somehow limited. And since the ideas are rather incremental, a more convincing experiments are necessary. o The authors present samples from CelebA, but no bpd is reported. I would suggest (at least in the appendix) to indicate what bpds are for the proposed VAE, VAE w/ Gaussian, VAE w/ VampPrior. I do not expect SOTA scores, but I would appreciate these scores on RGB images, not only black&white or gray-scale images.

Review 2

Summary and Contributions: The paper introduces Exemplar Variational Autoencoders, an extension of VAE that utilizes a non-parametric approach for training and sampling. The model re-uses an encoder network to embed exemplar data points and Parzen window estimator with Gaussian kernel for a latent prior. To reduce computational complexity arising from a large number of exemplars needed to estimate the value of exemplar prior, the authors use an approximate k nearest neighbour search for the most influential points in latent space justifying this as a lower bound on the prior. To avoid overfitting (a simple reconstruction of exemplars), the authors propose two regularization techniques, namely leave-one-one training and exemplar subsampling, both resulting in lower-bounds of the original unregularized KL divergence between the variational posterior and the exemplar prior (essentially a Gaussian mixture on exemplar points). Density estimation experiments show marginal improvement, although other applications (e.g. data augmentation for classification) seem promising.

Strengths: The paper has an extensive experimental section showing benefits in a number of applications. For example, data augmentation with Exemplar VAE is shown to reduce error rates on the classification task, which is not as prominent in the case of VAEs with Gaussian prior and VampPrior. Apart from that, I believe that bridging parametric and non-parametric approaches is a relevant direction to explore.

Weaknesses: I might be wrong, but I tend to see the introduced exemplar-based prior as a way of constraining model towards training data and reducing generalization. The authors did a great job introducing regularization techniques, but might it be that these techniques would also boost original VAE and VAE with VampPrior? It is also not discussed how one should choose k and M hyperparameters in kNN search and exemplar subsampling.

Correctness: There are few things to address in experiments. 1. The reported scores for VampPrior (Table 3) are different from the original paper (Tables 1,3,4) although it seems that the experimental setting is the same (40d latent space, architectures, etc.). For example, the authors of VampPrior report NLL = -101.18 for HVAE on OMNIGLOT while the authors of this paper report -103.30. Looking at the original VampPrior paper, the improvement of ExemplarVAE in density estimation disappears. 2. The authors report an FID score = 39 on 64x64 CelebA, saying that this result is possible with a post-processing step of reducing the variance in the latent prior. However, it is not compared to VAE with VampPrior with or without similar post-processing. Also, wouldn't this post-processing step limit the diversity of samples? How can it be justified? 3. When testing different ratios of M/N in exemplar subsampling, it doesn't seem like changing M has an apparent effect on the model. 4. I believe that Exemplar VAE data augmentation is complementary to the methods that are reported in Tables 5,6, so it should not be a problem to add two outperforming methods that are mentioned in the text. Also, what is the performance of Exemplar VAE without Label Smoothing?

Clarity: The paper has many typos, I recommend proofreading it for these minor errors. It would also be better to make references to specific sections in an appendix rather than generally referring to supplementary materials. It would also help including pseudo-code of the method to help understand how Exemplar VAE works.

Relation to Prior Work: Yes, I believe it is well discussed.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: The paper proposes a variant of VAE with a Parzen window prior in the latent space. Data augmentation can be done with this VAE, using latent space nearest neighbor relationship, for boosting image classification performance. Using a non-parameteric Parzen window is also found to perform better than traditional factorized Gaussians. Practical tricks for reducing computational requirements in Exemplar VAEs are also presented.

Strengths: 1. The intuition is based on generating prior distribution based on statistics of the training samples. The RAT training algorithm is in line with the intuition. 2. The authors use random sampling and kNN search to reduce computational consumption in estimating the prior distribution. 3. The proposed method works across a range of problems: density estimation, representation learning, and data augmentation. 4. Experiments are set up decently.

Weaknesses: 1. In line 117-118, the authors claim that "an exemplar prior with a shared isotropic covariance does not greatly impact the expressive power of the model." However, no experimental results are available. 2. In ablation study (Section 5.1), the authors only show the impact of sampling size M, what about the effect of k in kNN? 3. There is no clear description of the generation process. (How to sample from the parzen window prior?) 4. There should be more comparison with other variants of VAE. 5. In RAT algorithm, should it be "kNN = Cache.kNN(z) \cap π"? What if Cache.kNN(z) and π have no common elements? 6. In the data augmentation experiments, why only augment xi with the expectation of r(z | xi)? What if we draw latent code from r(z | xi)? 7. The examplar subsampling and kNN are proposed to deal with large training datasets. There lacks experimental results with large datasets like imagenet to validate this algorithm.

Correctness: The experiments are well tested on the given datasets. However, most empirical results are only demonstrated on small datasets, though the paper claims that it can scale to large datasets.

Clarity: The paper is well presented and organized around the exemplar VAE. However, a lot of details are in supplemental materials, for example the RAT algorithms.

Relation to Prior Work: The proposed mehtod is a variant of VAE, but modified in aspects like the prior and the sampling method, in the spirit of combining the exemplar-based approaches with distribution-regularized approaches.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposes to replace the Gaussian prior in a standard VAE with a uniform mixture of Parzen window prior centered on each training data point (exemplar). Heuristics based on a leave-one-out objective and exemplar sampling (a random subset of the training set) is used to combat overfitting.

Strengths: Unlike VampVAE with learned exemplars, the exemplars in the proposed method are sampled from the training data and the prior calculation for each z can be accelerated by approximate nearest neighbor search. Experiments demonstrate that the proposed method Exemplar VAE outperforms VAEs with a Gaussian prior and slightly outperforms VampVAE Experiments on MNIST and Fashion-MNIST show that Exemplar VAE is effective in generating augmented data to learn good representations for downstream tasks.

Weaknesses: My main concern is about the novelty and significance of this work. The proposed method Exemplar VAE is a trivial variant of VampVAE: In VampVAE, exemplars are learned; in Exemplar VAE, the exemplars are randomly sampled from the training set. Although this work is highly similar to VampVAE, the presentation of this paper is not as clear as VampVAE. As already shown in VampVAE, it is not surprising that VampVAE (Exemplar VAE) beats VAE with a standard Gaussian prior. A better baseline is VAE with a Gaussian mixture prior that has already been thoroughly studied for clustering and data visualization in the literature. Compared to the further hierarchical prior development in the VampVAE paper conducted several years ago, the simple modification presented in this work is not significant enough.

Correctness: The method is technically correct.

Clarity: The paper is well written in general.

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: I read all the reviews and the rebuttal. I agree with the authors that the proposed method is different from learned pseudo-exemplars in the embedding space as in VampVAE, and this work uses real exemplars in the image space. However, I am not convinced that randomly sampling exemplars in the data space with some heuristics based on LOO and trivial exemplar subsampling as regularizations on toy datasets is a significant contribution extending the exemplar-based prior in VampVAE. A possible limitation of the proposed Exemplar VAE is that, the generative model might not learn much beyond reconstruction, instead, it only produces some random samples that stay close to epsilon-ball of training data points. It's possible that Exemplar VAE even performs no better than a deterministic autoencoder with tiny Gaussian noise added to latent codes and k-means regularization in the latent space. VampVAE doesn't have this issue. Moreover, demonstrating the representation learning capabilities and data augmentation benefits (an orthogonal contribution) of Exemplar VAE on (F-)MNIST is not interesting. If this work wants to demonstrate the significance of Exemplar VAE, large-scale experiments on more challenging datasets such as ImageNet that do require approximate NN search should be conducted. In sum, considering previous works on exemplar learning as a prior, marginal improvement over VampVAE, and small-scale non-challenging experiments that doesn't require approximate nearest-neighbor search, I am not convinced that the work in its current form is a significant contribution to NeurIPS.