Review for NeurIPS paper: Quantized Variational Inference

NeurIPS 2020

Quantized Variational Inference

Review 1

Summary and Contributions: This work proposes Quantized VI which uses optimal quantization for variational inference. In general, it replaces regular MC sampling in BBVI with optimal quantization. It further utilizes Richardson extrapolation to reduce bias caused by quantization. ~~~~update~~~~ I appreciate the authors' effort on the additional experiments in the rebuttal. Adding the experiments with BNN and larger datasets with more baselines indeed will make the work stronger. As the method cannot be applied out of exponential family(otherwise, it would be not affordable) as well as the batching problem pointed out by other reviewers, the method may have too limited application impact. I would increase my score from 5 to 5.5(if it allows). I appreciate the nice use of of optimize quantization in VI. However, I still cannot support the paper to be accepted at the current stage as the advantage (fast convergence) is not applicable to non-exponential models with SVI setting where it matters.

Strengths: 1. The choice of mathematical tools Optimal quantization and Richarardson expansion is suitable for the purpose that the author poses. 2. There is clear theoretical analysis of the method 3. Experimental results shows promising performance given small data setting. 4. Variational Inference is important to the community

Weaknesses: 1. Optimal quantization is not scalable (which is mentioned in the paper as well). Even with clustering before, it is costly to both N(number of data) and M(the dimension). The paper (in abstract and intro) aims to speed up VI by fast convergence which is needed for big data/big model setting, which the quantization is a bottleneck for it, which makes the method loses its point. 2. Apart form the scalability, I wonder about the effectiveness in high dimensional space as well where everything is far away from each other. 3. The experiments are only with very simple small UCI datasets and very simple/small models (linear regression). I would be great to see with more "real-life" experiments. 4. There is also limited baselines. [a] is discussed in the paper but not compared. Only the basic BBVI is compared. It would be good to see at least baselines such as [a] and [b] in the experiments. 5. For algorithm 2, it would be insanely expensive if quantization needs be to computed every round. but it is explained with exponential family, it only need once. But if it limits to be exponential family, then the point of whole BBVI is lost. 5. Small things: line 2, minimize->maximize; can you explicitly discuss about the optimal quantization computational complexity. [a]Alexander Buchholz, Florian Wenzel, and Stephan Mandt. “Quasi-Monte Carlo Variational 238 Inference”. [b] Stochastic Learning on Imbalanced Data: Determinantal Point Processes for Mini-batch Diversification

Correctness: The method seems correct.

Clarity: Clear in general.

Relation to Prior Work: Some of them are discussed but not compared.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: The paper proposes a new optimization method that is based on sampling latent variables (they focus on the ELBO in variational inference). Their approach is based on constructing a optimal Voronoi tessellation that leads to biased but variance free gradient estimates.

Strengths: * The paper communicates the idea of Voronoi tessellation that might be new to many ML researchers.

Weaknesses: * The relevance of the setting is not clear to me (see detailed comments). Does this approach really holds what it promises -- it is not clear that really helps with "quick model checking". * The method seems to be limited to full-batch gradients, but most modern applications of VI include mini-batch sampling.

Correctness: All claims and derivations seem to be correct.

Clarity: The paper is well written and easy to follow.

Relation to Prior Work: The related work is clearly discussed.

Reproducibility: Yes

Additional Feedback: 1. Is the variance of the gradient estimator the real bottleneck for faster optimization? * Especially for reparameterization gradients I found in my experience that in most cases it has quite small variance wrt. to the latent variable sampling. So is this really an issue? * To clarify, you could plot the variance of the MCVI gradient (not only the gradient norm). 2. Application to mini-batch gradients? * Can you apply your approach also to the setting of SVI (i.e. stochastic mini-batch gradients) which is used in most cases in practice? * In this setting the mini-batch noise would probably dominate over latent variable sampling noise. Would you still see a benefit of using your method? 3. Experiments * Experiments on datasets that go beyond a couple of thousands of data points would be nice (especially, since in this regime it might be more relevant to have method for quicke model checking.) * Can you provide an experiments that really leverages your claim of quick model checking. E.g., tuning hyperparameters faster than if you would use regular MCVI? * Please quantify this gain in quicker evaluation time. AFTER REBUTTAL I thank the authors for their detailed rebuttal. I have decided to keep my initial score since I'm still not convinced of the motivation of the method/setting. To me the benefit of variance reduction (at the cost of bias) seems to be not really shown in the setting. Hower, I think the paper is an interesting read and I would agree on accepting it if that's the overall consensus.

Review 3

Summary and Contributions: The paper introduces a optimal quantization approach to estimate deterministically the gradient of the ELBO in a black-box VI setting. Edit: I updated my score to 7 as all my concerns have been properly addressed. The new experiment test the method in a challenging setting and the chosen baselines are very relevant.

Strengths: The new approach is simple and theoretically motivated. Everything can be quite straightforwardly implemented in a very generic automatic inference framework simply by pre-computing the optimal quantization for a large family of distributions. This makes the approach very suitable for probabilistic programming.

Weaknesses: The experiment section is suggestive but far too limited. No comparison with the many existing variance reduction methods is offered. Variance reduction in stochastic gradient-based VI is a rich area of research and without those comparisons it is simply not possible to evaluate the performance of the method. The only baseline is a the vanilla MCVI method which is a very elementary baseline and it does not allow the reader to assess the performance of the quantized approach against other variance reduction techniques. Comparison with variance reduction/Rao-Blackwellization methods based on control variates should be included. Furthermore, an experimental comparison with quasi-MC methods should also be included given the strong similarities with the proposed approach. Besides of the lack of relevant baselines, the experiments focus on very simple models. However, the biggest benefit of variance reduction approaches often comes from deep models such as Bayesian neural networks and deep exponential families. I will consider shifting to an accept position to weak accept is these comparisons are included in the rebuttal and incorporated in the camera ready. I will also consider an accept position if an additional more complex experiments with proper baselines is included.

Correctness: The methodology is correct but the lack of baselines make very difficult to properly estimate performance against relevant VI variance reduction alternative.

Clarity: Yes

Relation to Prior Work: The coverage of the related literature is somewhat lacking. More detailed discussion should be given of other variance reduction methods and their relationships with the new approach. In particular, the differences and similarities with control variates methods should be extensively discussed as they are currently dominant in the literature and in applications.

Reproducibility: Yes

Additional Feedback: The method is elegant and it has potential but its usefulness cannot be really assessed without a much stronger experiments section. This paper can definitely turn into an high quality submission if proper attention is given in including relevant baselines and more challenging experiments with deep models.

Review 4

Summary and Contributions: The paper proposes a biased, zero variance estimator for the gradients of the ELBO based on Optimal Voronoi Tesselation of q(z|x). It is shown that this esimator is a lower bound on the ELBO. UPDATE: Based on the authors' response, I'm raising my score to 6, mainly because the general idea is neat. I still have reservations about the quality of the presentation, but hopefully that can be improved for the final version. Also, I meant to ask for a comparison to IWAE not IWAE+quantization.

Strengths: This is a neat idea attacking an important problem and deserves begin explored. The theoretical part seems to check out, although I didn't go into the details.

Weaknesses: The proposed Quantized VI method employs several (20) "samples" (the middle points of the voronoi cells) to estimate the gradients of the ELBO. My main concern is that the ELBO is hardly the best way to make use of several samples. IWAE, or better yet, DReG (see "Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives") would make the experiments much more informative.

Correctness: Yes.

Clarity: Unfortuately, the paper feels rather rushed and draftlike. The content is mostly there, but it is left to the reader to connect the pieces. Spelling could be improved. Sometimes confusing mistakes are made ("gradient-free" is used multiple times instead of "variance-free").

Relation to Prior Work: Yes.

Reproducibility: Yes

Additional Feedback: