NeurIPS 2020

Robust, Accurate Stochastic Optimization for Variational Inference

Review 1

Summary and Contributions: This paper proposes a stochastic optimization method tailored for variational inference. The approach builds on recent works which shows that stochastic gradient descent can be seen as a discrete-time stochastic process. Then, it introduces tools from MCMC literature to define the stopping criteria, to justify iterate averaging, and to diagnose convergence problems. The empirical evaluation shows the benefits of the presented approach.

Strengths: The paper addresses a relevant problem. Many works in recent years have focused on improving variational inference, but in a different direction. This work puts the focus on the stochastic optimization process. Authors empirically show improving this optimization step the quality of the provided variational solution also improves. The method exploits recent characterizations of stochastic gradient descent (SGD) as a Markov chain. Based on that, they employ MCMC methods to address issues of SGD, which empirically seems to be beneficial.

Weaknesses: The paper focuses on variational inference, by the addressed problem is not directly associated to variational inference. The addressed problem is to improve the convergence of stochastic gradient descent. The ELBO can be seen as a loss function, so variational inference is a special case. The paper does not properly discuss this point. The paper does not provide any theoretical guarantee, even though the methodology is theoretically sound.

Correctness: I have an issue with the empirical methodology. The authors establish the output of Stan's implementation of Hamiltonian MC as the ground-truth. I think this should not be the only proxy to evaluate the quality of the variational inference algorithm. The use of predicted log-likelihood on a independent test set should be also considered.

Clarity: In general, the paper is well written and easy to follow to someone with basic knowledge of variational inference and stochastic optimization methods.

Relation to Prior Work: In my opinion, the main missing point wrt prior work is the relation of this work with other similar approaches outside the context of VI but which could also be used in VI settings by considering the ELBO as a special loss function. This is not clearly discussed in the paper.

Reproducibility: Yes

Additional Feedback: Intepretation of the k statistic in the experimental result is not easy to follow. Please try to improve this part. ********** REVIEW UPDATE ******* I thank the authors for their nice response. I find the theoretical contributions of the paper are quite limited, even though I agree with the authors that the connection with Dieuleveut et al. could provide a promising line of work. I think the paper approaches a problem which has been partially overlooked by the variational inference literature. The connection with MCMC is promising and the results are encouraging. The new experiments provided by the authors point in the right direction, even though I think a more extensive experimental evaluation will make this paper a much more solid contribution.

Review 2

Summary and Contributions: this paper analyzes the convergence of variational inference by viewing the iterates as a markov chain. the authors develop iterate averaging and import convergence diagnostics from markov chain monte carlo.

Strengths: this work is theoretically grounded in previous work by stephan mandt et al. the work is relevant to the neurips community, and has a similar flavor to francis bach's line of work in stochastic average gradient optimization algorithms - but from a variational inference perspective. the authors empirically evaluate their methods on standard datasets and models.

Weaknesses: it would be helpful to see whether the iterate averaging methods the authors propose also provides benefits to more recent variational inference developments such as deep generative models (variational autoencoders; normalizing flows). i am curious whether the k-hat statistic the authors use is applicable to these methods, or whether their stopping rule can accurately assess convergence for amortized inference - demonstrating this would increase the impact of this work.

Correctness: yes - the authors perform a thorough empirical evaluation on standard models.

Clarity: yes.

Relation to Prior Work: yes.

Reproducibility: Yes

Additional Feedback: nits: - captions can be improved. acronyms such as IA and MCSE not defined in captions makes it harder to read. - using ELBO rather than negative ELBO will be clearer and avoid confusion. this is variational inference, not risk minimization. - use booktabs and siunitx latex packages for the tables, they are difficult to read and assess (e.g. siunitx enables alignment of decimals across rows). removing the vertical lines and horizontal lines will help. - the figures can be made clearer: larger font sizes, thicker lines, legends in standard places (not squeezed on top), de-spining the top and right axes, etc

Review 3

Summary and Contributions: The paper proposes a stopping rule for training in VI based on averaging the iterates. The idea is motivated by the MCMC literature. The approach is illustrated to work well on a range of examples. The idea of iterate averaging has been used in the literature but this paper provides some motivations for it.

Strengths: The idea of iterate averaging is simple, and would be of useful in practice.

Weaknesses: The main limitation is that the suggested technique seems too heuristic and lack of theoretical grounding. Sure, the authors use some theoretical results in Robbins–Monro-type optimization and MCMC to motivate their iterate averaging method, but these results were developed under idealized conditions and might be not valid in their setting. Although I appreciate the work, I believe that the contribution isn't substantial enough and more study is needed, e.g., when the method works and when it wouldn't work. One thing bothers me a bit is that the authors suggest using MCMC to motivate stopping rule in their SGD: it is well know that convergence diagnostics is often problematic in MCMC.

Correctness: Most of the claims are reasonable and the empirical methodology seems to be correct.

Clarity: In general, the paper is well written and well structured. I enjoyed reading it.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: In this paper, the authors study the stochastic optimization algorithm for variational inference. In particular, the authors argue that existing methods stochastic optimization techniques for variational inference are fragile with respect to the hyperparameters of the optimization algorithm. Mainly, authors argue that the standard stopping rule for a stochastic optimization for variational inference is insufficient. The authors view the SGD algorithm with ELBO objective as a Markov chain with a stationary distribution centered around the true variational posterior. The main contribution of this paper are: a) to use iterate averaging to determine the parameter of the variational posterior. b) use various heuristics that are typically used to judge the convergence of the Markov chain to determine the stopping time for the stochastic optimization algorithm.

Strengths: Authors highlight an interesting problem in their experiments that even when the true posterior belong in the variational family, certain stopping rules can lead a stochastic optimization algorithm to fail to reach the true. ===== Post Rebuttal====== I apologize to the authors for having a trailing sentence here. I somehow failed to save my last edit. I wanted to point out that it's the strength of the paper that they highlight that even in the ideal setting the stochastic optimization can fail.

Weaknesses: Main weakness of the work is that the progress is only incremental in nature where the authors study a very specific problem of variational inference where the true posterior belonged in the variational family. In these assumptions, authors leverage existing theory to view SGD as a Markov chain with the stationary distribution exactly specified by the Gaussian centered around the true posterior. Then all the heurestics proposed to just the variational inference are well known in the Markov chain literature.

Correctness: Tthere are several claims made in the paper that are not validated with extensive experimentation. Atleast it's not presented in the current form of the paper. For example, they claim for a certain stopping rule the chain does not converge. It would illuminating to run experiments under different hyperparameter settings and learning rates that the authors used such that the stopping rule failed to converge. these claims need to be backed up with thorough experimentation.

Clarity: The paper is not self contained by itself. There are many heurestic used to compute the stopping time, for example \hat{k} , \hat{R} are never defined in the paper itself. it would be used for a reader to have these definitions handy. The figures in the paper are not very clear either. In particular the y axis of Fig 1 left measures the distance between the two distributions. But what distance is used is never exactly defined. Authors have written Distance between moments. But I am not sure how this is being computed. =========Post rebuttal============= I thank the authors for pointing out where they have defined the metrics. It'll be definitely useful have a reference to these in the Figure 1.

Relation to Prior Work: The authors do a good job in discussing prior work and how their work relates to it.

Reproducibility: Yes

Additional Feedback: ============== Post Rebuttal Comments ================= After reading through other reviews and author's feedback, I still think that progress here is only incremental as the main idea is that we can view SGD as a Markov chain centered around the optimal parameter which would be the true parameter in case the true posterior belonged in the family of variational approximations. Then the heuristics proposed are the ones used for judging the convergence of a Markov chain.