NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Their algorithm is clearly explained. It is easy to implement and have good theoretical convergence properties. As the authors have pointed out, a closely related (but not identical) approach is already used in literature for an optimisation problem however its use in sampling is novel. I find this paper a high quality work which potentially can have some impact both in terms of the practical applications of the proposed sampler as well as the way the convergence analysis is carried out.
Reviewer 2
UPDATE: The authors' rebuttal addressed some of my concerns, while this work still feels incremental, I'm inclined to accept it now. Strength - A theoretical analysis of the convergence rate for the proposed algorithm. - A nice review of gradient complexity for existing variants of SG-HMC and SGLD. - Experiments demonstrate improved mixing for the proposed algorithm over competing methods. Weakness - The proposed algorithm is a relatively standard extension of SG-HMC and SGLD. - While the proposed framework and analysis and bounds are a contribution, such results are fairly familiar in the literature. References: The following articles might also be related: Bardenet, R., Doucet, A., & Holmes, C. (2017). On Markov chain Monte Carlo methods for tall data. The Journal of Machine Learning Research, 18(1), 1515-1557. Brosse, N., Durmus, A., & Moulines, E. (2018). The promises and pitfalls of stochastic gradient Langevin dynamics. In Advances in Neural Information Processing Systems (pp. 8268-8278). Dang, K. D., Quiroz, M., Kohn, R., Tran, M. N., & Villani, M. (2018). Hamiltonian Monte Carlo with energy conserving subsampling. arXiv preprint arXiv:1708.00955v2.
Reviewer 3
Update: The authors have helpfully pointed out that they do provide some guidelines on setting the hyperparameters. This paper creatively combines underdamped Langevin MCMC work of Cheng et al. with the gradient estimator SPIDER of Fang et al. This allows the paper to use the theoretical result from Fang et al. to prove a better bound for achieving epsilon in 2-Wasserstein distance. Effectively it is the UL-MCMC algorithm with a better gradient estimator. This isn't meant to imply that the work is trivial as adapting any insight from the optimisation literature for use in a HMC algorithm requires careful work to yield measurable improvements. The paper was technically sound with a compelling theoretical analysis and adequate experimental results. The theory already suggests the algorithm works best on a small batch size, but it's unclear if there is any way to do anything other than roughly estimate how that hyperparameter should be set. The paper is wonderfully organised though also a fairly dense read. This is likely unavoidable given the theoretical contributions. The supplementary section greatly helped in understanding the paper, but the key ideas needed were still in the main paper. This paper is a great step forward in incorporating SPIDER and it will be nice to continue to see more variance reduction methods be introduced for Langevin methods, where they are sorely needed.