NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:4419
Title:Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices

Reviewer 1

Summary: This paper analyzes both Langevin dynamics (LD) and unadjusted Langevin algorithm (ULA) for sampling a target distribution When the target distribution is smooth and satisfies a log-Sobolev inequality (LSI), the authors show exponential convergence of LD to the target distribution in KL-divergence (Theorem 1) and then extends convergence bound to ULA with an additional error term (Theorem 2) by first proving Langevin dynamics (LD), . The author extend this result to convergence in Renyi-divergence (Theorems 3 + 4 for LD and ULA respectively), under the additional assumptions that perturbations of the target satisfy LSI (Lemma 8) and the existence of a growth function $g$ for ULA. The authors also provide a convergence results when the target distribution satisfies a relaxation of LSI (Poincare inequality) in the Supplement. Quality: The paper appears to be technically sound. To complement the theoretical convergence bounds, although the toy Gaussian examples are nice illustrations, the paper lacks (synthetic) experiments to complement the theoretical results. The paper does not (empirically) demonstrate the tightness (or lack of tightness) of the theoretical convergence bounds. The paper does not provide an example that is not strongly log-concave but satisfies LSI. The convergence bounds for ULA for Renyi divergence (Theorem 4 and Section F.2 for targets satisfying LSI and Poincare inequality respectively) rely on an unknown growth function (line 248). An example is provide for a Gaussian target, but it is not clear to me that this exists in general (for any target distribution satisfying LSI). Originality: To the best of my knowledge, analyzing the convergence of ULA for target distributions under LSI is novel as is measuring convergence in Renyi-divergence. The proof of the Theorems and Lemmas in the Supplement are elegant and easy to follow. Appropriate references to related work and comparisons to previous contributions appear to be included. Clarity: This paper is very well written and excellently organized. Really nice work. Although lines (66-72) provide background on what Renyi divergence is, it could be made more clear what the advantages and disadvantages Renyi divergence offers compared to KL-divergence (for q > 1). Significance: The main contribution of this paper is generalizing the convergence results of ULA beyond log-concave targets. This paper generalizes to target distributions satisfying LSI. Section 2.2 provides a compelling case for the reasonableness of the LSI condition - it is indeed a wider class of distributions and includes practical. These results are of interest to the stochastic gradient Langevin dynamics (SGLD) literature, as they could be extended to SGLD by handling additional noise in the gradient. This is highly significant, because SGLD (and it variants) are commonly used in practice to sample from non log-concave distributions. The significance of the results for Renyi divergence is more difficult to evaluate. I am not sure what additional benefits Renyi divergence offers (compared to KL-divergence) and the additional assumptions (Lemma 8) and existence of an unknown growth function are more difficult to justify. However, I believe the significance of the first contribution is enough for me to support acceptance. Typos / Minor Comments: -Line 29 could change "L^2" to "L^2-space" to avoid confusion with "L" used for smoothness. -Line 441 "Hessian" -> "Laplacian" === Update based on Author Feedback === Thank you for providing additional motivation for why Renyi divergence may be of practical interest over KL divergence and how the growth function only needs to be a rough initial estimate of the asymptotic bias. Although the toy examples of nonconvex functions that satisfy LSI (in the author feedback) are nice, I still wish that there was a clear motivating practical example (perhaps for Bayesian inference using the Holley-Stroock perturbation theorem, where the prior satisfies a LSI, the loglikelihood is bounded, and the target posterior distribution is not logconcave but will still satisfy LSI).

Reviewer 2

Summary: Authors study the convergence of unadjusted Langevin algorithm (ULA) which is Euler discretization of overdamped Langevin dynamics. They establish rates in KL divergence and Renyi divergence by only assuming a log-Sobolev inequality and smoothness. The results are significant and paper is well-written. My main concern is that there is no practical demonstration (examples or experiments) of the usefulness of the established results. See my further questions/comments below. ** Major comments: - Authors start their algorithm at N(x_*, . ) where x_* is a first order critical point of the potential function. They claim that x_* can be found via gradient descent. However, in practice they can only find a point that is arbitrarily close to a first-order critical point x_*. Would your final bound change after taking this into account? Same question for line 187. - The class of distributions that satisfy LSI is larger than class of logconcave distributions. However, authors do not provide an example of such a distribution that is used in practice (satisfies LSI but not strong convexity). An example of this sort would further motivate the reader and demonstrate the significance of the established results. - What happens to the upper bounds when you start the algorithm at a deterministic point? - LSI implies (by Talagrand inequality) exponential decay in Wasserstein-2 metric. In the case of exponential contraction in W2, this implies back strong convexity. Is it because you only have exponential decay rather than contraction that your results are more general than strong convexity? ** Minor comments: - I haven't noticed a single typo in the paper. -- I would like to thank the authors for answering my questions. I like the results of this paper; however, I will keep my score (see major concerns above).

Reviewer 3

After reading authors' response: I now understand that there might be subtle issues with preprint [3]. However, I maintain the original comment that the continuous-time convergence of KL under LSI isn't surprising and the discretization analysis seems novel. I now see that the generalization to a weaker Poincare ineq. condition is a good addition (which the authors claim to be one of three key contributions in their response). However, it is slightly confusing that they have not included any description regarding this contribution in the abstract of the original submission. - Originality There has been a resurgent interest in ULA and its derivatives in recent years partly due to the empirical success of SGLD [1] and strong theoretical guarantees for its performance in non-convex optimization [2]. Most results were proven with the rather stringent assumption of strong convexity, however, recent results were able to replace this with the log-Sobolev condition [2,3] or distant dissipativity [4,5] (based on switching between synchronous and reflection couplings). The first contribution of the paper, which proves a convergence bound in the KL divergence without assuming strong convexity is not surprising. Most of the techniques are known or bear a resemblance to existing works in the literature (see e.g. [2,6]), and the authors are upfront about this. Nevertheless, the discretization analysis using arguments based on the Fokker-Planck PDE for the density seems novel. The overall analysis is clean and is a good addition to the current literature on sampling methods based on the Langevin diffusion. The second part of the paper, where a generalized convergence bound in the Renyi divergence is shown, is novel. - Quality The paper is technically sound with generally well-written proofs provided. Although I did not read over every detail in every proof, I followed most of them (proofs of the main theorems 2 and 4), and they seem correct. A potential weakness of the second part of the paper is the rather strong assumption on the (modified) biased target measure. It seems that to verify LSI for the perturbed biased target, we may need to first infer some property of it, which seems much more difficult than merely bounding its error from the true target. Additionally, the second part of the paper would be more convincing if ample motivation was given as to why the Renyi divergence would be an interesting distance measure to consider for sampling applications. - Clarity While the paper is generally pretty readable, the overall clarity can be further improved. It seems that the introduction is quite long and contains the description of some less relevant information, i.e. the weaker Poincare inequality condition. Specifically regarding the introduction, one of the other issues is that some of the important concepts are not formally defined. This is generally fine for informal high-level theorems/proofs, but it is the usage of less standard notation and terminology that might confuse people. For instance, the first theorem in the intro assumes the target measure \nu (instead of the negative gradient of log-density) is L-smooth, where only until section 2.1 was it mentioned that this actually means the gradient of -\log \nu is Lipschitz (or the Hessian is upper bounded). Additionally, I'm a little bit curious as to why Theorem 1 and 3 don't assume Lipschitz drift coefficients (or in this case just smoothness of the gradient of potential), since this is usually required for the existence and uniqueness of the solution to the SDE. - Significance See part 1 of the review. References [1] Welling, Max, and Yee W. Teh. "Bayesian learning via stochastic gradient Langevin dynamics." Proceedings of the 28th international conference on machine learning (ICML-11). 2011. [2] Raginsky, Maxim, Alexander Rakhlin, and Matus Telgarsky. "Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis." arXiv preprint arXiv:1702.03849 (2017). [3] Ma, Yi-An, et al. "Is There an Analog of Nesterov Acceleration for MCMC?." arXiv preprint arXiv:1902.00996 (2019). [4] Gorham, Jackson, and Lester Mackey. "Measuring sample quality with kernels." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. [5] Eberle, Andreas, Arnaud Guillin, and Raphael Zimmer. "Couplings and quantitative contraction rates for Langevin dynamics." arXiv preprint arXiv:1703.01617 (2017). [6] Cheng, Xiang, and Peter Bartlett. "Convergence of Langevin MCMC in KL-divergence." arXiv preprint arXiv:1705.09048 (2017).