NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4021 Estimating Convergence of Markov chains with L-Lag Couplings

### Reviewer 1

The authors generalize 1-lag coupling of the chains to L-lag coupling and provide upper bounds on some distribution distances including the total variation and 1-Wasserstein distance. This bound serves as a convergence check for MCMC, e.g., to stop the burn-in phase. The main contributions of the paper are 1) deriving a computable bound of the distribution distance between two (L-lagged) chains, and 2) presenting algorithms (e.g., Coupled Random-Walk Metropolis-Hastings, Coupled HMC, etc.) using the bound as a stopping criterion for burn-in. Unfortunately, the second part together with the proof of the bound is in the supplementary material. The presented bound and method to compute it is, to the best of knowledge, novel and significantly extends the state-of-the-art. My only concern is about the presentation/composition of the paper: The bound requires to compute the (L-lag) meeting time. The algorithm to compute it compares samples (with time lag L) of different chains and stops when they coincidence. Naively, this will never happen for continuous values; hence, the algorithm assumes some coupled/joint MCMC kernel. Some viable variants are presented in the appendix. While those are based on prior work, I would still expect them to be part of the main body. Without the supplementary material, I would not consider the presented work as self-containing and reproducible. In addition, I would expect a broader discussion of relevant literature, also in the context of the applications of the bound, such as detecting convergence. For example, there are methods to quantify bias in a sample (e.g., "A Kernelized Stein Discrepancy for Goodness-of-fit Tests", Liu et al) or two-sample tests to measure sample discrepancy (e.g., "A Kernel Two-Sample Test", Gretton et al). Both approaches may also be used to detect MCMC convergence and could be considered as baselines in the experiments. After reading the authors' feedback, I revised my vote from 6 to 7. They addressed my concerns and suggested some changes so that, overall, I'm inclined accepting the paper.

### Reviewer 2

The paper reads well, the content is very clearly presented. The stylized example and the application section provide a convincing illustration. However, the paper is not very original, not from a mathematical perspective at least. A reader may understand that the motivation for using a L-lag coupling rather than a 1-lag coupling, is to obtain  sharper bounds'' [line 40]. First, it could be useful to provide some intuition behind this fact. Second, the authors should try to compare the convergence speed of the L-lag bound estimator with the corresponding 1-lag bound estimator. The examples show that the L-lag bound estimators are typically more stable but a quantitative analysis appears compelling. It would thus help a practitioner to know if the L-lag bound worths being used since for a replication of the L-lag coupling is typically much more computationally involved than an equivalent 1-lag replica. In the Application section, the Authors should mention if the assumptions 2.2--2.3 are satisfied. Minor comment l106 N times independently

### Reviewer 3

Strength - A crisply-written paper, targeting an important problem: assessing the convergence of MCMC methods. - The proposed algorithm provides practical bounds on the distance between the marginal distribution of Markov chains and the target distribution. Estimating such bounds is a challenging research problem in computational statistics, and this paper proposes a useful method based on L-Lag couplings. - Empirical evaluation verifies the theoretical claims and demonstrates the practicality of the proposed method on several common sampling algorithms. - A fair assessment of the proposed method, with experiments demonstrating both efficacy and the failure case (sec 2.2.2), It is worth pointing out the failure case should not be considered as a weakness of this paper as it is beneficial for understanding the proposed method. Weakness - Convergence estimates from the proposed method could be incorrect when the underlying MCMC sampler does not mix well, e.g. in high-dimensional and/or multimodal target distribution settings. This issue is pointed out by the authors through numerical experiments and deserves more investigation.