{"title": "Estimating Convergence of Markov chains with L-Lag Couplings", "book": "Advances in Neural Information Processing Systems", "page_first": 7391, "page_last": 7401, "abstract": "Markov chain Monte Carlo (MCMC) methods generate samples that are asymptotically distributed from a target distribution of interest as the number of iterations goes to infinity. Various theoretical results provide upper bounds on the distance between the target and marginal distribution after a fixed number of iterations. These upper bounds are on a case by case basis and typically involve intractable quantities, which limits their use for practitioners. We introduce L-lag couplings to generate computable, non-asymptotic upper bound estimates for the total variation or the Wasserstein distance of general Markov chains. We apply L-lag couplings to the tasks of (i) determining MCMC burn-in, (ii) comparing different MCMC algorithms with the same target, and (iii) comparing exact and approximate MCMC. Lastly, we (iv) assess the bias of sequential Monte Carlo and self-normalized importance samplers.", "full_text": "Estimating Convergence of Markov chains with\n\nL-Lag Couplings\n\nNiloy Biswas\n\nHarvard University\n\nniloy_biswas@g.harvard.edu\n\nPaul Vanetti\n\nUniversity of Oxford\n\npaul.vanetti@spc.ox.ac.uk\n\nPierre E. Jacob\nHarvard University\n\npjacob@fas.harvard.edu\n\nAbstract\n\nMarkov chain Monte Carlo (MCMC) methods generate samples that are asymptoti-\ncally distributed from a target distribution of interest as the number of iterations\ngoes to in\ufb01nity. Various theoretical results provide upper bounds on the distance\nbetween the target and marginal distribution after a \ufb01xed number of iterations.\nThese upper bounds are on a case by case basis and typically involve intractable\nquantities, which limits their use for practitioners. We introduce L-lag couplings to\ngenerate computable, non-asymptotic upper bound estimates for the total variation\nor the Wasserstein distance of general Markov chains. We apply L-lag couplings\nto the tasks of (i) determining MCMC burn-in, (ii) comparing different MCMC al-\ngorithms with the same target, and (iii) comparing exact and approximate MCMC.\nLastly, we (iv) assess the bias of sequential Monte Carlo and self-normalized\nimportance samplers.\n\n1\n\nIntroduction\n\nMarkov chain Monte Carlo (MCMC) algorithms generate Markov chains that are invariant with\nrespect to probability distributions that we wish to approximate. Numerous works help understanding\nthe convergence of these chains to their invariant distributions, hereafter denoted by \u03c0. Denote\nby \u03c0t the marginal distribution of the chain (Xt)t\u22650 at time t. The discrepancy between \u03c0t and \u03c0\ncan be measured in different ways, typically the total variation (TV) distance or the Wasserstein\ndistance in the MCMC literature. Various results provide upper bounds on this distance, of the form\nC(\u03c00)f (t), where C(\u03c00) < \u221e depends on \u03c00 but not on t, and where f (t) decreases to zero as t\ngoes to in\ufb01nity, typically geometrically; see Section 3 in [48] for a gentle survey, and [17, 13, 18]\nfor recent examples. These results typically relate convergence rates to the dimension of the state\nspace or to various features of the target. Often these results do not provide computable bounds on\nthe distance between \u03c0t and \u03c0, as C(\u03c00) and f (t) typically feature unknown constants; although see\n[49] where these constants can be bounded analytically, and [12] for examples where they can be\nnumerically approximated.\nVarious tools have been developed to assess the quality of MCMC estimates. Some focus on the\nbehaviour of the chains assuming stationarity, comparing averages computed within and across chains,\nor de\ufb01ning various notions of effective sample sizes based on asymptotic variance estimates (e.g.\n[20, 21, 19, 56], [46, Chapter 8]). Few tools provide computable bounds on the distance between \u03c0t\nand \u03c0 for a \ufb01xed t; some are mentioned in [6] for Gibbs samplers with tractable transition kernels.\nNotable exceptions, beyond [12] mentioned above, include the method of [31, 32] which relies on\ncoupled Markov chains. A comparison with our proposed method will be given in Section 2.4.\nWe propose to use L-lag couplings of Markov chains to estimate the distance between \u03c0t and \u03c0\nfor a \ufb01xed time t, building on 1-lag couplings used to obtain unbiased estimators in [23, 29]. The\ndiscussion of [29] mentions that upper bounds on the TV between \u03c0t and \u03c0 can be estimated with such\ncouplings. We generalize this idea to L-lag couplings, which provide sharper bounds, particularly\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffor small values of t. The proposed technique extends to a class of probability metrics [52] beyond\nTV. We demonstrate numerically that the bounds provide a practical assessment of convergence for\nvarious popular MCMC algorithms, on either discrete or continuous and possibly high-dimensional\nspaces. The proposed bounds can be used to (i) determine burn-in period for MCMC estimates,\nto (ii) compare different MCMC algorithms targeting the same distribution, or to (iii) compare\nexact and approximate MCMC algorithms, such as Unadjusted and Metropolis-adjusted Langevin\nalgorithms, providing a computational companion to studies such as [18]. We also (iv) assess the bias\nof sequential Monte Carlo and self-normalized importance samplers.\nIn Section 2 we introduce L-lag couplings to estimate metrics between marginal and invariant\ndistributions of a Markov chain. We illustrate the method on simple examples, discuss the choice of\nL, and compare with the approach of [31]. In Section 3 we consider applications including Gibbs\nsamplers on the Ising model and gradient-based MCMC algorithms on log-concave targets. In Section\n4 we assess the bias of sequential Monte Carlo and self-normalized importance samplers. All scripts\nin R are available at https://github.com/niloyb/LlagCouplings.\n\n2 L-lag couplings\n\nConsider two Markov chains (Xt)t\u22650, (Yt)t\u22650, each with the same initial distribution \u03c00 and Markov\nkernel K on (Rd,B(Rd)) which is \u03c0-invariant. Choose some integer L \u2265 1 as the lag parameter. We\ngenerate the two chains using Algorithm 1. The joint Markov kernel \u00afK on (Rd \u00d7 Rd,B(Rd \u00d7 Rd)) is\nsuch that, for all x, y, \u00afK((x, y), (\u00b7, Rd)) = K(x,\u00b7), and \u00afK((x, y), (Rd,\u00b7)) = K(y,\u00b7). This ensures\nthat Xt and Yt have the same marginal distribution at all times t. Furthermore, \u00afK is constructed\nsuch that the pair of chains can meet exactly after a random number of steps, i.e. the meeting time\n\u03c4 (L) := inf{t > L : Xt = Yt\u2212L} is almost surely \ufb01nite. Finally we assume that the chains remain\nfaithful after meeting, i.e. Xt = Yt\u2212L for all t \u2265 \u03c4 (L).\nVarious constructions for \u00afK have been derived in the literature: for instance coupled Metropolis-\nHastings and Gibbs kernels in [31, 29], coupled Hamiltonian Monte Carlo kernels in [36, 5, 26], and\ncoupled particle Gibbs samplers in [9, 3, 28].\n\nAlgorithm 1: Sampling L-lag meeting times\nInput: lag L \u2265 1, initial distribution \u03c00, single kernel K and joint kernel \u00afK\nOutput: meeting time \u03c4 (L), and chains (Xt)0\u2264t\u2264\u03c4 (L), (Yt)0\u2264t\u2264\u03c4 (L)\u2212L\nInitialize: generate X0 \u223c \u03c00, Xt|Xt\u22121 \u223c K(Xt\u22121,\u00b7) for t = 1, . . . , L, and Y0 \u223c \u03c00\nfor t > L do\n\nSample (Xt, Yt\u2212L)|(Xt\u22121, Yt\u2212L\u22121) \u223c \u00afK((Xt\u22121, Yt\u2212L\u22121),\u00b7)\nif Xt = Yt\u2212L then return \u03c4 (L) := t, and chains (Xt)0\u2264t\u2264\u03c4 (L) , (Yt)0\u2264t\u2264\u03c4 (L)\u2212L\n\nend\n\nWe next introduce integral probability metrics (IPMs, e.g. [52]).\nDe\ufb01nition 2.1. (Integral Probability Metric). Let H be a class of real-valued functions on a\nmeasurable space X . For all probability measures P, Q on X , the corresponding IPM is de\ufb01ned as:\n(1)\n\n(cid:12)(cid:12)(cid:12)EX\u223cP [h(X)] \u2212 EX\u223cQ[h(X)]\n(cid:12)(cid:12)(cid:12).\n\ndH(P, Q) := sup\nh\u2208H\n\nCommon IPMs include total variation distance dTV with H := {h : supx\u2208X |h(x)| \u2264 1/2}, and\n1-Wasserstein distance dW with H = {h : |h(x) \u2212 h(y)| \u2264 dX (x, y), \u2200x, y \u2208 X}, where dX is a\nmetric on X [42]. Our proposed method applies to IPMs such that suph\u2208H |h(x)\u2212h(y)| \u2264 MH(x, y)\nfor all x, y \u2208 X , for some computable function MH on X \u00d7X . For dTV we have MH(x, y) = 1, and\nfor dW we have MH(x, y) = dX (x, y).\nWith a similar motivation for the assessment of sample approximations, and not restricted to the\nMCMC setting, [25] considers a restricted class of functions H to develop a speci\ufb01c measure\nof sample quality based on Stein\u2019s identity. [35, 10] combine Stein\u2019s identity with reproducing\nkernel Hilbert space theory to develop goodness-of-\ufb01t tests. [24] obtains further results and draws\nconnections to the literature on couplings of Markov processes. Here we directly aim at upper bounds\non the total variation and Wasserstein distance. The total variation controls the maximal difference\n\n2\n\n\fbetween the masses assigned by \u03c0t and \u03c0 on any measurable set, and thus directly helps assessing the\nerror of histograms of the target marginals. The 1-Wasserstein distance controls the error made on\nexpectations of 1-Lipschitz functions, which with X = Rd and dX (x, y) = (cid:107)x \u2212 y(cid:107)1 (the L1 norm\non Rd) include all \ufb01rst moments.\n\n2.1 Main results\n\nWe make the three following assumptions similar to those of [29].\nAssumption 2.2. (Marginal convergence and moments.) For all h \u2208 H, as t \u2192 \u221e, E[h(Xt)] \u2192\nEX\u223c\u03c0[h(X)]. Also, \u2203\u03b7 > 0, D < \u221e such that E[MH(Xt, Yt\u2212L)2+\u03b7] \u2264 D for all t \u2265 L.\nThe above assumption is on the marginal convergence of the MCMC algorithm and on the moments\nof the associated chains. The next assumptions are on the coupling operated by the joint kernel \u00afK.\nAssumption 2.3. (Sub-exponential tails of meeting times.) The chains are such that the meeting time\n\u03c4 (L) := inf{t > L : Xt = Yt\u2212L} satis\ufb01es P( \u03c4 (L)\u2212L\nL > t) \u2264 C\u03b4t for all t \u2265 0, for some constants\nC < \u221e and \u03b4 \u2208 (0, 1).\nThe above assumption can be relaxed to allow for polynomial tails as in [37]. The \ufb01nal assumption\non faithfulness is typically satis\ufb01ed by design.\nAssumption 2.4. (Faithfulness.) The chains stay together after meeting: Xt = Yt\u2212L for all t \u2265 \u03c4 (L).\nWe assume that the three assumptions above hold in the rest of the article. The following theorem is\nour main result.\nTheorem 2.5. (Upper bounds.) For an IPM with function set H and upper bound MH, with the\nMarkov chains (Xt)t\u22650, (Yt)t\u22650 satisfying the above assumptions, for any L \u2265 1, and any t \u2265 0,\n\ndH(\u03c0t, \u03c0) \u2264 E(cid:104)(cid:6) \u03c4 (L)\u2212L\u2212t\n(cid:7)(cid:88)\n\nL\n\n(cid:105)\n\nMH(Xt+jL, Yt+(j\u22121)L)\n\n.\n\n(2)\n\nj=1\n\nHere (cid:100)x(cid:101) denotes the smallest integer above x, for x \u2208 R. When (cid:100)(\u03c4 (L) \u2212 L \u2212 t)/L(cid:101) \u2264 0, the\nsum in inequality (2) is set to zero by convention. We next give an informal proof. Seeing the\ninvariant distribution \u03c0 as the limit of \u03c0t as t \u2192 \u221e, applying triangle inequalities, recalling that\ndH(\u03c0s, \u03c0t) \u2264 E[MH(Xs, Xt)] for all s, t, we obtain\ndH(\u03c0t+jL, \u03c0t+(j\u22121)L) \u2264\n\nE[MH(Xt+jL, Yt+(j\u22121)L)].\n\ndH(\u03c0t, \u03c0) \u2264\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n(3)\n\nj=1\n\nj=1\n\nThe right-hand side of (2) is retrieved by swapping expectation and limit, and noting that terms\nindexed by j > (cid:100)(\u03c4 (L) \u2212 L \u2212 t)/L(cid:101) are equal to zero by Assumption 2.4. The above reasoning\nhighlights that increasing L leads to sharper bounds through the use of fewer triangle inequalities. An\nalternate, formal proof based on an unbiased estimation argument is given in supplementary material.\nTheorem 2.5 gives the following bounds for dTV and dW,\n\ndTV(\u03c0t, \u03c0) \u2264 E(cid:104)\nmax(0,(cid:6) \u03c4 (L) \u2212 L \u2212 t\ndW(\u03c0t, \u03c0) \u2264 E(cid:104)(cid:6) \u03c4 (L)\u2212L\u2212t\n(cid:7)(cid:88)\n\nL\n\nL\n\n(cid:105)\n(cid:7))\n\n,\n\ndX (Xt+jL, Yt+(j\u22121)L)\n\n(cid:105)\n\n(4)\n\n(5)\n\n.\n\nj=1\n\nFor the total variation distance, the boundedness part of Assumption 2.2 is directly satis\ufb01ed. For the\n1-Wasserstein distance on Rd with dX (x, y) = (cid:107)x \u2212 y(cid:107)1 (the L1 norm on Rd), the boundedness part\nis equivalent to a uniform bound of (2 + \u03b7)-th moments of the marginal distributions for some \u03b7 > 0.\nWe emphasize that the proposed bounds can be estimated directly by running Algorithm 1 N times\nindependently, and using empirical averages. All details of the MCMC algorithms and their couplings\nmentioned below are provided in the supplementary material.\n\n3\n\n\f2.2 Stylized examples\n\n2.2.1 A univariate Normal\n\nMH) and q := N (Yt\u2212L\u22121, \u03c32\n\nWe consider a Normal example where we can compute total variation and 1-Wasserstein distances\n(using the L1 norm on R throughout) exactly. The target \u03c0 is N (0, 1) and the kernel K is that of a\nNormal random walk Metropolis-Hastings (MH) with step size \u03c3MH = 0.5. We set the initial distri-\nbution \u03c00 to be a point mass at 10. The joint kernel \u00afK operates as follows. Given (Xt\u22121, Yt\u2212L\u22121),\nsample (X (cid:63), Y (cid:63)) from a maximal coupling of p := N (Xt\u22121, \u03c32\nMH).\nThis is done using Algorithm 2, which ensures X (cid:63) \u223c p, Y (cid:63) \u223c q and P(X (cid:63) (cid:54)= Y (cid:63)) = dTV(p, q).\nAlgorithm 2: A maximal coupling of p and q\nSample X\u2217 \u223c p, and W \u223c U(0, 1)\nif p(X\u2217)W \u2264 q(X\u2217) then set Y \u2217 = X\u2217 and return (X\u2217, Y \u2217)\nelse sample \u02dcY \u223c q and \u02dcW \u223c U(0, 1) until q( \u02dcY ) \u02dcW > p( \u02dcY ). Set Y \u2217 = \u02dcY and return (X\u2217, Y \u2217)\nHaving obtained (X (cid:63), Y (cid:63)), sample U \u223c U(0, 1); set Xt = X (cid:63) if U < \u03c0(X (cid:63))/\u03c0(Xt\u22121); otherwise\nset Xt = Xt\u22121. With the same U, set Yt\u2212L = Y (cid:63) if U < \u03c0(Y (cid:63))/\u03c0(Yt\u2212L\u22121); otherwise set\nYt\u2212L = Yt\u2212L\u22121. Such a kernel \u00afK is a coupling of K with itself, and Assumption 2.4 holds by\ndesign. The veri\ufb01cation of Assumption 2.3 is harder but can be done via drift conditions in various\ncases; we refer to [29] for more discussion.\nFigure 1 shows the evolution of the marginal distribution of the chain, and the TV and 1-Wasserstein\ndistance upper bounds. We use L = 1 and L = 150. For each L, N = 10000 independent runs of\nAlgorithm 1 were performed to estimate the bounds in Theorem 2.5 by empirical averages. Exact\ndistances are shown for comparison. Tighter bounds are obtained with larger values of L, as discussed\nfurther in Section 2.3.\n\nFigure 1: Marginal distributions of the chain (left), and upper bounds on the total variation (middle)\nand the 1-Wasserstein distance (right) between \u03c0t and \u03c0, for a Metropolis-Hastings algorithm targeting\nN (0, 1) and starting from a Dirac mass at 10. With L = 150 the estimated upper bounds for both are\nclose to the exact distances.\n\n2.2.2 A bimodal target\n\n2N (\u22124, 1) + 1\n\nWe consider a bimodal target to illustrate the limitations of the proposed technique. The target is\n2N (4, 1), as in Section 5.1 of [29]. The MCMC algorithm is again random walk\n\u03c0 = 1\nMH, with \u03c3MH = 1, \u03c00 = N (10, 1). Now, the chains struggle to jump between the modes, as seen in\nFigure 2 (left), which shows a histogram of the 500th marginal distribution from 1000 independent\nchains. Figure 2 (right) shows the TV upper bound estimates for lags L = 1 and L = 18000\n(considered very large), obtained with N \u2208 {1000, 5000, 10000} independent runs of Algorithm 1.\nWith L = 18000, we do not see a difference between the obtained upper bounds, which suggests\nthat the variance of the estimators is small for the different values of N. In contrast, the dashed\nline bounds corresponding to lag L = 1 are very different. This is because, over 1000 experiments,\n\n4\n\n\u221250510050100150iterationx0.00.51.01.5050100150iterationdTV0.02.55.07.510.012.5050100150iterationdWLagL=1L=150Exact\fthe 1-lag meetings always occurred quickly in the mode nearest to the initial distribution. However,\nover 5000 and 10000 experiments, there were instances where one of the two chains jumped to the\nother mode before meeting, resulting in a much longer meeting time. Thus the results obtained with\nN = 1000 repeats can be misleading. This is a manifestation of the estimation error associated with\nempirical averages, which are not guaranteed to be accurate after any \ufb01xed number N of repeats. The\nshape of the bounds obtained with L = 18000, with a plateau, re\ufb02ects how the chains \ufb01rst visit one\nof the modes, and then both.\n\nFigure 2: Metropolis-Hastings algorithm with \u03c00 \u223c N (10, 1), \u03c3MH = 1 on a bimodal target. Left:\nHistogram of the 500th marginal distribution from 1000 independent chains, and target density in full\nline. Right: Total variation bounds obtained with lags L \u2208 {1, 18000} and N \u2208 {1000, 5000, 10000}\nindependent runs of Algorithm 1.\n\n2.3 Choice of lag L\n\nSection 2.2.2 illustrates the importance of the choice of lag L. Obtaining \u03c4 (L) requires sampling L\ntimes from K and \u03c4 (L) \u2212 L from \u00afK. When L gets large, we can consider XL to be at stationarity,\nwhile Y0 still follows \u03c00. Then the distribution of \u03c4 (L) \u2212 L depends entirely on \u00afK and not on L. In\nthat regime the cost of obtaining \u03c4 (L) increases linearly in L. On the other hand, if L is small, the\ncost might be dominated by the \u03c4 (L) \u2212 L draws from \u00afK. Thus increasing L might not signi\ufb01cantly\nimpact the cost until the distribution of \u03c4 (L) \u2212 L becomes stable in L.\nThe point of increasing L is to obtain sharper bounds. For example, from (4) we see that, for \ufb01xed t,\nthe variable in the expectation takes values in [0, 1] with increasing probability as L \u2192 \u221e, resulting\nin upper bounds more likely to be in [0, 1] and thus non-vacuous. The upper bound is also decreasing\nin t. This motivates the strategy of starting with L = 1, plotting the bounds as in Figure 1, and\nincreasing L until the estimated upper bound for dTV(\u03c00, \u03c0) is close to 1.\nIrrespective of the cost, the bene\ufb01ts of increasing L eventually diminish: the upper bounds are loose\nto some extent since the coupling operated by \u00afK is not optimal [54]. The couplings considered in\nthis work are chosen to be widely applicable but are not optimal in any way.\n\n2.4 Comparison with Johnson\u2019s diagnostics\n\nThe proposed approach is similar to that proposed by Valen Johnson in [31], which works as\nfollows. A number c \u2265 2 of chains start from \u03c00 and evolve jointly (without time lags), such\nthat they all coincide exactly after a random number of steps Tc, while each chain marginally\nevolves according to K. If we assume that any draw from \u03c00 would be accepted as a draw from\n\u03c0 in a rejection sampler with probability 1 \u2212 r, then the main result of [31] provides the bound:\ndTV(\u03c0t, \u03c0) \u2264 P(Tc > t)\u00d7 (1\u2212 rc)\u22121. As c increases, for any r \u2208 (0, 1) the upper bound approaches\nP(Tc > t), which itself is small if t is a large quantile of the meeting time Tc. A limitation of this\nresult is its reliance on the quantity r, which might be unknown or very close to one in challenging\nsettings. Another difference is that we rely on pairs of lagged chains and tune the lag L, while the\ntuning parameter in [31] is the number of coupled chains c.\n\n5\n\n0.00.10.20.30.4\u221210\u221250510X500density0.00.51.01.52.010100100010000iterationdTVt1:1000(1)t1:5000(1)t1:10000(1)t1:1000(18000)t1:5000(18000)t1:10000(18000)\f3 Experiments and applications\n\n3.1\n\nIsing model\n\n{\u22121, +1}32\u00d732, we de\ufb01ne the target probability \u03c0\u03b2(x) \u221d exp(\u03b2(cid:80)\nof the chains, such as x (cid:55)\u2192(cid:80)\n\nWe consider an Ising model, where the target is de\ufb01ned on a large discrete space, namely a square\nlattice with 32 \u00d7 32 sites (each site has 4 neighbors) and periodic boundaries. For a state x \u2208\ni\u223cj xixj), where the sum is over\nall pairs i, j of neighboring sites. As \u03b2 increases, the correlation between nearby sites increases and\nsingle-site Gibbs samplers are known to perform poorly [39]. Dif\ufb01culties in the assessment of the\nconvergence of these samplers are in part due to the discrete nature of the state space, which limits\nthe possibilities of visual diagnostics. Users might observe trace plots of one-dimensional statistics\ni\u223cj xixj, and declare convergence when the statistic seems to stabilize;\n\nsee [55, 60] where trace plots of summary statistics are used to monitor Markov chains.\nHere we compute the proposed upper bounds for the TV distance for two algorithms: a single site\nGibbs sampler (SSG) and a parallel tempering (PT) algorithm, where different chains target different\n\u03c0\u03b2 with SSG updates, and regularly attempt to swap their states [22, 53]. The initial distribution\nassigns \u22121 and +1 with equal probability on each site independently. For \u03b2 = 0.46, we obtain TV\nbounds for SSG using a lag L = 106, and N = 500 independent repeats. For PT we use 12 chains,\neach targeting \u03c0\u03b2 with \u03b2 in an equispaced grid ranging from 0.3 to 0.46, a frequency of swap moves\nof 0.02, and a lag L = 2\u00d7 104. The results are in Figure 3, where we see a plateau for the TV bounds\non SSG and faster convergence for the TV bounds on PT. Our results are consistent with theoretical\nwork on faster mixing times of PT targeting multimodal distributions including Ising models [59].\nNote that the targets are different for both algorithms, as PT operates on an extended space. The\nbehavior of meeting times of coupled chains motivated by the \u201ccoupling from the past\u201d algorithm\n[44] for Ising models has been studied e.g. in [11].\n\nFigure 3: Single-site Gibbs (SSG) versus Parallel Tempering (PT) for an Ising model; bounds on the\ntotal variation distance between \u03c0t and \u03c0, for t up to 106 and inverse temperature \u03b2 = 0.46.\n\n3.2 Logistic regression\n\nWe next consider a target on a continuous state space de\ufb01ned as the posterior in a Bayesian logistic\nregression. Consider the German Credit data from [34]. There are n = 1000 binary responses\ni=1 \u2208 {\u22121, 1}n indicating whether individuals are creditworthy or not creditworthy, and d = 49\n(Yi)n\ncovariates xi \u2208 Rd for each individual i. The logistic regression model states P(Yi = yi|xi) =\ni \u03b2)\u22121 with a normal prior \u03b2 \u223c N (0, 10Id). We can sample from the posterior using\n(1 + e\u2212yixT\nHamiltonian Monte Carlo (HMC, [40]) or the P\u00f3lya-Gamma Gibbs sampler (PG, [43]). The former\ninvolves tuning parameters \u0001HMC and SHMC corresponding to a step size and a number of steps in\na leapfrog integration scheme performed at every iteration. We can use the proposed bounds to\ncompare convergence associated with HMC for different \u0001HMC, SHMC, and with the PG sampler.\nFigure 4 shows the total variation bounds for HMC with \u0001HMC = 0.025 and SHMC = 4, 5, 6, 7 and the\ncorresponding bound for the parameter-free PG sampler, both starting from \u03c00 \u223c N (0, 10Id). In this\nexample, the bounds are smaller for the PG sampler than for all HMC samplers under consideration.\nWe emphasize that the HMC tuning parameters associated with the fastest convergence to stationarity\nmight not necessarily be optimal in terms of asymptotic variance of ergodic averages of functions\nof interest; see related discussions in [26]. Also, since the proposed upper bounds are not tight, the\n\n6\n\n0.000.250.500.751.001e+011e+021e+031e+041e+051e+06iterationdTVSSGPT\ftrue convergence rates of the Markov chains under consideration may be ordered differently. The\nproposed upper bounds still allow a comparison of how con\ufb01dent we can be about the bias of different\nMCMC algorithms after a \ufb01xed number of iterations.\n\nFigure 4: Proposed upper bounds on dTV(\u03c0t, \u03c0) for a P\u00f3lya-Gamma Gibbs sampler and for Hamilto-\nnian Monte Carlo on a 49-dimensional posterior distribution in a logistic regression model. For HMC\nthe step size is \u0001HMC = 0.025 and the number of steps is SHMC = 4, 5, 6, 7.\n\n3.3 Comparison of exact and approximate MCMC algorithms\n\nIn various settings approximate MCMC methods trade off asymptotic unbiasedness for gains in\ncomputational speed, e.g. [30, 50, 14]. We compare an approximate MCMC method (Unadjusted\nLangevin Algorithm, ULA) with its exact counterpart (Metropolis-Adjusted Langevin Algorithm,\nMALA) in various dimensions. Our target is a multivariate normal:\n\n\u03c0 = N (0, \u03a3) where [\u03a3]i,j = 0.5|i\u2212j| for 1 \u2264 i, j \u2264 d.\n\nBoth MALA and ULA chains start from \u03c00 \u223c N (0, Id), and have step sizes of d\u22121/6 and 0.1d\u22121/6\nrespectively. Step sizes are linked to an optimal result of [47], and the 0.1 multiplicative factor for\nULA ensures that the target distribution for ULA is close to \u03c0 (see [13]). We can use couplings to\nstudy the mixing times tmix(\u0001) of the two algorithms, where tmix(\u0001) := inf{k \u2265 0 : dTV(\u03c0k, \u03c0) < \u0001}.\nFigure 5 highlights how the dimension impacts the estimated upper bounds on the mixing time\n\ntmix(0.25), calculated as inf{k \u2265 0 : (cid:98)E[max(0,(cid:100)(\u03c4 (L) \u2212 L \u2212 k)/L(cid:101))] < 0.25} where(cid:98)E denotes\n\nempirical averages. The results are consistent with the theoretical analysis in [18]. For a strongly\nlog-concave target such as N (0, \u03a3), Table 2 of [18] indicates mixing time upper bounds of order\nO(d) and O(d2) for ULA and MALA respectively (with a non-warm start centered at the unique\nmode of the target). In comparison to theoretical studies in [13, 18], our bounds can be directly\nestimated by simulation. On the other hand, the bounds in [13, 18] are more explicit about the impact\nof different aspects of the problem including dimension, step size, and features of the target.\n\nFigure 5: Mixing time bounds for ULA and MALA targeting a multivariate Normal distribution, as a\nfunction of the dimension. Mixing time tmix(0.25) denotes the \ufb01rst iteration t for which the estimated\nTV between \u03c0t and \u03c0 is less than 0.25.\n\n4 Assessing the bias of sequential Monte Carlo samplers\n\nLastly, we consider the bias associated with samples generated by sequential Monte Carlo (SMC) sam-\nplers [16]; the bias of self-normalized importance samplers can be treated similarly. Let (wn, \u03ben)N\nn=1\n\n7\n\n0.000.250.500.751.00050010001500iterationdTVPolya\u2212GammaSHMC=4SHMC=5SHMC=6SHMC=71003001000300002505007501000dimensiontmix(0.25)ULAMALA\fbe the weighted sample from an SMC sampler with N particles targeting \u03c0, and let q(N ) be the\nmarginal distribution of a particle \u03be sampled among (\u03ben)N\nn=1. Our aim\nis to upper bound a distance between q(N ) and \u03c0 for a \ufb01xed N. We denote by \u02c6Z the normalizing\nconstant estimator generated by the SMC sampler.\nThe particle independent MH algorithm (PIMH, [2]) operates as an independent MH algorithm using\nSMC samplers as proposals. Let ( \u02c6Zt)t\u22650 be the normalizing constant estimates from a PIMH chain.\nConsider an L-lag coupling of a pair of such PIMH chains as introduced in [38], initializing the\nchains by running an SMC sampler. Here \u03c4 (L) is constructed so that it can be equal to L with positive\nprobability; more precisely,\n\nn=1 with probabilities (wn)N\n\n\u03c4 (L) \u2212 (L \u2212 1)(cid:12)(cid:12) \u02c6ZL\u22121 \u223c Geometric(\u03b1( \u02c6ZL\u22121)),\n\nwhere \u03b1( \u02c6Z) := E(cid:2) min(1, \u02c6Z\u2217/ \u02c6Z)(cid:12)(cid:12) \u02c6Z(cid:3) is the average acceptance probability of PIMH, from a state\n\n(6)\n\nwith normalizing constant estimate \u02c6Z; see [38, Proposition 8] for a formal statement in the case of\n1-lag couplings. With this insight, we can bound the TV distance between the target and particles\ngenerated by SMC samplers, using Theorem 2.5 applied with t = 0. Details are in supplementary\nmaterial. We obtain\n\ndTV(q(N ), \u03c0) \u2264 E(cid:104)\n\nmax(0,(cid:6) \u03c4 (L) \u2212 L\n\n(cid:7))\n\n(cid:105)\n\n= E(cid:104)\n\nL\n\n(cid:105)\n\n.\n\n1 \u2212 \u03b1( \u02c6ZL\u22121)\n\n1 \u2212 (1 \u2212 \u03b1( \u02c6ZL\u22121))L\n\n(7)\n\nThe bound in (7) depends only on the distribution of the normalizing constant estimator \u02c6Z, and can\nbe estimated using independent runs of the SMC sampler. We can also estimate the distribution of\n\u02c6Z from a single SMC sampler by appealing to large asymptotic results such as [4], combined with\nasymptotically valid variance estimators such as [33]. As N goes to in\ufb01nity we expect \u03b1( \u02c6ZL\u22121)\nto approach one and the proposed upper bound to go to zero. The proposed bound aligns with the\ncommon practice of considering the variance of \u02c6Z as a measure of global performance of SMC\nsamplers.\nExisting TV bounds for particle approximations, such as those in [15, Chapter 8] and [27], are more\ninformative qualitatively but harder to approximate numerically. The result also applies to self-\nnormalized importance samplers (see [46, Chapter 3] and [41, Chapter 8]). In that case [1, Theorem\n2.1] shows dTV(q(N ), \u03c0) \u2264 6N\u22121\u03c1 for \u03c1 = E\u03be\u223cq[w(\u03be)2]/E\u03be\u223cq[w(\u03be)]2, with w the importance\nsampling weight function, which is a simpler and more informative bound; see also [8] for related\nresults and concentration inequalities.\n\n5 Discussion\n\nThe proposed method can be used to obtain guidance on the choice of burn-in, to compare different\nMCMC algorithms targeting the same distribution, and to compare mixing times of approximate\nand exact MCMC methods. The main requirement for the application of the method is the ability to\ngenerate coupled Markov chains that can meet exactly after a random but \ufb01nite number of iterations.\nThe couplings employed here, and described in supplementary materials, are not optimal in any\nway. As the couplings are algorithm-speci\ufb01c and not target-speci\ufb01c, they can potentially be added to\nstatistical software such as PyMC3 [51] or Stan [7].\nThe bounds are not tight, in part due to the couplings not being maximal [54], but experiments suggest\nthat they can be practical. The proposed bounds go to zero as t increases, making them informative\nat least for large enough t. The combination of time lags and coupling of more than two chains as\nin [31] could lead to new diagnostics. Further research might also complement the proposed upper\nbounds with lower bounds, obtained by considering speci\ufb01c functions among the classes of functions\nused to de\ufb01ne the integral probability metrics.\n\nAcknowledgments. The authors are grateful to Espen Bernton, Nicolas Chopin, Andrew Gelman,\nLester Mackey, John O\u2019Leary, Christian Robert, Jeffrey Rosenthal, James Scott, Aki Vehtari and\nreviewers for helpful comments on an earlier version of the manuscript. The second author gratefully\nacknowledges support by the National Science Foundation through awards DMS-1712872 and\nDMS-1844695. The \ufb01gures were created with packages [58, 57] in R Core Team [45].\n\n8\n\n\fReferences\n[1] S. Agapiou, O. Papaspiliopoulos, D. Sanz-Alonso, and A. M. Stuart. Importance sampling:\n\nIntrinsic dimension and computational cost. Statistical Science, 32(3):405\u2013431, 08 2017.\n\n[2] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte\nCarlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n72(3):269\u2013342, 2010.\n\n[3] Christophe Andrieu, Anthony Lee, and Matti Vihola. Uniform ergodicity of the iterated\nconditional SMC and geometric ergodicity of particle Gibbs samplers. Bernoulli, 24(2):842\u2013\n872, 2018.\n\n[4] Jean B\u00e9rard, Pierre Del Moral, and Arnaud Doucet. A lognormal central limit theorem for\nparticle approximations of normalizing constants. Electronic Journal of Probability, 19, 2014.\n\n[5] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. Coupling and convergence for\n\nHamiltonian Monte Carlo. arXiv preprint arXiv:1805.00452, 2018.\n\n[6] Stephen P. Brooks and Gareth O. Roberts. Assessing convergence of Markov chain Monte\n\nCarlo algorithms. Statistics and Computing, 8(4):319\u2013335, 1998.\n\n[7] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael\nBetancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan : A probabilistic\nprogramming language. Journal of Statistical Software, 76(1), 1 2017.\n\n[8] Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. Annals\n\nof Applied Probability, 28(2):1099\u20131135, 04 2018.\n\n[9] Nicolas Chopin and Sumeetpal S Singh. On particle Gibbs sampling. Bernoulli, 21(3):1855\u2013\n\n1883, 2015.\n\n[10] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of\n\ufb01t. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of\nProceedings of Machine Learning Research, pages 2606\u20132615, New York, New York, USA,\n20\u201322 Jun 2016. PMLR.\n\n[11] Andrea Collevecchio, Eren Metin El\u00e7i, Timothy M. Garoni, and Martin Weigel. On the coupling\ntime of the heat-bath process for the Fortuin\u2013Kasteleyn random\u2013cluster model. Journal of\nStatistical Physics, 170(1):22\u201361, 2018.\n\n[12] Mary Kathryn Cowles and Jeffrey S. Rosenthal. A simulation approach to convergence rates for\n\nMarkov chain Monte Carlo algorithms. Statistics and Computing, 8(2):115\u2013124, 1998.\n\n[13] Arnak S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-\nconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n79(3):651\u2013676, 2017.\n\n[14] Arnak S. Dalalyan and Avetik Karagulyan. User-friendly guarantees for the Langevin Monte\n\nCarlo with inaccurate gradient. Stochastic Processes and their Applications, 2019.\n\n[15] Pierre Del Moral. Feynman-Kac Formulae. Springer New York, 2004.\n\n[16] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411\u2013436, 2006.\n\n[17] Alain Durmus, Gersende Fort, and \u00c9ric Moulines. Subgeometric rates of convergence in\nWasserstein distance for Markov chains. In Annales de l\u2019Institut Henri Poincar\u00e9, Probabilit\u00e9s\net Statistiques, volume 52, pages 1799\u20131822. Institut Henri Poincar\u00e9, 2016.\n\n[18] Raaz Dwivedi, Yuansi Chen, Martin J. Wainwright, and Bin Yu. Log-concave sampling:\nMetropolis\u2013Hastings algorithms are fast! In Proceedings of the 31st Conference On Learning\nTheory, volume 75 of Proceedings of Machine Learning Research, pages 793\u2013797. PMLR,\n06\u201309 Jul 2018.\n\n9\n\n\f[19] Andrew Gelman and Stephen P. Brooks. General methods for monitoring convergence of\n\niterative simulations. Journal of Computational and Graphical Statistics, 1998.\n\n[20] Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple\n\nsequences. Statistical Science, 1992.\n\n[21] John Geweke. Evaluating the accuracy of sampling-based approaches to the calculation of pos-\n\nterior moments. Bayesian Statistics, 1998.\n\n[22] Charles Geyer. Markov chain Monte Carlo maximum likelihood. Technical report, University\n\nof Minnesota, School of Statistics, 1991.\n\n[23] Peter W. Glynn and Chang-Han Rhee. Exact estimation for Markov chain equilibrium expecta-\n\ntions. Journal of Applied Probability, 51(A):377\u2013389, 2014.\n\n[24] Jackson Gorham, Andrew Duncan, Sebastian Vollmer, and Lester Mackey. Measuring Sample\n\nQuality with Diffusions. 2018. arXiv preprint arXiv:1611.06972v6.\n\n[25] Jackson Gorham and Lester Mackey. Measuring sample quality with Stein\u2019s method.\n\nIn\nC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 28, pages 226\u2013234. Curran Associates, Inc., 2015.\n\n[26] Jeremy Heng and Pierre E. Jacob. Unbiased Hamiltonian Monte Carlo with couplings.\n\nBiometrika, 106(2):287\u2013302, 2019.\n\n[27] Jonathan H. Huggins and Daniel M. Roy. Sequential Monte Carlo as approximate sampling:\nbounds, adaptive resampling via \u221e-ESS, and an application to particle Gibbs. Bernoulli,\n25(1):584\u2013622, 2019.\n\n[28] Pierre E. Jacob, Fredrik Lindsten, and Thomas B. Sch\u00f6n. Smoothing with couplings of\n\nconditional particle \ufb01lters. Journal of the American Statistical Association, 2019.\n\n[29] Pierre E. Jacob, John O\u2019Leary, and Yves F. Atchad\u00e9. Unbiased Markov chain Monte Carlo with\ncouplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2019.\n\n[30] James E. Johndrow, Paulo Orenstein, and Anirban Bhattacharya. Scalable MCMC for Bayes\n\nshrinkage priors. arXiv preprint arXiv:1705.00841v3, 2018.\n\n[31] Valen E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using\ncoupled sample paths. Journal of the American Statistical Association, 91(433):154\u2013166, 1996.\n\n[32] Valen E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chain\nMonte Carlo algorithms. Journal of the American Statistical Association, 93(441):238\u2013248,\n1998.\n\n[33] Anthony Lee and Nick Whiteley. Variance estimation in the particle \ufb01lter. Biometrika,\n\n105(3):609\u2013625, 2018.\n\n[34] Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.\n\n[35] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t\ntests. In Proceedings of The 33rd International Conference on Machine Learning, volume 48\nof Proceedings of Machine Learning Research, pages 276\u2013284, New York, New York, USA,\n20\u201322 Jun 2016. PMLR.\n\n[36] Oren Mangoubi and Aaron Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly\n\nlog-concave distributions. arXiv preprint arXiv:1708.07114, 2017.\n\n[37] Lawrence Middleton, George Deligiannidis, Arnaud Doucet, and Pierre E. Jacob. Unbiased\nMarkov chain Monte Carlo for intractable target distributions. arXiv preprint arXiv:1807.08691,\n2018.\n\n[38] Lawrence Middleton, George Deligiannidis, Arnaud Doucet, and Pierre E. Jacob. Unbiased\nsmoothing using particle independent Metropolis-Hastings. In Proceedings of Machine Learning\nResearch, volume 89 of Proceedings of Machine Learning Research, pages 2378\u20132387. PMLR,\n16\u201318 Apr 2019.\n\n10\n\n\f[39] Elchanan Mossel and Allan Sly. Exact thresholds for Ising\u2013Gibbs samplers on general graphs.\n\nThe Annals of Probability, 41(1):294\u2013328, 2013.\n\n[40] Radford M. Neal. Bayesian learning via stochastic dynamics. Advances in neural information\n\nprocessing systems, 1993.\n\n[41] Art B. Owen. Monte Carlo theory, methods and examples. 2019.\n\n[42] Gabriel Peyr\u00e9 and Marco Cuturi. Computational Optimal Transport. 2019. arXiv preprint\n\nArXiv:1803.00567v3.\n\n[43] Nicholas G. Polson, James G. Scott, and Jesse Windle. Bayesian inference for logistic mod-\nels using Polya-Gamma latent variables. Journal of the American Statistical Association,\n108(504):1339\u20131349, 2013.\n\n[44] James G. Propp and David B. Wilson. Exact sampling with coupled Markov chains and\napplications to statistical mechanics. Random Structures & Algorithms, 9(1-2):223\u2013252, 1996.\n\n[45] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for\n\nStatistical Computing, Vienna, Austria, 2013.\n\n[46] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. Spinger New York,\n\n2013.\n\n[47] Gareth O. Roberts and Jeffrey S. Rosenthal. Optimal scaling for various Metropolis\u2013Hastings\n\nalgorithms. Statistical Science, 16(4):351\u2013367, 11 2001.\n\n[48] Gareth O. Roberts and Jeffrey S. Rosenthal. General state space Markov chains and MCMC\n\nalgorithms. Probability Surveys, (1):20\u201371, 2004.\n\n[49] Jeffrey S. Rosenthal. Analysis of the Gibbs sampler for a model related to James\u2013Stein\n\nestimators. Statistics and Computing, 6(3):269\u2013275, 1996.\n\n[50] Daniel Rudolf and Nikolaus Schweizer. Perturbation theory for Markov chains via Wasserstein\n\ndistance. Bernoulli, 24(4A):2610\u20132639, 2018.\n\n[51] John Salvatier, Thomas Wiecki, and Christopher Fonnesbeck. Probabilistic programming in\n\nPython using PyMC3. PeerJ Computer Science, 2(55), 2016.\n\n[52] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch\u00f6lkopf, and Gert RG\nLanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of\nStatistics, 6:1550\u20131599, 2012.\n\n[53] Saifuddin Syed, Alexandre Bouchard-C\u00f4t\u00e9, George Deligiannidis, and Arnaud Doucet. Non-\nreversible parallel tempering: an embarassingly parallel MCMC scheme. arXiv preprint\narXiv:1905.02939, 2019.\n\n[54] Hermann Thorisson. On maximal and distributional coupling. The Annals of Probability, pages\n\n873\u2013876, 1986.\n\n[55] Michalis K. Titsias and Christopher Yau. The Hamming ball sampler. Journal of the American\n\nStatistical Association, 112(520):1598\u20131611, 2017.\n\n[56] Dootika Vats, James M. Flegal, and Galin L. Jones. Multivariate output analysis for Markov\n\nchain Monte Carlo. Biometrika, 106(2):321\u2013337, 04 2019.\n\n[57] Hadley Wickham. ggplot2: elegant graphics for data analysis. Springer, 2016.\n\n[58] Claus O. Wilke. ggridges: Ridgeline plots in \u2018ggplot2\u2019. R package version 0.4, 1, 2017.\n\n[59] Dawn B. Woodard, Scott C. Schmidler, and Mark Huber. Conditions for rapid mixing of parallel\nand simulated tempering on multimodal distributions. Ann. Appl. Probab., 19(2):617\u2013640, 04\n2009.\n\n[60] Giacomo Zanella. Informed proposals for local MCMC in discrete spaces. Journal of the\n\nAmerican Statistical Association, 2019.\n\n11\n\n\f", "award": [], "sourceid": 4021, "authors": [{"given_name": "Niloy", "family_name": "Biswas", "institution": "Harvard University"}, {"given_name": "Pierre", "family_name": "Jacob", "institution": "Harvard University"}, {"given_name": "Paul", "family_name": "Vanetti", "institution": "Oxford"}]}