{"title": "Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path", "book": "Advances in Neural Information Processing Systems", "page_first": 1459, "page_last": 1467, "abstract": "This article provides the first procedure for computing a fully data-dependent interval that traps the mixing time $t_{mix}$ of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge. The interval is constructed around the relaxation time $t_{relax}$, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a $\\sqrt{n}$ rate, where $n$ is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least $\\Omega(t_{relax})$ times on the average. Finally, future directions of research are identified.", "full_text": "Mixing Time Estimation in Reversible Markov\n\nChains from a Single Sample Path\n\nDaniel Hsu\n\nColumbia University\ndjhsu@cs.columbia.edu\n\nAryeh Kontorovich\n\nBen-Gurion University\n\nCsaba Szepesv\u00b4ari\n\nUniversity of Alberta\n\nkaryeh@cs.bgu.ac.il\n\nszepesva@cs.ualberta.ca\n\nAbstract\n\nThis article provides the \ufb01rst procedure for computing a fully data-dependent in-\nterval that traps the mixing time tmix of a \ufb01nite reversible ergodic Markov chain at\na prescribed con\ufb01dence level. The interval is computed from a single \ufb01nite-length\nsample path from the Markov chain, and does not require the knowledge of any\nparameters of the chain. This stands in contrast to previous approaches, which ei-\nther only provide point estimates, or require a reset mechanism, or additional prior\nknowledge. The interval is constructed around the relaxation time trelax, which is\nstrongly related to the mixing time, and the width of the interval converges to zero\nroughly at a \u221an rate, where n is the length of the sample path. Upper and lower\nbounds are given on the number of samples required to achieve constant-factor\nmultiplicative accuracy. The lower bounds indicate that, unless further restric-\ntions are placed on the chain, no procedure can achieve this accuracy level before\nseeing each state at least \u2126(trelax) times on the average. Finally, future directions\nof research are identi\ufb01ed.\n\n1\n\nIntroduction\n\nThis work tackles the challenge of constructing fully empirical bounds on the mixing time of\nMarkov chains based on a single sample path. Let (Xt)t=1,2,... be an irreducible, aperiodic time-\nhomogeneous Markov chain on a \ufb01nite state space [d] := {1, 2, . . . , d} with transition matrix P .\nUnder this assumption, the chain converges to its unique stationary distribution \u03c0 = (\u03c0i)d\ni=1 regard-\nless of the initial state distribution q:\n\nlim\nt\u2192\u221e\n\nPrq (Xt = i) = lim\nt\u2192\u221e\n\n(qP t)i = \u03c0i\n\nfor each i \u2208 [d].\n\nThe mixing time tmix of the Markov chain is the number of time steps required for the chain to be\nwithin a \ufb01xed threshold of its stationary distribution:\n\nHere, \u03c0(A) = Pi\u2208A \u03c0i is the probability assigned to set A by \u03c0, and the supremum is over all\n\npossible initial distributions q. The problem studied in this work is the construction of a non-trivial\ncon\ufb01dence interval Cn = Cn(X1, X2, . . . , Xn, \u03b4) \u2282 [0,\u221e], based only on the observed sample\npath (X1, X2, . . . , Xn) and \u03b4 \u2208 (0, 1), that succeeds with probability 1 \u2212 \u03b4 in trapping the value of\nthe mixing time tmix.\n\nThis problem is motivated by the numerous scienti\ufb01c applications and machine learning tasks in\n\nwhich the quantity of interest is the mean \u03c0(f ) = Pi \u03c0if (i) for some function f of the states\n\nof a Markov chain. This is the setting of the celebrated Markov Chain Monte Carlo (MCMC)\nparadigm [1], but the problem also arises in performance prediction involving time-correlated data,\nas is common in reinforcement learning [2]. Observable bounds on mixing times are useful in the\n\n1\n\ntmix := min(cid:26)t \u2208 N : sup\n\nq\n\nA\u2282[d]|Prq (Xt \u2208 A) \u2212 \u03c0(A)| \u2264 1/4(cid:27) .\n\nmax\n\n(1)\n\n\fdesign and diagnostics of these methods; they yield effective approaches to assessing the estimation\nquality, even when a priori knowledge of the mixing time or correlation structure is unavailable.\n\nMain results. We develop the \ufb01rst procedure for constructing non-trivial and fully empirical con-\n\ufb01dence intervals for Markov mixing time. Consider a reversible ergodic Markov chain on d states\nwith absolute spectral gap \u03b3\u22c6 and stationary distribution minorized by \u03c0\u22c6. As is well-known [3,\nTheorems 12.3 and 12.4],\n\n(trelax \u2212 1) ln 2 \u2264 tmix \u2264 trelax ln\n\n4\n\n\u03c0\u22c6\n\n(2)\n\nwhere trelax := 1/\u03b3\u22c6 is the relaxation time. Hence, it suf\ufb01ces to estimate \u03b3\u22c6 and \u03c0\u22c6. Our main\nresults are summarized as follows.\n\n1. In Section 3.1, we show that in some problems n = \u2126((d log d)/\u03b3\u22c6 + 1/\u03c0\u22c6) observations\nare necessary for any procedure to guarantee constant multiplicative accuracy in estimating\n\u03b3\u22c6 (Theorems 1 and 2). Essentially, in some problems every state may need to be visited\nabout log(d)/\u03b3\u22c6 times, on average, before an accurate estimate of the mixing time can be\nprovided, regardless of the actual estimation procedure used.\n\n2. In Section 3.2, we give a point-estimator for \u03b3\u22c6, and prove in Theorem 3 that it achieves\nmultiplicative accuracy from a single sample path of length \u02dcO(1/(\u03c0\u22c6\u03b33\n\u22c6)).1 We also pro-\nvide a point-estimator for \u03c0\u22c6 that requires a sample path of length \u02dcO(1/(\u03c0\u22c6\u03b3\u22c6)). This\nestablishes the feasibility of estimating the mixing time in this setting. However, the valid\ncon\ufb01dence intervals suggested by Theorem 3 depend on the unknown quantities \u03c0\u22c6 and\n\u03b3\u22c6. We also discuss the importance of reversibility, and some possible extensions to non-\nreversible chains.\n\n3. In Section 4, the construction of valid fully empirical con\ufb01dence intervals for \u03c0\u22c6 and \u03b3\u22c6\nare considered. First, the dif\ufb01culty of the task is explained, i.e., why the standard approach\nof turning the \ufb01nite time con\ufb01dence intervals of Theorem 3 into a fully empirical one fails.\nCombining several results from perturbation theory in a novel fashion we propose a new\nprocedure and prove that it avoids slow convergence (Theorem 4). We also explain how\nto combine the empirical con\ufb01dence intervals from Algorithm 1 with the non-empirical\nbounds from Theorem 3 to produce valid empirical con\ufb01dence intervals. We prove in\nTheorem 5 that the width of these new intervals converge to zero asymptotically at least as\nfast as those from either Theorem 3 and Theorem 4.\n\nRelated work. There is a vast statistical literature on estimation in Markov chains. For instance, it\nis known that under the assumptions on (Xt)t from above, the law of large numbers guarantees that\nthe sample mean \u03c0n(f ) := 1\nt=1 f (Xt) converges almost surely to \u03c0(f ) [4], while the central\nlimit theorem tells us that as n \u2192 \u221e, the distribution of the deviation \u221an(\u03c0n(f ) \u2212 \u03c0(f )) will be\n\nnormal with mean zero and asymptotic variance limn\u2192\u221e n Var (\u03c0n(f )) [5].\n\nnPn\n\nAlthough these asymptotic results help us understand the limiting behavior of the sample mean\nover a Markov chain, they say little about the \ufb01nite-time non-asymptotic behavior, which is often\nneeded for the prudent evaluation of a method or even its algorithmic design [6\u201313]. To address\nthis need, numerous works have developed Chernoff-type bounds on Pr(|\u03c0n(f ) \u2212 \u03c0(f )| > \u01eb), thus\nproviding valuable tools for non-asymptotic probabilistic analysis [6, 14\u201316]. These probability\nbounds are larger than corresponding bounds for independent and identically distributed (iid) data\ndue to the temporal dependence; intuitively, for the Markov chain to yield a fresh draw Xt\u2032 that\nbehaves as if it was independent of Xt, one must wait \u0398(tmix) time steps. Note that the bounds\ngenerally depend on distribution-speci\ufb01c properties of the Markov chain (e.g., P , tmix, \u03b3\u22c6), which\nare often unknown a priori in practice. Consequently, much effort has been put towards estimating\nthese unknown quantities, especially in the context of MCMC diagnostics, in order to provide data-\ndependent assessments of estimation accuracy [e.g., 11, 12, 17\u201319]. However, these approaches\ngenerally only provide asymptotic guarantees, and hence fall short of our goal of empirical bounds\nthat are valid with any \ufb01nite-length sample path.\n\nLearning with dependent data is another main motivation to our work. Many results from statisti-\ncal learning and empirical process theory have been extended to suf\ufb01ciently fast mixing, dependent\n\n1The \u02dcO(\u00b7) notation suppresses logarithmic factors.\n\n2\n\n\fdata [e.g., 20\u201326], providing learnability assurances (e.g., generalization error bounds). These re-\nsults are often given in terms of mixing coef\ufb01cients, which can be consistently estimated in some\ncases [27]. However, the convergence rates of the estimates from [27], which are needed to derive\ncon\ufb01dence bounds, are given in terms of unknown mixing coef\ufb01cients. When the data comes from a\nMarkov chain, these mixing coef\ufb01cients can often be bounded in terms of mixing times, and hence\nour main results provide a way to make them fully empirical, at least in the limited setting we study.\n\nIt is possible to eliminate many of the dif\ufb01culties presented above when allowed more \ufb02exible access\nto the Markov chain. For example, given a sampling oracle that generates independent transitions\nfrom any given state (akin to a \u201creset\u201d device), the mixing time becomes an ef\ufb01ciently testable\nproperty in the sense studied in [28, 29]. On the other hand, when one only has a circuit-based\ndescription of the transition probabilities of a Markov chain over an exponentially-large state space,\nthere are complexity-theoretic barriers for many MCMC diagnostic problems [30].\n\n2 Preliminaries\n\n2.1 Notations\n\nWe denote the set of positive integers by N, and the set of the \ufb01rst d positive integers {1, 2, . . . , d}\nby [d]. The non-negative part of a real number x is [x]+ := max{0, x}, and \u2308x\u2309+ := max{0,\u2308x\u2309}.\nWe use ln(\u00b7) for natural logarithm, and log(\u00b7) for logarithm with an arbitrary constant base. Bold-\nface symbols are used for vectors and matrices (e.g., v, M ), and their entries are referenced by\nsubindexing (e.g., vi, Mi,j ). For a vector v, kvk denotes its Euclidean norm; for a matrix M , kMk\ndenotes its spectral norm. We use Diag(v) to denote the diagonal matrix whose (i, i)-th entry is vi.\ni=1 pi = 1}, and we regard vectors\n\nThe probability simplex is denoted by \u2206d\u22121 = {p \u2208 [0, 1]d :Pd\n\nin \u2206d\u22121 as row vectors.\n\n2.2 Setting\n\nLet P \u2208 (\u2206d\u22121)d \u2282 [0, 1]d\u00d7d be a d \u00d7 d row-stochastic matrix for an ergodic (i.e., irreducible\nand aperiodic) Markov chain. This implies there is a unique stationary distribution \u03c0 \u2208 \u2206d\u22121 with\n\u03c0i > 0 for all i \u2208 [d] [3, Corollary 1.17]. We also assume that P is reversible (with respect to \u03c0):\n(3)\n\n\u03c0iPi,j = \u03c0jPj,i,\n\ni, j \u2208 [d].\n\nThe minimum stationary probability is denoted by \u03c0\u22c6 := mini\u2208[d] \u03c0i.\n\nDe\ufb01ne the matrices\n\nM := Diag(\u03c0)P and L := Diag(\u03c0)\u22121/2M Diag(\u03c0)\u22121/2 .\n\nThe (i, j)th entry of the matrix Mi,j contains the doublet probabilities associated with P : Mi,j =\n\u03c0iPi,j is the probability of seeing state i followed by state j when the chain is started from its\nstationary distribution. The matrix M is symmetric on account of the reversibility of P , and hence\nit follows that L is also symmetric. (We will strongly exploit the symmetry in our results.) Further,\nL = Diag(\u03c0)1/2P Diag(\u03c0)\u22121/2, hence L and P are similar and thus their eigenvalue systems are\nidentical. Ergodicity and reversibility imply that the eigenvalues of L are contained in the interval\n(\u22121, 1], and that 1 is an eigenvalue of L with multiplicity 1 [3, Lemmas 12.1 and 12.2]. Denote and\norder the eigenvalues of L as\n\n1 = \u03bb1 > \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbd > \u22121.\n\nLet \u03bb\u22c6 := max{\u03bb2, |\u03bbd|}, and de\ufb01ne the (absolute) spectral gap to be \u03b3\u22c6 := 1\u2212\u03bb\u22c6, which is strictly\npositive on account of ergodicity.\nLet (Xt)t\u2208N be a Markov chain whose transition probabilities are governed by P . For each t \u2208 N,\nlet \u03c0(t) \u2208 \u2206d\u22121 denote the marginal distribution of Xt, so\n\n\u03c0(t+1) = \u03c0(t)P ,\n\nt \u2208 N.\n\nNote that the initial distribution \u03c0(1) is arbitrary, and need not be the stationary distribution \u03c0.\n\nThe goal is to estimate \u03c0\u22c6 and \u03b3\u22c6 from the length n sample path (Xt)t\u2208[n], and also to construct fully\nempirical con\ufb01dence intervals that \u03c0\u22c6 and \u03b3\u22c6 with high probability; in particular, the construction\n\n3\n\n\fof the intervals should not depend on any unobservable quantities, including \u03c0\u22c6 and \u03b3\u22c6 themselves.\nAs mentioned in the introduction, it is well-known that the mixing time of the Markov chain tmix\n(de\ufb01ned in Eq. 1) is bounded in terms of \u03c0\u22c6 and \u03b3\u22c6, as shown in Eq. (2). Moreover, convergence\nrates for empirical processes on Markov chain sequences are also often given in terms of mixing\ncoef\ufb01cients that can ultimately be bounded in terms of \u03c0\u22c6 and \u03b3\u22c6 (as we will show in the proof of\nour \ufb01rst result). Therefore, valid con\ufb01dence intervals for \u03c0\u22c6 and \u03b3\u22c6 can be used to make these rates\nfully observable.\n\n3 Point estimation\n\nIn this section, we present lower and upper bounds on achievable rates for estimating the spectral\ngap as a function of the length of the sample path n.\n\n3.1 Lower bounds\n\nThe purpose of this section is to show lower bounds on the number of observations necessary to\nachieve a \ufb01xed multiplicative (or even just additive) accuracy in estimating the spectral gap \u03b3\u22c6. By\nEq. (2), the multiplicative accuracy lower bound for \u03b3\u22c6 gives the same lower bound for estimating\nthe mixing time. Our \ufb01rst result holds even for two state Markov chains and shows that a sequence\nlength of \u2126(1/\u03c0\u22c6) is necessary to achieve even a constant additive accuracy in estimating \u03b3\u22c6.\nTheorem 1. Pick any \u00af\u03c0 \u2208 (0, 1/4). Consider any estimator \u02c6\u03b3\u22c6 that takes as input a random\nsample path of length n \u2264 1/(4\u00af\u03c0) from a Markov chain starting from any desired initial state\ndistribution. There exists a two-state ergodic and reversible Markov chain distribution with spectral\ngap \u03b3\u22c6 \u2265 1/2 and minimum stationary probability \u03c0\u22c6 \u2265 \u00af\u03c0 such that\n\nPr [|\u02c6\u03b3\u22c6 \u2212 \u03b3\u22c6| \u2265 1/8] \u2265 3/8.\n\nNext, considering d state chains, we show that a sequence of length \u2126(d log(d)/\u03b3\u22c6) is required to\nestimate \u03b3\u22c6 up to a constant multiplicative accuracy. Essentially, the sequence may have to visit all\nd states at least log(d)/\u03b3\u22c6 times each, on average. This holds even if \u03c0\u22c6 is within a factor of two of\nthe largest possible value of 1/d that it can take, i.e., when \u03c0 is nearly uniform.\nTheorem 2. There is an absolute constant c > 0 such that the following holds. Pick any positive\ninteger d \u2265 3 and any \u00af\u03b3 \u2208 (0, 1/2). Consider any estimator \u02c6\u03b3\u22c6 that takes as input a random sample\npath of length n < cd log(d)/\u00af\u03b3 from a d-state reversible Markov chain starting from any desired\ninitial state distribution. There is an ergodic and reversible Markov chain distribution with spectral\ngap \u03b3\u22c6 \u2208 [\u00af\u03b3, 2\u00af\u03b3] and minimum stationary probability \u03c0\u22c6 \u2265 1/(2d) such that\n\nPr [|\u02c6\u03b3\u22c6 \u2212 \u03b3\u22c6| \u2265 \u00af\u03b3/2] \u2265 1/4.\n\nThe proofs of Theorems 1 and 2 are given in Appendix A.2\n\n3.2 A plug-in based point estimator and its accuracy\n\nLet us now consider the problem of estimating \u03b3\u22c6. For this, we construct a natural plug-in estimator.\nAlong the way, we also provide an estimator for the minimum stationary probability, allowing one\nto use the bounds from Eq. (2) to trap the mixing time.\n\nDe\ufb01ne the random matrix cM \u2208 [0, 1]d\u00d7d and random vector \u02c6\u03c0 \u2208 \u2206d\u22121 by\n\n,\n\ncMi,j := |{t \u2208 [n \u2212 1] : (Xt, Xt+1) = (i, j)}|\n\nn \u2212 1\n\u02c6\u03c0i := |{t \u2208 [n] : Xt = i}|\n,\n\ni, j \u2208 [d] ,\n\nn\n\ni \u2208 [d] .\n\nFurthermore, de\ufb01ne\n\n2A full version of this paper, with appendices, is available on arXiv [31].\n\nSym(bL) :=\n\n1\n2\n\n\u22a4\n\n)\n\n(bL +bL\n\n4\n\n\fto be the symmetrized version of the (possibly non-symmetric) matrix\n\nbL := Diag( \u02c6\u03c0)\u22121/2cM Diag( \u02c6\u03c0)\u22121/2.\n\nLet \u02c6\u03bb1 \u2265 \u02c6\u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u02c6\u03bbd be the eigenvalues of Sym(bL). Our estimator of the minimum sta-\ntionary probability \u03c0\u22c6 is \u02c6\u03c0\u22c6 := mini\u2208[d] \u02c6\u03c0i, and our estimator of the spectral gap \u03b3\u22c6 is \u02c6\u03b3\u22c6 :=\n1 \u2212 max{\u02c6\u03bb2,|\u02c6\u03bbd|}.\n\nThese estimators have the following accuracy guarantees:\n\nand\n\nlog d\n\u03c0\u22c6\u03b4\n\n\u03c0\u22c6\u03b4\n\n+\n\n\u03b3\u22c6n\n\n\u03b4 \u00b7 log n\n\u03c0\u22c6\u03b3\u22c6n\n\n\u03c0\u22c6\u03b4\n\n+\n\nlog 1\n\u03b3\u22c6\n\nTheorem 3. There exists an absolute constant C > 0 such that the following holds. Assume the\nestimators \u02c6\u03c0\u22c6 and \u02c6\u03b3\u22c6 described above are formed from a sample path of length n from an ergodic and\nreversible Markov chain. Let \u03b3\u22c6 > 0 denote the spectral gap and \u03c0\u22c6 > 0 the minimum stationary\nprobability. For any \u03b4 \u2208 (0, 1), with probability at least 1 \u2212 \u03b4,\n\nTheorem 3 implies that the sequence lengths required to estimate \u03c0\u22c6 and \u03b3\u22c6 to within constant\n\n|\u02c6\u03c0\u22c6 \u2212 \u03c0\u22c6| \u2264 C \uf8eb\uf8eds \u03c0\u22c6 log d\n\u03b3\u22c6n \uf8f6\uf8f8\n|\u02c6\u03b3\u22c6 \u2212 \u03b3\u22c6| \u2264 C \uf8eb\uf8eds log d\n\u03b3\u22c6n \uf8f6\uf8f8 .\n\u22c6(cid:17) . By Eq. (2), the second of these is\nmultiplicative factors are, respectively, \u02dcO(cid:16) 1\nThe proof of Theorem 3 is based on analyzing the convergence of the sample averages cM and \u02c6\u03c0\nChernoff-type bounds for Markov chains to each entry of cM would result in a signi\ufb01cantly worse\n\nto their expectation, and then using perturbation bounds for eigenvalues to derive a bound on the\nerror of \u02c6\u03b3\u22c6. However, since these averages are formed using a single sample path from a (possibly)\nnon-stationary Markov chain, we cannot use standard large deviation bounds; moreover applying\n\nsequence length requirement, roughly a factor of d larger. Instead, we adapt probability tail bounds\nfor sums of independent random matrices [32] to our non-iid setting by directly applying a blocking\ntechnique of [33] as described in the article of [20]. Due to ergodicity, the convergence rate can be\nbounded without any dependence on the initial state distribution \u03c0(1). The proof of Theorem 3 is\ngiven in Appendix B.\n\n\u03c0\u22c6\u03b3 3\nalso a bound on the required sequence length to estimate tmix.\n\n\u03c0\u22c6\u03b3\u22c6(cid:17) and \u02dcO(cid:16) 1\n\n(4)\n\n(5)\n\nNote that because the eigenvalues of L are the same as that of the transition probability matrix P ,\nwe could have instead opted to estimate P , say, using simple frequency estimates obtained from\n\nthe sample path, and then computing the second largest eigenvalue of this empirical estimate bP .\n\nIn fact, this approach is a way to extend to non-reversible chains, as we would no longer rely on\nthe symmetry of M or L. The dif\ufb01culty with this approach is that P lacks the structure required\nby certain strong eigenvalue perturbation results. One could instead invoke the Ostrowski-Elsner\ntheorem [cf. Theorem 1.4 on Page 170 of 34], which bounds the matching distance between the\n\neigenvalues of a matrix A and its perturbation A + E by O(kEk1/d). Since kbP \u2212 Pk is expected\n\nto be of size O(n\u22121/2), this approach will give a con\ufb01dence interval for \u03b3\u22c6 whose width shrinks\nat a rate of O(n\u22121/(2d))\u2014an exponential slow-down compared to the rate from Theorem 3. As\ndemonstrated through an example from [34], the dependence on the d-th root of the norm of the\nperturbation cannot be avoided in general. Our approach based on estimating a symmetric matrix\naffords us the use of perturbation results that exploit more structure.\n\nReturning to the question of obtaining a fully empirical con\ufb01dence interval for \u03b3\u22c6 and \u03c0\u22c6, we notice\nthat, unfortunately, Theorem 3 falls short of being directly suitable for this, at least without further\nassumptions. This is because the deviation terms themselves depend inversely both on \u03b3\u22c6 and \u03c0\u22c6,\nand hence can never rule out 0 (or an arbitrarily small positive value) as a possibility for \u03b3\u22c6 or \u03c0\u22c6.3\nIn effect, the fact that the Markov chain could be slow mixing and the long-term frequency of some\n\n3Using Theorem 3, it is possible to trap \u03b3\u22c6 in the union of two empirical con\ufb01dence intervals\u2014one around\n\n\u02c6\u03b3\u22c6 and the other around zero, both of which shrink in width as the sequence length increases.\n\n5\n\n\fAlgorithm 1 Empirical con\ufb01dence intervals\nInput: Sample path (X1, X2, . . . , Xn), con\ufb01dence parameter \u03b4 \u2208 (0, 1).\n1: Compute state visit counts and smoothed transition probability estimates:\n\nNi,j + 1/d\n\nNi := |{t \u2208 [n \u2212 1] : Xt = i}| ,\nNi,j := |{t \u2208 [n \u2212 1] : (Xt, Xt+1) = (i, j)}| ,\n\ni \u2208 [d];\n\n,\n\nNi + 1\n\n5: Spectral gap estimate:\n\n(i, j) \u2208 [d]2.\n\nbPi,j :=\n2: Let bA# be the group inverse of bA := I \u2212 bP .\n3: Let \u02c6\u03c0 \u2208 \u2206d\u22121 be the unique stationary distribution for bP .\n4: Compute eigenvalues \u02c6\u03bb1\u2265\u02c6\u03bb2\u2265\u00b7\u00b7\u00b7\u2265\u02c6\u03bbd of Sym(bL), wherebL := Diag( \u02c6\u03c0)1/2bP Diag( \u02c6\u03c0)\u22121/2.\n\u02c6\u03b3\u22c6 := 1 \u2212 max{\u02c6\u03bb2, |\u02c6\u03bbd|}.\n6: Empirical bounds for |bPi,j\u2212Pi,j| for (i, j) \u2208 [d]2: c := 1.01, \u03c4n,\u03b4 := inf{t \u2265 0 : 2d2(1 +\n\uf8f6\uf8f7\uf8f8\n(5/3)\u03c4n,\u03b4 + |bPi,j \u2212 1/d|\n\nbBi,j :=\uf8eb\uf8ec\uf8edr c\u03c4n,\u03b4\n\nt \u2309+)e\u2212t \u2264 \u03b4},\n\n\u2308logc\n\nand\n\nNi\n\n+\n\n2n\n\n2\n\n.\n\n7: Relative sensitivity of \u03c0:\n\nNi\n\n+s 2cbPi,j(1 \u2212 bPi,j)\u03c4n,\u03b4\nj,j \u2212 minnbA#\n\n2Ni\n\n2Ni\n\n+vuut c\u03c4n,\u03b4\nmaxnbA#\n\u02c6b := \u02c6\u03ba maxnbBi,j : (i, j) \u2208 [d]2o ,\n\n\u02c6\u03ba :=\n\n1\n2\n\n8: Empirical bounds for maxi\u2208[d] |\u02c6\u03c0i \u2212 \u03c0i| and maxSi\u2208[d]{|p\u03c0i/\u02c6\u03c0i \u2212 1|, |p\u02c6\u03c0i/\u03c0i \u2212 1|}:\n[\u02c6\u03c0i \u2212 \u02c6b]+) .\n\n\u02c6\u03c1 :=\n\n1\n2\n\n\u02c6\u03c0i\n\n\u02c6b\n\n,\n\ni,j : i \u2208 [d]o : j \u2208 [d]o .\nmax [i\u2208[d]( \u02c6b\ni,j!1/2\n\n\u02c6\u03c0i\n\u02c6\u03c0j\n\n\u02c6B2\n\n.\n\n9: Empirical bounds for |\u02c6\u03b3\u22c6 \u2212 \u03b3\u22c6|:\n\n\u02c6w := 2\u02c6\u03c1 + \u02c6\u03c12 + (1 + 2\u02c6\u03c1 + \u02c6\u03c12) X(i,j)\u2208[d]2\n\nstates could be small makes it dif\ufb01cult to be con\ufb01dent in the estimates provided by \u02c6\u03b3\u22c6 and \u02c6\u03c0\u22c6. This\nsuggests that in order to obtain fully empirical con\ufb01dence intervals, we need an estimator that is not\nsubject to such effects\u2014we pursue this in Section 4. Theorem 3 thus primarily serves as a point\nof comparison for what is achievable in terms of estimation accuracy when one does not need to\nprovide empirical con\ufb01dence bounds.\n\n4 Fully empirical con\ufb01dence intervals\n\nIn this section, we address the shortcoming of Theorem 3 and give fully empirical con\ufb01dence in-\ntervals for the stationary probabilities and the spectral gap \u03b3\u22c6. The main idea is to use the Markov\nproperty to eliminate the dependence of the con\ufb01dence intervals on the unknown quantities (includ-\ning \u03c0\u22c6 and \u03b3\u22c6). Speci\ufb01cally, we estimate the transition probabilities from the sample path using\nsimple frequency estimates: as a consequence of the Markov property, for each state, the frequency\nestimates converge at a rate that depends only on the number of visits to the state, and in particular\nthe rate (given the visit count of the state) is independent of the mixing time of the chain.\n\n6\n\n\fAs discussed in Section 3, it is possible to form a con\ufb01dence interval for \u03b3\u22c6 based on the eigen-\nvalues of an estimated transition probability matrix by appealing to the Ostrowski-Elsner theorem.\nHowever, as explained earlier, this would lead to a slow O(n\u22121/(2d)) rate. We avoid this slow rate\nby using an estimate of the symmetric matrix L, so that we can use a stronger perturbation result\n(namely Weyl\u2019s inequality, as in the proof of Theorem 3) available for symmetric matrices.\n\nTo form an estimate of L based on an estimate of the transition probabilities, one possibility is\nto estimate \u03c0 using a frequency-based estimate for \u03c0 as was done in Section 3, and appeal to\nthe relation L = Diag(\u03c0)1/2P Diag(\u03c0)\u22121/2 to form a plug-in estimate. However, as noted in\nSection 3.2, con\ufb01dence intervals for the entries of \u03c0 formed this way may depend on the mixing\ntime. Indeed, such an estimate of \u03c0 does not exploit the Markov property.\n\nWe adopt a different strategy for estimating \u03c0, which leads to our construction of empirical con\ufb01-\n\nestimate \u02c6\u03b3\u22c6 of the spectral gap (Steps 4 and 5). In the remaining steps, we use perturbation analyses\n\nbe computed at the cost of inverting an (d\u22121)\u00d7(d\u22121) matrix [35, Theorem 5.2].4 Further, once\n\ndence intervals, detailed in Algorithm 1. We form the matrix bP using smoothed frequency estimates\nof P (Step 1), then compute the so-called group inverse bA# of bA = I \u2212 bP (Step 2), followed by\n\ufb01nding the unique stationary distribution \u02c6\u03c0 of bP (Step 3), this way decoupling the bound on the\naccuracy of \u02c6\u03c0 from the mixing time. The group inverse bA# of bA is uniquely de\ufb01ned; and if bP\nde\ufb01nes an ergodic chain (which is the case here due to the use of the smoothed estimates), bA# can\ngiven bA#, the unique stationary distribution \u02c6\u03c0 of bP can be read out from the last row of bA# [35,\nTheorem 5.3]. The group inverse is also be used to compute the sensitivity of \u03c0. Based on \u02c6\u03c0 and bP ,\nwe construct the plug-in estimatebL of L, and use the eigenvalues of its symmetrization to form the\nto relate \u02c6\u03c0 and \u03c0, viewing P as the perturbation of bP ; and also to relate \u02c6\u03b3\u22c6 and \u03b3\u22c6, viewing L as a\nperturbation of Sym(bL). Both analyses give error bounds entirely in terms of observable quantities\nbA#, which, as noted reduces to matrix inversion. Thus, with a standard implementation of matrix\nStep 7, with bA# replaced by the group inverse A# of A := I \u2212 P . The result is as follows.\n\nTheorem 4. Suppose Algorithm 1 is given as input a sample path of length n from an ergodic and\nreversible Markov chain and con\ufb01dence parameter \u03b4 \u2208 (0, 1). Let \u03b3\u22c6 > 0 denote the spectral gap,\n\u03c0 the unique stationary distribution, and \u03c0\u22c6 > 0 the minimum stationary probability. Then, on an\nevent of probability at least 1 \u2212 \u03b4,\n\u03c0i \u2208 [\u02c6\u03c0i \u2212 \u02c6b, \u02c6\u03c0i + \u02c6b]\n\ninversion, the algorithm\u2019s time complexity is O(n + d3), while its space complexity is O(d2).\n\nTo state our main theorem concerning Algorithm 1, we \ufb01rst de\ufb01ne \u03ba to be analogous to \u02c6\u03ba from\n\n(e.g., \u02c6\u03ba), tracing back to empirical error bounds for the smoothed frequency estimates of P .\n\nThe most computationally expensive step in Algorithm 1 is the computation of the group inverse\n\n\u03b3\u22c6 \u2208 [\u02c6\u03b3\u22c6 \u2212 \u02c6w, \u02c6\u03b3\u22c6 + \u02c6w].\n\nfor all i \u2208 [d],\n\nand\n\nMoreover, \u02c6b and \u02c6w almost surely satisfy (as n \u2192 \u221e)\n\n\u02c6b = O max\n\n(i,j)\u2208[d]2\n\n\u03bar Pi,j log log n\n\n\u03c0in\n\n! ,\n\n\u02c6w = O \u03ba\n\n\u03c0\u22c6r log log n\n\n\u03c0\u22c6n\n\n\u03c0\u22c6n ! .5\n+r d log log n\n\nThe proof of Theorem 4 is given in Appendix C. As mentioned above, the obstacle encountered in\nTheorem 3 is avoided by exploiting the Markov property. We establish fully observable upper and\n\nlower bounds on the entries of P that converge at apn/ log log n rate using standard martingale tail\n\ninequalities; this justi\ufb01es the validity of the bounds from Step 6. Properties of the group inverse [35,\n36] and eigenvalue perturbation theory [34] are used to validate the empirical bounds on \u03c0i and \u03b3\u22c6\ndeveloped in the remaining steps of the algorithm.\n\nThe \ufb01rst part of Theorem 4 provides valid empirical con\ufb01dence intervals for each \u03c0i and for \u03b3\u22c6,\nwhich are simultaneously valid at con\ufb01dence level \u03b4. The second part of Theorem 4 shows that the\n\n4 The group inverse of a square matrix A, a special case of the Drazin inverse, is the unique matrix A#\n\nsatisfying AA#A = A, A#AA# = A# and A#A = AA#.\n\n5In Theorems 4 and 5, our use of big-O notation is as follows. For a random sequence (Yn)n and a (non-\nrandom) positive sequence (\u03b5\u03b8,n)n parameterized by \u03b8, we say \u201cYn = O(\u03b5\u03b8,n) holds almost surely as n \u2192 \u221e\u201d\nif there is some universal constant C > 0 such that for all \u03b8, lim supn\u2192\u221e Yn/\u03b5\u03b8,n \u2264 C holds almost surely.\n\n7\n\n\fwidth of the intervals decrease as the sequence length increases. We show in Appendix C.5 that\n\n\u03ba \u2264 d/\u03b3\u22c6, and hence \u02c6b = O(cid:18)max(i,j)\u2208[d]2\n\nd\n\n\u03b3\u22c6q Pi,j log log n\n\n\u03c0in\n\n(cid:19), \u02c6w = O(cid:16) d\n\n\u03c0\u22c6\u03b3\u22c6q log log n\n\u03c0\u22c6n (cid:17).\n\nIt is easy to combine Theorems 3 and 4 to yield intervals whose widths shrink at least as fast\nas both the non-empirical intervals from Theorem 3 and the empirical intervals from Theorem 4.\n\nSpeci\ufb01cally, determine lower bounds on \u03c0\u22c6 and \u03b3\u22c6 using Algorithm 1, \u03c0\u22c6 \u2265 mini\u2208[d][\u02c6\u03c0i \u2212 \u02c6b]+ ,\n\u03b3\u22c6 \u2265 [\u02c6\u03b3\u22c6 \u2212 \u02c6w]+; then plug-in these lower bounds for \u03c0\u22c6 and \u03b3\u22c6 in the deviation bounds in Eq. (5)\nfrom Theorem 3. This yields a new interval centered around the estimate of \u03b3\u22c6 from Theorem 3,\nand it no longer depends on unknown quantities. The interval is a valid 1 \u2212 2\u03b4 probability con\ufb01-\ndence interval for \u03b3\u22c6, and for suf\ufb01ciently large n, the width shrinks at the rate given in Eq. (5). We\ncan similarly construct an empirical con\ufb01dence interval for \u03c0\u22c6 using Eq. (4), which is valid on the\nsame 1 \u2212 2\u03b4 probability event.6 Finally, we can take the intersection of these new intervals with the\ncorresponding intervals from Algorithm 1. This is summarized in the following theorem, which we\nprove in Appendix D.\nTheorem 5. The following holds under the same conditions as Theorem 4. For any \u03b4 \u2208 (0, 1),\n\nthe con\ufb01dence intervals bU and bV described above for \u03c0\u22c6 and \u03b3\u22c6, respectively, satisfy \u03c0\u22c6 \u2208 bU and\n\u03b3\u22c6 \u2208 bV with probability at least 1 \u2212 2\u03b4. Furthermore, the widths of these intervals almost surely\nsatisfy (as n \u2192 \u221e) |bU| = O r \u03c0\u22c6 log d\n, \u02c6w(cid:27)(cid:19), where \u02c6w is\n\n\u03b3\u22c6n !, |bV | = O(cid:18)min(cid:26)q log d\n\nthe width from Algorithm 1.\n\n\u03b4\n\u03c0\u22c6\u03b3\u22c6n\n\n\u00b7log(n)\n\n\u03c0\u22c6 \u03b4\n\n5 Discussion\n\nThe construction used in Theorem 5 applies more generally: Given a con\ufb01dence interval of the\nform In = In(\u03b3\u22c6, \u03c0\u22c6, \u03b4) for some con\ufb01dence level \u03b4 and a fully empirical con\ufb01dence set En(\u03b4)\nfor (\u03b3\u22c6, \u03c0\u22c6) for the same level, I \u2032\nn = En(\u03b4) \u2229 \u222a(\u03b3,\u03c0)\u2208En(\u03b4)In(\u03b3, \u03c0, \u03b4) is a valid fully empirical 2\u03b4-\nlevel con\ufb01dence interval whose asymptotic width matches that of In up to lower order terms under\nreasonable assumptions on En and In. In particular, this suggests that future work should focus on\nclosing the gap between the lower and upper bounds on the accuracy of point-estimation. Another\ninteresting direction is to reduce the computation cost: The current cubic cost in the number of states\ncan be too high even when the number of states is only moderately large.\n\nPerhaps more important, however, is to extend our results to large state space Markov chains: In\nmost practical applications the state space is continuous or is exponentially large in some natural pa-\nrameters. As follows from our lower bounds, without further assumptions, the problem of fully data\ndependent estimation of the mixing time is intractable for information theoretical reasons. Interest-\ning directions for future work thus must consider Markov chains with speci\ufb01c structure. Parametric\nclasses of Markov chains, including but not limited to Markov chains with factored transition kernels\nwith a few factors, are a promising candidate for such future investigations. The results presented\nhere are a \ufb01rst step in the ambitious research agenda outlined above, and we hope that they will\nserve as a point of departure for further insights in the area of fully empirical estimation of Markov\nchain parameters based on a single sample path.\n\nReferences\n\n[1] J. S. Liu. Monte Carlo Strategies in Scienti\ufb01c Computing. Springer Series in Statistics. Springer-Verlag,\n\n2001.\n\n[2] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction (Adaptive Computation and\n\nMachine Learning). A Bradford Book, 1998.\n\n[3] D. Levin, Y. Peres, and E. Wilmer. Markov Chains and Mixing Times. AMS, 2008.\n\n[4] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer, 1993.\n\n[5] C. Kipnis and S. R. S. Varadhan. Central limit theorem for additive functionals of reversible markov\n\nprocesses and applications to simple exclusions. Comm. Math. Phys., 104(1):1\u201319, 1986.\n\n6For the \u03c0\u22c6 interval, we only plug-in lower bounds on \u03c0\u22c6 and \u03b3\u22c6 only where these quantities appear as 1/\u03c0\u22c6\n\nand 1/\u03b3\u22c6 in Eq. (4). It is then possible to \u201csolve\u201d for observable bounds on \u03c0\u22c6. See Appendix D for details.\n\n8\n\n\f[6] I. Kontoyiannis, L. A. Lastras-Monta\u02dcno, and S. P. Meyn. Exponential bounds and stopping rules for\n\nMCMC and general Markov chains. In VALUETOOLS, page 45, 2006.\n\n[7] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, pages 65\u201372, 2006.\n\n[8] V. Mnih, Cs. Szepesv\u00b4ari, and J.-Y. Audibert. Empirical Bernstein stopping. In ICML, pages 672\u2013679,\n\n2008.\n\n[9] A. Maurer and M. Pontil. Empirical Bernstein bounds and sample-variance penalization. In COLT, 2009.\n\n[10] L. Li, M. L. Littman, T. J. Walsh, and A. L. Strehl. Knows what it knows: a framework for self-aware\n\nlearning. Machine Learning, 82(3):399\u2013443, 2011.\n\n[11] J. M. Flegal and G. L. Jones. Implementing MCMC: estimating with con\ufb01dence. In Handbook of Markov\n\nchain Monte Carlo, pages 175\u2013197. Chapman & Hall/CRC, 2011.\n\n[12] B. M. Gyori and D. Paulin. Non-asymptotic con\ufb01dence intervals for MCMC in practice. arXiv:1212.2016,\n\n2014.\n\n[13] A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged bandit feed-\n\nback. In ICML, 2015.\n\n[14] D. Gillman. A Chernoff bound for random walks on expander graphs. SIAM Journal on Computing,\n\n27(4):1203\u20131220, 1998.\n\n[15] C. A. Le\u00b4on and F. Perron. Optimal Hoeffding bounds for discrete reversible Markov chains. Annals of\n\nApplied Probability, pages 958\u2013970, 2004.\n\n[16] D. Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods.\n\nElectronic Journal of Probability, 20:1\u201332, 2015.\n\n[17] S. T. Garren and R. L. Smith. Estimating the second largest eigenvalue of a Markov transition matrix.\n\nBernoulli, 6:215\u2013242, 2000.\n\n[18] G. L. Jones and J. P. Hobert. Honest exploration of intractable probability distributions via markov chain\n\nmonte carlo. Statist. Sci., 16(4):312\u2013334, 11 2001.\n\n[19] Y. Atchad\u00b4e. Markov Chain Monte Carlo con\ufb01dence intervals. Bernoulli, 2015. (to appear).\n\n[20] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of\n\nProbability, 22(1):94\u2013116, January 1994.\n\n[21] R. L. Karandikar and M. Vidyasagar. Rates of uniform convergence of empirical means with mixing\n\nprocesses. Statistics and Probability Letters, 58(3):297\u2013307, 2002.\n\n[22] D. Gamarnik. Extension of the PAC framework to \ufb01nite and countable Markov chains. IEEE Transactions\n\non Information Theory, 49(1):338\u2013345, 2003.\n\n[23] M. Mohri and A. Rostamizadeh. Stability bounds for non-iid processes. In NIPS, 2008.\n\n[24] M. Mohri and A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In NIPS, 2009.\n\n[25] I. Steinwart and A. Christmann. Fast learning from non-i.i.d. observations. In NIPS, 2009.\n\n[26] I. Steinwart, D. Hush, and C. Scovel. Learning from dependent observations. Journal of Multivariate\n\nAnalysis, 100(1):175\u2013194, 2009.\n\n[27] D. McDonald, C. Shalizi, and M. Schervish. Estimating beta-mixing coef\ufb01cients. In AISTATS, pages\n\n516\u2013524, 2011.\n\n[28] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions are close. In\n\nFOCS, pages 259\u2013269. IEEE, 2000.\n\n[29] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing closeness of discrete distributions.\n\nJournal of the ACM (JACM), 60(1):4:2\u20134:25, 2013.\n\n[30] N. Bhatnagar, A. Bogdanov, and E. Mossel. The computational complexity of estimating MCMC conver-\n\ngence time. In RANDOM, pages 424\u2013435. Springer, 2011.\n\n[31] D. Hsu, A. Kontorovich, and C. Szepesv\u00b4ari. Mixing time estimation in reversible Markov chains from a\n\nsingle sample path. CoRR, abs/1506.02903, 2015.\n\n[32] J. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine\n\nLearning, 2015.\n\n[33] S. Bernstein. Sur l\u2019extension du theoreme limite du calcul des probabilites aux sommes de quantites\n\ndependantes. Mathematische Annalen, 97:1\u201359, 1927.\n\n[34] G. W. Stewart and J. Sun. Matrix perturbation theory. Academic Press, Boston, 1990.\n\n[35] C. D. Meyer Jr. The role of the group generalized inverse in the theory of \ufb01nite Markov chains. SIAM\n\nReview, 17(3):443\u2013464, 1975.\n\n[36] G. Cho and C. Meyer. Comparison of perturbation bounds for the stationary distribution of a Markov\n\nchain. Linear Algebra and its Applications, 335:137\u2013150, 2001.\n\n9\n\n\f", "award": [], "sourceid": 887, "authors": [{"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "Aryeh", "family_name": "Kontorovich", "institution": "Ben Gurion University"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}