{"title": "Learning Multiple Markov Chains via Adaptive Allocation", "book": "Advances in Neural Information Processing Systems", "page_first": 13343, "page_last": 13353, "abstract": "We study the problem of learning the transition matrices of a set of Markov chains from a single stream of observations on each chain. We assume that the Markov chains are ergodic but otherwise unknown. The learner can sample Markov chains sequentially to observe their states. The goal of the learner is to sequentially select various chains to learn transition matrices uniformly well with respect to some loss function. We introduce a notion of loss that naturally extends the squared loss for learning distributions to the case of Markov chains, and further characterize the notion of being \\emph{uniformly good} in all problem instances. We present a novel learning algorithm that efficiently balances \\emph{exploration} and \\emph{exploitation} intrinsic to this problem, without any prior knowledge of the chains. We provide finite-sample PAC-type guarantees on the performance of the algorithm. Further, we show that our algorithm asymptotically attains an optimal loss.", "full_text": "Learning Multiple Markov Chains\n\nvia Adaptive Allocation\n\nMohammad Sadegh Talebi\n\nSequeL Team, Inria Lille \u2013 Nord Europe\n\nsadegh.talebi@inria.fr\n\nOdalric-Ambrym Maillard\n\nSequeL Team, Inria Lille \u2013 Nord Europe\n\nodalric.maillard@inria.fr\n\nAbstract\n\nWe study the problem of learning the transition matrices of a set of Markov chains\nfrom a single stream of observations on each chain. We assume that the Markov\nchains are ergodic but otherwise unknown. The learner can sample Markov chains\nsequentially to observe their states. The goal of the learner is to sequentially select\nvarious chains to learn transition matrices uniformly well with respect to some\nloss function. We introduce a notion of loss that naturally extends the squared loss\nfor learning distributions to the case of Markov chains, and further characterize\nthe notion of being uniformly good in all problem instances. We present a novel\nlearning algorithm that ef\ufb01ciently balances exploration and exploitation intrinsic to\nthis problem, without any prior knowledge of the chains. We provide \ufb01nite-sample\nPAC-type guarantees on the performance of the algorithm. Further, we show that\nour algorithm asymptotically attains an optimal loss.\n\n1\n\nIntroduction\n\nt=1\n\nempirical mean estimate \u02c6\u00b5k,n built with the Tk,n =(cid:80)n\n\nWe study a variant of the following sequential adaptive allocation problem: A learner is given a\nset of K arms, where to each arm k\u2208 [K], an unknown real-valued distributions \u03bdk with mean \u00b5k\nk > 0 is associated. At each round t \u2208 N, the learner must select an arm kt \u2208 [K],\nand variance \u03c32\nand receives a sample drawn from \u03bdk. Given a total budget of n pulls, the objective is to estimate\nthe expected values (\u00b5k)k\u2208[K] of all distributions uniformly well. The quality of estimation in this\nproblem is classically measured through the expected quadratic estimation error E[(\u00b5k\u2212\u02c6\u00b5k,n)2] for the\nI{k = kt} many samples received from \u03bdk at\ntime n, and the performance of an allocation strategy is the maximal error, maxk\u2208[K] E[(\u00b5k\u2212\u02c6\u00b5k,n)2].\nUsing ideas from the Multi-Armed Bandit (MAB) literature, previous works (e.g., [1, 2]) have\nprovided optimistic sampling strategies with near-optimal performance guarantees for this setup.\nThis generic adaptive allocation problem is related to several application problems arising in optimal\nexperiment design [3, 4], active learning [5], or Monte-Carlo methods [6]; we refer to [1, 7, 2, 8] and\nreferences therein for further motivation. We extend this line of work to the case where each process\nis a discrete Markov chain, hence introducing the problem of active bandit learning of Markov chains.\nMore precisely, we no longer assume that (\u03bdk)k are real-valued distributions, but we study the case\nwhere each \u03bdk is a discrete Markov process over a state space S \u2282 N. The law of the observations\ni=2 Pk(Xk,i\u22121, Xk,i),\nwhere pk denotes the initial distribution of states, and Pk is the transition function of the Markov\nchain. The goal of the learner is to learn the transition matrices (Pk)k\u2208[K] uniformly well on the\nchains. Note that the chains are not controlled (we only decide which chain to advance, not the states\nit transits to).\nBefore discussing the challenges of the extension to Markov chains, let us give further comments\non the performance measure considered in bandit allocation for real-valued distributions: Using\nthe expected quadratic estimation error on each arm k makes sense since when Tk,n, k \u2208 [K] are\n\n(Xk,i)i\u2208N on arm (or chain) k is given by \u03bdk(Xk,1, . . . Xk,n) = pk(Xk,1)(cid:81)n\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdeterministic, it coincides with \u03c32\nk/Tk,n, thus suggesting to pull arm k proportionally to its variance\nk. However, for a learning strategy, Tk,n typically depends on all past observations. The presented\n\u03c32\nanalyses in these series of works rely on Wald\u2019s second identity as the technical device, heavily\nrelying on the use of a quadratic loss criterion, which prevents one from extending the approach\ntherein to other distances. Another peculiarity arising in working with expectations is the order of\n\u201cmax\u201d and \u201cexpectation\u201d operators. While it makes more sense to control the expected value of the\nmaximum, the works cited above look at maximum of the expected value, which is more in line with\na pseudo-loss de\ufb01nition rather than the loss; actually in extensions of these works a pseudo-loss is\nconsidered instead of this performance measure. As we show, all of these dif\ufb01culties can be avoided\nby resorting to a high probability setup. Hence, in this paper, we deviate from using an expected loss\ncriterion, and rather use a high-probability control. We formally de\ufb01ne our performance criterion in\nSection 2.3.\n\n1.1 Related Work\n\nOn the one hand, our setup can be framed into the line of works on active bandit allocation, considered\nfor the estimation of reward distributions in MABs as introduced in [1, 7], and further studied in [2, 9].\nThis has been extended to strati\ufb01ed sampling for Monte-Carlo methods in [10, 8], or to continuous\nmean functions in, e.g., [11]. On the other hand, our extension from real-valued distributions\nto Markov chains can be framed into the rich literature on Markov chain estimation; see, e.g.,\n[12, 13, 14, 15, 16, 17]. This stream of works extends a wide range of results from the i.i.d. case to\nthe Markov case. These include, for instance, the law of large numbers for (functions of) state values\n[17], the central limit theorem for Markov sequences [13] (see also [17, 18]), and Chernoff-type or\nBernstein-type concentration inequalities for Markov sequences [19, 20]. Note that the majority of\nthese results are available for ergodic Markov chains.\nAnother stream of research on Markov chains, which is more relevant to our work, investigates\nlearning and estimation of the transition matrix (as opposed to its full law); see, e.g., [16, 15, 21, 22].\nAmong the recent studies falling into this category, [22] investigates learning of the transition matrix\nwith respect to a loss function induced by f-divergences in a minimax setup, thus extending [23] to\nthe case of Markov chains. [21] derives a Probably Approximately Correct (PAC) type bound for\nlearning the transition matrix of an ergodic Markov chain with respect to the total variation loss. It\nfurther provides a matching lower bound. Among the existing literature on learning Markov chains,\nto the best of our knowledge, [21] is the closest to ours. There are however two aspects distinguishing\nour work: Firstly, the challenge in our problem resides in dealing with multiple Markov chains, which\nis present neither in [21] nor in the other studies cited above. Secondly, our notion of loss does not\ncoincide with that considered in [21], and hence, the lower bound of [21] does not apply to our case.\nAmong the results dealing with multiple chains, we may refer to learning in the Markovian bandits\nsetup [24, 25, 26]. Most of these studies address the problem of reward maximization over a \ufb01nite\ntime horizon. We also mention that in a recent study, [27] introduces the so-called active exploration\nin Markov decision processes, where the transition kernel is known, and the goal is rather to learn the\nmean reward associated to various states. To the best of our knowledge, none of these works address\nthe problem of learning the transition matrix. Last, as we target high-probability performance bounds\n(as opposed to those holding in expectation), our approach is naturally linked to the PAC analysis.\n[28] provides one of the \ufb01rst PAC bounds for learning discrete distributions. Since then, the problem\nof learning discrete distributions has been well studied; see, e.g., [29, 30, 23] and references therein.\nWe refer to [23] for a rather complete characterization of learning distribution in a minimax setting\nunder a big class of smooth loss functions. We remark that except for very few studies (e.g., [29]),\nmost of these results are provided for discrete distributions.\n\n1.2 Overview and Contributions\n\nOur contributions are the following: (i) For the problem of learning Markov chains, we consider\na notion of loss function, which appropriately extends the loss function for learning distributions\nto the case of Markov chains. Our notion of loss is similar to that considered in [22] (we refer\nto Section 2.3 for a comparison between our notion and the one in [22]). In contrast to existing\nworks on similar bandit allocation problems, our loss function avoids technical dif\ufb01culties faced\nwhen extending the squared loss function to this setup. We further characterize the notion of a\n\u201cuniformly good algorithm\u201d under the considered loss function for ergodic chains; (ii) We present\nan optimistic algorithm, called BA-MC, for active learning of Markov chains, which is simple to\n\n2\n\n\fk\n\nn + C0\n\nn ), where (cid:101)O(\u00b7) hides\n\nimplement and does not require any prior knowledge of the chains. To the best of our knowledge, this\nconstitutes the \ufb01rst algorithm for active bandit allocation for learning Markov chains; (iii) We provide\nnon-asymptotic PAC-type bounds as well as an asymptotic one on the loss incurred by BA-MC,\nindicating three regimes. In the \ufb01rst regime, which holds for any learning budget n \u2265 4K, we\n\npresent (in Theorem 1) a high-probability bound on the loss scaling as (cid:101)O( KS2\ncut-off budget ncutoff (in Theorem 2) so that when n \u2265 ncutoff, the loss behaves as (cid:101)O( \u039b\nwhere \u039b =(cid:80)\n\nlog(log(n)) factors. Here, K and S respectively denote the number of chains and the number of\n(cid:80)\nstates in a given chain. This result holds for homogenous Markov chains. We then characterize a\nn3/2 ),\nx,y Pk(x, y)(1\u2212 Pk(x, y)) denotes the sum of variances of all states and all chains,\nand where Pk denotes the transition probability of chain k. This latter bound constitutes the second\nregime, in view of the fact that \u039b\nn equals the asymptotically optimal loss (see Section 2.4 for more\ndetails). Thus, this bound indicates that the pseudo-excess loss incurred by the algorithm vanishes\nat a rate C0n\u22123/2 (we refer to Section 4 for a more precise de\ufb01nition). Furthermore, we carefully\ncharacterize the constant C0. In particular, we discuss that C0 does not deteriorate with mixing times\nof the chains, which, we believe, is a strong feature of our algorithm. We also discuss how various\nproperties of the chains, e.g., discrepancies between stationary distribution of various states of a given\nchain, may impact the learning performance. Finally, we demonstrate a third regime, the asymptotic\none, when the budget n grows large, in which we show (in Theorem 3) that the loss of BA-MC\nmatches the asymptotically optimal loss \u039b\nn . All proofs are provided in the supplementary material.\nMarkov chains have been successfully used for modeling a broad range of practical problems, and\ntheir success makes the studied problem in this paper relevant in practice. There are practical\napplications in reinforcement learning (e.g., active exploration in MDPs [27]) and in rested Markov\nbandits (e.g., channel allocation in wireless communication systems where a given channel\u2019s state\nfollows a Markov chain1), for which we believe our contributions could serve as a technical tool.\n\n2 Preliminaries and Problem Statement\n\n2.1 Preliminaries\n\nBefore describing our model, we recall some preliminaries on Markov chains; these are standard\nde\ufb01nitions and results, and can be found in, e.g., [32, 33]. Consider a Markov chain de\ufb01ned on a\n\ufb01nite state space S with cardinality S. Let PS denote the collection of all row-stochastic matrices\nover S. The Markov chain is speci\ufb01ed by its transition matrix P \u2208 PS and its initial distribution p:\nFor all x, y \u2208 S, P (x, y) denotes the probability of transition to y if the current state is x. In what\nfollows, we may refer to a chain by just referring to its transition matrix.\nWe recall that a Markov chain P is ergodic if P m > 0 (entry-wise) for some m \u2208 N. If P is ergodic,\nthen it has a unique stationary distribution \u03c0 satisfying \u03c0 = \u03c0P . Moreover \u03c0 := minx\u2208S \u03c0(x) > 0.\nA chain P is said to be reversible if its stationary distribution \u03c0 satis\ufb01es detailed balance equations:\nFor all x, y \u2208 S, \u03c0(x)P (x, y) = \u03c0(y)P (y, x). Otherwise, P is called non-reversible. For a Markov\nchain P , the largest eigenvalue is \u03bb1(P ) = 1 (with multiplicity one). In a reversible chain P ,\nall eigenvalues belong to (\u22121, 1]. We de\ufb01ne the absolute spectral gap of a reversible chain P as\n\u03b3(P ) = 1 \u2212 \u03bb(cid:63)(P ), where \u03bb(cid:63)(P ) denotes the second largest (in absolute value) eigenvalue of\nP . If P is reversible, the absolute spectral gap \u03b3(P ) controls the convergence rate of the state\ndistributions of the chain towards the stationary distribution \u03c0. If P is non-reversible, the convergence\nrate is determined by the pseudo-spectral gap as introduced in [20] as follows. De\ufb01ne P (cid:63) as:\nP (cid:63)(x, y) := \u03c0(y)P (y, x)/\u03c0(x) for all x, y \u2208 S. Then, the pseudo-spectral gap \u03b3ps(P ) is de\ufb01ned as:\n\u03b3ps(P ) := max(cid:96)\u22651\n\n\u03b3((P (cid:63))(cid:96)P (cid:96))\n\n.\n\n(cid:96)\n\n2.2 Model and Problem Statement\n\nWe are now ready to describe our model. We consider a learner interacting with a \ufb01nite set of Markov\nchains indexed by k \u2208 [K] := {1, 2, . . . , K}. For ease of presentation, we assume that all Markov\nchains are de\ufb01ned on the same state space2 S with cardinality S. The Markov chain k, or for short\n\n1For example, in the Gilbert-Elliott channels [31].\n2Our algorithm and results are straightforwardly extended to the case where the Markov chains are de\ufb01ned\n\non different state spaces.\n\n3\n\n\fchain k, is speci\ufb01ed by its transition matrix Pk \u2208 PS. In this work, we assume that all Markov chains\nare ergodic, which implies that any chain k admits a unique stationary distribution, which we denote\nby \u03c0k. Moreover, the minimal element of \u03c0k is bounded away from zero: \u03c0k := minx\u2208S \u03c0k(x) > 0.\nThe initial distributions of the chains are assumed to be arbitrary. Further, we let \u03b3k := \u03b3(Pk) denote\nthe absolute spectral gap of chain k if k is reversible; otherwise, we de\ufb01ne the pseudo-spectral gap of\nk by \u03b3ps,k := \u03b3ps(Pk).\nA related quantity in our results is the Gini index of the various states. For a chain k, the Gini index\nfor state x \u2208 S is de\ufb01ned as\n\nGk(x) :=\n\nPk(x, y)(1 \u2212 Pk(x, y)).\n\n(cid:88)\n\ny\u2208S\n\nNote that Gk(x) \u2264 1\u2212 1\nachieved when Pk(x, y) = 1\n\nS for all y \u2208 S (in view of the concavity of z (cid:55)\u2192(cid:80)\n\nIn this work, we assume that for all k,(cid:80)\nsum (over states) of inverse stationary distributions: For a chain k, we de\ufb01ne Hk :=(cid:80)\n\nS . This upper bound is veri\ufb01ed by the fact that the maximal value of Gk(x) is\nx\u2208S z(x)(1 \u2212 z(x))).\nx\u2208S Gk(x) > 0.3 Another related quantity in our results is the\nx\u2208S \u03c0k(x)\u22121.\nk . The quantity Hk re\ufb02ects the discrepancy between individual elements\n\nNote that S2 \u2264 Hk \u2264 S\u03c0\u22121\nof \u03c0k.\n\nThe online learning problem. The learner wishes to design a sequential allocation strategy to\nadaptively sample various Markov chains so that all transition matrices are learnt uniformly well. The\ngame proceeds as follows: Initially all chains are assumed to be non-stationary with arbitrary initial\ndistributions chosen by the environment. At each step t \u2265 1, the learner samples a chain kt, based\non the past decisions and the observed states, and observes the state Xkt,t. The state of kt evolves\naccording to Pkt. The state of chains k (cid:54)= kt does not change: Xk,t = Xk,t\u22121 for all k (cid:54)= kt.\nWe introduce the following notations: Let Tk,t denote the number of times chain k is selected by the\nI{kt(cid:48) = k}, where I{\u00b7} denotes the indicator function. Likewise,\nwe let Tk,x,t represent the number of observations of chain k, up to time t, when the chain was in state\nI{kt(cid:48) = k, Xk,t(cid:48) = x}. Further, we note that the learner only controls Tk,t (or\nt(cid:48)=1\nx Tk,x,t), but not the number of visits to individual states. At each step t, the learner\nmaintains empirical estimates of the stationary distributions, and estimates transition probabilities\nof various chains based on the observations gathered up to t. We de\ufb01ne the empirical stationary\ndistribution of chain k at time t as \u02c6\u03c0k,t(x) := Tk,x,t/Tk,t for all x \u2208 S. For chain k, we maintain the\nfollowing smoothed estimation of transition probabilities:\n\nlearner up to time t: Tk,t :=(cid:80)t\nx: Tk,x,t :=(cid:80)t\nequivalently,(cid:80)\n\nt(cid:48)=1\n\nt(cid:48)=2\n\nI{Xk,t(cid:48)\u22121 = x, Xk,t(cid:48) = y}\n\u03b1S + Tk,x,t\n\n,\n\n\u2200x, y \u2208 S,\n\n(1)\n\n\u03b1 +(cid:80)t\n\n(cid:98)Pk,t(x, y) :=\n\nwhere \u03b1 is a positive constant. In the literature, the case of \u03b1 = 1\nS is usually referred to as the\nLaplace-smoothed estimator. The learner is given a budget of n samples, and her goal is to obtain\nan accurate estimation of transition matrices of the Markov chains. The accuracy of the estimation\nis determined by some notion of loss, which will be discussed later. The learner adaptively selects\nvarious chains so that the minimal loss is achieved.\n\n2.3 Performance Measures\n\nWe are now ready to provide a precise de\ufb01nition of our notion of loss, which would serve as the\nperformance measure of a given algorithm. Given n \u2208 N, we de\ufb01ne the loss of an adaptive algorithm\nA as:\n\n(cid:88)\n\nx\u2208S\n\n\u02c6\u03c0k,n(x)(cid:107)Pk(x,\u00b7) \u2212 (cid:98)Pk,n(x,\u00b7)(cid:107)2\n\n2 .\n\nLn(A) := max\nk\u2208[K]\n\nLk,n, with Lk,n :=\n\nThe use of the L2-norm in the de\ufb01nition of loss is quite natural in the context of learning and\nestimation of distributions, as it is directly inspired by the quadratic estimation error used in active\n\n3We remark that there exist chains with(cid:80)\n\nx Gk(x) = 0. In view of the de\ufb01nition of the Gini index, such\nchains are necessarily deterministic (or degenerate), namely their transition matrices belong to {0, 1}S\u00d7S. One\nexample is a deterministic cycle with S nodes. We note that such chains may fail to satisfy irreducibility or\naperiodicity.\n\n4\n\n\fbandit allocation (e.g., [2]). Given a budget n, the loss Ln(A) of an adaptive algorithm A is a random\nvariable, due to the evolution of the various chains as well as the possible randomization in the\nalgorithm. Here, we aim at controlling this random quantity in a high probability setup as follows:\nLet \u03b4 \u2208 (0, 1). For a given algorithm A, we wish to \ufb01nd \u03b5 := \u03b5(n, \u03b4) such that\n\nP (Ln(A) \u2265 \u03b5) \u2264 \u03b4 .\n\n(2)\n\nRemark 1 We remark that the empirical stationary distribution \u02c6\u03c0k,t may differ from the stationary\n\ndistribution associated to the smoothed estimator (cid:98)Pk,t of the transition matrix. Our algorithm and\nresults, however, do not rely on possible relations between \u02c6\u03c0k,t and (cid:98)Pk,t, though one could have used\n\nsmoothed estimators for \u03c0k. The motivation behind using empirical estimate \u02c6\u03c0k,t of \u03c0k in Ln is that\nit naturally corresponds to the occupancy of various states according to a given sample path.\n\n(cid:80)\nx\u2208S (cid:107)Pk(x,\u00b7) \u2212 (cid:98)Pk,n(x,\u00b7)(cid:107)2\n\n(cid:80)\nx\u03c0k(x)(cid:107)Pk(x,\u00b7)\u2212(cid:98)Pk,n(x,\u00b7)(cid:107)2\n\nComparison with other losses. We now turn our attention to the comparison between our loss\nn(A) =\nfunction and some other possible notions. First, we compare ours to the loss function L(cid:48)\n2. Such a notion of loss might look more natural or simpler,\nmaxk\nsince the weights \u02c6\u03c0k,n(x) are replaced simply with 1 (equivalently, uniform weights). However, this\nmeans a strategy may incur a high loss for a part of the state space that is rarely visited, even though\nwe have absolutely no control on the chain. For instance, in the extreme case when some states x\nare reachable with a very small probability, Tk,x,n may be arbitrarily small thus resulting in a large\nloss L(cid:48)\nn for all algorithms, while it makes little sense to penalize an allocation strategy for these\n\u201cvirtual\" states. Weighting the loss according to the empirical frequency \u02c6\u03c0k,n of visits avoids such a\nphenomenon, and is thus more meaningful.\nIn view of the above discussion, it is also tempting to replace the empirical state distribution\nn(A) =\n\u02c6\u03c0k,n with its expectation \u03c0k, namely to de\ufb01ne a pseudo-loss function of the form L(cid:48)(cid:48)\n2 (as studied in, e.g., [22] in a different setup). We recall\nmaxk\nthat our aim is to derive performance guarantees on the algorithm\u2019s loss that hold with high probabil-\nity (for 1\u2212 \u03b4 portions of the sample paths of the algorithm for a given \u03b4). To this end, Ln (which uses\n\u02c6\u03c0k,n) is more natural and meaningful than L(cid:48)(cid:48)\nn as Ln penalizes the algorithm\u2019s performance by the\nrelative visit counts of various states in a given sample path (through \u02c6\u03c0k,n), and not by the expected\nvalue of these. This matters a lot in the small-budget regime, where \u02c6\u03c0k,n could differ signi\ufb01cantly\nfrom \u03c0k \u2014 Otherwise when n is large enough, \u02c6\u03c0k,n becomes well-concentrated around \u03c0k with high\nprobability. To clarify further, let us consider the small-budget regime, and some state x where \u03c0k(x)\nis not small. In the case of Ln, using \u02c6\u03c0k,n we penalize the performance by the mismatch between\nvisited x. In contrast, in the case of L(cid:48)(cid:48)\nn, weighting the mismatch proportionally to \u03c0k(x) does not\nseem reasonable since in a given sample path, the algorithm might not have visited x enough even\nthough \u03c0k(x) is not small. We remark that our results in subsequent sections easily apply to the\npseudo-loss L(cid:48)(cid:48)\nn, at the expense of an additive second-order term, which might depend on the mixing\ntimes.\nFinally, we position the high-probability guarantee on Ln, in the sense of Eq. (2), against those\nholding in expectation. Prior studies on bandit allocation, such as [7, 2], whose objectives involve a\nmax operator, consider expected squared distance. The presented analyses in these series of works\nrely on Wald\u2019s second identity as the technical device. This prevents one to extend the approach\ntherein to other distances. Another peculiarity arising in working with expectations is the order of\n\u201cmax\u201d and \u201cexpectation\u201d operators. While it makes more sense to control the expected value of the\nmaximum, the works cited above look at maximum of the expected value, which is more in line with a\npseudo-loss de\ufb01nition rather than the loss. All of these dif\ufb01culties can be avoided by resorting to a\nhigh probability setup (in the sense of Eq. (2).\n\n(cid:98)Pk,n(x,\u00b7) and Pk(x,\u00b7), weighted proportionally to the number of rounds the algorithm has actually\n\nFurther intuition and example. We now provide an illustrative example to further clarify some\nof the above comments. Let us consider the following two-state Markov chain: P =\n,\nwhere \u03b5 \u2208 (0, 1). The stationary distribution of this Markov chain is \u03c0 = [ \u03b5\n2+\u03b5 ]. Let s1 (resp. s2)\ndenote the state corresponding to the \ufb01rst (resp. second) row of the transition matrix. In view of \u03c0,\nwhen \u03b5 (cid:28) 1, the chain tends to stay in s2 (the lazy state) most of the time: Out of n observations,\none gets on average only n\u03c0(s1) = n\u03b5/(2 + \u03b5) observations of state s1, which means, for \u03b5 (cid:28) 1/n,\nessentially no observation of state s1. Hence, no algorithm can estimate the transitions from s1 in\n\n1/2\n1 \u2212 \u03b5\n\n2+\u03b5 , 2\n\n(cid:20)1/2\n\n\u03b5\n\n(cid:21)\n\n5\n\n\fsuch a setup, and all strategies would suffer a huge loss according to L(cid:48)\nn, no matter how samples are\nallocated to this chain. Thus, L(cid:48)\nn has limited interest in order to distinguish between good and base\nsampling strategies. On the other hand, using Ln enables to better distinguish between allocation\nstrategies, since the weight given to s1 would be essentially 0 in this case, thus focusing on the good\nestimation of s2 (and other chains) only.\n\n2.4 Static Allocation\n\nThe proof of this lemma consists in two steps: First, we provide lower and upper bounds on Lk,n in\n\nIn this subsection, we investigate the optimal loss asymptotically achievable by an oracle policy that\nis aware of some properties of the chains. To this aim, let us consider a non-adaptive strategy where\nsampling of various chains is deterministic. Therefore, Tk,n, k = 1, . . . , K are not random. The\nfollowing lemma is a consequence of the central limit theorem:\n\nLemma 1 We have for any chain k: Tk,nLk,n \u2192Tk,n\u2192\u221e(cid:80)\nterms of the loss(cid:101)Lk,n incurred by the learner had she used an empirical estimator (corresponding to\n\u03b1 = 0 in (1)). Second, we show that by the central limit theorem, Tk,n(cid:101)Lk,n \u2192Tk,n\u2192\u221e(cid:80)\nNow, consider an oracle policy Aoracle, who is aware of(cid:80)\nof the above discussion, and taking into account the constraint(cid:80)\n(cid:88)\n(cid:88)\n\nx Gk(x).\nx\u2208S Gk(x) for various chains. In view\nk\u2208[K] Tk,n = n, it would be\n\nasymptotically optimal to allocate Tk,n = \u03b7kn samples to chain k, where\n\n(cid:88)\n\nx Gk(x) .\n\n\u03b7k :=\n\nGk(x) , with \u039b :=\n\nGk(x) .\n\n1\n\u039b\n\nx\u2208S\n\nk\u2208[K]\n\nx\u2208S\n\nThe corresponding loss would satisfy: nLn(Aoracle) \u2192n\u2192\u221e \u039b . We shall refer to the quantity \u039b\nn as\nthe asymptotically optimal loss, which is a problem-dependent quantity. The coef\ufb01cients \u03b7k, k \u2208 [K]\ncharacterize the discrepancy between the transition matrices of the various chains, and indicate that\nan algorithm needs to account for such discrepancy in order to achieve the asymptotically optimal\nloss. Having characterized the notion of asymptotically optimal loss, we are now ready to de\ufb01ne the\nnotion of uniformly good algorithm:\nDe\ufb01nition 1 (Uniformly Good Algorithm) An algorithm A is said to be uniformly good if, for\nany problem instance, it achieves the asymptotically optimal loss when n grows large; that is,\nlimn\u2192\u221e nLn(A) = \u039b for all problem instances.\n\n3 The BA-MC Algorithm\n\nIn this section, we introduce an algorithm designed for adaptive bandit allocation of a set of Markov\nchains. It is designed based on the optimistic principle, as in MAB problems (e.g., [34, 35]), and\nrelies on an index function. More precisely, at each time t, the algorithm maintains an index function\nbk,t+1 for each chain k, which provides an upper con\ufb01dence bound (UCB) on the loss incurred by k\n2,\nchain kt \u2208 argmaxk\u2208[K] bk,t+1 at time t, we can balance exploration and exploitation by selecting\nmore the chains with higher estimated losses or those with higher uncertainty in these estimates.\nIn order to specify the index function bk,\u00b7, let us choose \u03b1 = 1\n\nat t; more precisely, with high probability, bk,t+1 \u2265 Lk,t :=(cid:80)\nx\u2208S \u02c6\u03c0k,t(x)(cid:107)Pk(x,\u00b7) \u2212 (cid:98)Pk,t(x,\u00b7)(cid:107)2\nwhere (cid:98)Pk,t denotes the smoothed estimate of Pk with some \u03b1 > 0 (see Eq. (1)). Now, by sampling a\nlater on), and for each state x \u2208 S, de\ufb01ne the estimate of Gini coef\ufb01cient at time t as (cid:98)Gk,t(x) :=\ny\u2208S (cid:98)Pk,t(x, y)(cid:0)1 \u2212 (cid:98)Pk,t(x, y)(cid:1). The index bk,t+1 is then de\ufb01ned as\n(cid:80)\n(cid:88)\n\n3S (we motivate this choice of \u03b1\n\nbk,t+1 =\n\n2\u03b2\nTk,t\n\n6.6\u03b23/2\n\n+\n\n(cid:88)\nI{Tk,x,t > 0}(cid:98)Gk,t(x) +\n(cid:88)\n(cid:88)\n\nT 3/2\nk,x,t\n\nx\u2208S\n\nTk,t\n\nx\u2208S\n\n(Tk,x,t + \u03b1S)2\n\ny\u2208S\n\n6\n\n28\u03b22S\nTk,t\n\nI{Tk,x,t > 0}\nTk,x,t + \u03b1S\n\nx\u2208S\n\n(cid:113)(cid:98)Pk,t(x, y)(cid:0)1 \u2212 (cid:98)Pk,t(x, y)(cid:1) ,\n\n\f(cid:16)(cid:108) log(n)\n\n(cid:109) 6KS2\n\n(cid:17)\n\n\u03b4\n\nlog(c)\n\n, with c > 1 being an arbitrary choice. In this paper,\n\nwhere \u03b2 := \u03b2(n, \u03b4) := c log\nwe choose c = 1.1.\nWe remark that the design of the index bk,\u00b7 above comes from the application of empirical Bernstein\nconcentration for \u03b1-smoothed estimators (see Lemma 4 in the supplementary) to the loss function Lk,t.\nIn other words, Lemma 4 guarantees that with high probability, bk,t+1 \u2265 Lk,t. Our concentration\ninequality (Lemma 4) is new, to our knowledge, and could be of independent interest.\nHaving de\ufb01ned the index function bk,\u00b7, we are now ready to describe our algorithm, which we call\nBA-MC (Bandit Allocation for Markov Chains). BA-MC receives as input a con\ufb01dence parameter \u03b4,\na budget n, as well as the state space S. It initially samples each chain twice (hence, this phase lasts\nfor 2K rounds). Then, BA-MC simply consists in sampling the chain with the largest index bk,t+1\n\nat each round t. Finally, it returns, after n pulls, an estimate (cid:98)Pk,n for each chain k. We provide the\n\npseudo-code of BA-MC in Algorithm 1. Note that BA-MC does not require any prior knowledge of\nthe chains (neither the initial distribution nor the mixing time).\n\nAlgorithm 1 BA-MC \u2013 Bandit Allocation for Markov Chains\n\nInput: Con\ufb01dence parameter \u03b4, budget n, state space S;\nInitialize: Sample each chain twice;\nfor t = 2K + 1, . . . , n do\nSample chain kt \u2208 argmaxk bk,t+1;\nObserve Xk,t, and update Tk,x,t and Tk,t;\n\nend for\n\n(cid:80)\n\nIn order to provide more insights into the design of BA-MC, let us remark that (as shown in Lemma 8\nin the supplementary) bk,t+1 provides a high-probability UCB on the quantity 1\nx Gk(x) as well.\nTk,t\nNow by sampling the chain kt \u2208 argmaxk\u2208[K] bk,t+1 at time t, in view of discussions in Section 2.4,\n\nBA-MC would try to mimic an oracle algorithm being aware of(cid:80)\n\nx Gk(x) for various chains.\n\nWe remark that our concentration inequality in Lemma 4 (of the supplementary) parallels the one\npresented in Lemma 8.3 in [36]. In contrast, our concentration lemma makes appear the terms\nTk,x,t + \u03b1S in the denominator, whereas Lemma 8.3 in [36] makes appear terms Tk,x,t in the\ndenominator. This feature plays an important role to deal with situations where some states are not\nsampled up to time t, that is for when Tk,x,t = 0 for some x.\n\n4 Performance Bounds\n\nWe are now ready to study the performance bounds on the loss Ln(BA-MC) in both asymptotic and\nnon-asymptotic regimes. We begin with a generic non-asymptotic bound as follows:\nTheorem 1 (BA-MC, Generic Performance) Let \u03b4 \u2208 (0, 1). Then, for any budget n \u2265 4K, with\nprobability at least 1 \u2212 \u03b4, the loss under A = BA-MC satis\ufb01es\n\nLn(A) \u2264 287KS2\u03b22\n\nn\n\n+ (cid:101)O(cid:16) K 2S2\n\n(cid:17)\n\n.\n\nn2\n\nThe proof of this theorem, provided in Section C in the supplementary, reveals the motivation to\nchoose \u03b1 = 1\n3S : It veri\ufb01es that to minimize the dependency of the loss on S, on must choose\n\u03b1 \u221d S\u22121. In particular, the proof does not rely on the ergodicity assumption:\nRemark 2 Theorem 1 is valid even if the Markov chains Pk, k \u2208 [K] are reducible or periodic.\nIn the following theorem, we state another non-asymptotic bound on the performance of BA-MC,\nwhich re\ufb01nes Theorem 1 for when n \u2265 ncutoff, where\n\nncutoff := ncutoff(\u03b4) := K max\n\nlog\n\nwhere \u03b3(cid:48)\n\ntation \u039b :=(cid:80)\n\nk\n\n(cid:80)\n\nand \u03c0k := minx\u2208S \u03c0k(x) > 0.\n\nk = \u03b3k if k is reversible, and \u03b3(cid:48)\n\nx Gk(x), and that for any chain k, \u03b7k = 1\n\nk = \u03b3ps,k otherwise. To present the theorem, we recall the no-\nx\u2208S \u03c0k(x)\u22121,\n\n\u039b\n\n(cid:18) 300\n\n\u03b3(cid:48)\nk\u03c0k\n\nk\n\n(cid:17)(cid:19)2\n\n(cid:113)\n(cid:16) 2KS\n(cid:80)\nx\u2208S Gk(x), Hk :=(cid:80)\n\n\u03c0\u22121\n\n\u03b4\n\nk\n\n,\n\n7\n\n\fTheorem 2 Let \u03b4 \u2208 (0, 1), and assume that n \u2265 ncutoff. Then, with probability at least 1 \u2212 2\u03b4,\n\nwhere C0 := 150K\n\n\u221a\n\nLn(A) \u2264 2\u03b2\u039b\nn\n\n+\n\nC0\u03b23/2\nn3/2\n\n\u221a\n\nS\u039b maxk Hk + 3\n\nS\u039b maxk\n\nHk\n\u03b7k\n\n.\n\n+ (cid:101)O(n\u22122) ,\n\n\u221a\n\n\u221a\n\nS\u039b maxk\n\nHk\n\u03b7k\n\nS\u039b maxk Hk +\n\nRecalling the asymptotic loss of the oracle algorithm discussed in Section 2.4 being equal to \u039b/n, in\nview of the Bernstein concentration, the oracle would incur a loss at most 2\u03b2\u039b\nn for when the budget n\nn as the pseudo-excess loss of A (we\nis \ufb01nite. In this regard, we may look at the quantity Ln(A) \u2212 2\u03b2\u039b\nunder BA-MC vanishes at a rate (cid:101)O(n\u22123/2). In particular, Theorem 2 characterizes the constant C0\nrefrain from calling this quantity the excess loss as 2\u03b2\u039b\nn is not equal to the high-probability loss of the\noracle). Theorem 2 implies that when n is greater than the cut-off budget ncutoff, the pseudo-excess loss\ncontrolling the main term of the pseudo-excess loss: C0 = O(K\nthe discrepancy among the(cid:80)\n).\nThis further indicates that the pseudo-excess loss is controlled by the quantity Hk\n, which captures (i)\n\u03b7k\nx Gk(x) values of various chains k, and (ii) the discrepancy between\nvarious stationary probabilities \u03c0k(x), x \u2208 S. We emphasize that the dependency of the learning\nperformance (through C0) on Hk is in alignment with the result obtained in [21] for the estimation of\na single ergodic Markov chain.\nThe proof of Theorem 2, provided in Section D in the supplementary, shows that to determine the\ncut-off budget ncutoff, one needs to determine the value of n such that with high probability, for any\nchain k and state x, the term Tk,n(Tk,x,n + \u03b1S)\u22121 approaches \u03c0k(x)\u22121, which is further controlled\nby \u03b3ps,k (or \u03b3k if k is reversible) as well as the minimal stationary distribution \u03c0k. This in turn\nallows us to show that, under BA-MC, the number Tk,n of samples for any chain k comes close to\nthe quantity \u03b7kn. Finally, we remark that the proof of Theorem 2 also reveals that the result in the\ntheorem is indeed valid for any constant \u03b1 > 0.\nIn the following theorem, we characterize the asymptotic performance of BA-MC:\nTheorem 3 (BA-MC, Asymptotic Regime) Under A =BA-MC, lim supn\u2192\u221e nLn(A) = \u039b .\nThe above theorem asserts that, asymptotically, the loss under BA-MC matches the asymptotically\noptimal loss \u039b/n characterized in Section 2.4. We may thus conclude that BA-MC is uniformly good\n(in the sense of De\ufb01nition 1). The proof of Theorem 3 (provided in Section E of the supplementary)\nproceeds as follows: It divides the estimation problem into two consecutive sub-problems, the\none with the budget n0 =\nn of pulls. We then show when\nn \u2265 ncutoff, the number of samples on each chain k at the end of the \ufb01rst sub-problem\nn0 =\nis lower bounded by \u2126(n1/4), and as a consequence, the index bk would be accurate enough:\nbk,n0 \u2208 1\nthe allocation under BA-MC in the course of the second sub-problem to that of the oracle, and further\nto show that the difference vanishes as n \u2192 \u221e.\nBelow, we provide some further comments about the presented bounds in Theorems 1\u20133:\n\nn and the other with the rest n \u2212 \u221a\nx Gk(x) + (cid:101)O(n\u22121/8)) with high probability. This allows us to relate\n\n((cid:80)\n\nx Gk(x),(cid:80)\n\nTk,n0\n\n\u221a\n\n\u221a\n\nVarious regimes. Theorem 1 provides a non-asymptotic bound on the loss valid for any n, while\nTheorem 3 establishes the optimality of BA-MC in the asymptotic regime. In view of the inequality\n\u039b \u2264 K(S \u2212 1), the bound in Theorem 1 is at least off by a factor of S from the asymptotic loss \u039b/n.\nTheorem 2 bridges between the two results thereby establishing a third regime, in which the algorithm\n\nenjoys the asymptotically optimal loss up to an additive pseudo-excess loss scaling as (cid:101)O(n\u22123/2).\n\nThe effect of mixing.\nIt is worth emphasizing that the mixing times of the chains do not appear\nexplicitly in the bounds, and only control (through the pseudo-spectral gap \u03b3ps,k) the cut-off budget\nncutoff that ensures when the pseudo-excess loss vanishes at a rate n\u22123/2. This is indeed a strong\naspect of our results due to our meaningful de\ufb01nition of loss, which could be attributed to the fact that\nour loss function employs empirical estimates \u02c6\u03c0k,n in lieu of \u03c0k. Speci\ufb01cally speaking, as argued in\n[36], given the number of samples of various states (akin to using \u02c6\u03c0k,t(x) in the loss de\ufb01nition), the\nconvergence of frequency estimates towards the true values is independent of the mixing time of the\nchain. We note that despite the dependence of ncutoff on the mixing times, BA-MC does not need to\n\n8\n\n\festimate them as when n \u2264 ncutoff, it still enjoys the loss guarantee of Theorem 1. We also mention\nthat to de\ufb01ne an index function for the loss function maxk\n2, one\nmay have to derive con\ufb01dence bounds on the mixing time and/or stationary distribution \u03c0k as well.\n\nx\u03c0k(x)(cid:107)Pk(x,\u00b7)\u2212(cid:98)Pk,n(x,\u00b7)(cid:107)2\n\n(cid:80)\n\nMore on the pseudo-excess loss. We stress that the notion of the pseudo-excess loss bears some\nsimilarity with the de\ufb01nition of regret for active bandit learning of distributions as introduced in [7, 2]\n(see Section 1). In the latter case, the regret typically decays as n\u22123/2 similarly to the pseudo-excess\nloss in our case. An interesting question is whether the decay rate of the pseudo-excess loss, as a\nfunction of n, can be improved. And more importantly, if a (problem-dependent) lower bound on the\npseudo-excess loss can be established. These questions are open even for the simpler case of active\nlearning of distributions in the i.i.d. setup; see, e.g., [37, 8, 2]. We plan to address these as a future\nwork.\n\n5 Conclusion\n\nIn this paper, we addressed the problem of active bandit allocation in the case of discrete and ergodic\nMarkov chains. We considered a notion of loss function appropriately extending the loss function\nfor learning distributions to the case of Markov chains. We further characterized the notion of\na \u201cuniformly good algorithm\u201d under the considered loss function. We presented an algorithm for\nlearning Markov chains, which we called BA-MC. Our algorithm is simple to implement and does not\nrequire any prior knowledge of the Markov chains. We provided non-asymptotic PAC-type bounds on\nthe loss incurred by BA-MC, and showed that asymptotically, it incurs an optimal loss. We further\ndiscussed that the (pseudo-excess) loss incurred by BA-MC in our bounds does not deteriorate with\nmixing times of the chains. As a future work, we plan to derive a problem-dependent lower bound on\nthe pseudo-excess loss. Another interesting, and yet very challenging, future direction is to devise\nadaptive learning algorithms for restless Markov chains, where the state of various chains evolve at\neach round independently of the learner\u2019s decision.\n\nAcknowledgements\n\nThis work has been supported by CPER Nord-Pas-de-Calais/FEDER DATA Advanced data science\nand technologies 2015-2020, the French Ministry of Higher Education and Research, Inria, and\nthe French Agence Nationale de la Recherche (ANR), under grant ANR-16-CE40-0002 (project\nBADASS).\n\nReferences\n[1] Andr\u00e1s Antos, Varun Grover, and Csaba Szepesv\u00e1ri. Active learning in multi-armed bandits. In\n\nInternational Conference on Algorithmic Learning Theory, pages 287\u2013302. Springer, 2008.\n\n[2] Alexandra Carpentier, Alessandro Lazaric, Mohammad Ghavamzadeh, R\u00e9mi Munos, and Peter\nIn\n\nAuer. Upper-con\ufb01dence-bound algorithms for active learning in multi-armed bandits.\nInternational Conference on Algorithmic Learning Theory, pages 189\u2013203. Springer, 2011.\n\n[3] Valerii Vadimovich Fedorov. Theory of optimal experiments. Elsevier, 1972.\n\n[4] Hovav A. Dror and David M. Steinberg. Sequential experimental designs for generalized linear\n\nmodels. Journal of the American Statistical Association, 103(481):288\u2013298, 2008.\n\n[5] David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan. Active learning with statistical\n\nmodels. Journal of Arti\ufb01cial Intelligence Research, 4:129\u2013145, 1996.\n\n[6] Pierre Etor\u00e9 and Benjamin Jourdain. Adaptive optimal allocation in strati\ufb01ed sampling methods.\n\nMethodology and Computing in Applied Probability, 12(3):335\u2013360, 2010.\n\n[7] Andr\u00e1s Antos, Varun Grover, and Csaba Szepesv\u00e1ri. Active learning in heteroscedastic noise.\n\nTheoretical Computer Science, 411(29-30):2712\u20132728, 2010.\n\n[8] Alexandra Carpentier, Remi Munos, and Andr\u00e1s Antos. Adaptive strategy for strati\ufb01ed Monte\n\nCarlo sampling. Journal of Machine Learning Research, 16:2231\u20132271, 2015.\n\n9\n\n\f[9] James Neufeld, Andr\u00e1s Gy\u00f6rgy, Dale Schuurmans, and Csaba Szepesv\u00e1ri. Adaptive Monte\nCarlo via bandit allocation. In Proceedings of the 31st International Conference on International\nConference on Machine Learning, pages 1944\u20131952, 2014.\n\n[10] Alexandra Carpentier and R\u00e9mi Munos. Adaptive strati\ufb01ed sampling for Monte-Carlo integra-\ntion of differentiable functions. In Advances in Neural Information Processing Systems, pages\n251\u2013259, 2012.\n\n[11] Alexandra Carpentier and Odalric-Ambrym Maillard. Online allocation and homogeneous\npartitioning for piecewise constant mean-approximation. In Advances in Neural Information\nProcessing Systems, pages 1961\u20131969, 2012.\n\n[12] Patrick Billingsley. Statistical methods in Markov chains. The Annals of Mathematical Statistics,\n\npages 12\u201340, 1961.\n\n[13] Claude Kipnis and S. R. Srinivasa Varadhan. Central limit theorem for additive functionals\nof reversible Markov processes and applications to simple exclusions. Communications in\nMathematical Physics, 104(1):1\u201319, 1986.\n\n[14] Moshe Haviv and Ludo Van der Heyden. Perturbation bounds for the stationary probabilities of\n\na \ufb01nite Markov chain. Advances in Applied Probability, 16(4):804\u2013818, 1984.\n\n[15] Nicky J. Welton and A. E. Ades. Estimation of Markov chain transition probabilities and rates\nfrom fully and partially observed data: uncertainty propagation, evidence synthesis, and model\ncalibration. Medical Decision Making, 25(6):633\u2013645, 2005.\n\n[16] Bruce A. Craig and Peter P. Sendi. Estimation of the transition matrix of a discrete-time Markov\n\nchain. Health Economics, 11(1):33\u201342, 2002.\n\n[17] Sean P. Meyn and Richard L. Tweedie. Markov chains and stochastic stability. Springer Science\n\n& Business Media, 2012.\n\n[18] Emmanuel Rio. Asymptotic theory of weakly dependent random processes. Springer, 2017.\n\n[19] Pascal Lezaud. Chernoff-type bound for \ufb01nite Markov chains. Annals of Applied Probability,\n\npages 849\u2013867, 1998.\n\n[20] Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral\n\nmethods. Electronic Journal of Probability, 20, 2015.\n\n[21] Geoffrey Wolfer and Aryeh Kontorovich. Minimax learning of ergodic Markov chains. In\n\nAlgorithmic Learning Theory, pages 903\u2013929, 2019.\n\n[22] Yi HAO, Alon Orlitsky, and Venkatadheeraj Pichapati. On learning Markov chains. In Advances\n\nin Neural Information Processing Systems, pages 648\u2013657, 2018.\n\n[23] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On learning\ndistributions from their samples. In Conference on Learning Theory, pages 1066\u20131100, 2015.\n\n[24] Ronald Ortner, Daniil Ryabko, Peter Auer, and R\u00e9mi Munos. Regret bounds for restless Markov\n\nbandits. Theoretical Computer Science, 558:62\u201376, 2014.\n\n[25] Cem Tekin and Mingyan Liu. Online learning of rested and restless bandits. IEEE Transactions\n\non Information Theory, 58(8):5588\u20135611, 2012.\n\n[26] Christopher R. Dance and Tomi Silander. Optimal policies for observing time series and related\n\nrestless bandit problems. Journal of Machine Learning Research, 20(35):1\u201393, 2019.\n\n[27] Jean Tarbouriech and Alessandro Lazaric. Active exploration in Markov decision processes.\nIn The 22nd International Conference on Arti\ufb01cial Intelligence and Statistics, pages 974\u2013982,\n2019.\n\n[28] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and Linda\nSellie. On the learnability of discrete distributions. In Proceedings of the Twenty-Sixth Annual\nACM Symposium on Theory of Computing, pages 273\u2013282. ACM, 1994.\n\n10\n\n\f[29] David Gamarnik. Extension of the PAC framework to \ufb01nite and countable Markov chains. IEEE\n\nTransactions on Information Theory, 49(1):338\u2013345, 2003.\n\n[30] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of\nfunctionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835\u2013\n2885, 2015.\n\n[31] Mordechai Mushkin and Israel Bar-David. Capacity and coding for the Gilbert-Elliott channels.\n\nIEEE Transactions on Information Theory, 35(6):1277\u20131290, 1989.\n\n[32] James R. Norris. Markov chains. Number 2. Cambridge University Press, 1998.\n\n[33] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times.\n\nAmerican Mathematical Society, Providence, RI, 2009.\n\n[34] Tze Leung Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Ad-\n\nvances in Applied Mathematics, 6(1):4\u201322, 1985.\n\n[35] Peter Auer, Nicol\u00f2 Cesa-Bianchi, and Paul Fischer. Finite time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47(2-3):235\u2013256, 2002.\n\n[36] Daniel J. Hsu, Aryeh Kontorovich, and Csaba Szepesv\u00e1ri. Mixing time estimation in reversible\nMarkov chains from a single sample path. In Advances in Neural Information Processing\nSystems, pages 1459\u20131467, 2015.\n\n[37] Alexandra Carpentier and R\u00e9mi Munos. Minimax number of strata for online strati\ufb01ed sampling:\n\nThe case of noisy samples. Theoretical Computer Science, 558:77\u2013106, 2014.\n\n11\n\n\f", "award": [], "sourceid": 7329, "authors": [{"given_name": "Mohammad Sadegh", "family_name": "Talebi", "institution": "Inria"}, {"given_name": "Odalric-Ambrym", "family_name": "Maillard", "institution": "INRIA"}]}