{"title": "Optimal Best Markovian Arm Identification with Fixed Confidence", "book": "Advances in Neural Information Processing Systems", "page_first": 5605, "page_last": 5614, "abstract": "We give a complete characterization of the sampling complexity\nof best Markovian arm identification in one-parameter Markovian bandit models. We derive instance specific nonasymptotic and asymptotic lower bounds which generalize those of the IID setting.\nWe analyze the Track-and-Stop strategy, initially proposed for the IID setting, and we prove that asymptotically it is at most a factor of four apart from the lower bound. Our one-parameter Markovian bandit model is based on the notion of an exponential family of stochastic matrices for which we establish many useful properties. For the analysis of the Track-and-Stop strategy we derive a novel and optimal concentration inequality for Markov chains that may be of interest in its own right.", "full_text": "Optimal Best Markovian Arm Identi\ufb01cation with\n\nFixed Con\ufb01dence\n\nVrettos Moulos\n\nUniversity of California Berkeley\n\nvrettos@berkeley.edu\n\nAbstract\n\nWe give a complete characterization of the sampling complexity of best Markovian\narm identi\ufb01cation in one-parameter Markovian bandit models. We derive instance\nspeci\ufb01c nonasymptotic and asymptotic lower bounds which generalize those of\nthe IID setting. We analyze the Track-and-Stop strategy, initially proposed for the\nIID setting, and we prove that asymptotically it is at most a factor of four apart\nfrom the lower bound. Our one-parameter Markovian bandit model is based on the\nnotion of an exponential family of stochastic matrices for which we establish many\nuseful properties. For the analysis of the Track-and-Stop strategy we derive a novel\nconcentration inequality for Markov chains that may be of interest in its own right.\n\n1\n\nIntroduction\n\nThis paper is about optimal best Markovian arm identi\ufb01cation with \ufb01xed con\ufb01dence. There are K\nindependent options which are referred to as arms. Each arm a is associated with a discrete time\nstochastic process, which is characterized by a parameter \u03b8a and it\u2019s governed by the probability law\nP\u03b8a. At each round we select one arm, without any prior knowledge of the statistics of the stochastic\nprocesses. The stochastic process that corresponds to the selected arm evolves by one time step, and\nwe observe this evolution through a reward function, while the stochastic processes for the rest of the\narms stay still. A con\ufb01dence level \u03b4 \u2208 (0, 1) is prescribed, and our goal is to identify the arm that\ncorresponds to the process with the highest stationary mean with probability at least 1 \u2212 \u03b4, and using\nas few samples as possible.\n\n1.1 Contributions\n\nIn the work of Garivier and Kaufmann (2016) the discrete time stochastic process associated with\neach arm a is assumed to be an IID process. Here we go one step further and we study more\ncomplicated dependent processes, which allow us to use more expressive models in the stochastic\nmulti-armed bandits framework. More speci\ufb01cally we consider the case that each P\u03b8a is the law of an\nirreducible \ufb01nite state Markov chain associated with a stationary mean \u00b5(\u03b8a). We establish a lower\nbound (Theorem 1) for the expected sample complexity, as well as an analysis of the Track-and-Stop\nstrategy, proposed for the IID setting in Garivier and Kaufmann (2016), which shows (Theorem 3)\nthat asymptotically the Track-and-Stop strategy in the Markovian dependence setting attains a sample\ncomplexity which is at most a factor of four apart from our asymptotic lower bound. Both our lower\nand upper bounds extend the work of Garivier and Kaufmann (2016) in the more complicated and\nmore general Markovian dependence setting.\nThe abstract framework of multi-armed bandits has numerous applications in areas like clinical trials,\nad placement, adaptive routing, resource allocation, gambling etc. For more context we refer the\ninterested reader to the survey of Bubeck and Cesa-Bianchi (2012). Here we generalize this model to\nallow for the presence of Markovian dependence, enabling this way the practitioner to use richer and\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmore expressive models for the various applications. In particular, Markovian dependence allows\nmodels where the distribution of next sample depends on the sample just observed. This way one can\nmodel for instance the evolution of a rigged slot machine, which as soon as it generates a big reward\nfor the gambler, it changes the reward distribution to a distribution which is skewed towards smaller\nrewards.\nOur key technical contributions stem from the large deviations theory for Markov chains Miller\n(1961); Donsker and Varadhan (1975); Ellis (1984); Dembo and Zeitouni (1998). In particular we\nutilize the concept of an exponential family of stochastic matrices, \ufb01rst introduced in Miller (1961),\nin order to model our one-parameter Markovian bandit model. Many properties of the family are\nestablished which are then used for our analysis of the Track-and-Stop strategy. The most important\none is an optimal concentration inequality for the empirical means of Markov chains (Theorem 2).\nWe are able to establish this inequality for a large class of Markov chains, including those that all the\ntransitions have positive probability. Prior work on the topic, Gillman (1993); Dinwoodie (1995);\nLezaud (1998); Le\u00f3n and Perron (2004), fails to capture the optimal exponential decay, or introduces\na polynomial prefactor, Davisson et al. (1981), as opposed to our constant prefactor. This result may\nbe of independent interest due to the wide applicability of Markov chains in many aspects of learning\ntheory such as various aspects of reinforcement learning, Markov chain Monte Carlo and others.\n\n1.2 Related Work\n\nThe cornerstone of stochastic multi-armed bandits is the seminal work of Lai and Robbins (1985).\nThey considered K IID process with the objective being to maximize the expected value of the sum of\nthe observed rewards, or equivalently to minimize the so called regret. In the same spirit Anantharam\net al. (1987a,b) examine the generalization where one is allowed to collect multiple rewards at each\ntime step, \ufb01rst in the case that processes are IID Anantharam et al. (1987a), and then in the case that\nthe processes are irreducible and aperiodic Markov chains Anantharam et al. (1987a). A survey of\nthe regret minimization literature is contained in Bubeck and Cesa-Bianchi (2012).\nAn alternative objective is the one of identifying the process with the highest stationary mean as fast\nas and as accurately as possible, notions which are made precise in Subsection 2.1. In the IID setting,\nEven-Dar et al. (2006) establish an elimination based algorithm in order to \ufb01nd an approximate best\narm, and Mannor and Tsitsiklis (0304) provide a matching lower bound. Jamieson et al. (2014)\npropose an upper con\ufb01dence strategy, inspired by the law of iterated logarithm, for exact best arm\nidenti\ufb01cation given some \ufb01xed level of con\ufb01dence. In the asymptotic high con\ufb01dence regime, the\nproblem is settled by the work of Garivier and Kaufmann (2016), who provide instance speci\ufb01c\nmatching lower and upper bounds. For their upper bound they propose the Track-and-Stop strategy\nwhich is further explored in the work of Kaufmann and Koolen (2018).\nThe earliest reference for the exponential family of stochastic matrices which is being used to model\nthe Markovian arms can be found in the work of Miller (1961). Exponential families of stochastic\nmatrices lie in the heart of the theory of large deviations for Markov processes, which was popularized\nwith the pioneering work of Donsker and Varadhan (1975). A comprehensive overview of the theory\ncan be found in the book Dembo and Zeitouni (1998). Naturally they also show up when one\nconditions on the second order empirical distribution of a Markov chain, see the work of Csisz\u00e1r\net al. (1987) about conditional limit theorems. A variant of the exponential family that we are\ngoing to discuss has been developed in the context of hypothesis testing in Nakagawa and Kanaya\n(1993). A more recent development by Nagaoka (2005) gives an information geometry perspective\nto this concept, and the work Hayashi and Watanabe (2016) examines parameter estimation for the\nexponential family. Our development of the exponential family of stochastic matrices tries to parallel\nthe development of simple exponential families of probability distributions of Wainwright and Jordan\n(2008).\nRegarding concentration inequalities for Markov chains one of the earliest works Davisson et al.\n(1981) is based on counting, and is able to capture the optimal rate of exponential decay dictated by\nthe theory of large deviations, but has a suboptimal polynomial prefactor. More recent approaches\nfollow the line of work started by Gillman (1993), who used matrix perturbation theory to derive a\nbound for reversible Markov chains. This bound attains a constant prefactor but with a suboptimal\nrate of exponential decay which depends on the spectral gap of the transition matrix. This work\nwas later extended by Dinwoodie (1995); Lezaud (1998) but still with a sub-optimal rate. The work\nof Le\u00f3n and Perron (2004) reduces the problem to a two state Markov chain, and attains the optimal\n\n2\n\n\frate only for the case of a two state Markov chain. Chung et al. (2012) obtain rates that depend on the\nmixing time of the chain rather than the spectral gap, but which are still suboptimal.\n\n2 Problem Formulation\n\n2.1 One-parameter family of Markov Chains\n\nIn order to model the problem we will use a one-parameter family of Markov chains on a \ufb01nite state\nspace S. Each Markov chain in the family corresponds to a parameter \u03b8 \u2208 \u0398, where \u0398 \u2286 R is the\nparameter space, and is completely characterized by an initial distribution q\u03b8 = [q\u03b8(x)]x\u2208S, and a\nstochastic transition matrix P\u03b8 = [P\u03b8(x, y)]x,y\u2208S, which satisfy the following conditions.\n\nP\u03b8 is irreducible for all \u03b8 \u2208 \u0398.\n\nq\u03b8(x) > 0 \u21d2 q\u03bb(x), for all \u03b8, \u03bb \u2208 \u0398, x \u2208 S.\n\nP\u03b8(x, y) > 0 \u21d2 P\u03bb(x, y) > 0, for all \u03b8, \u03bb \u2208 \u0398, x, y \u2208 S.\n\n(1)\n(2)\n(3)\nThere are K Markovian arms with parameters \u03b8\u03b8\u03b8 = (\u03b81, . . . , \u03b8K) \u2208 \u0398K, and each arm a \u2208 [K] =\n{1, . . . , K} evolves as a Markov chain with parameter \u03b8a which we denote by {X a\nn}x\u2208Z\u22650 . A\nnon-constant real valued reward function f : S \u2192 R is applied at each state and produces the reward\nprocess {Y a\nn). We can only observe the reward process but not the\ninternal Markov chain. Note that the reward process is a function of the Markov chain and so in\ngeneral it will have more complicated dependencies than the Markov chain. The reward process is a\nMarkov chain if and only if f is injective. For each \u03b8 \u2208 \u0398 there is a unique stationary distribution\n\u03c0\u03b8 = [\u03c0\u03b8(x)]x\u2208S associated with the stochastic matrix P\u03b8, due to (1). This allows us to de\ufb01ne the\nx f (x)\u03c0\u03b8(x).\nWe will assume that among the K Markovian arms there exists precisely one that possess the highest\nstationary mean, and we will denote this arm by a\u2217(\u03b8\u03b8\u03b8), so in particular\n\nstationary reward of the Markov chain corresponding to the parameter \u03b8 as \u00b5(\u03b8) =(cid:80)\n\nn }n\u2208Z\u22650 given by Y a\n\nn = f (X a\n\nThe set of all parameter con\ufb01gurations that possess a unique highest mean is denoted by\n\n{a\u2217(\u03b8\u03b8\u03b8)} = arg max\na\u2208[K]\n\n\u00b5(\u03b8a).\n\n(cid:40)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)arg max\n\na\u2208[K]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1\n(cid:41)\n\n.\n\n\u0398\u0398\u0398 =\n\n\u03b8\u03b8\u03b8 \u2208 \u0398K :\n\n\u00b5(\u03b8a)\n\nThe Kullback-Leibler divergence rate characterizes the sample complexity of the Markovian identi\ufb01-\ncation problem that we are about to study. For two Markov chains of the one-parameter family that\nare indexed by \u03b8 and \u03bb respectively it is given by,\n\n(cid:88)\n\nx,y\u2208S\n\nD (\u03b8 (cid:107) \u03bb) =\n\nlog\n\nP\u03b8(x, y)\nP\u03bb(x, y)\n\n\u03c0\u03b8(x)P\u03b8(x, y),\n\nwhere we use the standard notational conventions log 0 = \u221e,\n0 = \u221e if \u03b1 > 0, and 0 log 0 =\n0 = 0. It is always nonnegative, D (\u03b8 (cid:107) \u03bb) \u2265 0, with equality occurring if and only if P\u03b8 = P\u03bb,\n0 ln 0\nand so \u00b5(\u03b8) (cid:54)= \u00b5(\u03bb) yields that D (\u03b8 (cid:107) \u03bb) > 0. Furthermore, D (\u03b8 (cid:107) \u03bb) < \u221e due to (2).\nWith some abuse of notation we will also write D (P (cid:107) Q) for the Kullback-Leibler divergence\nbetween two probability measures P and Q on the same measurable space, which is de\ufb01ned as\n\nlog \u03b1\n\n(cid:105)\n\n(cid:104)\n\n(cid:40)EP\n\n\u221e,\n\nD (P (cid:107) Q) =\n\nlog d P\nd Q\n\n,\n\nif P (cid:28) Q\notherwise,\n\nwhere P (cid:28) Q means that P is absolutely continuous with respect to Q, and in that case d P\nthe Radon-Nikodym derivative of P with respect to Q.\n\nd Q denotes\n\n2.2 Best Markovian Arm Identi\ufb01cation with Fixed Con\ufb01dence\nLet \u03b8\u03b8\u03b8 \u2208 \u0398\u0398\u0398 be an unknown parameter con\ufb01guration for the K Markovian arms. Let \u03b4 \u2208 (0, 1)\nbe a given con\ufb01dence level. Our goal is to identify a\u2217(\u03b8\u03b8\u03b8) with probability at least 1 \u2212 \u03b4 using\n\n3\n\n\fNa(t) =(cid:80)t\n\nas few samples as possible. At each time t we select a single arm At and we observe the next\nn }n\u2208Z\u22650 , while all the other reward processes stay still. Let\nsample from the reward process {Y At\ns=1 I{As=a}\u2212 1 be the number of transitions of the Markovian arm a up to time t. Let Ft\nn }NK (t)\nn=0 .\n\nbe the \u03c3-\ufb01eld generated by our choices A1, . . . , At and the observations {Y 1\nA sampling strategy, A\u03b4, is a triple A\u03b4 = ((At)t\u2208Z>0 , \u03c4\u03b4, \u02c6a\u03c4\u03b4 ) consisting of:\n\nn=0 , . . . ,{Y a\n\nn }N1(t)\n\nmines which arm At+1 we should sample next, so At+1 is Ft-measurable;\nwith respect to the \ufb01ltration (Ft)t\u2208Z>0 , such that EA\u03b4\n\n\u2022 a sampling rule (At)t\u2208Z>0 , which based on the past decisions and observations Ft, deter-\n\u2022 a stopping rule \u03c4\u03b4, which denotes the end of the data collection phase and is a stopping time\n\u2022 a decision rule \u02c6a\u03c4\u03b4, which is F\u03c4\u03b4-measurable, and determines the arm that we estimate to be\n\n\u03bb\u03bb\u03bb [\u03c4\u03b4] < \u221e for all \u03bb\u03bb\u03bb \u2208 \u0398\u0398\u0398;\n\nthe best one.\n\nSampling strategies need to perform well across all possible parameter con\ufb01gurations in \u0398\u0398\u0398, therefore\nwe need to restrict our strategies to a class of uniformly accurate strategies. This motivates the\nfollowing standard de\ufb01nition.\nDe\ufb01nition 1 (\u03b4-PC). Given a con\ufb01dence level \u03b4 \u2208 (0, 1), a sampling strategy A\u03b4 =\n((At)t\u2208Z>0 , \u03c4\u03b4, \u02c6a\u03c4\u03b4 ) is called \u03b4-PC (Probably Correct) if,\n\n\u03bb\u03bb\u03bb (\u02c6a\u03c4\u03b4 (cid:54)= a\u2217(\u03bb\u03bb\u03bb)) \u2264 \u03b4, for all \u03bb\u03bb\u03bb \u2208 \u0398\u0398\u0398.\nPA\u03b4\n\nTherefore our goal is to study the quantity,\n\ninf\n\nA\u03b4:\u03b4\u2212P C\n\nEA\u03b4\n\u03b8\u03b8\u03b8 [\u03c4\u03b4],\n\nboth in terms of \ufb01nding a lower bound, i.e. establishing that no \u03b4-PC strategy can have expected\nsample complexity less than our lower bound, and also in terms of \ufb01nding an upper bound, i.e. a\n\u03b4-PC strategy with very small expected sample complexity. We will do so in the high con\ufb01dence\nregime of \u03b4 \u2192 0, by establishing instance speci\ufb01c lower and upper bounds which differ just by a\nfactor of four.\n\n3 Lower Bound on the Sample Complexity\n\nDeriving lower bounds in the multi-armed bandits setting is a task performed by change of measure\narguments initial introduced by Lai and Robbins (1985). Those change of measure arguments capture\nthe simple idea that in order to identify the best arm we should at least be able to differentiate between\ntwo bandit models that exhibit different best arms but are statistically similar. Fix \u03b8 \u2208 \u0398\u0398\u0398, and de\ufb01ne\nthe set of parameter con\ufb01gurations that exhibit as best arm an arm different than a\u2217(\u03b8\u03b8\u03b8) by\n\nAlt(\u03b8\u03b8\u03b8) = {\u03bb\u03bb\u03bb \u2208 \u0398\u0398\u0398 : a\u2217(\u03bb\u03bb\u03bb) (cid:54)= a\u2217(\u03b8\u03b8\u03b8)}.\n\n(cid:32)\n\nlog\n\n=\n\nI{Na(t)\u22650} log\n\nK(cid:88)\n\na=1\n\nThen we consider an alternative parametrization \u03bb\u03bb\u03bb \u2208 Alt(\u03b8\u03b8\u03b8) and we write their log-likelihood ratio\nup to time t\n\nwhere Na(x, y, 0, t) = (cid:80)t\u22121\n\n|Ft\n\u03bb\u03bb\u03bb |Ft\nK(cid:88)\n(cid:88)\ns+1 = y}. The log-likelihood ratio enables us to\ns=0 1{X a\nperform changes of measure for \ufb01xed times t, and more generally for stopping times \u03c4 with respect\nto (Ft)t\u2208Z>0 , which are PA\u03b4\nand PA\u03b4\n\u03bb\u03bb\u03bb -a.s. \ufb01nite, through the following change of measure formula,\n\u03bb\u03bb\u03bb (E) = EA\u03b4\nPA\u03b4\n\n, for any E \u2208 F\u03c4 .\n\nx,y\ns = x, X a\n\nP\u03b8a (x, y)\nP\u03bba (x, y)\n\nq\u03b8a (X a\n0 )\nq\u03bba (X a\n0 )\n\nNa(x, y, 0, t) log\n\nIE\n\n\u03b8\u03b8\u03b8\n\nd PA\u03b4\nd PA\u03b4\n\n\u03b8\u03b8\u03b8\n\n+\n\na=1\n\n(cid:33)\n\n(cid:20)\n\n(4)\n\n(5)\n\n\u03b8\u03b8\u03b8\n\n,\n\n(cid:21)\n\nd P\u03bb\u03bb\u03bb |F\u03c4\nd P\u03b8\u03b8\u03b8 |F\u03c4\n\nIn order to derive our lower bound we use a technique developed for the IID case by Garivier and\nKaufmann (2016) which combines several changes of measure at once. To make this technique work\nin the Markovian setting we need the following inequality which we derive in Appendix A using a\nrenewal argument for Markov chains.\n\n4\n\n\fD\n\n(cid:13)(cid:13)(cid:13) PA\u03b4\n\nLemma 1. Let \u03b8\u03b8\u03b8 \u2208 \u0398\u0398\u0398 and \u03bb\u03bb\u03bb \u2208 Alt(\u03b8\u03b8\u03b8) be two parameter con\ufb01gurations. Let \u03c4 be a stopping time\nwith respect to (Ft)t\u2208Z>0 , with EA\u03b4\n\u03bb\u03bb\u03bb [\u03c4 ] < \u221e. Then\n\u03b8\u03b8\u03b8 [\u03c4 ], EA\u03b4\n\u03bb\u03bb\u03bb |F\u03c4\nK(cid:88)\n0}] < \u221e, the \ufb01rst summand is \ufb01nite due to (3), and the\n\n(cid:17)\nD (q\u03b8a (cid:107) q\u03bba ) +\n\n(cid:16)EA\u03b4\n\n(cid:16)PA\u03b4\n\u2264 K(cid:88)\n\n|F\u03c4\n\n\u03b8\u03b8\u03b8\n\nD (\u03b8a (cid:107) \u03bba),\n\n\u03b8\u03b8\u03b8 [Na(\u03c4 )] + R\u03b8a\n\nwhere R\u03b8a = E\u03b8a [inf{n > 0 : X a\nsecond summand is \ufb01nite due to (2).\n\nn = X a\n\n(cid:17)\n\na=1\n\na=1\n\nCombining those ingredients with the data processing inequality we derive our instance speci\ufb01c lower\nbound for the Markovian bandit identi\ufb01cation problem in Appendix A.\nTheorem 1. Assume that the one-parameter family of Markov chains on the \ufb01nite state space S\nsatis\ufb01es conditions (1), (2), and (3). Fix \u03b4 \u2208 (0, 1), let f : S \u2192 R be a nonconstant reward function,\nlet A\u03b4 be a \u03b4-PC sampling strategy, and \ufb01x a parameter con\ufb01guration \u03b8\u03b8\u03b8 \u2208 \u0398\u0398\u0398. Then\n\nT \u2217(\u03b8\u03b8\u03b8) \u2264 lim inf\n\u03b4\u21920\n\nEA\u03b4\n\u03b8\u03b8\u03b8 [\u03c4\u03b4]\nlog 1\n\u03b4\n\n,\n\nwhere\n\nT \u2217(\u03b8\u03b8\u03b8)\u22121 =\n\nsup\n\nwww\u2208M1([K])\n\ninf\n\n\u03bb\u03bb\u03bb\u2208Alt(\u03b8\u03b8\u03b8)\n\nwaD (\u03b8a (cid:107) \u03bba),\n\nand M1 ([K]) denotes the set of all probability distributions on [K].\nAs noted in Garivier and Kaufmann (2016) the sup in the de\ufb01nition of T \u2217(\u03b8\u03b8\u03b8) is actually attained\nuniquely, and therefore we can de\ufb01ne www\u2217(\u03b8\u03b8\u03b8) as the unique maximizer,\n\nK(cid:88)\n\na=1\n\nK(cid:88)\n\na=1\n\n{www\u2217(\u03b8\u03b8\u03b8)} = arg max\nwww\u2208M1([K])\n\ninf\n\n\u03bb\u03bb\u03bb\u2208Alt(\u03b8\u03b8\u03b8)\n\nwaD (\u03b8a (cid:107) \u03bba).\n\n4 One-Parameter Exponential Family of Markov Chains\n\n4.1 De\ufb01nition and Basic Properties\n\nIn this section we instantiate the abstract one-parameter family of Markov chains from Subsection 2.1,\nwith the one-parameter exponential family of Markov chains. Given the \ufb01nite state space S, and the\nnonconstant reward function f : S \u2192 R, we de\ufb01ne M = maxx f (x) and m = minx f (x). Based on\nf we construct two subsets of state space, SM = {x \u2208 S : f (x) = M} and Sm = {x \u2208 S : f (x) =\nm}, corresponding to states of maximum and minimum f-value respectively. Our goal is to create a\nfamily of Markov chains which can realize any stationary mean in the interval (m, M ), which will be\nlater used in order to model the Markovian arms. Towards this goal we use as a generator for our\nfamily, an irreducible stochastic matrix P which satis\ufb01es the following properties.\nThe submatrix of P with rows and columns in SM is irreducible.\nFor every x \u2208 S \u2212 SM , there is a y \u2208 SM such that P (x, y) > 0.\nThe submatrix of P with rows and columns in Sm is irreducible.\nFor every x \u2208 S \u2212 Sm, there is a y \u2208 Sm such that P (x, y) > 0.\n\n(6)\n(7)\n(8)\n(9)\nFor example, a positive stochastic matrix, i.e. one where all the transition probabilities are positive,\nsatis\ufb01es all those properties. Note that in practice this can always be attained by substituting zero\ntransition probabilities with \u0001 transition probabilities, where \u0001 \u2208 (0, 1) is some small constant.\nOur parameter space will be the whole real line, \u0398 = R. Given a parameter \u03b8 \u2208 \u0398, we pick an\narbitrary initial distribution q\u03b8 \u2208 M1 (S) such that q\u03b8(x) > 0 for all x \u2208 S, and we tilt exponentially\nall the the transitions of P by constructing the matrix \u02dcP\u03b8(x, y) = P (x, y)e\u03b8f (y). Note that \u02dcP\u03b8 is\nnot a stochastic matrix, but we can normalize it and turn it into a stochastic matrix by invoking the\nPerron-Frobenius theory. Let \u03c1(\u03b8) be the spectral radius of \u02dcP\u03b8. From the Perron-Frobenius theory\n\n5\n\n\fwith unique left and right eigenvectors u\u03b8, v\u03b8 such that they are both positive,(cid:80)\n(cid:80)\n\nwe know that \u03c1(\u03b8) is a simple eigenvalue of \u02dcP\u03b8, called the Perron-Frobenius eigenvalue, associated\nx u\u03b8(x) = 1, and\nx u\u03b8(x)v\u03b8(x) = 1, see for instance Theorem 8.4.4 in the book Horn and Johnson (2013). Let\nA(\u03b8) = log \u03c1(\u03b8) be the log-Perron-Frobenius eigenvalue, a quantity which plays a role similar to that\nof a log-moment-generating function. From \u02dcP\u03b8 we can construct an irreducible nonnegative matrix\n\nP\u03b8(x, y) =\n\n\u02dcP\u03b8(x, y)v\u03b8(y)\n\n\u03c1(\u03b8)v\u03b8(x)\n\n=\n\nv\u03b8(y)\nv\u03b8(x)\n\ne\u03b8\u03c6(y)\u2212A(\u03b8)P (x, y),\n\nwhich is stochastic, since(cid:88)\n\ny\n\nP\u03b8(x, y) =\n\n1\n\n\u03c1(\u03b8)v\u03b8(x)\n\n\u02dcP\u03b8(x, y)v\u03b8(y) = 1.\n\n\u00b7(cid:88)\n\ny\n\nIn addition its stationary distributions is given by\n\nsince\n\n(cid:88)\n\nx\n\n\u03c0\u03b8(x)P\u03b8(x, y) =\n\nv\u03b8(y)\n\u03c1(\u03b8)\n\nu\u03b8(x) \u02dcP\u03b8(x, y) = u\u03b8(y)v\u03b8(y) = \u03c0\u03b8(y).\n\n\u03c0\u03b8(x) = u\u03b8(x)v\u03b8(x),\n\n\u00b7(cid:88)\n\nx\n\nNote that the generator stochastic matrix P , is the member of the family that corresponds to \u03b8 = 0,\ni.e. P = P0, \u03c1(0) = 1, and A(0) = 0.\nThe following lemma, whose proof is presented in Appendix B, suggests that the family can be\nreparametrized using the mean parameters \u00b5(\u03b8). More speci\ufb01cally \u02d9A is a strictly increasing bijection\nbetween the set \u0398 of canonical parameters and the set M = {\u00b5 \u2208 (m, M ) : \u00b5(\u03b8) = \u00b5, for some \u03b8 \u2208\n\u0398} of mean parameters. Therefore with some abuse of notation, we will write u\u00b5, v\u00b5, P\u00b5, \u03c0\u00b5 for\nu \u02d9A\u22121(\u00b5), v \u02d9A\u22121(\u00b5), P \u02d9A\u22121(\u00b5), \u03c0 \u02d9A\u22121(\u00b5), and D (\u00b51 (cid:107) \u00b52) for D\nLemma 2. Let P be an irreducible stochastic matrix stochastic matrix on a \ufb01nite state space S which\ncombined with a real-valued function f : S \u2192 R satis\ufb01es (6), (7), (8) and (9). Then the following\nproperties hold true for the exponential family of stochastic matrices generated by P and f.\n\n(cid:13)(cid:13)(cid:13) \u02d9A\u22121(\u00b52)\n(cid:17)\n\n(cid:16) \u02d9A\u22121(\u00b51)\n\n.\n\n(a) \u03c1(\u03b8), A(\u03b8), u\u03b8 and v\u03b8 are analytic functions of \u03b8 on \u0398 = R.\n(b)\n\n\u02d9A(\u03b8) = \u00b5(\u03b8), for all \u03b8 \u2208 \u0398.\n\u02d9A(\u03b8) is strictly increasing.\n\n(c)\n(d) M = (m, M ).\n\n4.2 Concentration for Markov Chains\nFor a Markov chain {Xn}n\u2208Z\u22650 , driven by an irreducible transition matrix P and an initial\ndistribution q, the large deviations theory, Miller (1961); Donsker and Varadhan (1975); Ellis\n(1984); Dembo and Zeitouni (1998), suggests that the probability of the large deviation event\n{f (X1) + . . . + f (Xn) \u2265 n\u00b5}, when \u00b5 is greater or equal than the stationary mean \u00b5(0), asymptoti-\ncally is an exponential decay with the rate of the decay given by a Kullback-Leibler divergence rate.\nIn particular Theorem 3.1.2. from Dembo and Zeitouni (1998) in our context can be written as\n\nlim\nn\u2192\u221e\n\n1\nn\n\nlog P0(f (X1) + . . . + f (Xn) \u2265 n\u00b5) = \u2212A\u2217(\u00b5), for any \u00b5 \u2265 \u00b5(0),\n\nwhere A\u2217(\u00b5) = sup\u03b8\u2208R{\u03b8\u00b5\u2212 A(\u03b8)} is the convex conjugate of the log-Perron-Frobenius eigenvalue\nand represents a Kullback-Leibler divergence rate as we illustrate in Lemma 10.\nIn the following theorem we present a concentration inequality for Markov chains which attains the\nrate of exponential decay prescribed from the large deviations theory, as well as a constant prefactor\nwhich is independent from \u00b5.\nTheorem 2. Let S be a \ufb01nite state space, and let P be an irreducible stochastic matrix on S, which\ncombined with a function f : S \u2192 R satis\ufb01es (6), (7), (8), and (9). Fix \u03b8 \u2208 R, and let {Xn}n\u2208Z\u22650\n\n6\n\n\fbe a Markov chain on S, which is driven by P\u03b8, the stochastic matrix from the exponential family\nwhich corresponds to the parameter \u03b8 and has stationary mean \u00b5(\u03b8). Then\n\nP\u03b8 (f (X1) + . . . + f (Xn) \u2265 n\u00b5) \u2264 C 2e\u2212nD(\u00b5 (cid:107) \u00b5(\u03b8)), for \u00b5 \u2208 [\u00b5(\u03b8), M ],\n\nwhere C = C(P, f ) is a constant depending only on the generator stochastic matrix P and the\nP (y,z)\nfunction f. In particular, if P is a positive stochastic matrix then we can take C = maxx,y,z\nP (x,z) .\n\nWe note that in the special case that the process is an IID process the constant C(P, \u03c6) can be taken to\nbe 1, and thus Theorem 2 generalizes the classic Cramer-Chernoff bound, Chernoff (1952). Observe\nalso that Theorem 2 has a straightforward counterpart for the lower tail as well.\nMoreover our inequality is optimal up to the constant prefactor, since the exponential decay is\nunimprovable due to the large deviations theory, while with respect to the prefactor we can not expect\nanything better than a constant because otherwise we would contradict the central limit theorem for\nMarkov chains. In particular, when our conditions on P and f are met, our bound dominates similar\nbounds given by Davisson et al. (1981); Gillman (1993); Dinwoodie (1995); Lezaud (1998); Le\u00f3n\nand Perron (2004).\nWe give a proof of Theorem 2 in Appendix C, where the main techniques involved are a uniform\nupper bound on the ratio of the entries of the right Perron-Frobenius eigenvector, as well as an\napproximation of the log-Perron-Frobenius eigenvalue using the log-moment-generating function.\n\n5 Upper Bound on the Sample Complexity: the (\u03b1, \u03b4)\n(\u03b1, \u03b4)-Track-and-Stop\n(\u03b1, \u03b4)\n\nStrategy\n\nThe (\u03b1, \u03b4)-Track-and-Stop strategy, which was proposed in Garivier and Kaufmann (2016) in order to\ntackle the IID setting, tries to track the optimal weights w\u2217\na(\u03b8\u03b8\u03b8). In the sequel we will also write www\u2217(\u00b5\u00b5\u00b5),\nwith \u00b5\u00b5\u00b5 = (\u00b5(\u03b81), . . . , \u00b5(\u03b8K)), to denote www\u2217(\u03b8\u03b8\u03b8). Not having access to \u00b5\u00b5\u00b5, the (\u03b1, \u03b4)-Track-and-Stop\nstrategy tries to approximate \u00b5\u00b5\u00b5 using sample means. Let \u02c6\u00b5\u00b5\u00b5(t) = (\u02c6\u00b51(N1(t)), . . . , \u02c6\u00b5K(NK(t))) be\nthe sample means of the K Markov chains when t samples have been observed overall and the\ncalculation of the very \ufb01rst sample from each Markov chain is excluded from the calculation of its\nsample mean, i.e.\n\nNa(t)(cid:88)\n\n\u02c6\u00b5a(t) =\n\n1\n\nY a\ns .\n\nNa(t)\n\ns=1\n\nBy imposing suf\ufb01cient exploration the law of large numbers for Markov chains will kick in and the\nsample means \u02c6\u00b5\u00b5\u00b5(t) will almost surely converge to the true means \u00b5\u00b5\u00b5, as t \u2192 \u221e.\nWe proceed by brie\ufb02y describing the three components of the (\u03b1, \u03b4)-Track-and-Stop strategy.\n\n5.1 Sampling Rule: Tracking the Optimal Proportions\n\nFor initialization reasons the \ufb01rst 2K samples that we are going to observe are Y 1\nAfter that, for t \u2265 2K we let Ut = {a : Na(t) <\n\n1 , . . . , Y K\nt \u2212 K/2} and we follow the tracking rule:\n\n0 , Y 1\n\n1 .\n0 , Y K\n\n\u221a\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3arg min\n\narg max\na=1,...,K\n\na\u2208Ut\n\nAt+1 \u2208\n\n(cid:26)\n\nNa(t),\na(\u02c6\u00b5\u00b5\u00b5(t)) \u2212 Na(t)\nw\u2217\n\nt\n\n(cid:27)\n\nif Ut (cid:54)= \u2205\n\n(forced exploration),\n\n, otherwise\n\n(direct tracking).\n\nThe forced exploration step is there to ensure that \u02c6\u00b5\u00b5\u00b5(t) a.s.\u2192 \u00b5\u00b5\u00b5 as t \u2192 \u221e. Then the continuity of\n\u00b5\u00b5\u00b5 (cid:55)\u2192 www\u2217(\u00b5\u00b5\u00b5), combined with the direct tracking step guarantees that almost surely the frequencies\n\nNa(t)\n\nt\n\nconverge to the optimal weights w\u2217\n\na(\u00b5\u00b5\u00b5) for all a = 1, . . . , K.\n\n7\n\n\f5.2 Stopping Rule: (\u03b1, \u03b4)\n(\u03b1, \u03b4)-Chernoff\u2019s Stopping Rule\n(\u03b1, \u03b4)\n\nFor the stopping rule we will need the following statistics. For any two distinct arms a, b if\n\u02c6\u00b5a(Na(t)) \u2265 \u02c6\u00b5b(Nb(t)), we de\ufb01ne\n\nZa,b(t) =\n\nNa(t)\n\nNa(t) + Nb(t)\n\nNb(t)\n\nNa(t) + Nb(t)\n\nD (\u02c6\u00b5a(Na(t)) (cid:107) \u02c6\u00b5a,b(Na(t), Nb(t)))+\n\nD (\u02c6\u00b5b(Nb(t)) (cid:107) \u02c6\u00b5a,b(Na(t), Nb(t))),\n\nwhile if \u02c6\u00b5a(Na(t)) < \u02c6\u00b5b(Nb(t)), we de\ufb01ne Za,b(t) = \u2212Zb,a(t), where\nNb(t)\n\nNa(t)\n\n\u02c6\u00b5a,b(Na(t), Nb(t)) =\n\nNa(t) + Nb(t)\n\nNa(t) + Nb(t)\n\n\u02c6\u00b5a(Na(t)) +\n\n\u02c6\u00b5b(Nb(t)).\n\nNote that the statistics Za,b(t) do not arise as the closed form solutions of the Generalized Likelihood\nRatio statistics for Markov chains, as it is the case in the IID bandits setting.\nFor a con\ufb01dence level \u03b4 \u2208 (0, 1), and a convergence parameter \u03b1 > 1 we de\ufb01ne the (\u03b1, \u03b4)-Chernoff\nstopping rule following Garivier and Kaufmann (2016)\n\n\u03c4\u03b1,\u03b4 = inf {t \u2208 Z>0 : \u2203a \u2208 {1, . . . , K} \u2200b (cid:54)= a, Za,b(t) > (0 \u2228 \u03b2\u03b1,\u03b4(t))} ,\n\n2\u03b1KC 2\n\u03b1 \u2212 1\n\n\u03b4 , D =\n\nwhere \u03b2\u03b1,\u03b4(t) = 2 log Dt\u03b1\n, and C = C(P, f ) is the constant from Lemma 11. In\nthe special case that P is a positive stochastic matrix we can explicitly set C = maxx,y,z\nP (x,z). It\nis important to notice that the constant C = C(P, f ) does not depend on the bandit instance \u03b8\u03b8\u03b8 or the\ncon\ufb01dence level \u03b4, but only on the generator stochastic matrix P and the reward function f. In other\nwords it is a characteristic of the exponential family of Markov chains and not of the particular bandit\ninstance, \u03b8\u03b8\u03b8, under consideration.\n\nP (y,z)\n\n5.3 Decision Rule: Best Sample Mean\n\nFor a \ufb01xed arm a it is clear that, minb(cid:54)=a Za,b(t) > 0 if and only if \u02c6\u00b5a(Na(t)) > \u02c6\u00b5b(Nb(t)) for all\nb (cid:54)= a. Hence the following simple decision rule is well de\ufb01ned when used in conjunction with the\n(\u03b1, \u03b4)-Chernoff stopping rule:\n\n{\u02c6a\u03c4\u03b1,\u03b4} = arg max\n\na=1,...,K\n\n\u02c6\u00b5a(Na(\u03c4\u03c4\u03b1,\u03b4 )).\n\n5.4 Sample Complexity Analysis\n\nIn this section we establish that the (\u03b1, \u03b4)-Track-and-Stop strategy is \u03b4-PC, and we upper bound its\nexpected sample complexity. In order to do this we use our Markovian concentration bound Theo-\nrem 2.\nWe \ufb01rst use it in order to establish the following uniform deviation bound.\nLemma 3. Let \u03b8\u03b8\u03b8 \u2208 \u0398\u0398\u0398, \u03b4 \u2208 (0, 1), and \u03b1 > 1. Let A\u03b4 be a sampling strategy that uses an arbitrary\nsampling rule, the (\u03b1, \u03b4)-Chernoff\u2019s stopping rule and the best sample mean decision rule. Then, for\nany arm a,\n\nPA\u03b4\n\n\u03b8\u03b8\u03b8\n\n(\u2203t \u2208 Z>0 : Na(t)D (\u02c6\u00b5a(Na(t)) (cid:107) \u00b5a) \u2265 \u03b2\u03b1,\u03b4(t)/2) \u2264 \u03b4\nK\n\n.\n\nWith this in our possession we are able to prove in Appendix D that the (\u03b1, \u03b4)-Track-and-Stop strategy\nis \u03b4-PC.\nProposition 1. Let \u03b4 \u2208 (0, 1), and \u03b1 \u2208 (1, e/4]. The (\u03b1, \u03b4)-Track-and-Stop strategy is \u03b4-PC.\nFinally, we obtain that in the high con\ufb01dence regime, \u03b4 \u2192 0, the (\u03b1, \u03b4)-Track-and-Stop strategy\nhas a sample complexity which is at most 4\u03b1 times the asymptotic lower bound that we established\nin Theorem 1.\nTheorem 3. Let \u03b8\u03b8\u03b8 \u2208 \u0398\u0398\u0398, and \u03b1 \u2208 (1, e/4]. The (\u03b1, \u03b4)-Track-and-Stop strategy, denoted here by A\u03b4,\nhas its asymptotic expected sample complexity upper bounded by,\n\u2264 4\u03b1T \u2217(\u03b8\u03b8\u03b8).\n\nlim sup\n\nEA\u03b4\n\u03b8\u03b8\u03b8 [\u03c4\u03b1,\u03b4]\nlog 1\n\u03b4\n\n\u03b4\u21920\n\n8\n\n\fAcknowledgements\n\nWe would like to thank Venkat Anantharam, Jim Pitman and Satish Rao for many helpful discussions.\nThis research was supported in part by the NSF grant CCF-1816861.\n\nReferences\nAnantharam, V., Varaiya, P., and Walrand, J. (1987a). Asymptotically ef\ufb01cient allocation rules for the\nmultiarmed bandit problem with multiple plays. I. I.I.D. rewards. IEEE Trans. Automat. Control,\n32(11):968\u2013976.\n\nAnantharam, V., Varaiya, P., and Walrand, J. (1987b). Asymptotically ef\ufb01cient allocation rules for\nthe multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat.\nControl, 32(11):977\u2013982.\n\nBubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-\n\narmed Bandit Problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122.\n\nChernoff, H. (1952). A measure of asymptotic ef\ufb01ciency for tests of a hypothesis based on the sum\n\nof observations. Ann. Math. Statistics, 23:493\u2013507.\n\nChung, K.-M., Lam, H., Liu, Z., and Mitzenmacher, M. (2012). Chernoff-Hoeffding Bounds for\n\nMarkov Chains: Generalized and Simpli\ufb01ed. In STACS.\n\nCover, T. M. and Thomas, J. A. (2006). Elements of information theory. Wiley-Interscience [John\n\nWiley & Sons], Hoboken, NJ, second edition.\n\nCsisz\u00e1r, I., Cover, T. M., and Choi, B. S. (1987). Conditional limit theorems under Markov condition-\n\ning. IEEE Trans. Inform. Theory, 33(6):788\u2013801.\n\nDavisson, L. D., Longo, G., and Sgarro, A. (1981). The error exponent for the noiseless encoding of\n\n\ufb01nite ergodic Markov sources. IEEE Trans. Inform. Theory, 27(4):431\u2013438.\n\nDembo, A. and Zeitouni, O. (1998). Large deviations techniques and applications, volume 38 of\n\nApplications of Mathematics (New York). Springer-Verlag, New York, second edition.\n\nDinwoodie, I. H. (1995). A probability inequality for the occupation measure of a reversible Markov\n\nchain. Ann. Appl. Probab., 5(1):37\u201343.\n\nDonsker, M. D. and Varadhan, S. R. S. (1975). Asymptotic evaluation of certain Markov process\n\nexpectations for large time. I. II. Comm. Pure Appl. Math., 28:1\u201347; ibid. 28 (1975), 279\u2013301.\n\nDurrett, R. (2010). Probability: theory and examples, volume 31 of Cambridge Series in Statistical\n\nand Probabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition.\n\nEllis, R. S. (1984). Large deviations for a general class of random vectors. Ann. Probab., 12(1):1\u201312.\n\nEven-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for\nthe multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079\u20131105.\n\nGarivier, A. and Kaufmann, E. (2016). Optimal best arm identi\ufb01cation with \ufb01xed con\ufb01dence.\n\nProceedings of the 29th Conference On Learning Theory, 49:1\u201330.\n\nGillman, D. (1993). A Chernoff bound for random walks on expander graphs. In 34th Annual\nSymposium on Foundations of Computer Science (Palo Alto, CA, 1993), pages 680\u2013691. IEEE\nComput. Soc. Press, Los Alamitos, CA.\n\nHayashi, M. and Watanabe, S. (2016). Information geometry approach to parameter estimation in\n\nMarkov chains. Ann. Statist., 44(4):1495\u20131535.\n\nHorn, R. A. and Johnson, C. R. (2013). Matrix analysis. Cambridge University Press, Cambridge,\n\nsecond edition.\n\n9\n\n\fJamieson, K. G., Malloy, M., Nowak, R. D., and Bubeck, S. (2014).\n\nlil\u2019 UCB : An Optimal\nExploration Algorithm for Multi-Armed Bandits. In COLT, volume 35 of JMLR Workshop and\nConference Proceedings, pages 423\u2013439.\n\nKaufmann, E. and Koolen, W. (2018). Mixture martingales revisited with applications to sequential\n\ntests and con\ufb01dence intervals.\n\nLai, T. L. and Robbins, H. (1985). Asymptotically ef\ufb01cient adaptive allocation rules. Adv. in Appl.\n\nMath., 6(1):4\u201322.\n\nLax, P. D. (2007). Linear algebra and its applications. Pure and Applied Mathematics (Hoboken).\n\nWiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition.\n\nLe\u00f3n, C. A. and Perron, F. (2004). Optimal Hoeffding bounds for discrete reversible Markov chains.\n\nAnn. Appl. Probab., 14(2):958\u2013970.\n\nLezaud, P. (1998). Chernoff-type bound for \ufb01nite Markov chains. Ann. Appl. Probab., 8(3):849\u2013867.\n\nMannor, S. and Tsitsiklis, J. N. (2003/04). The sample complexity of exploration in the multi-armed\n\nbandit problem. J. Mach. Learn. Res., 5:623\u2013648.\n\nMiller, H. D. (1961). A convexity property in the theory of random variables de\ufb01ned on a \ufb01nite\n\nMarkov chain. Ann. Math. Statist., 32:1260\u20131270.\n\nNagaoka, H. (2005). The exponential family of Markov chains and its information geometry. In\nProceedings of The 28th Symposium on Information Theory and Its Applications (SITA2005), pages\n1091\u20131095, Okinawa, Japan.\n\nNakagawa, K. and Kanaya, F. (1993). On the converse theorem in statistical hypothesis testing for\n\nMarkov chains. IEEE Trans. Inform. Theory, 39(2):629\u2013633.\n\nOrtega, J. M. (1990). Numerical analysis, volume 3 of Classics in Applied Mathematics. Society for\nIndustrial and Applied Mathematics (SIAM), Philadelphia, PA, second edition. A second course.\n\nWainwright, M. J. and Jordan, M. I. (2008). Graphical Models, Exponential Families, and Variational\n\nInference. Found. Trends Mach. Learn., 1(1-2):1\u2013305.\n\n10\n\n\f", "award": [], "sourceid": 2998, "authors": [{"given_name": "Vrettos", "family_name": "Moulos", "institution": "UC Berkeley"}]}