{"title": "Online Markov Decision Processes under Bandit Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 1804, "page_last": 1812, "abstract": "We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O(T^{2/3} (ln T)^{1/3}), giving the first rigorously proved convergence rate result for the problem.", "full_text": "Online Markov Decision Processes under Bandit\n\nFeedback\n\nGergely Neu\u2217\u2020\n\n\u2217Department of Computer Science and\n\nInformation Theory, Budapest University of\n\nTechnology and Economics, Hungary\n\nneu.gergely@gmail.com\n\nAndr\u00b4as Gy\u00a8orgy\n\n\u2020Machine Learning Research Group\nMTA SZTAKI Institute for Computer\n\nScience and Control, Hungary\n\ngya@szit.bme.hu\n\nCsaba Szepesv\u00b4ari\n\nDepartment of Computing Science,\n\nUniversity of Alberta, Canada\nszepesva@ualberta.ca\n\nAndr\u00b4as Antos\n\nMachine Learning Research Group\nMTA SZTAKI Institute for Computer\n\nScience and Control, Hungary\n\nantos@szit.bme.hu\n\nAbstract\n\nWe consider online learning in \ufb01nite stochastic Markovian environments where in\neach time step a new reward function is chosen by an oblivious adversary. The\ngoal of the learning agent is to compete with the best stationary policy in terms\nof the total reward received. In each time step the agent observes the current state\nand the reward associated with the last transition, however, the agent does not\nobserve the rewards associated with other state-action pairs. The agent is assumed\nto know the transition probabilities. The state of the art result for this setting is\na no-regret algorithm. In this paper we propose a new learning algorithm and,\nassuming that stationary policies mix uniformly fast, we show that after T time\n\nsteps, the expected regret of the new algorithm is O(cid:0)T 2/3(ln T )1/3(cid:1), giving the\n\n\ufb01rst rigorously proved regret bound for the problem.\n\n1\n\nIntroduction\n\nWe consider online learning in \ufb01nite Markov decision processes (MDPs) with a \ufb01xed, known dy-\nnamics. The formal problem de\ufb01nition is as follows: An agent navigates in a \ufb01nite stochastic envi-\nronment by selecting actions based on the states and rewards experienced previously. At each time\ninstant the agent observes the reward associated with the last transition and the current state, that is,\nat time t + 1 the agent observes rt(xt, at), where xt is the state visited at time t and at is the action\nchosen. The agent does not observe the rewards associated with other transitions, that is, the agent\nfaces a bandit situation. The goal of the agent is to maximize its total expected reward \u02c6RT in T\nsteps. As opposed to the standard MDP setting, the reward function at each time step may be differ-\nent. The only assumption about this sequence of reward functions rt is that they are chosen ahead of\ntime, independently of how the agent acts. However, no statistical assumptions are made about the\nchoice of this sequence. As usual in such cases, a meaningful performance measure for the agent is\nhow well it can compete with a certain class of reference policies, in our case the set of all stationary\npolicies: If R\u2217\nT denotes the expected total reward in T steps that can be collected by choosing the\nbest stationary policy (this policy can be chosen based on the full knowledge of the sequence rt),\nthe goal of learning can be expressed as minimizing the total expected regret, \u02c6LT = R\u2217\nIn this paper we propose a new algorithm for this setting. Assuming that the stationary distributions\nunderlying stationary policies exist, are unique and they are uniformly bounded away from zero and\n\nT \u2212 \u02c6RT .\n\n1\n\n\falgorithm in T time steps is O(cid:0)T 2/3(ln T )1/3(cid:1).\n\nthat these policies mix uniformly fast, our main result shows that the total expected regret of our\n\nThe \ufb01rst work that considered a similar online learning setting is due to Even-Dar et al. (2005,\n2009). In fact, this is the work that provides the starting point for our algorithm and analysis. The\nmajor difference between our work and that of Even-Dar et al. (2005, 2009) is that they assume\nthat the reward function is fully observed (i.e., in each time step the learning agent observes the\nwhole reward function rt), whereas we consider the bandit setting. The main result in these works\nis a bound on the total expected regret, which scales with the square root of the number of time\nsteps under mixing assumptions identical to our assumptions. Another work that considered the full\ninformation problem is due to Yu et al. (2009) who proposed new algorithms and proved a bound on\n\nthe expected regret of order O(cid:0)T 3/4+\u03b5(cid:1) for arbitrary \u03b5 \u2208 (0, 1/3). The advantage of the algorithm\n\nof Yu et al. (2009) to that of Even-Dar et al. (2009) is that it is computationally less expensive, which,\nhowever, comes at the price of an increased bound on the regret. Yu et al. (2009) introduced another\nalgorithm (\u201cQ-FPL\u201d) and they have shown a sublinear (o(T )) almost sure bound on the regret.\nAll the works reviewed so far considered the full information case. The requirement that the full\nreward function must be given to the agent at every time step signi\ufb01cantly limits their applicability.\nThere are only three papers that we know of where the bandit situation was considered.\nThe \ufb01rst paper which falls into this category is due to Yu et al. (2009) who proposed an algorithm\n(\u201cExploratory FPL\u201d) for this setting and have shown an o(T ) almost sure bound on the regret.\n\nRecently, Neu et al. (2010) gave O(cid:16)\u221a\n\n(cid:17)\n\nT\n\nregret bounds for a special bandit setting when the agent\ninteracts with a loop-free episodic environment. The algorithm and analysis in this work heavily\nexploits the speci\ufb01cs of these environments (i.e., that in the same episode no state can be visited\ntwice) and so they do not generalize to our setting.\nAnother closely related work is due to Yu and Mannor (2009a,b) who considered the problem of\nonline learning in MDPs where the transition probabilities may also change arbitrarily after each\ntransition. This problem, however, is signi\ufb01cantly different from ours and the algorithms studied\nare unsuitable for our setting. Further, the analysis in these papers seems to have gaps (see Neu\net al., 2010). Thus, currently, the only result for the case considered in this paper is an asymptotic\n\u201cno-regret\u201d result.\nThe rest of the paper is organized as follows: The problem is laid out in Section 2, which is followed\nby a section about our assumptions (Section 3). The algorithm and the main result are given in\nSection 4, while a proof sketch of the latter is presented in Section 5.\n\n2 Problem de\ufb01nition\nFormally, a \ufb01nite Markov Decision Process (MDP) M is de\ufb01ned by a \ufb01nite state space X , a \ufb01nite\naction set A, a transition probability kernel P : X \u00d7 A \u00d7 X \u2192 [0, 1], and a reward function\nr : X \u00d7 A \u2192 [0, 1]. In time step t \u2208 {1, 2, . . .}, knowing the state xt \u2208 X , an agent acting in\nthe MDP M chooses an action at \u2208 A(xt) to be executed based on (xt, r(at\u22121, xt\u22121), at\u22121, xt\u22121,\n. . . , x2, r(a1, x1), a1, x1).1 Here A(x) \u2282 A is the set of admissible actions at state x. As a result of\nexecuting the chosen action the process moves to state xt+1 \u2208 X with probability P (xt+1|xt, at)\nand the agent receives reward r(xt, at). In the so-called average-reward problem, the goal of the\nagent is to maximize the average reward received over time. For a more detailed introduction the\nreader is referred to, for example, Puterman (1994).\n\n2.1 Online learning in MDPs\n\nIn this paper we consider the online version of MDPs when the reward function is allowed to change\narbitrarily. That is, instead of a single reward function r, a sequence of reward functions {rt} is\ngiven. This sequence is assumed to be \ufb01xed ahead of time, and, for simplicity, we assume that\nrt(x, a) \u2208 [0, 1] for all (x, a) \u2208 X \u00d7 A and t \u2208 {1, 2, . . .}. No other assumptions are made about\nthis sequence.\n\n1We follow the convention that boldface letters denote random variables.\n\n2\n\n\fThe learning agent is assumed to know the transition probabilities P , but is not given the sequence\n{rt}. The protocol of interaction with the environment is unchanged: At time step t the agent\nreceives xt and then selects an action at which is sent to the environment. In response, the reward\nrt(xt, at) and the next state xt+1 are communicated to the agent. The initial state x1 is generated\nfrom a \ufb01xed distribution P0.\nThe goal of the learning agent is to maximize its expected total reward\n\n\u02c6RT = E\n\nrt(xt, at)\n\n.\n\n(cid:35)\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:34) T(cid:88)\n\nt=1\n\nAn equivalent goal is to minimize the regret, that is, to minimize the difference between the expected\ntotal reward received by the best algorithm within some reference class and the expected total reward\nof the learning algorithm. In the case of MDPs a reasonable reference class, used by various previous\nworks (e.g., Even-Dar et al., 2005, 2009; Yu et al., 2009) is the class of stationary stochastic policies.2\nA stationary stochastic policy, \u03c0, (or, in short: a policy) is a mapping \u03c0 : A \u00d7 X \u2192 [0, 1], where\n\u03c0(a|x) \u2261 \u03c0(a, x) is the probability of taking action a in state x. We say that a policy \u03c0 is followed\nin an MDP if the action at time t is drawn from \u03c0, independently of previous states and actions given\nthe current state x(cid:48)\nt). The expected total reward while following a policy \u03c0 is de\ufb01ned\nas\n\nt \u223c \u03c0(\u00b7|x(cid:48)\n\nt: a(cid:48)\n\nT = E\nR\u03c0\n\nrt(x(cid:48)\n\nt, a(cid:48)\nt)\n\n.\n\n\u02c6LT = sup\n\u03c0\n\nT \u2212 \u02c6RT ,\nR\u03c0\n\nt, a(cid:48)\n\nt)} denotes the trajectory that results from following policy \u03c0 from x(cid:48)\n\nHere {(x(cid:48)\nThe expected regret (or expected relative loss) of the learning agent relative to the class of policies\n(in short, the regret) is de\ufb01ned as\n\n1 \u223c P0.\n\nwhere the supremum is taken over all (stochastic stationary) policies. Note that the optimal policy\nis chosen in hindsight, depending acausally on the reward function. If the regret of an agent grows\nsublinearly with T then we can say that in the long run it acts as well as the best (stochastic station-\nary) policy (i.e., the average expected regret of the agent is asymptotically equal to that of the best\npolicy).\n\n3 Assumptions\n\nDe\ufb01ne P \u03c0(x(cid:48)|x) =(cid:80)\n\nIn this section we list the assumptions that we make throughout the paper about the transition proba-\nbility kernel (hence, these assumptions will not be mentioned in the subsequent results). In addition,\nrecall that we assume that the rewards are bound to [0, 1].\nBefore describing the assumptions, a few more de\ufb01nitions are needed: Let \u03c0 be a stationary policy.\na \u03c0(a|x)P (x(cid:48)|x, a). We will also view P \u03c0 as a matrix: (P \u03c0)x,x(cid:48) = P \u03c0(x(cid:48)|x),\nwhere, without loss of generality, we assume that X = {1, 2, . . . ,|X|}. In general, distributions will\nalso be treated as row vectors. Hence, for a distribution \u00b5, \u00b5P \u03c0 is the distribution over X that results\nfrom using policy \u03c0 for one step from \u00b5 (i.e., the \u201cnext-state distribution\u201d under \u03c0). Remember that\nthe stationary distribution of a policy \u03c0 is a distribution \u00b5 which satis\ufb01es \u00b5P \u03c0 = \u00b5.\n\nAssumption A1 Every policy \u03c0 has a well-de\ufb01ned unique stationary distribution \u00b5\u03c0.\n\nThe stationary distributions are uniformly bounded away from zero:\n\nAssumption A2\ninf \u03c0,x \u00b5\u03c0(x) \u2265 \u03b2 for some \u03b2 > 0.\nAssumption A3 There exists some \ufb01xed positive \u03c4 such that for any two arbitrary distributions \u00b5\nand \u00b5(cid:48) over X ,\n\nwhere (cid:107) \u00b7 (cid:107)1 is the 1-norm of vectors: (cid:107)v(cid:107)1 =(cid:80)\n\nsup\n\n\u03c0\n\ni |vi|.\n\n(cid:107)(\u00b5 \u2212 \u00b5(cid:48))P \u03c0(cid:107)1 \u2264 e\u22121/\u03c4(cid:107)\u00b5 \u2212 \u00b5(cid:48)(cid:107)1,\n\n2This is a reasonable reference class because for a \ufb01xed reward function one can always \ufb01nd a member of\n\nit which maximizes the average reward per time step, see Puterman (1994).\n\n3\n\n\fNote that Assumption A3 implies Assumption A1. The quantity \u03c4 is called the mixing time under-\nlying P by Even-Dar et al. (2009) who also assume A3.\n\n4 Learning in online MDPs under bandit feedback\n\nIn this section we shall \ufb01rst introduce some additional, standard MDP de\ufb01nitions, which we will be\nused later. That these are well-de\ufb01ned follows from our assumptions on P and from standard results\nto be found, for example, in the book by Puterman (1994). After the de\ufb01nitions, we specify our\nalgorithm. The section is \ufb01nished by the statement of our main result concerning the performance\nof the proposed algorithm.\n\n4.1 Preliminaries\nFix an arbitrary policy \u03c0 and t \u2265 1. Let {(x(cid:48)\ntransition probability kernel P . De\ufb01ne, \u03c1\u03c0\nrt by\n\ns, a(cid:48)\n\ns)} be the random trajectory generated by \u03c0 and the\nt , the average reward per stage corresponding to \u03c0, P and\nS(cid:88)\nx \u00b5\u03c0(x)(cid:80)\n\nE[rt(x(cid:48)\n\ns, a(cid:48)\n\ns)] .\n\n1\nS\n\ns=0\n\na \u03c0(a|x)rt(x, a), where \u00b5\u03c0 is the stationary\nAn alternative expression for \u03c1\u03c0\nt be the cor-\ndistribution underlying \u03c0. Let q\u03c0\nresponding state-value function. These can be uniquely de\ufb01ned as the solutions of the following\nBellman equations:\n\nt is \u03c1\u03c0\nt be the action-value function of \u03c0, P and rt and v\u03c0\n\nt (x, a) = rt(x, a) \u2212 \u03c1\u03c0\nq\u03c0\n\nt +\n\nP (x(cid:48)|x, a)v\u03c0\n\nt (x(cid:48)),\n\nv\u03c0\nt (x) =\n\n\u03c0(a|x)q\u03c0\n\nt (x, a).\n\n(cid:88)\n\na\n\n\u03c1\u03c0\nt = lim\nS\u2192\u221e\n\nt =(cid:80)\n(cid:88)\n\nx(cid:48)\n\nNow, consider the trajectory {(xt, at)} underlying a learning agent, where x1 is randomly chosen\nfrom P0, and de\ufb01ne\n\nut = ( x1, a1, r1(x1, a1), x2, a2, r2(x2, a2), . . . , xt, at, rt(xt, at) )\n\nand \u03c0t(a|x) = P[at = a|ut\u22121, xt = x]. That is, \u03c0t denotes the policy followed by the agent at\ntime step t (which is computed based on past information and is therefore random). We will use the\nfollowing notation:\n\nqt = q\u03c0t\nt\n\n,\n\nvt = v\u03c0t\nt\n\n,\n\n\u03c1t = \u03c1\u03c0t\nt\n\n.\n\nNote that qt, vt satisfy the Bellman equations underlying \u03c0t, P and rt.\nFor reasons to be made clear later in the paper, we shall need the state distribution at time step t\ngiven that we start from the state-action pair (x, a) at time t \u2212 N, conditioned on the policies used\nbetween time steps t \u2212 N and t:\n\nt,x,a(x(cid:48)) def= P [xt = x(cid:48) | xt\u2212N = x, at\u2212N = a, \u03c0t\u2212N +1, . . . , \u03c0t\u22121] ,\n\u00b5N\n\nx, x(cid:48) \u2208 X , a \u2208 A .\nt,x,a(\u00b7) will be viewed\nIt will be useful to view \u00b5N\nas one row of this matrix. To emphasize the conditional nature of this distribution, we will also use\nt (\u00b7|x, a) instead of \u00b5N\n\u00b5N\n\nt as a matrix of dimensions |X \u00d7A|\u00d7|X|. Thus, \u00b5N\nt,x,a(\u00b7).\n\n4.2 The algorithm\n\nOur algorithm is similar to that of Even-Dar et al. (2009) in that we use an expert algorithm in each\nstate. Since in our case the full reward function rt is not observed, the agent uses an estimate of it.\nThe main dif\ufb01culty is to come up with an unbiased estimate of rt with a controlled variance. Here\nwe propose to use the following estimate:\n\nrt(x,a)\nt (x|xt\u2212N ,at\u2212N )\n\n\u03c0t(a|x)\u00b5N\n0\n\nif (x, a) = (xt, at)\notherwise,\n\n(1)\n\n(cid:40)\n\n\u02c6rt(x, a) =\n\n4\n\n\fwhere t \u2265 N + 1. De\ufb01ne \u02c6qt, \u02c6vt and \u02c6\u03c1 as the solution to the Bellman equations underlying the\naverage reward MDP de\ufb01ned by (P, \u03c0t, \u02c6rt):\n\n\u02c6qt(x, a) = \u02c6rt(x, a) \u2212 \u02c6\u03c1t +\n\nP (x(cid:48)|x, a)\u02c6vt(x(cid:48)),\n\n\u02c6vt(x) =\n\n\u03c0t(a|x)\u02c6qt(x, a) ,\n\n(cid:88)\n\n(cid:88)\n\nx(cid:48)\n\n(cid:88)\n\n\u02c6\u03c1t =\n\n\u00b5\u03c0t(x)\u03c0t(a|x)\u02c6rt(x, a) .\n\na\n\n(2)\n\nx,a\n\nNote that if N is suf\ufb01ciently large and \u03c0t changes suf\ufb01ciently slowly then\n\n(3)\nalmost surely, for arbitrary x \u2208 X , t \u2265 N + 1. This fact will be shown in Lemma 4. Now, assume\nthat \u03c0t is computed based on ut\u2212N , that is, \u03c0t is measurable with respect to the \u03c3-\ufb01eld \u03c3(ut\u2212N )\ngenerated by the history ut\u2212N :\n\nt (x|xt\u2212N , at\u2212N ) > 0,\n\u00b5N\n\n\u03c0t \u2208 \u03c3(ut\u2212N ) .\n\n(4)\n\n(6)\n\nThen also \u03c0t\u22121, . . . , \u03c0t\u2212N \u2208 \u03c3(ut\u2212N ) and \u00b5N\n\nt can be computed using\n\nt,x,a = exP aP \u03c0t\u2212N +1 \u00b7\u00b7\u00b7 P \u03c0t\u22121,\n\u00b5N\n\n(5)\nwhere P a is the transition probability matrix when in every state action a is used and ex is the unit\nrow vector corresponding to x (and we assumed that X = {1, . . . ,|X|}). Moreover, a simple but\ntedious calculation shows that (3) and (4) ensure the conditional unbiasedness of our estimates, that\nis,\n\nE [ \u02c6rt(x, a)| ut\u2212N ] = rt(x, a).\n\nIt then follows that\nand, hence, by the uniqueness of the solutions of the Bellman equations, we have, for all (x, a) \u2208\nX \u00d7 A,\n\nE[ \u02c6\u03c1t|ut\u2212N ] = \u03c1t,\n\nE[\u02c6qt(x, a)|ut\u2212N ] = qt(x, a)\n\nand E[\u02c6vt(x)|ut\u2212N ] = vt(x).\n\nAs a consequence, we also have, for all (x, a) \u2208 X \u00d7 A, t \u2265 N + 1,\n\nE[ \u02c6\u03c1t] = E [\u03c1t] , E[\u02c6qt(x, a)] = E [qt(x, a)] ,\n\nand E[\u02c6vt(x)] = E [vt(x)] .\n\n(7)\n\n(8)\n\nThe bandit algorithm that we propose is shown as Algorithm 1. It follows the approach of Even-Dar\net al. (2009) in that a bandit algorithm is used in each state which together determine the policy to be\nused. These bandit algorithms are fed with estimates of action-values for the current policy and the\ncurrent reward. In our case these action-value estimates are \u02c6qt de\ufb01ned earlier, which are based on the\nreward estimates \u02c6rt. A major difference is that the policy computed based on the most recent action-\nvalue estimates is used only N steps later. This delay allows us to construct unbiased estimates of\nthe rewards. Its price is that we need to store N policies (or weights, leading to the policies), thus,\nthe memory needed by our algorithm scales with N |A||X|. The computational complexity of the\nalgorithm is dominated by the cost of computing \u02c6rt (and, in particular, by the cost of computing\n\u00b5N\ndelay, we also need to deal with the fact that in our case qt and \u02c6qt can be both negative, which must\nbe taken into account in the proper tuning of the algorithm\u2019s parameters.\n\nt (\u00b7|xt\u2212N , at\u2212N )). The cost of this is O(cid:0)N|A||X|3(cid:1). In addition to the need of dealing with the\n\n4.3 Main result\n\nOur main result is the following bound concerning the performance of Algorithm 1.\nTheorem 1. Let N = (cid:100)\u03c4 ln T(cid:101),\n\n\u03b7 = T \u22122/3 \u00b7 (ln|A|)2/3 \u00b7\n\n\u03b3 = T \u22121/3 \u00b7 (2\u03c4 + 4)\u22122/3 \u00b7\n\n(cid:18) 4\u03c4 + 8\n(cid:18) 2 ln|A|\n\n(cid:0)(2\u03c4 + 4)\u03c4|A| ln T + (3\u03c4 + 1)2(cid:1)(cid:19)\u22121/3\n(cid:0)(2\u03c4 + 4)\u03c4|A| ln T + (3\u03c4 + 1)2(cid:1)(cid:19)1/3\n\n\u03b2\n\n,\n\n.\n\n\u03b2\n\n5\n\n\fAlgorithm 1 Algorithm for the online bandit MDP.\nSet N \u2265 1, w1(x, a) = w2(x, a) = \u00b7\u00b7\u00b7 = w2N (x, a) = 1, \u03b3 \u2208 (0, 1), \u03b7 \u2208 (0, \u03b3].\nFor t = 1, 2, . . . , T , repeat\n\n1. Set\n\n\u03c0t(a|x) = (1 \u2212 \u03b3)\n\nfor all (x, a) \u2208 X \u00d7 A.\n\n(cid:80)\n\nwt(x, a)\nb wt(x, b)\n\n+\n\n\u03b3\n|A|\n\n2. Draw an action at randomly, according to the policy \u03c0t(\u00b7|xt).\n3. Receive reward rt(xt, at) and observe xt+1.\n4. If t \u2265 N + 1\n\nt (x|xt\u2212N , at\u2212N ) for all x \u2208 X using (5).\n\n(a) Compute \u00b5N\n(b) Construct estimates \u02c6rt using (1) and compute \u02c6qt using (2).\n(c) Set wt+N (x, a) = wt+N\u22121(x, a)e\u03b7\u02c6qt(x,a) for all (x, a) \u2208 X \u00d7 A.\n\nThen the regret can be bounded as\n\n\u02c6LT \u2264 3 T 2/3 \u00b7\n\n\u03b2\n\n(cid:18) (4\u03c4 + 8) ln|A|\n\n(cid:0)(2\u03c4 + 4)\u03c4|A| ln T + (3\u03c4 + 1)2(cid:1)(cid:19)1/3\n\n+ O(cid:16)\n\nT 1/3(cid:17)\n\n.\n\nIt is interesting to note that, similarly to the regret bound of Even-Dar et al. (2009), the main term\nof the regret bound does not directly depend on the size of the state space, but it depends on it only\nthrough \u03b2 and the mixing time \u03c4, de\ufb01ned in Assumptions A2 and A3, respectively; however, we also\nneed to note that \u03b2 > 1/|X|. While the theorem provides the \ufb01rst rigorously proved \ufb01nite sample\nregret bound for the online bandit MDP problem, we suspect that the given convergence rate is not\nsharp in the sense that it may be possible, in agreement with the standard bandit lower bound of\n\n(cid:17)\n\nAuer et al. (2002), to give an algorithm with an O(cid:16)\u221a\n(cid:32) T(cid:88)\n\n(cid:32)\n\n(cid:33)\n\nT \u2212 T(cid:88)\n\nT \u2212 \u02c6RT =\nR\u03c0\n\nR\u03c0\n\n\u03c1\u03c0\nt\n\n+\n\nThe proof of the theorem is similar to the proof of a similar bound done for the full-information case\nT \u2212 \u02c6RT for an arbitrary \ufb01x policy \u03c0. We\nby Even-Dar et al. (2009). Clearly, it suf\ufb01ces to bound R\u03c0\nuse the following decomposition of this difference (also used by Even-Dar et al., 2009):\n\nt \u2212 T(cid:88)\n\n\u03c1\u03c0\n\n(cid:33)\n\n(cid:32) T(cid:88)\n\n(cid:33)\n\n\u03c1t\n\n+\n\n\u03c1t \u2212 \u02c6RT\n\n.\n\n(9)\n\nT\n\nregret (up to some logarithmic factors).\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\n(cid:16)\n\nThe \ufb01rst term is bounded using the following standard MDP result.\nLemma 1 (Even-Dar et al., 2009). For any policy \u03c0 and any T \u2265 1 it holds that\nR\u03c0\n\n(cid:17) \u2264 2(\u03c4 + 1).\n\nT \u2212(cid:80)T\n\nt=1 \u03c1\u03c0\nt\n\nHence, it remains to bound the expectation of the other terms, which is done in the following two\npropositions.\nProposition 1. Let N \u2265 (cid:100)\u03c4 ln T(cid:101). For any policy \u03c0 and for all T large enough, we have\n\nT(cid:88)\n\nE [\u03c1\u03c0\n\nt \u2212 \u03c1t]\n\nT(cid:88)\n\nt=1\n\n(cid:18)\n(cid:18) 1\n\n\u03b3\n\n|A|(cid:16)\n(cid:19)\n\n+ 4\u03c4 + 6\n\n6\n\nt=1\n\n\u2264 (4\u03c4 + 10)N +\n\nln|A|\n\n\u03b7\n\n+ (2\u03c4 + 4) T\n\n\u03b3 +\n\n2\u03b7\n\u03b2\n\nProposition 2. Let N \u2265 (cid:100)\u03c4 ln T(cid:101). For any T large enough,\n\nN (1/\u03b3 + 4\u03c4 + 6) + (e \u2212 2)(2\u03c4 + 4)\n\nE [\u03c1t] \u2212 \u02c6RT \u2264 T\n\n2\u03b7\n\u03b2\n\n(3\u03c4 + 1)2 + 2T e\u2212N/\u03c4 + 2N.\n\n(cid:17)(cid:19)\n\n.\n\n(10)\n\n\fNote that the choice of N ensures that the second term in (10) becomes O(1).\nThe proofs are broken into a number of statements presented in the next section. Due to space\nconstraints we present proof sketches only; the full proofs are presented in the extended version of\nthe paper.\n\n5 Analysis\n\n5.1 General tools\n\nFirst, we show that if the policies that we follow up to time step t change slowly, \u00b5N\nt\n\u00b5\u03c0t:\nLemma 2. Let 1 \u2264 N < t \u2264 T and c > 0 be such that maxx\nholds for 1 \u2264 s \u2264 t \u2212 1. Then we have\n\nt,x,a(x(cid:48)) \u2212 \u00b5\u03c0t(x(cid:48))(cid:12)(cid:12) \u2264 c (3\u03c4 + 1)2 + 2e\u2212N/\u03c4 .\n(cid:12)(cid:12)\u00b5N\n\n(cid:88)\n\n(cid:80)\nis \u201cclose\u201d to\na |\u03c0s+1(a|x) \u2212 \u03c0s(a|x)| \u2264 c\n\nmax\nx,a\n\nx(cid:48)\n\nIn the next two lemmas we compute the rate of change of the policies produced by Exp3 and show\nthat for a large enough value of N, \u00b5N\nLemma 3. Assume that for some N + 1 \u2264 t \u2264 T , \u00b5N\nLet c = 2\u03b7\n\u03b2\n\nt,x,a can be uniformly bounded form below by \u03b2/2.\n\n(x(cid:48)) \u2265 \u03b2/2 holds for all states x(cid:48).\n\n\u03b3 + 4\u03c4 + 6\n\n(cid:16) 1\n\nt,xt\u2212N ,at\u2212N\n\n. Then,\n\n(cid:17)\n\n|\u03c0t+N\u22121(a|x) \u2212 \u03c0t+N (a|x)| \u2264 c.\n\n(11)\n\n(cid:88)\n\na\n\nmax\n\nx\n\nThe previous results yield the following result that show that by choosing the parameters appropri-\nately, the policies will change slowly and \u00b5N\nLemma 4. Let c be as in Lemma 3. Assume that c(3\u03c4 + 1)2 < \u03b2/2, and let\n\nt will be uniformly bounded away from zero.\n\n(cid:24)\n\n(cid:18)\n\n(cid:19)(cid:25)\n\nN \u2265\n\n\u03c4 ln\n\n4\n\n\u03b2 \u2212 2c(3\u03c4 + 1)2\n\n.\n\nmaxx(cid:48)(cid:80)\n\nThen,\n\nfor all N < t \u2264 T , x, x(cid:48) \u2208 X and a \u2208 A, we have \u00b5N\n\na(cid:48) |\u03c0t+1(a(cid:48)|x(cid:48)) \u2212 \u03c0t(a(cid:48)|x(cid:48))| \u2264 c.\n\n(12)\nt,x,a(x(cid:48)) \u2265 \u03b2/2 and\n\nThis result is proved by \ufb01rst ensuring that \u00b5t is uniformly lower bounded for t = N + 1, . . . , 2N,\nwhich can be easily seen since the policies do not change in this period. For the rest of the time\ninstants, one can proceed by induction, using Lemmas 2 and 3 in the inductive step.\n\nt \u2212 \u03c1t =\n\u03c1\u03c0\n\n\u00b5\u03c0(x)\u03c0(a|x) [qt(x, a) \u2212 vt(x)] .\n\n5.2 Proof of Proposition 1\nThe statement is trivial for T \u2264 N. The following simple result is the \ufb01rst step in proving Proposi-\ntion 1 for T > N.\nLemma 5. (cf. Lemma 4.1 in Even-Dar et al., 2009) For any policy \u03c0 and t \u2265 1,\n\n(cid:88)\nt=N +1 qt(x, a) and VT (x) = (cid:80)T\nFor every x, a de\ufb01ne QT (x, a) = (cid:80)T\n(cid:108)\n\nt=N +1 vt(x). The pre-\nceding lemma shows that in order to prove Proposition 1, it suf\ufb01ces to prove an upper bound on\nE [QT (x, a) \u2212 VT (x)].\nLemma 6. Let c be as in Lemma 3. Assume that \u03b3 \u2208 (0, 1), c(3\u03c4 + 1)2 < \u03b2/2, N \u2265\n\u03c4 ln\nE [QT (x, a) \u2212 VT (x)]\nln|A|\n\u2264 (4\u03c4 + 8)N +\n\n2(1/\u03b3 +2\u03c4 +3) , and T > N hold. Then, for all (x, a) \u2208 X \u00d7 A,\n(cid:17)(cid:19)\n\nN (1/\u03b3 + 4\u03c4 + 6) + (e \u2212 2)(2\u03c4 + 4)\n\n|A|(cid:16)\n\n, 0 < \u03b7 \u2264\n\n\u03b2\u22122c(3\u03c4 +1)2\n\n+ (2\u03c4 + 4) T\n\n(cid:17)(cid:109)\n\n(cid:18)\n\n(cid:16)\n\n\u03b3 +\n\nx,a\n\n\u03b2\n\n4\n\n.\n\n\u03b7\n\n2\u03b7\n\u03b2\n\n7\n\n\fProof sketch. The proof essentially follows the original proof of Auer et al. (2002) concerning the\nregret bound of Exp3, although some details are more subtle in our case: our estimates have different\nproperties than the ones considered in the original proof, and we also have to deal with the N-step\ndelay.\nLet\n\n\u02c6VN\n\nT (x) =\n\n\u03c0t+N\u22121(a|x)\u02c6qt(x, a)\n\nand\n\n\u02c6QN\n\nT (x, b) =\n\n\u02c6qt(x, b).\n\nT\u2212N +1(cid:88)\n\n(cid:88)\n\nT\u2212N +1(cid:88)\n\nT\u2212N +1(cid:88)\n\n(cid:88)\n\nt=N +1\n\na\n\nt=N +1\n\na\n\nt=N +1\n\nObserve that although qt(x, a) is not necessarily positive (in contrast to the rewards in the Exp3\nalgorithm), one can prove that \u03c0t(a|x)|\u02c6qt(x, a)| \u2264 4\n\n\u03b2 (\u03c4 + 2) and\n\n(13)\nSimilarly, it can be easily seen that the constraint on \u03b7 ensures that \u03b7\u02c6qt(x, a) \u2264 1 for all x, a, t.\nThen, following the proof of Auer et al. (2002), we can show that\n\nE [|\u02c6qt(x, a)|] \u2264 2(\u03c4 + 2).\n\n(14)\n\nT (x)\n\n\u2212 4\n\u03b2\n\n|\u02c6qt(x, a)| .\n\n(\u03c4 + 2) \u03b7(e \u2212 2)\n\nT (x) \u2265 (1 \u2212 \u03b3) \u02c6QN\n\u02c6VN\n\nNext, since the policies satisfy maxx\nusing (8) and (13), that\n\nT (x, b) \u2212 ln|A|\n(cid:80)\n\u03b7\na |\u03c0s+1(a|x) \u2212 \u03c0s(a|x)| \u2264 c by Lemma 4, we can prove,\n(cid:105) \u2264 E [VT (x)] + 2(\u03c4 + 2) N (c T|A| + 1).\nE(cid:104) \u02c6VN\nNow, taking the expectation of both sides of (14) and using the bound on E(cid:104) \u02c6VN\n(cid:88)\nT\u2212N +1(cid:88)\nT (x, b)(cid:3) \u2212 ln|A|\nE [VT (x)] \u2265 (1 \u2212 \u03b3)E(cid:2)QN\n(cid:105)\nwhere we used that E(cid:104) \u02c6QN\nT (x, b)(cid:3) by (8). Since qt(x, b) \u2264 2(\u03c4 + 2),\n= E(cid:2)QN\nE(cid:2)QN\nT (x, b)(cid:3) \u2264 E [QT (x, b)] + 2(\u03c4 + 2) N.\n\n\u03b7\n\u2212 2(\u03c4 + 2) N (c T|A| + 1),\n\n(\u03c4 + 2) \u03b7(e \u2212 2)\n\nE [|\u02c6qt(x, a)|]\n\nT (x)\n\nwe get\n\n\u2212 4\n\u03b2\n\nT (x, b)\n\nt=N +1\n\na\n\n(cid:105)\n\nCombining the above results and using (13) again, then substituting the de\ufb01nition of c yields the\ndesired result.\n\nProof of Proposition 1. Under the conditions of the proposition, combining Lemmas 5-6 yields\n\nT(cid:88)\n\nt=1\n\nE [\u03c1\u03c0\n\nt \u2212 \u03c1t]\n(cid:88)\n\n\u2264 2N +\n\nx,a\n\n\u2264 (4\u03c4 + 10)N +\n\nln|A|\n\n\u03b7\n\nproving Proposition 1.\n\nAcknowledgments\n\n\u00b5\u03c0(x)\u03c0(a|x) E [QT (x, a) \u2212 VT (x)]\n\n(cid:18)\n\n|A|(cid:16)\n\nN (1/\u03b3 + 4\u03c4 + 6) + (e \u2212 2)(2\u03c4 + 4)\n\n(cid:17)(cid:19)\n\n,\n\n+ (2\u03c4 + 4) T\n\n\u03b3 +\n\n2\u03b7\n\u03b2\n\nThis work was supported in part by the Hungarian Scienti\ufb01c Research Fund and the Hungarian\nNational Of\ufb01ce for Research and Technology (OTKA-NKTH CNK 77782), the PASCAL2 Network\nof Excellence under EC grant no. 216886, NSERC, AITF, the Alberta Ingenuity Centre for Machine\nLearning, the DARPA GALE project (HR0011-08-C-0110) and iCore.\n\n8\n\n\fReferences\nAuer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed\n\nbandit problem. SIAM J. Comput., 32(1):48\u201377.\n\nEven-Dar, E., Kakade, S. M., and Mansour, Y. (2005). Experts in a Markov decision process. In Saul,\nL. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17,\npages 401\u2013408.\n\nEven-Dar, E., Kakade, S. M., and Mansour, Y. (2009). Online Markov decision processes. Mathe-\n\nmatics of Operations Research, 34(3):726\u2013736.\n\nNeu, G., Gy\u00a8orgy, A., and Szepesv\u00b4ari, C. (2010). The online loop-free stochastic shortest-path prob-\n\nlem. In COLT-10.\n\nPuterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nWiley-Interscience.\n\nYu, J. Y. and Mannor, S. (2009a). Arbitrarily modulated Markov decision processes. In Joint 48th\n\nIEEE Conference on Decision and Control and 28th Chinese Control Conference. IEEE Press.\n\nYu, J. Y. and Mannor, S. (2009b). Online learning in Markov decision processes with arbitrarily\nchanging rewards and transitions. In GameNets\u201909: Proceedings of the First ICST international\nconference on Game Theory for Networks, pages 314\u2013322, Piscataway, NJ, USA. IEEE Press.\n\nYu, J. Y., Mannor, S., and Shimkin, N. (2009). Markov decision processes with arbitrary reward\n\nprocesses. Mathematics of Operations Research, 34(3):737\u2013757.\n\n9\n\n\f", "award": [], "sourceid": 1311, "authors": [{"given_name": "Gergely", "family_name": "Neu", "institution": null}, {"given_name": "Andras", "family_name": "Antos", "institution": null}, {"given_name": "Andr\u00e1s", "family_name": "Gy\u00f6rgy", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}