{"title": "Model-based Reinforcement Learning and the Eluder Dimension", "book": "Advances in Neural Information Processing Systems", "page_first": 1466, "page_last": 1474, "abstract": "We consider the problem of learning to optimize an unknown Markov decision process (MDP). We show that, if the MDP can be parameterized within some known function class, we can obtain regret bounds that scale with the dimensionality, rather than cardinality, of the system. We characterize this dependence explicitly as $\\tilde{O}(\\sqrt{d_K d_E T})$ where $T$ is time elapsed, $d_K$ is the Kolmogorov dimension and $d_E$ is the \\emph{eluder dimension}. These represent the first unified regret bounds for model-based reinforcement learning and provide state of the art guarantees in several important settings. Moreover, we present a simple and computationally efficient algorithm \\emph{posterior sampling for reinforcement learning} (PSRL) that satisfies these bounds.", "full_text": "Model-based Reinforcement Learning\n\nand the Eluder Dimension\n\nIan Osband\n\nStanford University\n\niosband@stanford.edu\n\nBenjamin Van Roy\nStanford University\nbvr@stanford.edu\n\nAbstract\n\nWe consider the problem of learning to optimize an unknown Markov deci-\nsion process (MDP). We show that, if the MDP can be parameterized within\nsome known function class, we can obtain regret bounds that scale with the\ndimensionality, rather than cardinality, of the system. We characterize this\ndependence explicitly as \u02dcO(\u00d4dKdET) where T is time elapsed, dK is the\nKolmogorov dimension and dE is the eluder dimension. These represent\nthe \ufb01rst uni\ufb01ed regret bounds for model-based reinforcement learning and\nprovide state of the art guarantees in several important settings. More-\nover, we present a simple and computationally ecient algorithm posterior\nsampling for reinforcement learning (PSRL) that satis\ufb01es these bounds.\n\n1 Introduction\nWe consider the reinforcement learning (RL) problem of optimizing rewards in an unknown\nMarkov decision process (MDP) [1].\nIn this setting an agent makes sequential decisions\nwithin its enironment to maximize its cumulative rewards through time. We model the\nenvironment as an MDP, however, unlike the standard MDP planning problem the agent\nis unsure of the underlying reward and transition functions. Through exploring poorly-\nunderstood policies, an agent may improve its understanding of its environment but it may\nimprove its short term rewards by exploiting its existing knowledge [2, 3].\nThe focus of the literature in this area has been to develop algorithms whose performance\nwill be close to optimal in some sense. There are numerous criteria for statistical and\ncomputational eciency that might be considered. Some of the most common include PAC\n(Probably Approximately Correct) [4], MB (Mistake Bound) [5], KWIK (Knows What It\nKnows) [6] and regret [7]. We will focus our attention upon regret, or the shortfall in the\nagent\u2019s expected rewards compared to that of the optimal policy. We believe this is a natural\ncriteria for performance during learning, although these concepts are closely linked. A good\noverview of various eciency guarantees is given in section 3 of Li et al. [6].\nBroadly, algorithms for RL can be separated as either model-based, which build a generative\nmodel of the environment, or model-free which do not. Algorithms of both type have been\ndeveloped to provide PAC-MDP bounds polynomial in the number of states S and actions\nA [8, 9, 10]. However, model-free approaches can struggle to plan ecient exploration. The\nonly near-optimal regret bounds to time T of \u02dcO(S\u00d4AT) have only been attained by model-\nbased algorithms [7, 11]. But even these bounds grow with the cardinality of the state and\naction spaces, which may be extremely large or even in\ufb01nite. Worse still, there is a lower\nbound (\u00d4SAT) for the expected regret in an arbitrary MDP [7].\nIn special cases, where the reward or transition function is known to belong to a certain\nfunctional family, existing algorithms can exploit the structure to move beyond this \u201c\u2018tabula\nrasa\u201d (where nothing is assumed beyond S and A) lower bound. The most widely-studied\n\n1\n\n\fparameterization is the degenerate MDP with no transitions, the mutli-armed bandit [12,\n13, 14]. Another common assumption is that the transition function is linear in states and\nactions. Papers here establigh regret bounds \u02dcO(\u00d4T) for linear quadratic control [16], but\nwith constants that grow exponentially with dimension. Later works remove this exponential\ndependence, but only under signi\ufb01cant sparsity assumptions [17]. The most general previous\nanalysis considers rewards and transitions that are \u2013-H\u00a8older in a d-dimensional space to\nestablish regret bounds \u02dcO(T (2d+\u2013)/(2d+2\u2013)) [18]. However, the proposed algorithm UCCRL\nis not computationally tractable and the bounds approach linearity in many settings.\nIn this paper we analyse the simple and intuitive algorithm posterior sampling for reinforce-\nment learning (PSRL) [20, 21, 11]. PSRL was initially introduced as a heuristic method [21],\nbut has since been shown to satisfy state of the art regret bounds in \ufb01nite MDPs [11] and\nalso exploit the structure of factored MDPs [15]. We show that this same algorithm satis\ufb01es\ngeneral regret bounds that depends upon the dimensionality, rather than the cardinality, of\nthe underlying reward and transition function classes. To characterize the complexity of this\nlearning problem we extend the de\ufb01nition of the eluder dimension, previously introduced for\nbandits [19], to capture the complexity of the reinforcement learning problem. Our results\nprovide a uni\ufb01ed analysis of model-based reinforcement learning in general and provide new\nstate of the art bounds in several important problem settings.\n\n2 Problem formulation\nWe consider the problem of learning to optimize a random \ufb01nite horizon MDP M =\n(S,A, RM , P M ,\u00b7,\ufb02 ) in repeated \ufb01nite episodes of interaction. S is the state space, A is\nthe action space, RM(s, a) is the reward distribution over R and P M(\u00b7|s, a) is the transition\ndistribution over S when selecting action a in state s, \u00b7 is the time horizon, and \ufb02 the initial\nstate distribution. All random variables we will consider are on a probability space (, F, P).\nA policy \u00b5 is a function mapping each state s \u0153S and i = 1, . . . ,\u00b7 to an action a \u0153A . For\neach MDP M and policy \u00b5, we de\ufb01ne a value function V :\n\nV M\n\n\u00b5,i(s) := EM,\u00b5# \u00b7\u00ffj=i\n\nrM(sj, aj)---si = s$\n\n(1)\n\n\u00b5,i(s) = max\u00b5\u00d5 V M\n\nwhere rM(s, a) := E[r|r \u2265 RM(s, a)] and the subscripts of the expectation operator indicate\nthat aj = \u00b5(sj, j), and sj+1 \u2265 P M(\u00b7|sj, aj) for j = i, . . . , \u00b7. A policy \u00b5 is said to be optimal\n\u00b5\u00d5,i(s) for all s \u0153S and i = 1, . . . ,\u00b7 . We will associate with\nfor MDP M if V M\neach MDP M a policy \u00b5M that is optimal for M.\nWe require that the state space S is a subset of Rd for some \ufb01nite d with a \u00ce\u00b7\u00ce 2-norm\ninduced by an inner product. These result actually extend to general Hilbert spaces, but we\nwill not deal with that in this paper. This allows us to decompose the transition function\nas a mean value in S plus additive noise s\u00d5 \u2265 P M(\u00b7|s, a) =\u2206 s\u00d5 = pM(s, a) + \u2018P . At\n\ufb01rst this may seem to exclude discrete MDPs with S states from our analysis. However,\nwe can represent the discrete state as a probability vector st \u0153S = [0, 1]S \u00b5 RS with a\nsingle active component equal to 1 and 0 otherwise. In fact, the notational convention that\nS\u2122 Rd should not impose a great restriction for most practical settings.\nFor any distribution  over S, we de\ufb01ne the one step future value function U to be the\nexpected value of the optimal policy with the next state distributed according to .\n(2)\nOne natural regularity condition for learning is that the future values of similar distributions\nshould be similar. We examine this idea through the Lipschitz constant on the means of\nthese state distributions. We write E() := E[s|s \u2265 ] \u0153S for the mean of a distribution\ni with respect to the \u00ce\u00b7\u00ce 2-norm of the mean:\n and express the Lipschitz continuity for U M\n(3)\nWe de\ufb01ne KM(D) := maxi KM\ni (D) to be a global Lipschitz contant for the future value\nfunction with state distributions from D. Where appropriate, we will condense our notation\n\ni (D)\u00ceE() \u2260E (\u02dc)\u00ce2 for all , \u02dc \u0153D\n\ni () := EM,\u00b5M#V M\n\n\u00b5M ,i+1(s)--s \u2265 $.\n\ni () \u2260 U M\n|U M\n\ni (\u02dc)|\u00c6 KM\n\nU M\n\n2\n\n\fto write KM := KM(D(M)) where D(M) := {P M(\u00b7|s, a)|s \u0153S , a \u0153A} is the set of all\npossible one-step state distributions under the MDP M.\nThe reinforcement learning agent interacts with the MDP over episodes that begin at times\ntk = (k \u2260 1)\u00b7 + 1, k = 1, 2, . . .. Let Ht = (s1, a1, r1, . . . , st\u22601, at\u22601, rt\u22601) denote the history\nof observations made prior to time t. A reinforcement learning algorithm is a deterministic\nsequence {\ufb01k|k = 1, 2, . . .} of functions, each mapping Htk to a probability distribution\n\ufb01k(Htk) over policies which the agent will employ during the kth episode. We de\ufb01ne the\nregret incurred by a reinforcement learning algorithm \ufb01 up to time T to be\n\nRegret(T,\ufb01, M \u00fa) :=\n\nk :=\u2044s\u0153S\n\n\ufb02(s)1V M\u00fa\n\nk,\n\n\u00c1T /\u00b7 \u00cb\u00ffk=1\n\u00b5k,12 (s)\n\u00b5\u00fa,1 \u2260 V M\u00fa\n\nwhere k denotes regret over the kth episode, de\ufb01ned with respect to the MDP M\u00fa by\n\nwith \u00b5\u00fa = \u00b5M\u00fa and \u00b5k \u2265 \ufb01k(Htk). Note that regret is not deterministic since it can\ndepend on the random MDP M\u00fa, the algorithm\u2019s internal random sampling and, through\nthe history Htk, on previous random transitions and random rewards. We will assess and\ncompare algorithm performance in terms of regret and its expectation.\n\n3 Main results\nWe now review the algorithm PSRL, an adaptation of Thompson sampling [20] to rein-\nforcement learning. PSRL was \ufb01rst proposed by Strens [21] and later was shown to satisfy\necient regret bounds in \ufb01nite MDPs [11]. The algorithm begins with a prior distribution\nover MDPs. At the start of episode k, PSRL samples an MDP Mk from the posterior. PSRL\nthen follows the policy \u00b5k = \u00b5Mk which is optimal for this sampled MDP during episode k.\n\nAlgorithm 1\nPosterior Sampling for Reinforcement Learning (PSRL)\n1: Input: Prior distribution \u201e for M\u00fa, t=1\n2: for episodes k = 1, 2, .. do\nsample Mk \u2265 \u201e(\u00b7|Ht)\n3:\n4:\ncompute \u00b5k = \u00b5Mk\nfor timesteps j = 1, ..,\u00b7 do\n5:\n6:\n7:\n8:\n9:\n10: end for\n\napply at \u2265 \u00b5k(st, j)\nobserve rt and st+1\nadvance t = t + 1\n\nend for\n\nTo state our results we \ufb01rst introduce some notation. For any set X and Y\u2122 Rd for d \ufb01nite\nlet P C,\u2021\nbe the family the distributions from X to Y with mean \u00ce\u00b7\u00ce 2-bounded in [0, C] and\nX ,Y\nadditive \u2021-sub-Gaussian noise. We let N(F,\u2013, \u00ce\u00b7\u00ce 2) be the \u2013-covering number of F with\nrespect to the \u00ce\u00b7\u00ce 2-norm and write nF = log(8N(F, 1/T 2,\u00ce\u00b7\u00ce 2)T) for brevity. Finally we\nwrite dE(F) = dimE(F, T \u22601) for the eluder dimension of F at precision T \u22601, a notion of\ndimension specialized to sequential measurements described in Section 4.\nOur main result, Theorem 1, bounds the expected regret of PSRL at any time T.\nTheorem 1 (Expected regret for PSRL in parameterized MDPs).\nfor\nFix a state space S, action space A, function families R\u2122P CR,\u2021R\nany CR, CP ,\u2021 R,\u2021 P > 0. Let M\u00fa be an MDP with state space S, action space A, rewards\nR\u00fa \u0153R and transitions P \u00fa \u0153P . If \u201e is the distribution of M\u00fa and K\u00fa = KM\u00fa is a global\nLipschitz constant for the future value function as per (3) then:\n\nS\u25caA ,R and P\u2122P CP ,\u2021P\nS\u25caA ,S\n\nE[Regret(T,\ufb01 P S, M\u00fa)] \u00c6#CR + CP$ + \u02dcD(R) + +E[K\u00fa]31 + 1\n\nT \u2260 14 \u02dcD(P)\n\n(4)\n\n3\n\n\fWhere for F equal to either R or P we will use the shorthand:\n\u02dcD(F) := 1 + \u00b7CF dE(F) + 8\u00d2dE(F)(4CF +\uf8ff2\u20212\n\nlog(32T 3)) + 8\uf8ff2\u20212\n\nTheorem 1 is a general result that applies to almost all RL settings of interest. In particular,\nwe note that any bounded function is sub-Gaussian. To clarify the assymptotics if this bound\nwe use another classical measure of dimensionality.\nDe\ufb01nition 1. The Kolmogorov dimension of a function class F is given by:\n\nF nF dE(F)T.\n\nF\n\ndimK(F) := lim sup\n\u2013\u00bf0\n\nlog(N(F,\u2013, \u00ce\u00b7\u00ce 2))\n\nlog(1/\u2013)\n\n.\n\n(5)\n\nUsing De\ufb01nition 1 in Theorem 1 we can obtain our Corollary.\nCorollary 1 (Assymptotic regret bounds for PSRL in parameterized MDPs).\nUnder the assumptions of Theorem 1 and writing dK(F) := dimK(F):\nE[Regret(T,\ufb01 P S, M\u00fa)] = \u02dcO1 \u2021R\uf8ffdK(R)dE(R)T + E[K\u00fa]\u2021P\uf8ffdK(P)dE(P)T 2\nWhere \u02dcO(\u00b7) ignores terms logarithmic in T.\nIn Section 4 we provide bounds on the eluder dimension of several function classes. These\nlead to explicit regret bounds in a number of important domains such as discrete MDPs,\nlinear-quadratic control and even generalized linear systems. In all of these cases the eluder\ndimension scales comparably with more traditional notions of dimensionality. For clarity,\nwe present bounds in the case of linear-quadratic control.\nCorollary 2 (Assymptotic regret bounds for PSRL in bounded linear quadratic systems).\nLet M\u00fa be an n-dimensional linear-quadratic system with \u2021-sub-Gaussian noise. If the state\nis \u00ce\u00b7\u00ce 2-bounded by C and \u201e is the distribution of M\u00fa, then:\n\nE[Regret(T,\ufb01 P S, M\u00fa)] = \u02dcO1\u2021C\u20441n2\u00d4T 2 .\n(6)\nHere \u20441 is the largest eigenvalue of the matrix Q given as the solution of the Ricatti equations\nfor the unconstrained optimal value function V (s) = \u2260sT Qs [22].\nProof. We simply apply the results of for eluder dimension in Section 4 to Corollary 1 and\nupper bound the Lipschitz constant of the constrained LQR by 2C\u20441, see Appendix D.\n\nAlgorithms based upon posterior sampling are intimately linked to those based upon opti-\nmism [14]. In Appendix E we outline an optimistic variant that would attain similar regret\nbounds but with high probility in a frequentist sense. Unfortunately this algorithm remains\ncomputationally intractable even when presented with an approximate MDP planner. Fur-\nther, we believe that PSRL will generally be more statistically ecient than an optimistic\nvariant with similar regret bounds since the algorithm is not aected by loose analysis [11].\n\n4 Eluder dimension\nTo quantify the complexity of learning in a potentially in\ufb01nite MDP, we extend the existing\nnotion of eluder dimension for real-valued functions [19] to vector-valued functions. For any\nwe de\ufb01ne the set of mean functions F = E[G] := {f|f = E[G] for G \u0153G} . If\nG\u2122P C,\u2021\nX ,Y\nwe consider sequential observations yi \u2265 G\u00fa(xi) we can equivalently write them as yi =\nf\u00fa(xi)+ \u2018i for some f\u00fa(xi) = E[y|y \u2265 G\u00fa(xi)] and \u2018i zero mean noise. Intuitively, the eluder\ndimension of F is the length d of the longest possible sequence x1, .., xd such that for all i,\nknowing the function values of f(x1), .., f(xi) will not reveal f(xi+1).\nDe\ufb01nition 2 ((F,\u2018 ) \u2260 dependence).\nWe will say that x \u0153X is (F,\u2018 )-dependent on {x1, ..., xn}\u2122X\n\n\u2248\u2206 \u2019f, \u02dcf \u0153F ,\n\nn\u00ffi=1 \u00cef(xi) \u2260 \u02dcf(xi)\u00ce2\n\n2 \u00c6 \u20182 =\u2206 \u00cef(x) \u2260 \u02dcf(x)\u00ce2 \u00c6 \u2018.\n\nx \u0153X is (\u2018,F)-independent of {x1, .., xn} i it does not satisfy the de\ufb01nition for dependence.\n\n4\n\n\fDe\ufb01nition 3 (Eluder Dimension).\nThe eluder dimension dimE(F,\u2018 ) is the length of the longest possible sequence of elements\nin X such that for some \u2018\u00d5 \u00d8 \u2018 every element is (F,\u2018 \u00d5)-independent of its predecessors.\nTraditional notions from supervised learning, such as the VC dimension, are not sucient to\ncharacterize the complexity of reinforcement learning. In fact, a family learnable in constant\ntime for supervised learning may require arbitrarily long to learn to control well [19]. The\neluder dimension mirrors the linear dimension for vector spaces, which is the length of the\nlongest sequence such that each element is linearly independent of its predecessors, but\nallows for nonlinear and approximate dependencies. We overload our notation for G\u2122P C,\u2021\nX ,Y\nand write dimE(G,\u2018 ) := dimE(E[G],\u2018 ), which should be clear from the context.\n4.1 Eluder dimension for speci\ufb01c function classes\nTheorem 1 gives regret bounds in terms of the eluder dimension, which is well-de\ufb01ned for\nany F,\u2018 . However, for any given F,\u2018 actually calculating the eluder dimension may take\nsome additional analysis. We now provide bounds on the eluder dimension for some common\nfunction classes in a similar approach to earlier work for real-valued functions [14]. These\nproofs are available in Appendix C.\nProposition 1 (Eluder dimension for \ufb01nite X).\nA counting argument shows that for |X| = X \ufb01nite, any \u2018> 0 and any function class F:\n\ndimE(F,\u2018 ) \u00c6 X\n\n5\n\ne\n\ne\n\ne\n\ndimE(F,\u2018 ) \u00c6 p(4n \u2260 1)\n\nThis bound is tight in the case of independent measurements.\nProposition 2 (Eluder dimension for linear functions).\nLet F = {f |f(x) = \u25ca\u201e(x) for \u25ca \u0153 Rn\u25cap,\u201e \u0153 Rp,\u00ce\u25ca\u00ce2 \u00c6 C\u25ca,\u00ce\u201e\u00ce2 \u00c6 C\u201e} then \u2019X:\n\u2018 42B (4n \u2260 1)D + 1 = \u02dcO(np)\nProposition 3 (Eluder dimension for quadratic functions).\nLet F = {f |f(x) = \u201e(x)T \u25ca\u201e(x) for \u25ca \u0153 Rp\u25cap,\u201e \u0153 Rp,\u00ce\u25ca\u00ce2 \u00c6 C\u25ca,\u00ce\u201e\u00ce2 \u00c6 C\u201e} then \u2019X:\nB2Rb (4p \u2260 1)TV + 1 = \u02dcO(p2).\n\ne \u2260 1 logCA1 +32C\u201eC\u25ca\ne \u2260 1 logSUQa1 +A2pC2\n\ndimE(F,\u2018 ) \u00c6 p(4p \u2260 1)\n\nProposition 4 (Eluder dimension for generalized linear functions).\nLet g(\u00b7) be a component-wise independent function on Rn with derivative in each component\nIf F =\nbounded \u0153 [h, h] with h > 0. De\ufb01ne r = h\nh > 1 to be the condition number.\n{f |f(x) = g(\u25ca\u201e(x)) for \u25ca \u0153 Rn\u25cap,\u201e \u0153 Rp,\u00ce\u25ca\u00ce2 \u00c6 C\u25ca,\u00ce\u201e\u00ce2 \u00c6 C\u201e} then for any X:\n22464+1 = \u02dcO(r2np)\ndimE(F,\u2018 ) \u00c6 p!r2(4n \u2260 2) + 1\"\n\ne \u2260 13log5!r2(4n \u2260 2) + 1\"31 +12C\u25caC\u201e\n\n\u201eC\u25ca\n\u2018\n\n5 Con\ufb01dence sets\nWe now follow the standard argument that relates the regret of an optimistic or pos-\nterior sampling algorithm to the construction of con\ufb01dence sets [7, 11]. We will use\nthe eluder dimension build con\ufb01dence sets for the reward and transition which contain\nthe true functions with high probability and then bound the regret of our algorithm by\nthe maximum deviation within the con\ufb01dence sets. For observations from f\u00fa \u0153F we\nwill center the sets around the least squares estimate \u02c6f LS\n\u0153 arg minf\u0153F L2,t(f) where\nL2,t(f) :=qt\u22601\n2 is the cumulative squared prediciton error. The con\ufb01dence\nt \u00ce2,Et \u00c6 \u00d4\u2014t} where \u2014t controls the growth\nsets are de\ufb01ned Ft = Ft(\u2014t) := {f \u0153F|\u00ce f \u2260 \u02c6f LS\n2,Et :=qt\u22601\nof the con\ufb01dence set and the empirical 2-norm is de\ufb01ned \u00ceg\u00ce2\n\ni=1 \u00cef(xt) \u2260 yt\u00ce2\n\ni=1 \u00ceg(xi)\u00ce2\n2.\n\nt\n\n\u2018\n\n\fFor F\u2122P C,\u2021\nX ,Y\n\n, we de\ufb01ne the distinguished control parameter:\n\n\u2014\u00fat (F,\u201d,\u2013 ) := 8\u20212 log(N(F,\u2013, \u00ce\u00b7\u00ce 2)/\u201d) + 2\u2013t18C +\uf8ff8\u20212 log(4t2/\u201d))2\n\nThis leads to con\ufb01dence sets which contain the true function with high probability.\nProposition 5 (Con\ufb01dence sets with high probability).\nFor all \u201d> 0 and \u2013> 0 and the con\ufb01dence sets Ft = Ft(\u2014\u00fat (F,\u201d,\u2013 )) for all t \u0153 N then:\n\n(7)\n\nPAf\u00fa \u0153\n\n\u0152\u2039t=1FtB \u00d8 1 \u2260 2\u201d\n\nProof. We combine standard martingale concentrations with a discretization scheme. The\nargument is essentially the same as Proposition 6 in [14], but extends statements about R\nto vector-valued functions. A full derivation is available in the Appendix A.\n\n5.1 Bounding the sum of set widths\nWe now bound the deviation from f\u00fa by the maximum deviation within the con\ufb01dence set.\nDe\ufb01nition 4 (Set widths).\nFor any set of functions F we de\ufb01ne the width of the set at x to be the maximum L2 deviation\nbetween any two members of F evaluated at x.\nwF(x) := sup\nf ,f\u0153F\n\n\u00cef(x) \u2260 f(x)\u00ce2\n\nWe can bound for the number of large widths in terms of the eluder dimension.\nLemma 1 (Bounding the number of large widths).\n\nIf {\u2014t > 0--t \u0153 N} is a nondecreasing sequence with Ft = Ft(\u2014t) then\n\u20182 + \u00b74 dimE(F,\u2018 )\n\n1{wFtk (xtk+i) >\u2018 }\u00c6 34\u2014T\n\nm\u00ffk=1\n\n\u00b7\u00ffi=1\n\nProof. This result follows from proposition 8 in [14] but with a small adjustment to account\nfor episodes. A full proof is given in Appendix B.\n\nWe now use Lemma 1 to control the cumulative deviation through time.\nProposition 6 (Bounding the sum of widths).\n\nIf {\u2014t > 0--t \u0153 N} is nondecreasing with Ft = Ft(\u2014t) and \u00cef\u00ce2 \u00c6 C for all f \u0153F then:\n\nwFtk (xtk+i) \u00c6 1 + \u00b7C dimE(F, T \u22601) + 4\uf8ff\u2014T dimE(F, T \u22601)T\n\nm\u00ffk=1\n\n\u00b7\u00ffi=1\n\n(8)\n\n.\n\nm\u00ffk=1\n\nProof. Once again we follow the analysis of Russo [14] and strealine notation by letting wt =\nwFtk (xtk+i) abd d = dimE(F, T \u22601). Reordering the sequence (w1, .., wT ) \u00e6 (wi1, .., wiT )\nsuch that wi1 \u00d8 .. \u00d8 wiT we have that:\nwFtk (xtk+i) =\n\nT\u00ffi=1\n\u00b7\u00ffi=1\nBy the reordering we know that wit >\u2018 means that qm\nt\u2260\u00b7d . So that if wit > T \u22601 then wit \u00c6 min{C,\u00d2 4\u2014T d\nFrom Lemma 1, \u2018 \u00c6\u00d2 4\u2014T d\nT\u00ffi=1\n1{wit \u00d8 T \u22601}\u00c6 \u00b7Cd +\n\n1{wit \u00d8 T \u22601}\nk=1q\u00b7\ni=1 1{wFtk (xtk+i) >\u2018 }\u00d8 t.\nt\u2260\u00b7d }. Therefore,\n0 \u00da d\nt \u2260 \u00b7d \u00c6 \u00b7Cd +2\uf8ff\u2014T\u2044 T\ndt \u00c6 \u00b7Cd +4\uf8ff\u2014T dT\n\nT\u00fft=\u00b7d+1\u00da 4\u2014T d\n\nwit \u00c6 1 +\n\nT\u00fft=1\n\nwit\n\nwit\n\nt\n\n6\n\n\f6 Analysis\nWe will now show reproduce the decomposition of expected regret in terms of the Bellman\nerror [11]. From here, we will apply the con\ufb01dence set results from Section 5 to obtain\nour regret bounds. We streamline our discussion of P M , RM , V M\n\u00b5 by simply\nwriting \u00fa in place of M\u00fa or \u00b5\u00fa and k in place of Mk or \u00b5k where appropriate; for example\nV \u00fak,i := V M\u00fa\n\u02dc\u00b5k,i.\nThe \ufb01rst step in our ananlysis breaks down the regret by adding and subtracting the imagined\noptimal reward of \u00b5k under the MDP Mk.\n\n\u00fa,1 \u2260 V \u00fak,1\" (s0) =!V \u00fa\n\nk,1\" (s0) +!V k\n(9)\n\u00fa,1 \u2260 V k\nHere s0 is a distinguished initial state, but moving to general \ufb02(s) poses no real challenge.\nAlgorithms based upon optimism bound (V \u00fa\u00fa,1 \u2260 V k\nk,1) \u00c6 0 with high probability. For PSRL\nwe use Lemma 2 and the tower property to see that this is zero in expectation.\nLemma 2 (Posterior sampling).\nIf \u201e is the distribution of M\u00fa then, for any \u2021(Htk)-measurable function g,\n\nk,1 \u2260 V \u00fak,1\" (s0)\n\nk =!V \u00fa\n\nand T M\n\n\u00b5,i, U M\ni\n\nE[g(M\u00fa)|Htk] = E[g(Mk)|Htk]\n\n(10)\n\u00b5 , which for any MDP M = (S,A, RM , P M ,\u00b7,\ufb02 ),\n\nWe introduce the Bellman operator T M\nstationary policy \u00b5 : S\u00e6A and value function V : S\u00e6 R, is de\ufb01ned by\nP M(s\u00d5|s, \u00b5(s))V (s\u00d5).\n\n\u00b5 V (s) := rM(s, \u00b5(s)) +\u2044s\u00d5\u0153S\nT M\n\nThis returns the expected value of state s where we follow the policy \u00b5 under the laws of M,\nfor one time step. The following lemma gives a concise form for the dynamic programming\nparadigm in terms of the Bellman operator.\nLemma 3 (Dynamic programming equation).\nFor any MDP M = (S,A, RM , P M ,\u00b7,\ufb02 ) and policy \u00b5 : S\u25ca{ 1, . . . ,\u00b7 }\u00e6A , the value\nfunctions V M\n\u00b5\n(11)\n\nsatisfy\n\n\u00b5,\u00b7+1 := 0.\n\nfor i = 1 . . .\u00b7 , with V M\nThrough repeated application of the dynamic programming operator and taking expectation\nof martingale dierences we can mirror earlier analysis [11] to equate expected regret with\nthe cumulative Bellman error:\n\n\u00b5,i = T M\nV M\n\n\u00b5(\u00b7,i)V M\n\n\u00b5,i+1\n\nE[k] =\n\n(T k\nk,i \u2260T \u00fak,i)V k\n\nk,i+1(stk+i)\n\n(12)\n\n\u00b7\u00ffi=1\n\n6.1 Lipschitz continuity\nEcient regret bounds for MDPs with an in\ufb01nite number of states and actions require some\nregularity assumption. One natural notion is that nearby states might have similar optimal\nvalues, or that the optimal value function function might be Lipschitz. Unfortunately, any\ndiscontinuous reward function will usually lead to discontious values functions so that this\nassumption is violated in many settings of interest. However, we only require that the\nfuture value is Lipschitz in the sense of equation (3). This will will be satis\ufb01ed whenever the\nunderlying value function is Lipschitz, but is a strictly weaker requirement since the system\nnoise helps to smooth future values.\nSince P has \u2021P -sub-Gaussian noise we write st+1 = pM(st, at) + \u2018P\nin the natural way. We\nnow use equation (12) to reduce regret to a sum of set widths. To reduce clutter and more\nclosely follow the notation of Section 4 we will write xk,i = (stk+i, atk+i).\n\nt\n\nE[k] \u00c6 EC \u00b7\u00ffi=1)rk(xk,i) \u2260 r\u00fa(xk,i) + U k\n\ni (P \u00fa(xk,i))*D\n\u00c6 EC \u00b7\u00ffi=1)|rk(xk,i) \u2260 r\u00fa(xk,i)| + Kk\u00cepk(xk,i) \u2260 p\u00fa(xk,i)\u00ce2*D\n\ni (P k(xk,i)) \u2260 U k\n\n(13)\n\n7\n\n\f\u00c6\n\nE[K\u00fa]\n\nm\u00ffk=1\n\n\u00b7\u00ffi=1\n\nWhere Kk is a global Lipschitz constant for the future value function of Mk as per (3).\nWe now use the results from Sections 4 and 5 to form the corresponding con\ufb01dence sets\nRk := Rtk(\u2014\u00fa(R,\u201d,\u2013 )) and Pk := Ptk(\u2014\u00fa(P,\u201d,\u2013 )) for the reward and transition functions\nrespectively. Let A = {R\u00fa, Rk \u0153R k \u2019k} and B = {P \u00fa, Pk \u0153P k \u2019k} and condition upon\nthese events to give:\n\nE[Regret(T,\ufb01 P S, M\u00fa)] \u00c6 EC m\u00ffk=1\n\n\u00b7\u00ffi=1)|rk(xk,i) \u2260 r\u00fa(xk,i)| + Kk\u00cepk(xk,i) \u2260 p\u00fa(xk,i)\u00ce2*D\n\u00b7\u00ffi=1)wRk(xk,i) + E[Kk|A, B]wPk(xk,i) + 8\u201d(CR + CP)* (14)\nThe posterior sampling lemma ensures that E[Kk] = E[K\u00fa] so that E[Kk|A, B] \u00c6 E[K\u00fa]\n1\u22608\u201d by a union bound on {Ac \ufb01 Bc}. We \ufb01x \u201d = 1/8T to see that:\nwRk(xk,i) + E[K\u00fa]11 + 1\nT \u2260 12 m\u00ffk=1\n\u00b7\u00ffi=1\nwPt(xk,i)\nE[Regret(T,\ufb01 P S, M\u00fa)] \u00c6 (CR + CP) +\nWe now use equation (7) together with Proposition 6 to obtain our regret bounds. For ease\nof notation we will write dE(R) = dimE(R, T \u22601) and dE(P) = dimE(P, T \u22601).\nE[Regret(T,\ufb01 P S, M\u00fa)] \u00c6 2 + (CR + CP) + \u00b7(CRdE(R) + CP dE(P)) +\n\nP(A,B) \u00c6\n\nm\u00ffk=1\n\nF\n\n(16)\n\n4\u00d2\u2014\u00faT (R, 1/8T,\u2013)dE(R)T + 4\u00d2\u2014\u00faT (P, 1/8T,\u2013)dE(P)T(15)\nWe let \u2013 = 1/T 2 and write nF = log(8N(F, 1/T 2,\u00ce\u00b7\u00ce 2)T) for R and P to complete our\nproof of Theorem 1:\n\nE[Regret(T,\ufb01 P S, M\u00fa)] \u00c6#CR + CP$ + \u02dcD(R) + E[K\u00fa]31 + 1\n\nT \u2260 14 \u02dcD(P)\nWhere \u02dcD(F) is shorthand for 1 + \u00b7CF dE(F) + 8\u00d2dE(F)(4CF +\uf8ff2\u20212\nlog(32T 3)) +\n8\uf8ff2\u20212\nF nF dE(F)T. The \ufb01rst term [CR + CP] bounds the contribution from missed con-\n\ufb01dence sets. The cost of learning the reward function R\u00fa is bounded by \u02dcD(R). In most\nproblems the remaining contribution bounding transitions and lost future value will be\ndominant. Corollary 1 follows from the De\ufb01nition 1 together with nR and nP.\n7 Conclusion\nWe present a new analysis of posterior sampling for reinforcement learning that leads to\na general regret bound in terms of the dimensionality, rather than the cardinality, of the\nunderlying MDP. These are the \ufb01rst regret bounds for reinforcement learning in such a\ngeneral setting and provide new state of the art guarantees when specialized to several\nimportant problem settings. That said, there are a few clear shortcomings which we do not\naddress in the paper. First, we assume that it is possible to draw samples from the posterior\ndistribution exactly and in some cases this may require extensive computational eort.\nSecond, we wonder whether it is possible to extend our analysis to learning in MDPs without\nepisodic resets. Finally, there is a fundamental hurdle to model-based reinforcement learning\nthat planning for the optimal policy even in a known MDP may be intractable. We assume\naccess to an approximate MDP planner, but this will generally require lengthy computations.\nWe would like to examine whether similar bounds are attainable in model-free learning\n[23], which may obviate complicated MDP planning, and examine the computational and\nstatistical eciency tradeos between these methods.\n\nAcknowledgments\nOsband is supported by Stanford Graduate Fellowships courtesy of PACCAR inc. This work\nwas supported in part by Award CMMI-0968707 from the National Science Foundation.\n\n8\n\n\fReferences\n[1] Apostolos Burnetas and Michael Katehakis. Optimal adaptive policies for Markov decision\n\nprocesses. Mathematics of Operations Research, 22(1):222\u2013255, 1997.\n\n[2] Tze Leung Lai and Herbert Robbins. Asymptotically ecient adaptive allocation rules. Ad-\n\nvances in applied mathematics, 6(1):4\u201322, 1985.\n\n[3] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A\n\nsurvey. arXiv preprint cs/9605103, 1996.\n\n[4] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142,\n\n1984.\n\n[5] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold\n\nalgorithm. Machine learning, 2(4):285\u2013318, 1988.\n\n[6] Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it\n\nknows: a framework for self-aware learning. Machine learning, 82(3):399\u2013443, 2011.\n\n[7] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. The Journal of Machine Learning Research, 99:1563\u20131600, 2010.\n\n[8] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 49(2-3):209\u2013232, 2002.\n\n[9] Ronen Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for\nnear-optimal reinforcement learning. The Journal of Machine Learning Research, 3:213\u2013231,\n2003.\n\n[10] Alexander Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael Littman. Pac model-\nfree reinforcement learning. In Proceedings of the 23rd international conference on Machine\nlearning, pages 881\u2013888. ACM, 2006.\n\n[11] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) Ecient Reinforcement Learning\n\nvia Posterior Sampling. Advances in Neural Information Processing Systems, 2013.\n\n[12] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-os. The Journal of\n\nMachine Learning Research, 3:397\u2013422, 2003.\n\n[13] S\u00b4ebastien Bubeck, R\u00b4emi Munos, Gilles Stoltz, and Csaba Szepesv\u00b4ari. X-armed bandits. Journal\n\nof Machine Learning Research, 12:1587\u00e21627, 2011.\n\n[14] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. CoRR,\n\nabs/1301.2609, 2013.\n\n[15] Ian Osband and Benjamin Van Roy. Near-optimal regret bounds for reinforcement learning in\n\nfactored MDPs. arXiv preprint arXiv:1403.3741, 2014.\n\n[16] Yassin Abbasi-Yadkori, D\u00b4avid P\u00b4al, and Csaba Szepesv\u00b4ari.\n\nImproved algorithms for linear\n\nstochastic bandits. Advances in Neural Information Processing Systems, 24, 2011.\n\n[17] Morteza Ibrahimi, Adel Javanmard, and Benjamin Van Roy. Ecient reinforcement learning\n\nfor high dimensional linear quadratic systems. In NIPS, pages 2645\u20132653, 2012.\n\n[18] Ronald Ortner, Daniil Ryabko, et al. Online regret bounds for undiscounted continuous rein-\n\nforcement learning. In NIPS, pages 1772\u20131780, 2012.\n\n[19] Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of opti-\nmistic exploration. In Advances in Neural Information Processing Systems, pages 2256\u20132264,\n2013.\n\n[20] William Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[21] Malcom Strens. A Bayesian framework for reinforcement learning. In Proceedings of the 17th\n\nInternational Conference on Machine Learning, pages 943\u2013950, 2000.\n\n[22] Dimitri Bertsekas. Dynamic programming and optimal control, volume 1. Athena Scienti\ufb01c\n\nBelmont, MA, 1995.\n\n[23] Benjamin Van Roy and Zheng Wen. Generalization and exploration via randomized value\n\nfunctions. arXiv preprint arXiv:1402.0635, 2014.\n\n9\n\n\f", "award": [], "sourceid": 802, "authors": [{"given_name": "Ian", "family_name": "Osband", "institution": "Stanford"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}