{"title": "Fitted Q-iteration in continuous action-space MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 9, "page_last": 16, "abstract": "We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by another policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous theoretical analysis of this algorithm, proving what we believe is the first finite-time bounds for value-function based algorithms for continuous state- and action-space problems.", "full_text": "Fitted Q-iteration in continuous action-space MDPs\n\nAndr\u00b4as Antos\n\nComputer and Automation Research Inst.\nof the Hungarian Academy of Sciences\nKende u. 13-17, Budapest 1111, Hungary\n\nantos@sztaki.hu\n\nR\u00b4emi Munos\n\nSequeL project-team, INRIA Lille\n59650 Villeneuve d\u2019Ascq, France\n\nremi.munos@inria.fr\n\nCsaba Szepesv\u00b4ari\u2217\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton T6G 2E8, Canada\n\nszepesva@cs.ualberta.ca\n\nAbstract\n\nWe consider continuous state, continuous action batch reinforcement learning\nwhere the goal is to learn a good policy from a suf\ufb01ciently rich trajectory gen-\nerated by some policy. We study a variant of \ufb01tted Q-iteration, where the greedy\naction selection is replaced by searching for a policy in a restricted set of can-\ndidate policies by maximizing the average action values. We provide a rigorous\nanalysis of this algorithm, proving what we believe is the \ufb01rst \ufb01nite-time bound\nfor value-function based algorithms for continuous state and action problems.\n\n1 Preliminaries\n\nWe will build on the results from [1, 2, 3] and for this reason we use the same notation as these\npapers. The unattributed results cited in this section can be found in the book [4].\nA discounted MDP is de\ufb01ned by a quintuple (X ,A, P, S, \u03b3), where X is the (possible in\ufb01nite)\nstate space, A is the set of actions, P : X \u00d7 A \u2192 M(X ) is the transition probability kernel with\nP (\u00b7|x, a) de\ufb01ning the next-state distribution upon taking action a from state x, S(\u00b7|x, a) gives the\ncorresponding distribution of immediate rewards, and \u03b3 \u2208 (0, 1) is the discount factor. Here X is\na measurable space and M(X ) denotes the set of all probability measures over X . The Lebesgue-\nmeasure shall be denoted by \u03bb. We start with the following mild assumption on the MDP:\nAssumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space,\nA is a compact subset of [\u2212A\u221e, A\u221e]dA. The random immediate rewards are bounded by \u02c6Rmax\nrS(dr|x, a), is uniformly bounded\nand that the expected immediate reward function, r(x, a) =\nby Rmax: (cid:107)r(cid:107)\u221e \u2264 Rmax.\nA policy determines the next action given the past observations. Here we shall deal with stationary\n(cid:80)\u221e\n(Markovian) policies which choose an action in a stochastic way based on the last observation only.\nThe value of a policy \u03c0 when it is started from a state x is de\ufb01ned as the total expected discounted\nt=0 \u03b3tRt|X0 = x]. Here\nreward that is encountered while the policy is executed: V \u03c0(x) = E\u03c0 [\nRt \u223c S(\u00b7|Xt, At) is the reward received at time step t, the state, Xt, evolves according to Xt+1 \u223c\n\u2217Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u.\n\n(cid:82)\n\n13-17, Budapest 1111, Hungary.\n\n1\n\n\f(cid:80)\u221e\nt=0 \u03b3tRt|X0 = x, A0 = a].\n\nP (\u00b7|Xt, At), where At is sampled from the distribution determined by \u03c0. We use Q\u03c0 : X \u00d7A \u2192 R\nto denote the action-value function of policy \u03c0: Q\u03c0(x, a) = E\u03c0 [\nThe goal is to \ufb01nd a policy that attains the best possible values, V \u2217(x) = sup\u03c0 V \u03c0(x), at all states\nx \u2208 X . Here V \u2217 is called the optimal value function and a policy \u03c0\u2217 that satis\ufb01es V \u03c0\u2217(x) =\nV \u2217(x) for all x \u2208 X is called optimal. The optimal action-value function Q\u2217(x, a) is Q\u2217(x, a) =\nsup\u03c0 Q\u03c0(x, a). We say that a (deterministic stationary) policy \u03c0 is greedy w.r.t. an action-value\nfunction Q \u2208 B(X \u00d7 A), and we write \u03c0 = \u02c6\u03c0(\u00b7; Q), if, for all x \u2208 X , \u03c0(x) \u2208 argmaxa\u2208A Q(x, a).\n(cid:82)\nUnder mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q\u2217\nis optimal. For \u03c0 : X \u2192 A we de\ufb01ne its evaluation operator, T \u03c0 : B(X \u00d7 A) \u2192 B(X \u00d7 A), by\n(cid:82)\nX Q(y, \u03c0(y)) P (dy|x, a). It is known that Q\u03c0 = T \u03c0Q\u03c0. Further, if\n(T \u03c0Q)(x, a) = r(x, a) + \u03b3\nwe let the Bellman operator, T : B(X \u00d7 A) \u2192 B(X \u00d7 A), de\ufb01ned by (T Q)(x, a) = r(x, a) +\nX supb\u2208A Q(y, b) P (dy|x, a) then Q\u2217 = T Q\u2217. It is known that V \u03c0 and Q\u03c0 are bounded by\n\u03b3\nRmax/(1 \u2212 \u03b3), just like Q\u2217 and V \u2217. For \u03c0 : X \u2192 A, the operator E\u03c0 : B(X \u00d7 A) \u2192 B(X ) is\nde\ufb01ned by (E\u03c0Q)(x) = Q(x, \u03c0(x)), while E : B(X \u00d7 A) \u2192 B(X ) is de\ufb01ned by (EQ)(x) =\nsupa\u2208A Q(x, a).\n(cid:82)\nThroughout the paper F \u2282 {f : X \u00d7 A \u2192 R} will denote a subset of real-valued functions over\n(cid:82)\nthe state-action space X \u00d7 A and \u03a0 \u2282 AX will be a set of policies. For \u03bd \u2208 M(X ) and f : X \u2192 R\nX |f(x)|p\u03bd(dx). We simply write (cid:107)f(cid:107)\u03bd for (cid:107)f(cid:107)2,\u03bd.\nmeasurable, we let (for p \u2265 1) (cid:107)f(cid:107)p\nFurther, we extend (cid:107)\u00b7(cid:107)\u03bd to F by (cid:107)f(cid:107)2\nX |f|2(x, a) d\u03bd(x) d\u03bbA(a), where \u03bbA is the uniform\ndistribution over A. We shall use the shorthand notation \u03bdf to denote the integral\nf(x)\u03bd(dx). We\ndenote the space of bounded measurable functions with domain X by B(X ). Further, the space of\nmeasurable functions bounded by 0 < K < \u221e shall be denoted by B(X ; K). We let (cid:107)\u00b7(cid:107)\u221e denote\nthe supremum norm.\n\np,\u03bd =\n\u03bd =\nA\n\n(cid:82)\n\n(cid:82)\n\n2 Fitted Q-iteration with approximate policy maximization\nWe assume that we are given a \ufb01nite trajectory, {(Xt, At, Rt)}1\u2264t\u2264N , generated by some stochastic\nstationary policy \u03c0b, called the behavior policy: At \u223c \u03c0b(\u00b7|Xt), Xt+1 \u223c P (\u00b7|Xt, At), Rt \u223c\nS(\u00b7|Xt, At), where \u03c0b(\u00b7|x) is a density with \u03c00\nThe generic recipe for \ufb01tted Q-iteration (FQI) [5] is\n\ndef= inf (x,a)\u2208X\u00d7A \u03c0b(a|x) > 0.\n\n(1)\nwhere Regress is an appropriate regression procedure and Dk(Qk) is a dataset de\ufb01ning a regression\nproblem in the form of a list of data-point pairs:\n\nQk+1 = Regress(Dk(Qk)),\n\n(cid:105)\n\n(cid:190)\n\n(cid:189)(cid:104)\n\nDk(Qk) =\n\n(Xt, At), Rt + \u03b3 max\n\nb\u2208A Qk(Xt+1, b)\n\n1\u2264t\u2264N\n\n.1\n\n(cid:82)\n\nFitted Q-iteration can be viewed as approximate value iteration applied to action-value func-\ntions. To see this note that value iteration would assign the value (T Qk)(x, a) = r(x, a) +\nmaxb\u2208A Qk(y, b) P (dy|x, a) to Qk+1(x, a) [6]. Now, remember that the regression function for\n\u03b3\nthe jointly distributed random variables (Z, Y ) is de\ufb01ned by the conditional expectation of Y given\nZ: m(Z) = E [Y |Z]. Since for any \ufb01xed function Q, E [Rt + \u03b3 maxb\u2208A Q(Xt+1, b)|Xt, At] =\n(T Q)(Xt, At), the regression function corresponding to the data Dk(Q) is indeed T Q and hence if\nFQI solved the regression problem de\ufb01ned by Qk exactly, it would simulate value iteration exactly.\nHowever, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is\nobtained based on the data, it is itself a random function. Hence, after the \ufb01rst iteration, the \u201ctarget\u201d\nfunction in FQI becomes random. Furthermore, this function depends on the same data that is used\nto de\ufb01ne the regression problem. Will FQI still work despite these issues? To illustrate the potential\ndif\ufb01culties consider a dataset where X1, . . . , XN is a sequence of independent random variables,\nwhich are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater\nthan N which is independent of the dataset (Xt)N\nt=1. Let U be another random variable, uniformly\ndistributed in [0, 1]. Now de\ufb01ne the regression problem by Yt = fM,U (Xt), where fM,U (x) =\nsgn(sin(2M 2\u03c0(x + U))). Then it is not hard to see that no matter how big N is, no procedure can\n\n1Since the designer controls Qk, we may assume that it is continuous, hence the maximum exists.\n\n2\n\n\festimate the regression function fM,U with a small error (in expectation, or with high probability),\neven if the procedure could exploit the knowledge of the speci\ufb01c form of fM,U . On the other hand,\nif we restricted M to a \ufb01nite range then the estimation problem could be solved successfully. The\nexample shows that if the complexity of the random functions de\ufb01ning the regression problem is\nuncontrolled then successful estimation might be impossible.\nAmongst the many regression methods in this paper we have chosen to work with least-squares\nmethods. In this case Equation (1) takes the form\n\n(cid:181)\n\n(cid:183)\n\n(cid:184)(cid:182)2\n\nQk+1 = argmin\n\nQ\u2208F\n\n1\n\n\u03c0b(At|Xt)\n\nQ(Xt, At) \u2212\n\nRt + \u03b3 max\n\nb\u2208A Qk(Xt+1, b)\n\n.\n\n(2)\n\nN(cid:88)\n\nt=1\n\nWe call this method the least-squares \ufb01tted Q-iteration (LSFQI) method. Here we introduced the\nweighting 1/\u03c0b(At|Xt) since we do not want to give more weight to those actions that are preferred\nby the behavior policy.\nBesides this weighting, the only parameter of the method is the function set F. This function set\nshould be chosen carefully, to keep a balance between the representation power and the number of\nsamples. As a speci\ufb01c example for F consider neural networks with some \ufb01xed architecture. In\nthis case the function set is generated by assigning weights in all possible ways to the neural net.\nThen the above minimization becomes the problem of tuning the weights. Another example is to use\nlinearly parameterized function approximation methods with appropriately selected basis functions.\nIn this case the weight tuning problem would be less demanding. Yet another possibility is to let F\nbe an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the\ntraining procedure becomes similar to LS-SVM training [7].\nAs indicated above, the analysis of this algorithm is complicated by the fact that the new dataset\nis de\ufb01ned in terms of the previous iterate, which is already a function of the dataset. Another\ncomplication is that the samples in a trajectory are in general correlated and that the bias introduced\nby the imperfections of the approximation architecture may yield to an explosion of the error of the\nprocedure, as documented in a number of cases in, e.g., [8].\nNevertheless, at least for \ufb01nite action sets, the tools developed in [1, 3, 2] look suitable to show\nthat under appropriate conditions these problems can be overcome if the function set is chosen in\na judicious way. However, the results of these works would become essentially useless in the case\nof an in\ufb01nite number of actions since these previous bounds grow to in\ufb01nity with the number of\nactions. Actually, we believe that this is not an artifact of the proof techniques of these works, as\nsuggested by the counterexample that involved random targets. The following result elaborates this\npoint further:\nProposition 2.1. Let F \u2282 B(X \u00d7 A). Then even if the pseudo-dimension of F is \ufb01nite, the fat-\nshattering function of\n\n(cid:190)\n\n(cid:189)\n\nF\u2228\nmax =\n\ncan be in\ufb01nite over (0, 1/2).2\n\nVQ : VQ(\u00b7) = max\n\na\u2208A Q(\u00b7, a), Q \u2208 F\n\nWithout going into further details, let us just note that the \ufb01niteness of the fat-shattering function is a\nsuf\ufb01cient and necessary condition for learnability and the \ufb01niteness of the fat-shattering function is\nimplied by the \ufb01niteness of the pseudo-dimension [9].The above proposition thus shows that without\nimposing further special conditions on F, the learning problem may become infeasible.\nOne possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if\nthe action space has a really high dimensionality, this approach becomes unfeasible (even enumer-\nating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions.\nAnother possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates.\nThen the same property will hold for functions in F\u2228\nmax and hence by a classical result we can bound\nthe capacity of this set (cf. pp. 353\u2013357 of [10]). One potential problem with this approach is that\nthis way it might be dif\ufb01cult to get a \ufb01ne control of the capacity of the resulting set.\n\n2The proof of this and the other results are given in the appendix, available in the extended version of this\n\npaper, downloadable from http://hal.inria.fr/inria-00185311/en/.\n\n3\n\n\fIn the approach explored here we modify the \ufb01tted Q-iteration algorithm by introducing a policy\nset \u03a0 and a search over this set for an approximately greedy policy in a sense that will be made\nprecise in a minute. Our algorithm thus has four parameters: F, \u03a0, K, Q0. Here F is as before, \u03a0\nis a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an\ninitial value function (a typical choice is Q0 \u2261 0). The algorithm computes a sequence of iterates\n(Qk, \u02c6\u03c0k), k = 0, . . . , K, de\ufb01ned by the following equations:\n\nN(cid:88)\nN(cid:88)\nN(cid:88)\n\nt=1\n\nt=1\n\nt=1\n\n\u02c6\u03c00 = argmax\n\nQ0(Xt, \u03c0(Xt)),\n\n\u03c0\u2208\u03a0\n\nQk+1 = argmin\n\nQ\u2208F\n\n1\n\n\u03c0b(At|Xt)\n\n(cid:179)\n\nQ(Xt, At) \u2212(cid:163)\n\n\u02c6\u03c0k+1 = argmax\n\nQk+1(Xt, \u03c0(Xt)).\n\n\u03c0\u2208\u03a0\n\nRt + \u03b3Qk(Xt+1, \u02c6\u03c0k(Xt+1))\n\n(cid:164)(cid:180)2\n\n,\n\n(3)\n\n(4)\n\nThus, (3) is similar to (2), while (4) de\ufb01nes the policy search problem. The policy search will\ngenerally be solved by a gradient procedure or some other appropriate method. The cost of this step\nwill be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments.\nFor example, if they were quadratic and if \u03c0 was linear then the problem would be a quadratic\noptimization problem. However, except for special cases3 the action value functions will be more\ncomplicated, in which case this step can be expensive. Still, this cost could be similar to that of\nsearching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions\nare similar across similar states.\nThis algorithm, which we could also call a \ufb01tted actor-critic algorithm, will be shown to overcome\nthe above mentioned complexity control problem provided that the complexity of \u03a0 is controlled\nappropriately. Indeed, in this case the set of possible regression problems is determined by the set\n\nand the proof will rely on controlling the complexity of F\u2228\n\n\u03a0 by selecting F and \u03a0 appropriately.\n\nF\u2228\n\u03a0 = { V : V (\u00b7) = Q(\u00b7, \u03c0(\u00b7)), Q \u2208 F, \u03c0 \u2208 \u03a0} ,\n\n3 The main theoretical result\n\n3.1 Outline of the analysis\n\nIn order to gain some insight into the behavior of the algorithm, we provide a brief summary of its\nerror analysis. The main result will be presented subsequently. For f,Q \u2208 F and a policy \u03c0, we\nde\ufb01ne the tth TD-error as follows:\n\ndt(f; Q, \u03c0) = Rt + \u03b3Q(Xt+1, \u03c0(Xt+1)) \u2212 f(Xt, At).\n\nFurther, we de\ufb01ne the empirical loss function by\n\n\u02c6LN (f; Q, \u03c0) =\n\n1\nN\n\nN(cid:88)\n\nt=1\n\nt (f; Q, \u03c0)\nd2\n\n\u03bb(A)\u03c0b(At|Xt) ,\n\nwhere the normalization with \u03bb(A) is introduced for mathematical convenience. Then (3) can be\nwritten compactly as Qk+1 = argminf\u2208F \u02c6LN (f; Qk, \u02c6\u03c0k).\nThe algorithm can then be motivated by the observation that for any f,Q, and \u03c0, \u02c6LN (f; Q, \u03c0) is an\nunbiased estimate of\n\n(5)\nwhere the \ufb01rst term is the error we are interested in and the second term captures the variance of the\nrandom samples:\n\nL(f; Q, \u03c0) def= (cid:107)f \u2212 T \u03c0Q(cid:107)2\n\n\u03bd + L\u2217(Q, \u03c0),\n\n(cid:90)\n\nL\u2217(Q, \u03c0) =\n\nE [Var [R1 + \u03b3Q(X2, \u03c0(X2))|X1, A1 = a]] d\u03bbA(a).\n\nA\n\n3Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious\n\nchoices for F and \u03a0 yield zero error in the limit, as can be proven based on the main result of this paper.\n\n4\n\n\fThis result is stated formally by E\n\n\u02c6LN (f; Q, \u03c0)\n\n= L(f; Q, \u03c0).\n\n(cid:104)\n\n(cid:105)\n\nis\n\n(5)\n\nthe\n\nvariance\n\nindependent\n\n=\nterm in\nSince\nargminf\u2208F (cid:107)f \u2212 T \u03c0Q(cid:107)2\n\u03bd. Thus, if \u02c6\u03c0k were greedy w.r.t. Qk then argminf\u2208F L(f; Qk, \u02c6\u03c0k) =\nargminf\u2208F (cid:107)f \u2212 T Qk(cid:107)2\n\u03bd. Hence we can still think of the procedure as approximate value iteration\nover the space of action-value functions, projecting T Qk using empirical risk minimization on the\nspace F w.r.t. (cid:107)\u00b7(cid:107)\u03bd distances in an approximate manner. Since \u02c6\u03c0k is only approximately greedy, we\nwill have to deal with both the error coming from the approximate projection and the error coming\nfrom the choice of \u02c6\u03c0k. To make this clear, we write the iteration in the form\n\nargminf\u2208F L(f; Q, \u03c0)\n\nof\n\nf,\n\nQk+1 = T \u02c6\u03c0k Qk + \u03b5(cid:48)\n\nk = T Qk + \u03b5(cid:48)\n\nk + (T \u02c6\u03c0k Qk \u2212 T Qk) = T Qk + \u03b5k,\n\nk is the error committed while computing T \u02c6\u03c0k Qk, \u03b5(cid:48)(cid:48)\n\ndef= T \u02c6\u03c0k Qk \u2212 T Qk is the error commit-\nwhere \u03b5(cid:48)\nted because the greedy policy is computed approximately and \u03b5k = \u03b5(cid:48)\nk is the total error of step\nk. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are\ncontrolled and that when the errors are propagated through these equations, the resulting error stays\ncontrolled, too. Since we are ultimately interested in the performance of the policy obtained, we\nwill also need to show that small action-value approximation errors yield small performance losses.\nFor these we need a number of assumptions that concern either the training data, the MDP, or the\nfunction sets used for learning.\n\nk + \u03b5(cid:48)(cid:48)\n\nk\n\n3.2 Assumptions\n\n3.2.1 Assumptions on the training data\n\nWe shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally,\nmixing means that future depends weakly on the past.\nAssumption A2 (Sample Path Properties) Assume that {(Xt, At, Rt)}t=1,...,N is the sample path\nof \u03c0b, a stochastic stationary policy. Further, assume that {Xt} is strictly stationary (Xt \u223c \u03bd \u2208\nM(X )) and exponentially \u03b2-mixing with the actual rate given by the parameters (\u03b2, b, \u03ba).4 We\nfurther assume that the sampling policy \u03c0b satis\ufb01es \u03c00 = inf (x,a)\u2208X\u00d7A \u03c0b(a|x) > 0.\n\nThe \u03b2-mixing property will be used to establish tail inequalities for certain empirical processes.5\nNote that the mixing coef\ufb01cients do not need to be known. In the case when no mixing condition is\nsatis\ufb01ed, learning might be impossible. To see this just consider the case when X1 = X2 = . . . =\nXN . Thus, in this case the learner has many copies of the same random variable and successful\ngeneralization is thus impossible. We believe that the assumption that the process is in a steady state\nis not essential for our result, as when the process reaches its steady state quickly then (at the price\nof a more involved proof) the result would still hold.\n\n3.2.2 Assumptions on the MDP\n\nIn order to prevent the uncontrolled growth of the errors as they are propagated through the updates,\nwe shall need some assumptions on the MDP. A convenient assumption is the following one [11]:\nAssumption A3 (Uniformly stochastic transitions) For all x \u2208 X and a \u2208 A, assume that\nP (\u00b7|x, a) is absolutely continuous w.r.t. \u03bd and the Radon-Nikodym derivative of P w.r.t. \u03bd is bounded\nuniformly with bound C\u03bd: C\u03bd\nNote that by the de\ufb01nition of measure differentiation, Assumption A3 means that P (\u00b7|x, a) \u2264\nC\u03bd\u03bd(\u00b7). This assumption essentially requires the transitions to be noisy. We will also prove (weaker)\nresults under the following, weaker assumption:\n\n(cid:176)(cid:176)(cid:176)\u221e < +\u221e.\n\n(cid:176)(cid:176)(cid:176) dP (\u00b7|x,a)\n\ndef= supx\u2208X ,a\u2208A\n\nd\u03bd\n\n4For the de\ufb01nition of \u03b2-mixing, see e.g. [2].\n5We say \u201cempirical process\u201d and \u201cempirical measure\u201d, but note that in this work these are based on depen-\n\ndent (mixing) samples.\n\n5\n\n\fAssumption A4 (Discounted-average concentrability of future-state distributions) Given \u03c1,\n\u03bd, m \u2265 1 and an arbitrary sequence of stationary policies {\u03c0m}m\u22651, assume that the future-\n(cid:80)\nstate distribution \u03c1P \u03c01P \u03c02 . . . P \u03c0m is absolutely continuous w.r.t. \u03bd. Assume that c(m) def=\n(cid:80)\nm\u22651 m\u03b3m\u22121c(m) < +\u221e. We shall call C\u03c1,\u03bd\ndef=\nsup\u03c01,...,\u03c0m\nsatis\ufb01es\nm\u22651 m\u03b3m\u22121c(m), (1 \u2212 \u03b3)\nm\u22651 \u03b3mc(m)\nmax\nthe discounted-average concentra-\nbility coef\ufb01cient of the future-state distributions.\n\n(cid:176)(cid:176)(cid:176) d(\u03c1P \u03c01 P \u03c02 ...P \u03c0m )\n(1 \u2212 \u03b3)2(cid:80)\n\n(cid:176)(cid:176)(cid:176)\u221e\n\n(cid:169)\n\n(cid:170)\n\nd\u03bd\n\nThe number c(m) measures how much \u03c1 can get ampli\ufb01ed in m steps as compared to the reference\ndistribution \u03bd. Hence, in general we expect c(m) to grow with m. In fact, the condition that C\u03c1,\u00b5 is\n\ufb01nite is a growth rate condition on c(m). Thanks to discounting, C\u03c1,\u00b5 is \ufb01nite for a reasonably large\nclass of systems (see the discussion in [11]).\nA related assumption is needed in the error analysis of the approximate greedy step of the algorithm:\nAssumption A5 (The random policy \u201cmakes no peak-states\u201d) Consider the distribution \u00b5 = (\u03bd\u00d7\n\u03bbA)P which is the distribution of a state that results from sampling an initial state according to \u03bd and\nthen executing an action which is selected uniformly at random.6 Then \u0393\u03bd = (cid:107)d\u00b5/d\u03bd(cid:107)\u221e < +\u221e.\nNote that under Assumption A3 we have \u0393\u03bd \u2264 C\u03bd. This (very mild) assumption means that after\none step, starting from \u03bd and executing this random policy, the probability of the next state being in\na set is upper bounded by \u0393\u03bd-times the probability of the starting state being in the same set.\nBesides, we assume that A has the following regularity property: Let Py(a, h, \u03c1)\n(a(cid:48), v) \u2208 RdA+1 : (cid:107)a \u2212 a(cid:48)(cid:107)1 \u2264 \u03c1, 0 \u2264 v/h \u2264 1 \u2212 (cid:107)a \u2212 a(cid:48)(cid:107)1 /\u03c1\nh and base given by the (cid:96)1-ball B(a, \u03c1) def=\n\n(cid:170)\na(cid:48) \u2208 RdA : (cid:107)a \u2212 a(cid:48)(cid:107)1 \u2264 \u03c1\n\ndef=\ndenote the pyramid with hight\n\ncentered at a.\n\n(cid:169)\n\n(cid:169)\n\n(cid:170)\n\nAssumption A6 (Regularity of the action space) We assume that there exists \u03b1 > 0, such that for\nall a \u2208 A, for all \u03c1 > 0,\n\n\u03bb(Py(a, 1, \u03c1) \u2229 (A \u00d7 R))\n\n\u03bb(Py(a, 1, \u03c1))\n\n\u2265 min\n\n\u03bb(A)\n\n\u03bb(B(a, \u03c1))\n\n\u03b1,\n\n(cid:181)\n\n(cid:182)\n\n.\n\nFor example, if A is an (cid:96)1-ball itself, then this assumption will be satis\ufb01ed with \u03b1 = 2\u2212dA.\nWithout assuming any smoothness of the MDP, learning in in\ufb01nite MDPs looks hard (see, e.g.,\n[12, 13]). Here we employ the following extra condition:\n\nAssumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities\nand rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all\n(x, a, a(cid:48)) \u2208 X \u00d7 A \u00d7 A and measurable set B of X ,\n|P (B|x, a) \u2212 P (B|x, a(cid:48))| \u2264 LP (cid:107)a \u2212 a(cid:48)(cid:107)1 ,\n\n|r(x, a) \u2212 r(x, a(cid:48))| \u2264 Lr (cid:107)a \u2212 a(cid:48)(cid:107)1 .\n\nNote that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct con-\nsistent planning algorithms.\n\n3.2.3 Assumptions on the function sets used by the algorithm\n\nThese assumptions are less demanding since they are under the control of the user of the algorithm.\nHowever, the choice of these function sets will greatly in\ufb02uence the performance of the algorithm,\nas we shall see it from the bounds. The \ufb01rst assumption concerns the class F:\nAssumption A8 (Lipschitzness of candidate action-value functions) Assume F \u2282 B(X \u00d7 A)\nand that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a)\u2212\nQ(x, a(cid:48))| \u2264 LA (cid:107)a \u2212 a(cid:48)(cid:107)1 holds for any x \u2208 X , a,a(cid:48) \u2208 A, and Q \u2208 F.\n6Remember that \u03bbA denotes the uniform distribution over the action set A.\n\n6\n\n\fWe shall also need to control the capacity of our function sets. We assume that the reader is familiar\nwith the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds\nupon the concept of VC-dimension:\nDe\ufb01nition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is de\ufb01ned as the VC-\ndimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of\nF).\nSince A is multidimensional, we de\ufb01ne V\u03a0+ to be the sum of the pseudo-dimensions of the coordi-\nnate projection spaces, \u03a0k of \u03a0:\n\nV\u03a0+ =\n\n, \u03a0k = { \u03c0k : X \u2192 R : \u03c0 = (\u03c01, . . . , \u03c0k, . . . , \u03c0dA) \u2208 \u03a0} .\n\nV\u03a0+\n\nk\n\ndA(cid:88)\n\nk=1\n\nNow we are ready to state our assumptions on our function sets:\nAssumption A9 (Capacity of the function and policy sets) Assume that F \u2282 B(X \u00d7 A; Qmax)\nfor Qmax > 0 and VF + < +\u221e. Also, A \u2282 [\u2212A\u221e, A\u221e]dA and V\u03a0+ < +\u221e.\nBesides their capacity, one shall also control the approximation power of the function sets involved.\nLet us \ufb01rst consider the policy set \u03a0. Introduce\ne\u2217(F, \u03a0) = sup\nQ\u2208F\n\n\u03bd(EQ \u2212 E\u03c0Q).\n\ninf\n\u03c0\u2208\u03a0\n\ninf \u03c0\u2208\u03a0 \u03bd(EQ \u2212 E\u03c0Q) measures the quality of approximating \u03bdEQ by \u03bdE\u03c0Q. Hence,\nNote that\ne\u2217(F, \u03a0) measures the worst-case approximation error of \u03bdEQ as Q is changed within F. This can\nbe made small by choosing \u03a0 large.\nAnother related quantity is the one-step Bellman-error of F w.r.t. \u03a0. This is de\ufb01ned as follows: For\na \ufb01xed policy \u03c0, the one-step Bellman-error of F w.r.t. T \u03c0 is de\ufb01ned as\nQ(cid:48)\u2208F (cid:107)Q(cid:48) \u2212 T \u03c0Q(cid:107)\u03bd .\ninf\n\nE1(F; \u03c0) = sup\nQ\u2208F\n\nTaking again a pessimistic approach, the one-step Bellman-error of F is de\ufb01ned as\n\nE1(F, \u03a0) = sup\n\u03c0\u2208\u03a0\n\nE1(F; \u03c0).\n\nTypically by increasing F, E1(F, \u03a0) can be made smaller (this is discussed at some length in\n[3]). However, it also holds for both \u03a0 and F that making them bigger will increase their capacity\n(pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and \u03a0 must be\nselected to balance the approximation and estimation errors, just like in supervised learning.\n\n3.3 The main result\nTheorem 3.2. Let \u03c0K be a greedy policy w.r.t. QK, i.e. \u03c0K(x) \u2208 argmaxa\u2208A QK(x, a). Then\nunder Assumptions A1, A2, and A5\u2013A9, for all \u03b4 > 0 we have with probability at least 1 \u2212 \u03b4: given\nAssumption A3 (respectively A4), (cid:107)V \u2217 \u2212 V \u03c0K(cid:107)\u221e (resp. (cid:107)V \u2217 \u2212 V \u03c0K(cid:107)1,\u03c1), is bounded by\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\uf8eb\uf8edE1(F, \u03a0) + e\u2217(F, \u03a0) +\n\nC\n\n(log N + log(K/\u03b4))\n\n\u03ba+1\n4\u03ba\n\nN 1/4\n\n\uf8f6\uf8f8 1\n\ndA+1\n\n+ \u03b3K\n\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe ,\n\nwhere C depends on dA, VF +, (V\u03a0+\nQmax, Rmax, \u02c6Rmax, and A\u221e. In particular, C scales with V\nplays the role of the \u201ccombined effective\u201d dimension of F and \u03a0.\n\nk=1, \u03b3, \u03ba, b, \u03b2, C\u03bd (resp. C\u03c1,\u03bd), \u0393\u03bd, LA, LP ,Lr, \u03b1, \u03bb(A), \u03c00,\n)dA\n4\u03ba(dA+1) , where V = 2VF + + V\u03a0+\n\n\u03ba+1\n\nk\n\n7Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and\n\nBartlett [14].\n\n7\n\n\f4 Discussion\n\nWe have presented what we believe is the \ufb01rst \ufb01nite-time bounds for continuous-state and action-\nspace RL that uses value functions. Further, this is the \ufb01rst analysis of \ufb01tted Q-iteration, an algorithm\nthat has proved to be useful in a number of cases, even when used with non-averagers for which no\nprevious theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that\nthere is a systematic way of making these algorithms work and to point at possible problem sources\nthe same time. We discussed why it can be dif\ufb01cult to make these algorithms work in practice. We\nsuggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming\nuniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actor-\ncritic algorithms. The bound in this paper is similar in many respects to a previous bound of a\nBellman-residual minimization algorithm [2].\nIt looks that the techniques developed here can be\nused to obtain results for that algorithm when it is applied to continuous action spaces. Finally,\nalthough we have not explored them here, consistency results for FQI can be obtained from our\nresults using standard methods, like the methods of sieves. We believe that the methods developed\nhere will eventually lead to algorithms where the function approximation methods are chosen based\non the data (similar to adaptive regression methods) so as to optimize performance, which in our\nopinion is one of the biggest open questions in RL. Currently we are exploring this possibility.\n\nAcknowledgments\nAndr\u00b4as Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences\n(Bolyai Fellowship). Csaba Szepesv\u00b4ari greatly acknowledges the support received from the Alberta Ingenuity\nFund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences.\nReferences\n[1] A. Antos, Cs. Szepesv\u00b4ari, and R. Munos. Learning near-optimal policies with Bellman-residual mini-\n\nmization based \ufb01tted policy iteration and a single sample path. In COLT-19, pages 574\u2013588, 2006.\n\n[2] A. Antos, Cs. Szepesv\u00b4ari, and R. Munos. Learning near-optimal policies with Bellman-residual mini-\n\nmization based \ufb01tted policy iteration and a single sample path. Machine Learning, 2007. (accepted).\n\n[3] A. Antos, Cs. Szepesv\u00b4ari, and R. Munos. Value-iteration based \ufb01tted policy iteration: learning with a\n\nsingle trajectory. In IEEE ADPRL, pages 330\u2013337, 2007.\n\n[4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press,\n\nNew York, 1978.\n\n[5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine\n\nLearning Research, 6:503\u2013556, 2005.\n\n[6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998.\n[7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based\n\nlearning methods). Cambridge University Press, 2000.\n\n[8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value\n\nfunction. In NIPS-7, pages 369\u2013376, 1995.\n\n[9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions.\n\nJournal of Computer and System Sciences, 52:434\u2013452, 1996.\n\n[10] A.N. Kolmogorov and V.M. Tihomirov. \u0001-entropy and \u0001-capacity of sets in functional space. American\n\nMathematical Society Translations, 17(2):277\u2013364, 1961.\n\n[11] R. Munos and Cs. Szepesv\u00b4ari. Finite time bounds for sampling based \ufb01tted value iteration. Technical\nreport, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u.\n13-17, Budapest 1111, Hungary, 2006.\n\n[12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceed-\n\nings of the 16th Conference in Uncertainty in Arti\ufb01cial Intelligence, pages 406\u2013415, 2000.\n\n[13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT\n\nPress, 2007.\n\n[14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University\n\nPress, 1999.\n\n[15] M. Riedmiller. Neural \ufb01tted Q iteration \u2013 \ufb01rst experiences with a data ef\ufb01cient neural reinforcement\n\nlearning method. In 16th European Conference on Machine Learning, pages 317\u2013328, 2005.\n\n[16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07,\n\n2007.\n\n8\n\n\f", "award": [], "sourceid": 917, "authors": [{"given_name": "Andr\u00e1s", "family_name": "Antos", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}