{"title": "Regularized Policy Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 441, "page_last": 448, "abstract": "In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by adding L2-regularization to two widely-used policy evaluation methods: Bellman residual minimization (BRM) and least-squares temporal difference learning (LSTD). We derive efficient implementation for our algorithms when the approximate value-functions belong to a reproducing kernel Hilbert space. We also provide finite-sample performance bounds for our algorithms and show that they are able to achieve optimal rates of convergence under the studied conditions.", "full_text": "Regularized Policy Iteration\n\nAmir-massoud Farahmand1, Mohammad Ghavamzadeh2, Csaba Szepesv\u00b4ari1, Shie Mannor3 \u2217\n\n1Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada\n\n3Department of ECE, McGill University, Canada - Department of EE, Technion, Israel\n\n2INRIA Lille - Nord Europe, Team SequeL, France\n\nAbstract\n\nIn this paper we consider approximate policy-iteration-based reinforcement learn-\ning algorithms. In order to implement a \ufb02exible function approximation scheme\nwe propose the use of non-parametric methods with regularization, providing a\nconvenient way to control the complexity of the function approximator. We pro-\npose two novel regularized policy iteration algorithms by adding L2-regularization\nto two widely-used policy evaluation methods: Bellman residual minimization\n(BRM) and least-squares temporal difference learning (LSTD). We derive ef\ufb01-\ncient implementation for our algorithms when the approximate value-functions\nbelong to a reproducing kernel Hilbert space. We also provide \ufb01nite-sample per-\nformance bounds for our algorithms and show that they are able to achieve optimal\nrates of convergence under the studied conditions.\n\n1 Introduction\n\nA key idea in reinforcement learning (RL) is to learn an action-value function which can then be\nused to derive a good control policy [15]. When the state space is large or in\ufb01nite, value-function\napproximation techniques are necessary, and their quality has a major impact on the quality of the\nlearned policy. Existing techniques include linear function approximation (see, e.g., Chapter 8 of\n[15]), kernel regression [12], regression tree methods [5], and neural networks (e.g., [13]). The user\nof these techniques often has to make non-trivial design decisions such as what features to use in\nthe linear function approximator, when to stop growing trees, how many trees to grow, what kernel\nbandwidth to use, or what neural network architecture to employ. Of course, the best answers to\nthese questions depend on the characteristics of the problem in hand. Hence, ideally, these questions\nshould be answered in an automated way, based on the training data.\nA highly desirable requirement for any learning system is to adapt to the actual dif\ufb01culty of the\nlearning problem. If the problem is easier (than some other problem), the method should deliver\nbetter solution(s) with the same amount of data. In the supervised learning literature, such proce-\ndures are called adaptive [7]. There are many factors that can make a problem easier, such as when\nonly a few of the inputs are relevant, when the input data lies on a low-dimensional submanifold of\nthe input space, when special noise conditions are met, when the expansion of the target function\nis sparse in a basis, or when the target function is highly smooth. These are called the regularities\nof the problem. An adaptive procedure is built in two steps: 1) designing \ufb02exible methods with\na few tunable parameters that are able to deliver \u201coptimal\u201d performance for any targeted regular-\nity, provided that their parameters are chosen properly, and 2) tuning the parameters automatically\n(automatic model-selection).\nSmoothness is one of the most important regularities: In regression when the target function has\nsmoothness of order p the optimal rate of convergence of the squared L2-error is n\u22122p/(2p+d),\n\u2217Csaba Szepesv\u00b4ari is on leave from MTA SZTAKI. This research was funded in part by the National Science\nand Engineering Research Council (NSERC), iCore and the Alberta Ingenuity Fund. We acknowledge the\ninsightful comments by the reviewers.\n\n\fwhere n is the number of data points and d is the dimension of the input space [7]. Hence, the rate\nof convergence is higher for larger p\u2019s. Methods that achieve the optimal rate are more desirable, at\nleast in the limit for large n, and seem to perform well in practice. However, only a few methods\nin the regression literature are known to achieve the optimal rates. In fact, it is known that tree\nmethods with averaging in the leaves, linear methods with piecewise constant basis functions, and\nkernel estimates do not achieve the optimal rate, while neural networks and regularized least-squares\nestimators do [7]. An advantage of using a regularized least-squares estimator compared to neural\nnetworks is that these estimators do not get stuck in local minima and therefore their training is more\nreliable.\nIn this paper we study how to add L2-regularization to value function approximation in RL. The\nproblem setting is to \ufb01nd a good policy in a batch or active learning scenario for in\ufb01nite-horizon\nexpected total discounted reward Markovian decision problems with continuous state and \ufb01nite ac-\ntion spaces. We propose two novel policy evaluation algorithms by adding L2-regularization to two\nwidely-used policy evaluation methods in RL: Bellman residual minimization (BRM) [16; 3] and\nleast-squares temporal difference learning (LSTD) [4]. We show how our algorithms can be imple-\nmented ef\ufb01ciently when the value-function approximator belongs to a reproducing kernel Hilbert\nspace. We also prove \ufb01nite-sample performance bounds for our algorithms. In particular, we show\nthat they are able to achieve a rate that is as good as the corresponding regression rate when the\nvalue functions belong to a known smoothness class. We further show that this rate of convergence\ncarries through to the performance of a policy found by running policy iteration with our regularized\npolicy evaluation methods. The results indicate that from the point of view of convergence rates\nRL is not harder than regression estimation, answering an open question of Antos et al. [2]. Due\nto space limitations, we do not present the proofs of our theorems in the paper; they can be found,\nalong with some empirical results using our algorithms, in [6].\nTo our best knowledge this is the \ufb01rst work that addresses \ufb01nite-sample performance of a regularized\nRL algorithm. While regularization in RL has not been thoroughly explored, there has been a few\nworks that used regularization. Xu et al. [17] used sparsi\ufb01cation in LSTD. Although sparsi\ufb01cation\ndoes achieve some form of regularization, to the best of our knowledge the effect of sparsi\ufb01cation\non generalization error is not well-understood. Note that sparsi\ufb01cation is fundamentally different\nfrom our approach. In our method the empirical error and the penalties jointly determine the solu-\ntion, while in sparsi\ufb01cation \ufb01rst a subset of points is selected independently of the empirical error,\nwhich are then used to obtain a solution. Comparing the ef\ufb01ciency of these methods requires further\nresearch, but the two methods can be combined, as was done in our experiments. Jung and Polani\n[9] explored adding regularization to BRM, but this solution is restricted to deterministic problems.\nThe main contribution of that work was the development of fast incremental algorithms using sparsi-\n\ufb01cation techniques. L1 penalties have been considered by [11], who were similarly concerned with\nincremental implementations and computational ef\ufb01ciency.\n\n2 Preliminaries\n\nAs we shall work with continuous spaces, we \ufb01rst introduce a few concepts from analysis. This is\nfollowed by an introduction to Markovian Decision Processes (MDPs) and the associated concepts\nand notation.\nFor a measurable space with domain S, we let M(S) and B(S; L) denote the set of probability\nmeasures over S and the space of bounded measurable functions with domain S and bound 0 <\nL < \u221e, respectively. For a measure \u03bd \u2208 M(S), and a measurable function f : S \u2192 R, we de\ufb01ne\nthe L2(\u03bd)-norm of f, kfk\u03bd, and its empirical counterpart kfk\u03bd,n as follows:\n\nnX\n\nt=1\n\nf 2(st) , st \u223c \u03bd.\n\n(1)\n\nZ\n\nkfk2\n\n\u03bd =\n\n|f(s)|2\u03bd(ds) ,\n\nkfk2\n\n\u03bd,n\n\ndef=\n\n1\nn\n\n\u03bd,n converges to kfk2\n\nIf {st} is ergodic, kfk2\nA \ufb01nite-action discounted MDP is a tuple (X ,A, P, S, \u03b3), where X is the state space, A =\n{a1, a2, . . . , aM} is the \ufb01nite set of M actions, P : X \u00d7 A \u2192 M(X ) is the transition probability\nkernel with P (\u00b7|x, a) de\ufb01ning the next-state distribution upon taking action a in state x, S(\u00b7|x, a)\n\n\u03bd as n \u2192 \u221e.\n\n\fgives the corresponding distribution of immediate rewards, and \u03b3 \u2208 (0, 1) is a discount factor. We\nmake the following assumptions on MDP:\nAssumption A1 (MDP Regularity) The set of states X is a compact subspace of the d-dimensional\n\nEuclidean space and the expected immediate rewards r(x, a) = R rS(dr|x, a) are bounded by\n\n\" \u221eX\n\nRmax.\nWe denote by \u03c0 : X \u2192 M(A) a stationary Markov policy. A policy is deterministic if it is a\nmapping from states to actions \u03c0 : X \u2192 A. The value and the action-value functions of a policy \u03c0,\ndenoted respectively by V \u03c0 and Q\u03c0, are de\ufb01ned as the expected sum of discounted rewards that are\nencountered when the policy \u03c0 is executed:\nV \u03c0(x) = E\u03c0\n.\nHere Rt denotes the reward received at time step t; Rt \u223c S(\u00b7|Xt, At), Xt evolves according to\nXt+1 \u223c P (\u00b7|Xt, At), and At is sampled from the policy At \u223c \u03c0(\u00b7|Xt). It is easy to see that for any\npolicy \u03c0, the functions V \u03c0 and Q\u03c0 are bounded by Vmax = Qmax = Rmax/(1\u2212\u03b3). The action-value\nfunction of a policy is the unique \ufb01xed-point of the Bellman operator T \u03c0 : B(X \u00d7A) \u2192 B(X \u00d7A)\nde\ufb01ned by\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X0 = x, A0 = a\n#\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X0 = x\n#\n\nQ\u03c0(x, a) = E\u03c0\n\n\" \u221eX\n\n\u03b3tRt\n\nt=0\n\n\u03b3tRt\n\nt=0\n\n,\n\nZ\n\n(T \u03c0Q)(x, a) = r(x, a) + \u03b3\n\nQ(y, \u03c0(y))P (dy|x, a).\n\nGiven an MDP, the goal is to \ufb01nd a policy that attains the best possible values, V \u2217(x) =\nsup\u03c0 V \u03c0(x),\u2200x \u2208 X . Function V \u2217 is called the optimal value function. Similarly the optimal\naction-value function is de\ufb01ned as Q\u2217(x, a) = sup\u03c0 Q\u03c0(x, a),\u2200x \u2208 X ,\u2200a \u2208 A. We say that\na deterministic policy \u03c0 is greedy w.r.t. an action-value function Q and write \u03c0 = \u02c6\u03c0(\u00b7; Q), if,\n\u03c0(x) \u2208 argmaxa\u2208A Q(x, a),\u2200x \u2208 X ,\u2200a \u2208 A. Greedy policies are important because any greedy\npolicy w.r.t. Q\u2217 is optimal. Hence, knowing Q\u2217 is suf\ufb01cient for behaving optimally. In this paper\nwe shall deal with a variant of the policy iteration algorithm [8]. In the basic version of policy\niteration an optimal policy is found by computing a series of policies, each being greedy w.r.t. the\naction-value function of the previous one.\nThroughout the paper we denote by F M \u2282 { f : X \u00d7 A \u2192 R} some subset of real-valued func-\ntions over the state-action space X \u00d7 A, and use it as the set of admissible functions in the op-\ntimization problems of our algorithms. We will treat f \u2208 F M as f \u2261 (f1, . . . , fM ), fj(x) =\nf(x, aj), j = 1, . . . , M. For \u03bd \u2208 M(X ), we extend k\u00b7k\u03bd and k\u00b7k\u03bd,n de\ufb01ned in Eq. (1) to F M by\nkfk2\n\n\u03bd = 1\n\nM\n\nPM\nj=1 kfjk2\nkfk2\n\n\u03bd , and\n1\nnM\n\n\u03bd,n =\n\nnX\n\nMX\n\nt=1\n\nj=1\n\nI{At=at}f 2\n\nj (Xt) =\n\n1\nnM\n\nf 2(Xt, At),\n\n(2)\n\nnX\n\nt=1\n\nwhere I{\u00b7} is the indicator function: for an event E, I{E} = 1 if and only if E holds and I{E} = 0,\notherwise.\n\n3 Approximate Policy Evaluation\n\nThe ability to evaluate a given policy is the core requirement to run policy iteration. Loosely speak-\ning, in policy evaluation the goal is to \ufb01nd a \u201cclose enough\u201d approximation V (or Q) of the value\n(or action-value) function of a \ufb01xed target policy \u03c0, V \u03c0 (or Q\u03c0). There are several interpretations to\nthe term \u201cclose enough\u201d in this context and it does not necessarily refer to a minimization of some\nnorm. If Q\u03c0 (or noisy estimates of it) is available at a number of points (Xt, At), one can form a\ntraining set of examples of the form {(Xt, At), Qt}1\u2264t\u2264n, where Qt is an estimate of Q\u03c0(Xt, At)\nand then use a supervised learning algorithm to infer a function Q that is meant to approximate Q\u03c0.\nUnfortunately, in the context of control, the target function, Q\u03c0, is not known in advance and its\nhigh quality samples are often very expensive to obtain if this option is available at all. Most often\nthese values have to be inferred from the observed system dynamics, where the observations do not\nnecessarily come from following the target policy \u03c0. This is referred to as the off-policy learning\nproblem in the RL literature. The dif\ufb01culty arising is similar to the problem when training and test\ndistributions differ in supervised learning. Many policy evaluation techniques have been developed\nin RL. Here we concentrate on the ones that are directly related to our proposed algorithms.\n\n\f3.1 Bellman Residual Minimization\n\nThe idea of Bellman residual minimization (BRM) goes back at least to the work of Schweitzer and\nSeidmann [14]. It was used later in the RL community by Williams and Baird [16] and Baird [3].\nThe basic idea of BRM comes from writing the \ufb01xed-point equation for the Bellman operator in the\nform Q\u03c0 \u2212 T \u03c0Q\u03c0 = 0. When Q\u03c0 is replaced by some other function Q, the left-hand side becomes\nnon-zero. The resulting quantity, Q \u2212 T \u03c0Q, is called the Bellman residual of Q. If the magnitude\nof the Bellman residual, kQ \u2212 T \u03c0Qk, is small, then Q can be expected to be a good approximation\nof Q\u03c0. For an analysis using supremum norms see, e.g., [16]. It seems, however, more natural\nto use a weighted L2-norm to measure the magnitude of the Bellman residual as it leads to an\noptimization problem with favorable characteristics and enables an easy connection to regression\nfunction estimation [7]. Hence, we de\ufb01ne the loss function LBRM (Q; \u03c0) = kQ \u2212 T \u03c0Qk2\n\u03bd , where\n\u03bd is the stationary distribution of states in the input data. Using Eq. (2) with samples (Xt, At)\nand by replacing (T \u03c0Q)(Xt, At) with its sample-based approximation ( \u02c6T \u03c0Q)(Xt, At) = Rt +\n\u03b3Q(Xt+1, \u03c0(Xt+1)), the empirical counterpart of LBRM (Q; \u03c0) can be written as\n\nnX\n\nh\n\nQ(Xt, At) \u2212(cid:16)\n\nRt + \u03b3Q(cid:0)Xt+1, \u03c0(Xt+1)(cid:1)(cid:17)i2\n\n\u02c6LBRM (Q; \u03c0, n) =\n\nHowever, as it is well-known (e.g., see [15],[10]), in general, \u02c6LBRM is not an unbiased estimate of\n\ni 6= LBRM (Q; \u03c0). The reason is that stochastic transitions may lead to a\n\nLBRM ; Eh \u02c6LBRM (Q; \u03c0, n)\n\nt=1\n\n1\nnM\n\n.\n\n(3)\n\nnon-vanishing variance term in Eq. (3). A common suggestion to deal with this problem is to use\nuncorrelated or \u201cdouble\u201d samples in \u02c6LBRM . According to this proposal, for each state-action pair in\nthe sample, at least two next states should be generated (e.g., see [15]). This is neither realistic nor\nsample-ef\ufb01cient unless a generative model of the environment is available or the state transitions are\ndeterministic. Antos et al. [2] recently proposed a de-biasing procedure for this problem. We will\nrefer to it as modi\ufb01ed BRM in this paper. The idea is to cancel the unwanted variance by introducing\nan auxiliary function h and a new loss function LBRM (Q, h; \u03c0) = LBRM (Q; \u03c0) \u2212 kh \u2212 T \u03c0Qk2\n\u03bd , and\napproximating the action-value function Q\u03c0 by solving\n\n\u02c6QBRM = argmin\nQ\u2208F M\n\nsup\nh\u2208F M\n\nLBRM (Q, h; \u03c0),\n\n(4)\n\nwhere the supremum comes from the negative sign of kh \u2212 T \u03c0Qk2\n\u03bd. They showed that optimizing\nthe new loss function still makes sense and the empirical version of this loss is unbiased. Solving\nEq. (4) requires solving the following nested optimization problems:\n\n(cid:13)(cid:13)(cid:13)h \u2212 \u02c6T \u03c0Q\n(cid:13)(cid:13)(cid:13)2\n\n\u03bd\n\nh\u2217\nQ = argmin\nh\u2208F M\n\n,\n\n\u02c6QBRM = argmin\nQ\u2208F M\n\nh(cid:13)(cid:13)(cid:13)Q \u2212 \u02c6T \u03c0Q\n(cid:13)(cid:13)(cid:13)2\n\n\u03bd\n\n\u2212(cid:13)(cid:13)(cid:13)h\u2217\n\ni\n\n(cid:13)(cid:13)(cid:13)2\n\n\u03bd\n\nQ \u2212 \u02c6T \u03c0Q\n\n.\n\n(5)\n\nOf course in practice, T \u03c0Q is replaced by its sample-based approximation \u02c6T \u03c0Q.\n\n3.2 Least-Squares Temporal Difference Learning\n\nLeast-squares temporal difference learning (LSTD) was \ufb01rst proposed by Bradtke and Barto [4],\nand later was extended to control by Lagoudakis and Parr [10]. They called the resulting algorithm\nleast-squares policy iteration (LSPI), which is an approximate policy iteration algorithm based on\nLSTD. Unlike BRM that minimizes the distance of Q and T \u03c0Q, LSTD minimizes the distance of Q\nand \u03a0T \u03c0Q, the back-projection of the image of Q under the Bellman operator, T \u03c0Q, onto the space\nof admissible functions F M (see Figure 1). Formally, this means that LSTD minimizes the loss\nfunction LLST D(Q; \u03c0) = kQ \u2212 \u03a0T \u03c0Qk2\n\u03bd. It can also be seen as \ufb01nding a good approximation for\nthe \ufb01xed-point of operator \u03a0T \u03c0. The projection operator \u03a0 : B(X \u00d7 A) \u2192 B(X \u00d7 A) is de\ufb01ned\nby \u03a0f = argminh\u2208F M kh \u2212 fk2\n\u03bd. In order to make this minimization problem computationally\nfeasible, it is usually assumed that F M is a linear subspace of B(X \u00d7 A). The LSTD solution can\ntherefore be written as the solution of the following nested optimization problems:\n\nh\u2217\nQ = argmin\nh\u2208F M\n\n(6)\nwhere the \ufb01rst equation \ufb01nds the projection of T \u03c0Q onto F M , and the second one minimizes the\ndistance of Q and the projection.\n\n\u02c6QLST D = argmin\nQ\u2208F M\n\nkh \u2212 T \u03c0Qk2\n\u03bd ,\n\nQ\n\n,\n\n(cid:13)(cid:13)Q \u2212 h\u2217\n\n(cid:13)(cid:13)2\n\n\u03bd\n\n\fT \u03c0Q\n\nminimized by BRM\n\nFigure 1: This \ufb01gure shows the loss functions minimized by\nBRM, modi\ufb01ed BRM, and LSTD methods. The function space\nF M is represented by the plane. The Bellman operator, T \u03c0, maps\nan action- value function Q \u2208 F M to a function T \u03c0Q. The vec-\ntor connecting T \u03c0Q and its back- projection to F M , \u03a0T \u03c0Q, is\northogonal to the function space F M . The BRM loss function is\nthe squared Bellman error, the distance of Q and T \u03c0Q. In order\nto obtain the modi\ufb01ed BRM loss, the squared distance of T \u03c0Q\nand \u03a0T \u03c0Q is subtracted from the squared Bellman error. LSTD\naims at a function Q that has minimum distance to \u03a0T \u03c0Q.\nAntos et al. [2] showed that when F M is linear, the solution of modi\ufb01ed BRM (Eq. 4 or 5) coincides\nwith the LSTD solution (Eq. 6). A quick explanation for this is: the \ufb01rst equations in (5) and (6) are\nthe same, the projected vector h\u2217\n\n\u201a\u201a2 =\nQ \u2212 T \u03c0Q\u201a\u201a2 (Pythagorean theorem), and therefore the second equations in (5) and\n\nQ \u2212 T \u03c0Q has to be perpendicular to F M , as a result\u201a\u201aQ \u2212 h\u2217\n\nkQ \u2212 T \u03c0Qk2 \u2212\u201a\u201ah\u2217\n\nminimized by LSTD\n\nF M\n\n\u03a0T \u03c0Q\n\nQ\n\nQ\n\n(6) have the same solution.\n\n4 Regularized Policy Iteration Algorithms\n\nIn this section, we introduce two regularized policy iteration algorithms. These algorithms are in-\nstances of the generic policy- iteration method, whose pseudo- code is shown in Table 1. By assump-\ntion, the training sample Di used at the ith (1 \u2264 i \u2264 N) iteration of the algorithm is a \ufb01nite trajectory\n{(Xt, At, Rt)}1\u2264t\u2264n\ngenerated\nby a policy \u03c0, thus, At = \u03c0(Xt)\nand Rt \u223c S(\u00b7|Xt, At). Examples\nof such policy \u03c0 are \u03c0i plus some\nexploration or some stochastic\nstationary policy \u03c0b. The action-\nvalue function Q(\u22121) is used to\ninitialize the \ufb01rst policy. Alter-\nnatively, one may start with an\narbitrary initial policy. The proce-\ndure PEval takes a policy \u03c0i (here\nthe greedy policy w.r.t. the current\nfunction Q(i\u22121))\naction- value\nalong with training sample Di,\nand returns an approximation to\nthe action- value function of policy \u03c0i. There are many possibilities to design PEval. In this paper,\nwe propose two approaches, one based on regularized (modi\ufb01ed) BRM (REG- BRM), and one based\non regularized LSTD (REG- LSTD). In REG- BRM, the next iteration is computed by solving the\nfollowing nested optimization problems:\n\nFittedPolicyQ(N,Q(\u22121),PEval)\n// N: number of iterations\n// Q(\u22121): Initial action- value function\n// PEval: Fitting procedure\nfor i = 0 to N \u2212 1 do\n\u03c0i(\u00b7) \u2190 \u02c6\u03c0(\u00b7; Q(i\u22121))\nGenerate training sample Di\nQ(i) \u2190 PEval(\u03c0i, Di)\nend for\nreturn Q(N\u22121) or \u03c0N (\u00b7) = \u02c6\u03c0(\u00b7; Q(N\u22121))\n\nTable 1: The pseudo- code of policy- iteration algorithm\n\n// the greedy policy w.r.t. Q(i\u22121) //\n\nh\u201a\u201a\u201ah \u2212 \u02c6T \u03c0i Q\n\nh\u201a\u201a\u201aQ \u2212 \u02c6T \u03c0i Q\n\n\u201a\u201a\u201a2\nt = (cid:0)Xt+1, \u03c0i(Xt+1)(cid:1) represent state- action pairs, J(h) and J(Q) are penalty functions (e.g.,\ni\n\nwhere ( \u02c6T \u03c0iQ)(Zt) = Rt + \u03b3Q(Z0\nZ0\nnorms), and \u03bbh,n, \u03bbQ,n > 0 are regularization coef\ufb01cients.\nIn REG- LSTD, the next iteration is computed by solving the following nested optimization prob-\nlems:\n\nn\n(7)\nt) represents the empirical Bellman operator, Zt = (Xt, At) and\n\n(\u00b7; Q) = argmin\nh\u2208FM\n\n, Q(i) = argmin\nQ\u2208FM\n\n(\u00b7; Q) \u2212 \u02c6T \u03c0i Q\n\n\u2212\u201a\u201a\u201ah\n\n+\u03bbh,nJ(h)\n\n\u201a\u201a\u201a2\n\n\u201a\u201a\u201a2\n\ni\n\ni\n\nh\n\n\u2217\n\n\u2217\n\nn\n\nn\n\nhkQ \u2212 h\n\n\u2217\n\n(\u00b7; Q)k2\n\nn + \u03bbQ,nJ(Q)\n\n. (8)\n\n+ \u03bbh,nJ(h)\n\n, Q(i) = argmin\nQ\u2208FM\n\ni\n\n+\u03bbQ,nJ(Q)\n\n,\n\nh\u201a\u201a\u201ah \u2212 \u02c6T \u03c0i Q\n\n\u201a\u201a\u201a2\n\nn\n\n\u2217\n\nh\n\n(\u00b7; Q) = argmin\nh\u2208FM\n\nIt is important to note that unlike the non- regularized case described in Sections 3.1 and 3.2, REG-\nBRM and REG- LSTD do not have the same solution. This is because, although the \ufb01rst equations\nin (7) and (8) are the same, the projected vector h\u2217(\u00b7; Q) \u2212 \u02c6T \u03c0i Q is not necessarily perpendicular to\nthe admissible function space F M . This is due to the regularization term \u03bbh,nJ(h). As a result, the\n\n\f\u201a\u201a\u201aQ \u2212 \u02c6T \u03c0i Q\n\n\u201a\u201a\u201a2 \u2212\u201a\u201a\u201ah\u2217(\u00b7; Q) \u2212 \u02c6T \u03c0i Q\n\u201a\u201a\u201a2\n\nPythagorean theorem does not hold: kQ \u2212 h\u2217(\u00b7; Q)k2 6=\n, and\ntherefore the objective functions of the second equations in (7) and (8) are not equal and they do not\nhave the same solution.\nWe now present algorithmic solutions for REG-BRM and REG-LSTD problems described above.\nWe can obtain Q(i) by solving the regularization problems of Eqs. (7) and (8) in a reproducing\nkernel Hilbert space (RKHS) de\ufb01ned by a Mercer kernel K. In this case, we let the regularization\nterms J(h) and J(Q) be the RKHS norms of h and Q, khk2H and kQk2H, respectively. Using\nthe Representer theorem, we can then obtain the following closed-form solutions for REG-BRM\nand REG-LSTD. This is not immediate, because the solutions of these procedures are de\ufb01ned with\nnested optimization problems.\ni=1 \u02dc\u03b1ik( \u02dcZi,\u00b7),\ni\u2212n, otherwise. The coef\ufb01cient vector \u02dc\u03b1 = (\u02dc\u03b11, . . . , \u02dc\u03b12n)> can\n\nTheorem 1. The optimizer Q \u2208 H of Eqs. (7) and (8) can be written as Q(\u00b7) =P2n\n\nwhere \u02dcZi = Zi if i \u2264 n and \u02dcZi = Z0\nbe obtained by\n\nREG-BRM:\nREG-LSTD:\n\n\u02dc\u03b1 = (CKQ + \u03bbQ,nI)\u22121(D> + \u03b3C>\n\u02dc\u03b1 = (F >F KQ + \u03bbQ,nI)\u22121F >Er,\n\n2 B>B)r,\n\nwhere r = (R1, . . . , Rn)>, C = D>D\u2212\u03b32(BC2)>(BC2), B = Kh(Kh+\u03bbh,nI)\u22121\u2212I, D =\nC1 \u2212 \u03b3C2, F = C1 \u2212 \u03b3EC2, E = Kh(Kh + \u03bbh,nI)\u22121, and Kh \u2208 Rn\u00d7n, C1, C2 \u2208 Rn\u00d72n,\nand KQ \u2208 R2n\u00d72n are de\ufb01ned by [Kh]ij = k(Zi, Zj), [C1KQ]ij = k(Zi, \u02dcZj), [C2KQ]ij =\nk(Z0\n\ni, \u02dcZj), and [KQ]ij = k( \u02dcZi, \u02dcZj).\n\n5 Theoretical Analysis of the Algorithms\n\nIn this section, we analyze the statistical properties of the policy iteration algorithms based on REG-\nBRM and REG-LSTD. We provide \ufb01nite-sample convergence results for the error between Q\u03c0N , the\naction-value function of policy \u03c0N , the policy resulted after N iterations of the algorithms, and Q\u2217,\nthe optimal action-value function. Due to space limitations, we only report assumptions and main\nresults here (Refer to [6] for more details). We make the following assumptions in our analysis,\nsome of which are only technical:\n\nt)}n\n\nt, \u03c0(X0\n\nt =(cid:0)X0\n\nt)(cid:1), Xt \u223c \u03bd \u2208 M(X ), At \u223c \u03c0b(\u00b7|Xt), X0\n\nAssumption A2 (1) At every iteration, samples are generated i.i.d. using a \ufb01xed distribution over\nstates \u03bd and a \ufb01xed stochastic policy \u03c0b, i.e., {(Zt, Rt, Z0\nt=1 are i.i.d. samples, where Zt =\nt \u223c P (\u00b7|Xt, At), and \u03c0 is the\n(Xt, At), Z0\npolicy being evaluated. We further assume that \u03c0b selects all actions with non-zero probability.\n(2) The function space F used in the optimization problems (7) and (8) is a Sobolev space Wk(Rd)\nwith 2k > d. We denote by Jk(Q) the norm of Q in this Sobolev space.\n(3) The selected function space F M contains the true action-value function, i.e., Q\u03c0 \u2208 F M .\n(4) For every function Q \u2208 F M with bounded norm J(Q), its image under the Bellman operator,\nT \u03c0Q, is in the same space, and we have J(T \u03c0Q) \u2264 BJ(Q), for some positive and \ufb01nite B, which\nis independent of Q.\n(5) We assume F M \u2282 B(X \u00d7 A; Qmax), for Qmax > 0.\n(1) indicates that the training sample should be generated by an i.i.d. process. This assumption is\nused mainly for simplifying the proofs and can be extended to the case where the training sample\nis a single trajectory generated by a \ufb01xed policy with appropriate mixing conditions as was done\nin [2].\n(2) Using Sobolev space allows us to explicitly show the effect of smoothness k on the\nconvergence rate of our algorithms and to make comparison with the regression learning settings.\nNote that Sobolev spaces are large: In fact, Sobolev spaces are more \ufb02exible than H\u00a8older spaces (a\ngeneralization of Lipschitz spaces to higher order smoothness) in that in these spaces the norm mea-\nsures the average smoothness of the functions as opposed to measuring their worst-case smoothness.\nThus, functions that are smooth most over the place except for some parts that have a small measure\nwill have small Sobolev-space norms, i.e., they will be looked as \u201csimple\u201d, while they would be\nviewed as \u201ccomplex\u201d functions in H\u00a8older spaces. Actually, our results extend to other RKHS spaces\n\n\fthat have well-behaved metric entropy capacity, i.e., log N (\u03b5,F) \u2264 A\u03b5\u2212\u03b1 for some 0 < \u03b1 < 2\nand some \ufb01nite positive A. In (3), we assume that the considered function space is large enough\nto include the true action-value function. This is a standard assumption when studying the rate of\nconvergence in supervised learning [7]. (4) constrains the growth rate of the complexity of the norm\nof Q under Bellman updates. We believe that this is a reasonable assumption that will hold in most\npractical situations. Finally, (5) is about the uniform boundedness of the functions in the selected\nfunction space. If the solutions of our optimization problems are not bounded, they must be trun-\ncated, and thus, truncation arguments must be used in the analysis. Truncation does not change the\n\ufb01nal result, so we do not address it to avoid unnecessary clutter.\nWe now \ufb01rst derive an upper bound on the policy evaluation error in Theorem 2. We then show how\nthe policy evaluation error propagates through the iterations of policy iteration in Lemma 3. Finally,\nwe state our main result in Theorem 4, which follows directly from the \ufb01rst two results.\nTheorem 2 (Policy Evaluation Error). Let Assumptions A1 and A2 hold. Choosing \u03bbQ,n =\n2k+d and \u03bbh,n = \u0398(\u03bbQ,n), for any policy \u03c0, the following holds with probability at\nc1\nleast 1 \u2212 \u03b4, for c1, c2, c3, c4 > 0.\n\n` log(n)\n\nk (Q\u03c0 )\n\nnJ2\n\n`J 2\nk (Q\u03c0)\u00b4 d\n\n2k+d\n\n\u2264 c2\n\n\u201e log(n)\n\n\u00ab 2k\n\n2k+d\n\nn\n\n+\n\nc3 log(n) + c4 log( 1\n\u03b4 )\n\nn\n\n.\n\n\u00b4 2k\n\u201a\u201a\u201a \u02c6Q \u2212 T \u03c0 \u02c6Q\n\u201a\u201a\u201a2\n\n\u03bd\n\nTheorem 2 shows how the number of samples and the dif\ufb01culty of the problem as characterized\nk(Q\u03c0) in\ufb02uence the policy evaluation error. With a large number of samples, we expect || \u02c6Q \u2212\nby J 2\nT \u03c0 \u02c6Q||2\n\u03bd to be small with high probability, where \u03c0 is the policy being evaluated and \u02c6Q is its estimated\naction-value function using REG-BRM or REG-LSTD.\nLet \u02c6Q(i) and \u03b5i = \u02c6Q(i) \u2212 T \u03c0i \u02c6Q(i), i = 0, . . . , N \u2212 1 denote the estimated action-value function and\nthe Bellman residual at the ith iteration of our algorithms. Theorem 2 indicates that at each iteration\ni, the optimization procedure \ufb01nds a function \u02c6Q(i) such that k\u03b5ik2\n\u03bd is small with high probability.\nLemma 3, which was stated as Lemma 12 in [2], bounds the \ufb01nal error after N iterations as a\nfunction of the intermediate errors. Note that no assumption is made on how the sequence \u02c6Q(i) is\ngenerated in this lemma. In Lemma 3 and Theorem 4, \u03c1 \u2208 M(X ) is a measure used to evaluate the\nperformance of the algorithms, and C\u03c1,\u03bd and C\u03bd are the concentrability coef\ufb01cients de\ufb01ned in [2].\nLemma 3 (Error Propagation). Let p \u2265 1 be a real and N be a positive integer. Then, for any\nsequence of functions {Q(i)} \u2282 B(X \u00d7 A; Qmax), 0 \u2264 i < N, and \u03b5i as de\ufb01ned above, the\nfollowing inequalities hold:\n\nkQ\n\n\u2217 \u2212 Q\u03c0Nkp,\u03c1 \u2264\n\u2217 \u2212 Q\u03c0Nk\u221e \u2264\n\nkQ\n\n2\u03b3\n\n(1 \u2212 \u03b3)2\n\n2\u03b3\n\n(1 \u2212 \u03b3)2\n\nC 1/p\n\n\u03c1,\u03bd max\n0\u2264i<N\n\nC 1/p\n\n\u03bd max\n0\u2264i<N\n\n\u201c\n\u201c\n\n\u201d\n\u201d\nk\u03b5ikp,\u03bd + \u03b3N/p Rmax\nk\u03b5ikp,\u03bd + \u03b3N/p Rmax\n\n,\n\n.\n\nTheorem 4 (Convergence Result). Let Assumptions A1 and A2 hold, \u03bbh,n and \u03bbQ,n use the same\nschedules as in Theorem 2, and the number of samples n be large enough. The error between\nthe optimal action-value function, Q\u2217, and the action-value function of the policy resulted after N\niterations of the policy iteration algorithm based on REG-BRM or REG-LSTD, \u02c6Q\u03c0N , is\n\nkQ\n\n\u2217 \u2212 Q\u03c0Nk\u03c1 \u2264\n\n2\u03b3\n\n(1 \u2212 \u03b3)2\n\nkQ\n\n\u2217 \u2212 Q\u03c0Nk\u221e \u2264\n\n2\u03b3\n\n(1 \u2212 \u03b3)2\n\n24c \u00d7 C 1/2\n24c \u00d7 C 1/2\n\n\u03c1,\u03bd\n\n\u03bd\n\n0@\u201e log(n)\n0@\u201e log(n)\n\nn\n\nn\n\n2k+d\n\n\u00ab k\n\u00ab k\n\n2k+d\n\n \n \n\n+\n\n+\n\nlog( N\n\u03b4 )\n\nn\n\nlog( N\n\u03b4 )\n\nn\n\n21A + \u03b3N/2Rmax\n! 1\n21A + \u03b3N/2Rmax\n! 1\n\n35 ,\n35 ,\n\nwith probability at least 1 \u2212 \u03b4 for some c > 0.\n\nTheorem 4 shows the effect of number of samples n, degree of smoothness k, number of iterations\nN, and concentrability coef\ufb01cients on the quality of the policy induced by the estimated action-\nvalue function. Three important observations are: 1) the main term in the rate of convergence\n2k+d ), which is an optimal rate for regression up to a logarithmic factor and hence\nis O(log(n)n\n\n\u2212 k\n\n\fit is an optimal rate value-function estimation, 2) the effect of smoothness k is evident: for two\nproblems with different degrees of smoothness, learning the smoother one is easier \u2013 an intuitive,\nbut previously not rigorously proven result in the RL literature, and 3) increasing the number of\niterations N increases the error of the second term, but its effect is only logarithmic.\n6 Conclusions and Future Work\nIn this paper we showed how L2-regularization can be added to two widely-used policy evalua-\ntion methods in RL: Bellman residual minimization (BRM) and least-squares temporal difference\nlearning (LSTD), and developed two regularized policy evaluation algorithms REG-BRM and REG-\nLSTD. We then showed how these algorithms can be implemented ef\ufb01ciently when the value-\nfunction approximation belongs to a reproducing kernel Hilbert space (RKHS). We also proved\n\ufb01nite-sample performance bounds for REG-BRM and REG-LSTD, and the regularized policy iter-\nation algorithms built on top of them. Our theoretical results indicate that our methods are able to\nachieve the optimal rate of convergence under the studied conditions.\nOne of the remaining problems is how to \ufb01nd the regularization parameters: \u03bbh,n and \u03bbQ,n. Us-\ning cross-validation may lead to a completely self-tuning process. Another issue is the type of\nregularization. Here we used L2-regularization, however, the idea can be extended naturally to L1-\nregularization in the style of Lasso, opening up the possibility of procedures that can handle a high\nnumber of irrelevant features. Although the i.i.d. sampling assumption is technical, extending our\nanalysis to the case when samples are correlated requires generalizing quite a few results in super-\nvised learning. However, we believe that this can be done without problem following the work of\n[2]. Extending the results to continuous-action MDPs is another major challenge. Here the interest-\ning question is if it is possible to achieve better rates than the one currently available in the literature,\nwhich scales quite unfavorably with the dimension of the action space [1].\n\nReferences\n[1] A. Antos, R. Munos, and Cs. Szepesv\u00b4ari. Fitted Q-iteration in continuous action-space MDPs. In Ad-\n\nvances in Neural Information Processing Systems 20 (NIPS-2007), pages 9\u201316, 2008.\n\n[2] A. Antos, Cs. Szepesv\u00b4ari, and R. Munos. Learning near-optimal policies with Bellman-residual mini-\n\nmization based \ufb01tted policy iteration and a single sample path. Machine Learning, 71:89\u2013129, 2008.\n\n[3] L.C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings\n\nof the Twelfth International Conference on Machine Learning, pages 30\u201337, 1995.\n\n[4] S.J. Bradtke and A.G. Barto. Linear least-squares algorithms for temporal difference learning. Machine\n\nLearning, 22:33\u201357, 1996.\n\n[5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. JMLR, 6:503\u2013556,\n\n2005.\n\n[6] A. M. Farahmand, M. Ghavamzadeh, Cs. Szepesv\u00b4ari, and S. Mannor. L2-regularized policy iteration.\n\n2009. (under preparation).\n\n[7] L. Gy\u00a8or\ufb01, M. Kohler, A. Krzy\u02d9zak, and H. Walk. A distribution-free theory of nonparametric regression.\n\nSpringer-Verlag, New York, 2002.\n\n[8] R.A. Howard. Dynamic Programming and Markov Processes. The MIT Press, Cambridge, MA, 1960.\n[9] T. Jung and D. Polani. Least squares SVM for least squares TD learning. In ECAI, pages 499\u2013503, 2006.\n[10] M. Lagoudakis and R. Parr. Least-squares policy iteration. JMLR, 4:1107\u20131149, 2003.\n[11] M. Loth, M. Davy, and P. Preux. Sparse temporal difference learning using LASSO. In IEEE International\n\nSymposium on Approximate Dynamic Programming and Reinforcement Learning, 2007.\n\n[12] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49:161\u2013178, 2002.\n[13] M. Riedmiller. Neural \ufb01tted Q iteration \u2013 \ufb01rst experiences with a data ef\ufb01cient neural reinforcement\n\nlearning method. In 16th European Conference on Machine Learning, pages 317\u2013328, 2005.\n\n[14] P. J. Schweitzer and A. Seidmann. Generalized polynomial approximations in Markovian decision pro-\n\ncesses. Journal of Mathematical Analysis and Applications, 110:568\u2013582, 1985.\n\n[15] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998.\n[16] R. J. Williams and L.C. Baird. Tight performance bounds on greedy policies based on imperfect value\n\nfunctions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, 1994.\n\n[17] X. Xu, D. Hu, and X. Lu. Kernel-based least squares policy iteration for reinforcement learning. IEEE\n\nTrans. on Neural Networks, 18:973\u2013992, 2007.\n\n\f", "award": [], "sourceid": 871, "authors": [{"given_name": "Amir", "family_name": "Farahmand", "institution": null}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": null}, {"given_name": "Shie", "family_name": "Mannor", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}