{"title": "Reinforcement Learning with Function Approximation Converges to a Region", "book": "Advances in Neural Information Processing Systems", "page_first": 1040, "page_last": 1046, "abstract": null, "full_text": "Reinforcement Learning with Function \nApproximation Converges to a Region \n\nGeoffrey J. Gordon \nggordon@es.emu.edu \n\nAbstract \n\nMany algorithms for approximate reinforcement learning are not \nknown to converge. In fact, there are counterexamples showing \nthat the adjustable weights in some algorithms may oscillate within \na region rather than converging to a point. This paper shows that, \nfor two popular algorithms, such oscillation is the worst that can \nhappen: the weights cannot diverge, but instead must converge \nto a bounded region. The algorithms are SARSA(O) and V(O); the \nlatter algorithm was used in the well-known TD-Gammon program. \n\n1 \n\nIntroduction \n\nAlthough there are convergent online algorithms (such as TD()') [1]) for learning \nthe parameters of a linear approximation to the value function of a Markov process, \nno way is known to extend these convergence proofs to the task of online approxi(cid:173)\nmation of either the state-value (V*) or the action-value (Q*) function of a general \nMarkov decision process. In fact, there are known counterexamples to many pro(cid:173)\nposed algorithms. For example, fitted value iteration can diverge even for Markov \nprocesses [2]; Q-Iearning with linear function approximators can diverge, even when \nthe states are updated according to a fixed update policy [3]; and SARSA(O) can \noscillate between multiple policies with different value functions [4]. \nGiven the similarities between SARSA(O) and Q-Iearning, and between V(O) and \nvalue iteration, one might suppose that their convergence properties would be identi(cid:173)\ncal. That is not the case: while Q-Iearning can diverge for some exploration strate(cid:173)\ngies, this paper proves that the iterates for trajectory-based SARSA(O) converge \nwith probability 1 to a fixed region. Similarly, while value iteration can diverge for \nsome exploration strategies, this paper proves that the iterates for trajectory-based \nV(O) converge with probability 1 to a fixed region. l \nThe question ofthe convergence behavior of SARSA()') is one of the four open theo(cid:173)\nretical questions of reinforcement learning that Sutton [5] identifies as \"particularly \nimportant, pressing, or opportune.\" This paper covers SARSA(O), and together \n\nlIn a ''trajectory-based'' algorithm, the exploration policy may not change within a \nsingle episode of learning. The policy may change between episodes, and the value function \nmay change within a single episode. (Episodes end when the agent enters a terminal state. \nThis paper considers only episodic tasks, but since any discounted task can be transformed \ninto an equivalent episodic task, the algorithms apply to non-episodic tasks as well.) \n\n\fwith an earlier paper [4] describes its convergence behavior: it is stable in the sense \nthat there exist bounded regions which with probability 1 it eventually enters and \nnever leaves, but for some Markov decision processes it may not converge to a single \npoint. The proofs extend easily to SARSA(,\\) for ,\\ > O. \nUnfortunately the bound given here is not of much use as a practical guarantee: it \nis loose enough that it provides little reason to believe that SARSA(O) and V(O) \nproduce useful approximations to the state- and action-value functions. However, \nit is important for several reasons. First, it is the best result available for these two \nalgorithms. Second, such a bound is often the first step towards proving stronger \nresults. Finally, in practice it often happens that after some initial exploration \nperiod, only a few different policies are ever greedy; if this is the case, the strategy \nof this paper could be used to prove much tighter bounds. \n\nResults similar to the ones presented here were developed independently in [6]. \n\n2 The algorithms \n\nThe SARSA(O) algorithm was first suggested in [7]. The V(O) algorithm was pop(cid:173)\nularized by its use in the TD-Gammon backgammon playing program [8]. 2 \nFix a Markov decision process M, with a finite set 8 of states, a finite set A of \nactions, a terminal state T, an initial distribution 8 0 over 8, a one-step reward \nfunction r : 8 x A -+ R, and a transition function 8 : 8 x A -+ 8 U {T}. (M \nmay also have a discount factor 'Y specifying how to trade future rewards against \npresent ones. Here we fix 'Y = 1, but our results carry through to 'Y < 1.) Both the \ntransition and reward functions may be stochastic, so long as successive samples are \nindependent (the Markov property) and the reward has bounded expectation and \nvariance. We assume that all states in 8 are reachable with positive probability. \nWe define a policy 7r to be a function mapping states to probability distributions \nover actions. Given a policy we can sample a trajectory (a sequence of states, \nactions, and one-step rewards) by the following rule: begin by selecting a state So \naccording to 80 . Now choose an action ao according to 7r(so), Now choose a one(cid:173)\nstep reward ro according to r(so, ao). Finally choose a new state Sl according to \n8(so, ao). If Sl = T, stop; otherwise repeat. We assume that all policies are proper, \nthat is, that the agent reaches T with probability 1 no matter what policy it follows. \n(This assumption is satisfied trivially if'Y < 1.) \nThe reward for a trajectory is the sum of all of its one-step rewards. Our goal is \nto find an optimal policy, that is, a policy which on average generates trajectories \nwith the highest possible reward. Define Q*(s, a) to be the best total expected \nreward that we can achieve by starting in state s, performing action a, and acting \noptimally afterwards. Define V*(s) = maxaQ*(s, a). Knowledge of either Q* or the \ncombination of V*, 8, and r is enough to determine an optimal policy. \nThe SARSA(O) algorithm maintains an approximation to Q*. We will write Q(s,a) \nfor s E 8 and a E A to refer to this approximation. We will assume that Q is a \nfull-rank linear function of some parameters w. For convenience of notation, we \nwill write Q(T, a) = 0 for all a E A, and tack an arbitrary action onto the end of \nall trajectories (which would otherwise end with the terminal state). After seeing \n\n2The proof given here does not cover the TD-Gammon program, since TD-Gammon \nuses a nonlinear function approximator to represent its value function. \nInterestingly, \nthough, the proof extends easily to cover games such as backgammon in addition to MDPs. \nIt also extends to cover SARSA('x) and V(,x) for ,X > O. \n\n\fa trajectory fragment s, a, r, s', a', the SARSA(O) algorithm updates \n\nQ(s, a) +- r + Q(s', a') \n\nThe notation Q(s, a) +- V means that the parameters, w, which represent Q(s, a) \nshould be adjusted by gradient descent to reduce the error (Q(s, a) - V)2; that is, \nfor some preselected learning rate 0: ~ 0, \n\nWnew = Wold + 0:(V - Q(s, a)) 8w Q(s, a) \n\n8 \n\nFor convenience, we assume that 0: remains constant within a single trajectory. \nWe also make the standard assumption that the sequence of learning rates is fixed \nbefore the start of learning and satisfies Et O:t = 00 and Et o:~ < 00. \nWe will consider only the trajectory-based version of SARSA(O). This version \nchanges policies only between trajectories. At the beginning of each trajectory, \nit selects the E-greedy policy for its current Q function. From state s, the E-greedy \npolicy chooses the action argmaxa Q(s, a) with probability 1 - E, and otherwise \nselects uniformly at random among all actions. This rule ensures that, no matter \nthe sequence of learned Q functions, each state-action pair will be visited infinitely \noften. (The use of E-greedy policies is not essential. We just need to be able to \nfind a region that contains all of the approximate value functions for every policy \nconsidered, and a bound on the convergence rate of TD(O).) \n\nWe can compare the SARSA(O) update rule to the one for Q-Iearning: \n\nQ(s, a) +- r + maxQ(s, b) \n\nb \n\nOften a' in the SARSA(O) update rule will be the same as the maximizing b in \nthe Q-Iearning update rule; the difference only appears when the agent takes an \nexploring action, i.e., one which is not greedy for the current Q function. \nThe V(O) algorithm maintains an approximation to V* which we will write V(s) for \nall s E S. Again, we will assume V is a full-rank linear function of parameters w, \nand V(T) is held fixed at O. After seeing a trajectory fragment s,a,r,s', V(O) sets \n\nV(s) +- r + V(s') \n\nThis update ignores a. Often a is chosen according to a greedy or E-greedy policy \nfor a recent V. However, for our analysis we only need to assume that we consider \nfinitely many policies and that the policy remains fixed during each trajectory. \n\nWe leave open the question of whether updates to w happen immediately after each \ntransition or only at the end of each trajectory. As pointed out in [9], this difference \nwill not affect convergence: the updates within a single trajectory are 0(0:), so they \ncause a change in Q(s,a) or V(s) of 0(0:), which means subsequent updates are \naffected by at most 0(0:2 ). Since 0: is decaying to zero, the 0(0:2 ) terms can be \nneglected. (If we were to change policies during the trajectory, this argument would \nno longer hold, since small changes in Q or V can cause large changes in the policy.) \n\n3 The result \n\nOur result is that the weights w in either SARSA(O) or V(O) converge with proba(cid:173)\nbility 1 to a fixed region. The proof of the result is based on the following intuition: \nwhile SARSA(O) and V(O) might consider many different policies over time, on any \ngiven trajectory they always follow the TD(O) update rule for some policy. The \nTD(O) update is, under general conditions, a 2-norm contraction, and so would \n\n\fconverge to its fixed point if it were applied repeatedly; what causes SARSA(O) and \nV(O) not to converge to a point is just that they consider different policies (and \nso take steps towards different fixed points) during different trajectories. Crucially, \nunder general conditions, all of these fixed points are within some bounded region. \nSo, we can view the SARSA(O) and V(O) update rules as contraction mappings plus \na bounded amount of \"slop.\" With this observation, standard convergence theorems \nshow that the weight vectors generated by SARSA(O) and V(O) cannot diverge. \n\nTheorem 1 For any Markov decision process M satisfying our assumptions, there \nis a bounded region R such that the SARSA(O) algorithm, when acting on M, pro(cid:173)\nduces a series of weight vectors which with probability 1 converges to R. Similarly, \nthere is another bounded region R' such that the V(O) algorithm acting on M pro(cid:173)\nduces a series of weight vectors converging with probability 1 to R' . \n\nPROOF: Lemma 2, below, shows that both the SARSA(O) and V(O) updates can \nbe written in the form \n\nWt+1 = Wt - at (Atwt - rt + Et) \n\nwhere At is positive definite, at is the current learning rate, E(Et) = 0, Var(Et) ::::: \nK(l + IlwtI12), and At and rt depend only on the currently greedy policy. (At and \nrt represent, in a manner described in the lemma, the transition probabilities and \none-step costs which result from following the current policy. Of course, Wt, At, \nand rt will be different depending on whether we are following SARSA(O) or V(O).) \nSince At is positive definite, the SARSA(O) and V(O) updates are 2-norm contrac(cid:173)\ntions for small enough at. So, if we kept the policy fixed rather than changing it \nat the beginning of each trajectory, standard results such as Lemma 1 below would \nguarantee convergence. The intuition is that we can define a nonnegative potential \nfunction J(w) and show that, on average, the updates tend to decrease J(w) as \nlong as at is small enough and J (w) starts out large enough compared to at. \nTo apply Lemma 1 under the assumption that we keep the policy constant rather \nthan changing it every trajectory, write At = A and rt = r for all t, and write \nw\" = A -1 r. Let p be the smallest eigenvalue of A (which must be real and positive \nsince A is positive definite). Write St = AWt - r + Et for the update direction at \nstep t. Then if we take J(w) = Ilw - w,,11 2, \n\nE(V J(Wt)T stlwt) = 2(wt - w,,)T(Awt - r + E(Et)) \n\n= 2(wt - w,,)T(Awt - Aw,,) \n> 2pllwt - w,,11 2 \n= 2pJ(wt) \n\nso that -St is a descent direction in the sense required by the lemma. It is easy \nto check the lemma's variance condition. So, Lemma 1 shows that J(Wt) converges \nwith probability 1 to 0, which means Wt must converge with probability 1 to W\". \nIf we pick an arbitrary vector u and define H(w) = max(O, Ilw - ull - C)2 for a \nsufficiently large constant C, then the same argument reaches the weaker conclusion \nthat Wt must converge with probability 1 to a sphere of radius C centered at u. To \nsee why, note that -St is also a descent direction for H(w): inside the sphere, H = 0 \nand V H = 0, so the descent condition is satisfied trivially. Outside the sphere, \n\nVH(w) \n\n= 2(w - u) \n\nIl w-ull-C \n\nIlw-ull \n\n= d(w)(w - u) \n\nV H(Wt)T E(stlwt) \n\nd(wt)(wt - u)TE(stlwt) \n\n\f= d(wt)(wt - w\" + w\" - U)T A(wt - w,,) \n~ d(wt)(pllwt - w,,112 -llw\" - ullllAllllwt - w\"ID \n\nThe positive term will be larger than the negative one if Ilwt - w\" II is large enough. \nSo, if we choose C large enough, the descent condition will be satisfied. The variance \ncondition is again easy to check. Lemma 3 shows that \\7 H is Lipschitz. So, Lemma 1 \nshows that H(wt) converges with probability 1 to 0, which means that Wt must \nconverge with probability 1 to the sphere of radius C centered at u. \nBut now we are done: since there are finitely many policies that SARSA(O) or V(O) \ncan consider, we can pick any u and then choose a C large enough that the above \nargument holds for all policies simultaneously. With this choice of C the update for \nany policy decreases H(wt) on average as long as at is small enough, so the update \nfor SARSA(O) or V(O) does too, and Lemma 1 applies. \n0 \n\nThe following lemma is Corollary 1 of [10]. \nIn the statement of the lemma, a \nLipschitz continuous function F is one for which there exists a constant L so that \nIIF(u) - F(w)11 ::; Lllu - wll for all u and w. The Lipschitz condition is essentially \na uniform bound on the derivative of F. \n\nLemma 1 Let J be a differentiable function, bounded below by J*, and let \\7 J be \nLipschitz continuous. Suppose the sequence Wt satisfies \n\nfor random vectors St independent of Wt-;P, Wt+2, . . .. Suppose - St is a descent \ndirection for J in the sense that E(stlwt) \\7 J(Wt) > 6(E) > 0 whenever J(Wt) > \nJ* + Eo Suppose also that \n\nE(llstI12Iwt) ::; Kd(wt) + K2E(stlwt)T\\7J(Wt) + K3 \n\nand finally that the constants at satisfy \n\nat > 0 L: at = 00 \n\nt \n\nThen J(Wt) -+ J* with probability 1. \n\nMost of the work in proving the next lemma is already present in [1]. The transfor(cid:173)\nmation from an MDP under a fixed policy to a Markov chain is standard. \n\nLemma 2 The update made by SARSA(O) or V(O) during a single trajectory can \nbe written in the form \n\nwhere the constant matrix A\" and constant vector r\" depend on the currently greedy \npolicy 7f, a is the current learning rate, and E(E) = O. Furthermore, A\" is positive \ndefinite, and there is a constant K such that Var(E) ::; K(l + IlwI12). \n\nPROOF: Consider the following Markov process M,,: M\" has one state for each \nstate-action pair in M. If M has a transition which goes from state S under action \na with reward r to state s' with probability p, then M\" has a transition from state \n(s,a) with reward r to state (s',a') for every a'; the probability of this transition \nis p7r(a'ls'). We will represent the value function for M\" in the same way that we \nrepresented the Q function for M; in other words, the representation for V ( (s, a}) is \nthe same as the representation for Q(s, a). With these definitions, it is easy to see \nthat TD(O) acting on M\" produces exactly the same sequence of parameter changes \n\n\fas SARSA(O) acting on M under the fixed policy 1r. (And since 7r(als) > 0, every \nstate of M\", will be visited infinitely often.) \n\nWrite T\", for the transition probability matrix of the above Markov process. That \nis, the entry of T\", in row (s, a) and column (s', a') will be equal to the probability of \ntaking a step to (s', a') given that we start in (s, a). By definition, T\", is substochas(cid:173)\ntic. That is, it has nonnegative entries, and its row sums are less than or equal to l. \nWrite s for the vector whose (s, a)th element is So(s)7r(als), that is, the probability \nthat we start in state s and take action a. Write d1f = (I - T;f')-ls, where I is the \nidentity matrix. As demonstrated in, e.g., [11], d\", is the vector of expected visita(cid:173)\ntion frequencies under 7rj that is, the element of d\", corresponding to state sand \naction a is the expected number of times that the agent will visit state s and select \naction a during a single trajectory following policy 7r. Write D1f for the diagonal \nmatrix with d1f on its diagonaL Write r for the vector of expected rewardsj that \nis, the component of r corresponding to state s and action a is E(r(s, a)). Finally \nwrite X for the Jacobian matrix ~. \nWith this notation, Sutton [1] showed that the expected TD(O) update is \n\nE(wnewlwold) = Wold - aXT D\",(I - T\",)XWold + aXT D\",r \n\n(Actually, he only considered the case where all rewards are zero except on transi(cid:173)\ntions from nonterminal to terminal states, but his argument works equally well for \nthe more general case where nonzero rewards are allowed everywhere.) So, we can \ntake A\", = X T D\",(I - T\",)X and r\", = X T D\",r to make E(f) = O. \nFurthermore, Sutton showed that, as long as the agent reaches the terminal state \nwith probability 1 (in other words, as long as 7r is proper) and as long as every \nstate is visited with positive probability (which is true since all states are reachable \nand 7r has a nonzero probability of choosing every action), the matrix D 1f (I - T\",) \nis strictly positive definite. Therefore, so is A\",. \nFinally, as can be seen from Sutton's equations on p. 25, there are two sources of \nvariance in the update direction: variation in the number of times each transition \nis visited, and variation in the one-step rewards. The visitation frequencies and the \none-step rewards both have bounded variance, and are independent of one another. \nThey enter into the overall update in two ways: there is one set of terms which is \nbilinear in the one-step rewards and the visitation frequencies, and there is another \nset of terms which is bilinear in the visitation frequencies and the weights w. The \nformer set of terms has constant variance. Because the policy is fixed, W is inde(cid:173)\npendent of the visitation frequencies, and so the latter set of terms has variance \nproportional to Ilw112. So, there is a constant K such that the total variance in f \ncan be bounded by K(1 + IlwI12). \nA similar but simpler argument applies to V(O). In this case we define M\", to have \nthe same states as M, and to have the transition matrix T\", whose element s, s' is \nthe probability of landing in s' in M on step t + 1, given that we start in s at step \nt and follow 7r. Write s for the vector of starting probabilities, that is, Sx = So(x). \nNow define X = ~~ and d\", = (I - TJ)-l s. Since we have assumed that all policies \nare proper and that every policy considered has a positive probability of reaching \nany state, the update matrix A\", = XT D\", (I - T\",)X is strictly positive definite. 0 \nLemma 3 The gradient of the function H(w) = max(O, Ilwll - 1)2 is Lipschitz \ncontinuous. \n\nPROOF: Inside the unit sphere, H and all of its derivatives are uniformly zero. \nOutside, we have \n\n'VH = wd(w) \n\n\fwhere d(w) = II~Rlll, and \n\n'\\72 H = d(w)I + '\\7d(w)wT \n\n= d(w)I + IIwl12 Ilwllw \n\nw IT \n\n= d(w)I + IIwl12 (1 - d(w)) \n\nwwT \n\nThe norm of the first term is d( w), the norm of the second is 1 - d~ w), and since one \nof the terms is a multiple of I the norms add. So, the norm of '\\7 H is 0 inside the \nunit sphere and 1 outside. At the boundary of the unit sphere, '\\7 H is continuous, \nand its directional derivatives from every direction are bounded by the argument \nabove. So, '\\7 H is Lipschitz continuous. \n0 \n\nAcknowledgements \n\nThanks to Andrew Moore and to the anonymous reviewers for helpful comments. \nThis work was supported in part by DARPA contract number F30602- 97- 1- 0215, \nand in part by NSF KDI award number DMS- 9873442. The opinions and conclu(cid:173)\nsions are the author's and do not reflect those of the US government or its agencies. \n\nReferences \n[1] R. S. Sutton. Learning to predict by the methods of temporal differences. \n\nMachine Learning, 3(1):9-44, 1988. \n\n[2] Geoffrey J. Gordon. Stable function approximation in dynamic programming. \n\nTechnical Report CMU-CS-95-103, Carnegie Mellon University, 1995. \n\n[3] L. C. Baird. Residual algorithms: Reinforcement learning with function ap(cid:173)\n\nproximation. In Machine Learning: proceedings of the twelfth international \nconference, San Francisco, CA, 1995. Morgan Kaufmann. \n\n[4] Geoffrey J. Gordon. Chattering in SARSA(A). Internal report, 1996. CMU \n\nLearning Lab. Available from www.es . emu. edu;-ggordon. \n\n[5] R. S. Sutton. Open theoretical questions in reinforcement learning. In P. Fis(cid:173)\n\ncher and H. U. Simon, editors, Computational Learning Theory (Proceedings \nof EuroCOLT'99), pages 11- 17, 1999. \n\n[6] D. P. de Farias and B. Van Roy. On the existence of fixed points for approxi(cid:173)\n\nmate value iteration and temporal-difference learning. Journal of Optimization \nTheory and Applications, 105(3), 2000. \n\n[7] Gavin A. Rummery and Mahesan Niranjan. On-line Q-Iearning using con(cid:173)\n\nnectionist systems. Technical Report 166, Cambridge University Engineering \nDepartment, 1994. \n\n[8] G. Tesauro. TD-Gammon, a self-teaching backgammon program, achieves \n\nmaster-level play. Neural Computation, 6:215- 219, 1994. \n\n[9] T. Jaakkola, M. 1. Jordan, and S. P. Singh. On the convergence of stochastic \niterative dynamic programming algorithms. Neural Computation, 6:1185- 1201, \n1994. \n\n[10] B. T. Polyak and Ya. Z. Tsypkin. Pseudogradient adaptation and training \nalgorithms. Automation and Remote Control, 34(3):377- 397, 1973. 'Translated \nfrom A vtomatika i Telemekhanika. \n\n[11] J. G. Kemeny and J. L. SnelL Finite Markov Chains. Van Nostrand- Reinhold, \n\nNew York, 1960. \n\n\f", "award": [], "sourceid": 1911, "authors": [{"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}