{"title": "Stable Fitted Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1052, "page_last": 1058, "abstract": null, "full_text": "Stable Fitted Reinforcement Learning \n\nComputer Science Department \n\nCarnegie Mellon University \n\nGeoffrey J. Gordon \n\nPittsburgh PA 15213 \nggordon@cs.cmu.edu \n\nAbstract \n\nWe describe the reinforcement learning problem, motivate algo(cid:173)\nrithms which seek an approximation to the Q function, and present \nnew convergence results for two such algorithms. \n\n1 \n\nINTRODUCTION AND BACKGROUND \n\nImagine an agent acting in some environment. At time t, the environment is in some \nstate Xt chosen from a finite set of states. The agent perceives Xt, and is allowed to \nchoose an action at from some finite set of actions. The environment then changes \nstate, so that at time (t + 1) it is in a new state Xt+1 chosen from a probability \ndistribution which depends only on Xt and at. Meanwhile, the agent experiences a \nreal-valued cost Ct, chosen from a distribution which also depends only on Xt and \nat and which has finite mean and variance. \n\nSuch an environment is called a Markov decision process, or MDP. The reinforce(cid:173)\nment learning problem is to control an MDP to minimize the expected discounted \ncost Lt ,tCt for some discount factor, E [0,1]. Define the function Q so that \nQ(x, a) is the cost for being in state x at time 0, choosing action a, and behaving \noptimally from then on. If we can discover Q, we have solved the problem: at each \nstep, we may simply choose at to minimize Q(xt, at). For more information about \nMDPs, see (Watkins, 1989, Bertsekas and Tsitsiklis, 1989). \nWe may distinguish two classes of problems, online and offline. In the offline prob(cid:173)\nlem, we have a full model of the MDP: given a state and an action, we can describe \nthe distributions of the cost and the next state. We will be concerned with the \nonline problem, in which our knowledge of the MDP is limited to what we can dis(cid:173)\ncover by interacting with it. To solve an online problem, we may approximate the \ntransition and cost functions, then proceed as for an offline problem (the indirect \napproach); or we may try to learn the Q function without the intermediate step \n(the direct approach). Either approach may work better for any given problem: the \n\n\fStable Fitted Reinforcement Learning \n\n1053 \n\ndirect approach may not extract as much information from each observation, but \nthe indirect approach may introduce additional errors with its extra approximation \nstep. We will be concerned here only with direct algorithms. \nWatkins' (1989) Q-Iearning algorithm can find the Q function for small MDPs, \neither online or offline. Convergence with probability 1 in the online case \nwas proven in (Jaakkola et al., 1994, Tsitsiklis, 1994). For large MDPs, ex(cid:173)\nact Q-Iearning is too expensive: representing the Q function requires too much \nspace. To overcome this difficulty, we may look for an inexpensive approxima(cid:173)\ntion to the Q function. \nIn the offline case, several algorithms for this purpose \nhave been proven to converge (Gordon, 1995a, Tsitsiklis and Van Roy, 1994, \nBaird, 1995). For the online case, there are many fewer provably convergent al(cid:173)\ngorithms. As Baird (1995) points out, we cannot even rely on gradient descent for \nlarge, stochastic problems, since we must observe two independent transitions from \na given state before we can compute an unbiased estimate of the gradient. One \nof the algorithms in (Tsitsiklis and Van Roy, 1994), which uses state aggregation \nto approximate the Q function, can be modified to apply to online problems; the \nresulting algorithm, unlike Q-Iearning, must make repeated small updates to its \ncontrol policy, interleaved with comparatively lengthy periods of evaluation of the \nchanges. After submitting this paper, we were advised of the paper (Singh et al., \n1995), which contains a different algorithm for solving online MDPs. In addition, \nour newer paper (Gordon, 1995b) proves results for a larger class of approximators. \nThere are several algorithms which can handle restricted versions of the online \nproblem. In the case of a Markov chain (an MDP where only one action is available \nat any time step), Sutton's TD('\\') has been proven to converge for arbitrary linear \napproximators (Sutton, 1988, Dayan, 1992). For decision processes with linear \ntransition functions and quadratic cost functions (the so-called linear quadratic \nregulation problem), the algorithm of (Bradtke, 1993) is guaranteed to converge. \nIn practice, researchers have had mixed success with approximate reinforcement \nlearning (Tesauro, 1990, Boyan and Moore, 1995, Singh and Sutton, 1996). \n\nThe remainder of the paper is divided into four sections. In section 2, we summarize \nconvergence results for offline Q-Iearning, and prove some contraction properties \nwhich will be useful later. Section 3 extends the convergence results to online \nalgorithms based on TD(O) and simple function approximators. Section 4 treats \nnondiscounted problems, and section 5 wraps up. \n\n2 OFFLINE DISCOUNTED PROBLEMS \n\nStandard offline Q-Iearning begins with an MDP M and an initial Q fUnction q(O) . \nIts goal is to learn q(n), a good approximation to the optimal Q function for M. To \naccomplish this goal, it performs the series of updates q(i+1) = TM(q(i\u00bb), where the \ncomponent of TM(q(i\u00bb) corresponding to state x and action a is defined to be \n\n[T ( (i) )] \n\nM q \n\nxa = Cxa + \"t ~ xay mln qyb \n(i) \n\n~ p \n\n. \n\n-\n\nHere Cxa is the expected cost of performing action a in state x; Pxay is the probability \nthat action a from state x will lead to state y; and\"t is the discount factor. \n\ny \n\nOffline Q-Iearning converges for discounted MDPs because TM is a contraction in \nmax norm. That is, for all vectors q and r, \n\nwhere II q 1\\ == maxx,a I qxa I\u00b7 Therefore, by the contraction mapping theorem, TM \nhas a unique fixed point q* , and the sequence q(i) converges linearly to q* . \n\nII TM(q) - TM(r) II ~ \"til q -\n\nr II \n\n\f1054 \n\nG.J.OORDON \n\nIt is worth noting that a weighted version of offline Q-Iearning is also guaranteed \nto converge. Consider the iteration \n\nq(i+l) = (I + aD(TM - I))(q(i)) \n\nwhere a is a positive learning rate and D is an arbitrary fixed nonsingular diagonal \nmatrix of weights. In this iteration, we update some Q valnes more rapidly than \nothers, as might occur if for instance we visited some states more frequently than \nothers. (We will come back to this possibility later.) This weighted iteration is a \nmax norm contraction, for sufficiently small a: take two Q functions q and r, with \nII q - r II = I. Suppose a is small enough that the largest element of aD is B < 1, \nand let b > 0 be the smallest diagonal element of aD. Consider any state x and \naction a, and write dxa for the corresponding element of aD. We then have \n\n[(1 - aD)q - (1 - aD)r]xa \n[TMq - TMr]xa \n[aDTMq - aDTMr]xa \n[(I - aD + aDTM)q - (1 - aD + aDTM )r]xa \n\n(1 - dxa)1 \n\n< \n< ,I \n< dxa,l \n< \n< (l-b(l-,))1 \n\n(1 - dxa)1 + dxa,l \n\nso (1 - aD + aDTM) is a max norm contraction with factor (1 - b(l - ,)). The \nfixed point of weighted Q-Iearning is the same as the fixed point of unweighted \nQ-Iearning: TM(q*) = q* is equivalent to aD(TM - l)q* = O. \nThe difficulty with standard (weighted or unweighted) Q-Iearning is that, for MDPs \nwith many states, it may be completely infeasible to compute TM(q) for even one \nvalue of q. One way to avoid this difficulty is fitted Q-Iearning: if we can find \nsome function MA so that MA 0 TM is much cheaper to compute than TM, we can \nperform the fitted iteration q(Hl) = MA(TM(q(i))) instead of the standard offline Q(cid:173)\nlearning iteration. The mapping MA implements a function approximation scheme \n(see (Gordon, 1995a)); we assume that qeD) can be represented as MA(q) for some \nq. The fitted offline Q-Iearning iteration is guaranteed to converge to a unique fixed \npoint if MA is a nonexpansion in max norm, and to have bounded error if MA(q*) \nis near q* (Gordon, 1995a). \n\nFinally, we can define a fitted weighted Q-Iearning iteration: \nq(Hl) = (1 + aMAD(TM - I))(q(i)) \n\nIf MA is a max norm nonexpansion and M1 = MA (these conditions are satisfied, \nfor example, by state aggregation), then fitted weighted Q-Iearning is guaranteed \nto converge: \n\n((1 - MA) + MA(1 + aD(TM - I)))q \nMA(1 + aD(TM - 1)))q \n\nsince MAq = q for q in the range of MA. (Note that q(i+l) is guaranteed to be in the \nrange of MA if q(i) is.) The last line is the composition of a max norm nonexpansion \nwith a max norm contraction, and so is a max norm contraction. \n\nThe fixed point of fitted weighted Q-Iearning is not necessarily the same as the fixed \npoint of fitted Q-Iearning, unless MA can represent q* exactly. However, if MA is \nlinear, we have that \n\n(1 + aMAD(TM - I))(q + c) = c + MA(I + aD(TM - I)))(q + c) \n\nfor any q in the range of MA and c perpendicular to the range of MA. In particular, \nif we take c so that q* - c is in the range of MA, and let q = MAq be a fixed point \n\n\fStable Fitted Reinforcement Learning \n\n1055 \n\nof the weighted fitted iteration, then we have \n\nII (I + aMAD(TM - I))q* - (I + aMAD(TM - I))q II \nII c + MA(I + aD(TM - I)))q* - MA(I + aD(TM - I)))q II \nII c II + (1 - b(l - ,))11 q* - q II \n\nIIcll \n\nb(l -,) \n\n< \nII q* - q II < \n\nThat is, if MA is linear in addition to the conditions for convergence, we can bound \nthe error for fitted weighted Q-Iearning. \nFor offline problems, the weighted version of fitted Q-Iearning is not as useful as the \nunweighted version: it involves about the same amount of work per iteration, the \ncontraction factor may not be as good, the error bound may not be as tight, and it \nrequires M1 = MA in addition to the conditions for convergence of the unweighted \niteration. On the other hand, as we shall see in the next section, the weighted \nalgorithm can be applied to online problems. \n\n3 ONLINE DISCOUNTED PROBLEMS \n\nConsider the following algorithm, which is a natural generalization of TD(O) (Sut(cid:173)\nton, 1988) to Markov decision problems. \n\"sarsa\" (Singh and Sutton, 1996).) Start with some initial Q function q(O). Re(cid:173)\npeat the following steps for i from 0 onwards. Let 1l'(i) be a policy chosen according \nto some predetermined tradeoff between exploration and exploitation for the Q \nfunction q(i). Now, put the agent in M's start state and allow it to follow the policy \n1l'(i) for a random number of steps L(i) . If at step t of the resulting trajectory the \nagent moves from the state Xt under action at with cost Ct to a state Yt for which \nthe action bt appears optimal, compute the estimated Bellman error \n\n(This algorithm has been called \n\n- ( + [(i) 1 \n\n, q \n\nYtbt \n\net -\n\nCt \n\n) \n\n-\n\n[( i) 1 \nq \n\nXtat \n\nAfter observing the entire trajectory, define e(i) to be the vector whose xa-th com(cid:173)\nponent is the sum of et for all t such that Xt = x and at = a. Then compute the \nnext weight vector according to the TD(O)-like update rule with learning rate a(i) \n\nq(i+l) = q(i) + a (i) MAe(i) \n\nSee (Gordon, 1995b) for a comment on the types of mappings MA which are ap(cid:173)\npropriate for online algorithms. \nWe will assume that L(i) has the same distribution for all i and is independent of \nall other events related to the i-th and subsequent trajectories, and that E(L(i\u00bb) is \nbounded. Define d~il to be the expected number of times the agent visited state x \nand chose action a during the i-th trajectory, given 1l'(i). We will assume that the \npolicies are such that d~il > \u20ac for some positive \u20ac and for all i, x, and a. Let D(i) \nbe the diagonal matrix with elements d~il. With this notation, we can write the \nexpected update for the sarsa algorithm in matrix form: \n\nE(q(i+l) I q(i\u00bb) = (I + a(i) MAD(i)(TM - I))q(i) \n\nWith the exception of the fact that D(i) changes from iteration to iteration, this \nequation looks very similar to the offline weighted fitted Q-Iearning update. How(cid:173)\never, the sarsa algorithm is not guaranteed to converge even in the benign case \n\n\f1056 \n\nG. J. GORDON \n\n(a) \n\n(b) \n\nFigure 1: A counterexample to sarsa. (a) An MDP: from the start state, the agent \nmay choose the upper or the lower path, but from then on its decisions are forced. \nNext to each arc is its expected cost; the actual costs are randomized on each step. \nBoxed pairs of arcs are aggregated, so that the agent must learn identical Q values \nfor arcs in the same box. We used a discount, = .9 and a learning rate a = .1. \nTo ensure sufficient exploration, the agent chose an apparently suboptimal action \n10% of the time. (Any other parameters would have resulted in similar behavior. \nIn particular, annealing a to zero wouldn't have helped.) (b) The learned Q value \nfor the right-hand box during the first 2000 steps. \n\nwhere the Q-function is approximated by state aggregation: when we apply sarsa \nto the MDP in figure 1, one of the learned Q values oscillates forever. This problem \nhappens because the frequency-of-update matrix D(i) can change discontinuously \nwhen the Q function fluctuates slightly: when, by luck, the upper path through the \nMDP appears better, the cost-l arc into the goal will be followed more often and \nthe learned Q value will decrease, while when the lower path appears better the \ncost-2 arc will be weighted more heavily and the Q value will increase. Since the \ntwo arcs out of the initial state always have the same expected backed-up Q value \n(because the states they lead to are constrained to have the same value), each path \nwill appear better infinitely often and the oscillation will continue forever. \nOn the other hand, if we can represent the optimal Q function q*, then no matter \nwhat D(i) is, the expected sarsa update has its fixed point at q*. Since the smallest \ndiagonal element of D(i) is bounded away from zero and the largest is bounded \nabove, we can choose an a and a \" < 1 so that (I + aMAD(i)(TM \nI)) is a \ncontraction with fixed point q* and factor \" for all i. Now if we let the learning \nrates satisfy Ei a(i) = 00 and Ei(a(i\u00bb)2 < 00, convergencew.p.l to q* is guaranteed \nby a theorem of (Jaakkola et al., 1994). (See also the theorem in (Tsitsiklis, 1994).) \nMore generally, if MA is linear and can represent q* - c for some vector c, we can \nbound the error between q* and the fixed point of the expected sarsa update on \niteration i: if we choose an a and a \" < 1 as in the previous paragraph, \n\n-\n\nII E(q(Hl) I q(i\u00bb) - q* II ~ ,'II q(i) - q* II + 211 ell \n\nfor all i. A minor modification of the theorem of (Jaakkola et al., 1994) shows that \nthe distance from q(i) to the region \n\n{ q III q - q* II ~ 211 c 111 ~ \" } \n\nconverges w.p.l to zero. That is, while the sequence q(i) may not converge, the \nworst it will do is oscillate in a region around q* whose size is determined by how \n\n\fStable Fitted Reinforcement Learning \n\n1057 \n\naccurately we can represent q* and how frequently we visit the least frequent (state, \naction) pair. \n\nFinally, if we follow a fixed exploration policy on every trajectory, the matrix D( i) \nwill be the same for every i; in this case, because of the contraction property \nproved in the previous section, convergence w.p.1 for appropriate learning rates is \nguaranteed again by the theorem of (Jaakkola et al., 1994). \n\n4 NONDISCOUNTED PROBLEMS \n\nWhen M is not discounted, the Q-Iearning backup operator TM is no longer a max \nnorm contraction. Instead, as long as every policy guarantees absorption w.p.1 into \nsome set of cost-free terminal states, TM is a contraction in some weighted max \nnorm. The proofs of the previous sections still go through, if we substitute this \nweighted max norm for the unweighted one in every case. In addition, the random \nvariables L(i) which determine when each trial ends may be set to the first step t \nso that Xt is terminal, since this and all subsequent steps will have Bellman errors \nof zero. This choice of L(i) is not independent of the i-th trial, but it does have a \nfinite mean and it does result in a constant D(i). \n\n5 DISCUSSION \n\nWe have proven new convergence theorems for two online fitted reinforcement learn(cid:173)\ning algorithms based on Watkins' (1989) Q-Iearning algorithm. These algorithms, \nsarsa and sarsa with a fixed exploration policy, allow the use of function approxi(cid:173)\nmators whose mappings MA are max norm nonexpansions and satisfy M~ = MA. \nThe prototypical example of such a function approximator is state aggregation. For \nsimilar results on a larger class of approximators, see (Gordon, 1995b). \n\nAcknowledgements \n\nThis material is based on work supported under a National Science Foundation \nGraduate Research Fellowship and by ARPA grant number F33615-93-1-1330. Any \nopinions, findings, conclusions, or recommendations expressed in this publication \nare those of the author and do not necessarily reflect the views of the National \nScience Foundation, ARPA, or the United States government. \n\nReferences \nL. Baird. Residual algorithms: Reinforcement learning with function approxima(cid:173)\n\ntion. In Machine Learning (proceedings of the twelfth international conference), \nSan Francisco, CA, 1995. Morgan Kaufmann. \n\nD. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numer(cid:173)\n\nical Methods. Prentice Hall, 1989. \n\nJ. A. Boyan and A. W. Moore. Generalization in reinforcement learning: safely \n\napproximating the value function. In G. Tesauro and D. Touretzky, editors, Ad(cid:173)\nvances in Neural Information Processing Systems, volume 7. Morgan Kaufmann, \n1995. \n\nS. J. Bradtke. Reinforcement learning applied to linear quadratic regulation. In S. J. \nHanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information \nProcessing Systems, volume 5. Morgan Kaufmann, 1993. \n\nP. Dayan. The convergence of TD(A) for general lambda. Machine Learning, 8(3-\n\n4):341-362, 1992. \n\n\f1058 \n\nG. J. GOROON \n\nG. J. Gordon. Stable function approximation in dynamic programming. In Machine \nLearning (proceedings of the twelfth international conference), San Francisco, CA, \n1995. Morgan Kaufmann. \n\nG. J. Gordon. Online fitted reinforcement learning. In J . A. Boyan, A. W. Moore, \nand R. S. Sutton, editors, Proceedings of the Workshop on Value Function Ap(cid:173)\nproximation, 1995. Proceedings are available as tech report CMU-CS-95-206. \n\nT . Jaakkola, M.I. Jordan, and S. P. Singh. On the convergence of stochastic iterative \n\ndynamic programming algorithms. Neural Computation, 6(6):1185- 1201, 1994. \n\nS. P. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft state \n\naggregation. In G. Tesauro and D. Touretzky, editors, Advances in Neural Infor(cid:173)\nmation Processing Systems, volume 7. Morgan Kaufmann, 1995. \n\nS. P. Singh and R. S. Sutton. Reinforcement learning with replacing eligibility \n\ntraces. Machine Learning, 1996. \n\nR. S. Sutton. Learning to predict by the methods of temporal differences. Machine \n\nLearning, 3(1):9- 44, 1988. \n\nG. Tesauro. Neurogammon: a neural network backgammon program. In IJCNN \n\nProceedings III, pages 33-39, 1990. \n\nJ. N. Tsitsiklis and B. Van Roy. Feature-based methods for large-scale dynamic \nprogramming. Technical Report P-2277, Laboratory for Information and Decision \nSystems, 1994. \n\nJ. N. Tsitsiklis. Asynchronous stochastic approximation and Q-Iearning. Machine \n\nLearning, 16(3):185-202, 1994. \n\nC. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, \n\nCambridge, England, 1989. \n\n\f", "award": [], "sourceid": 1133, "authors": [{"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}