{"title": "Analysis of Temporal-Diffference Learning with Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1075, "page_last": 1081, "abstract": "", "full_text": "Analysis of Temporal-Difference Learning \n\nwith Function Approximation \n\nJohn N. Tsitsiklis and Benjamin Van Roy \nLaboratory for Information and Decision Systems \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\ne-mail: jnt@mit.edu, bvr@mit.edu \n\nAbstract \n\nWe present new results about the temporal-difference learning al(cid:173)\ngorithm, as applied to approximating the cost-to-go function of \na Markov chain using linear function approximators. The algo(cid:173)\nrithm we analyze performs on-line updating of a parameter vector \nduring a single endless trajectory of an aperiodic irreducible finite \nstate Markov chain. Results include convergence (with probability \n1), a characterization of the limit of convergence, and a bound on \nthe resulting approximation error. In addition to establishing new \nand stronger results than those previously available, our analysis \nis based on a new line of reasoning that provides new intuition \nabout the dynamics of temporal-difference learning. Furthermore, \nwe discuss the implications of two counter-examples with regards \nto the Significance of on-line updating and linearly parameterized \nfunction approximators. \n\n1 \n\nINTRODUCTION \n\nThe problem of predicting the expected long-term future cost (or reward) of a \nstochastic dynamic system manifests itself in both time-series prediction and con(cid:173)\ntrol. An example in time-series prediction is that of estimating the net present \nvalue of a corporation, as a discounted sum of its future cash flows, based on the \ncurrent state of its operations. In control, the ability to predict long-term future \ncost as a function of state enables the ranking of alternative states in order to guide \ndecision-making. Indeed, such predictions constitute the cost-to-go function that is \ncentral to dynamic programming and optimal control (Bertsekas, 1995). \nTemporal-difference learning, originally proposed by Sutton (1988), is a method for \napproximating long-term future cost as a function of current state. The algorithm \n\n\f1076 \n\n1. N. Tsitsiklis and B. Van Roy \n\nis recursive, efficient, and simple to implement. Linear combinations of fixed basis \nfunctions are used to approximate the mapping from state to future cost. The \nweights of the linear combination are updated upon each observation of a state \ntransition and the associated cost. The objective is to improve approximations \nof long-term future cost as more and more state transitions are observed. The \ntrajectory of states and costs can be generated either by a physical system or a \nsimulated model. In either case, we view the system as a Markov chain. Adopting \nterminology from dynamic programming, we will refer to the function mapping \nstates of the Markov chain to expected long-term cost as the cost-to-go function. \n\nIn this paper, we introduce a new line of analysis for temporal-difference learning. \nIn addition to providing new intuition about the dynamics of the algorithm, this \napproach leads to a stronger convergence result than previously available, as well \nas an interpretation of the limit of convergence and bounds on the resulting ap(cid:173)\nproximation error, neither of which have been available in the past. Aside from \nthe statement of results, we maintain the discussion at an informal level, and make \nno attempt to present a complete or rigorous proof. The formal and more general \nanalysis based on our line of reasoning can found in (Tsitsiklis and Van Roy, 1996), \nwhich also discusses the relationship between our results and other work involving \ntem poral-difference learning. \n\nThe convergence results assume the use of both on-line updating and linearly pa(cid:173)\nrameterized function approximators. To clarify the relevance of these requirements, \nwe discuss the implications of two counter-examples that are presented in (Tsitsiklis \nand Van Roy, 1996). These counter-examples demonstrate that temporal-difference \nlearning can diverge in the presence of either nonlinearly parameterized function \napproximators or arbitrary (instead of on-line) sampling distributions. \n\n2 DEFINITION OF TD(A) \n\nIn this section, we define precisely the nature of temporal-difference learning, as ap(cid:173)\nplied to approximation of the cost-to-go function for an infinite-horizon discounted \nMarkov chain. While the method as well as our subsequent results are applicable to \nMarkov chains with fairly general state spaces, including continuous and unbounded \nspaces, we restrict our attention in this paper to the case where the state space is \nfinite. Discounted Markov chains with more general state spaces are addressed in \n(Tsitsiklis and Van Roy, 1996). Application of this line of analysis to the context of \nundiscounted absorbing Markov chains can be found in (Bertsekas and Tsitsiklis, \n1996) and has also been carried out by Gurvits (personal communication). \nWe consider an aperiodic irreducible Markov chain with a state space S = \n{I, ... , n}, a transition probability matrix P whose (i, j)th entry is denoted by Pij, \ntransition costs g(i,j) associated with each transition from a state i to a state j, \nand a discount factor Q E (0,1). The sequence of states visited by the Markov chain \nis denoted by {it I t = 0,1, ... }. The cost-to-go function J* : S t-+ ~ associated \nwith this Markov chain is defined by \n\nJ*(i) ~ E [f: olg(it, it+d I io = ij. \n\nt=o \n\nSince the number of dimensions is finite, it is convenient to view J* as a vector \ninstead of a function. \nWe consider approximations of J* using a function of the form \n\nJ(i, r) = (*r)(i). \n\n\fAnalysis ofTemporal-Diflference Learning with Function Approximation \n\n1077 \n\nHere, r = (r(l), ... ,r(K)) is a parameter vector and cI> is a n x K. We denote the \nith row of cI> as a (column) vector 0 for all i E S. We define an n x n diagonal matrix D with \ndiagonal entries 71\"(1), ... , 7I\"(n). We define a weighted norm II \u00b7IID by \n\nIIJIID = L 7I\"(i)J2(i). \n\niES \n\nWe define a \"projection matrix\" II by \n\nIIJ = arg !llin IIJ - JIID. \n\nJ=tf>r \n\nIt is easy to show that II = cI>(cI>' DcI\u00bb-lcI>' D. \n\nWe define an operator T(>\") : ~n I-t ~n, indexed by a parameter A E [0,1) by \n\n(T(\u00bb J)(i) = (1 - ~) %;. ~m E [t, o/g(i\" it+1) + \"m+l J(im+l) I io = i) . \n\n\f1078 \n\n1. N. Tsitsiklis and B. Van Roy \n\nFor A = 1 we define (T(l)J)(i) = J*(i), so that lim>.tl(T(>')J)(i) = (T(l)J)(i). To \ninterpret this operator in a meaningful manner, note that, for each m, the term \n\nE [f cig(it, it+d + am+! J(im+d I io = i] \n\nt=o \n\nis the expected cost to be incurred over m transitions plus an approximation to \nthe remaining cost to be incurred, based on J. This sum is sometimes called the \n\"m-stage truncated cost-to-go.\" Intuitively, if J is an approximation to the cost(cid:173)\nto-go function, the m-stage truncated cost-to-go can be viewed as an improved \napproximation. Since T(>') J is a weighted average over the m-stage truncated cost(cid:173)\nto-go values, T(>') J can also be viewed as an improved approximation to J*. A \nproperty of T(>') that is instrumental in our proof of convergence is that T(>') is a \ncontraction of the norm II\u00b7IID. It follows from this fact that the composition IIT(>') \nis also a contraction with respect to the same norm, and has a fixed point of the \nform cf>r* for some parameter vector r* . \nTo clarify the fundamental structure of TD(A), we construct a process X t = \n(it, it+!, Zt)\u00b7 It is easy to see that X t is a Markov process. In particular, Zt+l \nand it+! are deterministic functions of X t and the distribution of it+2 only depends \non it+l. Note that at each time t, the random vector X t , together with the cur(cid:173)\nrent parameter vector rt, provides all necessary information for computing rt+l. By \ndefining a function s with s(r, X) = (g(i,j)+aJ(j, r) -J(i, r))z, where X = (i,j, z), \nwe can rewrite the TD(A) algorithm as \n\nrt+1 = rt + Its(rt, Xd\u00b7 \n\nFor any r, s(r,Xt) has a \"steady-state\" expectation, which we denote by \nEo[s(r, X t)]. Intuitively, once X t reaches steady-state, the TD(A) algorithm, in \nan \"average\" sense, behaves like the following deterministic algorithm: \n\nTT+l = TT + ITEO[S(TT' X t )]. \n\nUnder some technical assumptions, a theorem from (Benveniste, et al., 1990) can \nbe used to deduce convergence TD(A) from that of the deterministic counterpart. \nOur study centers on an analysis of this deterministic algorithm. A theorem from \n(Benveniste, et aI, 1990) is used to formally deduce convergence of the stochastic \nalgorithm. \nIt turns out that \n\nEo[s(r,Xt )] = cf>'D(T(>')(cf>r) - cf>r). \n\nUsing the contraction property of T(>'), \n\n(r - r*)'Eo[s(r,Xt )] = \n< \n< \n\n(cf>r - cf>r*)'D(IIT(>')(cf>r) -\n\ncf>r* + (cf>r* - cf>r)) \n\nlIcf>r - cf>r*IID . IlIIT(>') (cf>r) - cf>r*IID -11cf>r* - cf>r1l1 \n(0: -1)IIcf>r - cf>r*1I1. \n\nSince a < 1, this inequality shows that the steady state expectation Eo[s(r, Xd] \ngenerally moves the parameter vector towards r*, the fixed point of IIT(>'), where \n\"closeness\" is measured in terms of the norm II . liD. This provides the main line \nof reasoning behind the proof of convergence provided in (Tsitsiklis and Van Roy, \n1996). Some illuminating interpretations of this deterministic algorithm, which are \nuseful in developing an intuitive understanding of temporal difference learning, are \nalso discussed in (Tsitsiklis and Van Roy, 1996). \n\n\fAnalysis ofTemporal-DiflJerence Learning with Function Approximation \n\n1079 \n\n4 CONVERGENCE RESULT \n\nWe now present our main result concerning temporal-difference learning. A formal \nproof is provided in (Tsitsiklis and Van Roy, 1996). \n\nTheorem 1 Let the following conditions hold: \n(a) The Markov chain it has a unique invariant distribution 71\" that satisfies 71\"' P = \n71\"', with 71\"( i) > 0 for all i. \n(b) The matrix 4> has full column rank; that is, the \"basis functions\" {\u00a2k I k = \n1, ... ,K} are linearly independent. \n(c) The step sizes 'Yt are positive, nonincreasing, and predetermined. Furthermore, \nthey satisfy 2::0 'Yt = 00, and 2::0 'Yt < 00. \nWe then have: \n(a) For any A E [0,1]' the TD(A) algorithm, as defined in Section 2, converges with \nprobability 1. \n(b) The limit of convergence r* is the unique solution of the equation \n\n(c) Furthermore, r* satisfies \n\nIIT(>') (4)r*) = 4>r*. \n\nl14>r* - J* liD :S 1 - Aa IlIIJ* - J* liD. \n\nI-a \n\nPart (b) of the theorem leads to an interesting interpretation of the limit of con(cid:173)\nvergence. In particular, if we apply the TD (A) operator to the final approximation \n4>r*, and then project the resulting function back into the span of the basis func(cid:173)\ntions, we get the same function 4>r*. Furthermore, since the composition IIT(>') \nis a contraction, repeated application of this composition to any function would \ngenerate a sequence of functions converging to 4>r*. \nPart (c) of the theorem establishes that a certain desirable property is satisfied \nby the limit of convergence. In particular, if there exists a vector r such that \n4>r = J*, then this vector will be the limit of convergence of TD(A), for any A E \n[0, 1]. On the other hand, if no such parameter vector exists, the distance between \nthe limit of convergence 4>r* and J* is bounded by a multiple of the distance \nbetween the projection IIJ* and J*. This latter distance is amplified by a factor of \n(1 - Aa)/(1 - a), which becomes larger as A becomes smaller. \n\n5 COUNTER-EXAMPLES \n\nSutton (1995) has suggested that on-line updating and the use of linear function \napproximators are both important factors that make temporal-difference learning \nconverge properly. These requirements also appear as assumptions in the conver(cid:173)\ngence result of the previous section. To formalize the fact that these assumptions \nare relevant, two counter-examples were presented in (Tsitsiklis and Van Roy, 1996). \nThe first counter-example involves the use of a variant of TD(O) that does not sample \nstates based on trajectories. Instead, the states it are sampled independently from a \ndistribution q(.) over S, and successor states jt are generated by sampling according \nto Pr[jt = jlit] = Pid. Each iteration of the algorithm takes on the form \n\nrt+I = rt + 'Yt\u00a2(it) (g(it,jt) + a\u00a2'(jt)rt - \u00a2'(it)rt). \n\nWe refer to this algorithm as q-sampled TD(O). Note that this algorithm is closely \nrelated to the original TD(A) algorithm as defined in Section 2. In particular, if it is \n\n\f1080 \n\nJ. N. Tsitsiklis and B. Van Roy \n\ngenerated by the Markov chain and jt = it+! , we are back to the original algorithm. \nIt is easy to show, using a subset of the arguments required to prove Theorem 1, \nthat this algorithm converges when q(i) = 7r(i) for all i, and the Assumptions of \nTheorem 1 are satisfied. However, results can be very different when q( .) is arbi(cid:173)\ntrary. In particular, the counter-example presented in (Tsitsiklis an Van Roy, 1996) \nshows that for any sampling distribution q(.) that is different from 7r(-) there exists \na Markov chain with steady-state probabilities 7r(-) and a linearly parameterized \nfunction approximator for which q-sampled TD(O) diverges. A counter-example \nwith similar implications has also been presented by Baird (1995). \nA generalization of temporal difference learning is commonly used in conjunction \nwith nonlinear function approximators. This generalization involves replacing each \nvector .. Although this is only a bound, it strongly suggests that higher values \nof >. are likely to produce more accurate approximations. This is consistent with \nthe examples that have been constructed by Bertsekas (1994). \n\nThe sensitivity of the error bound to >. raises the question of whether or not it ever \nmakes sense to set >. to values less than 1. Many reports of experimental results, \ndating back to Sutton (1988), suggest that setting>. to values less than one can \noften lead to significant gains in the rate of convergence. A full understanding of \nhow>. influences the rate of convergence is yet to be found, though some insight in \nthe case of look-up table representations is provided by Dayan and Singh (1996). \nThis is an interesting direction for future research. \n\nAcknowledgments \n\nWe thank Rich Sutton for originally making us aware of the relevance of on-line \nstate sampling, and also for pointing out a simplification in the expression for the \nerror bound of Theorem l. This research was supported by the NSF under grant \nDMI-9625489 and the ARO under grant DAAL-03-92-G-01l5. \n\nReferences \n\nBaird, L. C. (1995). \"Residual Algorithms: Reinforcement Learning with Function \nApproximation,\" in Prieditis & Russell, eds. Machine Learning: Proceedings of \nthe Twelfth International Conference, 9-12 July, Morgan Kaufman Publishers, San \nFrancisco, CA. \n\nBertsekas, D. P. (1994) \"A Counter-Example to Temporal-Difference Learning,\" \n\n\fAnalysis ofTemporal-Diffference Learning with Function Approximation \n\n1081 \n\nNeural Computation, vol. 7, pp. 270-279. \nBertsekas, D. P. (1995) Dynamic Programming and Optimal Control, Athena Sci(cid:173)\nentific, Belmont, MA. \nBertsekas, D. P. & Tsitsiklis, J. N. (1996) Neuro-Dynamic Programming, Athena \nScientific, Belmont, MA. \nBenveniste, A., Metivier, M., & Priouret, P., (1990) Adaptive Algorithms and \nStochastic Approximations, Springer-Verlag, Berlin. \nDayan, P. D. & Singh, S. P (1996) \"Mean Squared Error Curves in Temporal \nDifference Learning,\" preprint. \nGurvits, L. (1996) personal communication. \nSutton, R. S., (1988) \"Learning to Predict by the Method of Temporal Differences,\" \nMachine Learning, vol. 3, pp. 9-44. \n\nSutton, R.S. (1995) \"On the Virtues of Linear Learning and Trajectory Distribu(cid:173)\ntions,\" Proceedings of the Workshop on Value Function Approximation, Machine \nLearning Conference 1995, Boyan, Moore, and Sutton, Eds., p. 85. Technical \nReport CMU-CS-95-206, Carnegie Mellon University, Pittsburgh, PA 15213. \nTsitsiklis, J. N. & Van Roy, B. (1996) \"An Analysis of Temporal-Difference Learning \nwith Function Approximation,\" to appear in the IEEE Transactions on Automatic \nControl. \n\n\f", "award": [], "sourceid": 1269, "authors": [{"given_name": "John", "family_name": "Tsitsiklis", "institution": null}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": null}]}*