{"title": "Analytical Mean Squared Error Curves in Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1054, "page_last": 1060, "abstract": null, "full_text": "Analytical Mean Squared Error Curves \n\nin Temporal Difference Learning \n\nSatinder Singh \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \nbaveja@cs.colorado.edu \n\nPeter Dayan \n\nBrain and Cognitive Sciences \n\nE25-210, MIT \n\nCambridge, MA 02139 \nbertsekas@lids.mit.edu \n\nAbstract \n\nWe have calculated analytical expressions for how the bias and \nvariance of the estimators provided by various temporal difference \nvalue estimation algorithms change with offline updates over trials \nin absorbing Markov chains using lookup table representations. We \nillustrate classes of learning curve behavior in various chains, and \nshow the manner in which TD is sensitive to the choice of its step(cid:173)\nsize and eligibility trace parameters. \n\n1 \n\nINTRODUCTION \n\nA reassuring theory of asymptotic convergence is available for many reinforcement \nlearning (RL) algorithms. What is not available, however, is a theory that explains \nthe finite-term learning curve behavior of RL algorithms, e.g., what are the different \nkinds of learning curves, what are their key determinants, and how do different \nproblem parameters effect rate of convergence. Answering these questions is crucial \nnot only for making useful comparisons between algorithms, but also for developing \nhybrid and new RL methods. In this paper we provide preliminary answers to some \nof the above questions for the case of absorbing Markov chains, where mean square \nerror between the estimated and true predictions is used as the quantity of interest \nin learning curves. \n\nOur main contribution is in deriving the analytical update equations for the two \ncomponents of MSE, bias and variance, for popular Monte Carlo (MC) and TD(A) \n(Sutton, 1988) algorithms. These derivations are presented in a larger paper. Here \nwe apply our theoretical results to produce analytical learning curves for TD on \ntwo specific Markov chains chosen to highlight the effect of various problem and \nalgorithm parameters, in particular the definite trade-offs between step-size, Q, and \neligibility-trace parameter, A. Although these results are for specific problems, we \n\n\fAnalytical MSE Curves/or TD Learning \n\n1055 \n\nbelieve that many ofthe conclusions are intuitive or have previous empirical support, \nand may be more generally applicable. \n\n2 ANALYTICAL RESULTS \n\nA random walk, or trial, in an absorbing Markov chain with only terminal payoffs \nproduces a sequence of states terminated by a payoff. The prediction task is to \ndetermine the expected payoff as a function of the start state i, called the optimal \nvalue function, and denoted v.... Accordingly, vi = E {rls1 = i}, where St is the \nstate at step t, and r is the random terminal payoff. The algorithms analysed are \niterative and produce a sequence of estimates of v\" by repeatedly combining the \nresult from a new trial with the old estimate to produce a new estimate. They have \nthe form: viet) = Viet - 1) + a(t)oi(t) where vet) = {Viet)} is the estimate of the \noptimal value function after t trials, Oi (t) is the result for state i based on random \ntrial t, and the step-size a( t) determines how the old estimate and the new result \nare combined. The algorithms differ in the os produced from a trial. \n\nMonte Carlo algorithms use the final payoff that results from a trial to define the \nOi(t) (e.g., Barto & Duff, 1994). Therefore in MC algorithms the estimated value ofa \nstate is unaffected by the estimated value of any other state. The main contribution \nof TD algorithms (Sutton, 1988) over MC algorithms is that they update the value \nof a state based not only on the terminal payoff but also on the the estimated \nvalues of the intervening states. When a state is first visited, it initiates a short(cid:173)\nterm memory process, an eligibility trace, which then decays exponentially over time \nwith parameter A. The amount by which the value of an intervening state combines \nwith the old estimate is determined in part by the magnitude of the eligibility trace \nat that point. \nIn general, the initial estimate v(O) could be a random vector drawn from some \ndistribution, but often v(O) is fixed to some initial value such as zero. In either case, \nsubsequent estimates, vet); t > 0, will be random vectors because of the random os. \nThe random vector v( t) has a bias vector b( t) d~ E {v( t) - v\"'} and a covariance \nmatrix G(t) d~ E{(v(t) - E{v(t)})(v(t) - E{v(t)})T}. The scalar quantity of \ninterest for learning curves is the weighted MSE as a function of trial number t, and \nis defined as follows: \n\nMSE(t) = l:i Pi(E{(Vi(t) - vi?}) = l:i pi(bl(t) + Gii(t\u00bb, \n\nwhere Pi = {JJT [I - QJ -1 )d l:j (J.'T [I - QJ -1)j is the weight for state i, which is the \nexpected number of visits to i in a trial divided by the expected length of a trial 1 \n(J.'i is the probability of starting in state i; Q is the transition matrix of the chain). \nIn this paper we present results just for the standard TD(A) algorithm (Sutton, \n1988), but we have analysed (Singh & Dayan, 1996) various other TD-like algorithms \n(e.g., Singh & Sutton, 1996) and comment on their behavior in the conclusions. Our \nanalytical results are based on two non-trivial assumptions: first that lookup tables \nare used, and second that the algorithm parameters a and A are functions of the \ntrial number alone rather than also depending on the state. We also make two \nassumptions that we believe would not change the general nature of the results \nobtained here: that the estimated values are updated offline (after the end of each \ntrial), and that the only non-zero payoffs are on the transitions to the terminal \nstates. With the above caveats, our analytical results allow rapid computation of \nexact mean square error (MSE) learning curves as a function of trial number. \n\n10t her reasonable choices for the weights, PI, would not change the nature of the results \n\npresented here. \n\n\f1056 \n\nS. Singh and P. Dayan \n\n2.1 BIAS, VARIANCE, And MSE UPDATE EQUATIONS \n\nThe analytical update equations for the bias, variance and MSE are complex and \ntheir details are in Singh & Dayan (1996) -\nthey take the following form in outline: \n\nb(t) \nC(t) \n\nam + Bmb(t - 1) \nAS + BSC(t - 1) + IS (b(t - 1\u00bb \n\n(1) \n(2) \n\nwhere matrix Bm depends linearly on a(t) and BS and IS depend at most quadrat(cid:173)\nically on a(t). We coded this detail in the C programming language to develop a \nsoftware tool 2 whose rapid computation of exact MSE error curves allowed us to ex(cid:173)\nperiment with many different algorithm and problem parameters on many Markov \nchains. Of course, one could have averaged together many empirical MSE curves \nobtained via simulation of these Markov chains to get approximations to the an(cid:173)\nalytical MSE error curves, but in many cases MSE curves that take minutes to \ncompute analytically take days to derive empirically on the same computer for five \nsignificant digit accuracy. Empirical simulation is particularly slow in cases where \nthe variance converges to non-zero values (because of constant step-sizes) with long \ntails in the asymptotic distribution of estimated values (we present an example in \nFigure lc). Our analytical method, on the other hand, computes exact MSE curves \nfor L trials in O( Istate space 13 L) steps regardless of the behavior of the variance \nand bias curves. \n\n2.2 ANALYTICAL METHODS \n\nTwo consequences of having the analytical forms of the equations for the update \nof the mean and variance are that it is possible to optimize schedules for setting a \nand A and, for fixed A and a, work out terminal rates of convergence for band C. \nComputing one-step optimal a's: Given a particular A, the effect on the MSE \nof a single step for any of the algorithms is quadratic in a. It is therefore straight(cid:173)\nforward to calculate the value of a that minimises MSE(t) at the next time step. \nThis is called the greedy value of a. It is not clear that if one were interested \nin minimising MSE(t + t'), one would choose successive a(u) that greedily min(cid:173)\nimise MSE(t); MSE(t + 1); .... In general, one could use our formulre and dynamic \nprogramming to optimise a whole schedule for a( u), but this is computationally \nchallenging. \n\nNote that this technique for setting greedy a assumes complete knowledge about \nthe Markov chain and the initial bias and covariance of v(O), and is therefore not \ndirectly applicable to realistic applications of reinforcement learning. Nevertheless, \nit is a good analysis tool to approximate omniscient optimal step-size schedules, \neliminating the effect of the choice of a when studying the effect of the A. \nComputing one-step optimal A'S: Calculating analytically the A that would \nminimize MSE(t) given the bias and variance at trial t - 1 is substantially harder \nbecause terms such as [/ - A(t)Q]-l appear in the expressions. However, since it is \npossible to compute MSE(t) for any choice of A, it is straightforward to find to any \ndesired accuracy the Ag(t) that gives the lowest resulting MSE(t). This is possible \nonly because MSE(t) can be computed very cheaply using our analytical equations. \n\nThe caveats about greediness in choosing ag{t) also apply to Ag(t). For one of the \nMarkov chains, we used a stochastic gradient ascent method to optimise A( u) and \n\n2The analytical MSE error curve software is available via anonymous ftp from the \n\nfollowing address: ftp.cs.colorado.edu /users/baveja/ AMse. tar.Z \n\n\f1057 \n\nAnalytical MSE Curves/or TD Learning \na( u) to minimise MSE( t + tf) and found that it was not optimal to choose Ag (t) \nand ag(t) at the first step. \nComputing terminal rates of convergence: In the update equations 1 and 2, \n. b(t) depends linearly on b(t - 1) through a matrix Bm; and C(t) depends linearly \non C(t - 1) through a matrix B S . For the case of fixed a and A, the maximal and \nminimal eigenvalues of Bm and B S determine the fact and speed of convergence \nof the algorithms to finite endpoints. If the modulus of the real part of any of \nthe eigenvalues is greater than 1, then the algorithms will not converge in general. \nWe observed that the mean update is more stable than the mean square update, \ni.e., appropriate eigenvalues are obtained for larger values of a (we call the largest \nfeasible a the largest learning rate for which TD will converge). Further, we know \nthat the mean converges to v\" if a is sufficiently small that it converges at all, \nand so we can determine the terminal covariance. Just like the delta rule, these \nalgorithms converge at best to an (-ball for a constant finite step-size. This amounts \nto the MSE converging to a fixed value, which our equations also predict. Further, \nby calculating the eigenvalues of B m , we can calculate an estimate of the rate of \ndecrease of the bias. \n\n3 LEARNING CURVES ON SPECIFIC MARKOV \n\nCHAINS \n\nWe applied our software to two problems: a symmetric random walk (SRW), and a \nMarkov chain for which we can control the frequency of returns to each state in a \nsingle run (we call this the cyclicity of the chain). \n\n_,. \n\ne\"....,., MIlE DiIdll.IIan \n\n\u2022\u2022 ,--------------, \n\n) \nc \n\na) ... \n\ni ,.,. \n\u2022 ... \n..... 1,., -----..:-.---;,;;:;; ... .---=,;; ... ~:;;:_~, =iiJ .... , \n\nT .... N~ \n\nb) \" \n\u2022\u2022 \n.. \nI\u00b7. \nI. . \u2022 \n\n10.0 \n\ntellO \n110.0 \nTNlN ...... \n\n_0 \n\n110.0 \n\n. . \n\nFigure 1: Comparing Analytical and Empirical MSE curves. a) analytical and empirical \nlearning curves obtained on the 19 state SRW problem with parameters Ci = 0.01, ,\\ = 0.9. \nThe empirical curve was obtained by averaging together more than three million simulation \nruns, and the analytical and empirical MSE curves agree up to the fourth decimal place; \nb) a case where the empirical method fails to match the analytical learning curve after \nmore than 15 million runs on a 5 state SRW problem. The empirical learning curve is \nvery spiky. c) Empirical distribution plot over 15.5 million runs for the MSE at trial 198. \nThe inset shows impulses at actual sample values greater than 100. The largest value is \ngreater than 200000. \n\nAgreement: First, we present empirical confirmation of our analytical equations \non the 19 state SRW problem. We ran TD(A) for specific choices of a and A for more \nthan three million simulation runs and averaged the resulting empIrical weighted \nMSE error curves. Figure la shows the analytical and empirical learning curves, \nwhich agree to within four decimal places. \nLong-Tails of Empirical MSE distribution: There are cases in which the agree(cid:173)\nment is apparently much worse (see Figure Ib). This is because of the surprisingly \nlong tails for the empirical MSE distribution - Figure lc shows an example for a 5 \n\n\f1058 \n\nS. Singh and P. Dayan \n\nstate SRW. This points to interesting structure that our analysis is unable to reveal. \nEffect of a and A: Extensive studies on the 19 state SRW that we do not have \nspace to describe fully show that: HI) for each algorithm, increasing a while holding \nA fixed increases the asymptotic value of MSE, and similarly for increasing A whilst \nholding a constant; H2) larger values of a or A (except A very close to 1) lead \nto faster convergence to the asymptotic value of MSE if there exists one; H3) in \ngeneral, for each algorithm as one decreases A the reasonable range of a shrinks, \ni.e., larger a can be used with larger A without causing excessive MSE. The effect \nin H3 is counter-intuitive because larger A tends to amplify the effective step-size \nand so one would expect the opposite effect. Indeed, this increase in the range of \nfeasible a is not strictly true, especially very near A = 1, but it does seem to hold \nfor a large range of A. \nMe versus TD(A): Sutton (1988) and others have investigated the effect of A \non the empirical MSE at small trial numbers and consistently shown that TD is \nbetter for some A < 1 than MC (A = 1). Figure 2a shows substantial changes as a \nfunction of trial number in the value of A that leads to the lowest MSE. This effect \nis consistent with hypotheses HI-H3. Figure 2b confirms that this remains true \neven if greedy choices of a tailored for each value of A are used. Curves for different \nvalues of A yield minimum MSE over different trial number segments. We observed \nthese effects on several Markov chains. \n\na) \n\nb) \n\n.,. \n\nAccumulate \n\n, \\ \n\\ \\ \n~'':: _____ .o--------.----_u)_ \u2022. __ \n\n. \n\n\u2022 oo \\ \" . :--... \u2022\u2022\u2022 \n\n1OQ.0\u00b7 150.0 \nTn.! Numbor \n\n0.8 0.8 \n\n00000 \n\n50 \n\n,.,..~ \n\n... ----.... -------\n\n100.0 \n\nHO.O \n\nFigure 2: U-shaped Curves. a) Weighted MSE as a function of A and trial number for \nfixed ex = 0.05 (minimum in A shown as a black line). This is a 3-d version of the U-shaped \ncurves in Sutton (1988), with trial number being the extra axis. b) Weighted MSE as a \nfunction of trial number for various A using greedy ex. Curves for different values of A yield \nminimum MSE over different trial number segments. \n\nInitial bias: Watkins (1989) suggested that A trades off bias for variance, since A ..... \n1 has low bias, but potentially high variance, and conversely for A ..... 0. Figure 3a \nconfirms this in a problem which is a little like a random walk, except that it is highly \ncyclic so that it returns to each state many times in a single trial. If the initial bias \nis high (low), then the initial greedy value of A is high (low). We had expected the \n\nasymptotic greedy value of A to be 0, since once b(t) ..... 0, then A = \u00b0 leads to lower \nvariance updates. However, Figure 3a shows a non-zero asymptote - presumably \nbecause larger learning rates can be used for A > 0, because of covariance. Figure 3b \nshows, however, that there is little advantage in choosing A cleverly except in the \nfirst few trials, at least if good values of a are available. \n\nEigenvalue stability analysis: We analysed the eigenvalues of the covariance \nupdate matrix BS (c.f. Equation 2) to determine maximal fixed a as a function \nof A. Note that larger a tends to lead to faster learning, provided that the values \nconverge. Figure 4a shows the largest eigenvalue of B S as a function of A for various \n\n\fAnalytical MSE Curvesfor TD Learning \n\n1059 \n\na) \n\n1.0 ,-----~-____ -----, \n\nb) \n\nAccumulate \n\n02 \n\n0.0 ':-------,~---:-:':\":__----:-::'. \nlSO.0 \n\n100.0 \n\nSO.O \n\n0.0 \n\n--\n\nFigure 3: Greedy A for a highly cyclic problem. a) Greedy A for high and low initial bias \n(using greedy a). b) Ratio of MSE for given value of A to that for greedy A at each trial. \nThe greedy A is used for every step. \n\na) \n\n4.0 ,---~-~-~-~---, \n\n1 3.0 \\ \\ \n, \n'. \ni \n\" 12.0 \n\n5 \n\n. \n10 \n\n...., \n\n\u2022\u2022\u2022 ~:.. \n\n... - . . \n\n\"-\u2022\u2022 \n\n0.0 ':-----:\"0--......,...,.--:\":--,.__----' \n1.0 \n\n0.' \n\n0.4 \n\n0.2 \n\n0.0 \n\n0.' \n\nA \n\nb) \n\n0.10 \n\n0.00 \n\nLargest Feasible a \n\nc) \n\n.. \n\n'-0 \n\n\" 'I \n\n-\n... ,... ........ 0.1 \n\u00b7_\u00b7 .... ,... ....... \u00b7o,01 \n\nO\u00b7OIo.~O --------,,0A':\"\". - - - - - - : \".0 \n\n0\u00b7900.'-0--02--0~.4--0~.6--0.~8 --',.0 \n\nA \n\nFigure 4: Eigenvalue analysis of covariance reduction. a) Maximal modulus of the eigen(cid:173)\nvalues of B S \u2022 These determine the rate of convergence of the variance. Values greater than \n1 lead to instability. b) Largest a such that the covariance is bounded. The inset shows a \nblowup for 0.9 ;:; A ;:; 1. Note that A = 1 is not optimal. c) Maximal bias reduction rates \nas a function of A, after controlling for asymptotic variance (to 0.1 and 0.01) by choosing \nappropriate a's. Again, A < 1 is optimal. \n\na. If this eigenvalue is larger than 1, then the algorithm will diverge - a behavior \nthat we observed in our simulations. The effect of hypothesis H3 above is evident -\nfor larger A, only smaller a can be used. Figure 4b shows this in more graphic form, \nindicating the largest a that leads to stable eigenvalues for B S . Note the reversal \nvery dose to A = 1, which provides more evidence against the pure MC algorithm. \nThe choice of a and A control both rate of convergence and the asymptotic MSE. \nIn Figure 4c we control for the asymptotic variance by choosing appropriate as as \na function of .x and plot maximal eigenvalues of 8 m (c.f. Equation 1; it controls \nthe terminal rate of convergence of the bias to zero) as a function of A. Again, we \nsee evidence for T Dover Me. \n\n4 CONCLUSIONS \n\nWe have provided analytical expressions for calculating how the bias and variance \nof various TD and Monte Carlo algorithms change over iterations. The expressions \nthemselves seem not to be very revealing, but we have provided many illustrations \nof their behavior in some example Markov chains. We have also used the analysis \nto calculate one-step optimal values of the step-size a and eligibility trace A param(cid:173)\neters. Further, we have calculated terminal mean square errors and maximal bias \nreduction rates. Since all these results depend on the precise Markov chains chosen, \n\n\f1060 \n\nS. Singh and P. Dayan \n\nit is hard to make generalisations. \nWe have posited four general conjectures: HI) for constant A, the larger a, the \nlarger the terminal MSE; H2) the larger a or A (except for A very close to I), the \nfaster the convergence to the asymptotic MSE, provided that this is finite; H3) the \nsmaller A, the smaller the range of a for which the terminal MSE is not excessive; \nH4) higher values of A are good for cases with high initial biases. The third of \nthese is somewhat surprising, because the effective value of the step-size is really \nal(l - A). However, the lower A, the more the value of a state is based on the \nvalue estimates for nearby states. We conjecture that with small A, large a can \nquickly lead to high correlation in the value estimates of nearby states and result \nin runaway variance updates. \n\nTwo main lines of evidence suggest that using values of A other than 1 (Le., using a \ntemporal difference rather than a Monte-Carlo algorithm) can be beneficial. First, \nthe greedy value of A chosen to minimise the MSE at the end of the step (whilst \nusing the associated greedy a) remains away from 1 (see Figure 3). Second, the \neigenvalue analysis of B S showed that the largest value of a that can be used is \nhigher for A < 1 (also the asymptotic speed with which the bias can be guaranteed \nto decrease fastest is higher for A < 1). \nAlthough in this paper we have only discussed results for the standard TD(A) al(cid:173)\ngorithm (called Accumulate), we have also analysed Replace TD(A) of Singh & \nSutton (1996) and various others. This analysis clearly provides only an early step \nto understanding the course of learning for TD algorithms, and has focused exclu(cid:173)\nsively on prediction rather than control. The analytical expressions for MSE might \nlend themselves to general conclusions over whole classes of Markov chains, and our \ngraphs also point to interesting unexplained phenomena, such as the apparent long \ntails in Figure 1c and the convergence of greedy values of A in Figure 3. Stronger \nanalyses such as those providing large deviation rates would be desirable. \n\nReferences \n\nBarto, A.G. &. Duff, M. (1994). Monte Carlo matrix inversion and reinforcement learning. \nNIPS 6, pp 687-694. \n\nSingh, S.P. &. Dayan, P. (1996). Analytical mean squared error curves in temporal differ(cid:173)\nence learning. Machine Learning, submitted. \n\nSingh, S.P. &. Sutton, R.S. (1996). Reinforcement learning with replacing eligibility traces. \nMachine Learning, to appear. \n\nSutton, R.S. (1988). Learning to predict by the methods of temporal difference. Machine \nLearning, 3, pp 9-44. \n\nWatkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD Thesis. University of \nCambridge, England. \n\n\f", "award": [], "sourceid": 1284, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}