{"title": "TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2402, "page_last": 2410, "abstract": "We show that the lambda-return target used in the TD(lambda) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an n-step return estimate increases with n. We introduce the gamma-return estimator, an alternative target based on a more accurate model of variance, which defines the TD_gamma family of complex-backup temporal difference learning algorithms. We derive TD_gamma, the gamma-return equivalent of the original TD(lambda) algorithm, which eliminates the lambda parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TD_gamma(C), with a capacity parameter C. TD_gamma(C) requires C times more time and memory than TD(lambda) and is incremental and online. We show that TD_gamma outperforms TD(lambda) for any setting of lambda on 4 out of 5 benchmark domains, and that TD_gamma(C) performs as well as or better than TD_gamma for intermediate settings of C.", "full_text": "TD\u03b3: Re-evaluating Complex Backups in Temporal\n\nDifference Learning\n\nGeorge Konidaris\u2217\u2217\u2020\n\nMIT CSAIL\u2020\n\nCambridge MA 02139\ngdk@csail.mit.edu\n\nScott Niekum\u2217\u2021\n\nPhilip S. Thomas\u2217\u2021\n\nUniversity of Massachusetts Amherst\u2021\n\nAmherst MA 01003\n\n{sniekum,pthomas}@cs.umass.edu\n\nAbstract\n\nWe show that the \u03bb-return target used in the TD(\u03bb) family of algorithms is the\nmaximum likelihood estimator for a speci\ufb01c model of how the variance of an n-\nstep return estimate increases with n. We introduce the \u03b3-return estimator, an\nalternative target based on a more accurate model of variance, which de\ufb01nes the\nTD\u03b3 family of complex-backup temporal difference learning algorithms. We de-\nrive TD\u03b3, the \u03b3-return equivalent of the original TD(\u03bb) algorithm, which elimi-\nnates the \u03bb parameter but can only perform updates at the end of an episode and\nrequires time and space proportional to the episode length. We then derive a sec-\nond algorithm, TD\u03b3(C), with a capacity parameter C. TD\u03b3(C) requires C times\nmore time and memory than TD(\u03bb) and is incremental and online. We show that\nTD\u03b3 outperforms TD(\u03bb) for any setting of \u03bb on 4 out of 5 benchmark domains,\nand that TD\u03b3(C) performs as well as or better than TD\u03b3 for intermediate settings\nof C.\n\n1\n\nIntroduction\n\nMost reinforcement learning [1] algorithms are value-function based\u2014learning is performed by es-\ntimating the expected return (discounted sum of rewards) obtained by following the current policy\nfrom each state, and then updating the policy based on the resulting so-called value function. Ef\ufb01-\ncient value function learning algorithms are crucial to this process and have been the focus of a great\ndeal of reinforcement learning research.\nThe most successful and widely-used family of value function algorithms is the TD(\u03bb) family [2].\nThe TD(\u03bb) family forms an estimate of return, called the \u03bb-return, that blends both low variance,\nbootstrapped and biased temporal-difference estimates of return with high variance, unbiased Monte\nCarlo estimates of return using a parameter \u03bb. While several different algorithms exist within the\nTD(\u03bb) family\u2014the original incremental and online algorithm [2], replacing traces [3], least-squares\nalgorithms [4], algorithms for learning state-action value functions [1, 5], and algorithms for adapt-\ning \u03bb [6], among others\u2014the \u03bb-return formulation has remained unchanged since its introduction in\n1988 [2]. Our goal is to understand the modeling assumptions implicit in the \u03bb-return formulation\nand improve them.\nWe show that the \u03bb-return estimator is only a maximum-likelihood estimator of return given a spe-\nci\ufb01c model of how the variance of an n-step return estimate increases with n. We propose a more\naccurate model of that variance increase and use it to obtain an expression for a new return estimator,\nthe \u03b3-return. This results in the TD\u03b3 family of algorithms, of which we derive TD\u03b3, the \u03b3-return\nversion of the original TD(\u03bb) algorithm. TD\u03b3 is only suitable for the batch setting where updates\noccur at the end of the episode and requires time and space proportional to the length of the episode,\n\n\u2217All three authors are primary authors on this occasion.\n\n1\n\n\fbut it eliminates the \u03bb parameter. We then derive a second algorithm, TD\u03b3(C), which requires C\ntimes more time and memory than TD(\u03bb) and can be used in an incremental and online setting. We\nshow that TD\u03b3 outperforms TD(\u03bb) for any setting of \u03bb on 4 out of 5 benchmark domains, and that\nTD\u03b3(C) performs as well as or better than TD\u03b3 for intermediate settings of C.\n\n2 Complex Backups\n\nEstimates of return lie at the heart of value-function based reinforcement learning algorithms: a\nvalue function V \u03c0 (which we denote here as V , assuming a \ufb01xed policy) estimates return from each\nstate, and the learning process aims to reduce the error between estimated and observed returns.\nTemporal difference (TD) algorithms use a return estimate obtained by taking a single transition in\nthe MDP and then estimating the remaining return using the value function itself:\n\nRTD\nst\n\n= rt + \u03b3V (st+1),\n\n(1)\n\nwhere RTD\nis the return estimate from state st and rt is the reward for going from st to st+1 via\nst\naction at. Monte Carlo algorithms (for episodic tasks) do not use intermediate estimates but instead\nuse the full return sample directly:\n\nRMC\nst\n\n=\n\n\u03b3irt+i,\n\n(2)\n\nfor an episode L transitions in length after time t. These two types of return estimates can be\nconsidered instances of the more general notion of an n-step return sample, for n \u2265 1:\n\nR(n)\nst\n\n= rt + \u03b3rt+1 + \u03b32rt+2 + . . . + \u03b3n\u22121rt+n\u22121 + \u03b3nV (st+n).\n\n(3)\n\ni=0\n\nHere, n transitions are observed from the MDP and the remaining portion of return is estimated using\nthe value function. The important observation here is that all n-step return estimates can be used\nsimultaneously for learning. The TD(\u03bb) family of algorithms accomplishes this using a parameter\n\u03bb \u2208 [0, 1] to average n-step return estimates, according to the following equation:\n\nL\u22121(cid:88)\n\n\u221e(cid:88)\n\nn=0\n\n= (1 \u2212 \u03bb)\n\nR\u03bb\nst\n\n\u03bbnR(n+1)\n\nst\n\n.\n\n(4)\n\nNote that for any episodic MDP we always obtain a \ufb01nite episode length. The usual formulation of\nan episodic MDP uses absorbing terminal states\u2014states where only zero-reward self-transitions are\navailable. In such cases the n-step returns past the end of the episode are all equal, and therefore\nTD(\u03bb) allocates the weights corresponding to all of those return estimates to the \ufb01nal transition.\nst, known as the \u03bb-return, is an estimator that blends one-step temporal difference estimates (which\nR\u03bb\nare biased, but low variance) at \u03bb = 0 and Monte Carlo estimates (which are unbiased, but high\nvariance) at \u03bb = 1. This forms the target for the entire family of TD(\u03bb) algorithms, whose members\ndiffer largely in their use of the resulting estimates to update the value function.\nWhat makes this a good way to average the n-step returns? Why choose this method over any\nother? Viewing this as a statistical estimation problem where each n-step return is a sample of the\ntrue return, under what conditions and for what model is equation (4) a good estimator for return?\nThe most salient feature of the n-step returns is that their variances increase with n. Therefore, con-\nsider the following model: each n-step return estimate R(n)\nis sampled from a Gaussian distribution\nst\ncentered on the true return, Rst,1 with variance k(n) that is some increasing function of n. If we\nassume the n-step returns are independent,2 then the likelihood function for return estimate \u02c6Rst is:\n\nL( \u02c6Rst|R(1)\n\nst\n\n, ..., R(n)\nst\n\n; k) =\n\nN (R(n)\n\nst\n\n| \u02c6Rst, k(n)).\n\n(5)\n\n1We should note that this assumption is not quite true: only the Monte Carlo return is unbiased.\n2Again, this assumption is not true. However, it allows us to obtain a simple, closed-form estimator.\n\nn=1\n\n2\n\nL(cid:89)\n\n\fMaximizing the log of this equation obtains the maximum likelihood estimator for \u02c6Rst:\n\n(cid:80)L\n(cid:80)L\nn=1 k(n)\u22121R(n)\n\nn=1 k(n)\u22121\n\nst\n\n\u02c6Rst =\n\n.\n\n(6)\n\nThus, we obtain a weighted sum: each n-step return is weighted by the inverse of its variance and\nthe entire sum is normalized so that the weights sum to 1. From here we can see that if we let L go\nto in\ufb01nity and set k(n) = \u03bb\u2212n, 0 \u2264 \u03bb \u2264 1, then we obtain the \u03bb-return estimator in equation (4),\n\nsince(cid:80)\u221e\n\nn=0 \u03bbn = 1/(1 \u2212 \u03bb).\n\nThus, \u03bb-return (as used in the TD(\u03bb) family of algorithms) is the maximum-likelihood estimator of\nreturn under the following assumptions:\n\n1. The n-step returns from a given state are independent.\n2. The n-step returns from a given state are normally distributed with a mean of the true return.\n3. The variances of the n-step returns from each state increase according to a geometric pro-\n\ngression in n, with common ratio \u03bb\u22121.\n\nAll of these assumptions could be improved, but the third is the most interesting. In this view, the\nvariance of an n-step sample return increases geometrically with n and \u03bb parametrizes the shape of\nthis geometric increase.\n\n3 \u03b3-Return and the TD\u03b3 Family of Algorithms\n\nConsider the variance of an n-step sample return, n > 1:\n\n(cid:104)\n\n(cid:105)\n\nV ar\n\nR(n)\nst\n\n(cid:104)\n(cid:104)\n\n(cid:105)\n\nst\n\nR(n\u22121)\nR(n\u22121)\n\nst\n\n(cid:104)\n\n=V ar\n\n=V ar\n\n+ 2Cov\n\nR(n\u22121)\n\nst\n\n(cid:104)\n\n\u2212 \u03b3n\u22121V (st+n\u22121) + \u03b3n\u22121rt+n\u22121 + \u03b3nV (st+n)\n+ \u03b32(n\u22121)V ar\n\nV (st+n\u22121) \u2212 (rt+n\u22121 + \u03b3V (st+n))\n,\u2212\u03b3n\u22121V (st+n\u22121) + \u03b3n\u22121rt+n\u22121 + \u03b3nV (st+n)\n\n(cid:105)\n(cid:105)\n\n.\n\nExamining the last term, we see that:\n\n(cid:104)\n(cid:104)\n(cid:104)\n(cid:104)\n\n,\u2212\u03b3n\u22121V (st+n\u22121) + \u03b3n\u22121rt+n\u22121 + \u03b3nV (st+n)\n\n(cid:105)\n\nCov\n\n=Cov\n\n=Cov\n\n=Cov\n\nst\n\nst\n\nR(n\u22121)\nR(n\u22121)\nR(n\u22121)\nR(n\u22121)\n\nst\n\nst\n\nst\n\n\u2212 R(n\u22121)\n\n(cid:105) \u2212 Cov\n(cid:105) \u2212 V ar\n\n(cid:104)\n(cid:104)\n\n, R(n)\nst\n\n, R(n)\nst\n\n, R(n)\nst\n\n(cid:105)\n\nst\n\nR(n\u22121)\nR(n\u22121)\n\nst\n\n, R(n\u22121)\n\nst\n\n(cid:105)\n\n.\n\n(cid:105)\n\n(cid:105)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nst\n\nSince R(n\u22121)\nCov[R(n\u22121)\nst and R(n\u22121)\nThus, equation (12) is approximately zero. Hence, equation (8) becomes:\n\nst are highly correlated\u2014being two successive return samples\u2014we assume that\nare perfectly correlated).\n\n] (equality holds when R(n)\n\nst\n\nst\n\nand R(n)\nst ] \u2248 V ar[R(n\u22121)\n, R(n)\n(cid:105)\n\n(cid:105) \u2248 V ar\n\nR(n\u22121)\n\n(cid:104)\n\nst\n\nst\n\n(cid:104)\n\nV ar\n\nR(n)\nst\n\n+ \u03b32(n\u22121)V ar\n\nV (st+n\u22121) \u2212 (rt+n\u22121 + \u03b3V (st+n))\n\n.\n\n(13)\n\n(cid:105)\n\nNotice that the \ufb01nal term on the right hand side of equation (13) is the discounted variance of the\ntemporal difference error n-steps after the current state. We assume that this variance is roughly the\nsame for all states; let that value be \u03ba. Since \u03ba also approximates the variance of the 1-step return\n(i.e., k(1) = \u03ba), we obtain the following simple model of the variance of an n-step sample of return:\n\nk(n) =\n\n\u03b32(i\u22121)\u03ba.\n\n(14)\n\n(cid:104)\n\nn(cid:88)\n\ni=1\n\n3\n\n\fSubstituting equation (14) into the general return estimator in equation (6), we obtain:\n\nL(cid:88)\n\nn=1\n\n=\n\nw(n, L)R(n)\nst\n\n,\n\n(15)\n\n(16)\n\nR\u03b3\nst\n\nwhere\n\ni=1 \u03b32(i\u22121))\u22121R(n)\n\nst\n\ni=1 \u03b32(i\u22121))\u22121\n\nn=1((cid:80)n\n= \u03ba\u22121(cid:80)L\n\u03ba\u22121(cid:80)L\nn=1((cid:80)n\n((cid:80)n\nn=1((cid:80)n\n(cid:80)L\n\nw(n, L) =\n\ni=1 \u03b32(i\u22121))\u22121\n\ni=1 \u03b32(i\u22121))\u22121\n\nis the weight associated with the nth-step return in a trajectory of length L after time t. This estimator\nhas the virtue of being parameter-free since the \u03ba values cancel. Therefore, we need not estimate \u03ba\u2014\nunder the assumption of independent, Gaussian n-step returns with variances increasing according\nto equation (13), equation (15) is the maximum likelihood estimator for any value of \u03ba. We call this\nestimator the \u03b3-return since it weights the n-step returns according to the discount factor.\nFigure 1 compares the weightings obtained using \u03bb-return and \u03b3-return for a few example trajectory\nlengths. There are two important qualitative differences. First, the \u03bb-return weightings spike at the\nend of the trajectory, whereas the \u03b3-return weightings do not. This occurs because even though any\nsample trajectory has \ufb01nite length, the \u03bb-return as de\ufb01ned in equation (4) is actually an in\ufb01nite sum;\nthe remainder of the weight mass is allocated to the Monte Carlo return. This allows the normalizing\nfactor in equation (4) to be a constant, rather than having it depend on the length of the trajectory, as\nit does in equation (15) for the \u03b3-return. This signi\ufb01cantly complicates the problem of obtaining an\nincremental algorithm using \u03b3-return, as we shall see in later sections.\n\nFigure 1: Example weights for trajectories of various lengths for \u03bb-return (left) and \u03b3-return (right).\n\nSecond, while the \u03bb-return weightings tend to zero over time, the \u03b3-return weightings tend toward\na constant. This means that very long trajectories will be effectively \u201ccut-off\u201d after some point and\nhave effectively no contribution to the \u03bb-return, whereas after a certain length in the \u03b3-return all\nn-step returns have roughly equal weighting. This also complicates the problem of obtaining an\nincremental algorithm using \u03b3-return: even if we were to assume in\ufb01nitely many n-step returns past\nthe end of the episode, the normalizing constant would not become \ufb01nite.\nNevertheless, we can use the \u03b3-return estimator to obtain an entire family of TD\u03b3 learning algo-\nrithms; for any TD(\u03bb) algorithm we can derive an equivalent TD\u03b3 algorithm.\nIn the following\nsection, we derive TD\u03b3, the \u03b3-return equivalent of the original TD(\u03bb) algorithm.\n\n4 TD\u03b3\nGiven a set of m trajectories T = {\u03c41, \u03c42, . . . , \u03c4m}, where l\u03c4 = |\u03c4| denotes the number of\nt+1) tuples in the trajectory \u03c4. Using approximator \u02c6V\u03b8 with parameters \u03b8 to approximate\n(s\u03c4\nt , r\u03c4\n\nt , s\u03c4\n\n4\n\n05101520253000.050.10.150.20.25Return Estimate LengthWeight  Lambda=0.75, L=10Lambda=0.85, L=20Lambda=0.8, L=3005101520253000.050.10.150.20.250.30.35Return Estimate LengthWeight  L=10, Gamma=0.95L=20, Gamma=0.95L=30, Gamma=0.95L=30, Gamma=0.8\f(17)\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\n(cid:105)\u2207\u03b8 \u02c6V\u03b8(s\u03c4\n\nt ),\n\nt ).\n\n(cid:105)\u2207\u03b8 \u02c6V\u03b8(s\u03c4\n(cid:105)\u2207\u03b8 \u02c6V\u03b8(s\u03c4\n\nt ).\n\n(cid:104)\n\n(cid:33)\n\n(cid:104)\n\n(cid:104)\n\nV , the objective function for regression is:\n\n(cid:88)\n(cid:88)\n\n\u03c4\u2208T\n\nl\u03c4\u22121(cid:88)\nl\u03c4\u22121(cid:88)\n\nt=0\n\n(cid:16)\n(cid:32)l\u03c4\u2212t(cid:88)\n\nR\u03b3\ns\u03c4\nt\n\n\u03c4\u2208T\n\nt=0\n\nn=1\n\n(cid:17)2\n\n\u2212 \u02c6V\u03b8(s\u03c4\nt )\n\nw(n, l\u03c4 \u2212 t)R(n)\n\ns\u03c4\nt\n\n\u2212 \u02c6V\u03b8(s\u03c4\nt )\n\n(cid:33)2\n\n.\n\nE(\u03b8) =\n\n=\n\n1\n2\n\n1\n2\n\nBecause(cid:80)l\u03c4\u2212t\n\nn=1 w(n, l\u03c4 \u2212 t) = 1, we can write\n\n(cid:88)\n(cid:88)\n\n\u03c4\u2208T\n\nl\u03c4\u22121(cid:88)\nl\u03c4\u22121(cid:88)\n\nt=0\n\n(cid:32)l\u03c4\u2212t(cid:88)\n(cid:32)l\u03c4\u2212t(cid:88)\n\nn=1\n\n\u03c4\u2208T\n\nt=0\n\nn=1\n\nE(\u03b8) =\n\n=\n\n1\n2\n\n1\n2\n\n\u2212 l\u03c4\u2212t(cid:88)\n\nn=1\n\n(cid:104)\n\n(cid:105)(cid:33)2\n\n.\n\nw(n, l\u03c4 \u2212 t)\n\nR(n)\ns\u03c4\nt\n\n\u2212 \u02c6V\u03b8(s\u03c4\nt )\n\nw(n, l\u03c4 \u2212 t)R(n)\n\ns\u03c4\nt\n\nw(n, l\u03c4 \u2212 t) \u02c6V\u03b8(s\u03c4\nt )\n\n(cid:33)2\n\nOur goal is to minimize E(\u03b8). One approach is to descend the gradient \u2207\u03b8E(\u03b8), assuming that the\nR(n)\ns\u03c4\nt\n\nt ) and not a function of \u03b8, as in the derivation of TD(\u03bb) [7]:\n\nare noisy samples of V (s\u03c4\n\n\u2206\u03b8 = \u2212\u03b1\u2207\u03b8E(\u03b8) = \u2212\u03b1\n\n\u2212w(n, l\u03c4 \u2212 t)\n\nR(n)\ns\u03c4\nt\n\n\u2212 \u02c6V\u03b8(s\u03c4\nt )\n\nwhere \u03b1 is a learning rate. We can substitute in n = u\u2212 t (where u is the current time step, st is the\nstate we want to update the value estimate of, and n is the length of the n-step return that ends at the\ncurrent time step) to get:\n\n(cid:88)\n\nl\u03c4\u22121(cid:88)\n\nl\u03c4\u2212t(cid:88)\n\n\u03c4\u2208T\n\nt=0\n\nn=1\n\n\u2206\u03b8 = \u2212\u03b1\n\n\u2212w(u \u2212 t, l\u03c4 \u2212 t)\n\nR(u\u2212t)\n\ns\u03c4\nt\n\n\u2212 \u02c6V\u03b8(s\u03c4\nt )\n\nSwapping the sums allows us to derive the backward version of TD\u03b3:\n\n\u2206\u03b8 = \u2212\u03b1\n\n\u2212w(u \u2212 t, l\u03c4 \u2212 t)\n\nR(u\u2212t)\n\ns\u03c4\nt\n\n\u2212 \u02c6V\u03b8(s\u03c4\nt )\n\n(cid:88)\n\nl\u03c4\u22121(cid:88)\n\nl\u03c4(cid:88)\n\n\u03c4\u2208T\n\nt=0\n\nu=t+1\n\n(cid:88)\n\nl\u03c4(cid:88)\n\nu\u22121(cid:88)\n\n\u03c4\u2208T\n\nu=1\n\nt=0\n\nExpanding and rearranging the terms gives us an algorithm for TD\u03b3 when using linear function\napproximation with weights \u03b8:\n\n\u2206\u03b8 = \u2212\u03b1\n\n= \u2212\u03b1\n\n= \u2212\u03b1\n\n\u03c4\u2208T\n\n(cid:88)\n(cid:88)\n(cid:88)\n\n\u03c4\u2208T\n\nu=1\n\nl\u03c4(cid:88)\nl\u03c4(cid:88)\nl\u03c4(cid:88)\n\nu=1\n\nt=0\n\nu\u22121(cid:88)\nu\u22121(cid:88)\nu\u22121(cid:88)\n\nt=0\n\n\u03c4\u2208T\n\nu=1\n\nt=0\n\n(cid:32)u\u22121(cid:88)\n\ni=t\n\n\u2212\n\n(cid:34)\n(cid:34)\n\nw(u \u2212 t, l\u03c4 \u2212 t)\n\nw(u \u2212 t, l\u03c4 \u2212 t)\n\n\u03b3i\u2212trs\u03c4\n\ni\n\n\u2212 \u03b3u\u2212t(\u03b8 \u00b7 \u03c6s\u03c4\n\n) + (\u03b8 \u00b7 \u03c6s\u03c4\n\nu\n\nt\n\n(cid:32)u\u22121(cid:88)\n\n(cid:33)(cid:35)\n\n\u03b8 \u00b7 (\u03c6s\u03c4\n\nt\n\n\u2212 \u03b3u\u2212t\u03c6s\u03c4\n\nu\n\n) \u2212\n\n\u03b3i\u2212trs\u03c4\n\ni\n\n\u03c6s\u03c4\n\nt\n\nw(u \u2212 t, l\u03c4 \u2212 t) [\u03b8 \u00b7 a \u2212 b] \u03c6s\u03c4\n\nt\n\n,\n\ni=t\n\n(cid:35)\n\n)\n\n\u03c6s\u03c4\n\nt\n\n(24)\n\n(25)\n\n(26)\n\nt is the feature vector at state s\u03c4\n\n. This leads to\nwhere \u03c6s\u03c4\nTD\u03b3 for episodic tasks (given in Algorithm 1), which eliminates the eligibility trace parameter \u03bb. For\nepisode length l\u03c4 and feature vector size F , the algorithm can be implemented with time complexity\nof O(l\u03c4 F ) per step and space complexity O(l\u03c4 F ). Unfortunately, implementing this backward TD\u03b3\nincrementally is problematic: l\u03c4 is not known until the end of the trajectory is reached, and without\n\nu, and b =\n\nt , a = \u03c6s\u03c4\n\ni=t\n\nt\n\ni\n\n\u2212 \u03b3u\u2212t\u03c6s\u03c4\n\nu\u22121(cid:80)\n\n\u03b3i\u2212trs\u03c4\n\n5\n\n\fAlgorithm 1 TD\u03b3\nGiven: A discount factor, \u03b3\n1: \u03b8 \u2190 \u2192\n0\n2: for each trajectory \u03c4 \u2208 T do\n3:\n4:\n5:\n6:\n7:\n8:\n\nStore \u03c60 in memory\nfor u = 1 to l\u03c4 do\nStore \u03c6u and ru\u22121 in memory\n\u03b4 \u2190 \u2192\n0\nfor t = 0 to u \u2212 1 do\nb \u2190 u\u22121(cid:80)\na \u2190 \u03c6t \u2212 \u03b3u\u2212t\u03c6u\n\n\u03b3i\u2212tri\n\n9:\n\ni=t\n\n\u03b4 \u2190 \u03b4 + w(u \u2212 t, l\u03c4 \u2212 t)[\u03b8 \u00b7 a \u2212 b]\u03c6t\n\nend for\nDiscard all \u03c6 and r from memory\n\nend for\n\u03b8 \u2190 \u03b8 \u2212 \u03b1\u03b4\n\n10:\n11:\n12:\n13:\n14:\n15: end for\n\nit, the normalizing constant of the weights used in the updates cannot be computed. Thus, Algorithm\n1 can only be used for batch updates where each episode\u2019s trajectory data is stored until an update\nis performed at the end of an episode; this is often undesirable, and in continuing tasks, impossible.\nTD(\u03bb) is able to achieve O(F ) time and space for two reasons. First, the weight normalization is\na constant and does not depend on the length of the episode. Second, the operation that must be\nperformed on each trace is the same\u2014a multiplication by \u03bb. Thus, TD(\u03bb) need only store the sum\nof the return estimates from each state, rather than having to store each individually.\nOne approach to deriving an incremental algorithm is to use only the \ufb01rst C n-step returns from\neach state, similar to truncated temporal differences [8]. This eliminates the \ufb01rst barrier: weight\nnormalization no longer relies on the episode length, except for the \ufb01nal C \u2212 1 states, which can\nbe corrected for after the episode ends. This approach has time complexity O(CF ) and space\ncomplexity O(CF )\u2014and is therefore C times more expensive than TD(\u03bb)\u2014and replaces \u03bb with\nthe more intuitive parameter C rather than eliminating it, but it affords the incremental TD\u03b3(C)\nalgorithm given in Algorithm 2. Note that setting C = 1 obtains TD(0), and C \u2265 l\u03c4 obtains TD\u03b3.\n\n5 Empirical Comparisons\n\nFigure 2 shows empirical comparisons of TD(\u03bb) (for various values of \u03bb), TD\u03b3 and TD\u03b3(C) for 5\ncommon benchmarks. The \ufb01rst is a 10 \u00d7 10 discrete gridworld with the goal in the bottom right,\ndeterministic movement in the four cardinal directions, a terminal reward of +10 and \u22120.5 for all\nother transitions, and \u03b3 = 1.0. For the gridworld, the agent selected one of the optimal actions with\nprobability 0.4, and each of the other actions with probability 0.2. The second domain is the 5-state\ndiscrete chain domain [1] with random transitions to the left and right, and \u03b3 = 0.95. The third\ndomain is the pendulum swing-up task [7] with a reward of 1.0 for entering a terminal state (the\npendulum is nearly vertical) and zero elsewhere, and \u03b3 = 0.95. The optimal action was selected\nwith probability 0.5, with a random action selected otherwise. The fourth domain is mountain car\n[1] with \u03b3 = 0.99, and using actions from a hand-coded policy with probability 0.75, and random\nactions otherwise. The \ufb01fth and \ufb01nal domain is the acrobot [1] with a terminal reward of 10 and\n\u22120.1 elsewhere. A random policy was used with \u03b3 = 0.95. In all cases the start state was selected\nuniformly from the set of nonterminal states. A 5th order Fourier basis [9] was used as the function\napproximator for the 3 continuous domains. We used 10, 5, 10, 3, and 10 trajectories, respectively.\nTD\u03b3 outperforms TD(\u03bb) for all settings of \u03bb in 4 out of the 5 domains. In the chain domain TD\u03b3\nperforms better than most settings of \u03bb but slightly worse than the optimal setting. An interesting and\nsomewhat unexpected result is that TD\u03b3(C) with a relatively low setting of C does at least as well\nas\u2014or in some cases better than\u2014TD\u03b3. This could occur because the n-step returns become very\n\n6\n\n\fStore \u03c60 in memory\nfor u = 1 to l\u03c4 do\n\nAlgorithm 2 TD\u03b3(C)\nGiven: A discount factor, \u03b3\nA cut-off length, C\n1: \u03b8 \u2190 \u2192\n0\n2: for each trajectory \u03c4 \u2208 T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\nIf u > C, discard \u03c6u\u2212C\u22121, \u03b8u\u2212C\u22121, and ru\u2212C\u22121 from memory\n\u03b8u\u22121 \u2190 \u03b8\nStore \u03c6u, \u03b8u\u22121, and ru\u22121 in memory\n\u03b4 \u2190 \u2192\n0\nm = max(0, u \u2212 C)\nfor t = m to u \u2212 1 do\nb \u2190 u\u22121(cid:80)\na \u2190 \u03c6t \u2212 \u03b3u\u2212t\u03c6u\n\n\u03b3i\u2212tri\n\n12:\n\n13:\n14:\n15:\n16:\n\ni=t\n\n\u03b4 \u2190 \u03b4 + w(u \u2212 t, C)[\u03b8 \u00b7 a \u2212 b]\u03c6t\n\nend for\n\u03b8 \u2190 \u03b8 \u2212 \u03b1\u03b4\n\nend for\n\n17: m = min(l\u03c4 , C)\n18:\n19:\n20:\n21:\n22:\n\n\u03b8 \u2190 \u03b8l\u03c4\u2212m\nfor \u02c6u = l\u03c4 \u2212 m to l\u03c4 do\n\u03b4 \u2190 \u2192\n0\nfor t = m to \u02c6u \u2212 1 do\nb \u2190 \u02c6u\u22121(cid:80)\na \u2190 \u03c6t \u2212 \u03b3 \u02c6u\u2212t\u03c6\u02c6u\n\n\u03b3i\u2212tri\n\n23:\n\ni=t\n\n\u03b4 \u2190 \u03b4 + w(\u02c6u \u2212 t, m \u2212 t)[\u03b8 \u00b7 a \u2212 b]\u03c6t\n\nend for\n\u03b8 \u2190 \u03b8 \u2212 \u03b1\u03b4\nDiscard \u03c6\u02c6u, \u03b8 \u02c6u\u22121, and r \u02c6u\u22121 from memory\n\n24:\n25:\n26:\n27:\n28:\n29: end for\n\nend for\n\nsimilar for large n due to either \u03b3 discounting diminishing the difference, or to the additional one-\nstep rewards accounting for a very small fraction of the total return. These near-identical estimates\nwill accumulate a large fraction of the weighting (see Figure 1) and come to dominate the \u03b3-return\nestimate. This suggests that once the returns start to become almost identical they should not be\nconsidered independent samples and should instead be discarded.\n\n6 Discussion and Future Work\n\nAn immediate goal of future work is \ufb01nding an automatic way to set C. We may be able to calculate\nbounds on the diminishing differences between n-step returns due to \u03b3, or empirically track the\npoint at which those differences begin to diminish. Another avenue for future research is deriving a\nversion of TD\u03b3 or TD\u03b3(C) that provably converges for off-policy data with function approximation,\nmost likely using recent insights on gradient-descent based TD algorithms [10]. Thereafter, we aim\nto develop an algorithm based on \u03b3-return for control rather than just prediction, for example Sarsa\u03b3.\nWe have shown that the widely used \u03bb-return formulation is the maximum-likelihood estimator of\nreturn given three assumptions (see section 2). The results presented here have shown that reevaluat-\ning just one of these assumptions results in more accurate value function approximation algorithms.\nWe expect that re-evaluating all three will prove a fruitful avenue for future research.\n\n7\n\n\fGridworld\n\nChain\n\nPendulum\n\nMountain Car\n\nAcrobot\n\nFigure 2: Mean squared error (MSE) over sample trajectories from \ufb01ve benchmark domains for\nTD(\u03bb) with various settings of \u03bb, TD\u03b3, and TD\u03b3(C), for various settings of C. Error bars are\nstandard error over 100 samples. Each result is the minimum MSE (weighted by state visitation\nfrequency) between each algorithm\u2019s approximation and the correct value function (obtained using\na very large number of Monte Carlo samples), found by searching over \u03b1 at increments of 0.0001.\n\nAcknowledgments\n\nWe would like to thank David Silver, Hamid Maei, Gustavo Goretkin, Sridhar Mahadevan and Andy\nBarto for useful discussions. George Konidaris was supported in part by the AFOSR under grant\nAOARD-104135 and the Singapore Ministry of Education under a grant to the Singapore-MIT In-\nternational Design Center. Scott Niekum was supported by the AFOSR under grant FA9550-08-1-\n0418.\n\n8\n\n300340E220260300=005.15.225.335.445.555.665.775.885.99599=1\u03b32)5)0)0)0)0)0)0)0)MSE\u03bb=\u03bb=0.0\u03bb=0.\u03bb=0.1\u03bb=0.\u03bb=0.2\u03bb=0.\u03bb=0.3\u03bb=0.\u03bb=0.4\u03bb=0.\u03bb=0.5\u03bb=0.\u03bb=0.6\u03bb=0.\u03bb=0.7\u03bb=0.\u03bb=0.8\u03bb=0.\u03bb=0.9\u03bb=0.9\u03bb=\u03b3(2\u03b3(5\u03b3(10\u03b3(20\u03b3(50\u03b3(100\u03b3(200\u03b3(500\u03b3(10000080.090.1E0.050.060.070.08=005.15.225.335.445.555.665.775.885.99599=1\u03b32)5)0)0)0)0)0)0)0)MSE\u03bb=\u03bb=0.0\u03bb=0.\u03bb=0.1\u03bb=0.\u03bb=0.2\u03bb=0.\u03bb=0.3\u03bb=0.\u03bb=0.4\u03bb=0.\u03bb=0.5\u03bb=0.\u03bb=0.6\u03bb=0.\u03bb=0.7\u03bb=0.\u03bb=0.8\u03bb=0.\u03bb=0.9\u03bb=0.9\u03bb=\u03b3(2\u03b3(5\u03b3(10\u03b3(20\u03b3(50\u03b3(100\u03b3(200\u03b3(500\u03b3(10000.60.8E0.20.40.6=005.15.225.335.445.555.665.775.885.99599=1\u03b32)5)0)0)0)0)0)0)0)MSE\u03bb=\u03bb=0.0\u03bb=0.\u03bb=0.1\u03bb=0.\u03bb=0.2\u03bb=0.\u03bb=0.3\u03bb=0.\u03bb=0.4\u03bb=0.\u03bb=0.5\u03bb=0.\u03bb=0.6\u03bb=0.\u03bb=0.7\u03bb=0.\u03bb=0.8\u03bb=0.\u03bb=0.9\u03bb=0.9\u03bb=\u03b3(2\u03b3(5\u03b3(10\u03b3(20\u03b3(50\u03b3(100\u03b3(200\u03b3(500\u03b3(10009001000E600700800=005.15.225.335.445.555.665.775.885.99599=1\u03b32)5)0)0)0)0)0)0)0)MSE\u03bb=\u03bb=0.0\u03bb=0.\u03bb=0.1\u03bb=0.\u03bb=0.2\u03bb=0.\u03bb=0.3\u03bb=0.\u03bb=0.4\u03bb=0.\u03bb=0.5\u03bb=0.\u03bb=0.6\u03bb=0.\u03bb=0.7\u03bb=0.\u03bb=0.8\u03bb=0.\u03bb=0.9\u03bb=0.9\u03bb=\u03b3(2\u03b3(5\u03b3(10\u03b3(20\u03b3(50\u03b3(100\u03b3(200\u03b3(500\u03b3(100056905710E563056505670=005.15.225.335.445.555.665.775.885.99599=1\u03b32)5)0)0)0)0)0)0)0)MSE\u03bb=\u03bb=0.0\u03bb=0.\u03bb=0.1\u03bb=0.\u03bb=0.2\u03bb=0.\u03bb=0.3\u03bb=0.\u03bb=0.4\u03bb=0.\u03bb=0.5\u03bb=0.\u03bb=0.6\u03bb=0.\u03bb=0.7\u03bb=0.\u03bb=0.8\u03bb=0.\u03bb=0.9\u03bb=0.9\u03bb=\u03b3(2\u03b3(5\u03b3(10\u03b3(20\u03b3(50\u03b3(100\u03b3(200\u03b3(500\u03b3(1000\fReferences\n[1] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge,\n\nMA, 1998.\n\n[2] R.S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning,\n\n3(1):9\u201344, 1988.\n\n[3] S. Singh and R.S. Sutton. Reinforcement learning with replacing eligibility traces. Machine\n\nLearning, 22:123\u2013158, 1996.\n\n[4] J.A. Boyan. Least squares temporal difference learning. In Proceedings of the 16th Interna-\n\ntional Conference on Machine Learning, pages 49\u201356, 1999.\n\n[5] H.R. Maei and R.S. Sutton. GQ(\u03bb): A general gradient algorithm for temporal-difference\nprediction learning with eligibility traces. In Proceedings of the Third Conference on Arti\ufb01cial\nGeneral Intelligence, 2010.\n\n[6] C. Downey and S. Sanner. Temporal difference Bayesian model averaging: A Bayesian per-\nspective on adapting lambda. In Proceedings of the 27th International Conference on Machine\nLearning, pages 311\u2013318, 2010.\n\n[7] K. Doya. Reinforcement learning in continuous time and space. Neural Computation,\n\n12(1):219\u2013245, 2000.\n\n[8] P. Cichosz. Truncating temporal differences: On the ef\ufb01cient implementation of TD(\u03bb) for\n\nreinforcement learning. Journal of Arti\ufb01cial Intelligence Research, 2:287\u2013318, 1995.\n\n[9] G.D. Konidaris, S. Osentoski, and P.S. Thomas. Value function approximation in reinforcement\nlearning using the Fourier basis. In Proceedings of the Twenty-Fifth Conference on Arti\ufb01cial\nIntelligence, pages 380\u2013385, 2011.\n\n[10] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvari, and E. Wiewiora.\nFast gradient-descent methods for temporal-difference learning with linear function approxi-\nmation. In Proceedings of the 26th International Conference on Machine Learning, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1276, "authors": [{"given_name": "George", "family_name": "Konidaris", "institution": null}, {"given_name": "Scott", "family_name": "Niekum", "institution": null}, {"given_name": "Philip", "family_name": "Thomas", "institution": null}]}