{"title": "Analytical Mean Squared Error Curves in Temporal Difference Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1054, "page_last": 1060, "abstract": null, "full_text": "Analytical Mean  Squared Error  Curves \n\nin Temporal Difference Learning \n\nSatinder Singh \n\nDepartment of Computer Science \n\nUniversity of Colorado \nBoulder,  CO 80309-0430 \nbaveja@cs.colorado.edu \n\nPeter Dayan \n\nBrain and Cognitive Sciences \n\nE25-210,  MIT \n\nCambridge,  MA  02139 \nbertsekas@lids.mit.edu \n\nAbstract \n\nWe  have  calculated  analytical  expressions  for  how  the  bias  and \nvariance of the estimators provided by various temporal difference \nvalue estimation algorithms change with offline updates over trials \nin absorbing Markov chains using lookup table representations.  We \nillustrate  classes  of learning curve  behavior in various  chains,  and \nshow the manner in which TD is sensitive to the choice of its step(cid:173)\nsize  and eligibility trace parameters. \n\n1 \n\nINTRODUCTION \n\nA reassuring theory of asymptotic convergence is  available for  many reinforcement \nlearning (RL)  algorithms.  What is  not available, however,  is a  theory that explains \nthe finite-term learning curve behavior of RL algorithms, e.g., what are the different \nkinds  of learning  curves,  what  are  their  key  determinants,  and  how  do  different \nproblem parameters effect  rate of convergence.  Answering these questions is crucial \nnot only for making useful comparisons between algorithms, but also for  developing \nhybrid and new RL methods.  In this paper we  provide preliminary answers to some \nof the above questions for  the case of absorbing Markov chains,  where  mean square \nerror between the estimated and true predictions is  used  as  the quantity of interest \nin  learning curves. \n\nOur  main contribution  is  in  deriving  the  analytical  update  equations for  the  two \ncomponents of MSE,  bias and variance, for  popular Monte Carlo (MC) and TD(A) \n(Sutton,  1988) algorithms.  These  derivations are presented in a  larger paper.  Here \nwe  apply  our  theoretical  results  to  produce  analytical  learning  curves  for  TD  on \ntwo  specific  Markov  chains  chosen  to  highlight  the  effect  of various  problem  and \nalgorithm parameters, in particular the definite trade-offs  between step-size,  Q, and \neligibility-trace parameter,  A.  Although  these  results  are for  specific  problems,  we \n\n\fAnalytical MSE Curves/or TD Learning \n\n1055 \n\nbelieve that many ofthe conclusions are intuitive or have previous empirical support, \nand  may be more generally  applicable. \n\n2  ANALYTICAL  RESULTS \n\nA random walk, or trial,  in  an absorbing  Markov  chain  with  only terminal payoffs \nproduces  a  sequence  of states  terminated  by  a  payoff.  The  prediction  task  is  to \ndetermine  the expected  payoff as  a function of the start state  i,  called the optimal \nvalue  function,  and  denoted  v....  Accordingly,  vi  = E {rls1  = i},  where  St  is  the \nstate at step  t,  and r  is  the random terminal payoff.  The algorithms analysed  are \niterative  and  produce  a  sequence  of estimates of v\"  by  repeatedly  combining the \nresult from a  new  trial with the old estimate to produce a  new estimate.  They have \nthe form:  viet)  =  Viet  - 1) + a(t)oi(t)  where  vet)  =  {Viet)}  is  the  estimate of the \noptimal value function  after  t  trials, Oi (t)  is  the  result  for  state i  based on  random \ntrial t,  and the step-size  a( t)  determines  how  the old estimate and  the new  result \nare  combined.  The algorithms differ  in  the os  produced from  a  trial. \n\nMonte Carlo algorithms use  the final  payoff that results from  a  trial  to define  the \nOi(t)  (e.g.,  Barto & Duff,  1994).  Therefore in MC algorithms the estimated value ofa \nstate is unaffected by the estimated value of any other state.  The main contribution \nof TD algorithms (Sutton,  1988) over MC  algorithms is  that they  update the value \nof a  state  based  not  only  on  the  terminal  payoff  but  also  on  the  the  estimated \nvalues  of the  intervening states.  When  a  state  is  first  visited,  it  initiates  a  short(cid:173)\nterm memory process, an eligibility trace, which then decays exponentially over time \nwith parameter A.  The amount by  which the value of an intervening state combines \nwith the old estimate is determined in part by the magnitude of the eligibility trace \nat that point. \nIn  general,  the  initial  estimate v(O)  could  be  a  random  vector  drawn  from  some \ndistribution, but often v(O)  is fixed  to some initial value such as zero.  In either case, \nsubsequent estimates, vet); t > 0,  will be random vectors because of the random os. \nThe  random vector  v( t)  has  a  bias  vector  b( t)  d~ E {v( t) - v\"'}  and  a  covariance \nmatrix G(t)  d~ E{(v(t) - E{v(t)})(v(t) - E{v(t)})T}.  The  scalar  quantity  of \ninterest for learning curves is the weighted  MSE as a function of trial number t,  and \nis  defined  as follows: \n\nMSE(t)  =  l:i Pi(E{(Vi(t) - vi?}) = l:i pi(bl(t) + Gii(t\u00bb, \n\nwhere Pi  = {JJT [I - QJ -1 )d l:j (J.'T [I - QJ -1)j is the weight for state i,  which is the \nexpected  number of visits to i  in a  trial  divided by  the expected  length of a  trial 1 \n(J.'i  is  the probability of starting in state i; Q is the transition matrix of the chain). \nIn  this  paper  we  present  results  just  for  the  standard  TD(A)  algorithm  (Sutton, \n1988), but we have analysed (Singh & Dayan, 1996) various other TD-like algorithms \n(e.g., Singh & Sutton, 1996) and comment on their behavior in the conclusions.  Our \nanalytical results are based on two non-trivial assumptions:  first  that lookup tables \nare  used,  and  second  that  the  algorithm  parameters  a  and  A are  functions  of the \ntrial  number  alone  rather  than  also  depending  on  the  state.  We  also  make  two \nassumptions  that  we  believe  would  not  change  the  general  nature  of the  results \nobtained here:  that the estimated values  are  updated offline  (after the end of each \ntrial),  and  that  the  only  non-zero  payoffs  are  on  the  transitions  to  the  terminal \nstates.  With the  above  caveats,  our  analytical results  allow  rapid computation of \nexact mean square error (MSE)  learning curves  as  a  function  of trial number. \n\n10t her reasonable choices for the weights, PI, would not change the nature of the results \n\npresented here. \n\n\f1056 \n\nS.  Singh and P.  Dayan \n\n2.1  BIAS, VARIANCE,  And MSE UPDATE EQUATIONS \n\nThe  analytical update equations for  the  bias,  variance  and  MSE  are  complex and \ntheir details are in Singh &  Dayan (1996) -\nthey take the following form in outline: \n\nb(t) \nC(t) \n\nam + Bmb(t - 1) \nAS + BSC(t - 1) + IS (b(t - 1\u00bb \n\n(1) \n(2) \n\nwhere matrix Bm depends linearly on a(t) and BS and IS  depend at most quadrat(cid:173)\nically  on  a(t).  We  coded  this  detail  in  the  C programming language  to  develop  a \nsoftware tool 2  whose rapid computation of exact MSE error curves  allowed us to ex(cid:173)\nperiment with  many different  algorithm and problem parameters on  many Markov \nchains.  Of course,  one  could  have  averaged  together  many empirical MSE  curves \nobtained  via simulation of these  Markov  chains  to  get  approximations to  the  an(cid:173)\nalytical  MSE  error  curves,  but  in  many  cases  MSE  curves  that  take  minutes  to \ncompute analytically take days to derive empirically on the same computer for  five \nsignificant  digit  accuracy.  Empirical simulation is  particularly slow  in  cases  where \nthe variance converges  to non-zero values (because of constant step-sizes)  with long \ntails in  the  asymptotic distribution of estimated values  (we  present  an example in \nFigure lc).  Our analytical method, on the other hand, computes exact MSE  curves \nfor  L  trials  in  O( Istate  space 13 L)  steps  regardless  of the  behavior  of the  variance \nand bias curves. \n\n2.2  ANALYTICAL  METHODS \n\nTwo  consequences  of having  the  analytical forms  of the  equations  for  the  update \nof the mean and variance are that it is  possible to optimize schedules for  setting a \nand  A and, for  fixed  A and a, work out terminal rates  of convergence  for  band C. \nComputing one-step optimal a's:  Given a  particular A,  the effect  on the  MSE \nof a  single step for  any of the algorithms is  quadratic in a.  It is  therefore  straight(cid:173)\nforward  to calculate  the  value of a  that  minimises  MSE(t)  at the  next  time step. \nThis  is  called  the  greedy  value  of a.  It is  not  clear  that  if one  were  interested \nin  minimising  MSE(t + t'),  one  would  choose  successive  a(u)  that  greedily  min(cid:173)\nimise  MSE(t); MSE(t + 1); ....  In general, one  could  use  our formulre and dynamic \nprogramming to  optimise  a  whole  schedule  for  a( u),  but  this  is  computationally \nchallenging. \n\nNote  that  this  technique  for  setting  greedy  a  assumes  complete  knowledge  about \nthe  Markov  chain and  the initial bias  and covariance  of v(O),  and is  therefore  not \ndirectly  applicable to realistic applications of reinforcement learning.  Nevertheless, \nit  is  a  good  analysis  tool  to  approximate omniscient  optimal step-size  schedules, \neliminating the effect  of the choice  of a  when  studying the effect  of the  A. \nComputing one-step  optimal  A'S:  Calculating  analytically  the  A that  would \nminimize  MSE(t)  given  the  bias  and  variance  at trial t  - 1 is  substantially harder \nbecause  terms such  as [/ - A(t)Q]-l  appear in the expressions.  However,  since it is \npossible to compute MSE(t)  for  any choice of A,  it is straightforward to find  to any \ndesired  accuracy  the  Ag(t)  that gives  the lowest  resulting  MSE(t).  This is  possible \nonly because  MSE(t) can be computed very  cheaply using our analytical equations. \n\nThe caveats about greediness  in choosing ag{t)  also apply to  Ag(t).  For one of the \nMarkov  chains,  we  used  a  stochastic  gradient ascent  method to optimise  A( u)  and \n\n2The  analytical  MSE  error  curve  software  is  available  via  anonymous  ftp  from  the \n\nfollowing  address:  ftp.cs.colorado.edu  /users/baveja/ AMse. tar.Z \n\n\f1057 \n\nAnalytical MSE Curves/or TD Learning \na( u)  to  minimise  MSE( t + tf)  and found  that  it  was  not  optimal to  choose  Ag (t) \nand ag(t)  at the first  step. \nComputing terminal rates of convergence:  In  the  update equations  1 and  2, \n. b(t) depends  linearly on b(t - 1)  through a  matrix Bm;  and C(t) depends  linearly \non C(t - 1)  through a  matrix B S .  For the  case  of fixed  a  and A,  the maximal and \nminimal eigenvalues  of Bm  and  B S  determine  the  fact  and  speed  of convergence \nof the  algorithms  to  finite  endpoints.  If the  modulus  of the  real  part  of any  of \nthe eigenvalues is  greater  than  1,  then the algorithms will  not  converge  in general. \nWe  observed  that  the  mean  update  is  more stable  than  the  mean  square  update, \ni.e., appropriate eigenvalues  are  obtained for  larger values of a  (we  call  the largest \nfeasible  a  the largest  learning rate for  which TD will  converge).  Further,  we  know \nthat  the  mean  converges  to  v\"  if  a  is  sufficiently  small  that  it  converges  at  all, \nand  so  we  can  determine  the  terminal  covariance.  Just  like  the  delta  rule,  these \nalgorithms converge at best to an (-ball for a constant finite step-size.  This amounts \nto the  MSE  converging to a fixed  value,  which  our equations also predict.  Further, \nby  calculating  the  eigenvalues  of B m ,  we  can  calculate  an estimate of the  rate  of \ndecrease  of the  bias. \n\n3  LEARNING CURVES  ON SPECIFIC  MARKOV \n\nCHAINS \n\nWe  applied our software to two problems:  a symmetric random walk (SRW), and a \nMarkov  chain  for  which  we  can  control  the frequency  of returns  to each  state  in a \nsingle run (we  call this the  cyclicity  of the chain). \n\n_,. \n\ne\"....,., MIlE DiIdll.IIan \n\n\u2022\u2022 ,--------------, \n\n) \nc \n\na)  ... \n\ni ,.,. \n\u2022 ... \n..... 1,., -----..:-.---;,;;:;; ... .---=,;; ... ~:;;:_~, =iiJ .... , \n\nT .... N~ \n\nb)  \" \n\u2022\u2022 \n.. \nI\u00b7. \nI. . \u2022 \n\n10.0 \n\ntellO \n110.0 \nTNlN ...... \n\n_0 \n\n110.0 \n\n. . \n\nFigure  1:  Comparing  Analytical  and  Empirical  MSE  curves.  a)  analytical  and empirical \nlearning  curves obtained on the 19 state SRW problem  with parameters Ci  = 0.01,  ,\\ =  0.9. \nThe empirical curve was obtained by averaging together more than three million simulation \nruns,  and  the analytical  and  empirical  MSE  curves agree  up  to the fourth  decimal  place; \nb)  a  case  where  the empirical  method  fails  to  match  the  analytical  learning  curve  after \nmore  than  15  million  runs  on  a  5  state SRW  problem.  The  empirical  learning  curve  is \nvery  spiky.  c)  Empirical  distribution  plot  over  15.5  million  runs for  the  MSE  at trial  198. \nThe inset  shows  impulses  at  actual  sample  values  greater  than  100.  The largest  value  is \ngreater than  200000. \n\nAgreement:  First,  we  present  empirical confirmation of our analytical equations \non the 19 state SRW problem.  We ran TD(A) for specific choices of a  and A for more \nthan  three  million simulation runs  and  averaged  the  resulting  empIrical  weighted \nMSE  error  curves.  Figure  la shows  the  analytical  and  empirical  learning  curves, \nwhich  agree  to within four  decimal places. \nLong-Tails of Empirical MSE distribution: There are cases in which the agree(cid:173)\nment is  apparently much worse  (see  Figure  Ib).  This is  because of the surprisingly \nlong  tails for  the empirical MSE  distribution - Figure  lc shows  an example for  a  5 \n\n\f1058 \n\nS.  Singh and P.  Dayan \n\nstate SRW. This points to interesting structure that our analysis is unable to reveal. \nEffect  of a  and  A:  Extensive  studies  on  the  19  state  SRW  that  we  do  not  have \nspace to describe fully show that:  HI) for each algorithm, increasing a while holding \nA fixed  increases the asymptotic value of MSE, and similarly for increasing A whilst \nholding  a  constant;  H2)  larger  values  of a  or  A  (except  A very  close  to  1)  lead \nto  faster  convergence  to  the  asymptotic  value  of MSE  if there  exists  one;  H3)  in \ngeneral,  for  each  algorithm  as  one  decreases  A the  reasonable  range  of a  shrinks, \ni.e.,  larger a  can  be  used  with larger  A without  causing  excessive  MSE.  The effect \nin  H3  is  counter-intuitive  because  larger  A tends  to amplify the  effective  step-size \nand so  one  would expect  the  opposite effect.  Indeed,  this  increase  in  the  range  of \nfeasible  a  is  not strictly  true,  especially  very  near  A = 1,  but it  does  seem  to hold \nfor  a  large  range  of A. \nMe  versus  TD(A):  Sutton  (1988)  and  others  have  investigated  the  effect  of A \non  the  empirical  MSE  at  small trial  numbers  and  consistently  shown  that  TD is \nbetter for  some  A < 1 than  MC  (A  =  1).  Figure  2a shows substantial changes  as  a \nfunction of trial number in the value of A that leads to the lowest  MSE.  This effect \nis  consistent  with  hypotheses  HI-H3.  Figure  2b  confirms  that  this  remains  true \neven if greedy choices of a  tailored for each value of A are used.  Curves for  different \nvalues of A yield minimum MSE over different  trial number segments.  We observed \nthese effects  on several  Markov chains. \n\na) \n\nb) \n\n.,. \n\nAccumulate \n\n, \\ \n\\ \\ \n~'':: _____ .o--------.----_u)_ \u2022. __ \n\n. \n\n\u2022 oo  \\ \" .   :--... \u2022\u2022\u2022 \n\n1OQ.0\u00b7 150.0 \nTn.! Numbor \n\n0.8  0.8 \n\n00000 \n\n50 \n\n,.,..~ \n\n... ----.... -------\n\n100.0 \n\nHO.O \n\nFigure  2:  U-shaped  Curves.  a)  Weighted  MSE  as  a  function  of A and  trial  number  for \nfixed  ex  = 0.05 (minimum in  A shown as  a  black line).  This is  a  3-d version of the U-shaped \ncurves in  Sutton  (1988),  with  trial  number  being  the extra  axis.  b)  Weighted  MSE  as  a \nfunction of trial number for  various  A using greedy ex.  Curves for  different  values of A yield \nminimum  MSE over different  trial number  segments. \n\nInitial bias:  Watkins (1989) suggested that A trades off bias for  variance, since A ..... \n1 has  low  bias,  but  potentially high variance,  and  conversely  for  A .....  0.  Figure  3a \nconfirms this in a problem which is a little like a random walk, except that it is highly \ncyclic so that it returns to each state many times in a single trial.  If the initial bias \nis high (low),  then the initial greedy  value of A is  high  (low).  We  had expected  the \n\nasymptotic greedy value of A to be 0, since once b(t) .....  0,  then  A = \u00b0 leads to lower \nvariance  updates.  However,  Figure  3a shows  a  non-zero  asymptote  - presumably \nbecause larger learning rates can be used for  A > 0,  because of covariance.  Figure 3b \nshows,  however,  that  there  is  little advantage in  choosing  A cleverly  except  in the \nfirst  few  trials, at least if good  values of a  are  available. \n\nEigenvalue  stability  analysis:  We  analysed  the  eigenvalues  of the  covariance \nupdate  matrix  BS  (c.f.  Equation  2)  to  determine  maximal fixed  a  as  a  function \nof A.  Note  that  larger  a  tends  to lead to faster  learning,  provided  that the values \nconverge.  Figure 4a shows the largest eigenvalue of B S  as a function of A for various \n\n\fAnalytical MSE Curvesfor TD Learning \n\n1059 \n\na) \n\n1.0  ,-----~-____  -----, \n\nb) \n\nAccumulate \n\n02 \n\n0.0  ':-------,~---:-:':\":__----:-::'. \nlSO.0 \n\n100.0 \n\nSO.O \n\n0.0 \n\n--\n\nFigure 3:  Greedy  A for  a  highly  cyclic  problem.  a)  Greedy  A for  high  and  low initial  bias \n(using  greedy a).  b)  Ratio  of MSE for  given  value  of A to  that for  greedy  A at each  trial. \nThe greedy  A is  used for  every  step. \n\na) \n\n4.0  ,---~-~-~-~---, \n\n1 3.0  \\ \\  \n, \n'. \ni \n\" 12.0 \n\n5 \n\n. \n10 \n\n...., \n\n\u2022\u2022\u2022 ~:.. \n\n... - . .  \n\n\"-\u2022\u2022 \n\n0.0  ':-----:\"0--......,...,.--:\":--,.__----' \n1.0 \n\n0.' \n\n0.4 \n\n0.2 \n\n0.0 \n\n0.' \n\nA \n\nb) \n\n0.10 \n\n0.00 \n\nLargest Feasible a \n\nc) \n\n.. \n\n'-0 \n\n\" 'I \n\n-\n... ,... ........ 0.1 \n\u00b7_\u00b7 .... ,... ....... \u00b7o,01 \n\nO\u00b7OIo.~O --------,,0A':\"\". - - - - - - : \".0 \n\n0\u00b7900.'-0--02--0~.4--0~.6--0.~8 --',.0 \n\nA \n\nFigure 4:  Eigenvalue  analysis  of covariance  reduction.  a)  Maximal  modulus  of the eigen(cid:173)\nvalues of B S \u2022  These determine the rate of convergence of the variance.  Values greater than \n1 lead  to instability.  b)  Largest a  such  that the covariance is  bounded.  The inset shows  a \nblowup  for  0.9  ;:;  A ;:;  1.  Note that  A = 1 is  not optimal.  c)  Maximal  bias  reduction  rates \nas  a  function  of A,  after controlling  for  asymptotic  variance  (to 0.1  and  0.01)  by  choosing \nappropriate  a's.  Again,  A < 1 is  optimal. \n\na.  If this eigenvalue  is  larger  than  1, then  the  algorithm will  diverge  - a  behavior \nthat we  observed in our simulations.  The effect  of hypothesis H3  above is evident -\nfor larger A,  only smaller a  can be used.  Figure 4b shows this in more graphic form, \nindicating the  largest  a  that leads to stable eigenvalues for  B S .  Note  the  reversal \nvery  dose to A = 1,  which  provides more evidence  against the pure  MC  algorithm. \nThe  choice of a  and  A control  both rate of convergence  and  the  asymptotic  MSE. \nIn  Figure 4c  we  control for  the asymptotic variance by  choosing appropriate as  as \na  function  of .x  and  plot  maximal eigenvalues  of 8 m  (c.f.  Equation  1;  it  controls \nthe terminal rate of convergence  of the bias to zero)  as a  function of A.  Again, we \nsee  evidence for T Dover Me. \n\n4  CONCLUSIONS \n\nWe  have  provided analytical expressions  for  calculating how  the bias and variance \nof various TD and Monte  Carlo algorithms change over iterations.  The expressions \nthemselves seem not  to be very  revealing,  but we  have provided many illustrations \nof their  behavior in some example Markov chains.  We  have  also  used  the  analysis \nto calculate one-step optimal values of the step-size a  and eligibility trace A param(cid:173)\neters.  Further,  we  have calculated  terminal mean square  errors  and maximal bias \nreduction rates.  Since all these results depend on the precise  Markov chains chosen, \n\n\f1060 \n\nS.  Singh and P.  Dayan \n\nit is hard  to make generalisations. \nWe  have  posited  four  general  conjectures:  HI)  for  constant  A,  the  larger  a,  the \nlarger  the  terminal MSE;  H2)  the larger  a  or  A (except  for  A very  close  to  I),  the \nfaster  the convergence  to the asymptotic MSE,  provided that this is  finite;  H3)  the \nsmaller A,  the  smaller the  range of a  for  which  the terminal MSE is  not excessive; \nH4)  higher  values  of A are  good  for  cases  with  high  initial  biases.  The  third  of \nthese  is  somewhat surprising,  because  the  effective  value of the step-size  is  really \nal(l - A).  However,  the  lower  A,  the  more  the  value  of a  state  is  based  on  the \nvalue  estimates for  nearby  states.  We  conjecture  that  with  small  A,  large  a  can \nquickly  lead  to high  correlation  in  the  value estimates of nearby  states  and result \nin runaway  variance updates. \n\nTwo main lines of evidence suggest that using values of A other than  1 (Le., using a \ntemporal difference  rather than a  Monte-Carlo algorithm) can  be  beneficial.  First, \nthe  greedy  value  of A chosen  to  minimise  the  MSE  at  the  end  of the  step  (whilst \nusing  the  associated  greedy  a)  remains  away  from  1  (see  Figure  3).  Second,  the \neigenvalue  analysis  of B S  showed  that  the  largest  value  of a  that  can  be  used  is \nhigher for  A < 1 (also the asymptotic speed  with which  the bias can be guaranteed \nto decrease  fastest  is  higher for  A < 1). \nAlthough  in  this  paper  we  have only  discussed  results  for  the standard TD(A)  al(cid:173)\ngorithm  (called  Accumulate),  we  have  also  analysed  Replace  TD(A)  of Singh  & \nSutton  (1996)  and various others.  This analysis clearly provides only an early step \nto understanding  the course  of learning for TD algorithms, and has focused  exclu(cid:173)\nsively on prediction rather than control.  The analytical expressions for  MSE  might \nlend themselves to general conclusions over whole classes of Markov chains,  and our \ngraphs also point to interesting unexplained phenomena, such  as  the apparent long \ntails in  Figure  1c  and the  convergence  of greedy  values of A in  Figure  3.  Stronger \nanalyses such  as  those providing large deviation rates  would be desirable. \n\nReferences \n\nBarto, A.G.  &.  Duff,  M.  (1994).  Monte Carlo matrix inversion  and reinforcement  learning. \nNIPS  6,  pp 687-694. \n\nSingh,  S.P.  &.  Dayan,  P.  (1996).  Analytical  mean squared  error curves in  temporal differ(cid:173)\nence learning.  Machine  Learning, submitted. \n\nSingh,  S.P.  &.  Sutton, R.S.  (1996).  Reinforcement learning  with replacing eligibility  traces. \nMachine  Learning, to appear. \n\nSutton,  R.S.  (1988).  Learning  to predict  by  the methods  of temporal difference.  Machine \nLearning,  3,  pp 9-44. \n\nWatkins,  C.J.C.H.  (1989).  Learning from  Delayed  Rewards.  PhD  Thesis.  University  of \nCambridge,  England. \n\n\f", "award": [], "sourceid": 1284, "authors": [{"given_name": "Satinder", "family_name": "Singh", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}