{"title": "Gradient Descent for General Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 968, "page_last": 974, "abstract": null, "full_text": "Gradient  Descent for  General \n\nReinforcement  Learning \n\nLeemon  Baird \n\nleemon@cs.cmu.edu \n\nwww.cs.cmu.edu/- Ieemon \n\nComputer Science Department \n\n5000 Forbes Avenue \n\nCarnegie Mellon University \nPittsburgh, PA  15213-3891 \n\nAndrew Moore \nawm@cs.cmu.edu \n\nwww.cs.cmu.edu/-awm \n\nComputer Science Department \n\n5000  Forbes Avenue \n\nCarnegie Mellon University \nPittsburgh, PA  15213-3891 \n\nAbstract \n\nA  simple  learning  rule  is  derived,  the  VAPS  algorithm,  which  can \nbe  instantiated  to  generate  a  wide  range  of  new  reinforcement(cid:173)\nlearning  algorithms.  These  algorithms  solve  a  number  of  open \nproblems, define several new approaches to reinforcement learning, \nand  unify  different  approaches  to  reinforcement  learning  under  a \nsingle  theory.  These  algorithms  all  have  guaranteed  convergence, \nand  include  modifications  of several  existing  algorithms  that  were \nknown  to  fail  to  converge  on  simple  MOPs.  These  include  Q(cid:173)\nIn  addition  to  these \nlearning,  SARSA,  and  advantage  learning. \nit  also  generates  pure  policy-search \nvalue-based  algorithms \nreinforcement-learning  algorithms,  which  learn  optimal  policies \nwithout  learning  a  value  function. \nsearch  and  value-based  algorithms  to  be  combined,  thus  unifying \ntwo  very  different  approaches  to  reinforcement  learning  into  a \nsingle  Value  and  Policy  Search  (V APS)  algorithm.  And  these \nalgorithms converge for  POMDPs without requiring a  proper belief \nstate .  Simulations  results  are  given,  and  several  areas  for  future \nresearch are discussed. \n\nIn  addition,  it  allows  policy(cid:173)\n\n1  CONVERGENCE  OF  GREEDY  EXPLORATION \n\nMany  reinforcement-learning  algorithms  are  known  that  use  a  parameterized \nfunction  approximator  to  represent  a  value  function,  and  adjust  the  weights \ninclude  Q-learning,  SARSA,  and \nincrementally  during \nadvantage  learning.  There  are  simple  MOPs  where  the  original  form  of  these \nalgorithms  fails  to  converge,  as  summarized  in  Table  1.  For the  cases  with..J,  the \nalgorithms  are  guaranteed  to  converge  under  reasonable  assumptions  such  as \n\nExamples \n\nlearning. \n\n\fGradient Descent for General Reinforcement Learning \n\n969 \n\nTable  1.  Current convergence results for incremental, value-based RL algorithms. \n\nResidual algorithms changed every X in the first two columns to ..J. \nX to a ..J. \nThe new al \n\nin this \n\ndistribution \n\ndistribution \n\nUsually-\ngreedy \n\ndistribution \n\nMarkov \nchain  r-----~----~----_+------------.--------\n\nMDP \n\nPOMDP \n\nr--------:---'''---_+.\" \n\n=convergence guaranteed \n\nbest and worst \n\nX=counterexample is known that either diverges or oscillates between the \n\nible \n\nicies. \n\ndecaying  learning  rates.  For  the  cases  with  X,  there  are  known  counterexamples \nwhere it  will either diverge or osciIlate between the best and worst possible policies, \nwhich have  very-different  values.  This  can happen  even with  infinite training time \nand  slowly-decreasing  learning  rates  (Baird,  95,  Gordon,  96).  Each  X in  the  first \ntwo columns can be changed to  a  ..J  and  made to  converge by  using a  modified  form \nof the  algorithm,  the  residual  form  (Baird  95).  But  this  is  only  possible  when \nlearning  with  a  fixed  training  distribution,  and  that  is  rarely  practical.  For  most \nlarge  problems,  it  is  useful  to  explore  with  a  policy  that  is  usualIy-greedy  with \nrespect to the current value function,  and that changes as the value function changes. \nIn  that case  (the rightmost column  of the  chart), the  current convergence guarantees \nare  not  very  good.  One  way  to  guarantee  convergence  in  alI  three  columns  is  to \nmodify  the  algorithm  so  that  it  is  performing  stochastic  gradient  descent  on  some \naverage error function,  where the average  is  weighted by state-visitation frequencies \nfor  the  current  usually-greedy  policy.  Then  the  weighting  changes  as  the  policy \nIt might  appear  that  this  gradient  is  difficult  to  compute.  Consider  Q(cid:173)\nchanges. \nlearning  exploring  with  a  Boltzman  distribution  that  is  usually  greedy  with  respect \nto the  learned Q function.  It seems difficult to  calculate gradients,  since  changing a \nsingle  weight  will  change  many  Q  values,  changing  a  single  Q  value  will  change \nmany  action-choice  probabilities  in  that  state,  and  changing  a  single  action-choice \nprobability  may  affect  the  frequency  with  which  every  state  in  the  MDP  is  visited. \nAlthough this  might seem difficult,  it  is  not.  Surprisingly,  unbiased estimates of the \ngradients  of visitation  distributions  with  respect  to  the  weights  can  be  calculated \nquickly, and the resulting algorithms can put a ..J  in every case  in Table  1. \n\n2  DERIVATION  OF  THE  V APS  EQUATION \n\nConsider a  sequence  of transitions  observed  while  following  a  particular  stochastic \npolicy  on  an  MDP.  Let  Sl  = {xo,uo,Ro,  xt.ut.Rt.  ...  xl.t.ul_t.RI_t.  xtout.RI}  be  the \nsequence  of states,  actions,  and  reinforcements  up  to  time  t,  where  performing \naction  UI  in  state  XI  yields  reinforcement  RI  and  a  transition  to  state  XI+I.  The \n\n\f970 \n\nL.  Baird and A.  W.  Moore \n\nstochastic  policy  may be a  function  of a  vector of weights w.  Assume the  MOP has \na  single  start  state  named  Xo. \nIf the  MOP  has  terminal  states,  and  x,  is  a  terminal \nstate,  then  X'+I=XO.  Let  S,  be the  set of all  possible sequences  from  time  0  to  t.  Let \ne(s,)  be a  given  error function  that calculates an  error on  each time  step,  such as  the \nsquared  Bellman residual  at time t,  or some other error occurring at time  t.  If e  is  a \nfunction  of the weights, then  it  must be a  smooth  function  of the  weights.  Consider \na  period  of time  starting  at  time  0  and  ending  with  probability  P(endls,)  after  the \nsequence s,  occurs.  The  probabilities  must be such that the  expected  squared  period \nlength  is  finite.  Let  B  be  the  expected  total  error  during  that  period,  where  the \nexpectation  is  weighted  according  to  the  state-visitation  frequencies  generated  by \nthe given policy: \n\nT \n\nB  =  I  I  P(period ends at time T after trajectory Sr) I  e(s,) \n\nr \n\n,=0 \n\nxc \n\n= I  I  e(s,)P(sJ \n\n1= 0  s, e St \n\nwhere: \n\npes,) =  P(u,  I sJP(R,  I s,)O P(u,  I s,)P(R,  I s,)P(S'+1  I s,)fi  - P(end Is,)] \n\n, - I \n\n,=0 \n\n(I) \n\n(2) \n\n(3) \n\nNote  that  on  the  first  line,  for  a  particular  s\"  the  error  e(s,)  will  be  added  in  to  B \nonce for  every  sequence  that starts  with s,.  Each of these terms  will  be  weighted  by \nthe  probability  of  a  complete  trajectory  that  starts  with  s,.  The  sum  of  the \nprobabilities of all  trajectories  that  start  with  s,  is  simply  the  probability of s,  being \nobserved, since the period is assumed to end eventually with probability one.  So the \nsecond  line  equals  the  first.  The  third  line  is  the  probability  of the  sequence,  of \nwhich  only the  P(u,lx,)  factor  might be  a  function  of w.  If so,  this  probability  must \nbe a smooth function  of the weights and nonzero everywhere.  The partial  derivative \nof B with respect to w,  a particular element of the weight vector w,  is: \n\n(4) \n\n(5) \n\nSpace  here  is  limited,  and  it  may  not  be  clear  from  the  short  sketch  of  this \nderivation, but summing (5) over an  entire  period does give  an  unbiased estimate of \nB,  the  expected  total  error  during  a  period.  An  incremental  algorithm  to  perform \nstochastic gradient descent on B  is the  weight  update  given  on  the  left side of Table \n2,  where the  summation over previous time steps  is  replaced with a trace  T,  for each \nweight.  This algorithm  is more general than  previously-published algorithms of this \nform,  in  that  e can be a  function  of all  previous states,  actions,  and reinforcements, \nrather  than  just  the  current  reinforcement.  This  is  what  allows  V APS  to  do  both \nvalue and policy search. \n\nEvery  algorithm  proposed  in  this  paper  is  a  special  case  of the  V APS  equation  on \nthe  left side  of Table  2.  Note that no  model  is  needed for  this  algorithm.  The  only \nprobability needed  in  the  algorithm  is  the  policy,  not the  transition  probability from \nthe  MOP.  This  is  stochastic  gradient  descent  on  B,  and  the  update  rule  is  only \ncorrect if the  observed transitions are  sampled  from  trajectories  found  by  following \n\n\fGradient Descent for General Reinforcement Learning \n\n971 \n\nTable 2.  The general YAPS algorithm (left), and several  instantiations of it (right). \nThis  single  algorithm  includes  both  value-based  and  policy-search  approaches  and \nh  . \nt  elr com  matlOn,  an  gives guarantee  convergence  m every case. \n\nd' \n\nb\" \n\nd \n\n. \n\n~w, =  -aL~ e(s,)  + e(s,)T,] \n\n~T, =  ~I In(P(u'_1 I S,_I)) \n\ne adva\"lag, (S,)=fE2 \n\n[RH  + r m\", A(x\" u) -1' A(x,_,. U H \n\neSARSA  (St)  =  t \u00a32 (R,_1 + }Q(xt , ut ) - Q(xt_1 ,  u,-ll \neQ-learm\"g(s,)  = 1- E2lRI _1 + y m~ Q(x\"  u) - Q(x, _1' u,-;l \n) 1 \neva/lte  - 'leraIlO\"  (S/)  = + [ max  E[ R' _I + yV (xJ] - V (x/-I) J \neSARI'A-poh,y (SJ  =  (t  - P)eSARI'A(SJ  + pT.b  - y' R/J \n\n\"(~-I)  A( \n\n) \nX, _I' U \n\nm,:u' \n\n+  A \n\nIt,  1 \n\nthe  current,  stochastic  policy.  Both  e  and  P should  be  smooth  functions  of w,  and \nfor any  given  w  vector,  e  should be  bounded.  The  algorithm  is  simple,  but  actuaIly \ngenerates  a  large  class  of different  algorithms  depending  on  the  choice  of e  and \nwhen  the  trace  is  reset  to  zero.  For  a  single  sequence,  sampled  by  following  the \ncurrent policy,  the  sum  of ~w along the  sequence  will  give  an  unbiased  estimate of \nthe true gradient,  with  finite  variance.  Therefore,  during learning,  if weight updates \nare  made  at  the  end  of each  trial,  and  if the  weights  stay  within  a  bounded  region, \nand  the  learning  rate  approaches  zero,  then  B  wiIl  converge  with  probability  one. \nAdding a  weight-decay term (a constant times the 2-norm  of the weight vector)  onto \nB  will  prevent  weight  divergence  for  small  initial  learning  rates.  There  is  no \nguarantee  that  a  global  minimum  will  be  found  when  using  general  function \napproximators, but at least it will  converge.  This is true for backprop as  well. \n\n3 \n\nINSTANTIATING  THE  V APS  ALGORITHM \n\nMany  reinforcement-learning  algorithms  are  value-based;  they  try  to  learn  a  value \nfunction  that satisfies the BeUman  equation .  Examples are Q-learning,  which  learns \na value function,  actor-critic algorithms, which learn  a  value  function  and the policy \nwhich  is  greedy  with  respect  to  it,  and  TO( 1),  which  learns  a  value  function  based \non  future  rewards.  Other  algorithms  are  pure  policy-search  algorithms;  they \ndirectly  learn  a  policy  that  returns  high  rewards.  These  include  REINFORCE \n(Williams,  1988),  backprop \nlearning  automata,  and  genetic \nalgorithms.  The  algorithms  proposed  here  combine  the  two  approaches:  they \nperform  Value  And  Policy  Search  (YAPS).  The  ,general  VAPS  equation  is \ninstantiated  by  choosing  an  expression  for  e.  This  can  be  a  Bellman  residual \n(yielding  value-based),  the  reinforcement  (yielding  policy-search),  or  a  linear \ncombination  of the  two  (yielding  Value  And  Policy  Search).  The  single  VAPS \nupdate  rule  on  the  left  side  of Table  2  generates  a  variety  of different  types  of \nalgorithms, some of which are described in  the foIlowing sections. \n\nthrough \n\ntime, \n\n3.1  REDUCING  MEAN  SQUARED  RESIDUAL  PER  TRIAL \n\nIf the  MOP has terminal  states,  and  a trial is  the time  from  the  start  until  a  terminal \nstate  is  reached,  then  it  is  possible to  minimize the  expected total  error per trial  by \nresetting  the  trace  to  zero  at  the  start  of each  trial.  Then,  a  convergent  form  of \nSARSA,  Q-Iearning,  incremental  value  iteration,  or  advantage  learning  can  be \ngenerated by  choosing  e to  be  the  squared  Bellman  residual,  as  shown on  the  right \nside  of Table 2.  In  each case,  the expected value  is  taken over all  possible (x/>u\"R,) \n\n\f972 \n\nL.  Baird and A.  W  Moore \n\ntriplets,  given  St-I'  The  policy  must  be  a  smooth,  nonzero  function  of the  weights. \nSo it could not be an  c-greedy  policy that chooses the greedy action  with probability \n(I-c)  and  chooses  uniformly  otherwise.  That  would  cause  a  discontinuity  in  the \ngradient  when  two  Q  values  in  a  state  were  equal.  But  the  policy  could  be \nsomething that approaches c-greedy as a  positive temperature c approaches zero: \n\n& \n\nP(u I x) = -;;  + (I  - &) I  (I  + eQ(x,u') lc ) \n\n1 + eQ(x.II) l c \n\n(6) \n\nII' \n\nwhere n is the  number of possible actions  in  each state.  For each  instance  in  Table 2 \nother than  value iteration, the gradient of e can be estimated using two,  independent, \nunbiased estimates of the expected value. For example: \n\n!, eSARSA  (Sf)  ==  e SAR.S:4  (Sf {r\u00a2 !, Q(X'f  ,  U'f  )  - !, Q(Xf _l ,  U f _I )) \n\n(7) \n\nWhen  $=1,  this  is  an  estimate  of the  true  gradient.  When  $<1,  this  is  a  residual \nalgorithm,  as  described  in  (Baird,  96),  and  it  retains  guaranteed  convergence,  but \nmay  learn  more quickly  than  pure gradient descent  for  some  values  of $.  Note that \nthe  gradient  of Q(x,u)  at time  I  uses  primed  variables.  That  means  a  new  state  and \naction  at time  I  were  generated  independently  from  the  state  and  action  at  time 1-1. \nOf course,  if the MOP  is deterministic,  then the primed variables are the same as the \nunprimed.  If the  MOP  is  nondeterministic but the  model  is  known,  then  the  model \nmust  be  evaluated  one  additional  time  to  get  the  other  state. \nIf the  model  is  not \nknown,  then there are  three choices.  First, a  model  could be learned  from  past data, \nand  then  evaluated  to  give  this  independent  sample.  Second,  the  issue  could  be \nignored,  simply  reusing  the  unprimed  variables  in  place  of the  primed  variables. \nThis  may  affect  the  quality  of the  learned  function  (depending  on  how  random  the \nMOP  is),  but  doesn't  stop  convergence,  and  be  an  acceptable  approximation  in \npractice.  Third,  all  past  transitions  could  be  recorded,  and  the  primed  variables \ncould  be  found  by  searching  for  all  the  times  (Xt-hUt-')  has  been  seen  before,  and \nrandomly  choosing one  of those  transitions  and  using  its  successor  state  and  action \nas  the  primed  variables.  This  is  equivalent  to  learning  the  certainty  equivalence \nmodel,  and  sampling  from  it,  and  so  is  a  special  case  of  the  first  choice.  For \nextremely  large  state-action  spaces  with  many  starting  states,  this  is  likely  to  give \nthe  same  result  in  practice  as  simply  reusing  the  unprimed  variables  as  the  primed \nvariables.  Note,  that  when  weights  do not  effect  the  policy  at  all,  these  algorithms \nreduce to standard residual algorithms (Baird, 95). \n\nIt is  also  possible to reduce the  mean  squared residual  per step,  rather than  per trial. \nThis  is  done  by  making  period  lengths  independent  of the  policy,  so  minimizing \nerror per period will  also  minimize the  error per step.  For example,  a  period  might \nbe defined to  be the  first  100  steps,  after which  the  traces  are  reset,  and  the  state  is \nreturned  to  the  start  state.  Note that  if every  state-action  pair has  a  positive  chance \nof being seen  in  the first  100 steps,  then  this  will  nol just be solving a  finite-horizon \nproblem.  It will  be actually  be  solving the  discounted,  infinite-horizon  problem,  by \nreducing the  Bellman residual  in  every state.  But the weighting of the residuals wilI \nbe  determined  only  by  what  happens  during  the  first  100  steps.  Many  different \nproblems  can  be  solved  by  the  V APS  algorithm  by  instantiating  the  definition  of \n\"period\"  in  different ways. \n\n3.2  POLICY-SEARCH  AND  VALUE-BASED  LEARNING \n\nIt is  also  possible  to  add  a  term  that  tries  to  maximize  reinforcement  directly.  For \nexample,  e  could  be  defined  to  be  e.\\\u00b7ARSA-I'0!Jcy  rather  than  eSARSA.  from  Table  2,  and \n\n\fGradient Descent for General Reinforcement Learning \n\n973 \n\n10000  , -- - -- - - - - - - - - - ,  \n\n{Jl ca \n. _  1000 \nI-< \nE-\n\n100  -t----r---,...---,...-----l \n0.8 \n\n0.6 \n\n0.4 \nBeta \n\no \n\n0.2 \n\nFigure  1.  A POMDP and the number of trials needed to  learn  it  vs. p . \n\nA combination of policy-search and value-based RL outperforms either alone. \n\nthe  various  algorithms  from \n\nthe  trace reset to  zero after each  terminal  state  is  reached.  The  constant b does  not \naffect  the  expected  gradient,  but  does  affect  the  noise  distribution,  as  discussed  in \n(Williams,  88).  When  P=O,  the algorithm  will  try to  learn  a  Q function  that satisfies \nthe  Bellman equation, just as before.  When  P=I,  it directly learns a policy that will \nminimize  the  expected  total  discounted  reinforcement.  The resulting  \"Q  function\" \nmay  not  even  be  close  to  containing  true  Q  values  or  to  satisfying  the  Bellman \nequation,  it will just give a  good  policy .  When  P  is  in  between,  this algorithm  tries \nto  both  satisfy  the  Bellman  equation  and  give  good  greedy  policies.  A  similar \nmodification  can  be  made  to  any  of the  algorithms  in  Table  2. \nIn  the  special  case \nwhere  P=I, this  algorithm  reduces  to  the  REINFORCE  algorithm  (Williams,  1988). \nREINFORCE has been rederived for the special  case of gaussian action distributions \n(Tresp &  Hofman,  1995), and extensions of it  appear in  (Marbach,  1998).  This case \nof pure  policy  search  is  particularly  interesting,  because  for  P=I ,  there  is  no  need \nfor  any  kind  of  model  or  of  generating  two  independent  successors.  Other \nalgorithms  have  been  proposed  for  finding  policies  directly,  such  as  those  given  in \n(Gullapalli,  92)  and \nlearning  automata  theory \nsummarized  in  (Narendra &  Thathachar,  89).  The V APS  algorithms  proposed  here \nappears to be the  first  one  unifying these  two  approaches to reinforcement  learning, \nfinding  a  value  function  that  both  approximates  a  Bellman-equation  solution  and \ndirectly optimizes the greedy policy. \nFigure  1 shows simulation results  for  the  combined algorithm.  A run  is  said to  have \nlearned  when  the  greedy  policy  is  optimal  for  1000  consecutive  trials.  The  graph \nshows  the  average  plot  of 100  runs,  with  different  initial  random  weights  between \n\u00b1 10.6 .  The  learning  rate  was  optimized  separately  for  each  p  value.  R= 1  when \nleaving state A,  R=2 when  leaving state B or entering end, and R=O  otherwise. y=0.9. \nThe  algorithm  used  was  the  modified  Q-Iearning  from  Table  2,  with  exploration  as \nin  equation  13,  and  q>=c= l,  b=O, c=O.1.  States  A and  B  share  the  same  parameters, \nso  ordinary  SARSA  or  greedy  Q-Iearning  could  never  converge,  as  shown  in \n(Gordon,  96).  When  p=O  (pure  value-based),  the  new  algorithm  converges,  but  of \ncourse  it  cannot learn  the optimal  policy  in  the  start state,  since  those  two  Q  values \nlearn to be equal.  When  P=1  (pure  policy-search),  learning converges to optimality, \nbut slowly, since there  is  no  value  function  caching the  results  in  the  long sequence \nof states near the  end.  By  combining the  two  approaches,  the  new  algorithm  learns \nmuch more quickly than either alone. \n\nIt  is  interesting that the  V APS algorithms described  in  the  last three sections  can  be \napplied  directly  to  a  Partially  Observable  Markov  Decision  Process  (POMDP), \nwhere  the  true  state  is  hidden,  and  all  that  is  available  on  each  time  step  is  an \n\n\f974 \n\nL.  Baird and A.  W Moore \n\nambiguous  \"observation\",  which  is  a  function  of  the  true  state.  Normally,  an \nalgorithm  such  as  SARSA  only  has  guaranteed  convergance  when  applied  to  an \nMOP.  The  V APS algorithms will  converge in  such cases. \n\n4  CONCLUSION \n\nA  new  algorithm  has  been  presented.  Special  cases  of  it  give  new  algorithms \nsimilar  to  Q-Iearning,  SARSA,  and  advantage  learning,  but  with  guaranteed \nconvergence  for  a  wider range  of problems  than  was  previously  possible,  including \nPOMOPs.  For  the  first  time,  these  can  be  guaranteed  to  converge,  even  when  the \nexploration  policy  changes  during  learning.  Other  special  cases  allow  new \napproaches  to  reinforcement  learning,  where  there  is  a  tradeoff between  satisfying \nthe  Bellman  equation  and  improving  the  greedy  policy.  For  one  MOP,  simulation \nshowed  that  this  combined  algorithm  learned  more  quickly  than  either  approach \nalone.  This  unified  theory,  unifying  for  the  first  time  both  value-based  and  policy(cid:173)\nsearch  reinforcement  learning,  is  of theoretical  interest,  and  also  was  of practical \nvalue  for  the  simulations  performed.  Future  research  with  this  unified  framework \nmay  be  able  to  empirically  or  analytically  address  the  old  question  of when  it  is \nbetter to  learn  value  functions  and  when  it  is  better to  learn  the  policy  directly. \nIt \nmay also shed light on the new question, of when it  is  best to do both at once. \n\nAcknowledgments \n\nThis research was sponsored in  part by the  U.S. Air Force. \n\nReferences \n\nBaird,  L.  C. \n(1995) .  Residual  Algorithms:  Reinforcement  Learning  with  Function \nApproximation.  In  Armand  Prieditis  &  Stuart  Russell,  eds.  Machine  Learning:  Proceedings \nof  the  Twelfth  International  Conference,  9- 1 2  July,  Morgan  Kaufman  Publishers,  San \nFrancisco, CA. \n\nGordon,  G.  (1996).  \" Stable fitted  reinforcement  learning\".  In  G.  Tesauro,  M.  Mozer,  and  M. \nHasselmo \n(eds.),  Advances  in  Neural  Information  Processing  Systems  8,  pp.  1052-1058. \nMIT Press, Cambridge,  MA . \n\nGullapalli,  V.  (1992).  Reinforcement  Learning and Its  Application  to  Control.  Dissertation \nand COINS Technical  Report 92-10,  University of Massachusetts,  Amherst,  MA. \n\nKaelbling,  L.  P.,  Littman,  M.  L.  &  Cassandra,  A.,  \" Planning  and  Acting  in  Partially \nObservable  Stochastic  Domains\".  Artificial  Intelligence,  to  appear.  Available  now  at \nhttp://www.cs.brown.edu/people/lpk. \n\nMarbach,  P.  (1998).  Simulation-Based  Optimization  of Markov  Decision  Processes.  Thesis \nLIDS-TH  2429,  Massachusetts  Institute of Technology. \n\nMcCallum  (1995),  A.  Reinforcement  learning  with  selective  perception  and  hidden  state. \nDissertation,  Department of Computer Science,  UniverSity of Rochester,  Rochester, NY. \n\nNarendra,  K ..  &  Thathachar,  M.A.L.  (1989).  Learning  automata: An  introduction .  Prentice \nHall,  Englewood Cliffs, NJ. \n\nTresp,  V.,  &  R.  Hofman  (1995).  \"Missing  and  noisy  data  in  nonlinear  time-series \nIn  Proceedings  of Neural  Networks  for  Signal  Processing  5,  F.  Girosi ,  J. \nprediction\". \nMakhoul,  E.  Manolakos  and  E.  Wilson,  eds.,  IEEE  Signal  Processing  Society,  New  York, \nNew York,  1995.  pp.  1-10. \n\nWilliams,  R.  J.  (1988).  Toward  a  theory  of reinforcement-learning  connectionist  systems. \nTechnical  report NU-CCS-88-3, Northeastern  University,  Boston,  MA. \n\n\f", "award": [], "sourceid": 1576, "authors": [{"given_name": "Leemon", "family_name": "Baird", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}