{"title": "Multidimensional Triangulation and Interpolation for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1005, "page_last": 1011, "abstract": null, "full_text": "Multidimensional Triangulation and \n\nInterpolation for  Reinforcement  Learning \n\nScott Davies \n\nscottd@cs.cmu.edu \n\nDepartment of Computer Science,  Carnegie  Mellon  University \n\n5000  Forbes  Ave,  Pittsburgh,  PA  15213 \n\nAbstract \n\nDynamic  Programming,  Q-Iearning  and  other  discrete  Markov  Decision \nProcess  solvers  can be -applied to continuous d-dimensional state-spaces  by \nquantizing the state space into an array of boxes.  This is often problematic \nabove  two dimensions:  a  coarse  quantization can lead  to poor policies, and \nfine  quantization is too expensive.  Possible solutions are variable-resolution \ndiscretization,  or function  approximation  by  neural  nets.  A  third option, \nwhich  has  been  little  studied  in  the  reinforcement  learning  literature,  is \ninterpolation on  a  coarse  grid.  In  this  paper we  study  interpolation tech(cid:173)\nniques  that  can  result  in  vast  improvements in  the  online  behavior of the \nresulting  control  systems:  multilinear  interpolation,  and  an  interpolation \nalgorithm  based  on  an  interesting  regular  triangulation  of d-dimensional \nspace.  We  adapt  these  interpolators  under  three  reinforcement  learning \nparadigms:  (i)  offline value iteration with  a  known  model,  (ii)  Q-Iearning, \nand  (iii)  online  value  iteration  with  a  previously  unknown  model  learned \nfrom data.  We  describe empirical results,  and the resulting implications for \npractical learning of continuous non-linear dynamic control. \n\n1  GRID-BASED INTERPOLATION TECHNIQUES \nReinforcement  learning algorithms generate functions  that map states to  \"cost-t<r \ngo\"  values.  When  dealing  with  continuous  state  spaces  these  functions  must  be \napproximated.  The following  approximators are frequently  used: \n\n\u2022  Fine grids may be used  in one or two dimensions.  Above  two dimensions, \nfine  grids  are  too expensive.  Value  functions  can  be  discontinuous,  which \n(as we will see) can lead to su boptimalities even with very fine  discretization \nin two dimensions . \n\n\u2022  Neural  nets have  been  used  in conjunction  with TD  [Sutton,  1988]  and \nQ-Iearning [Watkins,  1989]  in very  high dimensional spaces [Tesauro,  1991, \nCrites  and  Barto,  1996].  While promising, it  is  not  always  clear  that they \nproduce  the  accurate  value  functions  that  might  be  needed  for  fine  near(cid:173)\noptimal control of dynamic systems, and the most commonly used  methods \nof applying value  iteration or policy iteration with a  neural-net  value func(cid:173)\ntion are often unstable.  [Boyan  and  Moore,  1995]. \n\n\f1006 \n\nS.  Davies \n\nInterpolation over points on a coarse grid is another potentially useful approximator \nfor  value  functions  that  has  been  little  studied  for  reinforcement  learning.  This \npaper attempts to rectify  this omission.  Interpolation schemes may be  particularly \nattractive  because  they  are  local  averagers,  and  convergence  has  been  proven  in \nsuch cases  for offline  value iteration [Gordon,  1995]. \n\nAll  of the  interpolation methods discussed  here  split the state space into a  regular \ngrid  of d-dimensional  boxes;  data  points  are  associated  with  the  centers  or  the \ncorners  of the  resulting  boxes.  The value  at  a  given  point  in  the  continuous state \nspace  is  computed as  a  weighted  average of neighboring data points. \n\n1.1  MULTILINEAR INTERPOLATION \nWhen  using  multilinear  interpolation,  data  points  are  situated  at  the  corners  of \nthe grid's boxes.  The interpolated  value  within a  box is  an  appropriately weighted \naverage of the  2d  datapoints on  that  box's corners.  The weighting scheme  assures \nglobal continuity of the  interpolated surface,  and  also guarantees  that the interpo(cid:173)\nlated value at any grid corner matches the given  value of that corner. \nIn one-dimensional space,  multilinear interpolation simply involves piecewise  linear \ninterpolations between the  data points.  In a  higher-dimensional space,  a  recursive \n(though  not  terribly efficient)  implementation can  be described  as follows: \n\n\u2022  Pick an  arbitrary  axis.  Project the query point  along this axis  to each of the two \n\nopposite faces  of the box containing  the query  point. \n\n\u2022  Use  two (d - 1)-dimensional  multilinear  interpolations  over  the  2d - 1  datapoints \non each of these two faces to calculate the values at both of these projected points. \n\n\u2022  Linearly  interpolate  between  the two values  generated in the previous  step. \n\nMultilinear interpolation processes  2d  data points for  every  query,  which  becomes \nprohibitively expensive  as d increases. \n\n1.2  SIMPLEX-BASED INTERPOLATION \nIt is  possible to interpolate over d + 1 of the data points for any given query in only \nO( d log d) time and still achieve a continuous surface that fits the datapoints exactly. \nEach  box  is  broken  into d!  hyperdimensional  triangles,  or  simplexes,  according  to \nthe  Coxeter-Freudenthal-Kuhn triangulation [Moore,  1992]. \nAssume  that the  box  is  the  unit  hypercube,  with  one  corner  at  (Xl, X2,  . .. ,  Xd)  = \n(0,0, ... ,0), and the diagonally opposite corner  at (1,1, ... ,1).  Then, each simplex \nin the Kuhn triangulation corresponds to one possible permutation p of (1,2, ... , d), \nand occupies the set  of points satisfying the equation \n\no ~ Xp(l)  ~ Xp(2)  ~ ... ~ Xp(d)  ~ 1. \n\nTriangulating each box into d!  simplexes in this manner generates a  conformal mesh: \nany two elements with a (d - 1 )-dimensional surface in  common have entire faces  in \ncommon, which  ensures  continuity  across  element  boundaries when interpolating. \n\nWe  use  the  Kuhn triangulation for  interpolation as follows: \n\n\u2022  Translate  and  scale  to  a  coordinate  system  in  which  the  box  containing  the \nquery  point  is  the  unit  hypercube.  Let  the  new  coordinate  of  the  query  point \nbe  (x~, ... ,x~). \n\n\u2022  Use  a  sorting  algorithm  to  rank  x~  through  x~.  This tells  us the simplex  of the \n\nKuhn  triangulation  in  which  the query  point lies. \n\n\fTriangulation and Interpolation for Reinforcement Learning \n\n]007 \n\n\u2022  Express  (x~, . .. , x~) as  a  convex  combination  of the  coordinates  of the  relevant \n\nsimplex's  (d + 1)  corners . \n\n\u2022  Use the coefficients determined in the previous step as the weights for  a  weighted \n\nsum of the data values  stored  at  the corresponding  corners. \n\nAt  no  point  do  we  explicitly  represent  the  d!  different  simplexes.  All  of the  above \nsteps  can  be  performed  in  Oed)  time  except  the  second,  which  can  be  done  in \nO( d log d)  time using conventional sorting routines. \n\n2  PROBLEM DOMAINS \nCAR  ON  HILL:  In  the  Hillcar  domain,  the goal  is  to park a  car near  the top  of \na one-dimensional hill.  The hill is steep  enough  that the driver  needs  to back up  in \norder to gather enough speed  to get to the goal.  The state space is  two-dimensional \n(position, velocity).  See  [Moore  and  Atkeson,  1995]  for  further  details,  but  note \nthat our formulation  is  harder  than  the  usual  formulation  in that  the  goal  region \nis  restricted  to  a  narrow  range  of velocities  around  0,  and  trials  start  at  random \nstates.  The task is  specified  by  a reward of -1  for  any action taken outside the goal \nregion,  and 0 inside the goal.  No  discounting is used,  and  two actions are available: \nmaximum thrust  backwards,  and  maximum thrust forwards. \n\nACROBOT: The Acrobot  is  a  two-link planar robot  acting  in  the vertical  plane \nunder  gravity  with  a  weak  actuator  at  its  elbow  joint  joint.  The shoulder  is  un(cid:173)\nactuated.  The  goal  is  to  raise  the  hand  to  at  least  one  link's  height  above  the \nunactuated pivot  [Sutton,  1996].  The state space  is four-dimensional:  two  angular \npositions and  two angular velocities.  Trials always  start from  a stationary position \nhanging straight down.  This task is formulated  in  the same way  as  the car-on-the(cid:173)\nhill.  The only actions allowed  are  the two extreme elbow  torques. \n\n3  APPLYING INTERPOLATION: THREE CASES \n3.1  CASE I:  OFFLINE VALUE ITERATION WITH  A  KNOWN \n\nMODEL \n\nFirst,  we  precalculate  the effect  of taking each possible  action from each state cor(cid:173)\nresponding to a datapoint in the grid .  Then, as suggested  in [Gordon,  1995],  we  use \nthese calculations to derive  a  completely discrete MDP.  Taking any  action from  any \nstate  in  this  MDP  results  in  c possible  successor  states,  where  c is  the number  of \ndatapoints  used  per interpolation.  Without interpolation,  c is  1;  with  multilinear \ninterpolation, 2d;  with simplex-based interpolation, d + 1. \nWe  calculate  the  optimal  policy  for  this  derived  MDP  offline  using  value  itera(cid:173)\ntion  [Ross,  1983];  because  the  value  iteration  can  be  performed  on  a  completely \ndiscrete  MDP,  the calculations are much  less  computationally expensive than they \nwould  have  been  with  many other  kinds  of function  approximators.  The  value  it(cid:173)\neration  gives  us  values  for  the  datapoints  of our grid,  which  we  may  then  use  to \ninterpolate the values at other states during online control. \n\n3.1.1  Hillcar Results:  value iteration with known model \nWe  tested  the  two  interpolation  methods  on  a  variety  of  quantization  levels  by \nfirst  performing value iteration offline,  and then starting the car from 1000 random \nstates and  averaging  the number of steps  taken  to the goal  from  those states.  We \nalso  recorded  the  number  of backups  required  before  convergence,  as  well  as  the \nexecution  time  required  for  the  entire  value  iteration  on  a  85  MHz  Sparc  5.  See \nFigure  1 for  the  results.  All  steps-to-goal values  are  means with an expected error \nof 2 steps. \n\n\f1008 \n\nS.  Davies \n\nInterpolation Method \nNone \n\nMultlLm \n\nSimplex \n\nSteps  to  Goal : \n\nBackups: \n\nTime (sec) : \n\nSteps  to  Goal : \n\nBackups: \n\nTime _Lsec) : \n\nSteps  to  Goal : \n\nBackups: \n\nTime (sec) : \n\nl1:l \n\n237 \n2.42K \n0.4 \n\n134 \n4.S4K \n0 .6 \n\n134 \n6 .17K \n0.5 \n\nGrid  size \n21:l \n\n51:l \n\n131 \n15.4K \n1.0 \n\n116 \n18.1K \n1.3 \n\n118 \n18.1K \n1 .2 \n\n133 \n156K \n4 .1 \n\nlOS \n205K \n7 .1 \n\n109 \n195K \n5 .7 \n\n301:l \n\n120 \n14.3M \n192 \n\n107 \n17.8M \n405 \n\n107 \n17.9M \n328 \n\nFigure  1:  Hillcar:  value iteration with known  model \n\nInterpolation  Method \nNone \n\nMultlLm \n\nSimplex \n\nSteps  to Goal : \n\nBackups: \n\nTime  (sec) : \n\nSteps  to Goal : \n\nBackups: \n\nTime  (sec): \n\nSteps  to Goal : \n\nBackups: \n\nTime(sec): \n\n84 \n\n-\n-\n-\n\n3340 \n233K \n17 \n\n4700 \n196K \n9 \n\n2006 \n1.01M \n43 \n\n8007 \n1.16M \n24 \n\n94 \n\n104 \n\nGrid  size \n124 \n\n11\" \n\n-\n-\n-\n\n44089 \n280K \n15 \n\n1136 \n730K \n42 \n\n2953 \n590K \n22 \n\n-\n-\n-\n3209 \n2.01M \n83 \n\n3209 \n2.28M \n47 \n\n26952 \n622K \n30 \n\n1300 \n2.03M \n99 \n\n4663 \n1.62M \n47 \n\n13\" \n\n14'1 \n\n15\" \n\n-\n-\n-\n\n> 100000 \n1.42M \n53 \n\n-\n-\n-\n\n1820 \n3.74M \n164 \n\n2733 \n4.03M \n86 \n\n1518 \n4.45M \n197 \n\n1742 \n3 .65M \n93 \n\n1802 \n6.78M \n284 \n\n9613 \n6 .73M \n142 \n\nFigure 2:  Acrobot:  value iteration with known  model \n\nThe interpolated functions  require  more backUps for  convergence,  but this is amply \ncompensated  by  dramatic improvement in  the policy.  Surprisingly,  both interpola(cid:173)\ntion  methods  provide  improvements  even  at extremely  high  grid  resolutions  - the \nnoninterpolated  grid with  301  datapoints  along each  axis fared  no  better  than  the \ninterpolated grids with only  21  datapoints along each axis(!) . \n\n3.1.2  Acrobot Results:  value iteration with known  model \nWe  used the same value iteration algorithm in the acrobot domain.  In this case our \ntest  trials always began from the same start state,  but  we  ran tests for  a  larger set \nof grid  sizes  (Figure  2). \n\nGrids with different resolutions place grid cell  boundaries at different locations,  and \nthese  boundary  locations  appear  to  be  important  in  this  problem  -\nthe  perfor(cid:173)\nmance  varies  unpredictably  as  the  grid  resolution  changes.  However,  in  all  cases, \ninterpolation was  necessary  to  arrive  at  a  satisfactory  solution;  without  interpo(cid:173)\nlation,  the  value  iteration  often  failed  to  converge  at  all.  With  relatively  coarse \ngrids it may  be  that any  trajectory  to the goal  passes  through some grid  box more \nthan  once,  which  would  immediately spell  disaster  for  any  algorithm associating a \nconstant value over  that entire grid  box. \n\nControllers using multilinear interpolation consistently fared  better than those em(cid:173)\nploying  the  simplex-based  interpolation;  the smoother  value function  provided  by \nmultilinear interpolation seems to help.  However,  value  iteration with  the simplex(cid:173)\nbased  interpolation was  about  twice  as  fast  as  that  with  multilinear interpolation. \nIn higher  dimensions this speed  ratio will increase. \n\n\fTriangulation and Interpolation for Reinforcement Learning \n\n1009 \n\n3.2  CASE II:  Q-LEARNING \nUnder  a  second  reinforcement  learning  paradigm,  we  do  not  use  any  model. \nRather,  we  learn  a  Q-function  that  directly  maps state-action  pairs  to  long-term \nrewards  [Watkins,  1989].  Does  interpolation help  here  too? \nIn  this  implementation  we  encourage  exploration  by  optimistically initializing the \nQ-function  to zero  everywhere.  After  travelling a  sufficient  distance  from  our  last \ndecision  point,  we  perform  a  single  backup  by  changing  the  grid  point  values  ac(cid:173)\ncording to a  perceptron-like  update rule,  and then we  greedily select  the action for \nwhich  the interpolated Q-function is  highest  at the current state. \n\n3.2.1  Hillcar Results:  Q-Learning \nWe  used  Q-Learning  with  a  grid  size  of 112.  Figure  3  shows  learning  curves  for \nthree  learners using the three  different  interpolation techniques. \n\nBoth interpolation methods provided  a significant  improvement in both initial and \nfinal  online  performance.  The learner  without  interpolation  achieved  a  final  aver(cid:173)\nage  performance of about  175 steps to the goal; with multilinear interpolation, 119; \nwith  simplex-based  interpolation,  122.  Note  that these  are  all significant  improve(cid:173)\nments over the corresponding results for  offline value iteration with a known model. \nInaccuracies in the interpolated functions often cause controllers to enter cycles;  be(cid:173)\ncause  the Q-Iearning backups  are  being performed online,  however,  the Q-Iearning \ncontroller  can  escape from  these  control  cycles  by  depressing  the  Q-values  in  the \nvicinities of such  cycles. \n\n3.2.2  Acrobot  Results:  Q-Learning \nWe  used  the same algorithms on the acrobot domain with a grid size of 154 ;  results \nare shown in Figure 3. \n\n...,.,.,.  . \n\n-,eoooo \n\n-,eoooo \n\n\u00b7200000 \n\no \n\n!iO \n\n100 \n\n150 \n\n-_._--\n\n15 \n\n~ J -1.,..07 \n1 \n~ \n- -'5&+07 \nl \n\n-2 ... 01  L---_'-----'_---'_--'_---'-_~_~ _ \n\n200 \n\n2iO \n\n3DO \n\n360 \n\n400 \n\n450 \n\n500 \n\n0 \n\n!iO \n\n100 \n\n150 \n\n~ofl,.. \n\n200 \n\n250 \n\n300 \n\n!fiO \n\n~otT!U \n\n__' \n400 \n\nFigure  3:  Left:  Cumulative performance of Q-Iearning  hillcar  on an  112  grid.  (Multilinear \n\ninterpolation  comes out on  top;  no interpolation  on  the bottom.)  Right:  Q-Iearning \nacrobot on  a.  154  grid.  (The two interpolations come out on  top with  nearly  identical \n\nperformance.)  For each  learner,  the y-axis  shows the sum of rewards for  all trials to date. \n\nThe better the average  performance,  the shallower the gradient.  Gradients are always \nnegative because  each state transition  before reaching  the goal  results in a  reward  of -1. \n\nBoth  Q-Iearners  using  interpolation improved  rapidly,  and  eventually  reached  the \ngoal  in  a  relatively  small number  of steps  per  trial.  The  learner  using  multilinear \ninterpolation  eventually  achieved  an  average  of 1,529  steps  to  the  goal  per  trial; \nthe  learner  using  simplex-based  interpolation  achieved  1,727  steps  per  trial.  On \nthe  other  hand,  the  learner  not  using  any  interpolation fared  much  worse,  taking \n\n\f1010 \n\ns.  Davies \n\nan  average  of more than  27,000 steps  per  trial.  (A  controller  that  chooses  actions \nrandomly typically takes  about the same number of steps  to reach  the goal.) \n\nSimplex-based  interpolation  provided  on-line  performance  very  close  to  that  pro(cid:173)\nvided  by  multilinear interpolation, but at roughly  half the computational cost. \n\n3.3  CASE III:  VALUE ITERATION  WITH  MODEL  LEARNING \nHere, we  use  a model of the system, but we  do not assume that we  have one to start \nwith.  Instead , we  learn a  model of the system as  we  interact  with it;  we  assume this \nmodel is  adequate and calculate a value function  via the same algorithms we  would \nuse  if we  knew  the  true  model.  This  approach  may  be  particularly  beneficial  for \ntasks in which data is expensive and computation is cheap.  Here,  models are learned \nusing very simple grid-based function approximators without interpolation for  both \nthe reward  and transition functions of the model.  The same grid resolution  is  used \nfor  the  value  function  grid  and  the  model  approximator.  We  strongly  encourage \nexploration  by  initializing the  model  so  that every  state is  initially assumed  to  be \nan absorbing state  with zero  reward. \n\nWhile  making transitions  through  the  state  space,  we  update  the  model  and  use \nprioritized sweeping  [Moore  and Atkeson,  1993] to concentrate backups on relevant \nparts  of the  state  space.  We  also  occasionally  stop  to  recalculate  the  effects  of \nall  actions  under  the  updated  model  and  then  run  value iteration  to convergence. \nAs  this  is  fairly  time-consuming,  it  is  done  rather  rarely;  we  rely  on  the  updates \nperformed  by  prioritized  sweeping to guide the system in the meantime . \n\n. ,00000 \n\n50 \n\n100 \n\no \nFigure  4:  Left:  Cumulative performance,  model-learning  on  hillcar  with a  112  grid. \n\n.\",.\". ~_'------'_~_-.L_--'-_--'-_--'-_~ \n400 \n\n~o1 Tn'\" \n\n~o1 T \"'1I \n\n150 \n\n200 \n\n200 \n\n150 \n\n!OO \n\n350 \n\n-400 \n\n4SO \n\n500 \n\n~oo \n\n.:ISO \n\n2!10 \n\n0 \n\n50 \n\n100 \n\n250 \n\nRight:  Acrobot  with  a  154  grid.  In  both cases,  multilinear  interpolation  comes out on \n\ntop,  while  no interpolation  winds  up on  the bottom. \n\n3.3.1  Hillcar Results:  value iteration with learned model \nWe  used  the algorithm described above with an ll-by-ll grid.  An average of about \ntwo  prioritized  sweeping  backups  were  performed  per  transition;  the  complete  re(cid:173)\ncalculations  were  performed  every  1000  steps  throughout  the  first  two  trials  and \nevery  5000 steps  thereafter.  Figure 4 shows the  results  for  the first  500  trials. \n\nOver  the  first  500  trials,  the  learner  using simplex-based  interpolation didn't  fare \nmuch  better  than  the  learner  using  no  interpolation.  However,  its  performance \non  trials  1500-2500  (not  shown)  was  close  to that of the  learner  using  multilinear \ninterpolation, taking an average of 151  steps  to the goal per  trial while  the learner \nusing  multilinear  interpolation  took  147.  The  learner  using  no  interpolation  did \nsignificantly worse  than  the others in these  later trials,  taking 175 steps  per  trial. \n\n\fTriangulation and Interpolation for Reinforcement Learning \n\n1011 \n\nThe  model-learners'  performance improved  more  quickly  than the Q-Iearners'  over \nthe first few  trials; on the other hand, their final performance was significantly worse \nthat the Q-Iearners'. \n\n3.3.2  Acrobot  Results:  value iteration with learned model \nWe  used  the  same  algorithm  with  a  154  grid  on  the  acrobot  domain,  this  time \nperforming the complete recalculations every 10000 steps through the first  two trials \nand  every  50000  thereafter.  Figure  4  shows  the  results.  In  this  case,  the  learner \nusing no interpolation took so much  time per trial that the experiment was aborted \nearly;  after  100  trials,  it  was  still  taking  an  average  of more  than  45,000  steps \nto  reach  the  goal.  The  learners  using  interpolation,  however,  fared  much  better. \nThe learner using multilinear interpolation converged  to a solution taking 938 steps \nper trial;  the  learner  using simplex-based interpolation averaged  about  2450 steps. \nAgain,  as the graphs show,  these  three learners initially improve significantly faster \nthan did  the Q-Learners  using similar grid sizes. \n\n4  CONCLUSIONS \nWe  have shown how  two interpolation schemes- one based on a weighted  average of \nthe  2d  points in  a square cell,  the other on a  d- dimensional triangulation-may be \nused  in  three  reinforcement  learning paradigms:  Optimal policy  computation with \na  known  model,  Q-Iearning,  and  online value  iteration while  learning a  model.  In \neach  case  our empirical studies  demonstrate interpolation resoundingly  decreasing \nthe  quantization  level  necessary  for  a  satisfactory  solution.  Future  extensions  of \nthis  research  will  explore  the  use  of variable  resolution  grids  and  triangulations, \nmultiple low-dimensional interpolations in place of one high-dimension interpolation \nin a manner reminiscent ofCMAC [Albus,  1981], memory-based approximators, and \nmore intelligent exploration. \n\nThis research was funded  in  part by a  National Science Foundation  Graduate Fellowship to Scott Davies, \nand  a  Research  Initiation  Award to  Andrew  Moore. \nReferences \n[Albus,  1981]  J.  S.  Albus.  Brains,  BehaVIour and Robottcs.  BYTE Books,  McGraw-Hili ,  1981. \n[Boyan  and  Moore,  1995]  J.  A .  Boyan  and  A.  W.  Moore .  Generalization  in  Reinforcement  Learning : \n\nSafely  Approximating the  Value Function.  In  Neural Information Processing  Systems  7,  1995 . \n\n[Crites  and  Barto,  1996]  R.  H.  Crites  and  A.  G.  Barto.  Improving  Elevator  Performance  using  Rein(cid:173)\n\nIn  D.  Touretzky,  M.  Mozer,  and  M.  Hasselmo,  editors,  Neural  Information \n\nforcement  Learning. \nProcessing  Systems  8,  1996. \n\n[Gordon,  1995]  G.  Gordon.  Stable Function  Approximation  in  Dynamic  Programming.  In  Proceedmgs \n\nof the  12th International Conference  on  Machme  Learning.  Morgan Kaufmann,  June  1995 . \n\n[Moore and  Atkeson,  1993]  A.  W.  Moore  and  C .  G.  Atkeson .  Prioritized  Sweeping:  Reinforcement \n\nLearning with Less Data and  Less  Real Time.  Machme  Learning,  13,  1993. \n\n[Moore and  Atkeson,  1995]  A .  W.  Moore  and  C.  G.  Atkeson.  The  Parti-game Algorithm  for  Variable \nResolution  Reinforcement  Learning in  Multidimensional  State-spaces.  Machine  Learning,  21,  1995. \n[Moore,  1992]  D.  W .  Moore.  Simplical  Mesh  Generation  with  Applications.  PhD.  Thesis.  Report  no. \n\n92-1322, Cornell  University,  1992 . \n\n[Ross,  1983]  S.  Ross.  Introduction  to  Stochastic  Dynamic  Programming.  Academic  Press,  New  York, \n\n1983. \n\n[Sutton,  1988]  R.  S.  Sutton.  Learning  to  Predict  by  the  Methods  of Temporal  Differences.  Machine \n\nLearning,  3:9-44,  1988. \n\n[Sutton,  1996]  R.  S.  Sutton.  Generalization  in  Reinforcement  Learning:  Successful  Examples  Using \nSparse  Coarse Coding.  In  D .  Touretzky,  M .  Mozer,  and  M .  Hasselmo,  editors,  Neural  Information \nProcessing  Systems  8,  1996. \n\n[Tesauro,  1991]  G.  J.  Tesauro.  Practical  Issues  in  Temporal  Difference  Learning.  RC  17223 (76307), \n\nIBM T.  J .  Watson  Research  Center,  NY , 1991. \n\n[Watkins,  1989]  C.  J .  C .  H .  Watkins .  Learning from  Delayed  Rewards .  PhD .  Thesis,  King's  College, \n\nUniversity  of Cambridge,  May  1989. \n\n\f", "award": [], "sourceid": 1229, "authors": [{"given_name": "Scott", "family_name": "Davies", "institution": null}]}