{"title": "The Parti-Game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 711, "page_last": 718, "abstract": null, "full_text": "The Parti-game  Algorithm for  Variable \nResolution Reinforcement  Learning in \n\nMultidimensional State-spaces \n\nAndrew W.  Moore \n\nSchool  of Computer Science \nCarnegie-Mellon University \n\nPittsburgh,  PA  15213 \n\nAbstract \n\nParti-game is  a  new  algorithm for  learning  from  delayed  rewards \nin  high  dimensional  real-valued  state-spaces.  In  high  dimensions \nit  is  essential  that  learning  does  not  explore  or  plan  over  state \nspace  uniformly.  Part i-game maintains a  decision-tree  partitioning \nof state-space  and  applies  game-theory  and  computational geom(cid:173)\netry  techniques  to efficiently  and reactively  concentrate high  reso(cid:173)\nlution only on  critical  areas.  Many  simulated  problems have  been \ntested,  ranging  from  2-dimensional to  9-dimensional state-spaces, \nincluding  mazes,  path  planning, non-linear  dynamics,  and  uncurl(cid:173)\ning snake  robots  in restricted  spaces.  In  all  cases,  a  good  solution \nis  found  in less  than twenty trials  and  a few  minutes. \n\n1  REINFORCEMENT LEARNING \n\nReinforcement  learning  [Samuel,  1959,  Sutton,  1984,  Watkins,  1989,  Barto  et  al., \n1991] is a promising method for control systems to program and improve themselves. \nThis paper addresses  its biggest stumbling block:  the curse  of dimensionality  [Bell(cid:173)\nman, 1957], in which costs increase exponentially with the number of state variables. \n\nSome earlier work [Simons  et  al.,  1982, Moore,  1991, Chapman and Kaelbling,  1991, \nDayan  and  Hinton,  1993]  has  considered  recursively  partitioning state-space  while \nlearning  from  delayed  rewards.  The  new  ideas  in  the  parti-game  algorithm  in-\n\n711 \n\n\f712 \n\nMoore \n\nclude  (i)  a  game-theoretic  splitting  criterion  to  robustly  choose  spatial  resolution \n(ii)  real-time incremental  maintenance  and  planning  with  a  database of all  previ(cid:173)\nous  experIences,  and  (iii)  using  local  greedy  controllers  for  high-level  \"funneling\" \nactions. \n\n2  ASSUMPTIONS \n\nThe parti-game algorithm applies to  difficult learning control problems in which: \n\n1.  State and  action  spaces  are continuous  and  multidimensional. \n2.  \"Greedy\"  and hill-dim bing  techniques  would  become stuck, never attaining \n\nthe  goal. \n\n3.  Random  exploration  would  be hopelessly  time-consuming. \n4.  The  system  dynamics  and  control  laws  can  have  discontinuities  and  are \n\nunknown:  they  must  be learned. \n\nThe  experiments  reported  later  all  have  properties  1-4.  However,  the  initial algo(cid:173)\nrithm, described  and  tested  here,  has  the following  restrictions: \n\n5.  Dynamics  are  deterministic. \n6.  The task is  specified  by  a  goal,  not  an  arbitrary  reward  function. \n7.  The goal  state is  known. \n8.  A  \"good\"  solution  is  required,  not  necessarily  the  optimal  path.  This  na(cid:173)\n\ntion  of goodness  can  be formalized  as  \"the  optimal  path  to  within  a  given \nresolution  of state space\". \n\n9.  A  local  greedy  controller  is  available,  which  we  can  ask  to  move  greedily \ntowards  any  desired  state.  There  is  no  guarantee  that  a  request  to  the \ngreedy controller  will succeed.  For example,  in  a  maze  a  greedy  path to the \ngoal  would  quickly  hit  a  wall. \n\nFuture  developments  may  include  relatively  straightforward  additions  to the  algo(cid:173)\nrithm  that  would  remove  the  need  for  restrictions  6-9.  Restriction  5  is  harder  to \nremove. \n\n3  ESSENTIALS  OF  THE PARTI-GAME ALGORITHM \n\nThe state space  is  broken  into partitions by  a  kd-tree  [Friedman  et  al.,  1977].  The \ncontroller  can  always  sense  its  current  (continuous  valued)  state,  and  can  cheaply \ncompute  which  partition  it  is  in.  The  space  of actions  is  also  discretized  so  that \nin  a  partition with  N  neighboring  partitions, there  are  N  high-level  actions.  Each \nhigh  level  action  corresponds  to a  local  greedy  controller,  aiming for  the  center  of \nthe  corresponding neighboring partition. \n\nEach  partition  keeps  records  of  all  the  occasions  on  which  the  system  state  has \npassed  through  it.  Along  with  each  record  is  a  memory of which  high  level  action \nwas  used  (i.e.  which  neighbor  was  aimed for)  and what  the outcome was.  Figure  1 \nprovides  an  illustration. \n\nGiven  this  database  of  (partition, high-level-action, outcome)  triplets,  and  our \nknowledge  of the  partition  containing  the  goal  state,  we  can  try  to  compute  the \n\n\fThe Parti-Game Algorithm for Variable Resolution Reinforcement Learning \n\n713 \n\nPartition  I \n\nPartition 2 \n\n................... \n\n, \n\nI \nI \n\nI \n\nPartition 3 \n\nFigure  1:  Three  trajectories  starting \nin  partition  1,  using  high-level  action \n\"Aim  at  partition  2\".  Partition  1  re-\nmembers  three outcomes. \n(Part  1,  Aim  2  --+  Part  2) \n(Part  1,  Aim  2  --+  Part  1) \n(Part  1,  Aim  2  --+  Part  3) \n\nbest  route  to  the  goal.  The  standard  approach  would  be  to  model  the  system \nas  a  Markov  Decision  Task  in  which  we  empirically  estimate  the  partition  tran(cid:173)\nsition  probabilities.  However,  the  probabilistic  interpretation  of coarse  resolution \npartitions  can  lead  to  policies  which  get  stuck.  Instead,  we  use  a  game-theoretic \napproach,  in  which  we  imagine  an  adversary.  This  adversary  sees  our  choice  of \nhigh-level  action,  and  is  allowed  to  select  any  of the  observed  previous  outcomes \nof the  action in this partition.  Partitions are scored  by minimaxing:  the  adversary \nplays  to  delay  or  prevent  us  getting  to the  goal  and  we  play  to  get  to the  goal  as \nquickly  as  possible. \n\nWhenever  the system's continuous state passes  between  partitions, the database of \nstate  transitions  is  updated  and,  if necessary,  the minimax scores  of all  partitions \nare updated.  If real-time constraints do not permit full  recomputation, the updates \ntake  place  incrementally  in  a  manner  similar to  prioritized  sweeping  [Moore  and \nAtkeson,  1993]. \n\nAs  well  as  being  robust  to  coarseness,  the  game-theoretic  approach  also  tells  us \nwhere  we  should  increase  the  resolution .  Whenever  we  compute  that  we  are  in  a \nlosing partition we  perform resolution increase.  We  first  compute the  complete set \nof connected  partitions which  are  also losing partitions.  We  then find  the subset  of \nthese  partitions which  border some non-losing region.  We  increase the resolution of \nall  these  border  states by  splitting them along their  longest  axes 1 . \n\n4 \n\nINITIAL EXPERIMENTS \n\nFigure 2 shows a 2-d continuous maze.  Figure 3 shows the performance of the robot \nduring  the very  first  trial.  It begins  with intense exploration to find  a  route  out of \nthe  almost entirely  enclosed  start  region.  Having  eventually  reached  a  sufficiently \nhigh  resolution,  it  discovers  the  gap  and  proceeds  greedily  towards  the  goal,  only \nto  be stopped  by  the goal's barrier  region.  The next  barrier is  traversed  at a  much \nlower  resolution,  mainly because  the gap  is  larger. \n\nFigure  4  shows  the  second  trial,  started  from  a  slightly  different  position.  The \npolicy  derived  from  the  first  trial  gets  us  to  the  goal  without further  exploration. \nThe trajectory  has  unnecessary  bends.  This is  because  the controller is  discretized \naccording  to the  current partitioning.  If necessary,  a  local optimizer could  be  used \n\n1 More  intelligent  splitting  criteria  are  under investigation. \n\n\f714 \n\nMoore \n\nStart  I\u00b7 \n\nFigure  2:  A  2-d  maze  problem.  The  point \nrobot  must  find  a  path  from  start  to  goal \nwithout crossing  any of the barrier lines.  Re(cid:173)\nmember that initially  it does not know where \nany obstacles  are,  and must discover them by \nfinding  impassable  states. \n\nFigure  3:  The  path  taken  during  the  entire \nfirst  trial.  See  text for  explanation. \n\nto refine  this trajectory2. \n\nThe system does  not explore  unnecessary  areas.  The barrier in the top left remains \nat low resolution because the system has had no need to visit there .  Figures 5 and 6 \nshow  what happens when  we  now  start the system inside  this  barrier. \n\nFigure 7 shows  a  3-d state space problem.  If a standard grid  were  used,  this would \nneed  an  enormous  number  of states  because  the  solution  requires  detailed  three(cid:173)\npoint-turns.  Parti-game's  total  exploration  took  18  times  as  much  movement  as \none  run  of the final  path obtained. \n\nFigure 8 shows  a  4-d  problem in  which  a  ball rolls  around  a  tray with steep edges. \nThe  goal  is  on  the  other  side  of a  ridge.  The  maximum  permissible force  is  low, \nand so  greedy  strategies,  or  globally linear control  rules,  get  stuck in  a  limit cycle. \nParti-game's solution runs to the other end of the tray,  to build up enough velocity \nto  make  it  over  the  ridge.  The exploration-length  versus  final-path-Iength  ratio  is \n24. \n\nFigure 9 shows a 9-joint snake-like robot manipulator which must move to a specified \nconfiguration  on  the  other  side  of a  barrier.  Again,  no initial  model  is  given:  the \ncontroller  must  learn  it  as  it  explores.  It  takes  seven  trials  before  fixing  on  the \nsolution shown.  The exploration-length versus  final-path-length  ratio is  60. \n\n2 Another  method is  to increase  the resolution  along  the  trajectory  [Moore,  1991]. \n\n\fThe Parti-Game Algorithm for Variable Resolution Reinforcement Learning \n\n715 \n\n1/ ,. rn \n(') \n11 \nj \nII \n~  /-\"-\n-I-~ \n\n~ \n\nf-H \n\n/'--\n\n) \n\n.f\"\" \nJ \n-1 \n\n1-1 \n\nr--- ./ \n\nFigure  4:  The second  trial. \n\n1 \n1 \n\nV \nc-f-\nr ~ \nv'\\ \nI  ,r-Il \nf- I- L~ IT'\" l.J J \n\"-\n\n1-+' 1 \n\n/ \nr--- ./ \n\n1-1 \n\nf-H \n\nFigure  5:  Starting inside  the \ntop left  barrier. \n\nFigure  6: \nthat. \n\n/--. \n\n) \n\n--1 \nThe  trial  after \n\nFigure  7:  A  problem  with  a  planar  rod  being  guided  past  obstacles.  The  state  space \nis  three-dimensional: \ntwo  values  specify  the  position  of  the  rod's  center,  and  the  third \nspecifies  the  rod's  angle  from  the  horizontal.  The  angle  is  constrained  so  that  the  pole's \ndotted  end  must  always  be  below  the other end.  The pole's  center  may  be  moved  a short \ndistance  (up to 1/40 of the diagram  width) and its angle may be altered by up to 5 degrees, \nprovided it does not hit  a  barrier in  the process.  Parti-game converged  to the  path shown \nbelow after two trials.  The partitioning lines on  the solution  diagram only  show  a 2-d slice \nof the full  kd-tree. \n\nTrials \nSteps \nPartitions \n\n149 \n\n149 \n\n149  Change \n\nno \n\n10 \n\n\f716 \n\nMoore \n\nFigure  8:  A  puck sliding  over  a  hilly  surface  (hills  shown  by contours  below:  the  surface \nis  bowl  shaped,  with  the  lowest  points  nearest  the  center,  rising  steeply  at  the  edges). \nThe state space is  four-dimensional:  two position  and two velocity variables.  The controls \nconsist  of a  force  which  may  be  applied  in  any  direction,  but  with  bounded  magnitude. \nConvergence time was  two  trials. \n\nlu..&.I.I.WI'U'I.\u00b7  \u2022\u2022\u2022\u2022 \n\n3 \nno \nchange \n\n10 \n\nTrials \nSteps \nPartitions \n\n2 \n\n1 \n2609  115 \n13 \n\n13 \n\nFigure  9:  A  nine-degree-of-freedom  planar  robot  must  move  from  the  shown  start  con(cid:173)\nfiguration  to  the  goal.  The solution  entails  curling,  rotating  and  then  uncurling.  It may \nnot  intersect  with  any  of the  barriers,  the  edge  of the  workspace,  or  itself.  Convergence \noccurred  after seven  trials. \n\nf-Fixed \nbase \n\n3 \n353 \n67 \n\n4 \n330 \n69 \n\n5 \n739 \n78 \n\n6 \n200 \n85 \n\n7 \n52 \n85 \n\n8 \n\nTrials \nSteps \nPartitions \n\n2 \n\n1 \n1090  430 \n41 \n\n66 \n\n\fThe Parti-Game Algorithm for Variable Resolution Reinforcement Learning \n\n717 \n\n5  DISCUSSION \n\nPossible extensions  include: \n\n\u2022  Splitting  criteria  that  lay  down  splits  between  trajectories  with  spatially \n\ndistinct  outcomes. \n\n\u2022  Allowing  humans to provide  hints  by  permitting  user-specified  controllers \n\n(\"behaviors\")  as extra high-level  actions. \n\n\u2022  Coalescing neighboring partitions that mutually agree. \n\nWe  finish  by  noting a  promising sign involving  a  series  of snake robot experiments \nwith  different  numbers  of links  (but  fixed  total  length).  Intuitively,  the  problem \nshould  get  easier  with  more  links,  but  the  curse  of  dimensionality  would  mean \nthat  (in  the  absence  of prior  knowledge)  it  becomes  exponentially harder.  This  is \nborne out by the observation that random exploration with  the three-link arm will \nstumble on the goal  eventually,  whereas  the  nine link  robot  cannot  be  expected  to \ndo  so  in  tractable  time.  However,  Figure  10  indicates  that  as  the  dimensionality \nrises,  the  amount  of exploration (and hence  computation) used  by  parti-game does \nnot  rise  exponentially.  Real-world  tasks  may  often  have  the  same  property as  the \nsnake example:  the complexity of the ultimate task remains roughly constant as the \nnumber of degrees  of freedom  increases.  If so,  we  may have  uncovered  the Achilles' \nheel  of the curse  of dimensionality. \n\n~ \nell \n\"'\"  180 \n~ \n~ = 160 \nQ \nCJ  140 \n~ \n\"'\"  120 \n~ \n~  100 \n.CI \n~  80 \n~ \n\n~  60 e \nfI.l  40 = Q .... ...  20 \n\n:=  0 \n\"'\" ~ \n~ \n\nFigure  10:  The  number  of  par(cid:173)\ntitions  finally  created  against  de(cid:173)\ngrees of freedom  for  a  set of snake(cid:173)\nlike robots.  The kd-trees built were \nall  highly  non-uniform, \ntypically \nhaving  maximum  depth  nodes  of \ntwice the dimensionality.  The rela(cid:173)\ntion  between exploration  time  and \ndimensionality  (not  shown)  had  a \nsimilar  shape. \n\nI \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\n9 \n\nDimensionality \n\nReferences \n[Barto  et  ai.,  1991]  A.  G.  Barto,  S.  J.  Bradtke,  and  S.  P.  Singh.  Real-time  Learning  and \nControl using Asynchronous Dynamic Programming. Technical Report 91-57,  University \nof Massachusetts  at  Amherst,  August  1991. \n\n[Bellman,  1957]  R.  E .  Bellman.  Dynamic  Programming.  Princeton  University  Press, \n\nPrinceton,  N J,  1957. \n\n[Chapman  and  Kaelbling,  1991)  D.  Chapman and L.  P.  Kaelbling.  Learning from  Delayed \n\nReinforcement In  a  Complex  Domain.  Technical  Report,  Teleos  Research,  1991. \n\n\f718 \n\nMoore \n\n[Dayan  and  Hinton,  1993]  P.  Dayan  and  G.  E.  Hinton.  Feudal  Reinforcement  Learning. \nIn  S.  J.  Hanson,  J.  D  Cowan,  and  C.  L.  Giles,  editors,  Advances  in  Neural Information \nProcessing Systems  5.  Morgan  Kaufmann,  1993. \n\n[Friedman  et  al.,  1977)  J.  H.  Friedman,  J.  L.  Bentley,  and  R.  A.  Finkel.  An Algorithm for \nFinding  Best  Matches  in  Logarithmic  Expected  Time.  ACM  Trans.  on  Mathematical \nSoftware,  3(3):209-226,  September  1977. \n\n[Moore  and  Atkeson,  1993]  A. W.  Moore  and  C.  G.  Atkeson.  Prioritized  Sweeping:  Rein(cid:173)\nforcement  Learning  with  Less  Data and  Less  Real Time.  Machine  Learning,  13,  1993. \n[Moore,  1991]  A.  W.  Moore.  Variable  Resolution  Dynamic  Programming:  Efficiently \nLearning  Action  Maps  in  Multivariate  Real-valued  State-spaces.  In  L.  Birnbaum  and \nG.  Collins,  editors,  Machine  Learning:  Proceedings  of the  Eighth  International  Work(cid:173)\nshop.  Morgan  Kaufman,  June  1991. \n\n[Samuel,  1959]  A.  L. Samuel.  Some Studies in Machine Learning using the Game of Check(cid:173)\n\ners.  IBM Journal on Research and Development, 3,  1959.  Reprinted in E. A.  Feigenbaum \nand  J.  Feldman,  editors,  Computers  and  Thought,  McGraw-Hill,  1963. \n\n[Simons  et  al.,  1982)  J.  Simons,  H.  Van  Brussel,  J.  De  Schutter,  and  J.  Verhaert.  A  Self(cid:173)\n\nLearning Automaton with Variable Resolution for High Precision Assembly by Industrial \nRobots.  IEEE  Trans.  on Automatic  Control,  27(5):1109-1113,  October 1982. \n\n[Singh,  1993]  S.  Singh.  Personal  Communication. \n[Sutton,  1984)  R.  S.  Sutton.  Temporal  Credit  Assignment  in  Reinforcement  Learning. \n\n,1993. \n\nPhd.  thesis,  University  of Massachusetts,  Amherst,  1984. \n\n[Watkins,  1989]  C.  J.  C.  H.  Watkins.  Learning  from  Delayed  Rewards.  PhD.  Thesis, \n\nKing's  College,  University of Cambridge,  May  1989. \n\n\f", "award": [], "sourceid": 742, "authors": [{"given_name": "Andrew", "family_name": "Moore", "institution": null}]}