{"title": "Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 663, "page_last": 670, "abstract": null, "full_text": "U sing Local Trajectory  Optimizers  To \n\nSpeed Up  Global  Optimization In \n\nDynamic Programming \n\nChristopher G.  Atkeson \n\nDepartment of Brain  and  Cognitive Sciences  and \n\nthe  Artificial  Intelligence  Laboratory \n\nMassachusetts  Institute of Technology,  NE43-771 \n545  Technology  Square,  Cambridge, MA  02139 \n\n617-253-0788,  cga@ai.mit.edu \n\nAbstract \n\nDynamic programming provides a methodology to develop planners \nand  controllers  for  nonlinear systems.  However,  general  dynamic \nprogramming is  computationally intractable.  We  have  developed \nprocedures that allow more complex planning and control problems \nto  be  solved.  We  use  second  order  local  trajectory  optimization \nto  generate  locally  optimal  plans  and  local  models  of  the  value \nfunction  and its derivatives.  We  maintain global consistency of the \nlocal  models  of the  value  function,  guaranteeing  that  our  locally \noptimal plans are actually globally optimal, up  to the resolution of \nour search  procedures. \n\nLearning to  do the right thing at each  instant in situations that evolve over  time is \ndifficult,  as  the future  cost  of actions chosen  now  may not be obvious immediately, \nand  may only  become  clear  with  time.  Value  functions  are  a  representational  tool \nthat  makes  the  consequences  of actions  explicit.  Value  functions  are  difficult  to \nlearn  directly,  but they  can  be built up from  learned  models of the dynamics of the \nworld  and  the  cost  function.  This  paper focuses  on  how  fast  optimizers  that  only \nproduce  locally optimal answers  can  playa useful  role  in  speeding  up  the  process \nof computing or learning a globally optimal value function. \nConsider  a  system  with  dynamics  Xk+l  = f(xk, Uk)  and  a  cost  function  L(Xk, Uk), \n\n663 \n\n\f664 \n\nAtkeson \n\nwhere x  is the state of the system and u  is  a vector of actions or controls.  The sub(cid:173)\nscript  k serves  as  a  time index,  but will  be  dropped  in  the equations that follow.  A \ngoal of reinforcement learning and optimal control is  to find  a policy that minimizes \nthe  total  cost,  which  is  the  sum of the  costs  for  each  time step.  One  approach  to \ndoing  this  is  to construct  an  optimal value function,  V(x).  The value of this  value \nfunction  at a state x  is  the sum of all future  costs,  given  that  the system started in \nstate  x  and  followed  the  optimal policy  P(x)  (chose  optimal actions  at  each  time \nstep  as  a  function  of the  state).  A  local  planner  or  controller  can  choose  globally \noptimal  actions  if it  knew  the  future  cost  of each  action.  This cost  is  simply  the \nsum of the cost of taking the action right now  and  the future  cost of the state that \nthe action  leads  to,  which  is  given  by  the value  function. \n\nu*  =  arg min (L(x, u) + V(f(x, u\u00bb) \n\nu \n\n(1) \n\nValue  functions  are  difficult  to  learn.  The environment  does  not  provide  training \nexamples that pair states with their optimal cost (x, V(x\u00bb.  In fact, it seems that the \noptimal policy depends on the optimal value function,  which in turn depends on the \noptimal  policy.  Algorithms  to  compute  value  functions  typically  iteratively  refine \na  candidate value function  and/or a  corresponding policy  (dynamic programming). \nThese  algorithms  are  usually  expensive.  We  use  local  optimization  to  generate \nlocally optimal plans and local  models of the value function  and  its derivatives.  We \nmaintain global consistency  of the local  models of the value function,  guaranteeing \nthat our locally optimal plans are  actually globally optimal, up  to the resolution of \nour search  procedures. \n\n1  A  SIMPLE  EXAMPLE:  A  PENDULUM \n\nIn  this  paper  we  will  present  a  simple example  to  make our  ideas  clear.  Figure  1 \nshows a simulated set of locally optimal trajectories  in  phase space for  a  pendulum \nbeing  driven  by  a  motor  at  the joint from  the  stable  to  the  unstable  equilibrium \nposition.  S  marks  the start  point,  where  the  pendulum  is  hanging straight  down, \nand G marks the goal point, where  the pendulum is inverted  (pointing straight up). \nThe  optimization  criteria  quadratically  penalizes  deviations  from  the  goal  point \nand  the magnitude of the torques  applied.  In  the three locally optimal trajectories \nshown  the  pendulum either swings directly  up  to the goal  (1),  moves initially away \nfrom  the  goal  and  then  swings  up  to  the  goal  (2),  or oscillates  to  pump itself and \nthen  swing  to  the  goal  (3).  In  what  follows  we  describe  how  to  find  these  locally \noptimal trajectories  and  also  how  to find  the globally optimal trajectory. \n\n2  LOCAL  TRAJECTORY  OPTIMIZATION \n\nWe  base  our  local  optimization  process  on  dynamic  programming  within  a  tube \nsurrounding  our  current  best  estimate  of a  locally  optimal  trajectory  (Dyer  and \nMcReynolds  1970,  Jacobson  and  Mayne  1970).  We  have  a  local  quadratic  model \nof the  cost  to  get  to  the  goal  (V)  at  each  time step  along  the  optimal  trajectory \n(assume a  time step  index k  in  everything  below  unless  otherwise  indicated): \n\nVex)  ~ Vo  + Vxx + 2x  Vxxx \n\n1  T \n\n(2) \n\n\fUsing Local Trajectory Optimizers to Speed Up Global Optimization \n\n665 \n\n\u2022 \n\ne \n\nI \n\n~ /\" \n1/// ~ \n~ \n/  VI \n\\ \n\n/ III  \" \nIs \n\\ v' \n\n') \n\n\\ \n\nI~ \n\n\\ \nGo \n\ne \n\nFigure  1:  Locally optimal trajectories for  the  pendulum swing  up  task. \n\nA  locally  optim al  policy  can  be  computed  using  local  models of the  plant  (in  this \ncase  local linear  models)  at each  time step  along the  trajectory: \n\n(3) \nand  local  quadratic m odels of the one step  cost  at each  time step  along  the trajec(cid:173)\ntory: \n\nXk+l  =  f(x, u) ~ Ax + Bu + c \n\n1 \n\n1 \n\nL(x,u) ~ 2xT Qx+ 2uTRu+xTSu+tTu \n\n(4) \n\nAt  each  point  along the  trajectory  the optimal policy  is  given  by: \n\nu opt  =  -(R + BTVxxB)-1 x \n\n(BTVxxAx +  ST x  +  BTVxxc +  VxB +  t) \n\nOne  can  integrate  the  plant  dynamics forward  in  time  based  on  the  above  policy, \nand  then  integrate  the  value  functions  and  its first  and  second  spatial  derivatives \nbackwards in  time to compute an  improved value function,  policy,  and  trajectory. \nFor  a  one step  cost  of the form: \n\nL(x, u) ~ 2(x - Xd)  Q(x - Xd)+ \n\n1 \n\nT \n\n1 \n2(u - Ud)  R(u - Ud)  +  (x - Xd)  S(n - Ud) \n\nT \n\nT \n\nthe  backward sweep  takes  the following form  (in  discrete  time): \n\nZx  =  VxA + Q(x - Xd) \nZu  =  VxB + R(u - Ud) \nZxx  =  ATVxxA + Q \nZux  =  BTVxxA + S \nZuu  =  BTVxxB + R \n\nK  =  Z;;: Zux \n\nVXk _ 1  =  Zx  - ZuK \n\nVXXk _ 1  =  Zxx  - ZxuK \n\n(5) \n(6) \n(7) \n(8) \n(9) \n(10) \n(11) \n(12) \n\n\f666 \n\nAtkeson \n\n3  STANDARD  DYNAMIC  PROGRAMMING \n\nA  typical implementation of dynamic programming in  continuous state spaces  dis(cid:173)\ncretizes  the  state  space  into  cells,  and  assigns  a  fixed  control  action  to  each  cell. \nLarson's  state  increment  dynamic  programming (Larson  1968)  is  a  good  example \nof this type of approach.  In  Figure 2A  we  see  the trajectory segments produced  by \napplying the constant  action  in  each  cell,  plotted on  a phase space for  the example \nproblem of swinging up  a  pendulum. \n\n4  USING  LOCAL  TRAJECTORY  OPTIMIZATION \n\nWITH  DP \n\nWe  want to minimize the number of cells  used  in  dynamic programming by  making \nthe cells as large as possible.  Combining local trajectory optimization with dynamic \nprogramming allows us  to greatly reduce  the resolution  of the grid  on  which  we  do \ndynamic programming and  still correctly  estimate the  cost  to get  to the goal from \ndifferent  parts  of the  space.  Figure  2A  shows  a  dynamic  programming approach \nin  which  each  cell  contains a  trajectory segment applied  to the pendulum problem. \nFigure  2B  shows  our  approach,  which  creates  a  set  of locally  optimal trajectories \nto the goal.  By  performing the local  trajectory  optimizations on  a  grid and forcing \nadjacent  trajectories  to  be  consistent,  this  local  optimization  process  becomes  a \nglobal  optimization  process.  Forcing  adjacent  trajectories  to  be  consistent  means \nrequiring  that  all  trajectories  can  be  generated  from  a  single  underlying  policy. \nA  trajectory  can  be  made  consistent  with  a  neighbor  by  using  the  neighboring \ntrajectory  as  an  initial trajectory in  the local optimization process,  or by  using  the \nvalue function from  the neighboring trajectory  to generate  the initial trajectory  in \nthe local  optimization process.  Each  grid  element stores  the  trajectory  that starts \nat that point and  achieves  the lowest  cost. \n\nThe trajectory  segments in figure  2A  match the trajectories  in  2B.  Figures  2C  and \n2D  are  low  resolution  versions  of the  same  problem.  Figure  2C  shows  that  some \nof the  trajectory  segments  are  no  longer  correct.  In  Figure  2D  we  see  the  locally \noptimal trajectories to the goal are still consistent with the trajectories in  Figure 2B. \nUsing locally optimal trajectories which go all the way to the goal as building blocks \nfor our dynamic programming algorithm allows us to avoid the problem of correctly \ninterpolating the cost  to get  to the goal function  on  a sparse grid.  Instead,  the cost \nto get  to the goal is measured  directly on  the optimal trajectory  from each  node to \nthe goal.  We  can  use  a  much sparser grid  and still  converge. \n\n5  ADAPTIVE  GRIDS  BASED  ON  CONSTANT  COST \n\nCONTOURS \n\nWe  can  limit the search  by  \"growing\"  the  volumes searched  around  the initial and \ngoal states by gradually increasing a cost threshold  Cg \u2022  We will only consider states \naround  the goal that  have a  cost  less  than  Cg  to get  to  the goal  and  states around \nthe  initial state  that have  a  cost  less  than  Cg  to get  from  the  initial state  to  that \nstate (Figure 3B). These two regions will increase in size as Cg  is increased.  We stop \n\n\fUsing Local Trajectory Optimizers to Speed Up Global Optimization \n\n667 \n\nA \n\nc \n\nFigure  2:  Different  dynamic programming techniques  (see  text). \n\nB \n\no \n\n\f668 \n\nAtkeson \n\nFigure 3:  Volumes  defined  by  a  cost  threshold. \n\nincreasing Cg  as soon  as  the two regions  come into contact.  The optimal trajectory \nhas  to be entirely  within  the  union of these  two  regions,  and has  a  cost  of 2Cg . \n\nInstead of having the initial conditions of the trajectories laid out on  a grid over the \nwhole space,  the initial conditions are laid out on  a  grid  over the surface separating \nthe inside and  the outside surfaces  of the volumes described  above.  The resolution \nof this grid  is  adaptively determined  by  checking whether  the value function  of one \ntrajectory  correctly  predicts  the  cost  of a  neighboring  trajectory.  If it  does  not, \nadditional grid  points are  added  between  the inconsistent  trajectories. \n\nDuring  this  global optimization we  separate  the state space  into  a  volume around \nthe goal which  has been  completely solved  and  the rest of the state space,  in  which \nno  exploration  or  computation  has  been  done.  Each  iteration  of  the  algorithm \nenlarges  the  completely solved  volume  by  performing dynamic programming from \na  surface  of slightly  increased  cost  to the  current  constant  cost  surface.  When  the \nsolved  volume includes  a  known  starting point or  contacts  a  similar solved  volume \nwith constant cost to get to the boundary from the starting point, a globally optimal \ntrajectory  from  the start to the goal  has  been found. \n\n6  DP BASED  ON  APPROXIMATING  CONSTANT \n\nCOST  CONTOURS \n\nUnfortunately,  adaptive grids  based  on  constant  cost  contours still suffer  from  the \ncurse  of dimensionality, having only  reduced  the  dimensionality of the  problem by \n1.  We  are  currently exploring methods to  approximate constant  cost  contours.  For \nexample, constant cost  contours can be approximated by growing \"key\"  trajectories. \n\n\fUsing Local Trajectory Optimizers to Speed Up Global Optimization \n\n669 \n\n;' \n/ \n\n\\ \n\" \n\nFigure 4:  Approximate constant  cost  contours  based  on  key  trajectories \n\nA  version  of this is illustrated in  Figure 4.  Here,  trajectories  were  grown  along the \n\"bottoms\"  of the value function  \"valleys\".  The location of a  constant  cost  contour \ncan  be  estimated  by  using  local  quadratic  models of the  value  function  produced \nby  the  process  which  optimizes the  trajectory.  These  approximate representations \ndo  not suffer  from  the  curse  of dimensionality.  They  require  on  the order of T D2, \nwhere  T  is  the  length  of time the  trajectory  requires  to get  to  the  goal,  and  D  is \nthe  dimensionality of the state space. \n\n7  SUMMARY \n\nDynamic programming provides a  methodology to plan trajectories and design  con(cid:173)\ntrollers  and estimators for  nonlinear systems.  However,  general  dynamic program(cid:173)\nming is computationally intractable.  We have developed procedures that allow more \ncomplex  planning  problems  to  be  solved.  We  have  modified  the  State  Increment \nDynamic Programming approach of Larson  (1968)  in several  ways: \n\n1.  In State Increment  DP,  a  constant  action is  integrated  to form a  trajectory \nsegment from the center of a cell  to its boundary.  We use second order local \ntrajectory optimization (Differential Dynamic Programming) to generate an \noptimal  trajectory  and  form  an  optimal policy  in  a  tube  surrounding  the \noptimal  trajectory  within  a  cell.  The  trajectory  segment  and  local  policy \nare globally optimal, up to the resolution  of the representation  of the value \nfunction  on  the boundary  of the  cell. \n\n2.  We  use  the  optimal  policy  within  each  cell  to  guide  the  local  trajectory \noptimization to form  a  globally optimal trajectory  from  the  center  of each \n\n\f670 \n\nAtkeson \n\ncell  all  the way  to  the goal.  This helps  us  avoid  the  accumulation of inter(cid:173)\npolation errors  as  one  moves from  cell  to  cell  in  the state space,  and  avoid \nlimitations caused  by  limited resolution  of the  representation  of the  value \nfunction  over  the state space. \n\n3.  The  second  order  trajectory  optimization  provides  us  with  estimates  of \nthe  value  function  and  its  first  and  second  spatial  derivatives  along  each \ntrajectory.  This provides  a  natural guide for  adaptive grid  approaches. \n\n4.  During  the  global  optimization we  separate  the  state space  into  a  volume \naround the goal which  has been  completely solved and the rest  of the state \nspace,  in  which  no  exploration  or  computation  has  been  done.  The  sur(cid:173)\nface  separating these  volumes is  a  surface of constant  cost,  with  respect  to \nachieving  the goal. \n\n5.  Each  iteration  of the  algorithm enlarges  the  completely  solved  volume  by \nperforming dynamic programming from  a  surface of slightly increased  cost \nto  the  current  constant  cost  surface. \n\n6.  When  the  solved  volume  includes  a  known  starting  point  or  contacts  a \nsimilar solved  volume with  constant cost  to get  to  the  boundary from  the \nstarting point,  a  globally optimal trajectory  from  the start  to the goal has \nbeen  found.  No  optimal trajectory  will ever  leave the solved  volumes.  This \nwould require the trajectory to  increase rather than  decrease its cost  to get \nto  the goal as  it progressed. \n\n7.  The surfaces of constant cost can be approximated by  a representation  that \n\navoids  the  curse  of dimensionality. \n\n8.  The true  test  of this  approach  lies  ahead:  Can it produce  reasonable solu(cid:173)\n\ntions  to  complex problems? \n\nAcknowledgenlents \n\nSupport  was  provided  under  Air  Force  Office  of Scientific  Research  grant  AFOSR-\n89-0500, by the Siemens Corporation, and by the ATR Human Information Process(cid:173)\ning  Research  Laboratories.  Support  for  CGA  was  provided  by  a  National  Science \nFoundation  Presidential Young  Investigator A ward. \n\nReferences \n\nBellman, R.,  (1957)  Dynamic  Programming,  Princeton  University  Press,  Princeton, \nNJ. \n\nBertsekas,  D.P.,  (1987)  Dynamic  Programming:  Deterministic  and Stochastic  Mod(cid:173)\nels,  Prentice-Hall,  Englewood  Cliffs,  NJ. \n\nDyer,  P.  and  S.R.  McReynolds,  (1970)  The  Computation  and  Theory  of Optimal \nControl,  Academic  Press,  New  York,  NY. \n\nJacobson,  D.H.  and  D.Q.  Mayne,  (1970)  Differential  Dynamic  Programming,  Else(cid:173)\nvier,  New  York,  NY. \n\nLarson,  R.E.,  (1968)  State  Increment  Dynamic  Programming,  Elsevier,  New  York, \nNY. \n\n\f", "award": [], "sourceid": 788, "authors": [{"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}