{"title": "Parallel Optimization of Motion Controllers via Policy Iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 996, "page_last": 1002, "abstract": null, "full_text": "Parallel Optimization of Motion \nControllers  via Policy Iteration \n\nJ. A.  Coelho Jr.,  R.  Sitaraman,  and  R.  A.  Grupen \n\nDepartment  of Computer Science \n\nUniversity  of Massachusetts,  Amherst,  01003 \n\nAbstract \n\nThis paper describes a  policy iteration algorithm for optimizing the \nperformance  of a  harmonic function-based  controller  with  respect \nto a  user-defined  index.  Value functions  are represented  as poten(cid:173)\ntial  distributions  over  the  problem  domain,  being  control  policies \nrepresented  as  gradient  fields  over  the  same domain.  All  interme(cid:173)\ndiate  policies  are intrinsically  safe,  i.e.  collisions  are not  promoted \nduring  the adaptation process.  The algorithm  has  efficient  imple(cid:173)\nmentation in parallel  SIMD  architectures.  One  potential  applica(cid:173)\ntion - travel distance  minimization - illustrates  its usefulness. \n\n1 \n\nINTRODUCTION \n\nHarmonic  functions  have  been  proposed  as  a  uniform  framework  for  the  solu(cid:173)\ntion  of  several  versions  of  the  motion  planning  problem.  Connolly  and  Gru(cid:173)\npen  [Connolly and Grupen,  1993]  have  demonstrated  how  harmonic  functions \ncan  be  used  to  construct  smooth,  complete  artificial  potentials  with  no  lo(cid:173)\ncal  minima. \n[Rimon and Koditschek,  1990]  for  navigation functions.  This implies  that  the gra(cid:173)\ndient  of harmonic functions  yields  smooth  (\"realizable\")  motion controllers. \n\nthese  potentials  meet  the  criteria  established  in \n\nIn  addition, \n\nBy construction,  harmonic function-based  motion controllers  will always command \nthe robot from  any initial  configuration  to a  goal configuration.  The intermediate \nconfigurations  adopted  by  the  robot  are  determined  by  the  boundary  constraints \nand  conductance  properties  set  for  the  domain.  Therefore,  it  is  possible  to  tune \nboth factors  so  as  to extremize  user-specified  performance indices  (e.g.  travel time \nor energy)  without  affecting  controller  completeness. \n\nBased on this idea, Singh et al.  [Singh  et  al.,  1994] devised a  policy iteration method \nfor  combining  two  harmonic function-based  control  policies  into  a  controller  that \nminimized travel time on a  given environment.  The two initial control policies  were \n\n\fParallel  Optimization  of Motion  Controllers  via  Policy  Iteration \n\n997 \n\nderived from solutions to two distinct boundary constraints (Neumann and Dirichlet \nconstraints).  The  policy  space  spawned  by the  two control policies  was  parameter(cid:173)\nized  by a  mixing  coefficient,  that ultimately determined  the obstacle avoidance be(cid:173)\nhavior adopted by the robot.  The resulting  controller preserved obstacle avoidance, \nensuring  safety at every iteration  of the learning  procedure. \n\nThis  paper addresses  the question  of how  to adjust  the conductance  properties  as(cid:173)\nsociated with the problem domain 0, such as to extremize an user-specified  perfor(cid:173)\nmance index.  Initially,  conductance  properties  are  homogeneous across  0, and the \nresulting controller is optimal in the sense that it minimizes collision probabilities at \nevery step  [Connolly,  1994]1.  The method proposed is  a  policy iteration algorithm, \nin which  the  policy  space is  parameterized  by the  set  of node conductances. \n\n2  PROBLEM  CHARACTERIZATION \n\nThe problem consists in constructing a path controller ifo  that maximizes an integral \nperformance index 'P  defined over the set of all possible paths on a lattice for a closed \ndomain 0  C  Rn, subjected to boundary constraints.  The controller ifo is responsible \nfor generating the sequence  of configurations from an initial configuration qo  on the \nlattice  to  the  goal  configuration  qG,  therefore  determining  the  performance  index \n'P.  In formal  terms,  the performance index 'P  can be defined  as follows: \n\nDef.  1  Performance  indez 'P : \n\n'P \n\nfor  all q E  L(O),  where \n\nqa \n\n'Pqo.* = L  f(q)\u00b7 \n\nq=qo \n\nL(O)  is  a  lattice  over the  domain 0, qo  denotes  an arbitrary configuration  on L(O), \nqG  is  the  goal  configuration,  and f(q)  is a  function  of the  configuration q. \n\nFor example,  one can define  f(q)  to be the available joint range associated with the \nconfiguration  q of a  manipulator;  in this  case,  'P  would  be  measuring  the available \njoint range associated  with  all  paths generated  within  a  given domain. \n\n2.1  DERIVATION  OF  REFERENCE  CONTROLLER \n\nThe  derivation  of ifo  is  very  laborious,  requiring  the  exploration  of the  set  of all \npossible  paths.  Out of this set,  one is  primarily interested  in the subset  of smooth \npaths.  We  propose  to  solve  a  simpler  problem,  in  which  the  derived  controller \nif  is  a  numerical  approximation  to  the  optimal  controller  ifo,  and  (1)  generates \nsmooth paths,  (2) is admissible,  and (3)  locally maximizes P.  To guarantee (1) and \n(2),  it is  assumed that the control actions of if are proportional to the gradient  of a \nharmonic function f/J,  represented as the voltage distribution across a resistive lattice \nthat tessellates  the  domain O.  The condition  (3)  is  achieved  through incremental \nchanges in the set  G  of internodal conductancesj  such changes  maximize P  locally. \nNecessary condition for optimality: Note that 'Pqo.*  defines  a  scalar field  over \nL(O).  It is  assumed that there exists a  well-defined  neighborhood  .N(q) for node qj \nin  fact,  it is  assumed  that  every  node  q  has two neighbors  across  each  dimension. \nTherefore, it is possible to compute the gradient over the scalar field Pqo.ff  by locally \napproximating its rate of change across  all  dimensions.  The gradient  VP qo  defines \n\nlThis is  exactly the control policy derived by the TD(O) reinforcement learning method, \nfor  the particular  case  of an  agent  travelling  in a  grid world  with  absorbing  obstacle  and \ngoal  states, and being rewarded only  for getting to the goal  states (see  [Connolly,  1994]). \n\n\f998 \n\nJ. A. COELHO Jr., R. SITARAMAN, R. A. GRUPEN \n\na  reference  controller;  in the  optimal situation,  the actions  of the controller  if will \nparallel  the  actions  of the  reference  controller.  One  can  now  formulate  a  policy \niteration algorithm for  the synthesis  of the reference  controller: \n\n1.  Compute  if = -V~, given  conductances  G; \n2.  Evaluate VP q: \n\n- for  each  cell,  compute 'P  ;;.. \n1'''' \n- for  each  cell,  compute  V'P q. \n\n3.  Change  G  incrementally,  minimizing the  approz.  error \u20ac  =  f(if, VPq); \n4.  If \u20ac \n\nstop.  Otherwise,  return to  (1). \n\nis  below  a  threshold \u20aco, \n\nOn  convergence,  the  policy  iteration  algorithm  will  have  derived  a  control  policy \nthat  maximizes  l' globally,  and is  capable of generating  smooth  paths  to  the  goal \nconfiguration.  The  key  step  on  the  algorithm  is  step  (3),  or  how  to  reduce  the \ncurrent  approximation error  by changing the conductances  G. \n\n3  APPROXIMATION  ALGORITHM \n\nGiven a  set  of internodal conductances,  the approximation error  \u20ac \n\n\u20ac \n\n- L  cos (if, VP) \n\nqEL(n) \n\nis  defined  as \n\n(1) \n\nor the  sum  over  L(n)  of the  cosine  of the  angle  between  vectors  if  and  VP.  The \napproximation error \u20ac  is therefore a function ofthe set G of internodal conductances. \nThere exist  0(nd\") conductances in a  n-dimensional grid,  where d is  the discretiza(cid:173)\ntion adopted for each dimension.  Discrete search methods for the set of conductance \nvalues that minimizes \u20ac  are ruled out by the cardinality of the search space:  0(knd\"), \nif k is the number of distinct values each conductance can assume.  We will represent \nconductances  as  real  values  and use  gradient  descent  to  minimize  \u20ac,  according  to \nthe approximation algorithm below: \n\n1.  Evaluate  the  apprommation  error \u20acj \n2.  Compute  the  gradient  V\u20ac  =  g;; j \n3.  Update  conductances,  making G  = G - aVE; \n4.  Normalize  conductances,  such  that  minimum conductance  gmin  =  1; \n\nStep  (4)  guarantees  that  every  conductance  g  E  G  will  be  strictly  positive.  The \nconductances  in  a  resistive  grid  can  be  normalized  without  constraining  the  volt(cid:173)\nage  distribution  across  it,  due  to  the  linear  nature  of the  underlying  circuit.  The \ncomplexity of the approximation algorithm is dominated by the computation of the \ngradient  V\u20ac(G).  Each component of the vector  V\u20ac(G)  can be expressed  as \n\n8\u20ac  =  _  ' \"   8cos(ifq, VPq). \n8g \u00b7 \n\nL...J \n\n'qEL(n) \n\n8g\u00b7 \n, \n\n(2) \n\nBy  assumption,  if  is  itself  the  gradient  of a  harmonic  function  \u00a2>  that  describes \nthe voltage  distribution  across a  resistive  lattice.  Therefore,  the calculation  of :;. \ninvolves the evaluation of ~ over all domain L(n), or how the voltage \u00a2>q  is affected \nby changes in a  certain  conductance gi. \nFor n-dimensional grids, *\u00a3 is a matrix with d\" rows and 0( nd\") columns.  We posit \nthat the computation of every element of it is  unnecessary:  the effects  of changing \n\n\fParallel  Optimization  of Motion  Controllers  via  Policy  Iteration \n\n999 \n\ng,  will  be  more  pronounced  in  a  certain  grid  neighborhood  of it,  and  essentially \nnegligible  for  nodes  beyond  that  neighborhood.  Furthermore,  this  simplification \nallows for breaking up the original problem into smaller,  independent  sub-problems \nsuitable  to simultaneous  solution in parallel  architectures. \n\n3.1  THE LOCALITY ASSUMPTION \n\nThe  first  simplifying  assumption  considered  in  this  work  establishes  bounds  on \nthe  neighborhood  affected  by  changes  on  conductances  at  node  ij  specifically, \nwe  will  assume  that  changes  in  elements  of g,  affect  only  the  voltage  at  nodes \nin  J/(i) ,  being  J/(i)  the  set  composed  of node  i  and  its  direct  neighbors.  See \n[Coelho  Jr.  et  al.,  1995]  for  a  discussion  on  the  validity  of  this  assumption. \nIn \nparticular,  it  is  demonstrated  that  the  effects  of changing  one  conductance  decay \nexponentially  with  grid  distance,  for  infinite  2D  grids.  Local  changes  in  resistive \ngrids  with  higher  dimensionality  will  be confined  to even smaller  neighborhoods. \n\nThe locality  assumption simplifies  the calculation  of :;.  to \n\nBut \n\n~ [  if . V!, 1 = \n\nlifllV'P1 \n\n8g, \n\n[8if. VP _ if\u00b7 VP  8if. if l. \n\nlifl2  (8g, \n\n) \n\n1.... \n\nlifllV'P1  8g, \n\nNote that in the derivation above it is  assumed that changes in G  affects  primarily \nthe  control  policy  if,  leaving  VP  relatively  unaffected,  at  least  in  a  first  order \napproximation. \n\nGiven  that  if  =  - V~, it  follows  that  the  component  7r;  at  node  q  can  be  ap(cid:173)\nproximated by  the change of potential across  the dimension  j, as measured  by  the \npotential on the corresponding  neighboring nodes: \n\n7r\"1  =  \u00a2q- - \u00a2q+,  and  87r;  =  _1_  [8\u00a2q _ _  8\u00a2q+] \n,  q \n\n2b. 2 \n\n2b. 2 \n\n8gi' \n\n8g, \n\n8gi \n\nwhere  b.  is  the internodal distance  on the lattice L(n). \n3.2  DERIVATION  OF  G; \n\nThe  derivation  of  ~ involves  computing  the  Thevenin  equiValent  circuit  for  the \nresistive  lattice,  when  every  conductance  9  connected  to  node  i  is  removed.  For \nclarity,  a  2D resistive  grid  was  chosen  to illustrate  the  procedure.  Figure  1 depicts \nthe  equivalence  warranted  by  Thevenin's  theorem  [Chua et al.,  1987]  and  the  rel-\nevant  variables  for  the  derivation  of ~. As  shown,  the equivalent  circuit  for  the \nresistive  grid  consists  of a  four-port  resistor,  driven  by  four  independent  voltage \nsources.  The relation  between  the voltage vector i =  [\u00a2t \n\u00a24Y and the current \nvector r =  [it  ... i 4]T  is  expressed  as \n\n(3) \nwhere  R  is  the impedance matrix for  the grid equivalent  circuit  and w is  the vector \nof open-circuit  voltage sources.  The grid  equivalent  circuit  behaves exactly like  the \nwhole resistive  gridj  there is  no approximation error. \n\nRf+w, \n\n\f1000 \n\nJ. A. COELHO Jr., R. SITARAMAN, R. A. GRUPEN \n\n... + .............. . \n! \ncJl2 \n\ni cJl 3 \n\n1 i 3 \ni \n\ncJl4 \n... + .............. . \n\nGrid  Equivalent \n\nCircuit \n\nFigure  1:  Equivalence  established  by  Thevenin's  theorem. \n\ncJl o \n\nThe  derivation  of the  20  parameters  (the  elements  of Rand w)  of the  equivalent \ncircuit is  detailed in [Coelho  Jr.  et al.,  1995]j it involves a series  ofrelaxation opera(cid:173)\ntions that can be efficiently  implemented in SIMD architectures.  The total number \nof relaxations  for  a  grid  with  n l  nodes  is  exactly  6n - 12,  or  an  average  of 1/2n \nrelaxations  per link.  In the  context  of this  paper,  it  is  assumed  that  Rand w are \nknown.  Our primary interest  is  to compute  how  changes in conductances  g1c  affect \nthe voltage vector i, or the matrix \n84>  = I 84>j I, \n\nfor  {Jk\u00b7 \n\n1, . .. ,4 \n1, ... ,4. \n\n8g \n\n8g1c \n\nThe elements of ~ can be computed by derivating each of the four equality relations \nin Equation 3 with respect  to g1c,  resulting in a system of 16 linear equations, and 16 \nvariables  - the elements  of ~. Notice  that each element  of i  can  be expressed  as a \nlinear function  of the potentials i, by applying Kirchhoff's laws  [Chua et  al.,  1987]: \n\n4  APPLICATION  EXAMPLE \n\nA  robot  moves  repeatedly  toward  a  goal  configuration.  Its  initial  configuration  is \nnot  known in advance,  and every  configuration is  equally  likely  of being  the initial \nconfiguration.  The problem is  to construct  a  motion controller  that  minimizes  the \noverall  travel distance for  the  whole configuration space.  If the configuration space \no is  discretized  into a  number of cells,  define  the combined  travel distance  D(?T)  as \n\nD(?T) \n\nL  dq,if, \n\nqEL(O) \n\n(4) \n\nwhere dq,if  is  the  travel distance from cell  q to the goal configuration  qG,  and robot \ndisplacements  are  determined  by  the  controller?T.  Figure  2  depicts  an  instance \nof the  travel  distance  minimization  problem,  and  the  paths  corresponding  to  its \noptimal solution,  given the obstacle distribution  and the  goal configuration shown. \n\nA  resistive  grid  with  17  x  17  nodes  was  chosen  to  represent  the  control  policies \ngenerated  by  our  algorithm.  Initially,  the  resistive  grid  is  homogeneous,  with  all \ninternodal resistances  set  to  10.  Figure 3 indicates  the  paths the robot  takes when \ncommanded by ifO,  the initial control policy derived from an homogeneous resistive \ngrid. \n\n\fParallel  Optimization  of Motion  Controllers  via  Policy  Iteration \n\n1001 \n\n16r----,....---,....----,r-----, \n\n16,....----,r-----,-----,----, \n\n12 \n\n12 \n\n12 \n\n16 \n\nFigure  2:  Paths for  optimal solution  of \nthe  travel  distance  minimization  prob(cid:173)\nlem. \n\nFigure  3:  Paths  for  the  initial  solution \nof the same  problem. \n\nThe conductances in the resistive grid were then adjusted over 400 steps of the policy \niteration algorithm, and Figure 4 is  a  plot of the overall travel distance as a function \nof the number  of steps.  It also  shows the  optimal travel  distance  (horizontal  line), \ncorresponding  to  the  optimal  solution  depicted  in  Figure  2.  The  plot  shows  that \nconvergence is  initially  fast;  in fact,  the  first  140 iterations  are responsible  for  90% \nof the overall improvement.  After  400  iterations,  the travel distance is  within  2.8% \nof its  optimal  value.  This  residual  error  may  be  explained  by  the  approximation \nincurred  in using a  discrete  resistive  grid to represent  the  potential distribution. \n\nFigure 5 shows the paths taken by the robot after convergence.  The final  paths are \nstraightened  versions  of the  paths  in  Figure  3.  Notice  also  that  some  of the  final \npaths  originating  on  the  left  of the  I-shaped  obstacle  take  the  robot  south  of the \nobstacle,  resembling  the optimal paths  depicted  in  Figure  2. \n\n5  CONCLUSION \n\nThis paper presented a  policy iteration algorithm for  the synthesis  of provably cor(cid:173)\nrect  navigation  functions  that  also  extremize  user-specified  performance  indices. \nThe algorithm  proposed solves  the optimal feedback  control problem,  in  which  the \nfinal control policy optimizes the performance index over the whole domain, assum(cid:173)\ning that every  state in the domain is  as likely  of being the initial state as any other \nstate. \n\nThe algorithm modifies  an existing  harmonic function-based  path controller  by in(cid:173)\ncrementally changing the conductances in a resistive  grid.  Departing from an homo(cid:173)\ngeneous  grid,  the algorithm  transforms an optimal controller  (i.e.  a  controller  that \nminimizes  collision  probabilities)  into  another  optimal  controller,  that  extremizes \nlocally  the  performance  index  of interest.  The  tradeoff  may require  reducing  the \nsafety margin between  the robot and obstacles,  but collision  avoidance is  preserved \nat each  step of the algorithm. \n\nOther  Applications:  The algorithm  presented  can  be  used  (1)  in  the  synthesis \nof time-optimal  velocity  controllers,  and  (2)  in  the optimization  of non-holonomic \npath controllers.  The algorithm can also  be a  component technology for  Intelligent \nVehicle  Highway Systems  (IVHS),  by combining  (1)  and  (2). \n\n\f1002 \n\nJ. A. COELHO Jr .\u2022  R.  SITARAMAN. R.  A. GRUPEN \n\n1170....----,----,----r----, \n\n16r-----.r-----.....---.,---., \n\n17.~ \n\n--------------~ \n\n12 \n\n1710 \n\n1680 \n\n16500L---IOO~-~200'\":--~300~-~400 \n\n12 \n\n16 \n\nFigure  4:  Overall  travel  distance,  as  a \nfunction  of iteration steps. \n\nFigure  5:  Final  paths,  after  800  policy \niteration steps. \n\nPerformance on Parallel Architectures: The proposed algorithm is  computa(cid:173)\ntionally  demandingj  however,  it is  suitable for implementation on parallel architec(cid:173)\ntures.  Its sequential implementation on a  SPARC 10  workstation requires  ~ 30 sec. \nper iteration,  for  the  example  presented.  We  estimate  that a  parallel  implementa(cid:173)\ntion of the  proposed  example  would  require  ~ 4.3  ms  per iteration,  or  1. 7 seconds \nfor  400  iterations,  given  conservative  speedups  available  on  parallel  architectures \n[Coelho  Jr.  et  al.,  1995]. \n\nAcknowledgements \n\nThis  work  was  supported in  part  by  grants  NSF  CCR-9410077,  IRI-9116297,  IRI-\n9208920,  and CNPq 202107/90.6. \n\nReferences \n\n[Chua et aI.,  1987]  Chua, L.,  Desoer, C., and Kuh, E.  (1987).  Linear and Nonlinear \n\nCircuits.  McGraw-Hill,  Inc.,  New  York,  NY. \n\n[Coelho  Jr.  et  al.,  1995]  Coelho  Jr.,  J.,  Sitaraman,  R.,  and  Grupen,  R.  (1995). \n\nControl-oriented tuning of harmonic functions.  Technical Report CMPSCI Tech(cid:173)\nnical  Report  95-112,  Dept.  Computer Science,  University  of Massachusetts. \n\n[Connolly,  1994]  Connolly,  C.  I.  (1994).  Harmonic  functions  and  collision  proba(cid:173)\n\nbilities. \nIEEE. \n\nIn  Proc.  1994  IEEE  Int.  Conf.  Robotics  Automat.,  pages  3015-3019. \n\n[Connolly  and Grupen,  1993]  Connolly,  C. I. and  Grupen,  R.  (1993).  The applica(cid:173)\ntions  of harmonic functions  to robotics.  Journal  of Robotic  Systems,  10(7):931-\n946. \n\n[Rimon and Koditschek,  1990]  Rimon,  E.  and Koditschek,  D.  (1990).  Exact  robot \nnavigation in geometrically complicated but topologically simple spaces.  In  Proc. \n1990 IEEE Int.  Conf. Robotics Automat., volume 3,  pages 1937-1942, Cincinnati, \nOH. \n\n[Singh et aI.,  1994]  Singh,  S.,  Barto, A.,  Grupen,  R., and Connolly,  C.  (1994).  Ro(cid:173)\nbust reinforcement  learning in motion planning.  In  Advances in Neural Informa(cid:173)\ntion Processing Systems 6,  pages 655-662, San Francisco,  CA. Morgan Kaufmann \nPublishers. \n\n\f", "award": [], "sourceid": 1117, "authors": [{"given_name": "Jefferson", "family_name": "Coelho", "institution": null}, {"given_name": "R.", "family_name": "Sitaraman", "institution": null}, {"given_name": "Roderic", "family_name": "Grupen", "institution": null}]}