{"title": "A Practice Strategy for Robot Learning Control", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 341, "abstract": null, "full_text": "A  Practice  Strategy  for  Robot  Learning \n\nControl \n\nTerence  D.  Sanger \n\nDepartment of Electrical Engineering and Computer Science \n\nMassachusetts Institute of Technology, room E25-534 \n\nCambridge, MA  02139 \n\ntds@ai.mit.edu \n\nAbstract \n\n\"Trajectory  Extension  Learning\"  is  a  new  technique for  Learning \nControl in Robots which assumes that there exists some parameter \nof the desired trajectory that can be smoothly varied from a region \nof easy  solvability of the dynamics to  a  region  of desired behavior \nwhich may have more difficult dynamics.  By gradually varying the \nparameter, practice movements remain near the desired path while \na  Neural Network learns to approximate the inverse dynamics.  For \nexample,  the average speed of motion might be varied, and the in(cid:173)\nverse dynamics can be  \"bootstrapped\" from  slow movements with \nsimpler dynamics to fast  movements.  This provides an example of \nthe more general  concept  of a  \"Practice  Strategy\"  in  which  a  se(cid:173)\nquence of intermediate tasks is  used to simplify learning a complex \ntask.  I  show  an example  of the  application  of this  idea to  a  real \n2-joint  direct drive robot arm. \n\n1 \n\nINTRODUCTION \n\nThe most general definition of Adaptive Control is one which includes any controller \nwhose behavior changes in response to the controlled system's behavior.  In practice, \nthis  definition  is  usually  restricted  to  modifying  a  small  number of controller  pa(cid:173)\nrameters in order to maintain system stability or global asymptotic stability of the \nerrors during execution of a  single trajectory (Sastry and Bodson 1989, for  review). \nLearning Control represents a second level of operation, since it uses Adaptive Con-\n\n335 \n\n\f336 \n\nSanger \n\ntrol to modify parameters during repeated performance trials of a desired trajectory \nso  that future trials result in greater accuracy  (Arimoto  et  al.  1984).  In this paper \nI  present  a  third level  called  a  \"Practice  Strategy\",  in  which  Learning  Control is \napplied  to  a  sequence  of intermediate  trajectories  leading  ultimately  to  the  true \ndesired  trajectory.  I  claim  that  this  can  significantly  increase  learning  speed  and \nmake learning possible for  systems which would otherwise become unstable. \n\n1.1  LEARNING CONTROL \n\nDuring repeated practice of a single desired trajectory, the actual trajectory followed \nby  the  robot  may  be  significantly  different.  Many  Learning  Control  algorithms \nmodify  the  commands  stored  in  a  sequence  memory  to  minimize  this  difference \n(Atkeson  1989,  for  review).  However,  the performance errors are usually measured \nin  a  sensory  coordinate system,  while  command corrections  must  be  made  in the \nmotor  coordinate  system.  If the  relationship  between  these  two  coordinate  sys(cid:173)\ntems is  not known, then command corrections might be in the wrong direction and \ninadvertently  worsen  performance.  However,  if the  practice  trajectory  is  close  to \nthe  desired trajectory,  then  the  errors will  be small and  the  relationship  between \ncommand and sensory errors can be approximated by the system Jacobian. \n\nAn alternative to a stored command sequence is to use a  Neural Network to learn an \napproximation to the inverse dynamics in the region of interest (Sanner and Slotine \n1992,  Yabuta and  Yamada 1991,  Atkeson  1989).  In  this  case,  the  commands and \nresults  from  the actual  movement  are  used  as  training data for  the  network,  and \nsmoothness  properties  are  assumed  such  that  the  error on  the  desired  trajectory \nwill decrease.  However,  a  significant problem with this method is  that if the actual \npractice  trajectory  is  far  from  the  desired  trajectory,  then  its  inverse  dynamics \ninformation  will  be  of little  use  in  training  the  inverse  dynamics  for  the  desired \ntrajectory.  In fact,  the  network  may  achieve  perfect  approximation on  the  actual \ntrajectory  while  still  making  significant  errors  on  the  desired  trajectory.  In  this \ncase, learning will stop (since the training error is  zero) leading to the phenomenon \nof \"learning lock-up\"  (An  et al.  1988).  So whether Learning Control uses a sequence \nmemory or a  Neural Network,  learning may proceed poorly if large errors are made \nduring the initial practice movements. \n\n1.2  PRACTICE STRATEGIES \n\nI define a  \"practice strategy\" as a sequence of trajectories such that the first element \nin  the  sequence  is  any  previously  learned  trajectory,  and  the  last  element  in  the \nsequence is  the  ultimate desired  trajectory.  A  well  designed  practice strategy  will \nresult in a seqence for which learning control of the trajectory for any particular step \nis  simplified if prior steps have already  been learned.  This will  occur if learning of \nprior trajectories reduces the initial performance error for  subsequent trajectories, \nso  that a  network will  be less likely  to experience learning lock-up. \n\nOne example  of a  practice strategy is  a  three-step  sequence in which the interme(cid:173)\ndiate step is  a  set of independently executable subtasks which partition the desired \ntrajectory into discrete pieces.  Another example is  a  multi-step sequence in which \nintermediate steps  are  a  set  of  trajectories  which  are  somehow  related  to  the  de(cid:173)\nsired  trajectory.  In  this  paper  I  present  a  multi-step  sequence  which  gradually \n\n\fA Practice Strategy for  Robot  Learning Control \n\n337 \n\n---~-------, \nI \n\" \nI \nA \n\" \ny \n\nA \nu \n\nN \n\nP \na. \n\nFigure 1:  Training signals for  network learning. \n\ntransforms some  known  trajectory  into  the  desired  trajectory  by  varying  a  single \nparameter.  This method has the advantage of not  requiring detailed knowledge of \nthe task structure in order to break it up into meaningful subtasks, and conditions \nfor  convergence can be stated explicitly.  It has a  close relationship to Continuation \nMethods for  solving differential equations, and can be considered to be a particular \napplication of the  Banach Extension Theorem. \n\n2  METHODS \n\nAs  in (Sanger  1992),  we  need  to  specify  4  aspects  of the  use of a  neural  network \nwithin a  control system: \n\n1.  the networks' function in the control system, \n\n2.  the network learning algorithm which modifies  the connection weights, \n\n3.  the training signals  used for  network learning,  and \n\n4.  the practice strategy used to generate sample movements. \n\nThe network's function is to learn the inverse dynamics of an equilibrium-point con(cid:173)\ntrolled plant (Shadmehr 1990).  The LMS-tree learning algorithm trains the network \n(Sanger  1991b,  Sanger  1991a).  The training  signals  are  determined  from  the  ac(cid:173)\ntual practice data using either \"Actual Trajectory Training\" or  \"Desired Trajectory \nTraining\",  as  defined  below.  And  the  practice  strategy  is  \"Trajectory  Extension \nLearning\",  in  which  a  parameter  of  the  movement  is  gradually  modified  during \ntraining. \n\n\f338 \n\nSanger \n\n2.1  TRAINING SIGNALS \n\nFigure 1 shows the general structure of the network and training signals.  A desired \ntrajectory  y  is  fed  into  the  network  N  to  yield  an  estimated  command  U.  This \ncommand is  then  applied  to the  plant  Pcx  where  the  subscript  indicates  that  the \nplant  is  parameterized  by  the  variable  a.  Although  the  true  command  u  which \nachieves y is  unknown, we  do  know  that the estimated command u produces y,  so \nthese signals are used for training by comparing the network response to y given by \n~ =  Ny  to the known value u and subtracting these to yield the training error 6,. \nNormally, network training would use this error signal to modify the network output \nfor  inputs near  y,  and I  refer to this  as  \"Actual Trajectory Training\".  However,  if \ny is far from  y  then no change in response may occur at  y  and this  may lead even \nmore quickly to learning lock-up.  Therefore an alternative is to use the error 6fJ  to \ntrain the network  output for  inputs  near  y.  I  refer  to  this  as  \"Desired Trajectory \nTraining\", and in the figure it is represented by the dotted arrow. \n\nThe following  discussion will  summarize the convergence conditions  and  theorems \npresented in (Sanger 1992). \n\nDefine \n\nRu  .  (1 - N P(x))u =  u - U \n\nto be an operator which maps commands into command errors for  states  x  on the \ndesired trajectory.  Similarly, let \n\nRu = (1 - N P( x))u = u - ~ \n\nmap commands into command errors for  states x on the actual trajectory. \nConvergence depends upon the following  assumptions: \n\nA1:  The plant P  is  smooth and invertible with respect to both the state x  and the \ninput u with Lipschitz constants k'z;  and ku,  and it has stable zero-dynamics. \n\nA2:  The network N  is  smooth with Lipschitz  constant  kN. \nA3:  Network learning reduces  the error in response to a  pair (y, 6y ). \nA4:  The change in network output in response to training is smooth with Lipschitz \n\nconstant  kL. \n\nA5:  There  exists  a  smoothly  controllable  parameter  a  such  that  an  inverse  dy(cid:173)\nnamics solution is  available at a  = ao,  and the desired performance occurs \nwhen a  =  ad. \n\nA6:  The change in command required to produce a desired output after any change \n\nin a  is  bounded by  the  change in a  multiplied by a  constant  kcx \u2022 \n\nA 7:  The change in plant response for  any fixed  input is  bounded by the change in \n\na  multiplied by a  constant  kp \u2022 \n\nUnder assumptions A1-A3 we can prove convergence of Desired Trajectory Training: \n\nTheorem 1: \nIf there  exists  a  k Rn  such  that \n\nII Rnu  - Rnull  < kRn lI u  - ull \n\n\fthen  if the  learning  rate  0 < 'Y  :::;  1, \n\nA Practice Strategy for  Robot Learning Control \n\n339 \n\nIf k Rn  < 1 and 'Y  :::;  1,  then the network output u approaches the  correct  command \nu. \n\nUnder assumptions A1-A4,  we can prove convergence of Actual Trajectory Training: \nTheorem 2: \nIf there  exists  a  kRn  such  that \n\nthen  if the  learning  rate  0 < 'Y  :::;  1, \n\nIIRn u  - Rnull  < kRn lIu - illl \n\n2.2  TRAJECTORY EXTENSION LEARNING \n\nLet  a  be some modifiable parameter of the plant such that for  a  =  ao  there exists \na  simple inverse  dynamics  solution,  and we  seek  a  solution when  a  = ad.  For ex(cid:173)\nample,  if the plant  uses  Equilibrium Point  Control  (Shadmehr 1990),  then  at  low \nspeeds the inverse dynamics  behave like  a  perfect  servo  controller yielding desired \ntrajectories  without  the need  to  solve  the  dynamics.  We  can  continue  to  train a \nlearning  controller  as  the  average  speed  of movement  (a)  is  gradually  increased. \nThe inverse dynamics learned at one speed provide an approximation to the inverse \ndynamics for  a  slightly faster speed,  and thus the performance errors remain small \nduring practice.  This  leads  to  significantly faster  learning rates  and greater likeli(cid:173)\nhood that the conditions for  convergence at any given speed will  be satisfied.  Note \nthat unlike traditional learning schemes, the error does not  decrease monotonically \nwith  practice,  but  instead  maintains  a  steady  magnitude  as  the  speed  increases, \nuntil the network is  no longer able to approximate the inverse dynamics. \n\nThe following  is  a  summary of a  result  from  (Sanger 1992).  Let  a  change from  al \nto  a2,  and  let  P  =  Pal  and  P'  =  Pa2 .  Then  under  assumptions  AI-A7  we  can \nprove convergence of Trajectory Extension Learning: \nTheorem 3: \nIf there  exists  a  kR  such  that for  a  =  al \n\nthen  for  a  = a2 \n\nIIR'u'  - R'illl  < kRllu' - ull + (2ka  + kNkp)la2 - all \n\nThis  shows that  given  the smoothness  assumptions and  a  small enough  change in \na,  the error will  continue to decrease. \n\n\f340 \n\nSanger \n\n3  EXAMPLE \n\nFigure 2 shows the result of 15  learning trials performed by a real direct-drive two(cid:173)\njoint  robot  arm  on  a  sampled  desired  trajectory.  The  initial  trial  required  11.5 \nseconds  to execute,  and the  speed  was  gradually increased until  the final  trial  re(cid:173)\nquired  only  4.5  seconds.  Simulated equilibrium  point  control  was  used  (Bizzi  et \nal.  1984) with stiffness and damping coefficients of 15  nm/rad and 1.5 nm/rad/sec, \nrespectively.  The grey  line  in  figure  2  shows  the  equilibrium  point  control signal \nwhich generated the actual movement represented by the solid line.  The difference \nbetween these two indicates  the nontrivial nature of the dynamics  calculations re(cid:173)\nquired to derive  the  control signal from  the desired trajectory.  Note  that  without \nTrajectory Extension Learning, the network does not converge and the arm becomes \nunstable.  The neural network was  an LMS  tree (Sanger 1991b,  Sanger 1991a) with \n10  Gaussian  basis functions  for  each of the  6 input  dimensions,  and a  total  of 15 \nsubtrees were grown per joint (see  (Sanger 1992)  for  further explanation). \n\n4  CONCLUSION \n\nTrajectory Extension Learning is  one example of the way in which a practice strat(cid:173)\negy  can  be used  to improve convergence for  Learning Control.  This or other types \nof practice strategies  might  be  able  to  increase the performance of many  different \ntypes  of  learning  algorithms  both  within  and  outside  the  Control  domain.  Such \nstrategies may also  provide a  theoretical  model for  the  practice strategies used by \nhumans to learn complex tasks,  and the theoretical  analysis  and convergence  con(cid:173)\nditions  could potentially lead to a  deeper  understanding of human motor learning \nand successful techniques for  optimizing performance. \n\nAcknowledgements \n\nThanks  are  due  to  Simon  Giszter,  Reza  Shadmehr,  Sandro  Mussa-Ivaldi,  Emilio \nBizzi,  and many people at  the NIPS  conference for  their comments and criticisms. \nThis report describes research done within the laboratory of Dr.  Emilio Bizzi in the \ndepartment of Brain and Cognitive Sciences at MIT. The author was supported dur(cid:173)\ning this work by a  National Defense Science and Engineering Graduate Fellowship, \nand by NIH  grants 5R37 AR26710 and 5ROINS09343 to Dr.  Bizzi. \n\nReferences \nAn  C.  H.,  Atkeson  C.  G.,  Hollerbach J. M.,  1988,  Model-Based  Control  of a  Robot \nManipulator,  MIT Press, Cambridge, MA. \nArimoto  S.,  Kawamura  S.,  Miyazaki  F.,  1984,  Bettering  operation  of  robots  by \nlearning,  Journal  of Robotic  Systems,  1(2):123-140. \nAtkeson C.  G.,  1989,  Learning arm kinematics and dynamics,  Ann.  Rev.  Neurosci., \n12:157-183. \nBizzi E., Accornero N.,  Chapple W., Hogan N.,  1984, Posture control and trajectory \nformation during arm movement,  J.  Neurosci,  4:2738-2744. \nSanger T. D., 1991a, A tree-structured adaptive network for function approximation \nin high dimensional spaces,  IEEE  Trans.  Neural  Networks,  2(2):285-293. \n\n\fA Practice Strategy for  Robot Learning Control \n\n341 \n\nSanger  T.  D.,  1991b,  A  tree-structured  algorithm  for  reducing  computation  in \nnetworks with separable basis functions,  Neural  Computation,  3(1):67-78. \nSanger  T.  D.,  1992,  Neural  network  learning  control  of  robot  manipulators  us(cid:173)\ning  gradually  increasing  task  difficulty,  submitted  to  IEEE  Trans.  Robotics  and \nAutomation. \nSanner R.  M., Slotine J.-J. E.,  1992,  Gaussian networks for  direct  adaptive control, \nIEEE  Trans.  Neural  Networks,  in  press.  Also  MIT  NSL  Report  910303,  910503, \nMarch 1991  and Proc. American Control Conference, Boston pages 2153-2159, June \n1991. \nSastry S.,  Bodson M.,  1989,  Adaptive  Control:  Stability,  Convergence,  and Robust(cid:173)\nness,  Prentice Hall,  New  Jersey. \nShadmehr R.,  1990,  Learning virtual equilibrium trajectories for  control of a  robot \narm,  Neural  Computation,  2:436-446. \nYabuta T., Yamada T., 1991,  Learning control using neural networks,  Proc.  IEEE \nInt'l  ConJ.  on  Robotics  and  Automation,  Sacramento,  pages 740-745. \n\nFigure  2:  Dotted  line  is  the  desired  trajectory,  solid  line  is  the  actual  trajectory, \nand the grey line is  the equilibrium point  control trajectory. \n\n\f", "award": [], "sourceid": 699, "authors": [{"given_name": "Terence", "family_name": "Sanger", "institution": null}]}