{"title": "Learning from Demonstration", "book": "Advances in Neural Information Processing Systems", "page_first": 1040, "page_last": 1046, "abstract": null, "full_text": "Learning From Demonstration \n\nsschaal @cc .gatech.edu; http://www.cc.gatech.edulfac/Stefan.Schaal \n\nStefan Schaal \n\nCollege of Computing, Georgia Tech, 801  Atlantic Drive, Atlanta, GA 30332-0280 \n\nATR Human Information Processing, 2-2 Hikaridai, Seiko-cho, Soraku-gun, 619-02 Kyoto \n\nAbstract \n\nBy  now  it is  widely  accepted  that learning  a  task  from  scratch,  i.e.,  without \nany prior knowledge,  is a daunting  undertaking. Humans,  however,  rarely at(cid:173)\ntempt  to  learn  from  scratch.  They  extract  initial  biases  as  well  as  strategies \nhow  to  approach a learning problem from  instructions and/or demonstrations \nof other  humans.  For  learning  control,  this  paper  investigates  how  learning \nfrom  demonstration  can  be  applied  in  the  context of reinforcement  learning. \nWe  consider priming  the  Q-function,  the  value  function,  the  policy,  and  the \nmodel of the task dynamics as possible areas where demonstrations can speed \nup  learning.  In  general  nonlinear learning  problems,  only  model-based  rein(cid:173)\nforcement learning shows significant speed-up after a demonstration,  while in \nthe  special  case  of linear  quadratic  regulator  (LQR)  problems,  all  methods \nprofit  from  the  demonstration.  In  an  implementation  of pole  balancing  on  a \ncomplex  anthropomorphic  robot  arm,  we  demonstrate  that,  when  facing  the \ncomplexities  of real  signal  processing,  model-based  reinforcement  learning \noffers  the  most robustness for LQR problems. Using  the suggested methods, \nthe  robot  learns  pole  balancing  in  just a  single trial  after  a  30  second  long \ndemonstration of the human instructor. \n\n1.  INTRODUCTION \nInductive supervised learning methods have reached a high level of sophistication. Given \na data set and  some prior information about its nature,  a host of algorithms exist that can \nextract structure from this data by  minimizing an error criterion. In learning control, how(cid:173)\never, the learning task is often less well defined. Here, the goal is to learn a policy, i.e., the \nappropriate actions in response to a perceived state, in order to steer a dynamical system to \naccomplish a task.  As the task is usually described in terms of optimizing an arbitrary per(cid:173)\nformance index, no direct training data exist which could be used to learn a controller in a \nsupervised  way. Even  worse,  the  performance index  may  be  defined  over  the  long  term \nbehavior of the  task,  and a problem of temporal credit assignment arises in how to  credit \nor blame actions in the past for the current performance. In such a setting, typical for rein(cid:173)\nforcement  learning,  learning  a  task  from  scratch  can  require  a  prohibitively  time(cid:173)\nconsuming amount of exploration of the state-action space in order to find a good policy. \nOn the other hand, learning without prior knowledge seems to be an approach that is rarely \ntaken in human and animal learning. Knowledge how to approach a new task can be trans(cid:173)\nferred  from  previously learned tasks,  and/or it can be extracted from  the performance of a \nteacher. This opens the questions of how learning control can profit from these kinds of in(cid:173)\nformation  in  order to  accomplish a new  task more quickly.  In this paper we  will focus  on \nlearning from demonstration. \nLearning from demonstration, also known as \"programming by demonstration\", \"imitation \nlearning\", and \"teaching by  showing\" received significant attention in automatic robot as(cid:173)\nsembly  over  the  last 20  years. The  goal  was  to  replace  the  time-consuming  manual  pro-\n\n\fLearningfrom Demonstration \n\n1041 \n\na,s \n\n(a) \nmI2jj = -/.tiH mg/sin9 + r .  9 E [-tr.tr] \nr(9.r)= (;)' _{;)210gCO{~ r:) \nm =1 = I. g = 9.81, J.I = 0.05. r...,  =5Nm \ndefine :  x = (9.8)'.  U = r \n\n(b) \nmlXcos9 + ml2e - mglsin9 = 0 \n(m + me)x + m/ecos9 - mU'J2 sin9 = F \ndefine:  x = (x,i,9.e) T,  u = F \nr(x.u) =  XTQX + uTRu \n1= 0.75m. m = 0.15kg, me = LOkg \nQ =  diag(1.25, I.  12. 0.25). R  = om \nFigure 1 : a) pendulum swing up, \n\nb) cart pole balancing \n\ngramming of a robot by  an  automatic  programming proc(cid:173)\ness, solely driven by showing the robot the assembly task \nby an expert. In concert with the main stream of Artificial \nIntelligence at the time,  research  was driven  by symbolic \napproaches:  the  expert's  demonstration  was  segmented \ninto  primitive  assembly  actions  and  spatial  relationships \nbetween manipulator and environment,  and  subsequently \nsubmitted to symbolic reasoning processes  (e.g., Lozano(cid:173)\nPerez,  1982;  Dufay  &  Latombe,  1983;  Segre  &  Dejong, \n1985).  More recent approaches  to  programming  by dem(cid:173)\nonstration  started  to  include  more  inductive  learning \ncomponents  (e.g.,  Ikeuchi,  1993;  Dillmann,  Kaiser,  & \nUde,  1995).  In  the  context  of  human  skill  learning, \nteaching  by  showing  was  investigated  by  Kawato,  Gan(cid:173)\ndolfo,  Gomi, &  Wada (1994) and Miyamoto et al.  (1996) \nfor  a  complex  manipulation task to  be learned  by  an an(cid:173)\nthropomorphic  robot  arm.  An  overview  of several  other \nprojects can be found in Bakker &  Kuniyoshi (1996). \nIn this paper, the focus lies on reinforcement learning and \nhow learning from demonstration can be beneficial in this \ncontext. We divide reinforcement  learning  into  two cate(cid:173)\ngories: \ntasks \n(Section  2) and for  (approximately)  linear tasks  (Section \n3),  and  investigate  how  methods  like  Q-Iearning,  value(cid:173)\nfunction  learning,  and  model-based  reinforcement  learn(cid:173)\ning can profit from data from a demonstration. In Section \n2.3,  one  example  task,  pole  balancing,  is  placed  in  the \ncontext of using an actual, anthropomorphic robot to learn \nit,  and  we  reconsider  the  applicability  of learning  from \ndemonstration in this more complex situation. \n\nfor  nonlinear \n\nreinforcement \n\nlearning \n\n2.  REINFORCEMENT LEARNING FROM DEMONSTRATION \nTwo example tasks  will  be the basis of our investigation of learning from  demonstration. \nThe nonlinear task is the \"pendulum swing-up with limited torque\" (Atkeson,  1994; Doya, \n19%), as  shown in Figure 1a.  The goal is to balance the pendulum in an  upright position \nstarting from  hanging  downward. As  the  maximal  torque  available is  restricted  such that \nthe pendulum cannot be supported  against gravity in all  states,  a  \"pumping\" trajectory  is \nnecessary, similar as in the mountain car example of Moore (1991), but more delicately in \nits timing  since building up  too much momentum during pumping will  overshoot the  up(cid:173)\nright position. The (approximately) linear example, Figure 1b, is the well-known cart-pole \nbalancing problem (Widrow &  Smith,  1964; Barto, Sutton,  &  Anderson,  1983). For both \ntasks,  the  learner is  given  information  about  the  one-step reward  r  (Figure  1), and  both \ntasks are formulated as continuous state and continuous action problems. The goal of each \ntask is to find a policy which minimizes the infinite horizon discounted reward: \n\nv(x(t)) = J e --~ r(x(s), u(s))ds  or  V(x(t)) = L ri-1r(x(i), u(i)) \n\n(S-I) \n\nco \n\n00 \n\n(1) \n\nwhere  the  left  hand  equation  is  the  continuous  time  formulation,  while  the  right  hand \nequation  is  the  corresponding  discrete  time  version,  and  where  x  and  u  denote  a  n(cid:173)\ndimensional  state  vector  and  a  m-dimensional  command  vector,  respectively.  For  the \nSwing-Up, we assume that a teacher provided us with 5  successful trials starting from dif-\n\ni=t \n\n\f1042 \n\nS.  Schaal \n\nferent  initial  conditions.  Each trial  consists  of a  time  series  of data vectors  (0, e, -r)  sam(cid:173)\npled at 60Hz.  For the Cart-Pole, we have a 30 second demonstration of successful balanc(cid:173)\ning, represented as a 60Hz time series of data vectors  (x, X, 0, e, F). How can these demon(cid:173)\nstrations be used to speed up reinforcement learning? \n\n2.1  THE NONLINEAR TASK: SWING-UP \nWe applied reinforcement learning  based on  learning a  value function  (V-function)  (Dyer \n&  McReynolds,  1970)  for  the  Swing-Up  task,  as  the  alternative  method,  Q-learning \n(Watkins,  1989), has yet received very limited research for continuous state-action spaces. \nThe V-function assigns a scalar reward value V(x(t\u00bb) to each state x  such that the entire V(cid:173)\nfunction fulfills the consistency equation: \n\nV(x(t)) = arg min(r(x(t), u(t)) + r V(x(t + 1))) \n\nu(t) \n\n(2) \n\nFor clarity, this equation is given for a discrete state-action system; the continuous formu(cid:173)\nlation can be found,  e.g., in Doya (1996).  The optimal policy, u  =Jt(x), chooses the action \nu  in state  x  such that (2)  is fulfilled.  Note that this computation  involves  an optimization \nstep  that includes knowledge of the subsequent state x(t+ 1).  Hence, it requires a model  of \nthe dynamics of the controlled system, x(t+ 1)=f(x(t),u(t\u00bb. From the viewpoint of learning \nfrom demonstration, V-function learning offers three candidates which can be primed from \na demonstration: the value function V(x), the policy 1t(x), and the modelf(x,u). \n\n60 \n\no \n\n100 \n\n10 \n\n...,-\" \n\n- - a)scratch \n\n20+-----~~~---+------------~ \n\n50-+-----------------+------~~'\u00a7.JJl~ \n\nr~30+----------r~-+------------~ \n\n2.1.1  V-Learning \nIn order to assess the benefits of a demon(cid:173)\nstration  for \nthe  Swing-Up,  we  imple(cid:173)\nmented V-learning as suggested in Doya's \n(1996)  continuous  TD  (CTD)  learning  al(cid:173)\ngorithm.  The  V-function  and  the  dynam(cid:173)\nics  model  were incrementally learned by a \nnonlinear  function  approximator,  Recep-\ntive  Field  Weighted  Regression  (RFWR) \n(Schaal  &  Atkeson  (1996\u00bb).  Differing \nfrom  Doya's  (1996)  implementation,  we \nFigure 2: Smoothed learning curves of the average  used the optimal action suggested by  CTD \nto  learn  a  model  of  the  policy  1t  (an \nof 10 learning trials for the learning conditions a) \nto d)  (see text).  Good performance is characterized \n\"actor\" as in Barto et al.  (1983\u00bb, again re(cid:173)\npresented by  RFWR.  The following  learn(cid:173)\nby T up >45s; below this value the system is usu-\ning conditions were tested empirically: \nally able to swing up properly but it does not know \na) \n\nScratch:  Trial  by  trial  learning  of \n\nhow to stop in the upright position. \n\n-\n\n-\n\nb)  primed actor  -\n\n10 \nTrial \n\n-\n\nc) primed model \n\nd) primed actor&model \n\nvalue function V,  model f, and actor 1t from scratch. \nPrimed Actor: Initial training of 1t from the demonstration, then trial by trial learning. \nPrimed Model: Initial training of f  from the demonstration, then trial by trial learning. \nPrimed Actor&Model: Priming of 1t and f  as in b) and c), then trial by trial learning. \n\nb) \nc) \nd) \nFigure  2  shows  the  results  of learning  the  Swing-Up.  Each  trial  lasted  60  seconds.  The \n\ntime Tup  the pole spent in the interval \u00b0 E  [-7r / 2, 7r /2]  during each trial  was taken as the \n\nperformance  measure  (Doya,  1996).  Comparing  conditions  a)  and  c),  the  results  demon(cid:173)\nstrate that learning the pole model from the demonstration did not speed up learning.  This \nis  not  surprising  since  learning  the  V-function  is  significantly  more  complicated  than \nlearning  the  model,  such  that the  learning  process  is  dominated  by  V-function  learning. \nInterestingly, priming the actor from the demonstration had a significant effect on the ini(cid:173)\ntial  performance (condition  a)  vs.  b\u00bb).  The  system  knew  right away  how  to pump up  the \npendUlum, but, in order to learn how to balance the pendulum in the upright position, it fi(cid:173)\nnally  took  the same amount of time as  learning from  scratch.  This  behavior is  due  to  the \n\n\fLearning from Demonstration \n\n1043 \n\nfact  that,  theoretically,  the  V-function  can  only  be  approximated  correctly  if the  entire \nstate-action space is  explored densely.  Only  if the demonstration covered  a large fraction \nof the entire state space one would expect that V-learning can profit from  it.  We also in(cid:173)\nvestigated  using  the  demonstration  to  prime  the  V-function  by  itself or  in  combination \nwith the other functions.  The results were qualitatively the same  as in  shown in Figure 2: \nif the policy was included in the priming, the learning traces were like b) and d), otherwise \nlike a)  and c). Again,  this is not totally surprising. Approximating a V-function is  not just \nsupervised learning as for ;t and  f, it requires an  iterative procedure to ensure the validity \nof (2)  and amounts to a complicated nonstationary function  approximation process. Given \nthe  limited  amount  of data  from  the  demonstration,  it  is  generally  very  unlikely  to  ap(cid:173)\nproximate a good value function. \n\n-\n\n10 \nTrial \n\n100) \n\n1--~30 \n\nV \n\na)scratch \n\n.  -\n\nb)  primed model \n\n60 \n\n50 \n\n40 \n\n20 \n\n10 \n\no \n\nI  -\n\n/ '   .....,..  --\n\n-\n\n- ~ \n/' \n\n-- --\n/ \n\n2.1.2  Model-Based V-Learning \nIf learning a model f  is  required,  one can \nmake more powerful use of it.  According \nto  the  certainty  equivalence  principle,  f \ncan substitute the real world, and planning \ncan be run in \"mental simulations\" instead \nof interaction  with  the real  world.  In  rein(cid:173)\nforcement  learning,  this  idea  was  origi(cid:173)\nnally  pursued  by  Sutton's  (1990)  DYNA \nalgorithms for discrete state-action spaces. \nHere  we  will  explore  in  how  far  a  con(cid:173)\ntinuous  version  of DYNA,  DYNA-CTD, \ncan  help  in  learning  from  demonstration. \nThe  only  difference compared  to  CTD  in \nSection 2.1.1  is  that  after every  real  trial, \nDYNA-CTD performs five \"mental trials\" \nin  which  the  model  of  the  dynamics  ac(cid:173)\nquired so far replaces the actual pole dynamics. Two learning conditions we be explored: \na)  Scratch:  Trial by trial learning of V,  model f, and policy ;t from scratch. \nb)  Primed Model:  Initial training of f  from the demonstration, then trial by trial learning. \nFigure 3  demonstrates that in contrast to V-learning in the previous section, learning from \ndemonstration  can  make  a  significant  difference  now:  after  the  demonstration,  it  only \ntakes  about 2-3  trials  to  accomplish a  good  swing-up with  stable balancing,  indicated  by \nT up >45s. Note that also learning from scratch is significantly faster than in Figure 2. \n\nFigure 3: Smoothed learning curves of the average \nof 10 learning trials for the learning conditions a) \nand b) (see text) of the Swing-Up problem using \n\"mental simulations\". See Figure 2 for explana-\n\ntions how to interpret the graph. \n\n2.2  THE LINEAR TASK: CART -POLE BALANCING \nOne might  argue that  applying reinforcement learning from  demonstration  to  the Swing(cid:173)\nUp task is premature, since reinforcement learning with nonlinear function  approximators \nhas  yet to obtain  appropriate  scientific understanding.  Thus,  in  this  section  we turn to an \neasier task:  the cart-pole balancer. The task is approximately linear if the pole is  started in \na  close  to  upright  position,  and  the  problem  has  been  well  studied  in  the  dynamic  pro(cid:173)\ngramming literature in  the context of linear quadratic regulation  (LQR) (Dyer &  McRey(cid:173)\nnolds,  1970). \n\n2.2.1  Q-Learning \nIn contrast to V-learning, Q-Iearning (Watkins,  1989; Singh & Sutton,  1996) learns a more \ncomplicated  value function,  Q(x,u),  which  depends  both  on  the  state  and  the  command. \nThe analogue of the consistency equation  (2) for Q-Iearning is: \n\nQ(x(t), u(t\u00bb) = r(x(t), u(t\u00bb) + r arg min(Q(x(t + 1), u(t + 1\u00bb)) \n\nu(I+1) \n\n(3) \n\n\f1044 \n\nS.  Schaal \n\nAt every  state x,  picking the action u  which  minimizes  Q is  the  optimal action  under the \nreward function  (l). As  an  advantage, evaluating the Q-function to  find  the  optimal pol(cid:173)\nicy  does  not require  a  model  the  dynamical  system f  that  is  to  be  controlled;  only  the \nvalue of the one-step reward r  is  needed.  For learning from demonstration, priming the Q(cid:173)\nfunction and/or the policy are the two candidates to speed up learning. \nFor LQR problems,  Bradtke (1993)  suggested  a  Q-Iearning method that is ideally  suited \nfor  learning from  demonstration,  based on extracting a policy.  He observed that for  LQR \nthe Q-function is quadratic in the states and commands: \n\nQ(x,u) = [xT,uTl[HHIl  H I2][XT,UTY,  HIl  =nxn, H22  =mxm, HI2  =H;I =nxm \n\n21  H22 \n\n(4) \n\n0.045 \n\n0.04 \n\nKdOmo = [\u00b70.59.  -1.81. -18.71. \n\u00b76.67) \n~nal = [\u00b75.76, -11.37, -83.05, -21.92) \n\nand  that  the  (linear)  policy,  represented  as  a \ngain matrix K, can be extracted from (4) as: \n\n20 \n\n40 \n\n80 \n\n100 \n\n0.035 \n\n'0 \n~  0.03 \n\u00a3 0.025 \na. \n~ 0.02 \n~ 0.Q15 \no \n0.01 \n0.005 \n\nuopt  = -K x = -H;;H2Ix \n\n(5) \nConversely,  given  a  stabilizing  initial  policy \nK demo'  the  current  Q-function can  be  approxi-\nmated  by  a  recursive  least  squares  procedure, \nand  it  can  be  optimized  by  a  policy  iteration \nprocess with guaranteed convergence (Bradkte, \no-tJcr~Dl\"!NI(fj~~~~II!~  1993). As a demonstration allows one to extract \n120  an  initial  policy  K demo  by  linearly  regressing \no \nthe  observed  command  u  against  the  corre(cid:173)\nsponding  observed  states x,  one-shot  learning \nof pole  balancing  is  achievable.  As  shown  in \nFigure 4, after about 120 seconds (12 policy it(cid:173)\neration  steps),  the  policy  is  basically  indistin(cid:173)\nguishable from  the  optimal policy.  A caveat of \n\nFigure 4: Typical learning curve of a noisy \nsimulation of the cart-pole balancer using Q(cid:173)\nlearning. The graph shows the value of the \n\none-step reward over time for the first \nlearning trial. The pole is never dropped. \n\n60 \n\nTime[s) \n\nthis Q-Iearning, however, is that it cannot not learn without a stabilizing initial policy. \n\n2.2.2  Model-based V -Learning \nLearning an  LQR task  by  learning the V-function  is  one of the  classic  forms  of dynamic \nprogramming  (Dyer  &  McReynolds,  1970).  Using  a  stabilizing  initial  policy  K demo'  the \ncurrent  V-function  can  be  approximated  by  recursive  least  squares  in  analogy  with \nBradtke (1993).  Similarly as  for K demo'  a  (linear)  model f demo of the cart-pole dynamics \ncan  be  extracted from  a  demonstration  by  linear regression  of the cart-pole  state x(t)  vs. \nthe previous state and command vector (x(t-1), u(t-1\u00bb, and the model can be refined  with \nevery new data point experienced during learning. The policy update becomes: \nK= y(R + yBTHBtBTHA,  where Vex) = xTHx, idemo  = [AB], A = n x n,B = n X  m \n(6) \nThus, a similar process as in Bradtke (1993) can be used to find the optimal policy K, and \nthe system accomplishes one shot learning, qualitatively indistinguishable from Figure 4. \nAgain,  as  pointed  out  in  Section  2.1.2,  one  can  make  more  efficient  use  of the  learned \nmodel by performing mental simulations. Given the model f demo'  the policy K  can be cal-\nculated by off-line policy iteration from an initial estimate ofH, e.g.,  taken to be the iden(cid:173)\ntity  matrix  (Dyer &  McReynolds,  1970).  Thus,  no  initial  (stabilizing)  policy  is  required, \nbut rather an estimate of the task dynamics. Also this  method achieves one shot learning. \n\n2.3  POLE BALANCING WITH AN ACTUAL ROBOT \nAs a result of the previous section, it seems that there are no real performance differences \nbetween  V-learning,  Q-Iearning,  and  model-based  V-learning  for  LQR problems.  To  ex(cid:173)\nplore  the  usefulness  of  these  methods  in  a  more  realistic  framework,  we  implemented \n\n\fLeamingfrom Demonstration \n\n1045 \n\nlearning from demonstration of pole balancing on an  anthropomorphic robot arm.  The ro(cid:173)\nbot is equipped with a 60 Hz video-based stereo vision.  The pole is marked by two color \nblobs  which can be tracked in real-time.  A 30 second long demonstration of pole balaoc(cid:173)\ning was is provided by a human standing in front of the two robot cameras. \nThere are a few  crucial differences in  comparison with the simulations. First, as the dem(cid:173)\nonstration  is  vision-based, only kinematic variables can  be extracted from  the demonstra(cid:173)\ntion.  Second,  visual  signal  processing  has  about  120ms  time  delay.  Third,  a  command \ngiven to the robot is  not executed  with  very  high accuracy  due to unknown  nonlinearities \nof the robot.  And  lastly,  humans  use  internal  state for pole balancing,  i.e.,  their policy  is \npartially based on non-observable variables. These issues have the following impact: \n\nKinematic  Variables:  In  this  implementation,  the  robot arm \nreplaces the cart of the Cart-Pole problem. Since we have an \nestimate of the  inverse dynamics and  inverse kinematics of \nthe  arm,  we  can use  the acceleration of the finger  in  Carte(cid:173)\nsian  space  as  command  input  to  the  task.  The  arm  is  also \nmuch  heavier than  the  pole  which  allows  us  to  neglect the \ninteraction forces  the pole exerts on the arm.  Thus, the pole \nbalancing dynamics of Figure Ib can be reformulated as: \n\n,\n\n.. \n\ni \n\n(7) \nFigure 5: Sketch of SARCOS  An  variables in  this  equation can be extracted from  a dem-\nanthropomorphic robot arm \n\nonstration. We omit the 3D extension of these equations. \n\numl cosO + Oml 2  - mgl sin 0= 0,  x = u \n\nDelayed Visual Information:  There are two possibilities of dealing with delayed variables. \nEither  the  state  of  the  system  is  augmented  by  delayed  commands  corresponding  to \n7* 1/60s:::::120s  delay time,  x T  = (x, x,O, (}, U t_1' ut- 2 '  ... , ut- 7 )  ,  or a state predictive controller \nis employed.  The former method increases the complexity of a policy significantly,  while \nthe latter method requires a model f. \nInaccuracies  of Command Execution:  Given  an  acceleration  command  u,  the  robot  will \nexecute something close to u,  but not u exactly. Thus, learning a function  which includes \nu, e.g., the dynamics model (7), can be dangerous since the mapping  (x,i,O,(},u) ~ (x,ii) \nis contaminated by  the nonlinear dynamics of the robot arm.  Indeed,  it turned out that we \ncould  not  learn  such  a  model  reliably.  This could  be  remedied  by  \"observing\"  the  com(cid:173)\nmand u, i.e., by extracting  u = x from visual feedback. \nInternal  State  in  Demonstrated  Policy:  Investigations  with  human  subjects  have  shown \nthat humans  use  internal  state  in  pole  balancing.  Thus,  a policy cannot  be  observed  that \neasily anymore as claimed in Section 2.2:  a regression analysis for extracting the policy of \na  teacher  must  find  the  a~propriate time-alignment  of observed  current  state  and  com(cid:173)\nmand(s) in  the past.  This can become a numerically involved process as regressing a pol(cid:173)\nicy  based  on  delayed  commands  is  endangered  by  singUlar  regression  matrices.  Conse(cid:173)\nquently, it easily happens that one extracts a nonstabilizing policy from the demonstration, \nwhich prevents the application of Q-Iearning and V-learning as described in Section 2.2. \nAs  a result of these considerations, the most trustworthy item to extract from  a demonstra(cid:173)\ntion is the model of the pole dynamics. In our implementation it was used in two ways, for \ncalculating  the  policy  as  in  (6),  and  in  state-predictive  control  with  a  Kalman  filter  to \novercome the delays in  visual  information processing.  The model  was  learned incremen(cid:173)\ntally  in  real-time  by  an  implementation  of RFWR  (Schaal  &  Atkeson  1996). Figure  6 \nshows the results  of learning  from  scratch  and learning from  demonstration  of the  actual \nrobot.  Without a demonstration, it took about  10-20 trials before learning succeeded in re(cid:173)\nliable performance longer than one minute. With a 30 second long demonstration, learning \nwas reliably accomplished in one single trial,  using a large variety of physically different \npoles and using demonstrations from arbitrary people in the laboratory. \n\n\f1046 \n\nS.  Schaal \n\n-\n_ \n\n/!Trial \n\n30 \n\n20 \n\n10 \n\n,.'l' 40 \n\na)scratcn \nb) primed model \n\n~~------------------, \n\nO-fl=:::::;:==;:::;:::;='~~lrO -L:::;::'.::!.;:.~~~I00 \n\nFigure 6: Smoothed average of 10 learn(cid:173)\ning curves of the robot for pole balancing. \n\n70 \n60+--- ------.,..------1 \n50 \n\n3.  CONCLUSION \nWe discussed learning from  demonstration  in  the \ncontext of reinforcement learning, focusing on Q(cid:173)\nlearning,  value  function \nlearning,  and  model \nbased  reinforcement  learning.  Q-Iearning  and \nvalue  function  learning  can  theoretically  profit \nfrom  a  demonstration  by  extracting  a  policy,  by \nusing the demonstration data to prime the Q/value \nfunction,  or,  in  the  case  of value  function  learn-\ning,  by  extracting  a  predictive  model  of  the \nworld. Only in the special case of LQR problems, \nhowever,  could  we  find  a  significant  benefit  of \npriming  the  learner  from  the  demonstration.  In \ncontrast, model-based reinforcement learning was \nable  to  greatly  profit from  the  demonstration  by \nusing  the  predictive  model  of  the  world  for \n\"mental  simulations\".  In  an  implementation  with \nan  anthropomorphic  robot  arm,  we  illustrated  that  even  in  LQR problems,  model-based \nreinforcement  learning  offers  larger  robustness  towards  the  complexity  in  real  learning \nsystems  than Q-Iearning and  value  function  learning.  Using  a  model-based  strategy,  our \nrobot learned  pole-balancing  from  a demonstration  in  a single trial  with  great reliability. \nThe important message of this work is that not every learning approach is equally suited to \nallow  knowledge  transfer  and/or  the  incorporation  of biases.  This  issue  may  serve  as  a \ncritical additional constraint to evaluate artificial and biological models of learning. \nAcknowledgments \n\nThe trials were aborted after successful \nbalancing of 60 seconds. We also tested \nlong term performance of the learning \n\nsystem by running pole balancing for over \n\nan hour-the pole was never dropped. \n\nIkeuchi,  K.  (1993b).  \"Assembly  plan  from  observa-\ntion.\",  School  of  Computer  Science,  Carnegie  Mellon \n\n. \n\n. \n\n'b  \n\nSupport  was  provided  by  the  A TR  Human  Infor- University,  Pittsburgh, PA. \nmation  Processing  Labs \nthe  German  Research  Kawato,  M.,  Gandolfo,  F.,  Gomi,  H.,  &  Wada,  Y. \n(1994b).  \"Teaching  by  showing  in  kendama  based  on \n~ssocIatlOn, the  Alexander  v.  ~um oldt ~ounda- optimization  principle.\" In:  Proceedings  of the  Interna-\ntional  Conference  on  Artificial  Neural  Networks \ntlOn,  and the German ScholarshIp FoundatIOn. \n(ICANN'94), 1, pp.601-606. \nReferences \nLozano-Perez,  T.  (1982).  \"Task-Planning.\"  In:  Brady, \nAtkeson,  C.  G.  (1994).  \"Using local  trajectory  optimiz- M.,  Hollerbach, 1.  M., Johnson, T.  L.,  Lozano-P _rez, T., \ners  to  speed  up  global  optimization  in  dynamic  pro- &  Mason, M. T.  (Eds.), , pp,473-498. MIT Press. \ngramming.\"  In:  Moody,  Hanson,  &  Lippmann  (Ed.),  Miyamoto, H.,  Schaal, S., Gandolfo,  F.,  Koike,  Y.,  Osu, \nAdv.  in Neural In!  Proc. Sys.  6. Morgan  Kaufmann. \nR.,  Nakano,  E., Wada,  Y.,  &  Kawato, M.  (in  press).  \"A \nBakker,  P.,  &  Kuniyoshi,  Y.  (1996).  \"Robot  see,  robot  Kendama learning robot based on bi-directional theory.\" \ndo:  An  overview  of  robot  imitation.\",  Autonomous  Neural Networks. \nSystems  Section,  Electrotechnical  Laboratory,  Tsukuba  Moore,  A.  (1991a).  \"Fast,  robust  adaptive  control  by \nScience City, Japan. \nlearning  only  forward  models.\"  In:  Moody,  1.  E.,  Han(cid:173)\nBarto,  A.  G., Sutton,  R.  S.,  &  Anderson,  C.  W.  (1983).  son,  S. J.,  &  and  Lippmann,  R.  P.  (Eds.),  Advances  in \n\"Neuronlike  adaptive  elements  that  can  solve  difficult  NeuralInt Proc. Systems 4.  Morgan Kaufmann. \nlearning  control  problems.\"  IEEE  Transactions  on  Sys- Schaal,  S.,  &  Atkeson,  C.  G.  (1996).  \"From isolation  to \ntems,  Man,  and Cybernetics, SMC-13, 5. \ncooperation:  An  alternative  of a  system  of experts.\"  In: \nBradtke,  S. 1.  (1993).  \"Reinforcement  learning  applied  Touretzky,  D.  S.,  Mozer,  M.  c.,  &  Hasselmo,  M.  E. \nto  linear quadratic regulation.\" In:  Hanson, J.  S.,  Cowan, (Eds.),  Advances  in  Neural  Information  Processing \n1.  D.,  &  Giles,  C.  L.  (Eds.),  Advances  in  Neural  In!  Systems 8.  Cambridge, MA:  MIT Press. \nProcessing Systems 5, pp.295-302. Morgan Kaufmann.  Segre,  A.  B.,  &  Dejong,  G.  (1985).  \"Explanation-based \nDillmann,  R.,  Kaiser,  M.,  &  Ude,  A. \n(1995).  manipulator  learning:  Acquisition  of  planning  ability \n\"Acquisition  of  elementary  robot  skills  from  human  through  observation.\"  In:  Conference  on  Robotics  and \ndemonstration.\"  In:  International  Symposium  on  Intelli- Automation, pp.555-560. \ngent Robotic Systems (SIRS'95),  Pisa, Italy. \nDoya,  K.  (1996).  \"Temporal  difference learning in  con- learning with eligibility traces.\" Machine Learning. \ntinuous  time  and  space.\"  In:  Touretzky,  D.  S.,  Mozer,  Sutton,  R.  S. \n(1990).  \"Integrated  architectures  for \nM. c. ,  &  Hasselmo,  M.  E. (Eds.),  Advances  in  Neural  learning,  planning, and reacting based on  approximating \nInformation Processing Systems 8. MIT Press. \nDufay,  B.,  &  Latombe,  J.-c.  (1984).  \"An  approach  to  tional Machine Learning Conference. \nautomatic  robot  programming  based  on \nlearning.\" In:  Brady, M., &  Paul, R.  (Eds.), Robotics Re- wards.\" Ph.D. thesis, Cambridge University (UK), . \nsearch, pp.97-115. Cambridge, MA:  MIT Press. \nWidrow,  B.,  &  Smith,  F. W.  (1964).  \"Pattern  recogniz(cid:173)\nDyer,  P., &  Mc~eynolds, S.  R.  (1970).  The ~omputation ing  control  systems.\"  In:  1963  Compo  and In!  Sciences \nand theory of opumal control. NY:  AcademIC Press. \n\ninductive  Watkins,  C.  1.  C.  H.  (1989).  \"Learning with delayed  re(cid:173)\n\nSingh,  S.  P.,  &  Sutton,  R.  S.  (1996).  \"Reinforcement \n\ndynamic programming.\"  In: Proceedings of the  Interna(cid:173)\n\n(COINS)  Symp. Proc., 288-317, Washington: Spartan. \n\n\f", "award": [], "sourceid": 1224, "authors": [{"given_name": "Stefan", "family_name": "Schaal", "institution": null}]}