{"title": "Efficient Nonlinear Control with Actor-Tutor Architecture", "book": "Advances in Neural Information Processing Systems", "page_first": 1012, "page_last": 1018, "abstract": null, "full_text": "Efficient  Nonlinear Control with \n\nActor-Tutor Architecture \n\nKenji Doya* \n\nA.TR  Human  Information Processing Research  Laboratories \n2-2  Hikaridai,  Seika-cho,  Soraku-gun,  Kyoto 619-02,  Japan. \n\nAbstract \n\nA  new  reinforcement  learning architecture for  nonlinear  control is \nproposed.  A  direct  feedback  controller,  or  the actor,  is  trained  by \na  value-gradient  based  controller,  or  the  tutor.  This  architecture \nenables both efficient use of the value function and simple computa(cid:173)\ntion for  real-time implementation.  Good performance was  verified \nin  multi-dimensional  nonlinear  control  tasks  using  Gaussian  soft(cid:173)\nmax networks. \n\n1 \n\nINTRODUCTION \n\nIn  the  study  of temporal  difference  (TD)  learning  in  continuous  time  and  space \n(Doya,  1996b),  an  optimal  nonlinear  feedback  control  law  was  derived  using  the \ngradient  of the  value  function  and  the  local  linear  model  of the  system  dynam(cid:173)\nics.  It was  demonstrated  in  the simulation of a  pendulum swing-up  task  that  the \nvalue-gradient based control scheme requires much less learning trials than the con(cid:173)\nventional  \"actor-critic\"  control scheme (Barto et al.,  1983). \n\nIn the actor-critic scheme, the actor,  a direct feedback  controller, improves its con(cid:173)\ntrol  policy  stochastically  using  the  TD  error  as  the  effective  reinforcement  (Fig(cid:173)\nure  1a).  Despite  its  relatively slow  learning,  the  actor-critic  architecture has  the \nvirtue  of simple  computation  in  generating  control  command.  In  order  to  train a \ndirect  controller while making efficient use of the value function,  we  propose a  new \nreinforcement  learning  scheme  which  we  call  the  \"actor-tutor\"  architecture  (Fig(cid:173)\nure  1b). \n\n\u00b7Current  address:  Kawato  Dynamic  Brain  Project,  JSTC.  2-2  Hikaridai,  Seika-cho, \n\nSoraku-gun,  Kyoto 619-02,  Japan.  E-mail:  doya@erato.atr.co.jp \n\n\fEfficient Nonlinear Control with Actor-Tutor Architecture \n\n1013 \n\nIn the actor-tutor scheme, the optimal control command based on the current esti(cid:173)\nmate of the value function is  used as the target output of the actor.  With the use of \nsupervised learning algorithms (e.g., LMSE), learning of the actor is expected to be \nfaster than in  the actor-critic scheme, which uses stochastic search algorithms (e.g., \nA RP )'  The simulation result  below  confirms  this  prediction.  This  hybrid  control \narchitecture provides a model of functional integration of motor-related brain areas, \nespecially the basal ganglia and the cerebellum (Doya,  1996a). \n\n2  CONTINUOUS TD LEARNING \n\nFirst, we summarize the theory of TD learning in continuous time and space (Doya, \n1996b),  which is  basic  to  the derivation of the proposed  control scheme. \n\n2.1  CONTINUOUS TD  ERROR \n\nLet  us  consider a continuous-time, continuous-state dynamical system \n\nd~;t) = f(x(t), u(t\u00bb \n\n(I) \n\nwhere  x  E  X  C  R n  is  the  state  and  u  E  U  C  R m  is  the  control  input  (or  the \naction).  The reinforcement is  given as  the function  of the state and the control \n\nFor a  given  control law  (or  a  policy) \n\nr(t) = r(x(t), u(t\u00bb. \n\nu(t) =  p(x(t\u00bb, \nwe  define  the  \"value function\"  of the state x(t) as \n\nVJ'(x(t\u00bb  = \n\n100  1 \n\nt \n\n. - j  \n\n-e--r  r(x(s), u(s\u00bbds, \nT \n\n(2) \n\n(3) \n\n(4) \n\nwhere  x(s)  and u(s)  (t  :5  s < 00)  follow  the system dynamics  (I)  and  the control \nlaw  (3).  Our goal  is  to find  an  optimal  control  law  p*  that  maximizes  VJ'(x)  for \nany state x  EX.  Note  that  T  is  the  time  constant of imminence-weighting,  which \nis  related to the discount factor 'Y  of the discrete-time TD as  'Y  =  1 _  ~t. \nBy  differentiating  (4)  by  t,  we  have  a  local  consistency  condition  for  the  value \nfunction \n\n(5) \n\nLet  P(x(t\u00bb  be  the prediction of the  value function  VJ'(x(t\u00bb  from  x(t)  by a  neural \nnetwork, or some function  approximator that  has enough  capability of generaliza(cid:173)\ntion.  The prediction should  be  adjusted to minimize  the inconsistency \n\nr(t) = r(t) - P(x(t\u00bb + T dP~~(t\u00bb  , \n\n(6) \n\nwhich  is  a  continuous  version  of the  TD  error.  Because  the  boundary  condition \nfor  the  value  function  is  given  on  the  attractor  set  of  the  state  space,  correc(cid:173)\ntion  of P(x(t\u00bb  should  be  made  backward into time.  The correspondence  between \ncontinuous-time TD algorithms and discrete-time TD(A)  algorithms (Sutton, 1988) \nis  shown in  (Doya,  1996b). \n\n\f1014 \n\nK.  Doya \n\nFigure  1: \n\n(a)  Actor-critic \n\n(b)  Actor-tutor \n\n2.2  OPTIMAL  CONTROL BY VALUE GRADIENT \n\nAccording  to  the  principle  of dynamic  programming  (Bryson  and  Ho,  1975),  the \nlocal constraint for  the value function V\u00b7  for  the optimal control law p. is  given by \nthe Hamilton-Jacobi-Bellman equation \n\nV\u00b7(t) =  max  [r(x(t), u(t)) + T av\u00b7~x(t)) I(x(t), u(t))]  . \n\n(7) \n\nu(1)EU \n\nx \n\nThe optimal  control p*  is  given by solving  the  maximization  problem in  the  HJB \nequation, i.e., \n\n(8) \nWhen the cost for each control variable is given by a convex potential function Gj  0 \n\nau  +T  ax \n\nau  - .  \n\nar(x, u) \n\naV\u00b7(x) al(x, u)  _  0 \n\nr(x,u) = R(x) - L:Gj(Uj), \n\nj \n\nequation  (8)  can  be solved using a  monotonic function gj(x) =  (Gj)-l(x) as \n\nUj  = gj  (TaV;~X) a/~:~ u)) . \n\n(9) \n\n(10) \n\nIf the  system  is  linear  with  respect  to  the  input,  which  is  the  case  with  many \nmechanical systems, al(x, u)/aUj is independent of u  and the above equation gives \na  closed-form optimal feedback  control law u  =  p\u00b7(x). \nIn practice,  the optimal value function  is  unknown and  we  replace  V\u00b7(x) with  the \ncurrent estimate of the value function  P(x) \n\n(  aPex) al(x, u)) \n\nu=g  T~ au \n\n. \n\n(11) \n\nWhile  the  system  evolves  with  the  above  control  law,  the  value  function  P(x)  is \nupdated to minimize the TD error (6).  In (11),  the vector aP(x)/ax represents the \ndesired motion direction in  the state space and the matrix al(x, u)/au transforms \nit  into  the  action  space.  The function  g,  which  is  specified  by  the  control  cost, \ndetermines  the  amplitude of control output.  For  example,  if the  control cost  G  is \nquadratic,  then  (11)  reduces  to a  linear feedback  control.  A  practically important \ncase is  when  9  is  a sigmoid,  because  this gives a  feedback  control law for  a system \nwith  limited control amplitude, as  in the examples below. \n\n\fEfficient Nonlinear Control with Actor-Tutor Architecture \n\n1015 \n\n3  ACTOR-TUTOR ARCHITECTURE \n\nIt was shown in a  task  of a  pendulum swing-up  with  limited  torque  (Doya,  1996b) \nthat the above  value-gradient based control scheme  (11  can learn the task  in much \nless  trials  than  the  actor-critic  scheme.  However,  computation  of  the  feedback \ncommand by (11) requires an on-line calculation of the gradient of the value function \noP(x)/ox and its multiplication with the local linear model of the system dynamics \na lex, u)/ou, which  can be  too  demanding for  real-time  implementation. \nOne solution  to this  problem is  to use  a simple direct  controller network,  as  in  the \ncase  of the  actor-critic  architecture.  The  training of the  direct  controller,  or  the \nactor,  can  be  performed  by supervised  learning  instead  of trial-and-error  learning \nbecause  the  target  output  of the  controller  is  explicitly  given  by  (11).  Although \ncomputation of the target output may involve a  processing time that is  not  accept(cid:173)\nable for  immediate feedback  control, it is  still possible  to use  its output for  training \nthe direct  controller provided that there is  some  mechanism of short-term memory \n(e.g.,  eligibility trace in  the connection  weights). \n\nFigure  l(b)  is  a  schematic  diagram  of this  \"actor-tutor\"  architecture.  The  critic \nmonitors the performance of the actor and estimates the value function.  The \"tutor\" \nis  a cascade of the critic, its gradient estimator, the local linear model ofthe system, \nand  the  differential  model  of control  cost.  The  actor  is  trained  to  minimize  the \ndifference  between its output  and the tutor's output. \n\n4  SIMULATION \n\nWe tested the  performance of the actor-tutor architecture in  two nonlinear control \ntasks;  a pendulum swing-up task (Doya, 1996b) and the global version of a cart-pole \nbalancing task  (Barto et al.,  1983). \n\nThe network  architecture we used for  both  the actor  and the critic was a  Gaussian \nsoft-max network.  The output of the network is  given by \n\nK \n\nY = I: Wkbk(X), \n\nk=l \n\nb  (  ) \nk X  =\",K \n\nul=l exp  ui=l \n\n3/0 \n\nexp[- L:~=1 (~)2] \n\n[_\",n  (X.-Cli)2]' \n\nwhere  (CkI'  ... , Ckn)  and  (Ski, ... , Skn)  are  the  center  and  the  size  of the  k-th  basis \nfunction.  It is in general possible to adjust the centers and sizes of the basis function, \nbut  in order  to assure  predictable transient  behaviors,  we fixed  them in  a  grid.  In \nthis  case,  computation  can  be  drastically  reduced  by  factorizing  the  activation  of \nbasis functions  in each input  dimension. \n\n4.1  PENDULUM  SWING-UP TASK \nThe first  task was to swing up a pendulum with a limited torque ITI  ~ Tmax ,  which \nwas about one fifth of the torque that was required to statically bring the pendulum \nup  (Figure  2  (a)).  This  is  a  nonlinear  control  task  in  which  the  controller  has  to \nswing the pendulum several times  at the bottom to build  up enough momentum. \n\n\f1016 \n\nK.  Doya \n\n(a)  Pendulum \n\ntriat. \n\n(b)  Value gradient \n\n( c)  Actor-Critic \n\ntrial_ \n\n( d)  Actor-Tutor \n\nFigure 2:  Pendulum swing-up  task.  The dynamics of the pendulum (a)  is given by \nmle = -ti; + mglsin{} + T.  The  parameters were  m = I = 1,  g = 9.8,  Jl.  = 0.01, \nand Tmax = 2.0.  The learning  curves for  value-gradient based optimal control (b), \nactor-critic  (c),  and actor-tutor (d);  t_up  is  time  during which  I{}I  < 45\u00b0. \n\nThe  state  space  for  the  pendulum x  = ({},w)  was  2D  and  we  used  12  x  12  basis \nfunctions  to cover the range I{} I ~ 180\u00b0  and Iw I ~ 180\u00b0 / s.  The reinforcement for  the \nstate was given by the height of the tip of the pendulum, i.e.,  R(x) = cos {}  and the \ncost  for  control  G  and  the  corresponding  output sigmoid function  g  were selected \nto match the maximal output torque ymax. \n\nFigures  2  (b),  (c),  and  (d)  show  the  learning  curves  for  the  value-gradient  based \ncontrol  (11),  actor  critic,  and  actor-tutor  control  schemes,  respectively.  As  we \nexpected,  the  learning  of the  actor-tutor  was  much faster  than  that  of the  actor(cid:173)\ncritic and  was comparable  to the value-gradient based optimal control schemes. \n\n4.2  CART-POLE SWING-UP TASK \n\nNext  we  tested the learning scheme in a  higher-dimensional nonlinear control task, \nnamely, a cart-pole swing-up task  (Figure 3).  In the pioneering work of , the actor(cid:173)\ncritic  system  successfully  learned  the  task  of balancing  the  pole  within  \u00b1 12\u00b0  of \nthe  upright  position  while  avoiding  collision  with  the  end  of the  cart  track.  The \ntask  we  chose  was  to swing  up  the pole  from  an  arbitrary  angle  and  to  balance  it \nupright.  The physical parameters of the cart-pole were the same as in (Barto et al., \n1983)  except  that the length of the track was doubled to provide enough room for \nswinging. \n\n\fEfficient Nonlinear Control with Actor-Tutor Architecture \n\nJ017 \n\n(a) \n\n(b) \n\n(c) \n\nFigure  3:  Cart-pole swing-up  task.  (a)  An  example  of a  swing-up  trajectory.  (b) \nValue function learned by the critic.  (c)  Feedback force  learned by  the actor.  Each \nsquare in  the plot shows a slice of the 4D  state space parallel to the (0, w)  plane. \n\nFigure  3  (a)  shows  an  example  of a  successful  swing  up  after  1500  learning  trials \nwith  the actor-tutor architecture.  We  could not achieve a  comparable performance \nwith the actor-critic scheme within 3000  learning trials.  Figures 3 (b)  and (c)  show \nthe value function  and the feedback  force  field,  respectively,  in  the 4D  state space \nx  = (x, v, 0, w),  which  were  implemented  in  6  x  6  x  12  x  12  Gaussian  soft-max \nnetworks.  We imposed symmetric constraints on both actor and critic networks to \nfacilitate  generalization.  It can  be  seen  that  the  paths  to  the  upright  position  in \nthe center of the track are represented as  ridges  in  the value function. \n\n5  DISCUSSION \n\nThe biggest problem in applying TD or DP to real-world control tasks is  the  curse \nof dimensionality,  which makes  both the  computation for  each  data point and  the \nnumbers of data points  necessary for  training very high.  The actor-tutor  architec(cid:173)\nture provides a partial solution  to the former  problem in real-time implementation. \nThe grid-based  Gaussian soft-max basis function  network was successfully  used  in \na 4D  state space.  However,  a more flexible  algorithm that allocates basis functions \nonly  in  the  relevant  parts  of  the  state  space  may  be  necessary  for  dealing  with \nhigher-dimension systems  (Schaal and Atkeson,  1996). \n\nIn  the  above  simulations,  we  assumed  that  the  local  linear  model  of the  system \ndynamics  fJf(x,u)/fJu  was  available.  In  preliminary  experiments,  it  was  verified \nthat  the critic,  the system model,  and the actor can be  trained simultaneously. \n\n\f1018 \n\nK.  Doya \n\nThe  actor-tutor  architecture  resembles  \"feedback  error  learning\"  (Kawato  et  al. , \n1987)  in  the  sense  that  a  nonlinear  controller  is  trained  by  the  output  of anther \ncontroller.  However,  the  actor-tutor  scheme  can  be  applied  to  a  highly  nonlinear \ncontrol task  to which it  is  difficult  to prepare a simple linear feedback  controller. \n\nMotivated by  the performance of the actor-tutor architecture and  the recent  phys(cid:173)\niological  and  fMRI  experiments on  the  brain  activity  during  the  course  of motor \nlearning (Hikosaka et al.,  1996; Imamizu et al.,  1996), we  proposed a framework  of \nfunctional integration of the basal ganglia, the cerebellum, and cerebral motor areas \n(Doya,  1996a).  In  this framework,  the basal ganglia learns the value function  P(x) \n(Houk et al.,  1994) and generates the desired motion direction based on its gradient \noP(x)/ox.  This is  transformed into a motor command by the  \"transpose model\"  of \nthe motor system (of (x, u)/ouf in  the lateral cerebellum (cerebrocerebellum).  In \nearly stages of learning, this output is  used  for  control,  albeit  its feedback  latency \nis  long.  As  the  subject  repeats  the  same  task,  a  direct  controller  is  constructed \nin  the  medial  and  intermediate  cerebellum  (spinocerebellum)  with  the  above  mo(cid:173)\ntor  command  as  the  teacher.  The  direct  controller  enables  quick,  near-automatic \nperformance  with less  cognitive load in other parts of the brain. \n\nReferences \nBarto,  A.  G.,  Sutton,  R.  S.,  and  Anderson,  C.  W.  (1983).  Neuronlike  adaptive \nelements that can solve difficult learning control problems.  IEEE  Transactions \non  Systems,  Man ,  and  Cybernetics,  13:834-846. \n\nBryson,  Jr.,  A.  E .. and  Ho,  Y -C.  (1975).  Applied  Optimal  Control.  Hemisphere \n\nPublishing,  New  York,  2nd edition. \n\nDoya, K. (1996a).  An integrated model of basal ganglia and cerebellum in sequential \n\ncontrol tasks.  Society for  Neuroscience  Abstracts,  22:2029. \n\nDoya,  K.  (1996b) .  Temporal  difference  learning  in  continuous  time  and  space.  In \nTouretzky,  D.  S.,  Mozer,  M.  C.,  and  Hasselmo,  M.  E.,  editors,  Advances  in \nNeural Information  Processing  Systems  8,  pages  1073-1079.  MIT Press,  Cam(cid:173)\nbridge,  MA. \n\nHikosaka, 0., Miyachi, S., Miyashita, K., and Rand, M. K. (1996).  Procedural learn(cid:173)\n\ning in monkeys - Possible roles of the basal ganglia.  In Ono, T., McNaughton, \nB.  1.,  Molotchnikoff,  S.,  Rolls,  E.  T .,  and  Nishijo,  H.,  editors,  Perception, \nMemory  and  Emotion:  Frontiers  in  Neuroscience,  pages  403-420 .  Pergamon, \nOxford. \n\nHouk,  J .  C.,  Adams,  J.  L.,  and  Barto,  A.  G.  (1994) .  A  model  of how ,the  basal \nganglia generate  and  use  neural  signals  that  predict  reinforcement.  In  Houk, \nJ . C., Davis, J. L., and Beiser, D. G., editors,  Models  of Information  Processing \nin  the  Basal Ganglia,  pages  249-270 . MIT Press,  Cambrigde, MA. \n\nImamizu, H.,  Miyauchi, S., Sasaki, Y, Takino, R., Putz, B., and Kawato, M.  (1996). \nA functional MRI study on internal models of dynamic transformations during \nlearning a  visuomotor  task.  Society for  Neuroscience  Abstracts,  22:898. \n\nKawato,  M.,  Furukawa,  K.,  and Suzuki, R.  (1987).  A  hierarchical neural  network \nmodel for  control and learning of voluntary movement.  Biological Cybernetics, \n57:169-185. \n\nSchaal, S.  and Atkeson, C.  C.  (1996) .  From isolation to cooperation:  An alternative \nview  of a system of experts.  In Touretzky,  D.  S.,  Mozer,  M.  C.,  and Hasselmo, \n,M.  E.,  editors,  Advances  in  Neural  Information  Processing  Systems  8,  pages \n605-611.  MIT Press,  Cambridge, MA,  USA. \n\nSutton,  R.  S.  (1988) .  Learning to  predict  by  the methods  of temporal difference. \n\nMachine  Learning,  3:9-44. \n\n\f", "award": [], "sourceid": 1228, "authors": [{"given_name": "Kenji", "family_name": "Doya", "institution": null}]}