{"title": "Learning Unambiguous Reduced Sequence Descriptions", "book": "Advances in Neural Information Processing Systems", "page_first": 291, "page_last": 298, "abstract": null, "full_text": "LEARNING  UNAMBIGUOUS  REDUCED \n\nSEQUENCE DESCRIPTIONS \n\nJiirgen Schmidhuber \n\nDept.  of Computer Science \n\nUniversity  of Colorado \n\nCampus Box 430 \n\nBoulder,  CO 80309,  USA \nyirgan@cs.colorado.edu \n\nAbstract \n\nDo you  want  your  neural  net  algorithm  to learn  sequences?  Do not lim(cid:173)\nit  yourself to  conventional  gradient  descent  (or  approximations  thereof). \nInstead,  use  your sequence  learning  algorithm  (any will  do)  to implement \nthe  following  method  for  history  compression.  No  matter  what  your  fi(cid:173)\nnal goals  are,  train  a  network  to predict  its next input from  the  previous \nones.  Since  only  unpredictable  inputs  convey  new  information, ignore  all \npredictable  inputs  but let  all  unexpected  inputs  (plus  information  about \nthe  time  step  at  which  they  occurred)  become  inputs  to  a  higher-level \nnetwork of the same kind  (working on a  slower,  self-adjusting  time scale). \nGo on  building  a  hierarchy  of such  networks.  This  principle  reduces  the \ndescriptions  of event  sequences  without  1088  of information,  thus  easing \nsupervised  or  reinforcement  learning  tasks.  Alternatively,  you  may  use \ntwo recurrent  networks  to collapse  a  multi-level  predictor hierarchy  into a \nsingle  recurrent  net.  Experiments  show  that systems based on  these prin(cid:173)\nciples  can require less  computation per  time step and many fewer  training \nsequences  than conventional training algorithms for  recurrent  nets.  Final(cid:173)\nly you can modify the above method such that predictability is not defined \nin  a  yes-or-no fashion  but in a  continuous fashion. \n\n291 \n\n\f292 \n\nSchmidhuber \n\n1 \n\nINTRODUCTION \n\nThe following methods for supervised sequence learning have been proposed:  Simple \nrecurrent  nets [7][3],  time-delay nets (e.g.  [2]),  sequential recursive  auto-associative \nmemories  [16],  back-propagation through time  or BPTT [21]  [30]  [33],  Mozer's  'fo(cid:173)\ncused  back-prop'  algorithm  [10],  the  IID- or  RTRL-algorithm  [19][1][34],  its  ac(cid:173)\ncelerated  versions  [32][35][25],  the  recent  fast-weight  algorithm  [27],  higher-order \nnetworks  [5],  as  well  as  continuous  time  methods  equivalent  to some  of the  above \n[14)[15][4].  The following  methods for  sequence  learning  by  reinforcement  learning \nhave  been  proposed:  Extended  REINFORCE  algorithms  [31],  the  neural  bucket \nbrigade algorithm [22],  recurrent  networks adjusted by adaptive critics  [23](see  also \n[8]),  buffer-based  systems  [13],  and networks of hierarchically  organized neuron-like \n\"bions\"  [18]. \n\nWith  the  exception  of [18]  and  [13],  these  approaches  waste  resources  and  limit \nefficiency  by  focusing  on  every  input  instead  of focusing  only  on  relevant  inputs. \nMany of these  methods  have  a  second  drawback  as  well:  The  longer  the time  lag \nbetween an event and the occurrence of a related error the less information is carried \nby the corresponding error information wandering 'back into time' (see  [6]  for a more \ndetailed  analysis).  [11],  [12]  and [20]  have addressed  the latter problem but not the \nformer.  The system  described  by  [18]  on the other hand  addresses  both problems, \nbut in  a  manner much different  from  that presented  here. \n\n2  HISTORY COMPRESSION \n\nA  major contribution  of this  work  is  an adaptive  method  for  removing  redundant \ninformation  from  sequences.  This  principle  can  be  implemented  with  the  help  of \nany of the methods mentioned in  the introduction. \n\nConsider a  deterministic  discrete  time predictor  (not necessarily  a  neural network) \nwhose state at time  t of sequence  p  is  described  by an environmental input  vector \nzP(t),  an internal state  vector  hP(t),  and an output vector  zP(t).  The environment \nmay be non-deterministic.  At time 0, the predictor starts with zP(O)  and an internal \nstart state hP(O).  At  time t  ~ 0,  the  predictor  computes \n\nzP(t) = f(zP(t), hP(t)). \nAt  time t> 0,  the predictor  furthermore  computes \n\nhP(t) = g(zP(t - 1), hP(t - 1)). \n\nAll information  about  the  input  at  a  given  time  t z  can  be  reconstructed  from \ntz,f,g,zP(O),hP(O), and the pairs (t\"zP(t,)) for  which 0 < t,  ~ tz and zP(t, -l);j: \nzP(t,).  This is  because if zP(t)  = zP(t + 1)  at a  given  time t,  then  the  predictor is \nable  to predict  the next  input  from  the  previous  ones.  The  new  input is  derivable \nby means of f  and g. \nInformation  about  the  observed  input  sequence  can  be  even  further  compressed \nbeyond just  the  unpredicted  input  vectors  zP(t,).  It suffices  to  know  only  those \nelements of the  vectors  zP(t,)  that were  not correctly  predicted. \nThis  observation  implies  that  we  can  discriminate  one  sequence  from  another  by \nknowing jud the  unpredicted inputs  and  the  corresponding time  steps  at  which  they \n\n\fLearning Unambiguous  Reduced Sequence Descriptions \n\n293 \n\noccurred.  No information is  lost if we  ignore  the expected  inputs.  We do not even \nhave  to know f  and g.  I  call  this  the  principle  of history  compression. \nFrom a theoretical point of view it is important to know at what time an unexpected \ninput occurs; otherwise there will be a potential for ambiguities:  Two different input \nsequences may lead to the same shorter sequence of un predicted inputs.  With many \npractical  tasks,  however,  there  is  no need  for  knowing  the  critical  time  steps  (see \nsection  5). \n\n3  SELF-ORGANIZING  PREDICTOR HIERARCHY \n\nUsing the principle of history compression we can build a self-organizing hierarchical \nneural  'chunking' systeml\n.  The basic  task can  be formulated  as a  prediction  task. \nAt  a  given  time  step  the goal is  to predict  the next input from  previous inputs.  If \nthere are external target vectors at certain  time steps then they are simply  treated \nas another  part of the input  to be predicted. \nThe architecture is a hierarchy of predictors,  the input to each level of the hierarchy \nis  coming from  the previous level.  Pi  denotes the ith level  network which is trained \nto  predict  its  own  nezt  input  from  its  previous  inputs2 \u2022  We  take  Pi  to  be  one  of \nthe conventional dynamic recurrent  neural networks mentioned in the introduction; \nhowever,  it might  be some other adaptive sequence  processing  device  as well3 . \nAt each time step the input of the lowest-level recurrent  predictor  Po  is  the current \nexternal  input.  We  create  a  new  higher-level  adaptive  predictor  P,+l  whenever \nthe  adaptive  predictor  at  the  previous  level,  P\"  stops  improving  its  predictions. \nWhen this happens the weight-changing mechanism of P,  is switched off (to exclude \npotential instabilities caused by ongoing modifications of the lower-level predictors). \nIf at  a  given  time  step  P,  (8  >  0)  fails  to  predict  its  next  input  (or  if we  are  at \nthe  beginning  of a  training  sequence  which  usually  is  not  predictable  either)  then \nP'+l  will  receive  as input  the  concatenation of this next input  of P,  plus  a unique \nrepresentation of the  corresponding time step4;  the activations of P,+l 's hidden and \noutput units will be updated.  Otherwise P,+l will not perform an activation update. \nThis procedure  ensures  that P'+l  is fed  with an unambiguous reduced  descriptionS \nof the input sequence observed by P,.  This is theoretically justified by the principle \nof history compression. \nIn general,  P,+l  will  receive fewer inputs over time than P,.  With existing  learning \n\n1 See also [18]  for a different hierarchical connectionist chun1cing system based on similar \n\nprinciples. \n\n2Recently I  became aware  that Don Mathis had some related ideas  (personal  commu(cid:173)\n\nnication).  A  hierarchical approach to sequence  generation was pursued by [9]. \n\n3For  instance,  we  might  employ  the  more  limited feed-forward  networks  and  a  'time \nwindow' approach.  In this case,  the number of previous inputs to be considered as a  basis \nfor  the next prediction will remain fixed. \n\n\u2022 A  unique  time  representation  is  theoretically necessary  to  provide  P.+l  with unam(cid:173)\n\nbiguous information about when the failure occurred (see also the last paragraph of section \n2).  A  unique representation of the  time that went by since  the  lad unpredicted input oc(cid:173)\ncurred  will do  as well. \n\n& In contrast,  the reduced  descriptions referred  to by  [11]  are not  unambiguous. \n\n\f294 \n\nSchmidhuber \n\nalgorithms,  the  higher-level  predictor  should  have  less  difficulties  in  learning  to \npredict  the  critical  inputs  than  the  lower-level  predictor.  This  is  because  P,+l'S \n'credit  assignment  paths'  will  often  be  short  compared  to  those  of P,.  This  will \nhappen  if the  incoming  inputs  cany  global  temporal  structure  which  has  not  yet \nbeen discovered  by P,.  (See also [18]  for a  related approach to the problem of credit \nassignment in  reinforcement  learning.) \nThis method is a  simplification and an improvement of the recent  chunking method \ndescribed  by  [24]. \n\nA  multi-level  predictor  hierarchy  is  a  rather  safe  way  of learning  to deal  with  se(cid:173)\nquences  with  multi-level  temporal structure  (e.g  speech).  Experiments have  shown \nthat multi-level predictors can quickly learn tasks which are practically unlearnable \nby  conventionalrecunent  networks,  e.g.  [6]. \n\n4  COLLAPSING  THE HIERARCHY \n\nOne disadvantage of a predictor hierarchy as above is that it is not known in advance \nhow  many levels  will  be  needed.  Another  disadvantage is  that levels  are explicitly \nseparated  from  each  other.  It may  be  possible,  however,  to collapse  the  hierarchy \ninto a  single  network  as outlined in  this  section.  See  details in  [26]. \n\nWe need  two conventional recurrent  networks:  The  automatizer A  and the  chunker \nC,  which  cones pond  to a  distinction  between automatic and attended events.  (See \nalso [13]  and [17]  which describe a similar distinction in the context ofreinforcement \nlearning).  At each time step A receives the current external input.  A's enor function \nis  threefold:  One  term  forces  it  to emit  certain  desired  target  outputs  at  certain \ntimes.  If there is a  target, then it becomes part of the next input.  The second  term \nforces  A  at  every  time  step  to  predict  its  own  next  non-target  input.  The  third \n(crucial)  term will  be explained  below. \n\nIf and  only  if A  makes  an  enor  concerning  the  first  and  second  term  of its  en or \nfunction,  the  un predicted  input  (including  a  potentially  available  teaching  vector) \nalong  with  a  unique  representation  of the  current  time  step  will  become  the  new \ninput to C.  Before  this  new input can  be  processed,  C  (whose last input  may have \noccuned  many time steps earlier)  is  trained  to predict  this  higher-level  input  from \nits cunent internal state and its last input (employing a  conventional recunent  net \nalgorithm).  After  this,  C  performs  an  activation  update  which  contributes  to  a \nhigher level internal representation of the input history.  Note that according to the \nprinciple  of history  compression  C is  fed  with an  unambiguous  reduced  description \nof the  input history.  The information deducible  by means of A's predictions  can be \nconsidered  as  redundant.  (The beginning  of an episode  usually  is  not  predictable, \ntherefore it has  to be fed  to the chunking level,  too.) \n\nSince C's 'credit assignment paths' will often be short compared to those of A, C will \noften be able to develop useful internal representations of previous unexpected input \nevents.  Due  to  the  final  term  of its  error  function,  A  will  be  forced  to reproduce \nthese  internal  representations,  by  predicting  C's  state.  Therefore  A  will  be  able \nto  create  useful  internal  representations  by itself in  an  early stage  of processing  a \n\n\fLearning Unambiguous  Reduced Sequence Descriptions \n\n295 \n\ngiven  sequence;  it  will  often  receive  meaningful  error  signals  long  before  errors  of \nthe first  or second  kind  occur.  These  internal  representations  in  turn  must  cany \nthe  discriminating  information for  enabling  A  to improve its low-level  predictions. \nTherefore  the  chunker  will  receive  fewer  and  fewer  inputs,  since  more  and  more \ninputs  become  predictable  by  the  automatizer.  This  is  the  collapsing  operation. \nIdeally,  the chunker  will  become obsolete  after some time. \nIt must  be  emphasized  that  unlike  with  the incremental  creation  of a  multi-level \npredictor  hierarchy  described  in  section  3,  there  is  no formal  proof that  the  2-net \non-line version is free of instabilities.  One can imagine situations where  A unlearns \npreviously  learned  predictions  because of the  third  term of its  enor function.  Rel(cid:173)\native  weighting  of the  different  terms  in  A's  enor  function  represents  an  ad-hoc \nremedy  for  this  potential  problem.  In  the  experiments  below,  relative  weighting \nwas  not necessary. \n\n5  EXPERIMENTS \n\nOne experiment  with a multi-level chunking architecture involved a  grammar which \nproduced  strings of many a's and  b's such  that  there  was  local  temporal  structure \nwithin the training strings (see  [6]  for details).  The task was to differentiate between \nstrings with long overlapping suffixes.  The conventional algorithm completely failed \nto solve  the task; it became confused by the great numbers of input sequences  with \nsimilar endings.  Not so the chunking system:  It soon discovered certain hierarchical \ntemporal structures in the input sequences  and decomposed  the problem such that \nit was able  to solve it within  a  few  hundred-thousand  training sequences. \n\nThe  2-net  chunking  system  (the  one  with  the  potential  for  collapsing  levels)  was \nalso  tested  against the conventionalrecUlrent  net  algorithms.  (See  details  in  [26].) \nWith the conventional algorithms,  with various learning  rates, and with more than \n1,000,000  training  sequences  performance  did  not  improve  in prediction  tasks  in(cid:173)\nvolving  even  as  few  as  ~o time  steps  between  relevant  events. \n\nBut,  the  2-net  chunking  system  was  able  to  solve  the  task  rather  quickly.  An \nefficient  approximation of the  BPTT-method was applied  to both the chunker  and \nthe  automatizer:  Only  3  iterations  of error propagation  'back  into  the  past' were \nperformed at each  time step.  Most of the test  runs required less  than 5000 training \nsequences.  Still  the  final  weight  matriz  of the  automatizer  often  resembled  what \none  would  hope  to  get  from  the  conventional  algorithm.  There  were  hidden  units \nwhich  learned  to  bridge  the 20-step  time lags  by  means of strong self-connections. \nThe chunking system  needed  less  computation  per time  step  than  the  conventional \nmethod and required many fewer  training  sequences. \n\n6  CONTINUOUS  HISTORY COMPRESSION \n\ntechnique \n\nThe  history  compression \nformulated  above  defines  expectation(cid:173)\nmismatches  in  a  yes-or-no  fashion:  Each  input  unit  whose  activation  is  not  pre(cid:173)\ndictable at a  certain time gives rise  to an unexpected event.  Each unexpected event \nprovokes  an  update  of the internal  state  of a  higher-level  predictor.  The  updates \nalways  take  place  according  to  the  conventional  activation  spreading  rules  for  re-\n\n\f296 \n\nSchmidhuber \n\ncurrent  neural  nets.  There is  no concept  of a  partial  mismatch or of a  'near-miss'. \nThere is no possibility  of updating the higher-level  net  'just a little  bit' in response \nto a  'nearly expected input'.  In practical applications, some 'epsilon' has to be used \nto define  an acceptable  mismatch. \nIn  reply  to  the  above  criticism,  continuous  history  compression  is  based  on  the \nfollowing  ideas.  In what follows,  Viet)  denotes  the  i-th component of vector  vet). \nWe  use  a  local  input  representation.  The  components  of zP(t)  are  forced  to sum \nup  to  1  and  are  interpreted  as  a  prediction  of the  probability  distribution  of the \npossible  zP(t + 1).  Z}(t)  is  interpreted  as  the  prediction  of the  probability  that \nzHt + 1)  is  1. \nThe output entropy \n\n- 2: zr(t)log zr(t) \n\nj \n\ncan  be  interpreted  as  a  measure  of the  predictor's  confidence.  In  the  worst  ease, \nthe predictor  will  expect every  possible  event  with equal probability. \n\nHow  much information  (relative  to the current  predictor)  is  conveyed  by  the event \nz~(t + 1) = 1, once  it is  observed?  According  to  [29]  it is \n\n-log Z}(t). \n\n[28]  defines  update  procedures  based  on  Mozer's  recent  update  function  [12]  that \nlet highly informative events have a stronger influence on the history representation \nthan less informative  (more likely)  events.  The 'strength' of an update in response \nto a  more  or  less  unexpected  event  is  a  monotonically  increasing  function  of the \ninformation  the event  conveys.  One of the update  procedures  uses  Pollack's recur(cid:173)\nsive  auto-associative  memories  [16]  for  storing  unexpected  events,  thus  yielding  an \nentirely  local learning algorithm for  learning extended  sequences. \n\n7  ACKNOWLEDGEMENTS \n\nThanks to Josef Hochreiter  for  conducting the experiments.  Thanks to Mike  Mozer \nand Mark Ring for  useful comments on an earlier  draft of this paper.  This research \nwas supported in part by NSF PYI award IRI-9058450, grant 90-21 from the James \nS.  McDonnell  Foundation,  and  DEC  external  research  grant  1250  to  Michael  C. \nMozer. \n\nReferences \n\n[1]  J.  Bachrach.  Learning  to represent  state,  1988.  Unpublished  master's  thesis, \n\nUniversity  of Massachusetts,  Amherst. \n\n[2]  U.  Bodenhausen and A. Waibel. The tempo 2 algorithm:  Adjusting time-delays \nby supervised  learning.  In D.  S.  Lippman, J.  E.  Moody, and D.  S.  Touretzky, \neditors,  Advances  in Neural In/ormation Processing Systems  3,  pages 155-161. \nSan Mateo,  CA:  Morgan Kaufmann,  1991. \n\n\fLearning Unambiguous  Reduced Sequence Descriptions \n\n297 \n\n[3]  J.  L.  Elman.  Finding  structure  in  time.  Technical  Report  CRL  Technical \nReport  8801,  Center  for  Research  in  Language,  University  of California,  San \nDiego,  1988. \n\n[4]  M. Gherrity. A learning algorithm for analog fully recurrent neural networks. In \nIEEE/INNS  International  Joint  Conference  on  Neural  Networks,  San  Diego, \nvolume  1,  pages  643-644, 1989. \n\n[S]  C.  L.  Giles  and  C.  B.  Miller.  Learning  and extracting finite  state  automata. \n\nAccepted  for  publication in  Neural  Computation,  1992. \n\n[6]  Josef Hochreiter.  Diploma  thesis,  1991.  Institut  fur  Informatik,  Technische \n\nUniversitiit  Miinchen. \n\n[7]  M.  I.  Jordan.  Serial  order:  A  parallel  distributed  processing  approach.  Tech(cid:173)\n\nnical  Report  ICS  Report  8604,  Institute  for  Cognitive  Science,  University  of \nCalifornia,  San Diego,  1986. \n\n[8]  G.  Lukes.  Review  of Schmidhuber's  paper  'Recurrent  networks  adjusted  by \n\nadaptive critics'.  Neural Network  Reviews,  4(1):41-42,  1990. \n\n[9]  Y. Miyata.  An unsupervised  PDP learning model for action planning.  In Proc. \nof the  Tenth  Annual  Conference  of the  Cognitive  Science  Society,  Hillsdale, \nNJ,  pages  223-229.  Erlbaum,  1988. \n\n[10]  M.  C.  Mozer.  A  focused  back-propagation  algorithm  for  temporal  sequence \n\nrecognition.  Complez  Systems,  3:349-381, 1989. \n\n[11]  M.  C.  Mozer.  Connectionist  music  composition  based  on  melodic,  stylistic, \nand  psychophysical  constraints.  Technical  Report  CU-CS-49S-90,  University \nof Colorado at Boulder,  1990. \n\n[12]  M.  C.  Mozer.  Induction  of multiscale  temporal  structure.  In D.  S.  Lippman, \nJ.  E.  Moody,  and  D.  S.  Touretzky,  editors,  Advances  in  Neural  Information \nProcessing  Systems 4,  to appear.  San Mateo,  CA:  Morgan Kaufmann,  1992. \n[13]  C. Myers. Learning with delayed reinforcement through attention-driven buffer(cid:173)\n\ning.  TR, Imperial  College of Science,  Technology and Medicine,  1990. \n\n[14]  B.  A.  Pearlmutter.  Learning  state space  trajectories  in  recurrent  neural  net(cid:173)\n\nworks.  Neural  Computation,  1:263-269, 1989. \n\n[IS]  F. J.  Pineda.  Time  dependent  adaptive  neural  networks.  In  D.  S.  Touretzky, \neditor,  Advances  in Neural Information  Processing  Systems  2,  pages  710-718. \nSan Mateo,  CA:  Morgan  Kaufmann, 1990. \n\n[16]  J.  B.  Pollack.  Recursive  distributed  representation.  Artificial  Intelligence, \n\n46:77-10S, 1990. \n\n[17]  M.  A.  Ring.  PhD Proposal:  Autonomous construction  of sensorimotor hierar(cid:173)\n\nchies  in neural networks.  Technical report,  Univ.  of Texas at Austin,  1990. \n\n[18]  M.  A. Ring.  Incremental development of complex behaviors through automatic \nconstruction  of sensory-motor  hierarchies. \nIn  L.  Birnbaum  and  G.  Collins, \neditors,  Machine  Learning:  Proceedings  of the Eighth  International  Workshop, \npages 343-347. Morgan Kaufmann, 1991. \n\n[19]  A.  J . Robinson  and F.  Fallside.  The utility  driven dynamic error  propagation \nnetwork.  Technical  Report  CUED/F-INFENG/TR.l,  Cambridge  University \nEngineering  Department,  1987. \n\n\f298 \n\nSchmidhuber \n\n[20]  R.  Rohwer.  The  'moving  targets'  training  method.  In  J.  Kindermann  and \nA.  Linden,  editors,  Proceedings  of  'Distributed  Adaptive  Neural  Information \nProcessing',  St.Augustin,  ~4.-~5.5,. Oldenbourg,  1989. \n\n[21]  D.  E.  Rumelhart,  G.  E.  Hinton,  and  R.  J.  Williams.  Learning  internal  rep(cid:173)\n\nresentations  by  error  propagation.  In D.  E.  Rumelhart  and J.  L.  McClelland, \neditors,  Parallel Distributed Processing,  volume  I,  pages 318-362.  MIT Press, \n1986. \n\n[22]  J.  H.  Schmidhuber.  A  local  learning  algorithm  for  dynamic  feedforward  and \n\nrecurrent  networks.  Connection  Science,  1(4):403-412, 1989. \n\n[23]  J.  H.  Schmidhuber.  Recurrent  networks adjusted  by  adaptive critics.  In Proc. \nIEEE/INNS International Joint  Conference  on Neural Networks,  Washington, \nD.  C.,  volume  I, pages 719-722, 1990. \n\n[24]  J.  H.  Schmidhuber.  Adaptive  decomposition  of  time. \n\nIn  T.  Kohonen, \n\nK.  Miikisara,  O.  Simula,  and  J.  Kangas,  editors,  Artificial Neural  Network(cid:173)\ns,  pages 909-914.  Elsevier  Science  Publishers  B.V.,  North-Holland,  1991. \n\n[25]  J.  H.  Schmidhuber.  A fixed  size  storage O(n3 )  time complexity learning  algo(cid:173)\n\nrithm for fully recurrent continually running networks. Accepted for publication \nin  Neural  Computation,  1992. \n\n[26]  J.  H.  Schmidhuber.  Learning complex,  extended sequences  using  the principle \nof history compression.  Accepted for  publication in  Neural Computation,  1992. \n[27]  J.  H.  Schmidhuber.  Learning  to control fast-weight  memories:  An alternative \n\nto recurrent  nets.  Accepted  for  publication in  Neural  Computation,  1992. \n\n[28]  J.  H.  Schmidhuber,  M.  C.  Mozer,  and  D.  Prelinger.  Continuous history  com(cid:173)\n\npression.  Technical  report,  Dept.  of Compo  Sci.,  University  of Colorado  at \nBoulder,  1992. \n\n[29]  C.  E. Shannon.  A mathematical theory of communication (parts I and II).  Bell \n\nSystem  Technical  Journal,  XXVII:379-423, 1948. \n\n[30]  P. J. Werbos. Generalization of back propagation with application to a recurrent \n\ngas market model.  Neural  Networks,  1,  1988. \n\n[31]  R.  J.  Williams.  Toward a  theory  of reinforcement-learning  connectionist  sys(cid:173)\n\ntems.  Technical  Report  NU-CCS-88-3,  College  of Compo  Sci.,  Northeastern \nUniversity,  Boston,  MA,  1988. \n\n[32]  R.  J.  Williams.  Complexity of exact  gradient  computation algorithms for  re(cid:173)\n\ncurrent  neural  networks.  Technical  Report  Technical  Report  NU-CCS-89-27, \nBoston:  Northeastern  University,  College  of Computer Science,  1989. \n\n[33]  R.  J. Williams  and J.  Pengo  An efficient  gradient-based  algorithm  for  on-line \ntraining  of recurrent  network  trajectories.  Neural  Computation,  4:491-501, \n1990. \n\n[34]  R. J. Williams  and D.  Zipser.  Experimental analysis of the real-time recurrent \n\nlearning  algorithm.  Connection  Science,  1(1):87-111, 1989. \n\n[35]  R. J. Williams and D. Zipser.  Gradient-based learning algorithms for  recurrent \nnetworks  and  their  computational  complexity.  In  Back-propagation:  Theory, \nArchitectures  and Applications.  Hillsdale,  NJ:  Erlbaum,  1992, in press. \n\n\fPART VI \n\nRECURRENT \nNETWORKS \n\n\f\f", "award": [], "sourceid": 523, "authors": [{"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}