{"title": "Graded Grammaticality in Prediction Fractal Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 52, "page_last": 58, "abstract": null, "full_text": "Graded grammaticality in Prediction \n\nFractal Machines \n\nShan Parfitt, Peter Tiilo and Georg Dorffner \nAustrian Research  Institute for  Artificial Intelligence, \n\nSchottengasse  3,  A-IOIO  Vienna,  Austria. \n\n{ shan,petert,georg} @ai. univie. ac. at \n\nAbstract \n\nWe  introduce  a  novel  method  of constructing  language  models, \nwhich  avoids some of the problems associated  with recurrent  neu(cid:173)\nral networks.  The method of creating a Prediction Fractal Machine \n(PFM)  [1]  is  briefly described  and some experiments are presented \nwhich  demonstrate the suitability of PFMs for  language modeling. \nPFMs  distinguish  reliably  between  minimal  pairs,  and  their  be(cid:173)\nhavior is  consistent  with the hypothesis  [4]  that wellformedness  is \n'graded' not absolute.  A discussion of their potential to offer fresh \ninsights into language acquisition and processing follows. \n\n1 \n\nIntroduction \n\nCognitive  linguistics  has  seen  the  development  in  recent  years  of two  important, \nrelated  trends.  Firstly,  a  widespread renewal  of interest  in  the statistical,  'graded' \nnature  of language  (e.g.  [2]-[4])  is  showing  that  the  traditional all-or-nothing no(cid:173)\ntion of well-formedness  may not  present  an  accurate  picture  of how  the  congruity \nof utterances  is  represented  internally.  Secondly,  the  analysis  of state  space  tra(cid:173)\njectories  in  artificial  neural  networks  (ANNs)  has  provided  new  insights  into  the \ntypes of processes  which  may account  for  the ability of learning devices  to acquire \nand represent  language,  without appealing to traditional linguistic concepts  [5]-[7]. \nDespite  the  remarkable  advances  which  have  come  out  of connectionist  research \n(e.g.  [8]),  and  the now  common use  of recurrent  networks,  and  Simple  Recurrent \nNetworks  (SRNs)  [9]  especially, in the study of language  (e.g.  [10]),  recurrent  neu(cid:173)\nral  networks  suffer  from  particular  problems  which  make  them imperfectly  suited \nto language tasks.  The vast  majority of work  in this field  employs small networks \nand  datasets  (usually  artificial),  and  although  many  interesting  linguistic  issues \nmay be  thus  tackled,  real  progress  in  evaluating  the  potentials  of state  trajecto(cid:173)\nries  and  graded  'grammaticality' to  uncover  the  underlying  processes  responsible \nfor  overt  linguistic phenomena must inevitably be limited whilst  the  experimental \ntasks  remain so  small.  Nevertheless,  there  are  certain  obstacles  to  the  scaling-up \nof networks  trained  by  back-propagation  (BP).  Such  networks  tend  towards  ever \n\n\fGraded Grammaticality in Prediction Fractal Machines \n\n53 \n\nlonger training times as  the sizes  of the input set  and of the  network increase,  and \nalthough  Real-Time  Recurrent  Learning  (RTRL)  and  Back-propagation  Through \nTime are potentially better at modeling temporal dependencies,  training times are \nlonger  still  [11].  Scaling-up  is  also  difficult  due  to  the  potential  for  catastrophic \ninterference  and lack  of adaptivity and  stability [12]-[14].  Other  problems include \nthe rapid loss of information about past events  as  the distance from the present  in(cid:173)\ncreases  [15]  and the dependence of learned state trajectories not only on the training \ndata,  but  also  upon  such  vagaries  as  initial  weight  vectors,  making  their  analysis \ndifficult  [16].  Other types of learning device  also suffer  problems.  Standard Markov \nmodels  require  the  allocation of memory for  every  n-gram,  such  that large  values \nof n  are impractical; variable-length Markov models are more memory-efficient, but \nbecome unmanageable when trained on large data sets [17].  Two important, related \nconcerns in cognitive linguistics are thus (a) to find  a method which allows language \nmodels to be scaled up,  which  is similar in spirit to recurrent  neural networks,  but \nwhich does not encounter the same problems of scale,  and (b)  to use such  a method \nto evince new  insights into graded grammaticality from the state trajectories which \narise given  genuinely large,  naturally-occurring data sets. \n\nAccordingly,  we  present  a new  method of generating state trajectories which avoids \nmost  of  these  problems.  Previously  studied  in  a  financial  prediction  task,  the \nmethod creates  a  fractal  map of the training data, from  which state machines are \nbuilt.  The  resulting  models  are  known  as  Prediction  Fractal  Machines  (PFMs) \n[18]  and  have  some  useful  properties.  The  state  trajectories  in  the  fractal  repre(cid:173)\nsentation are  fast  and computationally efficient  to generate,  and are  accurate  and \nwell-understood; it may be inferred that, even for  very large vocabularies and train(cid:173)\ning sets,  catastrophic interference  and lack of adaptivity and stability will not  be  a \nproblem, given the way  in which  representations  are  built (demonstrating this is  a \ntopic  for  future  work);  training times are  significantly  less  than for  recurrent  net(cid:173)\nworks  (in the experiments described below,  the smallest models took a few  minutes \nto  build,  while  the  largest  ones  took  only  around  three  hours;  in  comparison,  all \nof the ANNs  took  longer - up  to  a  day - to train);  and there  is  little  or  no  loss  of \ninformation over  the course  of an input sequence  (allowing for  the  finite  precision \nof the computer).  The scalability of the  PFM  was  taken  advantage of by  training \non a  large corpus  of naturally-occurring  text.  This enabled an assessment  of what \npotential new  insights might arise  from  the  use  of this  method in  truly  large-scale \nlanguage tasks. \n\n2  Prediction Fractal  Machines  (PFMs) \n\nA  brief description  of the  method of creating  a  PFM will  now be  given.  Interested \nreaders  should consult  [1],  since  space  constraints  preclude  a  detailed examination \nhere.  The  key  idea behind  our  predictive  model  is  a  transformation  F  of symbol \nsequences  from  an  alphabet  (here,  tagset)  {I, 2, ... , N} into  points  in  a  hypercube \nH  = [0,  I]D.  The dimensionality D  of the hypercube  H  should be large enough for \neach  symbol  1, 2, ... , N  to be  identified  with a  unique  vertex  of H.  The  particular \nassignment of symbols to vertices is arbitrary.  The transformation F has the crucial \nproperty that symbol sequences sharing the same suffix (context)  are mapped close \nto each  other.  Specifically, the longer the common suffix shared  by two sequences, \nthe smaller the (Euclidean) distance between their point representations.  The trans(cid:173)\nformation  F  used  in  this  study  corresponds  to  an  Iterative  Function  System  [19] \n\n\f54 \n\nS.  Parfitt,  P  TIno and G.  Dorffner \n\nconsisting of N  affine  maps i  : H  -+ H, i  = 1,2, ... , N, \n\ni(x) =  ~(x + ti),  tj E {a, l}D,  ti =F  tj for  i  =F  j. \n\n(1) \n\nGiven a sequence  5182 ... 5L  of L  symbols from the alphabet  1,2, ... , N,  we  construct \nits point representation  as \n\nwhere  x*  is  the center  {l}D  of the hypercube  H.  (Note  that  as  is  common in the \nIterative  Function Systems literature,  i  refers  either  to a  symbol or  to  a  map,  de(cid:173)\npending upon the context.)  PFMs are  constructed on point representations of sub(cid:173)\nsequences  appearing in the training sequence.  First,  we  slide the window  of length \nL  >  1  over  the  training  sequence.  At  each  position  we  transform  the  sequence \nof length  L  appearing in  the  window  into  a  point.  The  set  of points obtained  by \nsliding through the whole  training sequence  is  then  partitioned into several  classes \nby k-means vector quantization (in the Euclidean space), each class  represented  by \na  particular  codebook  vector.  The  number  of code book  vectors  required  is  cho(cid:173)\nsen  experimentally.  Since  quantization  classes  group  points  lying  close  together, \nsequences  having  point  representations  in  the  same  class  (potentially)  share  long \nsuffixes.  The quantization classes  may then  be treated  as  prediction contexts,  and \nthe corresponding predictive symbol probabilities computed by sliding the window \nover  the training sequence  again and counting, for  each  quantization class,  how  of(cid:173)\nten  a  sequence  mapped to  that class  was  followed  by  a  particular symbol.  In  test \nmode,  upon  seeing  a  new  sequence  of L  symbols,  the  transformation  F  is  again \nperformed, the closest  quantization center found,  and the corresponding  predictive \nprobabilities used  to predict  the next symbol. \n\n3  An experimental comparison of PFMs  and  recurrent \n\nnetworks \n\nThe  performance  of the  PFM  was  compared  against  that  of a  RTRL-trained  re(cid:173)\ncurrent  network  on  a  next-tag  prediction  task.  Sixteen  grammatical  tags  and  a \n'sentence  start'  character  were  used.  The  models were  trained  on  a  concatenated \nsequence  (22781  tags)  of the  top  three-quarters  of each  of the  14  sub-corpora  of \nthe University of Pennsylvania 'Brown' corpus1 .  The remainder was  used  to create \ntest data, as follows.  Because in a large training corpus of naturally-occurring data, \ncontexts  in  most  cases  have  more  than  one  possible  correct  continuation,  simply \ncounting correctly predicted symbols is insufficient to assess  performance, since  this \nfails to count correct responses which are not targets.  The extent to which the mod(cid:173)\nels distinguished between grammatical and ungrammatical utterances was therefore \nadditionally measured by generating minimal pairs and comparing their negative log \nlikelihoods  (NLLs)  per  symbol with respect  to the  model.  Likelihood is  computed \nby sliding through the test sequence and for each window position, determining the \nprobability of the symbol that appears immediately beyond  it.  As  processing  pro(cid:173)\ngresses,  these  probabilities are multiplied.  The negative of the natural logarithm is \nthen taken  and divided by  the number of symbols.  Significant  differences  in  NLLs \n\nIhttp://www.ldc.upenn.edu/ \n\n\fGraded Grammaticality in Prediction Fractal Machines \n\n55 \n\nare much harder to achieve between members of minimal pairs than between gram(cid:173)\nmatical and random sequences,  and are therefore a good measure of model validity. \nMinimal  pairs  generated  by  theoretically-motivated  manipulations tend  to  be  no \nlonger  ungrammatical  given  a  small  tagset,  because  the  removal  of grammatical \nsub-classes  necessarily  also  removes  a  large amount of information.  Manipulations \nwere therefore performed by switching the positions of two symbols in each sentence \nin the test sets.  Symbols switched  could be any distance apart within the sentence, \nas long as the resulting sentence was ungrammatical under all surface instantiations. \nBy changing as  little as  possible to make the sentence  ungrammatical, the goal was \nretained  that  the  task  of distinguishing  between  grammatical and  ungrammatical \nsequences  be  as  difficult  as  possible.  The  test  data  then  consisted  of 28  paired \ngrammatical/ungrammatical test sets  (around 570  tags each),  plus an ungrammat(cid:173)\nical,  'meaningless'  test  set  containing  all  17  codes  listed  several  times  over,  used \nto measure baseline performance.  Ten 1st-order randomly-initialised networks were \ntrained for  100 epochs using RTRL. The networks consisted of 1 input and 1 output \nlayer, each with 17 units corresponding to the 17 tags, 2 hidden layers, each with 10 \nunits, and 1 context layer of 10 units connected to the first hidden layer.  The second \nhidden  layer  was  used  to  increase  the  flexibility  of the  maps  between  the  hidden \nrepresentations in the recurrent  portion and the tag activations at the output layer. \nA  logistic  sigmoid activation function  was  used,  the  learning rate  and  momentum \nwere  set  to  0.05,  and  the  training  sequence  was  presented  at  the  rate  of one  tag \nper  clock  tick.  The  PFMs  were  derived  by  clustering  the fractal  representation  of \nthe training data ten times for  various numbers of codebook vectors  between  5 and \n200.  More  experiments were  performed using  PFMs than neural networks  because \nin the former case, experience in choosing appropriate numbers of codebook vectors \nwas  initially lacking for  this type  of data. \n\nThe  results  which  follow  are  given  as  averages,  either  over  all  neural  networks,  or \nelse  over  all  PFMs  derived  from  a  given  number  of codebook  vectors.  The  net(cid:173)\nworks  correctly  predicted  36.789%  and  32.667%  of next  tags  in  the  grammatical \nand  ungrammatical test  sets,  respectively.  The  PFMs  matched  this  performance \nat  around  30  codebook  vectors  (37.134%  and  32.814% respectively),  and  exceeded \nit  for  higher  numbers  of vectors  (39.515%  and  34.388%  respectively  at  200  vec(cid:173)\ntors).  The networks  generated  mean NLLs  per  symbol of 1.966  and  2.182  for  the \ngrammatical and  ungrammatical test  sets,  respectively  (a difference  of 0.216)  and \n4.157  for  the  'meaningless' test  set  (the  difference  between  NLLs  for  grammatical \nand  'meaningless'  data =  2.191).  The  PFMs  matched  this  difference  in  NLLs  at \n40  codebook vectors  (NLL  grammatical =  1.999,  NLL  ungrammatical = 2.217; dif(cid:173)\nference  =  0.218).  The  NLL  for  the  'meaningless' data at 40  codebook  vectors  was \n6.075  (difference  between  NLLs  for  grammatical and  'meaningless' data = 4.076). \nThe  difference  between  NLLs  for  grammatical and  ungrammatical, and  for  gram(cid:173)\nmatical  and  'meaningless'  data sets,  became  even  larger  with  increased  numbers \nof codebook  vectors.  The  difference  in  performance  between  grammatical and  un(cid:173)\ngrammatical test  sets  was  thus  highly  significant  in  all  cases  (p  <  .0005):  all  the \nmodels distinguished  what  was  grammatical from  what  was  not.  This conclusion \nis  supported  by  the  fact  that  the  mean, NLLs  for  the  'meaningless'  test  set  were \nalways  noticeably higher  than those for  the minimal pair sets. \n\n\f56 \n\nS.  Parfitt,  P.  Tina and G.  Daiffner \n\n4  Discussion \n\nThe  PFMs  exceeded  the  performance  of the  networks  for  larger  numbers of code(cid:173)\nbook  vectors,  but  it is  possible  that networks  with  more  hidden  nodes  would  also \ndo  better.  In terms of ease  of use,  however,  as well  as in their scaling-up potential, \nPFMs are certainly superior.  Their other great advantage is that the representations \ncreated  are dependable  (see  section  1),  making hypothesis creation and testing  not \njust more  rapid,  but  also  more straightforward:  the  speed  with  which  PFMs  may \nbe trained made it possible to make statistically significant observations for  a large \nnumber of clustering runs.  In the introduction, 'graded' wellformedness was spoken \nof as  being productive of new  hypotheses  about the  nature of language.  Our use  of \nminimal pairs,  designed  to  make  a  clear-cut  distinction  between  grammatical and \nungrammatical utterances, appears to leave this issue to one side.  But in reality, our \nresults were  rather pertinent to it, as the use of the likelihood measure might indeed \nimply.  The  Brown corpus consists  of subcorpora representative  of 14  different  dis(cid:173)\ncourse  types,  from  fiction  to government documents.  Whereas  traditional notions \nof grammaticality  would  lead  us  to  treat  all  of the  'ungrammatical' sentences  in \nthe minimal pair test sets  as equally ungrammatical, the NLLs  in  our experiments \ntell a different  story.  The grammatical versions  consistently had a  lower  associated \nNLL  (higher  probability)  than the  ungrammatical versions,  but  the  difference  be(cid:173)\ntween  these  was  much smaller than that between  the  'meaningless' data and either \nthe grammatical or  the  ungrammatical data.  This supports the concept  of 'graded \ngrammaticality', and NLLs  for  'meaningless' data such  as  ours might be  seen  as  a \nsort of benchmark by which to measure all lesser degrees ofungrammaticality.  (Note \nincidentally that the PFMs appear to associate with the  'meaningless' data a signif(cid:173)\nicantly higher  NLL  than did the networks,  even  though  the difference  between  the \nNLLs of the grammatical and ungrammatical data was the same.  This is suggestive \nof PFMs having greater powers  of discrimination between grades of wellformedness \nthan  the  recurrent  networks  used,  but further  research  will  be  needed  to  ascertain \nthe validity of this.)  Moreover,  the  NLL  varied  not just between  grammatical and \nungrammatical  test  sets,  but  also  from  sentence  to  sentence,  from  word  to  word \nand from  discourse  style  to discourse  style.  While it increased,  often  dramatically, \nwhen the manipulated portion of an ungrammatical sentence was encountered, some \nwords  in  grammatical sentences  exhibited  a  similar effect:  thus,  if a  subsequence \nin  a  well-formed  utterance  occurs  only  rarely  - or  never  - in  a  training set,  it  will \nhave a high associated NLL  in the same way as an ungrammatical one does.  This is \nlikely to happen even for  very large corpora, since some grammatical structures  are \nvery  rare.  This is  consistent  with recent  findings  that, during human sentence  pro(cid:173)\ncessing,  well-formedness is  linked to  conformity with expectation  [20]  as  measured \nby  CLOZE  scores.  Interesting  also  was  the  remarkable  variation in  NLL  between \ndiscourse  styles.  Although  the  mean  NLL  across  all  discourse  styles  (test  sets)  is \nlower  for  the grammatical than for  the  ungrammatical versions,  it cannot be guar(cid:173)\nanteed that the grammatical version of one test set  will have a lower  NLL  than the \nungrammatical version  of another.  Indeed,  the  grammatical and  ungrammatical \nNLLs  interleave,  as  may  be  observed  in  figure  1,  which  shows  the  NLLs  for  the \nthree  discourse  styles  which  lie  at the  bottom,  middle and  top of the  range.  Even \nmore interestingly,  if the  NLLs  for  the  grammatical versions  of all discourse  styles \nare ordered according to where they lie within this range, it becomes clear that NLL \nis a predictor of discourse style.  Styles which linguists class as  'formal', e.g.  those of \n\n\fGraded Grammaticality in Prediction Fractal Machines \n\n57 \n\nNU,s associaIed wi1h  grammatical and ungrammalical versions of 3 discourse types \n\n3r----------r~~------~--------~------~--~ \n\nLeamed text: grammatical  - (cid:173)\n\nleamed text ungrammalical  ....... \nRomantic: fiction: grammatical  .\u2022. ... \n\nRomantfc fiction: ungrammalical  -+(cid:173)\n\nScience fiction: grammatical  \u2022.\u2022.\u2022 \nScience fiction: ungrammatical  -D .\u2022 \n\n.. ....... \n\n_a\u00b7-\n\na-\n\n........ \n\nG . \n\n1:).. B:Ie...... \n\ne _.-a ... \u00b7 ..\u00a3t\u00b7 .. \u00b7q \u00b7.. \n\n\\. \n\nt::::~~~::-::-:-\u00b7-=~\u00b7\u00b7 ....... ' \n\n.--+-'----\n\n._ .. ----...-\n\n__ ----'\" \n\n--\"'-\n........ _ .... -.-.-.---\n\n__ .. _ .. - - -... -\n\n2.8 \n\n2.8 \n\n2.4 \n\n2.2 \n\n:::I z : ~ \n\n2 \n\n\\'--_.. ___ --\u2022\u2022 _.-\u2022\u2022 ~~~~~~:::.:~~~:~~~~-::~~: \u2022. :;:::-:::::.::::.:.:;:::.:;:::.:;;::::;;;:::;:::.::~::\"::,. \n~ 1.8 \n\no \n\n50 \n\n100 \n\nNo. of codebook vectors \n\n150 \n\n200 \n\nFigure 1:  NLLs of minimal pair test sets containing different discourse styles suggest \ngrades of wellformedness based  upon prototypicality. \n\nthe  Learned  and Government  Document test sets,  have  the  lowest  NLLs,  with  the \nthree  Press  test  sets  clustering just  above,  and  the  Fiction  test  sets,  exemplifying \ncreative  language  use,  clustering  at the high  end.  Similarly, that the  Learned  and \nGovernment test  sets  have  the  lowest  NLLs  conforms with  the  intuition that  their \nusage lies closest  to what is grammatically 'prototypical' - even though in the train(cid:173)\ning set, 6 out of the 14 test sets are fiction and thus might be expected to contribute \nmore  to  the prototype.  That  they  do  not,  suggests  that  usage  varies  significantly \nacross fiction  test sets. \n\n5  Conclusion \n\nWork on the use  of PFMs in language modeling is  at an early stage,  but as  results \nto date show,  they have a lot to offer.  A much larger project  is  planned,  which will \nexamine further Allen and Seidenberg's hypothesis that 'graded grammaticality' (or \nwellformedness)  applies not only to syntax,  but also to other language subdomains \nsuch  as  semantics,  an  integral  part  of  this  being  the  use  of larger  corpora  and \ntagsets,  and  the  identification  of vertices  with  semantic/syntactic  features  rather \nthan atomic symbols.  Identifying the possibilities of combining PFMs  with ANNs, \nfor  example  as  a  means  of bypassing  the  normal  method  of creating  state-space \ntrajectories,  is  the subject  of current study. \n\nAcknowledgments \n\nThis work was supported by the Austrian Science  Fund (FWF)  within the research \nproject  \"Adaptive  Information Systems  and  Modeling  in Economics  and  Manage(cid:173)\nment Science\"  (SFB 010).  The Austrian Research Institute for Artificial Intelligence \nis  supported by  the  Austrian Federal  Ministry of Science  and Transport. \n\n\f58 \n\nReferences \n\nS.  Parfitt,  P  Tino and G.  DorjJner \n\n[1]  P.  Tino  &  G.  Dorffner  (1998).  Constructing finite-context  sources  from  fractal  repre(cid:173)\nsentations of symbolic sequences.  Technical Report TR-98-18,  Austrian Research Institute \nfor  AI,  Vienna. \n[2]  J. R.  Taylor  (1995).  Linguistic categorisation:  Prototypes in linguistic  theory.  Claren(cid:173)\ndon,  Oxford. \n[3]  J. R.  Saffran,  R.  N. Aslin &  E.  L. Newport (1996) .  Statistical cues in language acquisi(cid:173)\ntion:  Word segmentation by infants.  In Proc.  of the  Cognitive Science Society Conference, \n376-380,  La Jolla,  CA. \n\n[4]  J.  Allen  &  M.  S.  Seidenberg  (in  press).  The  emergence  of grammaticality  in  connec(cid:173)\ntionist  networks.  In B.  Macwhinney  (ed.),  Emergentist approaches  to  language:  Proc.  of \nthe  28th  Carnegie  Symposium  on  cognition. Erlbaum. \n\n[5]  S.  Parfitt  (1997).  Aspects of anaphora  resolution  in  artificial  neural  networks:  Impli(cid:173)\ncations for  nativism.  PhD  thesis,  Imperial  College,  London. \n\n[6]  D.  Servan-Schreiber  et al  (1989).  Graded state machines:  The representation  of tem(cid:173)\nporal  contingencies  in  Simple  Recurrent  Networks.  In  Advances  in  Neural  Information \nProcessing Systems,  643-652. \n\n[7]  W.  Tabor  &  M.  Tanenhaus  (to  appear).  Dynamical  models  of sentence  processing. \nCognitive  Science. \n\n[8]  J.  L.  Elman  et  al (1996).  Rethinking innateness:  A  connectionist perspective  on  devel(cid:173)\nopment.  Bradford. \n\n[9]  J . L.  Elman  (1990).  Finding  structure in time.  In:  Cognitive  Science,  14:  179-211. \n\n[10]  S. Lawrence,  C.  Lee  Giles  &  S.  Fong  (in press).  Natural language  grammatical infer(cid:173)\nence  with recurrent neural networks.  IEEE  Trans.  on knowledge  and data  engineering. \n\n[11]  J.  Hertz,  A.  Krogh  &  R.  G.  Palmer  (1991) .  Introduction  to  the  theory  of neural \ncomputation.  Addison  Wesley. \n\n[12]  M.  McCloskey  &  N.  J.  Cohen  (1989).  Catastrophic  interference  in  connectionist \nnetworks:  The sequential learning problem.  In G.  Bower (ed.),  The  psychology of learning \nand motivation,  vol 24.  Academic,  NY. \n[13]  J . K. Kruschke (1991) .  ALCOVE: A connectionist  model of human category learning. \nIn  R.  P.  Lippman  et  al  (eds.),  Advances  in  Neural  Information  Processing  9,  649-655. \nKaufmann,  San  Mateo,  CA. \n\n[14]  S.  Grossberg  (ed.)  (1988).  Neural  networks and natural intelligence.  Bradford,  MIT, \nCambs,  MA. \n\n[15]  Y.  Bengio,  P.  Simard  &  P.  Frasconi  (1994).  Learning  long-term  dependencies  with \ngradient  descent  is  difficult.  IEEE  Trans.  on  neural networks,  5(2). \n\n[16]  M.  P.  Casey  (1996).  The  dynamics  of discrete-time  computation,  with  application \nto  recurrent  neural  networks  and  finite-state  machine  extraction.  Neural  Computation, \n8(6):1135-1178. \n\n[17]  D.  Ron,  Y.  Singer &  N.  Tishby  (1996).  The power of amnesia.  Machine Learning, 25. \n\n[18]  P.  Tino,  B.  G.  Horne,  C.  Lee  Giles  &  P.  C.  Collingwood  (1998).  Finite  state  ma(cid:173)\nchines  and recurrent  neural  networks  - automata and dynamical  systems  approaches.  In \nJ.  E.  Dayhoff  &  O.  Omidvar  (eds.),  Neural  Networks  and Pattern  Recognition,  171- 220. \nAcademic. \n\n[19]  M. F.  Barnsley  (1988).  Fractals  everywhere.  Academic,  NY. \n\n[20]  S.  Coulson,  J.  W.  King  &  M.  Kutas  (1998).  Expect  the  unexpected:  Responses  to \nmorphosyntactic  violations.  Language and Cognitive  Processes,  13(1). \n\n\f", "award": [], "sourceid": 1688, "authors": [{"given_name": "Shan", "family_name": "Parfitt", "institution": null}, {"given_name": "Peter", "family_name": "Ti\u00f1o", "institution": null}, {"given_name": "Georg", "family_name": "Dorffner", "institution": null}]}