{"title": "SARDNET: A Self-Organizing Feature Map for Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 584, "abstract": "", "full_text": "SARDNET:  A  Self-Organizing Feature \n\nMap for  Sequences \n\nDaniel L.  James and Risto Miikkulainen \n\nDepartment of Computer Sciences \nThe University of Texas  at Austin \n\nAustin,  TX 78712 \n\ndljames,risto~cs.utexas.edu \n\nAbstract \n\nA  self-organizing  neural  network  for  sequence  classification  called \nSARDNET is described  and analyzed experimentally.  SARDNET \nextends the Kohonen  Feature Map architecture  with activation re(cid:173)\ntention  and  decay  in  order  to  create  unique  distributed  response \npatterns for different sequences.  SARDNET yields extremely dense \nyet descriptive representations of sequential input in very few  train(cid:173)\ning  iterations.  The  network  has  proven  successful  on  mapping ar(cid:173)\nbitrary sequences  of binary and real numbers,  as well  as  phonemic \nrepresentations  of  English  words.  Potential  applications  include \nisolated  spoken  word  recognition  and  cognitive  science  models  of \nsequence  processing. \n\n1 \n\nINTRODUCTION \n\nWhile neural networks have proved a good tool for processing static patterns, classi(cid:173)\nfying sequential information has remained a challenging task.  The problem involves \nrecognizing patterns in a time series of vectors,  which requires forming a good inter(cid:173)\nnal  representation  for  the sequences.  Several  researchers  have  proposed  extending \nthe  self-organizing  feature  map  (Kohonen  1989,  1990),  a  highly  successful  static \npattern  classification  method,  to  sequential  information  (Kangas  1991;  Samara(cid:173)\nbandu  and  Jakubowicz  1990;  Scholtes  1991).  Below,  three  of the  most  recent  of \nthese  networks  are  briefly described.  The remainder of the paper focuses  on  a  new \narchitecture  designed  to overcome the shortcomings of these  approaches. \n\n\f578 \n\nDaniel  L.  James,  Risto  Miikkulainen \n\nRecently,  Chappel and Taylor (1993)  proposed the Temporal Kohonen Map (TKM) \narchitecture  for  classifying sequences.  The TKM  keeps  track of the  activation his(cid:173)\ntory of each  node by updating a value called leaky integrator potential, inspired by \nthe membrane potential in biological neural systems.  The activity of a node depends \nboth on the current input vector and the previous input vectors,  represented  by the \nnode's  potential.  A  given sequence  is  processed  by  mapping one  vector  at  a  time, \nand the last winning node serves  to represent  the entire sequence.  This way,  there \nneeds  to  be  a  separate  node  for  every  possible  sequence,  which  is  a  disadvantage \nwhen  the number of sequences  to be  classified is  large.  The TKM also suffers  from \nloss  of context.  Which  node  wins  depends  almost  entirely  upon  the  most  recent \ninput  vectors.  For  example,  the string baaaa would  most likely  map to  the  same \nnode as  aaaaa,  making the  approach applicable only to short sequences. \n\nThe  SOFM-S  network  proposed  by  van  Harmelen  (1993)  extends  TKM  such  that \nthe  activity  of each  map node  depends  on  the  current  input  vector  and  the  past \nactivation of all map nodes.  The SOFM-S is  an improvement of TKM  in that con(cid:173)\ntextual information is  not lost as quickly,  but it still uses  a single node to represent \na  sequence. \n\nThe  TRACE feature  map  (Zandhuis  1992)  has  two  feature  map layers.  The  first \nlayer  is  a  topological map of the individual input  vectors, and  is  used  to  generate \na  trace  (i.e.  path)  of the input sequence  on  the map.  The second  layer then  maps \nthe  trace  pattern  to  a  single  node.  In  TRACE,  the  sequences  are  represented  by \ndistributed  patterns on  the first  layer,  potentially allowing for  larger capacity, but \nit is difficult to encode sequences  where the same vectors repeat, such as baaaa.  All \na-vectors  would  be  mapped on the same unit in the first  layer,  and any number of \na-vectors  would  be indistinguishable. \n\nThe  architecture  described  in  this  paper,  SARDNET  (Sequential  Activation  Re(cid:173)\ntention  and  Decay  NETwork),  also  uses  a  subset  of map  nodes  to  represent  the \nsequence  of vectors.  Such  a  distributed  approach  allows  a  large  number  of repre(cid:173)\nsentations  be  \"packed\"  into  a  small map-like sardines.  In  the  following sections, \nwe  will examine how SARDNET differs from conventional self-organizing maps and \nhow it can  be  used  to represent  and classify  a  large number of complex sequences. \n\n2  THE SARDNET ARCHITECTURE \n\nInput  to  SARDNET  consists  of  a  sequence  of  n-dimensional  vectors  S  = \nV I, V 2 , V 3 ,  ... , VI  (figure  1) .  The  components  of each  vector  are  real  values  in \nthe  interval  [0,1].  For  example,  each  vector  might represent  a  sample of a  speech \nsignal in  n  different frequencies,  and the entire sequence  might constitute  a spoken \nword.  The  SARDNET input layer consists  of n  nodes, one  for  each  component in \nthe input vector,  and their values  are  denoted  as  A  = (aI, a2, a3,  ... , an).  The map \nconsists  of m  x  m  nodes  with  activation  Ojk ,  1  ~ j, k  ~ m.  Each  node  has  an \nn-dimensional input  weight  vector  Wjk,  which  determines  the  node's  response  to \nthe input  activation. \nIn  a  conventional feature  map network  as  well  as  in  SARDNET,  each  input vector \nis  mapped  on  a  particular  unit  on  the  map,  called  the  winner  or  the  maximally \nresponding  unit.  In  SARDNET,  however,  once  a  node  wins  an  input,  it  is  made \n\n\fSARDNET: A Self-Organizing  Feature  Map for Sequences \n\n579 \n\nSequence  of  Input  vectors  S \nV1  V2  V3  V4 \n\nV, \n\n---II \n\nPrevious  winners \n\nWinning  unit jlc \n\nInput  weight  vector wJk.l \n\nFigure  1:  The SARDNET architecture.  A sequence  of input  vectors  activates \nunits  on  the  map  one  at  a  time.  The  past  winners  are  excluded  from  further \ncompetition,  and their  activation  is  decayed  gradually to indicate  position  in  the \nsequence. \n\nINITIALIZATION: Clear all  map nodes  to zero. \nMAIN  LOOP: While  not end of seihence \n1.  Find unactivated  weight  vector t  at best matches the input. \n2.  Assign  1.0 activation to that unit. \n3.  Adjust weight  vectors of the nodes  in the neighborhood. \n4.  Exclude  the winning  unit from  subse~ent competition. \nS.  Decrement activation values for  all ot  er active nodes. \nRESULT:  Sequence representation =  activated nodes  ordered by activation  values \n\nTable 1:  The SARDNET training algorithm. \n\nuneligible to respond to the subsequent inputs in the sequence.  This way a different \nmap node  is  allocated for  every  vector  in the sequence.  As  more  vectors  come  in, \nthe  activation  of the  previous  winners  decays.  In  other  words,  each  sequence  of \nlength 1 is  represented  by 1 active  nodes  on the map, with their  activity indicating \nthe order in  which they  were  activated.  The algorithm is  summarized in table  1. \nAssume the maximum length ofthe sequences we wish to classify is I,  and each input \nvector  component can take on p  possible  values.  Since  there  are  pn  possible  input \nvectors,  Ipn  map nodes  are  needed  to represent  all possible  vectors  in  all  possible \npositions in the sequence,  and a distributed pattern over the Ipn  nodes can be used \nto represent  all pnl  different sequences.  This approach offers a significant advantage \nover methods in which  pnl  nodes would be required  for  pnl  sequences. \n\nThe specific computations of the SARDNET algorithm are as follows:  The winning \nnode  (j, k)  in  each  iteration  is  determined  by  the  Euclidean  distance  Djk  of the \n\n\f580 \n\nDaniel L.  James,  Risto  Miikkulainen \n\ninput  vector  A  and  the node 's weight vector  W j k: \n\nn \n\nDjk  = 1) Wjk ,i  - a;)2. \n\ni=O \n\n(1) \n\nThe unit with the smallest distance is selected as the winner and activated with 1.0. \nThe weights  of this  node  and  all nodes  in  its  neighborhood are  changed  according \nto the standard feature  map adaptation rule: \n\n(2) \n\nwhere  a  denotes  the  learning  rate.  As  usual,  the  neighborhood  starts  out  large \nand is  gradually decreased  as  the map becomes  more ordered.  As  the last step  in \nprocessing  an  input  vector,  the  activation  7]jk  of all  active  units  in  the  map  are \ndecayed  proportional to the decay  parameter d: \n\nO<d<1. \n\n(3) \n\nAs in the standard feature map, as the weight vectors adapt, input vectors gradually \nbecome  encoded  in  the  weight  vectors  of the  winning  units.  Because  weights  are \nchanged  in  local  neighborhoods,  neighboring  weight  vectors  are  forced  to  become \nas  similar as  possible, and eventually the network forms  a  topological layout of the \ninput vector space.  In SARDNET, however, if an input vector occurs multiple times \nin the same input sequence,  it will be represented  multiple times on the map as well. \nIn other words,  the map representation expands those areas of the input space  that \nare visited  most often during an  input sequence. \n\n3  EXPERIMENTS \n\nSARDNET  has  proven  successful  in  learning  and  recognizing  arbitrary  sequences \nof binary  and  real  numbers,  as  well  as  sequences  of phonemic  representations  for \nEnglish words.  This section presents  experiments on  mapping three-syllable words. \nThis dat a was selected because it shows  how SARDNET can be  applied to complex \ninput derived from a  real-world task . \n\n3.1 \n\nINPUT  DATA \n\nThe phonemic word representations were obtained from the CELEX database of the \nMax  Planck  Institute for  Psycholinguistics  and  converted  into  International  Pho(cid:173)\nnetic  Alphabet  (IPA)-compliant representation,  which  better  describes  similarities \namong the phonemes.  The words vary from five  to twelve phonemes in length.  Each \nphoneme is represented by five  values: place, manner, sound, chromacity and sonor(cid:173)\nity.  For example, the consonant p is represented by a single vector  (bilabial, stop, \nunvoiced, nil , nil) , or in terms of real numbers,  (.125,  .167,  .750,0 , 0).  The diph(cid:173)\nthong sound ai as in  \"buy\" , is represented  by the two vectors  (nil , vowel, voiced, \nfront, low)  and (nil, vowel , voiced, front-center, hi-mid), or in real  numbers, \n(0,  1,  .25,  .2,  1)  and  (0,  1,  .25,  .4, .286). \nThere  are  a  total of 43  phonemes  in this  data set,  including 23  consonants  and  20 \nvowels.  To represent all phonemic sequences of length 12, TKM and SOFM-S would \n\n\fSARDNET: A  Self-Organizing Feature Map for Sequences \n\n581 \n\no.ee \n\n0.118 \n\n0.87 \n\n0.118 \n\n0.85 ....... - -\n\nFigure  2:  Accuracy  of SARDNET for  different  map  and  data  set  sizes. \nThe accuracy is  measured as  a percentage of unique  representations  out of all word \nsequences. \n\nneed  to have 45 12  ~ 6.919 map nodes, whereas SARDNET would need only 45  x 12 \n=  540 nodes.  Of course, only a very small subset of the possible sequences  actually \noccur in the data.  Three data sets consisting of 713 , 988,  and 1628 words were  used \nin  the  experiments.  If the  maximum number  of occurrences  of phoneme  i  in  any \nsingle sequence  is  Cj I  then  the number of nodes  SARDNET needs  is  C  =  L:~o Cj I \nwhere  N  is  the number of phonemes.  This number of nodes  will allow  SARDNET \nto  map each  phoneme in  each  sequence  to  a  unit  with  an  exact  representation  of \nthat phoneme in its weights.  Calculated this way,  SARDNET should scale  up very \nwell  with  the  number  of words:  it would  need  81  nodes  for  representing  the  713 \nword set, 84  for  the  988  set and  88 for  the  1628 set. \n\n3.2  DENSENESS  AND  ACCURACY \n\nA  series  of  experiments  with  the  above  three  data  sets  and  maps  of  16  to  81 \nnodes  were  run  to  see  how  accurately  SARDNET  can  represent  the  sequences. \nSelf-organization was  quite  fast:  each  simulation took  only about  10  epochs,  with \na  = 0.45  and the neighborhood radius  decreasing  gradually from  5-1  to zero.  Fig(cid:173)\nure  2  shows  the  percentage  of unique  representations  for  each  data set  and  map \nSIze. \n\nSARDNET shows remarkable representational power:  accuracy for  all sets is better \nthan  97.7%,  and  SARDNET  manages  to  pack  1592  unique  representations  even \non  the  smallest  16-node map.  Even  when  there  are  not  enough  units  to represent \neach  phoneme in each sequence exactly, the map is sometimes able to  \"reuse\"  units \nto  represent  multiple  similar  phonemes.  For  example,  assume  units  with  exact \nrepresentations  for  the  phonemes  a  and  b  exist  somewhere  on  the  map,  and  the \ninput  data  does  not  contain  pairs  of sequences  such  as  aba-abb,  in  which  it  is \ncrucial  to  distinguished  the  second  a  from  the  second  b.  In  this  case,  the  second \noccurrence  of both phonemes could  be represented  by the same unit with a  weight \nvector  that is  the  average  of a  and  b.  This is  exactly  what the  map is  doing:  it is \nfinding the most descriptive representation of the data, given the available resources. \n\n\f582 \n\nDaniel L.  James,  Risto  Miikkulainen \n\nNote  that  it  would  be  possible  to  determine  the  needed  C  =  L:f:o Cj  phoneme \nrepresentation vectors directly from the input data set,  and without any learning or \na  map structure  at all,  establish distributed  representations  on  these  vectors  with \nthe SARDNET algorithm.  However, feature map learning is necessary ifthe number \nof available representation  vectors  is  less  than  C.  The topological  organization  of \nthe  map allows  finding  a  good  set  of reusable  vectors  that  can  stand for  different \nphonemes in different sequences,  making the  representation  more efficient. \n\n3.3  REPRESENTING  SIMILARITY \n\nNot only are the representations densely packed on the map, they are also descriptive \nin the sense that similar sequences  have similar representations.  Figure 3 shows the \nfinal  activation patterns on the  36-unit, 713-word map for  six example words.  The \nfirst  two  words,  \"misplacement\"  and  \"displacement,\"  sound  very  similar,  and  are \nrepresented  by  very  similar patterns  on  the  map.  Because  there  is  only  one  m in \n\"displacement\" , it is  mapped on the  same unit  as  the initial m of \"misplacement.\" \nNote  that  the  two  IDS  are  mapped next  to  each  other,  indicating that  the  map is \nindeed  topological,  and  small  changes  in  the  input  cause  only  small  changes  in \nthe  map representation.  Note  also  how  the  units  in  this small  map are  reused  to \nrepresent  several  different  phonemes in different  contexts. \n\nThe  other  examples  in  figure  3  display  different  types  of similarities with  \"mis(cid:173)\nplacement\".  The third word,  \"miscarried\", also begins with  \"mis\", and shares that \nsubpart  of the representation  exactly.  Similarly,  \"repayment\"  shares  a similar tail \nand \"pessimist\" the subsequence  \"mis\" in a different part or the word.  Because they \nappear in  a  different  context,  these  subsequences  are  mapped on  slightly different \nunits,  but still  very  close  to  their  positions  with  \"misplacement.\"  The last  word, \n\"burundi\"  sounds  very  different,  as  its representation  on the map indicates. \n\nSuch  descriptive  representations  are important when  the map has  to represent  in(cid:173)\nformation that is  incomplete or corrupted  with  noise.  Small changes  in  the input \nsequence  cause  small changes  in  the  pattern,  and the sequence  can  still  be  recog(cid:173)\nnized.  This property should turn out extremely important in real-world applications \nof SARDNET,  as  well  as  in cognitive science  models where  confusing  similar pat(cid:173)\nterns with each other is  often plausible behavior. \n\n4  DISCUSSION  AND  FUTURE RESEARCH \n\nBecause  the  sequence  representations  on  the  map  are  distributed,  the  number  of \npossible sequences  that can  be represented  in  m  units is  exponential in  m,  instead \nof linear  as in  most previous  sequential feature  map architectures.  This denseness \ntogether  with  the  tendency  to  map  similar  sequences  to  similar  representations \nshould  turn  out  useful  in  real-world  applications,  which  often  require  scale-up  to \nlarge  and  noisy  data sets.  For  example,  SARDNET  could  form  the  core  of  an \nisolated  word  recognition system.  The  word  input would  be  encoded  in  duration(cid:173)\nnormalized sequences  of sound  samples such  as  a  string  of phonemes,  or  perhaps \nrepresentations of salient transitions in the speech signal.  It  might also  be  possible \nto  modify  SARDNET  to  form  a  more  continuous  trajectory  on  the  map so  that \nSARDNET itself would  take care  of variability in  word  duration.  For  example,  a \n\n\fSARDNEf: A Self-Organizing Feature Map for Sequences \n\n583 \n\n(a) \n\n(b) \n\n(c) \n\n~~t \u2022  ript.ItDt \n\n(d) \n\n(e) \n\n(f) \n\nFigure 3:  Example map representations. \n\nsequence  of redundant inputs could be  reduced  to a single  node if all these  inputs \nfall  within the same neighborhood. \nEven though the sequence representations are dense,  they are also descriptive.  Cat(cid:173)\negory memberships are measured not by labels of the maximally responding units, \nbut by the differences  in the response  patterns themselves.  This sort of distributed \nrepresentation  should  be  useful  in  cognitive systems  where  sequential  input  must \nbe  mapped  to  an  internal  static  representation  for  later  retrieval  and  manipula(cid:173)\ntion.  Similarity-based reasoning  on  sequences  should  be  easy  to  implement,  and \nthe sequence  can be easily  recreated  from  the activity pattern on the map. \nGiven part  of a  sequence,  SARDNET  may also be  modified  to predict  the  rest  of \nthe  sequence.  This can  be  done  by  adding lateral connections  between  the  nodes \nin  the  map  layer.  The  lateral  connections  between  successive  winners  would  be \nstrengthened  during training.  Thus,  given part of a sequence,  one could follow  the \nstrongest lateral connections  to complete the sequence. \n\n\f584 \n\nDaniel  L.  James,  Risto  Miikkulainen \n\n5  CONCLUSION \n\nSARDNET  is  a  novel  feature  map  architecture  for  classifying  sequences  of input \nvectors.  Each  sequence  is  mapped  on  a  distributed  representation  on  the  map, \nmaking it  possible  to  pack  a  remarkable large  number  of category  representations \non a small feature  map. The representations are not only dense,  they also represent \nthe  similarities of the  sequences,  which  should  turn out useful  in  cognitive science \nas  well  as  real-world applications of the  architecture. \n\nAcknowledgments \n\nThanks to Jon Hilbert for  converting CELEX data into the International  Phonetic \nAlphabet  format used  in  the experiments.  This research  was  supported  in  part by \nthe National Science  Foundation under grant #IRI-9309273. \n\nReferences \n\nChappel,  G.  J.,  and  Taylor,  J.  G.  (1993).  The  temporal  Kohonen  map.  Neural \n\nNetworks,  6:441-445. \n\nKangas,  J.  (1991).  Time-dependent  self-organizing  maps  for  speech  recognition. \nIn  Proceedings  of the  International  Conference  on  Artificial Neural  Networks \n(Espoo,  Finland),  1591-1594. Amsterdam; New  York:  North-Holland. \n\nKohonen, T. (1989).  Self-Organization and Associative Memory.  Berlin; Heidelberg; \n\nNew  York:  Springer.  Third edition. \n\nKohonen,  T .  (1990) .  The self-organizing  map.  Proceedings  of the  IEEE,  78:1464-\n\n1480. \n\nSamarabandu,  J.  K.,  and Jakubowicz,  O.  G.  (1990).  Principles  of sequential fea(cid:173)\n\nture  maps in  multi-level  problems.  In  Proceedings  of the  International  Joint \nConference  on  Neural Networks (Washington,  DC),  vol.  II,  683-686. Hillsdale, \nNJ:  Erlbaum. \n\nScholtes,  J.  C.  (1991).  Recurrent  Kohonen  self-organization  in  natural  language \n\nprocessing.  In  Proceedings  of the  International  Conference  on  Artificial  Neu(cid:173)\nral  Networks  (Espoo,  Finland),  1751-1754.  Amsterdam;  New  York:  North(cid:173)\nHolland. \n\nvan  Harmelen,  H.  (1993).  Time dependent  self-organizing feature  map for  speech \nrecognition.  Master's thesis,  University of Twente,  Enschede,  the Netherlands. \nZandhuis,  J.  A.  (1992).  Storing  sequential  data  in  self-organizing  feature  maps. \nInternal Report  MPI-NL-TG-4/92, Max-Planck-Institute fur  Psycholinguistik, \nNijmegen, the Netherlands. \n\n\f", "award": [], "sourceid": 936, "authors": [{"given_name": "Daniel", "family_name": "James", "institution": null}, {"given_name": "Risto", "family_name": "Miikkulainen", "institution": null}]}