{"title": "A Recurrent Neural Network for Word Identification from Continuous Phoneme Strings", "book": "Advances in Neural Information Processing Systems", "page_first": 206, "page_last": 212, "abstract": null, "full_text": "A Recurrent Neural Network for Word Identification \n\nfrom Continuous Phoneme Strings \n\nRobert B. Allen \nBellcore \nMorristown, NJ 07962-1910 \n\nCandace A. Kamm \nBellcore \nMorristown, NJ 07962-1910 \n\nAbstract \n\nA  neural  network  architecture  was  designed  for  locating  word  boundaries  and \nidentifying  words  from  phoneme  sequences.  This  architecture  was  tested  in \nthree  sets  of  studies.  First,  a  highly  redundant  corpus  with  a  restricted \nvocabulary was  generated and the network was trained with a limited number of \nphonemic variations for the words  in the corpus.  Tests of network performance \non a transfer set yielded a very low error rate.  In a second study, a network was \ntrained  to  identify  words  from  expert  transcriptions  of speech.  On a  transfer \ntest,  error  rate  for  correct  simultaneous  identification  of  words  and  word \nboundaries was  18%.  The third study used the output of a phoneme classifier as \nthe input to the word and  word boundary identification network.  The error rate \non a transfer test set was 49% for this task.  Overall, these studies provide a first \nstep at identifying words in connected discourse with a neural network. \n\n1  INTRODUCTION \nDuring  the past several  years,  researchers have explored the use of neural  networks  for \nclassifying spectro-temporal speech patterns into phonemes or other sub-word units (e.g., \nHarrison &  Fallside, 1989; Kamm &  Singhal, 1990; Waibel et al.,  1989).  Less effort has \nfocussed on the use of neural nets for identifying words from the phoneme sequences that \nthese  spectrum-to-phoneme  classifiers  might  produce.  Several  recent  papers,  however, \nhave combined the output of neural network phoneme recognizers  with other techniques, \nincluding  dynamic  time  warping  (DTW)  and  hidden  Markov  models  (HMM)  (e.g., \nMiyatake, et al., 1990; Morgan &  Bourlard, 1990). \n\nSimple recurrent  neural  networks  (Allen,  1990;  Elman,  1990;  Jordan,  1986)  have  been \nshown  to  be  able  to  recognize  simple  sequences  of features  and  have  been  applied  to \nlinguistic  tasks  such  as  resolution  of  pronoun  reference  (Allen,  1990).  We  consider \nwhether they can be applied to  the recognition of words from  phoneme sequences.  This \npaper presents the results of three sets of experiments using recurrent neural networks to \nlocate  word  boundaries  and  to  identify  words  from  phoneme  sequences.  The  three \nexperiments  differ  primarily  in  the  degree  of  similarity  between  the  input  phoneme \nsequences  and the input information that would typically be generated by a spectrum-to(cid:173)\nphoneme classifier. \n\n2  NETWORK ARCHITECTURE \nThe network architecture is shown in Figure 1.  Sentence-length phoneme sequences are \nstepped past the  network  one phoneme at  a  time.  The input to  the network  on a  given \n\n206 \n\n\fA Recurrent Neural Network \n\n207 \n\ntime  step  within  a  sequence  consists  of three  46-element  vectors  (corresponding  to  46 \nphoneme  classes)  that  identify  the  phoneme  and  the  two  subsequent  phonemes.  The \nactivation of state unit Si  on the step at time t  is  a weighted sum of the activation of its \ncorresponding  hidden unit (H) and  the state  unit's  activation on the previous  time step, \nwhere (3  is  the weighting factor for  the hidden unit activation and !.l  is  the  state memory \nweighting factor:  Si,l=(3Hi ,l-l+!.lSi,t-l.  In this research  (3=1.0 and 1.1=0.5.  The output of \nthe  network  consists  of one  unit  for  each  word  in  the  lexicon  and  an  additional  unit \nwhose activation indicates the presence of a word boundary. \nWeights  from  the hidden units  to  the  word  units  were  updated  based  on error observed \nonly at phoneme positions that corresponded to the end of a word (Allen, 1988).  The end \nof  the  phoneme  sequence  was  padded  with  codes  representing  silence.  State  unit \nactivations were reset to zero at the end of each sentence. The network was trained using \na momentum factor of a=O.9 and  an average learning rate of 1l=0.05.  The learning rate \nwas  adjusted for  each  output unit proportionally to  the relative frequency  of occurrence \nof the word corresponding to that unit. \n\nword units \n\nword\u00b7 \n\nboundary \n\nunit \n\nn \n\n123 \n\n. . . \n\nOutput \nunits \n\nunits \n\nInput r \n\nInput t.1 \n\nInput  ~.II \n\nunits \n\n\u2022 \n\n138 Input units \n\n(46 phoneme classes J[ 3 time steps) \n\nFigure 1:  Recurrent Network for Word Identification \n\n3  EXPERIMENT 1: DICTIONARY TRANSCRIPTIONS \n\n3.1  PROCEDURE \nA  corpus  was  constructed  from  a  vocabulary  of 72  words.  The  words  appeared  in  a \nvariety of training contexts across sentences and the sentences were constrained to a very \nsmall  set  of syntactic  constructions.  The  vocabulary  set  included  a  subset  of rhyming \nwords.  Transcriptions  of each  word  were  obtained  from  Webster's  Seventh Collegiate \nDictionary  and  from  an  American-English  orthographic-phonetic  dictionary  (Shoup, \n1973).  These  transcriptions  describe  words  in  isolation  only  and  do  not  reflect \nco articulations  that occur when the sentences  are spoken.  For this  vocabulary, 26 of the \nwords had one pronunciation, 19 had two variations, and the remaining 27 had from 3 to \n17 variations.  The corpus consisted of 18504 sentences  of about 7 words each.  6000 of \nthese sentences were randomly selected and reserved as transfer sentences. \n\n\f208 \n\nAllen and Kamm \n\nThe input to the network was  a sequence of 46-element vectors designed to emulate the \nactivations  that  might  be  obtained  from  a  neural  network  phoneme  classifier  with  46 \noutput  classes  (Kamm  and  Singhal,  1990).  Since  a  phoneme  classifier  is  likely  to \ngenerate  a  set  of phoneme  candidates  for each position in  a  sequence,  we  modified the \nactivations  in  each  input  vector  to  mimic  this  expected  situation.  Confusion  data  were \nobtained  from  a  neural  network  classifier  trained  to  map  spectro-temporal  input  to  an \noutput activation vector for the  46 phoneme classes.  In this  study,  input activations for \nphonemes  that accounted for fewer than 5%  of the confusions with the correct phoneme \nremained  set  to  0.0,  while  input  activations  for  phonemes  accounting  for  higher \nproportions  of the  total  confusions  with  the  correct  phoneme  were  set  to  twice  those \nproportions,  with an upper limit of 1.0.  This  resulted in relatively high activation levels \nfor  one to  three elements  and activation of 0.0 for  the others.  Overall, the network had \n138  (46x3) input units, 80 hidden units,  80 state units, and 73  output units  (one for each \nword  and  one  boundary  detection  unit).  The  network  was  trained  for  50000  sentence \nsequences chosen randomly (with replacement) from the training corpus.  Each sequence \nwas  prepared  with  randomly  selected  transcriptions  from  the  available  dialectal \nvariations. \n\n3.2  RESULTS \nIn all the experiments discussed in this paper, performance of the network  was  analyzed \nusing  a  sequential  decision  strategy.  First,  word  boundaries  were  hypothesized  at  all \nlocations where the activation of the boundary unit exceeded a predefined threshold (0.5). \nThen,  the  activations  of  the  word  units  were  scanned  and  the  unit  with  the  highest \nactivation  was  selected as  the identified word.  By comparing the locations of true  word \nboundaries with hypothesized boundaries, a false alarm rate (Le.,  the number of spurious \nboundaries  divided  by  the number of non-boundaries)  was  computed.  Word  error rate \nwas then computed by dividing the number of incorrect words at correctly-detected word \nboundaries by the total number of words in the transfer set.  This word error rate includes \nboth deletions (Le., missed boundaries) and word substitutions (Le., incorrectly-identified \nwords at correct boundaries). Total error rate is obtained by summing the word error rate \nand the false alarm rate. \nOn  the  6000  sentence  test  set,  the  network  correctly  located  99.3%  of the  boundaries, \nwith a word error rate of 1.7% and a false alarm rate of 0.3% Overall, this yielded a total \nerror rate  of 2.0%.  To further test the robustness  of this procedure to  noisy  input,  three \nnetworks  were trained with the same procedures  as  above except that the input phoneme \nsequences  were  distorted.  In  the  first  network,  there  was  a  30%  chance  that  input \nphonemes would be duplicated.  In a second network, there was  a 30% chance that input \nphonemes would be deleted.  In the third network,  there was a 70%  chance that an input \nphoneme  would  be  substituted  with  another  closely-related  phoneme.  Total  error rates \nwere 11.7% for the insertion network, 20.9% for the deletion network, and  10.0% for the \nsubstitution network. \nEven  with  these  fairly  severe  distortions,  the  network  is  moderately  successful  at  the \nboundary  detection/word  identification  task.  Experiment  2  was  designed  to  study \nnetwork performance on a more diverse and realistic training set. \n\n\fA Recurrent Neural Network \n\n209 \n\n4  EXPERIMENT 2: EXPERT TRANSCRIPTIONS OF SPEECH \n\nPROCEDURE \n\n4.1 \nTo  provide  a  richer  and  more  natural  sample  of transcriptions  of  continuous  speech, \ntraining  sequences  were derived from  phonetic transcriptions  of 207  sentences  from  the \nDARPA  Acoustic-Phonetic  (AP)  speech  corpus  (TIMIT)  (Lamel,  et  a/.,  1986).  The \ntraining  set  consisted  of 4-5  utterances  of each  of 50  sentences,  spoken  by  different \ntalkers.  One other utterance of each sentence (spoken by different talkers)  was  used for \ntransfer tests.  This corpus contained 277 unique words.  For training, the transcripts were \nalso  segmented  into words.  When a  word boundary  was  spanned by  a  single phoneme \nbecause  of  co articulation  (for  example,  the  transcription  /haejr/  for  the  phrase  \"had \nyour\"),  the coarticulated phoneme (in  this  example, 1]1)  was  arbitrarily  assigned to  only \nthe first word of the phrase.  These transcriptions differ from those used in Experiment 1 \nprimarily  in  the  amount  of phonemic  variation  observed  at  word  boundaries,  and  so \nprovide a  more  difficult  boundary  detection task  for  the network.  As  in Experiment  1, \nthe  input  to  the  network  on  any  time  step  was  a  set  of three  46-element  vectors.  The \noriginal input vectors  (obtained from the phonetic transcriptions) were modified based on \nthe phoneme confusion data described in Section 3.1.  The network had 138 (3x46) input \nunits, 80 hidden units, 80 state units, and 278 output units. \n\n4.2  RESULTS \nAfter training  on  80000  sentence  sequences  randomly  selected  from  the  207  sentence \ntraining  set  (approximately  320,000  weight  updates),  the  network  was  tested  on the  50 \nsentence  transfer  set.  With  a  threshold  for  the  boundary  detection  unit  of  0.5.  the \nnetwork was  87.5%  correct at identifying word boundaries and had a false alarm rate of \n2.3%.  The word error rate was  15.5%.  Thus, using the sequential decision strategy,  the \ntotal error rate was  17.8%. \n\nConsidering all word boundaries  (i.e.,  not just the  correctly-detected boundaries),  word \nidentification was  90.3% correct  when the top candidate only  (Le.,  the output unit  with \nthe highest activation) was evaluated and 96.3% for the top three word choices.  Because \nthere were  instances  where boundaries  were  not detected.  but the word  unit  activations \nindicated  the  correct  word.  a  decision  strategy  that  simultaneously  considered  both  the \nactivation  of  the  word  boundary  unit  and  the  activation  of the  word  units  was  also \nexplored.  However,  the  distributions  of  the  word  unit  activations  at  non-boundary \nlocations  (Le.,  within  words)  and  at  word  boundaries  overlapped  significantly,  so  this \nstrategy was  unsuccessful at improving performance. In retrospect, this result is not very \nsurprising, since the network was not trained for word identification at non-boundaries. \n\nMany  of the  transfer  cases  were  similar to  the  training  sentences,  but some interesting \ngeneralizations  were  observed.  For example,  the  word  \"beautiful\"  always  appeared  in \nthe training set as /bjuf;)fi/, but its  appearance in the test set as /bjuf;)t1J1/ was  correctly \nidentified.  That is, the variation in the final  syllable did not prevent correct identification \nof the  word  boundary  or  the  word,  despite  the  fact  that  the  network  had  seen  other \ninstances  of the  phoneme  sequence lUll  in  the  words  \"woolen\"  and  \"football\"  (second \nsyllable).  Of the  135  word transcriptions in the transfer  set that were  unique  (Le.,  they \ndid  not  appear  in  the  training  set).  the  net  correctly  identified  72%  based  on  the  top \ncandidate  and  85%  within the  top  3  candidates.  Not  surprisingly,  performance  for  the \n275 words in the transfer set with non-unique transcriptions was higher, with 96% correct \n\n\f210 \n\nAllen and Kamm \n\nfor the top candidate and 98% for the top 3 choices. \nThere was  evidence that the network occasionally made use of phoneme context beyond \nword boundaries to distinguish among confusable transcriptions.  For example, the words \n\"an\",  \"and\", and \"in\" all appeared in the transfer set on at least one occasion as  !~n/, but \neach was correctly identified.  However, many word identification errors were confusions \nbetween  words  with  similar  final  phonemes  (e.g.,  confusing  \"she\"  with  \"be\",  \"pewter\" \nwith  \"order\").  This  result  suggests  that,  in  some  instances,  the  model  is  not  making \nsufficient use of prior context. \n\n5  EXPERIMENT 3: MACHINE-GENERATED TRANSCRIPTIONS \n\n5.1  CORPUS AND PROCEDURE \n\nIn this experiment, the input to the network was obtained by postprocessing the output of \na  spectrum-to-phoneme  neural  network  classifier  to  produce  sequences  of  phoneme \ncandidates.  The spectrum-to-phoneme classifier, (Kamm and Singhal, 1990), generates a \nseries  of 46-element vectors of output activations corresponding to 46 phoneme classes. \nThe  spectro-temporal  input  speech  patterns  are  stepped  past  the  classifier  in  5-ms \nincrements,  and  the  classifier  generates  output  vectors  on  each  step.  Since  phonemes \ntypically have average durations longer than 5 ms, a postprocessing stage was required to \ncompress  the output of the classifier into a sequence of phoneme candidates  appropriate \nfor use as input to the boundary detection/word identification neural network. \nThe training set was a subset of the DARPA A-P corpus consisting of 2 sentences spoken \nby  each  of 106  talkers.  The  postprocessor  provided  output  sequences  for  the  training \nsentences  that were  quite  noisy,  inserting  2233  spurious  phonemes  and deleting  581  of \nthe 7848 phonemes identified by expert transcription. Furthermore, in 2914 instances, the \nphoneme  candidate  with  highest  average  activation  was  not  the  correct  phoneme. \nHowever,  this  result  was  not  unexpected,  since  the  postprocessing  heuristics  are  still \nunder  development.  The  primary  purpose  for  using  the  postprocessor  output  was  to \nprovide a difficult input set for the boundary detection/word identification network. \nAfter postprocessing, the highest-activation phoneme candidate sequences  were mapped \nto  the  input  vectors  for  the  boundary  detection/word  identification network  as  follows: \nthe  vector  elements  corresponding  to  the  three  highest-activation  phoneme  candidate \nclasses  were  set  to  their  corresponding  average  activation  values,  and  the  other  43 \nphoneme classes were set to 0.0 activation. The network had 138 (Le., 46x3 ) input units, \n40 hidden units, 40 state units  and 22 output units  (21  words and  1 boundary unit).  The \nnetwork  was  trained  for  40000  sentence  sequences  and  then  tested  on  a  transfer  set \nconsisting  of each  sentence  spoken  by  a  new  set  of 105  talkers.  The  sequences  in  the \ntransfer set were also quite noisy, with 2162 inserted phoneme positions and 775 of 7921 \nphonemes  deleted.  Further,  the  top  candidate  was  not  the  correct  phoneme  in  3175 \npositions. \n\n5.2  RESULTS \nThe boundary detection performance of this network was 56%, much poorer than for the \nnetworks  with  less  noisy  input.  Since  the  network  sometimes  identified  the  word \nboundary  at  a  slightly  different  phoneme position than  had  been initially  identified,  we \nimplemented a more lenient scoring criterion that scored a \"correct\"  boundary detection \n\n\fA Recurrent Neural Network \n\n211 \n\nwhenever the activation of the boundary unit exceeded the threshold criterion at the true \nboundary position or at the position immediately preceding the true boundary. Even with \nthis looser scoring criterion, only 65% of the boundaries in the transfer set were correctly \ndetected  using  a boundary  detection threshold  of 0.5.  The false  alarm  rate  was  9%  and \nthe  word  error rates  were  40%,  yielding a  total  error rate of 49%.  This  is  much  larger \nthan the error rate for the network in Experiment 2.  This difference may be explained in \npart by  the presence  of insertions  in the input stream in this  experiment as  compared  to \nExperiment  2,  which  had  no  insertions.  The  results  of Experiment  2  indicated  that  this \nrecurrent  architecture  has  a  limited  capacity  for  considering  past  information  (Le.,  as \nevidenced  by  substitution errors  such as  \"she\"  for \"be\"  and  \"pewter\"  for  \"order\").  As  a \nresult,  poorer  performance  might  be  expected  when  the  information  required  for  word \nboundary  detection or word  identification spans longer input sequences,  as  occurs  when \nthe input contains extra vectors representing inserted phonemes. \n\n5.3  NON-RECURRENT NETWORK \nTo evaluate the  utility of the recurrent  network  architecture  for this  task,  a  simple  non(cid:173)\nrecurrent network  was  trained  using the same training  set.  In addition to the  t,  t+1  and \nt+2  input slots  (Fig.  1),  the non-recurrent network  also  included  t-1  and  t-2  slots,  in  an \nattempt to match some of the information about past context that may be available to the \nrecurrent network  through  the  state  unit  activations.  Thus,  the  input required  230  (Le., \n5x46)  input units.  The network  had  no  state units,  40 hidden units  and 22 output units, \nand was trained through 40000 sentence sequences.  On the transfer set, using a boundary \ndetection  threshold  of 0.5,  75%  of the  word  boundaries  were  correctly  detected,  and  a \nfalse alarm rate of 31 %. The word error rate was 60%.  Thus, the recurrent net performed \nconsistently  better  than  this  non-recurrent  network  both  in  terms  of fewer  false  alarms \nand fewer word errors, despite the fact that the non-recurrent network had more weights. \nThese  results  suggest  that  recurrence  is \nimportant  for  the  boundary  and  word \nidentification task with this noisy input set. \n\n6  DISCUSSION AND FUTURE DIRECTIONS \nThe  current  results  suggest  that  this  neural  network  model  may  provide  a  way  of \nintegrating  lower-level  spectrum-to-phoneme  classification  networks  and  phoneme-to(cid:173)\nword  classification  networks  for  automatic  speech  recognition.  The  results  of  these \ninitial experiments  are  moderately  encouraging,  demonstrating  that this  architecture can \nbe used successfully for boundary detection with moderately large (200-word)  and noisy \ncorpora,  although  performance  drops  significantly  when  the  input  stream  has  many \ninserted  and  deleted  phonemes.  Furthermore,  these  experiments  demonstrate  the \nimportance of recurrence. \nMany  unresolved  questions  about  the  application  of  this  model  for  the  word \nboundary Iword  identification  task  remain.  The  performance  of this  model  needs  to  be \ncompared with that of other techniques for locating and identifying words from phoneme \nsequences  (for  example,  the  two-level  dynamic  programming  algorithm  described  by \nLevinson et a/., 1990). \nWord-identification  performance  of the  model  (based  on  the  output  class  with  highest \nactivation) is far from perfect, suggesting that additional strategies are needed to improve \nperformance. First, word identification errors substituting \"she\" for \"be\"  and \"pewter\" for \n\"order\"  suggest  that  the  network  sometimes  uses  information  only  from  one  or  two \nprevious  time  steps  to  make  word  choices.  Efforts  to  extend  the  persistence  of \n\n\f212 \n\nAllen and Kamm \n\ninfonnation  in  the  state units beyond this  limit may  improve  perfonnance,  and  may be \nespecially  helpful  when the  input is  corrupted by  phoneme  insertions.  Another possible \nstrategy for improving perfonnance would be to use the locations and identities of words \nwhose  boundaries  and  identities  can be hypothesized  with  high certainty  as  \"islands  of \nreliability\".  These anchor points could then help detennine whether the word choice at a \nless certain boundary location is a reasonable one, based on features like word length (in \nphonemes)  or semantic or syntactic constraints. In addition, an algorithm  that considers \nmore  than  just  the  top  candidate  at  each  hypothesized  word  position  and  that  uses \nsemantic  and  syntactic  constraints  for  reducing  ambiguity  might be prove  more  robust \nthan the single-best-choice  word  identification strategy  used in the current experiments. \nSchemes  that  attempt  to  identify  word  sequences  without  specifically  locating  word \nboundaries  should be explored.  The question of whether this  network  architecture  will \nscale  to  successfully  handle  still  larger corpora  and  realistic  applications  also  requires \nfurther  study.  These  unresolved  issues  notwithstanding,  the  current  work  demonstrates \nthe  feasibility  of  an  integrated  neural-based  system  for  performing  several  levels  of \nprocessing of speech, from spectro-temporal pattern classification to word identification. \n\nReferences \n\nAllen, R.B.  Sequential connectionist networks for answering simple questions about a microworld. \nProceedings o/the Cognitive Science Society, 489-495,  1988. \n\nAllen, R.B.  Connectionist language users.  Connection Science, 2, 279-311,  1990. \n\nElman, J. L. Finding structure in time. Cognitive Science, 14, 179-211, 1990. \n\nHarrison, T. and Fallside, F.  A connectionist model for phoneme recognition in continuous speech. \nProc. ICASSP 89, 417-420, 1989. \n\nJordan, M.  I. Serial order:  A parallel distributed processing approach.  (Tech.  Rep. No.  8604).  San \nDiego: University of California, Institute for Cognitive Science,  1986. \n\nKamm,  C.  and Singhal,  S.  Effect of neural  network input span  on  phoneme  classification,  Proc. \nIJCNN June 1990, 1,195-200,1990. \n\nLamel,  L.,  Kassel,  R.  and  Seneff,  S.  Speech  database  development:  Design  and  analysis  of the \nacoustic-phonetic corpus. Proc. DARPA Speech Recognition Workshop,  100-109, 1986. \n\nLevinson,  S.  E.,  Ljolje,  A.  and  Miller,  L.  G.  Continuous  speech  recognition  from  a  phonetic \ntranscription. Proc. ICASSP-90, 93-96, 1990. \n\nMiyatake,  M.,  Sawai,  H.,  Minami,  Y.  and  Shikano,  H.  Integrated  training  for  spotting  Japanese \nphonemes using large phonemic time-delay neural networks. Proc. ICASSP 90, 449-452,1990. \n\nMorgan,  N.  and  Bourlard,  H.  Continuous  speech  recognition  using  multilayer  perceptrons  with \nhidden Markov models. Proc. ICASSP 90, 413-416,1990. \n\nShoup,  J.  E.  American  English  Orthographic-Phonemic  Dictionary.  NTIS  Report  AD763784, \n1973. \n\nWaibel, A.,  Hanazawa, T.,  Hinton, G., Shikano, K.  and Lang, K.  Phoneme recognition using time(cid:173)\ndelay neural networks. IEEE Trans. ASSP, 37, 328-339, 1989. \n\nWebster's Seventh Collegiate Dictionary. Springfield, MA:  Merriam Company, 1972. \n\n\f", "award": [], "sourceid": 372, "authors": [{"given_name": "Robert", "family_name": "Allen", "institution": null}, {"given_name": "Candace", "family_name": "Kamm", "institution": null}]}