{"title": "Speech Recognition Using Demi-Syllable Neural Prediction Model", "book": "Advances in Neural Information Processing Systems", "page_first": 227, "page_last": 233, "abstract": null, "full_text": "Speech Recognition \n\nUsing Demi-Syllable  Neural Prediction Model \n\nKen-ichi Iso  and Takao Watanabe \n\nC  &  C  Information Technology  Research  Laboratories \n\n4-1-1  Miyazaki, Miyamae-ku, Kawasaki  213, JAPAN \n\nNEC  Corporation \n\nAbstract \n\nThe  Neural  Prediction  Model  is  the  speech  recognition  model  based  on \npattern  prediction  by  multilayer  perceptrons.  Its  effectiveness  was  con(cid:173)\nfirmed  by  the  speaker-independent  digit  recognition  experiments.  This \npaper  presents  an  improvement  in  the model  and  its  application  to  large \nvocabulary speech recognition,  based on subword units.  The improvement \ninvolves an introduction  of \"backward  prediction,\"  which further  improves \nthe  prediction  accuracy  of the  original  model  with  only  \"forward  predic(cid:173)\ntion\".  In  application of the model  to  speaker-dependent large vocabulary \nspeech recognition,  the demi-syllable unit is  used  as  a subword recognition \nunit.  Experimental  results  indicated  a  95.2%  recognition  accuracy  for  a \n5000  word  test  set  and  the  effectiveness  was  confirmed  for  the  proposed \nmodel  improvement and  the demi-syllable subword units. \n\nINTRODUCTION \n\n1 \nThe  Neural  Prediction  Model  (NPM)  is  the  speech  recognition  model  based  on \npattern  prediction  by  multilayer  perceptrons  (MLPs).  Its  effectiveness  was  con(cid:173)\nfirmed  by  the  speaker-independent  digit  recognition  experiments  (Iso,  1989;  Iso, \n1990;  Levin,  1990). \n\nAdvantages  in  the  NPM  approach  are  as  follows.  The  underlying  process  of the \nspeech  production can  be  regarded  as  the nonlinear dynamical system.  Therefore, \nit is expected that there is causal relation among the adjacent speech feature vectors. \nIn  the NPM,  the causality is  represented  by  the nonlinear  prediction mapping F, \n\nat = F w (at - d, \n\n(1 ) \n\nwhere  at  is  the speech feature  vector  at frame  t,  and  subscript  w  represents map(cid:173)\nping  parameters.  This  causality  is  not  explicitly  considered  in  the  conventional \n\n227 \n\n\f228 \n\nIso and Watanabe \n\nspeech  recognition  model,  where  the adjacent speech feature vectors are treated as \nindependent variables. \nAnother  important  model  characteristic  is  its  applicability  to  continuous  speech \nrecognition.  Concatenating  the  recognition  unit  models,  continuous speech  recog(cid:173)\nnition and model training from continuous speech can be implemented without the \nneed for  segmentation. \nThis paper presents an improvement in the NPM and its application to large vocab(cid:173)\nulary speech recognition,  based on subword units.  It is an introduction of \"backward \nprediction,\"  which  further  improves  the  prediction  accuracy  for  the original  model \nwith only  ''forward  prediction\".  In  Section  2,  the improved  predictor  configuration, \nNPM recognition and training algorithms are described in detail.  Section 3 presents \nthe  definition  of demi-syllables  used  as  subword  recognition  units.  Experimental \nresults  obtained  from  speaker-dependent  large  vocabulary  speech  recognition  are \ndescribed  in  Section 4. \n\n2  NEURAL PREDICTION MODEL \n\n2.1  MODEL  CONFIGURATION \nFigure  1 shows  the MLP  predictor  architecture.  It is  given  two  groups  of feature \nvectors as  input.  One is  feature vectors for  \"forward  prediction\".  Another is  feature \nvectors  for  \"backward  prediction\".  The  former  includes  the  input  speech  feature \nvectors,  at-TF' .\u2022\u2022 , at-l, which  have been implemented in  the original formulation. \nThe latter,  at+l! ... , at+TB' are introduced in this paper to further improve the pre(cid:173)\ndiction  accuracy over the original  method,  with only  \"forward  prediction\".  This, for \nexample,  is  expected  to  improve  the  prediction  accuracy  for  voiceless  stop conso(cid:173)\nnants,  which  are characterized by a period of closure interval, followed  by a sudden \nrelease.  The MLP output, fit,  is  used  as  a  predicted feature vector for  input speech \n\nForward prediction \n\nBackward prediction \n\n\u2022 \u2022 \u2022 \n\nht \n\nHidden layer \n\na t \n\nOutput layer \n\nFigure 1:  Multilayer  perceptron  predictor \n\n\fSpeech Recognition Using Demi-Syllable Neural Prediction Model \n\n229 \n\nfeature vector at.  The difference  between the input speech feature vector at and its \nprediction at  is  the prediction  error.  Also,  it can  be  regarded  as  an error function \nfor  the MLP training,  based  on  the back-propagation technique. \n\nThe NPM for  a  recognition class,  such as  a subword  unit,  is  constructed as  a  state \ntransition network, where each state has an MLP predictor described above (Figure \n2).  This  configuration  is  similar  in  form  to  the  Hidden  Markov  Model  (HMM), \nin which each state has  a vector emission  probability distribution  (Rabiner,  1989). \nThe concatenation of these subword NPMs enables continuous speech recognition. \n\nFigure 2:  Neural  Prediction  Model \n\n2.2  RECOGNITION  ALGORITHM \nThis  section  presents  the  continuous  speech  recognition  algorithm  based  on  the \nNPM.  The concatenation of subword NPMs,  which  is  also the state transition net(cid:173)\nwork,  is  used  as  a  reference  model for  the input speech.  Figure 3 shows  a diagram \nof the recognition  algorithm.  In  the  recognition,  the  input  speech  is  divided  into \nsegments,  whose  number  is  equal  to  the  total  states  in  the  concatenated  NPMs \n(= N).  Each  state  makes  a  prediction  for  the  corresponding  segment.  The  local \nprediction  error,  between the  input  speech  at frame  t  and  the  n-th  state,  is  given \nby \n\n(2) \nwhere n  means  the consecutive number of the state in the concatenated NPM.  The \naccumulation oflocal prediction errors defines the global distance between the input \nspeech  and the concatenated NPMs \n\nT \n\nD = min I: dt(nt), \n\n{nt} t=1 \n\n(3) \n\nwhere  nt  denotes  the state number  used  for  prediction  at frame  t,  and  a  sequence \n{nIl n2, ... , nt, ... , nT} determines the segmentation of the input speech.  The mini(cid:173)\nmization means that the optimal segmentation, which gives a minimum accumulated \nprediction  error,  should  be selected.  This  optimization  problem  can  be  solved  by \nthe use of dynamic-programming.  As a result, the DP recursion formula is obtained \n\n9t  n  =  t  n  + mm \n\n()  d  () \n\n(4) \nAt the end of Equation (4)  recursive application, it is  possible to obtain D = 9T(N). \nBacktracking the result  provides the input speech segmentation. \n\n(1 ) .  \n\n.  {9t -1 ( n) \n9t-l  n  -\n\n} \n\n\f230 \n\nIso and Watanabe \n\nNPM \n\n'.6 ........................ _ ..... _ ...\u2022............ _ ...\u2022 \n\nMLP \nN \n\n\u2022 \u2022 \u2022 \n\nMLP \n2 \n\nat \n\n1 \n\n\u2022 \n\u2022 \n\u2022 \n\nI \n\ni : \ni \nI ; \n! \nf .................................... _ .................. . \n\nMLP \n1 \n\nala2a3  \u2022 \n\n\u2022  at \n\nInput Speech \n\n\u2022 \n\n\u00b7ar \n\nFigure 3:  Recognition  algorithm  based  on  DP \n\nIn  this algorithm,  temporal distortion of the input speech is efficiently absorbed  by \nDP  based  time-alignment  between  the input  speech  and  an  MLPs  sequence.  For \nsimplicity,  the  reference  model  topology  shown  above  is  limited  to  a  sequence  of \nMLPs  with  no  branches.  It is  obvious  that  the  algorithm  is  applicable  to  more \ngeneral topologies  with  branches. \n\n2.3  TRAINING ALGORITHM \nThis section presents a training algorithm for estimating NPM parameters from con(cid:173)\ntinuous utterances.  The training goal  is  to find  a set of MLP  predictor parameters, \nwhich minimizes the accumulated prediction errors for  training utterances.  The ob(cid:173)\njective function for  the minimization is defined as the average value for accumulated \nprediction  errors for  all  training utterances \n\n1  M \n\niJ = M  :L D(m), \n\nm=l \n\n(5) \n\nwhere M  is  the number of training utterances and D( m)  is  the accumulated predic(cid:173)\ntion error  between  the m-th  training utterance and  its  concatenated NPM,  whose \nexpression  is  given  by  Equation  (3).  The optimization  can  be  carried  out  by  an \niterative procedure,  combining dynamic-programming  (DP)  and  back-propagation \n(BP)  techniques.  The algorithm is  given as follows: \n\n1.  Initialize all  MLP predictor  parameters. \n2.  Set  m = 1. \n\n\fSpeech Recognition Using Demi-Syllable Neural Prediction Model \n\n231 \n\n3.  Compute  the accumulated  prediction  error  D{m)  by  DP  (Equation  (4))  and \n\ndetermine the optimal segmentation {nn, using  its  backtracking. \n\n4.  Correct parameters for  each MLP predictor by BP, using the optimal segmenta(cid:173)\ntion {nn, which determines the desired output at for  the actual output at{n;} \nof the nt-th MLP  predictor. \n\n5.  Increase m  by 1. \n6.  Repeat  3 - 5,  while m  ~ M. \n7.  Repeat  2 - 6,  until convergence occurs. \n\nConvergence proof for  this iterative procedure was  given  in  (Iso,  1989;  Iso,  1990). \nThis can  be  intuitively  understood  by  the fact  that  both DP and  BP decrease  the \naccumulated prediction error and  that  they  are applied  successively. \n\n3  Demi-Syllable  Recognition  Units \nIn  applying  the  model  to  large  vocabulary  speech  recognition,  the  demi-syllable \nunit is  used  as  a subword  recognition  unit  (Yoshida,  1989).  The demi-syllable  is  a \nhalf syllable unit,  divided at the center of the syllable nucleus.  It can treat contex(cid:173)\ntual variations,  caused  by  the co-articulation effect,  with a  moderate unit  number. \nThe  units  consist  of consonant-vowel  (CV)  and  vowel-consonant  (VC)  segments. \nWord  models  are  made by  concatenation  of demi-syllable  NPMs,  as  described  in \nthe  transcription  dictionary.  Their  segmentation  boundaries  are  basically  defined \nas  a  consonant  start  point  and  a  vowel  center  point  (Figure  4).  Actually,  they \nare  automatically  determined  in  the  training  algorithm,  based  on  the  minimum \naccumulated prediction error criterion  (Section  2.3). \n\n[hakata] \n\nha as *  ka  as  * \n\nta \n\na> \n\nFigure 4:  Demi-syllable  unit  boundary  definition \n\n4  EXPERIMENTS \n4.1  SPEECH DATA  AND  MODEL CONFIGURATION \nIn order to examine the validity of the proposed model, speaker-dependent Japanese \nisolated word recognition experiments were carried out.  Phonetically balanced 250, \n500  and  750  word sets  were selected from  a  Japanese word lexicon  as  training  vo(cid:173)\ncabularies.  For word recognition experiments, a 250 word test set was prepared.  All \n\n\f232 \n\nIso and Watanabe \n\nthe words in the test set  were different from  those in the training sets.  A Japanese \nmale speaker uttered these word sets in a  quiet environment.  The speech data was \nsampled  at  a  16  kHz  sampling  rate,  and  analyzed  by  a  10  msec frame  period.  As \na  feature  vector  for  each  time  frame,  10  mel-scaled  cepstral  parameters,  10  mel(cid:173)\nscaled delta cepstral parameters and a changing ratio parameter for  amplitude were \ncalculated from  the FFT based spectrum. \n\nThe  NPMs  for  demi-syllable  units  were  prepared.  Their  total  number  was  241, \nwhere each demi-syllable NPM consists of a sequence of four MLP predictors, except \nfor  silence  and  long  vowel  NPMs,  which  have  one  MLP  predictor.  Every  MLP \npredictor  has  20  hidden  units  and  21  output  units,  corresponding  to  the  feature \nvector dimensions.  The numbers of input speech feature vectors, denoted  by TF,  for \nthe forward  prediction,  and  by  TB,  for  the  backward  prediction,  in  Figure 1,  were \nchosen  for  the  two  configurations,  (TF' TB)  = (2,1)  and  (3,0).  The former,  Type \nA,  uses  the forward  and  backward  predictions,  while  the  latter,  Type  B,  uses  the \nforward  prediction only. \n\n4.2  WORD  RECOGNITION EXPERIMENTS \nAll  possible  combinations  between  training  data amounts  (= 250,500,750  words) \nand  MLP  input layer  configurations  (Type  A and Type  B)  were  evaluated  by  5000 \nword recognition experiments. \n\nTo  reduce  the  computational  amount  in  5000  word  recognition  experiments,  the \nsimilar  word  recognition  method  described  below  was  employed.  For  every  word \nin  the  250  word  recognition  vocabulary,  a  100  similar  word  set  is  chosen from  the \n5000  word  recognition  vocabulary,  using  the  distance  based  on  the  manually  de(cid:173)\nfined  phoneme confusion  matrix.  In  the  experiments,  every word  in  the 250  word \nutterances  is  compared  with  its  100  similar  word  set.  It has  been  confirmed  that \na  result  approximately equivalent to actual  5000  word  recognition can be obtained \nby  this similar  word  recognition  method  (Koga,  1989). \n\nRecognition \nAccuracy \n\n(%)~--------~--------~ \n96.0 \n\n-\n\nA \n\n' \n\n94.0 \n\n92.0 \n\nI \nI \nI \nI \nf \n~ \nI \n\n\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7~\u00b7\u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .. \u00b7 .. \u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7,\u00b7\u00b7\u00b7\u00b7I .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .... \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b71\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \nI \ni!  __ --~----i \nI \nl \nJ \n! \nI \n\n~\"t~~~~~~~~~--~-~~~- ---+~-------~-~~~---~--. \nI \ni \nI \n-i-\nI \nI \nI \n! \n\n-1_ \nI \nI \nI \n\n0.0  I-<~..,...,r _____ -+~_~ __ -I--I \n9 \n\n88.0  I--i~------.-+------I-I \n\n250 \n\n500 \n\n750 \n\nTraining Data Amount (words) \n\nFigure 5:  Recognition  accuracy  vs.  training data amounts \n\n\fSpeech Recognition Using Demi-Syllable Neural Prediction Model \n\n233 \n\nThe  results  for  5000  word  recognition  experiments  are  shown  in  Figure  5.  As  a \nresult,  consistently  higher  recognition  accuracies  were obtained for  the input layer \nconfiguration with  backward prediction  (Type A),  compared with  the configuration \nwithout  backward  prediction  (Type  B),  and absolute values for  recognition  accura(cid:173)\ncies  become higher  with  the increase  in  training data amount. \n\n5  DISCUSSION AND  CONCLUSION \nThis paper presents an improvement in the Neural Prediction Model  (NPM), which \nis  the introduction of backward  prediction,  and its application  to  large vocabulary \nspeech  recognition  based  on  the  demi-syllable  units.  As  a  result  of experiments, \nthe  NPM  applicability  to  large  vocabulary  (5000  words)  speech  recognition  was \nverified.  This  suggests  the  usefulness  of the  recognition  and  training  algorithms \nfor  concatenated  subword  unit  NPMs,  without  the need  for  segmentation.  It  was \nalso  reported  in  (Tebelskis,  1990)  (90  % for  924  words),  where  the subword  units \n(phonemes)  were  limited  to  a  subset  of complete  Japanese  phoneme  set  and  the \nduration constraints were heuristically introduced.  In  this paper,  the authors used \nthe demi-syllable  units,  which can cover  any  Japanese utterances,  and no  duration \nconstraints.  High recognition accuracies (95.2 %), obtained for 5000 words, indicates \nthe advantages of the use of demi-syllable units and the introduction of the backward \nprediction in  the NPM. \n\nAcknowledgements \nThe authors wish  to thank members of the Media Technology Research Laboratory \nfor  their continuous support. \n\nReferences \nK.  Iso.  (1989),  \"Speech Recognition Using Neural Prediction Model,\"  IEICE  Tech(cid:173)\nnical Report,  SP89-23,  pp.81-87  (in  Japanese). \nK.  Iso  and  T.  Watanabe.  (1990),  \"Speaker-Independent Word  Recognition  Using \nA Neural  Prediction Model,\"  Proc.ICASSP-90,  S8.8,  pp.441-444. \nE.  Levin.  (1990),  \"Word Recognition  Using  Hidden  Control Neural  Architecture,\" \nProc.ICASSP-90,  S8.6,  pp.433-436. \nJ.  Tebelskis  and  A.  Waibel.  (1990),  \"Large  Vocabulary  Recognition  Using  Linked \nPredictive Neural Networks,\"  Proc.  ICASSP-90,  S.8.7,  pp.437-440. \nL.R.Rabiner.  (1989),  \"A Tutorial on  Hidden Markov Models and Selected Applica(cid:173)\ntions  in  Speech Recognition\",  Proc.  of IEEE,  Vo1.11,  No.2,  pp.257-286.,  February \n1989. \nK.  Yoshida,  T.  Watanabe and S.  Koga.  (1989),  \"Large Vocabulary Word Recogni(cid:173)\ntion Based on Demi-Syllable Hidden Markov Model Using Small Amount of Training \nData,\"  Proc.ICASSP-89, S1.1,  pp.I-4. \nS.  Koga,  K.  Yoshida,  and  T.  Watanabe.  (1989),  \"Evaluation of Large Vocabulary \nSpeech Recognition Based on Demi-Syllable HMM,\"  Proc.  of ASJ Autumn Meeting \n(in  Japanese). \n\n\f", "award": [], "sourceid": 360, "authors": [{"given_name": "Ken-ichi", "family_name": "Iso", "institution": null}, {"given_name": "Takao", "family_name": "Watanabe", "institution": null}]}