{"title": "Multi-State Time Delay Networks for Continuous Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 135, "page_last": 142, "abstract": "", "full_text": "Multi-State Time Delay Neural Networks \n\nfor Continuous Speech Recognition \n\nPatrick Haffner \n\nCNET Lannion A TSSIRCP \n22301 LANNION, FRANCE \n\nhaffner@lannion.cnet.fr \n\nAlex Waibel \n\nCarnegie Mellon University \n\nPittsburgh, PA  15213 \n\nahw@cs.cmu.edu \n\nAbstract \n\nWe present the \"Multi-State Time Delay Neural Network\" (MS-TDNN) as  an \nextension of the TDNN to robust word recognition. Unlike most other hybrid \nmethods. the MS-TDNN embeds an alignment search procedure into the con(cid:173)\nnectionist architecture. and allows for word level supervision. The resulting \nsystem  has the ability to manage the sequential order of subword units. while \noptimizing for the recognizer performance. In this paper we present extensive \nnew evaluations of this approach over speaker-dependent and speaker-indepen(cid:173)\ndent connected alphabet. \n\nINTRODUCTION \n\n1 \nClassification based Neural Networks (NN) have been successfully applied to phoneme \nrecognition tasks. Extending those classification capabilities to word recognition is an \nimportant research direction in speech recognition. However. connectionist architectures \ndo not model time alignment properly. and they have to be combined with a Dynamic Pro(cid:173)\ngramming (DP) alignment procedure to be applied to word recognition.  Most of these \n\"hybrid\" systems (Bourlard. 1989) take advantage of the powerful and well tried probabi(cid:173)\nlistic formalism provided by Hidden Markov Models (HMM) to  make use of a reliable \nalignment procedure. However. the use of this HMM formalism  strongly limits one's \nchoice of word models and classification procedures. \nMS-TDNNs. which do not use this HMM formalism.  suggest new ways to  design speech \nrecognition systems in a connectionist framework.  Unlike most hybrid systems where \nconnectionist procedures replace some parts of a pre-existing system. MS-TDNNs are \ndesigned from  scratch as a global Neural Network that performs word recognition. No \nbootstrapping is required from  an HMM. and we can apply learning procedures that cor(cid:173)\nrect the recognizer's errors explicitly. These networks have been successfully tested on \n135 \n\n\f136 \n\nHaffner and Waibel \n\ndifficult word recognition tasks. such as speaker-dependent connected alphabet recogni(cid:173)\ntion (Haffner et al.  1991a) and speaker-independent telephone digit recognition (Haffner \nand Waibel.  1991b). Section 2 presents an overview of hybrid Connectionist/HMM archi(cid:173)\ntectures and training procedures. Section 3 describes the MS-TDNN architecture.  Section \n4 presents our novel training procedure. In section 5. MS-TDNNs are tested on speaker(cid:173)\ndependent and speaker-independent continuous alphabet recognition. \n\n2  HYBRID SYSTEMS \nHMMs are currently the most efficient and commonly used approach for large speech rec(cid:173)\nognition tasks:  their modeling capacity. however limited, fits  many speech recognition \nproblems fairly well (Lee. 1988). The main limit to the modelling capacity of HMMs is \nthe fact that trainable parameters must be interpretable in a probabilistic framework to be \nreestimated using the Baum-Welch algorithm  with the Maximal Likelihood Estimation \ntraining criterion (MLE). \nConnectionist learning  techniques used in NNs (generally error back-propagation) allow \nfor a  much  wider variety of architectures and parameterization possibilities. Unlike \nHMMs. NNs model discrimination  surfaces between classes rather than the complete \ninput/output distributions (as in HMMs) : their parameters are only trained to  minimize \nsome error criterion. This gain in data modeling capacity, associated with a more discrim(cid:173)\ninant training procedure, has permitted improved performance on a number of speech \ntasks. especially those in  which modeling sequential information is not necessary. For \ninstance. Time Delay Neural Networks have been applied, with high performance, to pho(cid:173)\nneme classification (Waibel et al. 1989). To extend this performance to word recognition, \none has to combine a front-end NN with a procedure performing time alignment, usually \nbased on DP. A variety of alignment procedures and training methods have been proposed \nfor those \"hybrid\" systems. \n\n2.1  TIME ALIGNMENT \nTo take into account the time distortions that may appear within its boundaries, a word is \ngenerally modeled by a sequence of states (l \u2022...\u2022 s .... ,N) that can have variable durations. \nThe score of a word in the vocabulary accumulates frame-level scores which are a func(cid:173)\ntion of the output Y(t)  = (Y1(t),  ...\u2022 Y.,(t\u00bb  of the front end NN \n\n(1) \n\nThe DP algorithm finds  the optimal alignment  {T I' ... , TN + I} which maximizes this  word \nscore. A variety of Score functions  have been proposed for Eq.(l). They are most often \ntreated as likelihoods, to apply the probabilistic Viterbi alignment algorithm. \n\n2.1.1 NN outputs probabilities \nOutputs of classification based NNs have been shown to approximate Bayes probabilities, \nprovided that they are trained with a proper objective function  (Bourlard,  1989). For \ninstance, we can train our front-end NN to output, at each time frame, state probabilities \nthat can be used by a Viterbi alignment procedure (to each state s there corresponds a NN \n\n\fMulti-State Time Delay Neural Networks for  Continuous Speech Recognition \n\n137 \n\noutput irs)~. Eq.(I) gives the resulting word log (likelihood) as a sum of frame-levellog((cid:173)\nlikelihoods) which are written1: \n\n~ \n\nScores (Y(t\u00bb  =  log (Yi(s)(t\u00bb \n\n(2) \n\n2.1.2 Comparing NN output to a reference vector \nThe front end NN can be interpreted as a system remapping the input to a single density \ncontinuous HMM (Bengio. 1991). In the case of identity covariance matrices, Eq.(l) gives \nthe log(likelihood) for the k-th  word (after Viterbi alignment) as a  sum  of distances \nbetween the NN frame-level output and a reference vector associated with  the current \nstate2. \n\nScores (Y{t\u00bb  =  II  yet) - yS l1 2 \n\n(3) \n\nHere. the reference vectors (ft. ... , r, ... , yN)  correspond to  the means of gaussian PDFs, \n\nand can be estimated with the Baum-Welch algorithm. \n\n2.2  TRAINING \nThe first hybrid models that were proposed (Bourlard. 1989; Franzini, 1991) optimized the \nstate-level NN (with gradient descent) and the word-level HMM (with Baum-Welch) sep(cid:173)\narately. Even though each level of the system  may have reached a local optimum of its \ncost function,  training is potentially suboptimal for the given complete system. Global \noptimization of hybrid connectionist/HMM systems requires a unified training algorithm, \nwhich makes use of global gradient descent (Bridle. 1990). \n\n3  THE MS-TDNN ARCHITECTURE \nMS-TDNNs have been designed to extend TDNNs classification performance to the word \nlevel. within the simplest possible connectionist framework.  Unlike the hybrid methods \npresented in  the previous section, the HMM formalism  is not taken as  the underlying \nframework here. but many of the models developed within this formalism  are applicable \nto MS-TDNNs. \n\n3.1  FRAME\u00b7LEVEL TDNN ARCHITECTURE \nAll the MS-TDNNs architectures described in  this paper use the front-end TDNN archi(cid:173)\ntecture (Waibel et al. 1989), shown in Fig.l. at the state level. Each unit of the first hidden \nlayer receives input from a 3-frame window of input coefficients. Similarly, each unit in \nthe second hidden layer receives input from  a 5-frame window of outputs of the first hid(cid:173)\nden layer.  At this level of the system (2nd hidden layer). the network produces, at each \ntime frame. the scores for the desired phonetic features. Phoneme recognition TDNNs are \ntrained in a time-shift invariant way by integrating over time the output of a single state. \n\n3.2  BASELINE MS\u00b7 TDNN \nWith MS-TDNNs, we have extended the formalism  of TDNNs to  incorporate time align(cid:173)\nment. The front-end TDNN architecture has I  output units, whose activations (ranging \n\n1.  State prior probabilities would add a constant tenn to Eq.(2) \n2.  State transition probabilities add an offset to Eq.(3) \n\n\f138 \n\nHaffner and Waibel \n\n2nd hidden \nlayer \n\n\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\nwordllll \n\nInput \nlpeec \n\n\u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022  \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \u2022\u2022\u2022 \n\n\u2022 \u2022 \u2022 \u2022 \u2022 \n\nTIME \n\nWord2  [!] \n\nt State  scores  are \n\ncopied  from  the \n2nd  hidden  layer \noftheTDNN \n\nFigure 1: Frame-Level TDNN \n\nFigure 2: MS-TDNN \n\nfrom 0 to 1) represent the trame-level scores. To each state s corresponds a TDNN  output \ni(s). Different states may share the same output (for instance with phone models). The DP \nprocedure, as described in Eq.(1), determines the sequence of states producing the maxi(cid:173)\nmum sum of activations3: \n\nScores(Y(t\u00bb  =  Yj(s) \n\n(4) \n\nThe frame-level  score used in the MS-TDNN combines the advantages of being simple \nwith that of having a formal description as an extension of the TDNN accumulation pro(cid:173)\ncess to multiple states. It becomes possible to model the accumulation process as a con(cid:173)\nnectionist word unit that sums the activations from  the best sequence of incoming state \nunits, as shown in Fig.2. This is mostly useful during the back-propagation phase: at each \ntime frame, we imagine a virtual connection between the active state unit and the word \nunit, which is used to backpropagate the error at the word level down to the state level. 4 \n\n3.3  EXTENDING MS\u00b7 TDNNs \nIn  the previous section, we presented the baseline MS-TDNN architecture.  We now \npresent extensions to the word-level architecture, which provide additional  trainable \nparameters. Eq.(4) is extended as: \n\nScores(Y(t\u00bb  =  Weight;\u00b7 Y;(s)  +Biasj \n\n(5) \n\n3.  This equation is  not very different from  Eq.(2) presented in the previous section,  however,  all \nattempts to use log(Y,{t)) instead of Y,{t) have resulted in unstable learning runs, that have never con\u00b7 \nverged properly. During the test phase,  the two approaches may be functionally  not very  different. \nOutputs that affect the error rate in a critical way  are mostly those of the correct word and the best \nincorrect word, especially when they are close. We have observed that frame level scores which play \na key role in discrimination are close to  1.0:  the two scores become asymptotically equivalent (less \n1): log(Y,(t\u00bb - Y,{t) - 1. \n4.  The alignment path found by the DP routine during the forward phase is \"frozen\", so that it can \nbe represented as  a connectionist accumulation unit during the backward phase. The problem is that, \nafter modification of the  weights, this  alignment path may no longer be  the optimal one.  Practical \nconsequences of this seem minimal. \n\n\fMulti-State Time Delay Neural Networks for  Continuous Speech Recognition \n\n139 \n\nWeight; allows to weight differently the importance of each state belonging to  the same \nword. We do not have to assume that each part of a  speech pattern contains an equal \namount of infonnation. \nBias; is analog to a transition log(probability) in a HMM. \n\nHowever. we have observed that a small variation in the value of those parameters may \nalter recognition perfonnance a  lot. The choice of a proper training procedure is critical. \nOur gradient back-propagation algorithm has been selected for its efficient training of the \nparameters of the front-end TDNN ; as our training procedure is global. we have also \napplied it to train Weight; and Bias;. but with some difficulty. \nIn section 4.1, we show that they are useful to shift the word scores so that a sigmoid func(cid:173)\ntion separates the correct words (output 1) properly from the incorrect ones (output 0). \n\n3.4  SEQUENCE MODELS \nWe design very simple state sequence models by hand that may use phonetic knowledge \n(phone models) or may not (word models). \nPhone Models: The phonetic representation of a word is transcribed as a  sequence of \nstates. As an example shown in Fig.3, the letter 'p' combines 3 phone units. P captures \nthe closure and the burst of this stop consonant. P-IY is a co-articulation unit. The phone \nIY is recognized in a context independent way. This phone is shared with all the other e(cid:173)\nset letters. States are duplicated to enforce minimal phone durations. \n\n0+g.~@.0+~ \n\nFigure 3 Phone Model for 'p' \n\nWord Models: No specific phonemic meaning is associated with  the states of a word. \n\nThose states cannot be shared with other words. \n\nTransition States: One can add specialized transition units  that are trained to  detect this \ntransition more explicitly: the resulting stabilization in segmentation yields an increase \nin performance. This method is however sensitive to a good bootstrapping of our system \non proper phone boundaries. and has so far only been applied to  speaker dependent \nalphabet recognition. \n\n4  TRAINING \nIn many speech recognition systems. a  large discrepancy is found between the training \nprocedure and the testing procedure. The training criterion. generally Maximum Likeli(cid:173)\nhood Estimation. is very far from the word accuracy the system is expected to  maximize. \nGood perfonnance depends on a large quantity of data, and on proper modeling. With MS(cid:173)\nTDNNs. we suggest optimization procedures which explicitly attempt to minimize the \nnumber of word substitutions; this approach represents a move towards systems in which \nthe training objective is maximum word accuracy. The same global gradient back-propa(cid:173)\ngation is applied to the whole system, from  the output word units down to the input units. \nEach desired word is associated with a segment of speech with known boundaries. and this \nassociation represents a learning sample. The DP alignment procedure is applied between \nthe known word boundaries. We describe now three training procedures we have applied \nto MS-TDNNs. \n\n\f140 \n\nHaffner and Waibel \n\n4.1  STANDARD BACK-PROPAGATION WITH SIGMOIDAL OUTPUTS \nWord outputs Q1  = ftW 1 .  0 1 + B 1)  are compared to word targets (J  for the desired word, \no for the other words), and the resulting error is back-propagated. Ok is  the OP sum given \nby Eq.(1) for the k-th word in the vocabulary.jis the sigmoid function, Wk gives the slope \nof the sigmoid and B k  is a bias term, as shown in Fig.4. They are trained so that the sig(cid:173)\nmoid function separates the correct word (Output 1) form the incorrect words (Output 0) \nproperly. When the network is trained with the additional parameters of Eq.(5),Weighti and \nBiasi can account for these sigmoid slope and bias. \nMS-TDNNs are applied to word recognition problems where classes are highly confus(cid:173)\nable. The score of the best incorrect word may be very close to  the score of the correct \nword: in this case, the slope and the bias are parameters which are difficult to tune, and the \nlearning procedure has problems to attain the 0 and 1 target values. To overcome those dif(cid:173)\nficulties, we have developed new training techniques which do not require the use of a sig(cid:173)\nmoid function and of fixed word targets. \n\nf  I  I I \n\nOther word scores \n\nS.l9pe \n\nA \n\np==: \n\nI \ncorrect \n\nbest \nincorrect \n\nbias \n\nFig.4. The sigmoid Function \n\n4.2  ON-LINE CORRECTION OF CLASSIFICATION ERRORS \nThe testing procedure recognizes the word (or the string of words) with the largest output, \nand there is an error when this is not the correct word. As the goal of the training proce(cid:173)\ndure is to minimize the number of errors, the \"ideal\" procedure would be, each time a clas(cid:173)\nsification error has  occurred, to observe where it comes from,  and  to  modify  the \nparameters of the system so that it no longer happens. \nThe MS-TDNN has to recognize the correct word CoWo.  There is a training error if, for \nan incorrect word InWo,  one has O/nWo  > OCowo- m.  No sigmoid function is needed to \ncompare these outputs, m is an additional margin to ensure the robustness of the training \nprocedure. Only in the event of a training error do we modify the parameters of the MS(cid:173)\nTONN. The word targets are moving (for instance, the target score for an incorrect word is \nOCowo- m) instead of fixed (0 or 1). \nThis technique overcomes the difficulties due to the use of an output sigmoid function. \nMoreover, the number of incorrect words whose output is actually modified is  greatly \nreduced:  this is very helpful in training under-represented classes, as the numbers of posi(cid:173)\ntive and negative examples become much more balanced. \nCompared to the more traditional training technique (with a sigmoid) presented in the pre(cid:173)\nvious section, large increases in training speed and word accuracy were observed. \n\n4.3  FUZZY WORD BOUNDARIES \nTraining procedures we have presented so far do not take into account the fact that the \nsample words may come from continuous speech. The main difficulty is that their straight(cid:173)\nforward extension to continuous speech would not be computationally feasible, as  the set \n\n\fMulti-State Time Delay Neural  Networks for  Continuous Speech Recognition \n\n141 \n\nof possible training classes will consist of all  the possible strings of words.We have \nadopted a staged approach:  we modify the training procedure, so that it matches the con(cid:173)\ntinuous recognition conditions more and more closely, while remaining computationally \nfeasible. \nThe first step deals with the problem of word boundaries. During training, known word \nboundaries give additional information that the system uses to help recognition. But this \ninformation is not available when testing. To overcome this problem when learning a cor(cid:173)\nrect word (noted CoWo). we take as the correct training token the triplet PreWo-CoWo(cid:173)\nNexWo (PreWo  is the preceding correct word, NexWo  is the next correct word in  the sen(cid:173)\ntence). All the other triplets PreWo-InWo-NexWo are considered as incorrect. These trip(cid:173)\nlets are aligned between the beginning known boundary of PrevWo and the ending known \nboundary of NexWo. What is important is  that no precise boundary information is given \nforCoWo. \nThe word classification training criterion presented here only minimizes word substitu(cid:173)\ntions. In connected speech. one has to deal with deletions and insertions errors: procedures \nto describe them as classification errors are currently being developed. \n\nEXPERIMENTS ON CONNECTED ALPHABET \n\n5 \nRecognizing spoken letters is considered one of the most challenging small-vocabulary \ntasks in speech recognition. The vocabulary, consisting of the 26 letters of the American \nEnglish alphabet, is  highly  confusable,  especially among  subsets like  the E-set \n('B' :C' :D' :E' :G' :P' :T' :V' :Z') or ('M', 'N'). In all experiments. as input parameters, \n16 filterbank melscale spectrum coefficients are computed at a  lOmsec frame rate. Phone \nmodels are used. \n\n5.1  SPEAKER DEPENDENT ALPHABET \nOur database consists of 1000 connected strings of letters, some corresponding to gram(cid:173)\nmatical words and proper names, others simply random. There is an average of five letters \nper string. The learning procedure is described in \u00a74.1 and applied to  the extended MS(cid:173)\nTDNN (\u00a73.3), with a bootstrapping phase where phone labels are used to give the align(cid:173)\nment of the desired word. During testing, time alignment is performed over the whole sen(cid:173)\ntence.  A one-stage DP algorithm (Ney, 1984) for connected words (with no grammar) is \nused in place of the isolated word DP algorithm used in the training phase. The additional \nuse of minimal word durations, word entrance penalties and word boundary detectors has \nreduced the number of word insertions and deletions (in the DP algorithm) to  an accept(cid:173)\nable level. On two speakers, the word error rates are respectively 2.4%  and  10.3%. By \ncomparison, SPHINX, achieved error rates of 6%  and 21.7%, respectively, when context(cid:173)\nindependent (as in our MS-TDNN) phone models were used.  Using context-dependent \nmodels (as described in Lee,  1988), SPHINX performance achieves 4%  and  16.1% error \nrates, respectively. No comparable results yet exist for the MS-TDNN for this case. \n\n5.2  SPEAKER INDEPENDENT ALPHABET (RMspell) \nOur database, a part of the DARPA Resource Management database (RM), consists of \n120 speakers, spelling about 15 words each.  109 speakers (about 10,000 spelled letters) \nare used for training.  11  speakers (about 1000 spelled letters) are used for testing.  57 \nphone units. in the second hidden layer. account for the phonemes and the co-articulation \nunits. We apply the training algorithms described in \u00a74.2 and \u00a74.3 to  our baseline MS-\n\n\f142 \n\nHaffner and Waibel \n\nTDNN architecture (\u00a73.2), without any additional procedure (for instance, no phonetic \nbootstrapping). An important difference from the experimental conditions described in the \nprevious section is that we have kept training and testing conditions exactly similar (for \ninstance, the same knowledge of the boundaries is used during training and testing). \n\nTable 1:  Alphabet classification errors (we do not allow for insertions or deletions errors). \n\nAlgorithm \n\n%Error \n\nKnown Word Boundaries (\u00a74.2) \nFuzzy Word Boundaries (\u00a74.3) \n\n5.7% \n6.5% \n\n6  SUMMARY \nWe presented in this paper MS-TDNNs, which extend TDNNs classification performance \nto the sequence level. They integrate the DP alignment procedure within a straightforward \nconnectionist framework.  We  developed  training procedures  which  are computationally \nreasonable and train the MS-TDNN in a global way. Their only supervision is the minimi(cid:173)\nzation of the recognizer's error rate. Experiments were conducted on speaker independent \ncontinuous alphabet recognition. The word error rates are 5.7% with known word bound(cid:173)\naries and 6.5% with fuzzy word boundaries. \n\nAcknowledgments \nThe authors would like to express their gratitude to Denis J ouvet and Michael Witbrock, \nfor their help writing this paper, and to Cindy Wood, for gathering the databases. This \nwork was partially performed at CNET laboratories, and at Carnegie Mellon University, \nunder DARPA support. \nReferences \nBengio, Y  \"Artificial  Neural  Networks  and  their  Application  to  Sequence Recognition\" \n\nPh.D. Thesis, McGill University, Montreal, June 1991. \n\nBourlard, H and Morgan, N. \"Merging Multilayer Perceptrons and Hidden Markov Mod(cid:173)\nels:  Some Experiments in Continuous Speech Recognition\", TR-89-033, ICSI,  Berke(cid:173)\nley, CA, July 1989 \n\nBridle, J.S.  \"Alphanets:  a recurrent  'neural'  network architecture with  a  hidden  Markov \n\nmodel interpretation.\" Speech Communication, \"Neurospeech\" issue, Feb 1990. \n\nFranzini,  M.A.,  Lee,  K.F.,  and  Waibel,  A.H.,\"Connectionist  Viterbi  Training:  A  New \n\nHybrid Method for Continuous Speech Recognition,\" ICASSP, Albuquerque 1990 \n\nHaffner,  P.,  FranzinLM.  and Waibel  A.,  \"  Integrating  Time  Alignment and  Neural  Net(cid:173)\n\nworks  for  High  Performance  Continuous  Speech  Recognition  \"  ICASSP,  Toronto \n1991a. \n\nHaffner. P and Waibel  A.  \"Time-Delay Neural Networks Embedding Time Alignment: a \n\nPerformance Analysis\" Europseech'91 , Genova, September 1991b. \n\nLee,  K.F.  \"Large-Vocabulary  Speaker-Independent Continuous  Speech  Recognition:  the \n\nSPHINX system\", PhD Thesis, Carnegie Mellon University, 1988. \n\nNey, H. ''The Use of a One-Stage Dynamic Programming Algorithm for Connected Word \nRecognition\" in IEEE Trans. on Acoustics, Speech and Signal Processing. April  1984. \nWaibel, A.H., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K.,  \"Phoneme Recogni(cid:173)\n\ntion using Time-Delay Neural Networks\" in IEEE Transactions on  Acoustics, Speech \nand Signal Processing 37(3):328-339, 1989. \n\n\f", "award": [], "sourceid": 580, "authors": [{"given_name": "Patrick", "family_name": "Haffner", "institution": null}, {"given_name": "Alex", "family_name": "Waibel", "institution": null}]}