{"title": "A Hybrid Neural Net System for State-of-the-Art Continuous Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 704, "page_last": 711, "abstract": null, "full_text": "A Hybrid Neural Net System for \n\nState-of-the-Art Continuous Speech Recognition \n\nG. Zavaliagkos \n\nNortheastern University \n\nBoston MA 02115 \n\nY. Zhao \n\nBBN Systems and Technologies \n\nCambridge, MA 02138 \n\nR. Schwartz \n\nJ. Makhoul \n\nBBN Systems and Technologies \n\nBBN Systems and Technologies \n\nCambridge, MA 02138 \n\nCambridge, MA 02138 \n\nAbstract \n\nUntill recently, state-of-the-art, large-vocabulary, continuous speech \nrecognition (CSR) has employed Hidden Markov Modeling (HMM) \nto model speech sounds. \nIn an attempt to improve over HMM we \ndeveloped a hybrid system that integrates HMM technology with neu(cid:173)\nral networks. We present the concept of a \"Segmental Neural Net\" \n(SNN) for phonetic modeling in CSR. By taking into account all the \nframes of a phonetic segment simultaneously, the SNN overcomes the \nwell-known conditional-independence limitation of HMMs. In several \nspeaker-independent experiments with the DARPA Resource Manage(cid:173)\nment corpus, the hybrid system showed a consistent improvement in \nperformance over the baseline HMM system. \n\n1 \n\nINTRODUCTION \n\nThe current state of the art in continuous speech recognition (CSR) is based on the use \nof hidden Markov models (HMM) to model phonemes in context. Two main reasons \nfor the popularity of HMMs are their high performance, in terms of recognition accu(cid:173)\nracy, and their computational efficiency However, the limitations of HMMs in modeling \nthe speech signal have been known for some time. Two such limitations are (a) the \nconditional-independence assumption, which prevents a HMM from taking full advan-\n\n704 \n\n\fA Hybrid Neural Net System for State-of-the-Art Continuous Speech Recognition \n\n705 \n\ntage of the correlation that exists among the frames of a phonetic segment, and (b) the \nawkwardness with which segmental features can be incorporated into .HM:M systems. We \nhave developed the concept of Segmental Neural Nets (SNN) to overcome the two .HM:M \nlimitations just mentioned for phonetic modeling in speech. A segmental neural net is a \nneural network that attempts to recognize a complete phonetic segment as a single unit, \nrather than a sequence of conditionally independent frames. \n\nNeural nets are known to require a large amount of computation, especially for training. \nAlso, there is no known efficient search technique for finding the best scoring segmen(cid:173)\ntation with neural nets in continuous speech. Therefore, we have developed a hybrid \nSNN/HM:M system that is designed to take full advantage of the good properties of both \nmethods. The two methods are integrated through a novel use of the N-best (multiple \nhypotheses) paradigm developed in conjunction with the BYBLOS system at BBN [1]. \n\n2 SEGMENTAL NEURAL NET MODELING \n\nThere have been several recent approaches to the use of neural nets in CSR. The SNN \ndiffers from these approaches in that it attempts to recognize each phoneme by using all \nthe frames in a phonetic segment simultaneously to perform the recognition. By looking \nat a whole phonetic segment at once, we are able to take advantage of the correlation that \nexists among frames of a phonetic segments, thus ameliorating the limitations of .HM:Ms . \n\n\u2022 core \n\nneural \nnetwork \n\nwarping \n\nphonetic \n\n.egment \n\nFigure 1: The SNN model samples the frames and produces a single segment score. \n\nThe structure of a typical SNN is shown in Figure 1. The input to the network is a fixed \nlength representation of the speech segment. This input is scored by the network. If the \nnetwork was trained to minimize a mean square error (MSE) or a relative entropy distor(cid:173)\ntion measure, the output of the network will be an estimate of the posterior probability \nP(CI:z:) of the phonetic class C given the segment :z: [2, 3]. This property of the SNN \nallows a natural extension to CSR: We segment the utterance into phonetic segments, \nand score each one of them seperately. The score of the utterance is simply the product \nof the scores of the individual segments. \n\n\f706 \n\nZavaliagkos, Zhao, Schwartz, and Makhoul \n\nThe procedure described above requires the availability of some form of phonetic seg(cid:173)\nmentation of the speech. We describe in Section 3 how we use the HMM to obtain \nlikely candidate segmentations. Here, we shall assume that a phonetic segmentation has \nbeen made available and each segment is represented by a sequence of frames of speech \nfeatures. The actual number of such frames in a phonetic segment is variable. However, \nfor input to the neural network, we need a fixed length representation. Therefore, we \nhave to convert the variable number of frames in each segment to a fixed number of \nframes. We have considered two approaches to cope with this problem: time sampling \nand Oiscrete Cosine Transfonn (ocr). \nIn the first approach, the requisite time warping is performed by a quasi-linear sampling \nof the feature vectors comprising the segment to a fixed number of frames (5 in our \nsystem). For example, in a 17-frame phonetic segment, we use frames 1, 5, 9, 13, and 17 \nas input to the SNN. The second approach uses the Discrete Cosine Transfonn (OCT). \nThe ocr can be used to represent the frame sequence of a segment as follows. Consider \nthe sequence of cepstral features across a segment as a time sequence and take its ocr. \nFor an m frame segment, this transfonn will result in a set of m OCT coefficients for \neach feature. Truncate this sequence to its first few coefficients (the more coefficients \n, the more precise the representation). To keep the number of features the same as in \nthe quasi-linear sampling, we use only five coefficients. If the input segment has less \nthan five frames, we initially interpolate in time so that a five-point ocr is possible. \nCompared to the quasi-linear sampling, OCT has the advantage of using information \nfrom all input frames. \nDuration: Because of the time warping function, the SNN score for a segment is inde(cid:173)\npendent of the duration of the segment. In order to provide duration infonnation to the \nSNN, we constructed a simple durational model. For each phoneme, a histogram was \nmade of segment durations in the training data. This histogram was then smoothed by \nconvolving with a triangular window, and probabilities falling below a floor level were \nreset to that level. The duration score was multiplied by the neural net score to give an \noverall segment score. \n\n3 THE N-BEST RESCORING PARADIGM \n\nOur hybrid system is based on the N-best rescoring paradigm [1], which allows us to \ndesign and test the SNN with little regard to the usual problem of searching for the \nsegmentation when dealing with a large vocabulary speech recognition system. \n\nFigure 2 illusrates the hybrid system. Each utterance is decoded using the BBN BYBLOS \nsystem [4]. The decoding is done in two steps: First the N-best recognition is performed, \nproducing a list of the candidate N best-scoring sentence hypotheses. In this stage, a \nrelatively simple HMM: is used for computation pUIposes. The length of the N-best list \nis chosen to be long enough to almost always include the correct answer. The second \nstep is the HMM: rescoring, where a more sophisticated HMM is used. The recognition \nprocess may stop at this stage, selecting the top scoring utterance of the list (HMM I-best \noutput). \n\nTo incOlporate the SNN in the N -best paradigm, we use the HMM system to generate \na segmentation for each N-best hypothesis, and the SNN to generate a score for the \nhypothesis USing the HMM: segmentation. The N-best list may be reordered based on \n\n\fA Hybrid Neural Net System for State-of-the-Art Continuous Speech Recognition \n\n707 \n\nSNN scores alone. In this case the recognition process stops by selecting the top scoring \nutterance of the rescored list (NN I-best output). \n\nSpeech \n\nN-&e.tHMM \nRecognition \n\nN-Best \n\n~ r List \n\nHMM \n\nRescoring \n\nN-be8t \n\nIII. \n\nLabels and \nSegmentation \n\nSegmental \nNeural Net \nRescoring \n\nHMM \nScores \n\nSNN \n\nScor .. \n\nHMM \n1-best \n\n+ + \n\nSNN \n1-bMt \n\nCombine SCores \n\nTop Choice \n\nand Reorder List \u2022 \n\nHybrid SNNIHMM \n\nFigure 2: Schematic diagram of the hybrid SNN/HMM system \n\nThe last stage in the hybrid system is to combine several scores for each hypothesis, \nsuch as SNN score, HMM: score, grammar score, and the hypothesized number of words \nand phonemes. (The number of words and phonemes are included because they serve \nthe same pUIpose as word and phoneme insertion penalties in a HMM: CSR system.) We \nform a composite score by taking a linear combination of the individual scores. The \nlinear combination is determined by selecting the weights that give the best performance \nover a development test set. These weights can be chosen automatically [5]. After we \nhave rescored the N-Best list, we can reorder it according to the new composite scores. \nIf the CSR system is required to output just a single hypothesis, the highest scoring \nhypothesis is chosen (hybrid SNN/HMM top choice in Figure 2). \n\n4 SNN TRAINING \n\nThe training of the phonetic SNNs is done in two steps. In the first training step, we \nsegment all of the training utterances into phonetic segments using the HMM: models and \n\n\f708 \n\nZavaliagkos, Zhao, Schwartz, and Makhoul \n\nthe utterance transcriptions. Each segment then serves as a positive example of the SNN \noutput corresponding to the phonetic label of the segment and as a negative example for \nall the other phonetic SNN outputs (we are using a total of 53 phonetic outputs). We call \nthis training method i-best training. \n\nThe SNN is trained using the log-error distortion measure [6], which is an extension \nof the relative entropy measure to an M -class problem. To ensure that the outputs are \nin fact probabilities, we use a sigmoidal nonlinearity to restrict their range in [0, 1] and \nan output normalization layer to make them sum to one. The models are initialized by \nremoving the sigmoids and using the MSE measure. Then we reinstate th~ sigmoids and \nproceed with four iterations of a quasi-Newton [7] error minimization algorithm. For the \nadopted error measure, when the neural net non-linearity is the usual sigmoid function, \nthere exists a unique minimum for single-layer nets [6]. \n\nThe I-best training described has one drawback: the training does not cover all the cases \nthat the network will be required to encounter in the N-best rescoring paradigm. With 1-\nbest training, given the correct segmentation, we train the network to discriminate between \ncorrect and incorrect labeling. However, the network will also be used to score N-best \nhypotheses with incorrect segmentation. Therefore, it is important to train based on the \nN-best lists in what we call N-best training. During N-best training, we produce the N(cid:173)\nbest lists for all of the training sentences, and we then train positively with all the correct \nhypotheses and negatively on the \"misrecognized\" parts of the incorrect hypothesis. \n\n4.1 Context Modelling \n\nSome of the largest gains in accuracy for HMM CSR systems have been obtained with \nthe use of context (i.e., phonetic identity of neighbOring segments). Consequently, we \nimplemented a version of the SNN that provided a simple model of left-context. In \naddition to the SNN previously described, which only models a segment's phonetic \nidentity and makes no reference to context, we trained 53 additional left-context networks. \nEach of these 53 networks were identical in structure to the context-independent SNN. \nIn the recognition process, the segment score is obtained by combining the output of \nthe context-independent SNN with the corresponding output of the SNN that models the \nleft-context of the segment. This combination is a weighted average of the two network \nvalues, where the weights are determined by the number of occurrences of the phoneme \nin the training data and the number of times the phoneme has its present context in the \ntraining data. \n\n4.1.1 Regularization Techniques for Context Models \n\nDuring neural net training of context models, a decrease of the distortion on the training \nset often causes an increase of the distortion on the test set. This problem is called \novertraining, and it typically occurs when the number of training samples is on the order \nof the number of the model parameters. Regularization provides a class of smoothing \ntechniques to ameliorate the overtraining problem. Instead of minimizing the distortion \nmeasure alone, we are minimizing the following objective function: \n\n(1) \n\n\fA Hybrid Neural Net System for State-of-the-Art Continuous Speech Recognition \n\n709 \n\nwhere Wo is the set of weights corresponding to the context-independent model, Nd \nis the number of data points, and >'1, >'2, 711, 712 are smoothing parameters. The first \nregularization tenn is used to control the excursion of the weights in general and the other \nto control the degree to which the context-dependent model is allowed to deviate from the \ncorresponding context-independent model (to achieve this first we initialize the context(cid:173)\ndependent models with the context-independent model). In our initial experiments, we \nused values of >'1 = >'2 = 1.0, 711 = I, 712 = 2. \nWhen there are very few training data for a particular context model, the regularization \ntenns in (!) p:'evail, Cflnstraining the model parameters to remain close to their initial \nestimates. The regularization tenn is gradually turned off with the presence of more data. \nWhat we accomplish in this way is an automatic mechanism that controls overtraining. \n\n4.2 Elliptical Basis Functions \n\nOur efforts to use multi-layer structures has been rather unsuccessful so far. The best \nimprovement we got was a mere 5% reduction in error rate over the single-layer perfor(cid:173)\nmance, but with a 10-fold increase in both number of parameters and computation time. \nWe suspect that our training is getting trapped in bad local minima. Due to the above \nconsiderations, we considered an alternative multi-layer structure, the Elliptical Basis \nFunction (EBF) network. EBFs are natural extensions of Radial Basis Functions, where \na full covariance matrix is introduced in the basis functions. As many researchers have \nsuggested, EBF networks provide modelling capabilities that are as powerful as multi(cid:173)\nlayer perceptrons. An advantage of EBF is that there exist well established techniques \nfor estimating the elliptical basis layer. As a consequence, the problem of training an \nEBF network can be reduced to a one-layer problem, i.e., training the second layer only. \n\nOur approach with EBF is to initialize them with Maximum Likelihood (ML). ML training \nallows us to use very detailed context models, such as triphones. The next step, which \nis not yet implemented, is to either proceed with discriminative NN training, or use a \nnonlinearity at the outout layer and treat the second layer as a single-layer feedforward \nmodel, or both. \n\n5 EXPERIMENTAL CONDITIONS AND RESULTS \n\nExperiments to test the performance of the hybrid system were performed on the speaker(cid:173)\nindependent (SI) portion of the DARPA 1000-word Resource Management speech corpus. \nThe training set consisted of utterances from 109 speakers, 2830 utterances from male \nspeakers and 1160 utterances from female speakers. We have tested our system with \n5 different test sets. The Feb '89 set was used as a cross-validation set for the SNN \nsystem. Feb '89 and Oct '89 were used as development sets whenever the weights for \nthe combination of two or more models were to be estimated. Feb '91 and the two Sep \n'92 sets were used as independent test sets. \n\nBoth the NN and the HMM systems had 3 separate models made from male, female, and \ncombined data. During recognition all 3 models were used to score the utterances, and \nthe recognition answer was decided by a 3-way gender selection: For each utterance, the \nmodel that produced the highest score was selected. The HMM used was the February \n'91 version of the BBN BYBLOS system. \n\n\f710 \n\nZavaliagkos, Zhao, Schwartz, and Makhoul \n\nIn the experiments, we used SNNs with 53 outputs, each representing one of the phonemes \nin our system. The SNN was used to rescore N-best lists of length N = 20. The input \nto the net is a fixed number of frames of speech features (5 frames in our system). The \nfeatures in each to-ms frame consist of 16 scalar values: power, power difference, and 14 \nmel-warped cepstral coefficients. For the EBF, the differences of the cepstral parameters \nwere used also. \n\nTable 1: SNN development on February '89 test set \n\n~--------------------~~~~--~-~ \n\nOriginal SSN (MSE) \n+ Log-Error Criterion \n+ N-Best training \n+ Left Context \n+ Regularization \n+ word,phoneme penalties \n\nEBF \n\nWord EITor (%) \n13.7 \n11.6 \n9.0 \n7.4 \n6.6 \n5.7 \n4.9 \n\nTable I shows the word error rates at the various stages of development. All the experi(cid:173)\nments mentioned below used the Feb '89 test set. The original I-layer SNN was trained \nusing the I-best training algorithm and the MSE criterion, and gave a word error rate \nof 13.7%. The incorporation of the duration and the adoption of the log-error training \ncriterion both resulted in some improvement, bringing the error rate down to 11.6%. \nWith N-best training the error rate dropped to 9.0%; adding left context models reduced \nthe word error rate down to 7.4%. When the the context models were trained with the \nregularization criterion the error rate dropped to 6.6%. All of the above results were ob(cid:173)\ntained using the mean NN score (NN score divided by the number of segments). When \nwe used word and phone penalties, the perfonnance was even better, a 5.7% word error \nrate. For the same conditions, the perfonnance for the EBF system was 4.9% word error \nrate. We should mention that the implementation of training with regularization was not \ncomplete at the time the hybrid system was tested on the September 92 test, so we will \nexclude it from the NN results presented below. \n\nThe final hybrid system included the HM:M, the SNN and EBF models, and Table 2 \nsummarizes its perfonnance (in this table, NN stands for the combination of SNN and \nEBF). We notice that with the exception of of the Sep '92 test sets the word error of the \nmfM was roughly around 3.5%(3.8, 3.7 and 3.4%). For the same test sets, the NN had \na word error slightly higher than 4.0%, and the hybrid NN/HMM system a word error \nrate of 2.7%. We are very happy to see the perfonnance of our neural net approaching \nthe perfonnance of the HMM. It is also worthwhile to mention that the perfonnance of \nthe hybrid system for Feb '89, Oct '89 and Feb '91 is the best perfonnance reported so \nfar for these sets. \n\nSpecial mention has to be made for the Sep '92 test sets. These test sets proved to be \nradically different than the previous released RM tests, resulting in almost a doubling of \nthe HM:M word error rate. The deterioration in perfonnance of the hybrid system was \nbigger, and the improvement due to the hybrid system was less than 10% (compared \nto an improvement of :::::: 25% for the other 3 sets). We have all been baffled by these \nunexpected results, and although we are continuously looking for an explanation of this \n\n\fA Hybrid Neural Net System for State-of-the-Art Continuous Speech Recognition \n\n711 \n\nWord Error % \n\nSystem \nHMM: \nNN \nNN+HMM: \n\nFeb '89 \n\nOct '89 Feb '91 \n\nSep '92 \n\n3.7 \n4.0 \n2.7 \n\n3.8 \n4.2 \n2.7 \n\n3.4 \n4.1 \n2.7 \n\n6.0 \n7.4 \n5.5 \n\nTable 2: Hybrid Neural Net/HM1vf system. \n\nstrange behaviour our efforts have not yet been successful. \n\n6 CONCLUSIONS \n\nWe have presented the Segmental Neural Net as a method for phonetic modeling in large \nvocabulary CSR systems and have demonstrated that, when combined with a conventional \nHMM, the SNN gives a significant improvement over the perfonnance of a state-of-the(cid:173)\nart HMM CSR system. The hybrid system is based on the N-best rescoring paradigm \nwhich, by providing the HMM segmentation, drastically reduces the computation for \nour segmental models and provides a simple way of combining the best aspects of two \nsystems. The improvements achieved from the use of a hybrid system vary from less \nthan 10% to about 25 % reduction in word error rate, depending on the test set used. \n\nReferences \n\n[1] R. Schwartz and S. Austin, \"A Comparison of Several Approximate Algorithms for \nFinding Multiple (N-Best) Sentence Hypotheses,\" IEEE Int. Con[ Acoustics, Speech \nand Signal Processing, Toronto, Canada, May 1991, pp. 701-704. \n\n[2] A. Barron, \"Statistical properties of artificial neural networks,\" IEEE Conf. Decision \n\nand Control, Tampa, FL, pp. 280-285, 1989. \n\n[3] H. Gish, \"A probabilistic approach to the understanding and training of neural \nnetwork classifiers,\" IEEE Int. ConfAcoust., Speech, Signal Processing, April 1990. \n[4] M. Bates et. all, \"The BBN/HARC Spoken Language Understanding System\" IEEE \nInt. Con[ Acoust., Speech,Signal Processing, Apr 1992, Minneapolis, MI, Apr. 1993 \n[5] M. Ostendorf et. all, \"Integration of Diverse Recognition Methodologies Through \nReevaluation of N-Best Sentence Hypotheses,\" Proc. DARPA Speech and Natural \nLanguage Workshop, Pacific Grove, CA, Morgan Kaufmann Publishers, February \n1991. \n\n[6] A. El-Jaroudi and J. Makhoul, \"A New Error Criterion for Posterior Probability \nEstimation with Neural Nets,\" International Joint Conference on Neural Networks, \nSan Diego, CA, June 1990, Vol III, pp. 185-192. \n\n[7] D. Luenberger, Linear and Nonlinear Programming, Addison-Wesley, Mas(cid:173)\n\nsachusetts, 1984. \n\n[8] R. Schwartz et. all, \"Improved Hidden Markov Modeling of Phonemes for Continu(cid:173)\n\nous Speech Recognition,\" IEEE Int. Con[ Acoustics, Speech and Signal Processing, \nSan Diego, CA, March 1984, pp. 35.6.1-35.6.4. \n\n\f", "award": [], "sourceid": 598, "authors": [{"given_name": "G.", "family_name": "Zavaliagkos", "institution": null}, {"given_name": "Y.", "family_name": "Zhao", "institution": null}, {"given_name": "R.", "family_name": "Schwartz", "institution": null}, {"given_name": "J.", "family_name": "Makhoul", "institution": null}]}