{"title": "Segmental Neural Net Optimization for Continuous Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1059, "page_last": 1066, "abstract": null, "full_text": "Segmental Neural Net Optimization for Continuous Speech \n\nRecognition \n\nYmg Zhao Richard Schwartz \n\nJohn Makhoul George Zavaliagkos \n\nBBN System and Technologies \n\n70 Fawcett Street \n\nCambridge MA 02138 \n\nAbstract \n\nPreviously, we had developed the concept of a Segmental Neural Net (SNN) for \nphonetic modeling in continuous speech recognition (CSR). This kind of neu(cid:173)\nral network technology advanced the state-of-the-art of large-vocabulary CSR, \nwhich employs Hidden Marlcov Models (HMM), for the ARPA 1oo0-word Re(cid:173)\nsource Management corpus. More Recently, we started porting the neural net \nsystem to a larger, more challenging corpus - the ARPA 20,Ooo-word Wall Street \nJournal (WSJ) corpus. During the porting, we explored the following research \ndirections to refine the system: i) training context-dependent models with a reg(cid:173)\nularization method; ii) training SNN with projection pursuit; and ii) combining \ndifferent models into a hybrid system. When tested on both a development set \nand an independent test set, the resulting neural net system alone yielded a per(cid:173)\nfonnance at the level of the HMM system, and the hybrid SNN/HMM system \nachieved a consistent 10-15% word error reduction over the HMM system. This \npaper describes our hybrid system, with emphasis on the optimization methods \nemployed. \n\n1 \n\nINTRODUCTION \n\nHidden Martov Models (HMM) represent the state-of-the-art for large-vocabulary con(cid:173)\ntinuous speech recognition (CSR). Recently, neural network technology has been shown \nto advance the state-of-the-art for CSR by integrating neural nets and HMMs [1,2]. In \nprinciple, the advance is based on the fact that neural network modeling can avoid some \nlimitations of the HMM modeling, for example, the conditional-independence assumption \nof HMMs and the fact that segmental features are hard to incorporate. Our work has been \nbased on the concept of a Segmental Neural Net (SNN) [2]. \n\n1059 \n\n\f1060 \n\nZhao, Schwartz, Makhoul, and Zavaliagkos \n\nA segmental neural network is a neural network that attempts to recognize a complete \nphoneme segment as a single unit. Its basic structure is shown in Figure 1. The input \nto the network is a fixed length representation of the speech segment, which is obtained \nfrom the warping (quasi-linear sampling) of a variable length phoneme segment. If the \nnetwork is trained to minimize a least squares error or a cross entropy distortion measure, \nthe output of the network can be shown to be an estimate of the posterior probability of \nthe phoneme class given the input segment [3,4]. \n\n.core \n\nneural \nnetwork \n\nwarping \n\nphonetic .egment \n\nFigure 1: The SNN model samples the frames and produces a single segment score. \n\nOur inith1 SNN system comprised a set of one-layer sigmoidal nets. This system is trained \nto minimize a cross entropy distortion measure by a quasi-Newton error minimization \nalgorithm. A vanable length segment is warped into a fixed length of 5 input frames. \nSince each frame includes 16 feature~, 14 mel cepstrum, power and difference of power, \nan input to the neural network forms a 16 x 5 = 80 dimensional vector. \nPreviously, our experimental domain was the ARPA 1000-word Resource Management \n(RM) Corpus, where we used 53 output phoneme classes. When tested on three independent \nevaluation sets (Oct 89, Feb 91 and Sep 92), our system achieved a consistent 10-20% \nword error rate reduction over the state-of-the-art HMM system [2]. \n\n2 THE WALL STREET JOURNAL CORPUS \n\nAfter the final September 92 RM corpus evaluation, we ported our neural network system \nthe Wall Street Journal (WSJ) Corpus. The WSJ corpus consists \nto a larger corpus -\nprimarily of read speech, with a 5,000- to 20,000-word vocabulary. It is the current ARPA \nspeech recognition research corpus. Compared to the RM corpus, it is a more challenging \ncorpus for the neural net system due to the greater length of WSJ utterances and the \nhigher perplexity of the WSJ task. So we would expect greater difficulty in improving \nperfOITnClllCe on the WSJ corpus. \n\n\fSegmental Neural Net Optimization for Continuous Speech Recognition \n\n1061 \n\n3 TRAINING CONTEXT-DEPENDENT MODELS WITH \n\nREGULARIZATION \n\n3.1 WHY REGULARIZATION \n\nIn contrast to the context-independent modeling for the RM corpus, we are concentrating on \ncontext-dependent modeling for the WSJ corpus. In context-dependent modeling, instead \nof using a single neural net to recognize phonemes in all contexts, different neural networks \nare used to recognize phonemes in different contexts. Because of the paucity of training \ndata for some context models, we found that we had an overfitting problem. \n\nRegularization provides a class of smoothing techniques to ameliorate the overfitting prob(cid:173)\nlem [5]. We started using regularization in our initial one-layer sigmoidal neural network \nsystem. The regularization tenn added here is to regulate how far the context-dependent \nparameters can move away from their initial estimates, which are context-independent pa(cid:173)\nrameters. TIus is different from the usual weight decay technique in neural net literature, \nand it is designed specifically for our problem. The objective function is shown below: \n\n- ~ d ~ [~IOg(1 - J.) + ~ log f} :'IIW ~ .WolI~ \n\nJ Regulanzatton Tenn \n\n(I) \n\n, \n\nDistortion measure Er(W) \n\nv \n\nwhere Ii is the net output for class i, II W II is the Euclidean nonn of all weights in all \nthe networks, IIWol1 is the initial estimate of weights from a context-independent neural \nnetwork, Nd is the number of data points. ). is the regularization parameter which controls \nthe tradeoff between the \"smoothness\" of the solution, as measured by IIW - Wo11 2, and \nthe deviation from the data as measured by the distortion. \n\nThe optimal )., wllich gives the best generalization to a test set, can be estimated by \ngeneralized cross-validation [5]. If the distortion measure as shown in (2) \n\n(2) \n\nis a qu:.:-ctratic function in tenns of network weights W, the optimal ). is that which gives \nthe minin,um of a generalized cross-validation index V().) [6]: \n\nV(),) = \n\nNt IIA()') - bW \nI - ~d tr(A()')) \n\nd \n\n(3) \n\nwhere A(>.) = A(AT A+Nd)'I)AT . V()') is an easily calculated function based on singular \nvalue decomposition (SVD): \n\n(4) \n\nwhere A = U DVT , singular decomposition of A, z = U T b. Figure 2 shows an example \nplot of V().). A typical optimal ). has an inverse relation to the number of samples in each \nclass, indicating that ). is gradually reduced with the presence of more data. \n\n\f1062 \n\nZhao, Schwartz, Makhoul, and Zavaliagkos \n\nCI: \n~ ~------------------------------------------. \n\no \n8 \no \n8 \n\u00a7 \n~ I \nI \n~ I \n\nR \n\no \n\n5*10\"-7 \n\n2.5*1011-6 \n\nlambda \nFigure 2: A typical V(A) \n\nJust as the linear least squares method can be generalized to a nonlinear least squares \nproblem by an iterative procedure, so selecting the optimal value of the regularization \nparameter in a quadratic error criterion can be generalized to a non-quadratic error criterion \niteratively. We developed an iterative procedure to apply the cross-validation technique to \na non-..]uadratic error function, for example, the cross-entropy criterion Er(W) in (1) as \nfollows: \n\n1. Compute distortion Er(Wn ) for an estimate Wn \u2022 \n\n2. Compute gradient gn and Hessian Hn of the distortion Er(Wn ). \n\n3. Compute the singular value decomposition of Hn = V! Dn Vn . Set Zn = v\"gn. \n\n4. Evaluate a generalized cross-validation index Vn(A) similar to (2) as follows, for a \n\nrange of A'S and select the An that gives the minimum Vn \u2022 \n\n\fSegmental Neural Net Optimization for Continuous Speech Recognition \n\n1063 \n\nVnC\\) = \n\nN [E TifT) \n\nd \n\n!dj+Nd.\\ 2] \nr(Hn - L.Jj (dj+Nd.\\)2Zn \n\n\" \n\n[Nd - Lj dj:1vd'\\] \n\n2 \n\n(5) \n\n5. Set Wn+l = Wn -\n6. Go to 1 and iterate. \n\n(Hn + NdAn)-lgn. \n\nNote that A is adjusted at each iteration. The final value of An is taken as the optimal A. \nIterative regularization parameter selection shows that A converges, for example, to 1~~' \nfrom one of our experiments. \n\n3.2 A TWO-LAYER NEURAL NETWORK SYSTEM WITH REGULARIZATION \n\nWe then extended our regularization work from the one-layer sigmoidal network system to \na two-layer sigmoidal network system. The first layer of the network works as a feature \nextractor and is shared by all phonetic classes. Theoretically, in order to benefit from \nits larger capability of representing phonetic segments, the number of hidden units of a \ntwo-layer network should be much greater than the number of input dimensions. However, \na large number of hidden units can cause serious overfitting problems when the number of \ntraining ~amples is less than the number of parameters for some context models. Therefore, \nregularization is more useful here. Because the second layer can be trained as a one-layer \nnet, the regularization techniques we developed for a one-layer net can be applied here to \ntrain the second layer. \n\nIn our implementation, a weighted least squares error measure was used at the output layer. \nFirst, the weights for the two-layer system were initialized with random numbers between \n-1 and 1. Fixing the weights for the second layer, we trained the first layer by using \ngradient descent; then fixing the weights for the first layer, we trained the second layer by \nlinear least squares with a re gularization term, without the sigmoidal function at the output. \nWe stopped after one iteration for our initial experiment. \n\n4 TRAINING SNN WITH PROJECTION PURSUIT \n\n4.1 WHY PROJECTION PURSUIT \n\nAs we described in the previous section, regularization is especially useful in training the \nsecond layer of a two-layer network. In order to take advantage of the two-layer layer \nstructure, we want to train the first layer as well. However, once the number of the hidden \nunits is large, the number of weights in the first layer is huge, which makes the first layer \nvery difficult to train. Projection pursuit presents a !lseful technique to use a large hidden \nlayer but still keep the number of weights in the first layer as small as possible. \n\nThe original pJojection PU13Uit is a nonparametric statistical technique to find interesting \nlow dimensional projections of high dimensional data sets [7]. The parametric version of \nit, a projection pursuit learning network. (pPLN) has a structure very similar to a two-layer \nsigmoidal network network [7]. In a traditional two-layer neural network, the weights in \nthe first layer can be viewed as hypetplanes in the input space. It has been proposed that \na special function of the first layer is to partition the input space into cells through these \nhyperplanes [8]. The second layer groups these cells together to form decision regions. \n\n\f1064 \n\nZhao, Schwartz, Makhoul, and Zavaliagkos \n\nThe accuracy or resolution of the decision regions is completely specified by the size and \ndensity of the cells which is detennined by the number and placement of the first layer \nhyperplanes in the input space. \n\nIn a two-layer neural net, since the weights in the first layer can go anywhere, there are no \nrestrictions on the placement of these hyperplanes. In contrast, a projection pursuit learning \nnetwork. restricts these hyperplanes in some major \"interesting\" directions. In other words, \nhidden units are grouped into several distinct directions. Of course, with this grouping, the \nnumber of cells in the input space is reduced somewhat. However, the interesting point \nhere is that this resoiction does not reduce the number of cells asymptotically [7]. In other \nwords, grouping hidden units does not affect the number of cells much. Consequently, for \na fixed number of hidden units, the number of parameters in the first layer in a projection \npursuit learning network. is much less than in a traditional neural network. Therefore, a \nprojection pursuit learning network is easier to train and generalizes better. \n\n4.2 HOW TO TRAIN A PPLN \n\nIn our implementation, the distinct projection directions were shared by all context(cid:173)\ndependent models, and they were trained context-independently. We then trained these \ndirection parameters with back-propagation. The second layer was trained with regulariza(cid:173)\ntion. Iterations can go back and forth between the two layers. \n\n5 COMBINATIONS OF DIFFERENT MODELS \n\nIn the last two sections, we talked about using regularization and projection pursuit to \noptimize our neural network system. In this section, we will discuss another optimization \nmethod, combining different models into a hybrid system. The combining method is based \non the N-best rescoring paradigm [2]. \n\nThe N-best rescoring paradigm is a mechanism that allows us to build a hybrid system by \ncombining different knowledge sources. For example, in the RM corpus, we successfully \ncombined the HMM: system, th~ SNN system and word-pair grammar into a single hybrid \nsystem which achieved the state-of-the-art. We have been using this N-best rescoring \nparadigm to combine different models in the WSJ corpus as well. These different models \ninclude SNN left context, right context, and diphone models, HMM models, and a language \nmodel known as statistical grammar. We will show how to obtain a reasonable combination \nof different systems from Bayes rule. \n\nThe goal is to compute P(SIX), the probability of the sentence 5 given the observation \nsequence X. From Bayes rule, \n\nP(SIX)SNN = P(S)P(XIS) \nP(X) \n\nP(x) \n\n~ P(S) II P(xIS) \n~ P(S) II P(xlp, c) \n~ P(S) II P(Plx, c) \n\nP(x) \n\nP(plc) \n\nx \n\nx \n\n:r: \n\n\fSegmental Neural Net Optimization for Continuous Speech Recognition \n\n1065 \n\nwhere X is a sequence of acoustic features x in each phonetic segment; p and c is the \nphoneme class and context for the segment, respectively. The following three approxima(cid:173)\ntions are used here: \n\n\u2022 P(XIS) = Ox P(xIS). \n\u2022 P(:z:IS) = P(:z:lp, c). \n\u2022 P(clx) = P(c). \n\nTherefol'e, in a SNN system, we use the following approximation from Bayes rule: \n\nP(SIX)N N ~ P(S) II P~~~;) \n\nx \n\nwhere \n\nP(S): Word grammar score. \nOx P(plx, c): Neural net score. \nOx P(plc): Phone grammar score. \nThese three scores together with HMM scores are combined in the SNN/HMM hybrid \nsystem. \n\n6 EXPERIMENTAL RESULTS \n\nDevelopment Set Nov92 Test \n\nHMM \nBaseline SNN \nRegtIlarization and P!'ojection Pursuit SNN \nBaseline SNN/HMM \nRegularization and Projection Pursuit SNN/HMM \n\n11.0 \n11. 7 \n11.2 \n10.3 \n9.5 \n\n8.5 \n\n9.1 \n7.7 \n7.2 \n\n~~~~~=-~~==~~~------~----------~~---\n\nTable 1: Word Error Rates for 5K, Bigram Grammar \n\nRegularization and Projection Pursuit SNN \nRegularization and Projection Pursuit SNN/HMM \n\nDevelopment Set Nov93 Test \n\n14.4 \n14.6 \n13.0 \n\n14.0 \n\n12.3 \n\nTable 2: Word Error Rates for 20K, Trigram Grammar \n\nSpeaker-independent CSR tests were performed on the 5,000-word (5K) and 20,000-word \n(20K) ARPA Wall Street Journal corpus. Bigram and trigram statistical grammars were \nused. The basic neural network structure consists of 80 inputs, 500 hidden units and 46 \noutputs. There are 125 projection directions in the first layer. Context models consist of \n\n\f1066 \n\nZhao, Schwartz, Makhoul, and Zavaliagkos \n\nright context models and left diphone models. In the right context models, we used 46 \ndifferent networks to recognize each phoneme in each of the different right contexts. In \nthe left diphone models, a segment input consisted of the first half segment of the current \nphone plus the second half segment of the previous phone. Word error rates are shown in \nTables 1 and 2. \n\nComparing the first two rows of Table 1 and Table 2, we can see that the two-layer neural \nnetwork system alone is at the level of state-of-the-art HMM systems. Shown in Row 3 \nand 5 of Table 1, regularization and projection pursuit improve the performance of neural \nnet system. The hybrid SNN/HMM system reduces the word error rate 10%-15% over the \nHMM system in both tables. \n\n7 CONCLUSIONS \n\nNeural net te':hnology is useful in advancing the state-of-the-art in continuous speech recog(cid:173)\nnition system. Optimization methods, like regularization and projection pursuit, improve \nthe performance of the neural net syst\u00a3:m. Our hybrid SNN/HMM system reduces the word \nerror rate 10%-15% over the HMM system on 5,000-word and 20,000-word WSJ corpus. \n\nAcknowledgments \n\nThis work was funded by the Advanced Research Projects Agency of the Department of \nDefense. \n\nReferences \n\n[1] M. Cohen, H. Franco, N. Morgan, D. Rumelhart and V. Abrash, \"Context-Dependent \nMultiple Distribution Phonetic Modeling with :MLPS\", in em Advances in Neural \nInformation Processing Systems 5, eds. S. J. Hanson, J. D. Cowan and C. L. Giles. \nMorgan Kaufmann Publishers, San Mateo, 1993. \n\n[2] G. Zavaliagkos, Y. Zhao, R. Schwartz and J. Makhoul, \" A Hybrid Neural Net \nSystem for State-of-the-Art Continuous Speech Recognition\", in em Advances in \nNeural Information Processing Systems 5, eds. S. J. Hanson, J. D. Cowan and C. L. \nGiles. Morgan Kaufinann Publishers, San Mateo, 1993. \n\n[3] A. Barron, \"Statistical properties of artificial neural networks,\" IEEE Cont Decision \n\nand Control, Tampa, FL, pp. 280-285, 1989. \n\n[4] H. Gish, \"A probabilistic approach to the understanding and training of neural network \n\nclassifiers,\" IEEE Int. Cont. Acoust .\u2022 Speech. Signal Processing, April 1990. \n\n[5] G. Wahba, Spline Models for Observational Data, CBMS-NSF Regional Conference \n\nSeries in Applied Mathematics, 1990. \n\n[6] D. M. Bates, M. J. Lindstrom, G. Wahba and B. S. Yandell, \"GCVPACK-Routines \nfor Generalized Cross Validation\", Comm. Statist.- Simula., 16(4), 1247-1253 (1987). \n[7] Y. Zhao and C. G. Atkeson, \"Implementing Projection Pursuit Learning\", to appear \n\nin Neural Computation, in preparation. \n\n[8] J. Makhoul, A. El-Jaroudi and R. Schwartz, \"Partitioning Capabilities of 1\\vo-Iayer \nNeural Networks\", IEEE Transactions on Signal Processing, 39, pp. 1435-1440, 1991. \n\n\fPART X \n\nCOGNITIVE SCIENCE \n\n\f\f", "award": [], "sourceid": 763, "authors": [{"given_name": "Ying", "family_name": "Zhao", "institution": null}, {"given_name": "Richard", "family_name": "Schwartz", "institution": null}, {"given_name": "John", "family_name": "Makhoul", "institution": null}, {"given_name": "George", "family_name": "Zavaliagkos", "institution": null}]}