{"title": "Improved Hidden Markov Model Speech Recognition Using Radial Basis Function Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 159, "page_last": 166, "abstract": "", "full_text": "Improved Hidden Markov Model \n\nSpeech Recognition Using \n\nRadial Basis Function Networks \n\nElliot Singer and Richard P. Lippmann \n\nLincoln Laboratory, MIT \n\nLexington, MA 02173-9108, USA \n\nAbstract \n\nA high performance speaker-independent isolated-word hybrid speech rec(cid:173)\nognizer was developed which combines Hidden Markov Models (HMMs) \nand Radial Basis Function (RBF) neural networks. In recognition ex(cid:173)\nperiments using a speaker-independent E-set database, the hybrid rec(cid:173)\nognizer had an error rate of 11.5% compared to 15.7% for the robust \nunimodal Gaussian HMM recognizer upon which the hybrid system was \nbased. These results and additional experiments demonstrate that RBF \nnetworks can be successfully incorporated in hybrid recognizers and sug(cid:173)\ngest that they may be capable of good performance with fewer parameters \nthan required by Gaussian mixture classifiers. A global parameter opti(cid:173)\nmization method designed to minimize the overall word error rather than \nthe frame recognition error failed to reduce the error rate. \n\n1 HMM/RBF HYBRID RECOGNIZER \n\nA hybrid isolated-word speech recognizer was developed which combines neural \nnetwork and Hidden Markov Model (HMM) approaches. The hybrid approach is \nan attempt to capitalize on the superior static pattern classification performance of \nneural network classifiers [6] while preserving the temporal alignment properties of \nHMM Viterbi decoding. Our approach is unique when compared to other studies \n[2, 5] in that we use Radial Basis Function (RBF) rather than multilayer sigmoidal \nnetworks. RBF networks were chosen because their static pattern classification \nperformance is comparable to that of other networks and they can be trained rapidly \nusing a one-pass matrix inversion technique [8]. \n\nThe hybrid HMM/RBF isolated-word recognizer is shown in Figure 1. For each \n159 \n\n\f160 \n\nSinger and Lippmann \n\nBEST WORD MATCH \n\nWORD \nMODELS \n\nBACKGROUND \nNOISE MODEL \n\nUNKNOWN _ \n\nWORD \n\nFigure 1: Block diagram of the hybrid recognizer for a two word vocabulary. \n\npattern presented at the input layer, the RBF network produces nodal outputs \nwhich are estimates of Bayesian probabilities [9]. The RBF network consists of an \ninput layer, a hidden layer composed of Gaussian basis functions, and an output \nlayer. Connections from the input layer to the hidden layer are fixed at unity \nwhile those from the hidden layer to the output layer are trained by minimizing \nthe overall mean-square error between actual and desired output values. Each \nRBF output node has a corresponding state in a set of HMM word models which \nrepresent the words in the vocabulary. HMM word models are left-to-right with \nno skip states and have a one-state background noise model at either end. The \nbackground noise models are identical for all words. In the simplified diagram of \nFigure 1, the vocabulary consists of 2 E-set words and the HMMs contain 3 states \nper word model. The number of RBF output nodes (classes) is thus equal to the \ntotal number of HMM non-background states plus one to account for background \nnoise. In recognition, Viterbi decoders use the nodal outputs of the RBF network \nas observation probabilities to produce word likelihood scores. Since the outputs of \nthe RBF network can take on any value, they were initially hard limited to 0.0 and \n1.0. The transition probabilities estimated as part of HMM training are retained. \nThe final response of the recognizer corresponds to that word model which produces \nthe highest Viterbi likelihood. Note that the structure of the HMM/RBF hybrid \nrecognizer is identical to that of a tied-mixture HMM recognizer. For a discussion \nand comparison of the two recognizers, see [10]. \n\nTraining of the hybrid recognizer begins with the preliminary step of training an \nHMM isolated-word recognizer. The robust HMM recognizer used provides good \nrecognition performance on many standard difficult isolated-word speech databases \n[7]. It uses continuous density, unimodal diagonal-covariance Gaussian classifiers \nfor each word state. Variances of all states are equal to the grand variance averaged \nover all words and states. The trained HMM recognizer is used to force an alignment \nof every training token and assign a label to each frame. Labels correspond to both \nstates of HMM word models and output nodes of the RBF network. \n\nThe Gaussian centers in the RBF hidden layer are obtained by performing k-means \n\n\fImproved Hidden Markov Model Speech Recognition Using Radial Basis Function Networks \n\n161 \n\nclustering on speech frames and separate clustering on noise frames, where speech \nand noise frames are distinguished on the basis of the initial Viterbi alignment. The \nRBF weights from the hidden layer to the output layer are computed by presenting \ninput frames to the RBF network and setting the desired network outputs to 1.0 \nfor the output node corresponding to the frame label and 0.0 for all other nodes. \nThe RBF hidden node outputs and their correlations are accumulated across all \ntraining tokens and are used to estimate weights to the RBF output nodes using a \nfast one-pass algorithm [8]. Unlike the performance of the system reported in [5], \nadditional training iterations using the hybrid recognizer to label frames did not \nimprove performance. \n\n2 DATABASE \n\nAll experiments were performed using a large, speaker-independent E-set (9 word) \ndatabase derived from the ISOLET Spoken Letter Database [4]. The training set \nconsisted of 1,080 tokens (120 tokens per word) spoken by 60 female and 60 male \nspeakers for a total of 61,466 frames. The test set consisted of 540 tokens (60 \ntokens per word) spoken by a different set of 30 female and 30 male speakers for \na total of 30,406 frames. Speech was sampled at 16 kHz and had an average SNR \nof 31.5 dB. Input vectors were based on a mel-cepstrum analysis of the speech \nwaveform as described in [7]. The input analysis window was 20ms wide and was \nadvanced at 10ms intervals. Input vectors were created by adjoining the first 12 \nnon-energy cepstral coefficients, the first 13 first-difference cepstral coefficients, and \nthe first 13 second-difference cepstral coefficients. Since the hybrid was based on \nan 8 state-per-word robust HMM recognizer, the RBF network contained a total of \n73 output nodes (72 speech nodes and 1 background node). The error rate of the 8 \nstate-per-word robust HMM recognizer on the speaker-independent E-set task was \n15.7%. \n\n3 MODIFICATIONS TO THE HYBRID RECOGNIZER \n\nThe performance of the baseline HMM/RBF hybrid recognizer described in Sec(cid:173)\ntion 1 is quite poor. We found it necessary to select the recognizer structure carefully \nand utilize intermediate outputs properly to achieve a higher level of performance. \nA full description of these modifications is presented in [10]. Briefly, they include \nnormalizing the hidden node outputs to sum to 1.0, normalizing the RBF outputs \nby the corresponding a priori class probabilities as estimated from the initial Viterbi \nalignment, expanding the RBF network into three individually trained subnetworks \ncorresponding to the ceptrum, first difference cepstrum, and second difference cep(cid:173)\nstrum data streams, setting a lower limit of 10- 5 on the values produced at the RBF \noutput nodes, adjusting a global scaling factor applied to the variances of the RBF \ncenters, and setting the number of centers to 33,33, and 65 for the first, second, and \nthird subnets, respectively. The structure of the final hybrid recognizer is shown in \nFigure 2. This recognizer has an error rate of 11.5% (binomial standard deviation \n= \u00b11.4) on the E-set test data compared to 15.7% (\u00b11.6) for the 8 state-per-word \nunimodal Gaussian HMM recognizer, and 9.6% (\u00b11.3) for a considerably more com(cid:173)\nplex tied-mixture HMM recognizer [10]. The final hybrid system contained a total \nof 131 Gaussians and 9,563 weights. On a SUN SPARCstation 2, training time for \n\n\f162 \n\nSinger and Lippmann \n\nthe final hybrid recognizer was about 1 hour and testing time was about 10 minutes. \n\nBEST WORD MATCH \n\nFigure 2: Block diagram of multiple sub net hybrid recognizer. \n\n4 GLOBAL OPTIMIZATION \n\nIn the hybrid recognizer described above, discriminative training is performed at \nthe frame level. A preliminary segmentation by the HMM recognizer assigns each \nspeech frame to a specific RBF output node or, equivalently, an HMM word state. \nThe RBF network weights are then computed to minimize the squared error be(cid:173)\ntween the network output and the desired output over all input frames. The goal of \nthe recognizer, however, is to classify words. To meet this goal, discriminant train(cid:173)\ning should be performed on word-level rather than frame-level outputs. Recently, \nseveral investigators have described techniques that optimize parameters based on \nword-level discriminant criteria [1, 3]. These techniques seek to maximize a mutual \ninformation type of criterion: \n\nLc \nC = logy, \n\nwhere Lc. is the likelihood score of the word model corresponding to the correct \nresult and L = Lw Lw is the sum of the word likelihood scores for all models. By \ncomputing oC/oO, the gradient of C with respect to parameter 0, we can optimize \nany parameter in the hybrid recognizer using the update equation \n\nwhere 0 is the new value of parameter 0, () is the previous value, and TJ is a gain \nterm proportional to the learning rate. Following [1], we refer to the word-level \noptimization technique as \"global optimization.\" \n\n\fImproved Hidden Markov Model Speech Recognition Using Radial Basis Function Networks \n\n163 \n\nTo apply global optimization to the HMM/RBF hybrid recognizer, we derived the \n\nformulas for the gradient of C with respect to wt ' the weight connecting RBF center \ni to RBF output node j in subnet k; Pj, the RBF output normalization factor for \nRBF output node j in subnet k; and mfl' the Ith element of the mean of center i of \nsubnet k. For each token of length T frames, these are given by \n\n8C = (be; - Pw ) \"'\"' frjt{3jt *~ \n\nT \n\nJ:lwk \nij \nU \n\nL \n\nw \n\nL..J k I t ' \nt=1 St \n\nand \n\no otherwise, \n\nlikelihood score for word model w, \nLw / Lw Lw is the normalized word likelihood, \n{ I if RBF output node j is a member of the correct word model \nforward partial probability of HMM state j at time t, \nbackward partial probability of HMM state j at time t, \nunnormalized output of RBF node j of subnet k at time t, \nnormalized output of ith Gaussian center of sub net k at time t, \n~ . **~t = 1 \nIth element of the input vector for subnet k at time t, \nglobal scaling factor for the variances of sub net k, \n[th component of the standard deviation of the ith Gaussian center \nof subnet k, \nnumber of RBF output nodes in sub net k. \n\n~, I \n\n, \n\nIn implementing global optimization, the frame-level training procedure described \nearlier serves to initialize system parameters and hill climbing methods are used to \nreestimate parameters iteratively. Thus, weights are initialized to the values derived \nusing the one-pass matrix inversion procedure, RBF output normalization factors \nare initialized to the class priors, and Gaussian means are initialized to the k-means \nclustering values. Note that while the priors sum to one, no such constraint was \nplaced on the RBF output normalization factors during global optimization. \n\nIt is worth noting that since the RBF network outputs in the hybrid recognizer \nare a posteriori probabilities normalized by a priori class probabilities, their values \nmay exceed 1. The accumulation of these quantities in the Viterbi decoders often \nleads to values of (Xjt{3jt and Lw in the range of 1080 or greater. Numerical problems \nwith the implementation of the global optimization equations were avoided by using \nlog arithmetic for intermediate operations and working with the quantity {3jt! Lw \nthroughout. Values of 7J which produced reasonable results were generally in the \nrange of 10- 10 to 10- 6 \n\n\f164 \n\nSinger and Lippmann \n\nThe results of using the global optimization technique to estimate the RBF weights \nare shown in Figure 3. Figure 3( a) shows the recognition performance on the train(cid:173)\ning and test sets versus the number of training iterations and Figure 3(b) tracks \nthe value of the criterion C = Lei L on the training and test set under the same \nconditions. It is apparent that the method succeeds in iteratively increasing the \nvalue of the criterion and in significantly lowering the error rate on the training \ndata. Unfortunately, this behavior does not extend to improved performance on \nthe test data. This suggests that global optimization is overfitting the hybrid word \nmodels to the training data. Results using global optimization to estimate RBF \noutput normalization factors and the Gaussian means produced similar results. \n\n20 ,.---~----r-----,r-----\" \n\n%ERROR \n\nTEST \n\n10 \n\n0 \n\n-0.2 \n\n-0.4 \n\n-0.6 \n\n-0.8 \n\nC = log (Lc I L) \n\nTRAIN \n\nTEST \n\no ~--~---~--~~--~ -1 \n\no \n\n5 \n15 \nNUMBER OF ITERATIONS \n\n10 \n\n20 \n\n0 \n\n5 \n15 \nNUMBER OF ITERATIONS \n\n10 \n\n20 \n\nFigure 3: (a) Error rates for training and test data. (b) Criterion C for training \nand test data. \n\n5 ACCURACY OF BAYES PROBABILITY \n\nESTIMATION \n\nThree methods were used to determine how well RBF outputs estimate Bayes prob(cid:173)\nabilities. First, since network outputs must sum to one if they are probabilities, we \ncomputed the RMS error between the sum of the RBF outputs and unity for all \nframes of the test data. The average RMS error was low (10- 4 or less for each \nsubnet). Second, the average output of each RBF node was computed because this \nshould equal the a priori probability of the class associated with the node [9]. This \ncondition was true for each subnet with an average RMS error on the order of 10- 5 . \n\nFor the final method, we partitioned the outputs into 100 equal size bins between \n0.0 and 1.0. For each input pattern, we used the output values to select the appro(cid:173)\npriate bins and incremented the corresponding bin counts by one. In addition, we \nincremented the correct-class bin count for the one bin which corresponded to the \nclass of the input pattern. For example, data indicated that for the 61,466 frames \nof training tokens, nodal outputs of the cepstra subnet in the range 0.095-0.105 oc(cid:173)\ncurred 29,698 times and were correct classifications (regardless of class) 3,067 times. \nIf the outputs of the network were true Bayesian probabilities, we would expect the \n\n\fImproved Hidden Markov Model Speech Recognition Using Radial Basis Function Networks \n\n165 \n\nrelative frequency of correct labeling to be close to 0.1. Similarly, relative frequen(cid:173)\ncies measured in other intervals would also be expected to be close to the value of \nthe corresponding center of the interval. Thus, a plot of the relative frequencies for \neach bin versus the bin centers should show the measured values lying close to the \ndiagonal. \n\nThe measured relative frequency data for the cepstra subnet and \u00b12u bounds for \nthe binomial standard deviations of the relative frequencies are shown in Figure 4. \nOutputs below 0.0 and above 1.0 are fixed at 0.0 and 1.0, respectively. Although \nthe relative frequencies tend to be clustered around the diagonal, many values lie \n\noutside the bounds. Furthermore, goodness-of-fit measurements using the x: test \n\nindicate that fits fail at significance levels well below .01. We conclude that although \nthe system provides good recognition accuracy, better performance may be obtained \nwith improved estimation of Bayesian probabilities. \n\n1r-------------------------------------------------------------~ \u2022 \n\n.J \n~0.9 \n:3 0.8 \nt; \nwO.7 \nIX \nIX 0 0.6 \no \n~0.5 \ngO.4 \nIX \n~0.3 \n1= 0.2 \n:5 \n~O.1 \n\n0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 \n\nRBF NETWORK OUTPUT (All Nodes) \n\n1 \n\nFigure 4: Relative frequency of correct class labeling and \u00b12u bounds for the bino(cid:173)\nmial standard deviation, cepstra subnet. \n\n6 SUMMARY AND CONCLUSIONS \n\nThis paper describes a hybrid isolated-word speech recognizer which successfully \nintegrates Radial Basis Function neural networks and Hidden Markov Models. The \nhybrid's performance is better than that of a tied-mixture recognizer of comparable \ncomplexity and near that of a tied-mi..xture recognizer of considerably greater com(cid:173)\nplexity. The structure of the RBF networks and the processing of network outputs \nhad to be carefully selected to provide this level of performance. A global opti(cid:173)\nmization technique designed to maximize a word discrimination criterion did not \nsucceed in improving performance further. Statistical tests indicated that the accu(cid:173)\nracy of the Bayesian probability estimation performed by the RBF networks could \n\n\f166 \n\nSinger and Lippmann \n\nbe improved. We conclude that RBF networks can be used to provide good perfor(cid:173)\nmance and short training times in hybrid recognizers and that these systems may \nrequire fewer parameters than Gaussian-mixture-based recognizers at comparable \nperformance levels. \n\nAcknowledgements \n\nThis work was sponsored by the Defense Advanced Research Projects Agency. The \nviews expressed are those of the authors and do not reflect the official policy or \nposition of the U.S. Government. \n\nReferences \n\n[1] Yoshua Bengio, Renato De Mori, Giovanni Flammia, and Ralf Kompe. Global \noptimization of a neural network - Hidden Markov model hybrid. Technical \nReport TR-SOCS-90.22, MgGill University School of Computer Science, Mon(cid:173)\ntreal, Qc., Canada, December 1990. \n\n[2] Herve Bourlard and Nelson Morgan. A continuous speech recognition system \nIn D. Touretzky, editor, Advances in Neural \nembedding MLP into HMM. \nInformation Processing 2, pages 186-193. Morgan Kaufmann, San Mateo, CA, \n1990. \n\n[3] John S. Bridle. Alpha-nets: A recurrent neural network architecture with a \n\nhidden Markov model interpretation. Speech Communication, 9:83-92, 1990. \n\n[4] Ron Cole, Yeshwant Muthusamy, and Mark Fanty. The Isolet spoken letter \ndatabase. Technical Report CSE 90-004, Oregon Graduate Institute of Science \nand Technology, Beverton, OR, March 1990. \n\n[5] Michael Franzini, Kai-Fu Lee, and Alex Waibel. Connectionist viterbi train(cid:173)\n\ning: A new hybrid method for continuous speech recognition. In Proceedings \nof IEEE International Conference on Acoustics Speech and Signal Processing. \nIEEE, April 1990. \n\n[6] Richard P. Lippmann. Pattern classification using neural networks. IEEE \n\nCommunications Magazine, 27(11):47-54, November 1989. \n\n[7] Richard P. Lippmann and Ed A. Martin. Mqlti-style training for robust \nisolated-word speech recognition. In Proceedings International Conference on \nAcoustics Speech and Signal Processing, pages 705-708. IEEE, April 1987. \n\n[8] Kenney Ng and Richard P. Lippmann. A comparative study of the practi(cid:173)\n\ncal characteristics of neural network and conventional pattern classifiers. In \nR. P. Lippmann, J. Moody, and D. S. Touretzky, editors, Advances in Neural \nInformation Processing 3. Morgan Kaufmann, San Mateo, CA, 1991. \n\n[9] Mike D. Richard and Richard P. Lippmann. Neural network classifiers estimate \n\nBayesian a posteriori probabilities. Neural Computation, In Press. \n\n[10] Elliot Singer and Richard P. Lippmann. A speech recognizer using radial ba(cid:173)\n\nsis function neural networks in an HMM framework. In Proceedings of the \nInternational Conference on Acoustics, Speech, and Signal Processing. IEEE, \n1992. \n\n\f", "award": [], "sourceid": 527, "authors": [{"given_name": "Elliot", "family_name": "Singer", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}*