{"title": "Links Between Markov Models and Multilayer Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 502, "page_last": 510, "abstract": null, "full_text": "502 \n\nLINKS BETWEEN MARKOV MODELS AND \n\nMULTILAYER PERCEPTRONS \n\nH. Bourlard t,t & C.J. Wellekens t \n\n(t)Philips Research Laboratory \nBrussels, B-1170 Belgium. \n\nmInt. Compo Science Institute \nBerkeley, CA 94704 USA. \n\nABSTRACT \n\nHidden Markov models are widely used for automatic speech recog(cid:173)\nnition. They inherently incorporate the sequential character of the \nspeech signal and are statistically trained. However, the a-priori \nchoice of the model topology limits their flexibility. Another draw(cid:173)\nback of these models is their weak discriminating power. Multilayer \nperceptrons are now promising tools in the connectionist approach \nfor classification problems and have already been successfully tested \non speech recognition problems. However, the sequential nature of \nthe speech signal remains difficult to handle in that kind of ma(cid:173)\nchine. In this paper, a discriminant hidden Markov model is de(cid:173)\nfined and it is shown how a particular multilayer perceptron with \ncontextual and extra feedback input units can be considered as a \ngeneral form of such Markov models. \n\nINTRODUCTION \n\nHidden Markov models (HMM) [Jelinek, 1976; Bourlard et al., 1985] are widely used \nfor automatic isolated and connected speech recognition. Their main advantages \nlie in the ability to take account of the time sequential order and variability of \nspeech signals. However, the a-priori choice of a model topology (number of states, \nprobability distributions and transition rules) limits the flexibility of the HMl\\l's, \nin particular speech contextual information is difficult to incorporate. Another \ndrawback of these models is their weak discriminating power. This fact is clearly \nillustrated in [Bourlard & Wellekens, 1989; Waibel et al., 1988] and several solutions \nhave recently been proposed in [Bahl et al., 1986; Bourlard & VVellekens, 1989; \nBrown, 1987]. \n\nThe multilayer perceptron (MLP) is now a familiar and promising tool in con(cid:173)\nnectionist approach for classification problems [Rumelhart et al., 1986; Lippmann, \n1987} and has already been widely tested on speech recognition problems [Waibel \net aI., 1988; Watrous & Shastri, 1987; Bourlard & Wellekens, 1989]. However, the \nsequential nature of the speech signal remains difficult to handle with ltfLP. It is \nshown here how an MLP with contextual and extra feedback input units can be \nconsidered as a form of discriminant HMM. \n\n\fLinks Between Markov Models and Multilayer Perceptrons \n\n503 \n\nSTOCHASTIC MODELS \n\nTRAINING CRITERIA \n\nStochastic speech recognition is based on the comparison of an utterance to be \nrecognized with a set of probabilistic finite state machines known as Hl\\1l\\f. These \nare trained such that the probability P(Wi IX) that model Wi has produced the \nassociated utterance X must be maximized, but the parameter space which this \noptimization is performed over makes the difference between independently trained \nmodels and discriminant ones. \nIndeed, the probability P(WiIX) can be written as \n\nP(W.'IX) = P(XIWi).P(Wi ) \n\nP(X)\u00b7 \n\nt \n\n(1) \n\nIn a recognition phase, P(X) may be considered as a constant since the model \nparameters are fixed but, in a training phase, this probability depends on the pa(cid:173)\nrameters of all possible models. Taking account of the fact that the models are \nmutually exclusive and if A represents the parameter set (for all possible models), \n(1) may then be rewritten as: \n\nMaximization of P(WdX, A) as given by (2) is usually simplified by restricting it \nto the subspace of the Uti parameters. This restriction leads to the Maximum \nLikelihood Estimators (MLE). The summation term in the denominator is constant \nover the parameter space of Uti and thus, maximization of P(XIWi' A) implies that \nof its bilinear map (2). A language model provides the value of P(Wi ) independently \nof the acoustic decoding [Jelinek, 1976]. \n\nOn the other hand, maximization of P(WiIX, A) with respect to the whole parameter \nspace (Le. the parameters of all models WI, W 2 , \u2022\u2022\u2022 ) leads to discriminant models \nsince it implies that the contribution of P(X IWi , A)P(Wi) should be enhanced while \nthat of the rival models, represented by \n\nL P(XIWk' A)P(Wk), \nkti \n\nshould be reduced. This maximization with respect to the whole parameter space \nhas been shown equivalent to the maximization of Mutual Information (l\\fMI) be(cid:173)\ntween a model and a vector sequence [Bahl et al., 1986; Brown, 1987]. \n\nSTANDARD HIDDEN MARKOV MODELS \n\nIn the regular discrete HMM, the acoustic vectors (e.g. corresponding to 10 ms \nspeech frames) are generally quantized in a front-end processor where each one is \nreplaced by the closest (e.g. according to an Euclidean norm) prototype vector \n\n\f504 \n\nBourlard and Wellekens \n\nYi selected in a predetermined finite set y of cardinality I. Let Q be a set of I< \ndifferent states q(k), with k = 1, ... , K. Markov models are then constituted by the \nassociation (according to a predefined topology) of some of these states. If H~MM are \ntrained along the MLE criterion, the parameters of the models (defined hereunder) \nmust be optimized for maximizing P(XIW) where X is a training sequence of \nquantized acoustic vectors Xn E y, with n = 1, ... , N and W is its associated \nMarkov model made up of L states ql E Q with l = 1, ... , L. Of course, the same \nstate may occur several times with different indices l, so that L :f. I<. Let us denote \nby q,/ the presence on state ql at a given time n E [1, N]. Since events q,/ are \nmutually exclusive, probability P(XIW) can be written for any arbitrary n: \n\nP(XIW) = L P(q,/,XIW) , \n\nL \n\nl=l \n\n(3) \n\nwhere P(q,/, XIW) denotes thus the probability that X is produced by W while \nassociating Xn with state ql. Maximization of (3) can be worked out by the classical \nforward-backward recurrences of the Baum-Welch algorithm [J elinek 1976, Bourlard \net al., 1985] \nMaximization of P(XIW) is also usually approximated by the Viterbi criterion. It \ncan be viewed as a simplified version of the MLE criterion where, instead of taking \naccount of all possible state sequences in W capable of producing X, one merely \nconsiders the most probable one. To make all possible paths apparent, (3) can also \nbe rewritten as \n\nP(XIW) = L ... L P(qt,\u00b7\u00b7\u00b7,qt\"XIW), \n\nL \n\nL \n\nand the explicit formulation of the Viterbi criterion is obtained by replacing all \nsummations by a \"max\" operator. Probability (3) is then approximated by: \n\nII =1 \n\nIN=1 \n\n-\nP(XIW) = max P(qll,\u00b7\u00b7\u00b7,qlN,XIW) , \n\n1 \n\nN \n\nll, ... ,lN \n\n(4) \n\nand can be calculated by the classical dynamic time warping (DTW) algorithm \n[Bourlard et al., 1985]. In that case, each training vector is then uniquely associated \nwith only one particular transition. \n\nIn both cases (MLE and Viterbi) , it can be shown that, according to classical \nhypotheses, P(XIW) and P(XIW) are estimated from the set of local parameters \np[q(l), Yi Iq(-)(k), W], for i = 1, ... , I and k, f = 1, ... , I<. Notations q (-)(k) and \nq(l) denote states E Q observed at two consecutive instants. In the particular case \nof the Viterbi criterion, these parameters are estimated by: \n\nVi E [1, I], Vk, l E [1, [(], \n\n(5) \n\nwhere niH denotes the number of times each prototype vector Yi has been associ(cid:173)\nated with a particular transition from q(k) to q(l) during the training. However, if \n\n\fLinks Between Markov Models and Multilayer Perceptrons \n\n505 \n\nthe models are trained along this formulation of the Viterbi algorithm, no discrimi(cid:173)\nnation is taken into account. For instance, it is interesting to observe that the local \nprobability (5) is not the suitable measure for the labeling of a prototype vector \nYi, i.e. to find the most probable state given a current input vector and a specified \nprevious state. Indeed, the decision should ideally be based on the Bayes rule. In \nthat cae, the most probable state q( f opt ) is defined by \n\nf opt = \n\nargmax p[q( f) IYi, q( - >( k)] , \n\nf \n\nand not on the basis of (5). \nIt can easily be proved that the estimate of the Bayes probabilities in (6) are: \n\n(6) \n\n(7) \n\nIn the last section, it is shown that these values can be generated at the output of \na particular MLP. \n\nDISCRIMINANT HMM \n\nFor quantized acoustic vectors and Viterbi criterion, an alternative HMM using \ndiscriminant local probabilities can also be described. Indeed, as the correct crite(cid:173)\nrion should be based on (1), comparing with (4), the \"Viterbi formulation\" of this \nprobability is \n\n(8) \n\nExpression (8) clearly puts the best path into evidence. The right hand side factor(cid:173)\nizes into \n\nand suggests two separate steps for the recognition. The first factor represents the \nacoustic decoding in which the acoustic vector sequence is converted into a sequence \nof states. Then, the second factor represents a phonological and lexical step: once \nthe sequence of states is known, the model W associated with X can be found from \nthe state sequence without an explicit dependence on X so that \n\nFor example, if the states represent phonemes, this probability must be estimated \nfrom phonological knowledge of the vocabulary once for all in a separate process \nwithout any reference to the input vector sequence. \nOn the contrary, P( ql1 ' \u2022\u2022\u2022 , q~ IX) is immediately related to the discriminant local \nprobabilities and may be factorized in \n\n\f506 \n\nBourlard and Wellekens \n\nNow, each factor of (9) may be simplified by relaxing the conditional constraints. \nMore specifically, the factors of (9) are assumed dependent on the previous state \nonly and on a signal window of length 2p + 1 centered around the current acoustic \nvector. The current expression of these local contributions becomes \n\n(10) \n\nwhere input contextual information is now taken into account, X~ denoting the \nvector sequence Xm , X m +1! ... , X n \u2022 If input contextual information is neglected (p = \n0), equation (10) represents nothing else but the discriminant local probability (7) \nand is at the root of a discriminant discrete HMM. Of course, as for (7), these \nlocal probabilities could also be simply estimated by counting on the training set, \nbut the exponential increase of the number of parameters with the width 2p + 1 of \nthe contextual window would require an exceedingly large storage capacity as an \nexcessive size of training data to obtain statistically significant parameters. It is \nshown in the following section how this drawback is circumvented by using an MLP. \nIt is indeed proved that, for the training vectors, the optimal outputs of a recurrent \nand context-sensitive MLP are the estimates of the local probabilities (10). Given \nits so-called \"generalization property\", the MLP can then be used for interpolating \non the test set. \n\nOf course, from the local contributions (10), P(WIX) can still be obtained by the \nclassical one-stage dynamic programming [Ney, 1984; Bourlard et al., 1985] . Indeed, \ninside HMM, the following dynamic programming recurrence holds \n\n(11) \n\nwhere parameter k runs over all possible states preceding qe and P(qeIXr) denotes \nthe cumulated best path probability of reaching state ql and having emitted the \npartial sequence Xr . \n\nRECURRENT MLP AND DISCRIMINANT HMM \n\nLet q( k), with k = 1, ... , K, be the output units of an MLP associated with different \nclasses (each of them corresponding a particular state of Q) and I the number of \nprototype vectors Yi. Let Vi denote a particular binary input of the l\\fLP. If no \ncontextual information is used, Vi is the binary representation of the index i of \nprototype vector Yi and, more precisely, a vector with all zero components but \nthe i-th one equal to 1. \nIn the case of contextual input, vector Vi is obtained \nby concatenating several representations of prototype vectors belonging to a given \ncontextual window centered on a current '!Ii. The architecture of the resulting MLP \nis then similar to NETtaik initially described in [Sejnowski & Rosenberg, 1987] for \nmapping written texts to phoneme strings. The same kind of architecture has also \nbeen proved successful in performing the classification of acoustic vector strings \ninto phoneme strings, where each current vector was classified by taking account \n\n\fLinks Between Markov Models and Multilayer Perceptrons \n\n507 \n\nof its surrounding vectors [Bourlard & Wellekens, 1989]. The input field is then \nconstituted by several groups of units, each group representing a prototype vector. \nThus, if 2p + 1 is the width of the contextual window, there are 2p + 1 groups of I \nunits in the input layer. \n\nHowever, since each acoustic vector is classified independently of the preceding clas(cid:173)\nsifications in such feedforward architectures, the sequential character of the speech \nsignal is not modeled. The system has no short-term memory from one classifi(cid:173)\ncation to the next one and successive classifications may be contradictory. This \nphenomenon does not appear in HMM since only some state sequences are permit(cid:173)\nted by the particular topology of the model. \n\nLet us assume that the training is performed on a sequence of N binary inputs \n{Vii' \u2022\u2022\u2022 , ViN} where each in represents the index of the prototype vector at time n (if \nno context) or the \"index\" of one of the I(2p+l) possible inputs (in the case of a 2p+ 1 \ncontextual window). Sequential classification must rely on the previous decisions \nbut the final goal remains the association of the current input vectors with their own \nclasses. An MLP achieving this task will generate, for each current input vector Vin \nand each class q(f), f = 1, ... , K, an output value g(in, kn' f) depending on the class \nq(kn ) in which the preceding input vector Vi n_ i was classified. Supervision comes \nfrom the a-priori knowledge of the classification of each Vi n \u2022 The training of the \nMLP parameters is usually based on the minimization of a mean square criterion \n(LMSE) [Rumelhart et al., 1986] which, with our requirements, takes the form: \n\nN \n\nE = 4 L L [g(in, kn, f) - d(in, f)]2 , \n\nK \n\nn=1 l=1 \n\n(12) \n\nwhere d(in, f) represents the target value of the f-th output associated with the \ninput vector Vi n \u2022 Since the purpose is to associate each input vector with a single \nclass, the target outputs, for a vector Vi E q(f), are: \n\nd(i, f) \nd(i, m) \n\n1, \n0, Vm ;f:. f , \n\nwhich can also be expressed, for each particular Vi E q(f) as: d(i, m) = bml . The \ntarget outputs d( i, f) only depend on the current input vector 'Vi and the considered \noutput unit, and not on the classification of the previous one. The difference between \ncriterion (12) and that of a memoryless machine is the additional index k n which \ntakes account of the previous decision. Collecting all terms depending on the same \nindexes, (12) can thus be rewritten as: \n\n1 J KKK \n\nE = - L L L L nilel\u00b7 [g(i, k, m) - d(i,m)]2 , \n\n2. \n\n*=1 k=1 l=1 m=1 \n\n(13) \n\nwhere J = I if the MLP input is context independent and J = .z(2p+l) if a 2p + 1 \ncontextual window is used; nikl represents the number of times 'Vi has been classified \n\n\f508 \n\nBourlard and Wellekens \n\ndecisions \n\nf~dback left context \n\nunits \n\ncurrent \nvector \n\nright context \n\ninput vectors \n\nc \n\nFigure 1: Recurrent and Context-Sensitive MLP (IZI = delay) \n\nin q(f) while the previous vector was known to belong to class q(k). Thus, whatever \nthe MLP topology is, i.e. \nthe number of its hidden layers and of units per \nlayer, the optimal output values gopt(i, k, m) are obtained by canceling the partial \nderivative of E versus g( i, k, m). It can easily be proved that, doing so, the optimal \nvalues for the outputs are then \n\ngopt(i, k, m) \n\nK \n\nLl=l nikl \n\n. \n\n(14) \n\nThe optimal g(i, k, m)'s obtained from the minimization of the MLP criterion are \nthus the estimates of the Bayes probabilities, i.e. the discriminant local probabilities \ndefined by (7) if no context is used and by (10) in the contextual case. \nIt is important to keep in mind that these optimal values can be reached only \nprovided the MLP contains enough parameters and does not get stuck into a local \nminimum during the training. \nA convenient way to generate the g(i, k,f) is to modify its input as follows. For \neach Vin' an extended vector \"'in = (vt., Vin) is formed where vt. is an extra input \nvector containing the information on the decision taken at time n - 1. Since output \ninformation is fed back in the input field, such an MLP has a recurrent topology. \nThe final architecture of the corresponding MLP (with contextual information and \noutput feedback) is represented in Figure 1 and is similar in design to the net \ndeveloped in [J ordan, 1986] to produce output pattern sequences. \nThe main advantage of this topology, when compared with other recurrent models \nproposed for sequential processing [Elman 1988, Watrous, 1987], over and above the \npossible interpretation in terms of HMM, is the control of the information fed back \nduring the training. Indeed, since the training data consists of consecutive labeled \nspeech frames, the correct sequence of output states is known and the training is \nsupervised by providing the correct information. \nReplacing in (13) dei, m) by the optimal values (14) provides a new criterion where \nthe target outputs depend now on the current vector, the considered output and \n\n\fLinks Between Markov Models and Multilayer Perceptrons \n\n509 \n\nthe classification of the previous vector: \n\nE'\" = ~ L L L L nikl. g(i,k,m) -\n\nJ KKK \n\n[ \n\ni=l k=l l=l m=l \n\n]2 \n\n' \n\n;ikm. \n\nl:l=l n,kl \n\n(15) \n\nand it is clear (by canceling the partial derivative of E'\" versus g(i, k, m)) that the \nlower bound for E'\" is reached for the same optimal outputs as (14) but is now equal \nto zero, what provides a very useful control parameter during the training phase. \n\nIt is evident that these results directly follow from the minimized criterion and not \nfrom the topology of the model. In that way, it is interesting to note that the same \noptimal values (14) may also result from other criteria as, for instance, the entropy \n[Hinton, 1987] or relative entropy [Solla et al., 1988] of the targets with respect to \noutputs. Indeed, in the case of relative entropy, e.g., criterion (12) is changed in: \n\n~ ~ [ \n\nd(in, f) \nEe = L.J L.J d(ln, f). In C k \n9 I n , n, \n\nn=l l=l \n\n. \n\n. \n\nf) + (1 - d(ln, \u00a3)).In 1 - C k \n9 In, n, \n\n( 1- d(in' f) )] \n\nf) \n\n, \n\n(16) \nand canceling its partial derivative versus g(i, k, m) yields the optimal values (14). \nIn that case, the optimal outputs effectively correspond now to Ee,min = o. \nOf course, since these results are independent of the topology of the models, they \nremain also valid for linear discriminant functions but, in that case, it is not guar(cid:173)\nanteed that the optimal values (14) can be reached. However, it has to be noted \nthat in some particular cases, even for not linearly separable classes, these optimal \nvalues are already obtained with linear discriminant functions (and thus with a one \nlayered perceptron trained according to an LMS criterion). \n\nIt is also important to point out that the same kind of recurrent A1LP could also be \nused to estimate local probabilities of higher order Markov models where the local \ncontribution in (9) are no longer assumed dependent on the previous state only but \nalso on several preceding ones. This is easily implemented by extending the input \nfield to the information related to these preceding classifications. Another solution \nis to represent, in the same extra input vector, a weighted sum (e.g. exponentially \ndecreasing with time) of the preceding outputs [Jordan, 1986]. \n\nCONCLUSION \n\nDiscrimination is an essential requirement in speech recognition and is not incor(cid:173)\nporated in the standard HMM. A discriminant HMM has been described and links \nbetween this new model and a recurrent MLP have been shown. Recurrence permits \nto take account of the sequential information in the output sequence. Moreover, \ninput contextual information is also easily captured by extending the input field. It \nhas finally been proved that the local probabilites of the discriminant HAf}.l may \nbe computed (or interpolated) by the particular MLP so defined. \n\n\f510 \n\nBourlard and Wellekens \n\nReferences \n\n[1] Bahl L.R., Brown P.F., de Souza P.V. & Mercer R.L. (1986). Maximum Mu(cid:173)\n\ntual Information Estimation of Hidden Markov Model Parameters for Speech \nRecognition, Proc.ICASSP-86, Tokyo, ppA9-52, \n\n[2] Bourlard H., Kamp Y., Ney H. & Wellekens C.J. (1985). Speaker-Dependent \nConnected Speech Recognition via Dynamic Programming and Statistical \nMethods, Speech and Speaker Recognition, Ed. M.R. Schroeder, KARGER, \n\n[3] Bourlard H. &. Wellekens C.J. (1989). Speech Pattern Discrimination and Mul(cid:173)\n\ntilayer Perceptrons, Computer, Speech and Language, 3, (to appear), \n\n[4] Brown P. (1987). The Acoustic-Modeling Problem in A utomatic Speech Recog(cid:173)\n\nnition, Ph.D. thesis, Comp.Sc.Dep., Carnegie-Mellon University, \n\n[5] Elman J .L. (1988). Finding Structure in Time, CRL Technical Report 8801, \n\nUCSD, Report, \n\n[6] Hinton G.E. (1987). Connectionist Learning Procedures, Technical Report \n\nCMU-CS-87-115, \n\n[7] Jelinek F. (1976). Continuous Recognition by Statistical Methods, Proceedings \n\nIEEE, vol. 64, noA, pp. 532-555, \n\n[8] Jordan M.L. (1986). Serial Order: A Parallel Distributed Processing Approach, \n\nUCSD, Tech. Report 8604, \n\n[9] Lippmann R.P. (1987). An Introduction to Computing with Neural Nets, IEEE \n\nASSP Magazine, vol. 4, pp. 4-22, \n\n[10] Ney H. (1984). The use of a one-stage dynamic programming algorithm for \n\nconnected word recognition. IEEE Trans. ASSP vol. 32, pp.263-271, \n\n[11] Rumelhart D.E., Hinton G.E. & Williams R.J. (1986). Learning Internal Repre(cid:173)\n\nsentations by Error Propagation, Parallel Distributed Processing. Exploration \nof the Microstructure of Cognition. vol. 1: Foundations, Ed. D.E.Rumelhart &. \nJ .L.McClelland, MIT Press, \n\n[12] Sejnowski T.J. &. Rosenberg C.R. (1987). Parallel Networks that Learn to Pro(cid:173)\n\nnounce English Text. Complex Systems, vol. 1, pp. 145-168, \n\n[13] Solla S.A., Levin E. &. Fleisher M. (1988). Accelerated Learning in Layered \n\nNeural Networks, AT&T Bell Labs. Manuscript, \n\n[14] Waibel A., Hanazawa T., Hinton G., Shikano K. & Lang, K. (1988). Phoneme \nRecognition Using Time-Delay Neural Networks, Proc. ICASSP-88, New York, \n\n[15] Watrous R.L. & Shastri L. (1987). Learning phonetic features using connec(cid:173)\n\ntionist networks: an experiment in speech recognition, Proceedings of the First \nInternational Conference on Neural Networks, IV -381-388, San Diego, CA, \n\n\f", "award": [], "sourceid": 163, "authors": [{"given_name": "Herv\u00e9", "family_name": "Bourlard", "institution": null}, {"given_name": "C.", "family_name": "Wellekens", "institution": null}]}