{"title": "An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1237, "page_last": 1244, "abstract": null, "full_text": "An Asynchronous Hidden Markov Model \n\nfor Audio-Visual Speech Recognition \n\nSamy Bengio \n\nDalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) \n\nCP 592, rue du Simplon 4, \n1920 Martigny, Switzerland \n\nbengio@idiap.ch.http://www.idiap.ch/-bengio \n\nAbstract \n\nThis paper presents a novel Hidden Markov Model architecture to \nmodel the joint probability of pairs of asynchronous sequences de(cid:173)\nscribing the same event. It is based on two other Markovian models, \nnamely Asynchronous Input/ Output Hidden Markov Models and \nPair Hidden Markov Models. An EM algorithm to train the model \nis presented, as well as a Viterbi decoder that can be used to ob(cid:173)\ntain the optimal state sequence as well as the alignment between \nthe two sequences. The model has been tested on an audio-visual \nspeech recognition task using the M2VTS database and yielded \nrobust performances under various noise conditions. \n\n1 \n\nIntroduction \n\nHidden Markov Models (HMMs) are statistical tools that have been used success(cid:173)\nfully in the last 30 years to model difficult tasks such as speech recognition [6) or \nbiological sequence analysis [4). They are very well suited to handle discrete of con(cid:173)\ntinuous sequences of varying sizes. Moreover, an efficient training algorithm (EM) \nis available, as well as an efficient decoding algorithm (Viterbi), which provides the \noptimal sequence of states (and the corresponding sequence of high level events) \nassociated with a given sequence of low-level data. \n\nOn the other hand, multimodal information processing is currently a very challeng(cid:173)\ning framework of applications including multimodal person authentication, multi(cid:173)\nmodal speech recognition, multimodal event analyzers, etc. In that framework, the \nsame sequence of events is represented not only by a single sequence of data but \nby a series of sequences of data, each of them coming eventually from a different \nmodality: video streams with various viewpoints, audio stream(s), etc. \n\nOne such task, which will be presented in this paper, is multimodal speech recog(cid:173)\nnition using both a microphone and a camera recording a speaker simultaneously \nwhile he (she) speaks. It is indeed well known that seeing the speaker's face in ad(cid:173)\ndition to hearing his (her) voice can often improve speech intelligibility, particularly \nin noisy environments [7), mainly thanks to the complementarity of the visual and \nacoustic signals. Previous solutions proposed for this task can be subdivided into \n\n\ftwo categories [8]: early integration, where both signals are first modified to reach \nthe same frame rate and are then modeled jointly, or late integration, where the \nsignals are modeled separately and are combined later, during decoding. While in \nthe former solution, the alignment between the two sequences is decided a priori, in \nthe latter, there is no explicit learning of the joint probability of the two sequences. \nAn example of late integration is presented in [3], where the authors present a multi(cid:173)\nstream approach where each stream is modeled by a different HMM, while decoding \nis done on a combined HMM (with various combination approaches proposed) . \n\nIn this paper, we present a novel Asynchronous Hidden Markov Model (AHMM) \nthat can learn the joint probability of pairs of sequences of data representing the \nsame sequence of events, even when the events are not synchronized between the \nsequences. In fact, the model enables to desynchronize the streams by temporarily \nstretching one of them in order to obtain a better match between the correspond(cid:173)\ning frames . The model can thus be directly applied to the problem of audio-visual \nspeech recognition where sometimes lips start to move before any sound is heard \nfor instance. The paper is organized as follows: in the next section, the AHMM \nmodel is presented, followed by the corresponding EM training and Viterbi decod(cid:173)\ning algorithms. Related models are then presented and implementation issues are \ndiscussed. Finally, experiments on a audio-visual speech recognition task based on \nthe M2VTS database are presented, followed by a conclusion. \n\n2 The Asynchronous Hidden Markov Model \nmodeling the joint probability of 2 asynchronous sequences, denoted xi and yr \n\nFor the sake of simplicity, let us present here the case where one is interested in \n\nwith S ::; T without loss of generality!. \nWe are thus interested in modeling p(xi, Yr). As it is intractable if we do it directly \nby considering all possible combinations, we introduce a hidden variable q which \nrepresents the state as in the classical HMM formulation, and which is synchronized \nwith the longest sequence. Let N be the number of states. \nMoreover, in the model presented here, we always emit Xt at time t and sometimes \nemit Ys at time t. Let us first define E(i, t) = P(Tt=sh- l =s - 1, qt=i, xLyf) as \nthe probability that the system emits the next observation of sequence y at time t \nwhile in state i. The additional hidden variable Tt = s can be seen as the alignment \nbetween y and q (and x which is aligned with q). Hence, we model p(x f,yr, qf, T'[). \n\n2.1 Likelihood Computation \n\nUsing classical HMM independence assumptions, a simple forward procedure can \nbe used to compute the joint likelihood of the two sequences, by introducing the \nfollowing 0: intermediate variable for each state and each possible alignment between \nthe sequences x and y: \n\no:(i,s,t) \n\no:(i , s,t) \n\nE(i, t)p( Xt , yslqt=i) L P(qt=ilqt- l =j)o:(j, s - 1, t - 1) \n\nN \n\nj = l \n\n(1) \n\nlIn fact , we assume that for all pairs of sequences (x, y), the sequence x is always at \nleast as long as the sequence y. If this is not the case, a straightforward extension of the \nproposed model is then necessary. \n\n\f+ (1 - E(i, t))p(xtlqt=i) L P(qt=ilqt- 1 =j)a(j, s, t - 1) \n\nN \n\nj=l \n\nwhich is very similar to the corresponding a variable used in normal HMMs2. It \ncan then be used to compute the joint likelihood of the two sequences as follows: \n\np(xi, yf) \n\nN \nL p( qT=i , TT=S, xi, yf) \ni=l \nN \nL a(i,S,T) . \ni=l \n\n(2) \n\n2.2 Viterbi Decoding \n\nUsing the same technique and replacing all the sums by max operators, a Viterbi \ndecoding algorithm can be derived in order to obtain the most probable path along \nthe sequence of states and alignments between x and y : \n\nV(i,s ,t) \n\nS) \nmax P qt=Z, Tt=S, Xl' Y1 \n\nt \n\n. \n\n(\n\nt -\nT1 \n\nl \n\nt -\n, Ql \n\nl \n\n(3) \n\n{ \n\nmax \n\n(E(i, t)p(Xt, Ys Iqt=i) mJx P(qt=ilqt- 1 =j)V(j, s - 1, t - 1), \n(1 - E(i, t))p(xtlqt=i) maxP(qt=i lqt- 1 =j)V(j, s, t - 1)) \n\nJ \n\nThe best path is then obtained after having computed V(i , S, Ti for the best final \n\nstate i and backtracking along the best path that could reach it . \n\n2.3 An EM Training Algorithm \n\nAn EM training algorithm can also be derived in the same fashion as in classical \nHMMs. We here sketch the resulting algorithm, without going into more details4 . \n\nBackward Step: Similarly to the forward step based on the a variable used to \ncompute the joint likelihood, a backward variable, (3 can also be derived as follows: \n\n(3(i,s, t) \n\n(3(i, s, t) \n\nL E(j, t + l)p(xt+1' Ys+ 1Iqt+1 =j)P(qt+ 1 =j lqt=i)(3(j, s + 1, t + 1) \nj=l \nN \n\n+ L (l - E(j, t + 1))P(Xt+ 1Iqt+ 1 =j)P(qt+1 =jlqt=i)(3(j, s, t + 1) . \n\nN \n\nj=l \n\n(4) \n\n2The full derivations are not given in this paper but can be found in the appendix of [1). \n3In the case where one is only interested in the best state sequence (no matter the \nalignment), the solution is then to marginalize over all the alignments during decoding \n(essentially keeping the sums on the alignments and the max on the state space). This \nsolut ion has not yet been tested. \n\n4See the appendix of [1) for more details. \n\n\fE-Step: Using both the forward and backward variables, one can compute the \nposterior probabilities of the hidden variables of the system, namely the posterior \non the state when it emits on both sequences, the posterior on the state when it \nemits on x only, and the posterior on transitions. \nLet al(i , s, t) be the part of a(i, s, t) when state i emits on Y at time t: \n\nE(i, t)p( Xt , ysl qt=i) L P(qt =ilqt- l =j)a(j , s - 1, t - 1) \n\nN \n\n(5) \n\nj = l \n\nand similarly, let aO(i, s, t) be the part of a(i, s, t) when state i does not emit on \ny at time t: \n\n(1 - E(i, t))p(xtlqt =i) L P(qt =ilqt- l =j)a(j , s , t - 1). \n\nN \n\n(6) \n\nj = l \n\nThen the posterior on state i when it emits joint observations of sequences x and \ny is \n\n. ITS ) a l (i,s,t)(3(i,s,t) \n\nPqt =Z,Tt =STt- I=S- l ,XI ,YI = \n\n( \n\n(T S) \n\nP Xl , YI \n\n, \n\n(7) \n\nthe posterior on state i when it emits the next observation of sequence x only is \n\n. ITS ) aO(i , s,t)(3 (i,s,t) \n\nP qt=Z, Tt=S Tt - l =S , Xl , YI = \n\n( \n\n(T S ) ' \n\nP Xl ,YI \n\nand the posterior on the transition between states i and j is \n\n( ) \n8 \n\n(9) \n\nP(qt=ilqt- l =j) \n\nP(x f, yf) \n\n( * a(j\" \n\n- 1, t - 1 )p(x\" y., Iq,~i)'( i, t) fi ( i, \" t) + ) \n\nL a(j, s , t - 1 )p(Xt Iqt=i) (1 - E( i , t) )(3( i , s, t) \n\ns=O \n\nM-Step: The Maximization step is performed exactly as in normal HMMs: when \n\nthe distributions are modeled by exponential functions such as Gaussian Mix(cid:173)\nture Models, then an exact maximization can be performed using the posteriors. \nOtherwise, a Generalized EM is performed by gradient ascent, back-propagating \nthe posteriors through the parameters of the distributions. \n\n3 Related Models \n\nThe present AHMM model is related to the Pair HMM model [4], which was pro(cid:173)\nposed to search for the best alignment between two DNA sequences. It was thus \ndesigned and used mainly for discrete sequences. Moreover, the architecture of the \nPair HMM model is such that a given state is designed to always emit either one OR \ntwo vectors, while in the proposed AHMM model, each state can always emit both \none or two vectors, depending on E(i, t), which is learned. In fact, when E(i, t) is \ndeterministic and solely depends on i , we can indeed recover the Pair HMM model \nby slightly transforming the architecture. \nIt is also very similar to the asynchronous version of Input/ Output HMMs [2], which \nwas proposed for speech recognition applications. The main difference here is that in \n\n\fAHMMs both sequences are considered as output, while in Asynchronous IOHMMs \none of the sequence (the shorter one, the output) is conditioned on the other one \n(the input). The resulting Viterbi decoding algorithm is thus different since in \nAsynchronous IOHMMs one of the sequence, the input, is known during decoding, \nwhich is not the case in AHMMs. \n\n4 \n\nImplementation Issues \n\n4.1 Time and Space Complexity \n\nThe proposed algorithms (either training or decoding) have a complexity of \nO(N 2 ST) where N is the number of states (and assuming the worst case with \nergodic connectivity) , S is the length of sequence y and T is the length of sequence \nx . This can become quickly intractable if both x and yare longer than, say, 1000 \nframes. It can however be shortened when a priori knowledge is known about possi(cid:173)\nble alignments between x and \u00a5. For instance, one can force the alignment between \nXt and Ys to be such that It - 5s1 < k where k is a constant representing the maxi(cid:173)\nmum stretching allowed between x and y, which should not depend on S nor T. In \nthat case, the complexity (both in time and space) becomes O(N2 Tk), which is k \ntimes the usual HMM training/ decoding complexity. \n\n4.2 Distributions to Model \n\nIn order to implement this system, we thus need to model the following distributions: \n\n\u2022 P(qt=ilqt- l =j): the transition distribution, as in normal HMMs; \n\u2022 p(xtlqt=i): the emission distribution in the case where only x is emitted, \n\nas in normal HMMs; \n\n\u2022 p(Xt , yslqt =i): the emission distribution in the case where both sequences \n\nare emitted. This distribution could be implemented in various forms, de(cid:173)\npending on the assumptions made on the data: \n\n- x and y are independent given state i: \n\np(Xt, Ys Iqt=i) = p(Xt Iqt=i)p(ys Iqt=i) \n\n- y is conditioned on x : \n\np(Xt , Ys Iqt =i) = p(Ys IXt , qt=i)p(xt Iqt =i) \n\n(10) \n\n(11) \n\n-\n\nthe joint probability is modeled directly, eventually forcing some com(cid:173)\nmon parameters from p(Xt Iqt=i) and p(Xt , Ys Iqt=i) to be shared. \n\nIn the experiments described later in the paper, we have chosen the latter \nimplementation, with no sharing except during initialization; \n\n\u2022 E(i, t) = P(Tt=slTt- l =s - 1, qt=i, xi,yf): the probability to emit on se(cid:173)\nquence y at time t on state i. With various assumptions, this probability \ncould be represented as either independent on i, independent on s, inde(cid:173)\npendent on Xt and Ys. In the experiments described later in the paper, we \nhave chosen the latter implementation. \n\n5 Experiments \n\nAudio-visual speech recognition experiments were performed using the M2VTS \ndatabase [5], which contains 185 recordings of 37 subjects, each containing acoustic \n\n\fand video signals of the subject pronouncing the French digits from zero to nine. \nThe video consisted of 286x360 pixel color images with a 25 Hz frame rate, while \nthe audio was recorded at 48 kHz using a 16 bit PCM coding. Although the M2VTS \ndatabase is one of the largest databases of its type, it is still relatively small com(cid:173)\npared to reference audio databases used in speech recognition. Hence, in order to \nincrease the significance level of the experimental results, a 5-fold cross-validation \nmethod was used. Note that all the subjects always pronounced the same sequence \nof words but this information was not used during recognition5 . \n\nThe audio data was down-sampled to 8khz and every 10ms a vector of 16 MFCC \ncoefficients and their first derivative, as well as the derivative of the log energy was \ncomputed, for a total of 33 features. Each image of the video stream was coded \nusing 12 shape features and 12 intensity features, as described in [3]. The first \nderivative of each of these features was also computed, for a total of 48 features . \n\nThe HMM topology was as follows: we used left-to-right HMMs for each instance \nof the vocabulary, which consisted of the following 11 words: zero, un, deux trois, \nquatre, cinq, six, sept, huit, neuf, silence. Each model had between 3 to 9 states \nincluding non-emitting begin and end states. \nIn each emitting state, there was 3 distributions: P( Xtlqt) , the emission distribu(cid:173)\ntion of audio-only data, which consisted of a Gaussian mixture of 10 Gaussians \n(of dimension 33), P(Xt , yslqt), the joint emission distribution of audio and video \ndata, which consisted also of a Gaussian mixture of 10 Gaussians (of dimension \n33+ 48= 81) , and E(i, t), the probability that the system should emit on the video \nsequence, which was implemented for these preliminary experiments as a simple \ntable. \n\nTraining was done using the EM algorithm described in the paper. However, in \norder to keep the computational time tractable, a constraint was imposed in the \nalignment between the audio and video streams: we did not consider alignments \nwhere audio and video information were farther than 0.5 second from each other. \n\nComparisons were made between the AHMM (taking into account audio and video), \nand a normal HMM taking into account either the audio or the video only. We also \ncompared the model with a normal HMM trained on both audio and video streams \nmanually synchronized (each frame of the video stream was repeated in multiple \ncopies in order to reach the same rate as the audio stream). Moreover, in order \nto show the interest of robust multimodal speech recognition, we injected various \nlevels of noise in the audio stream during decoding (training was always done using \nclean audio). The noise was taken from the Noisex database [9], and was injected \nin order to reach signal-to-noise ratios of 10dB, 5dB and OdB. \n\nNote that all the hyper-parameters of these systems, such as the number of Gaus(cid:173)\nsians in the mixtures, the number of EM iterations, or the minimum value of the \nvariances of the Gaussians, were not tuned using the M2VTS dataset. They were \ntaken from a previously trained model on a different task, Numbers'95. \n\nFigure 1 and Table 1 present the results. As it can be seen, the AHMM yielded \nbetter results as soon as the noise level was significant (for clean data, the perfor(cid:173)\nmance using the audio stream only was almost perfect, hence no enhancement was \nexpected). Moreover, it never deteriorated significantly (using a 95% confidence \ninterval) under the level of the video stream, no matter the level of noise in the \naudio stream. \n\n5Nevertheless, it can be argued that transitions between words could have been learned \n\nusing the training data. \n\n\f80 ,-r---------~----------~--------~_, \n\naudio HMM --+-(cid:173)\naudio+video HMM ---)(---\n\"* \nvideo HMM ----0 \n\naudio+video AHMM \n\n70 \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\nOdb \n\n5db \n\n10db \n\nnoise level \n\nFigure 1: Word Error Rates (in percent, the lower the better), of various systems \nunder various noise conditions during decoding (from 15 to 0 dB additive noise). \nThe proposed model is the AHMM using both audio and video streams. \n\nObservations Model \n\naudio \nHMM \nHMM \naudio+ video \naudio+ video AHMM \n\n15 dB \n\n2.9 (\u00b1 2.4) \n21.5 (\u00b1 6.0) \n4.8 (\u00b1 3.1) \n\nWER (%) and 95% CI \n10 dB \n\n5 dB \n\n11.9 (\u00b1 4.7) \n28.1 (\u00b1 6.5) \n11.4 (\u00b1 4.6) \n\n38.7 ~ \u00b1 7.1) \n35.3 (\u00b1 6.9) \n22.3 (\u00b1 6.0) \n\no dB \n\n79.1 (\u00b1 5.9) \n45.4 (\u00b1 7.2) \n41.1 (\u00b1 7.1) \n\nTable 1: Word Error Rates (WER, in percent, the lower the better) and correspond(cid:173)\ning Confidence Intervals (CI, in parenthesis), of various systems under various noise \nconditions during decoding (from 15 to 0 dB additive noise). The proposed model \nis the AHMM using both audio and video streams. An HMM using the clean video \ndata only obtains 39.6% WER (\u00b1 7.1). \n\nAn interesting side effect of the model is to provide an optimal alignment between \nthe audio and the video streams. Figure 2 shows the alignment obtained while \ndecoding sequence cd01 on data corrupted with 10dB Noisex noise. It shows that the \nrate between video and audio is far from being constant (it would have followed the \nstepped line) and hence computing the joint probability using the AHMM appears \nmore informative than using a naive alignment and a normal HMM. \n\n6 Conclusion \n\nIn this paper, we have presented a novel asynchronous HMM architecture to handle \nmultiple sequences of data representing the same sequence of events. The model was \ninspired by two other well-known models, namely Pair HMMs and Asynchronous \nIOHMMs. An EM training algorithm was derived as well as a Viterbi decoding \nalgorithm, and speech recognition experiments were performed on a multimodal \ndatabase, yielding significant improvements on noisy audio data. Various proposi(cid:173)\ntions were made to implement the model but only the simplest ones were tested \nin this paper. Other solutions should thus be investigated soon. Moreover, other \napplications of the model should also be investigated, such as multimodal authen(cid:173)\ntication. \n\n\fAudio \n\nFigure 2: Alignment obtained by the model between video and audio streams on \nsequence cdOl corrupted with a 10dE Noisex noise. The vertical lines show the \nobtained segmentation between the words. The stepped line represents a constant \nalignment. \n\nAcknowledgments \n\nThis research has been partially carried out in the framework of the European \nproject LAVA, funded by the Swiss OFES project number 01.0412. The Swiss \nNCCR project 1M2 has also partly funded this research. The author would like to \nthank Stephane Dupont for providing the extracted visual features and the experi(cid:173)\nmental protocol used in the paper. \n\nReferences \n\n[I] S. Bengio. An asynchronous hidden markov model for audio-visual speech recognition. \n\nTechnical Report IDIAP-RR 02-26, IDIAP, 2002. \n\n[2] S. Bengio and Y. Bengio. An EM algorithm for asynchronous input/ output hidden \nmarkov models. In Proceedings of the International Conference on Neural Information \nProcessing, ICONIP, Hong Kong, 1996. \n\n[3] S. Dupont and J . Luettin. Audio-visual speech modelling for continuous speech recog(cid:173)\n\nnition. IEEE Transactions on Multimedia, 2:141- 151, 2000. \n\n[4] R. Durbin, S. Eddy, A. Krogh, and G. Michison. Biological Sequence Analysis: Prob(cid:173)\n\nabilistic Models of proteins and nucleic acids. Cambridge University Press, 1998. \n\n[5] S. Pigeon and L. Vandendorpe. The M2VTS multimodal face database (release 1.00). In \nProceedings of th e First International Conference on Audio- and Video-bas ed Biometric \nP erson Authentication ABVPA, 1997. \n\n[6] Laurence R. Rabiner. A tutorial on hidden markov models and selected applications \n\nin speech recognition. Proceedings of the IEEE, 77(2):257- 286, 1989. \n\n[7] W. H. Sumby and 1. Pollak. Visual contributions to speech intelligibility in noise. \n\nJournal of the Acoustical Society of America, 26:212- 215, 1954. \n\n[8] A. Q. Summerfield. Lipreading and audio-visual speech perception. Philosophical \n\nTransactions of the Royal Society of London, Series B, 335:71- 78, 1992. \n\n[9] A. Varga, H.J .M. Steeneken, M. Tomlinson, and D . Jones. The noisex-92 study on \nthe effect of additive noise on automatic speech recognition. Technical report, DRA \nSpeech Research Unit, 1992. \n\n\f", "award": [], "sourceid": 2301, "authors": [{"given_name": "Samy", "family_name": "Bengio", "institution": null}]}