{"title": "A probabilistic model for generating realistic lip movements from speech", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": null, "full_text": "A Probabilistic Model for Generating\nRealistic Lip Movements from Speech\n\nGwenn Englebienne\n\nSchool of Computer Science\n\nUniversity of Manchester\n\nge@cs.man.ac.uk\n\nTim F. Cootes\n\nImaging Science and Biomedical Engineering\n\nUniversity of Manchester\n\nTim.Cootes@manchester.ac.uk\n\nMagnus Rattray\n\nSchool of Computer Science\n\nUniversity of Manchester\n\nmagnus.rattray@manchester.ac.uk\n\nAbstract\n\nThe present work aims to model the correspondence between facial motion\nand speech. The face and sound are modelled separately, with phonemes\nbeing the link between both. We propose a sequential model and evaluate\nits suitability for the generation of the facial animation from a sequence of\nphonemes, which we obtain from speech. We evaluate the results both by\ncomputing the error between generated sequences and real video, as well as\nwith a rigorous double-blind test with human subjects. Experiments show\nthat our model compares favourably to other existing methods and that\nthe sequences generated are comparable to real video sequences.\n\n1 Introduction\nGenerative systems that model the relationship between face and speech o\ufb00er a wide range\nof exciting prospects. Models combining speech and face information have been shown to\nimprove automatic speech recognition [4]. Conversely, generating video-realistic animated\nfaces from speech has immediate applications to the games and movie industries. There is\na strong correlation between lip movements and speech [7,10], and there have been multiple\nattempts at generating an animated face to match some given speech realistically [2,3,9,13].\nStudies have indicated that speech might be informative not only of lip movement but also\nof movement in the upper regions of the face [3].\nIncorporating speech therefore seems\ncrucial to the generation of true-to-life animated faces.\n\nOur goal is to build a generative probabilistic model, capable of generating realistic facial\nanimations in real time, given speech. We \ufb01rst use an Active Appearance Model (AAM [6])\nto extract features from the video frames. The AAM itself is generative and allows us to\nproduce video-realistic frames from the features. We then use a Hidden Markov Model\n(HMM [12]) to align phoneme labels to the audio stream of video sequences, and use this\ninformation to label the corresponding video frames. We propose a model which, when\ntrained on these labelled video frames, is capable of generating new, realistic video from\nunseen phoneme sequences. Our model is a modi\ufb01cation of Switching Linear Dynamical\nSystems (SLDS [1,15]) and we show that it performs better at generation than other existing\nmodels. We compare its performance to two previously proposed models by comparing the\nsequences they generate to a golden standard, features from real video sequences, and by\nasking volunteers to select the \u201creal\u201d video in a forced-choice test.\n\nThe results of human evaluation of our generated sequences are extremely encouraging. Our\nsystem performs well with any speech, and since it can easily handle real-time generation\nof the facial animation, it brings a realistic-looking, talking avatar within reach.\n\n1\n\n\f2 The Data\nWe used sequences from the freely available on-line news broadcast Democracy Now! The\nshow is broadcast every weekday in a high quality MP4 format, and as such constitutes\na constant source of new data. The text transcripts are available on-line, thus greatly\nfacilitating the training of a speech recognition system. We manually extracted short video\nsequences of the news presenter talking (removing any inserts, telephone interviews, etc.),\ncutting at \u201cnatural\u201d positions in the stream, viz. during pauses for breath and silences. The\nsequences are all of the same person, albeit on di\ufb00erent days within a period of slightly more\nthan a month. There was no reason to restrict the data to a single person, other than the\ndi\ufb03culty to obtain sequences of similar quality from other sources.\n\nAll usable sequences were extracted from the data, that is, those where the face of the speaker\nwas visible and the sound was not corrupted by external sound sources. The sequences do\ninclude hesitations, corrections, incomplete words, noticeable fatigue, breath, swallowing,\netc. The speaker visibly makes an e\ufb00ort to speak clearly, but obviously makes no e\ufb00ort to\nreduce head motion or facial expression, and the data is hence probably as representative\nof the problem as can be hoped for.\n\nIn total, sequences totalling 1 hour and 7 minutes of video were extracted and annotated.1\nThe data was split into independent training and test sets for a 10-fold cross validation,\nbased on the number of sequences in each set (rather than the total amount of data). This\nresulted in training sets of an average of 60 minutes of data, and test sets of approximately\n7 min. All models evaluated here were trained and tested on the same data sets.\n\nSound features and labelling. The sequences are\nsplit into an audio and a video stream, which are\ntreated separately (see Figure 1). From the sound\nstream, we extract Mel Frequency Cepstrum Coe\ufb03-\ncients (MFCC) at a rate of 100Hz, using tools from\nthe HMM Tool Kit [16], resulting in 13-dimensional\nfeature vectors. We train a HMM on these MFCC\nfeatures, and use it to align phonetic labels to the\nsound. This is an easier task than unrestricted speech\nrecognition, and is done satisfactorily by a simple\nHMM with monophones as hidden states, where mix-\ntures of Gaussian distributions model the emission\ndensities. The sound samples are labelled with the\nViterbi path through the HMM that was \u201cunrolled\u201d\nwith the phonetic transcription of the text.\n\nFigure 1: Combining sound and face\n\nThe labels obtained from the sound stream are then used to label the corresponding video\nframes. The di\ufb00erence in rate (the video is processed at 29.97 frames per second while\nMFCC coe\ufb03cients are computed at 100 Hz) is handled by simple voting: each video frame\nis labelled with the phoneme that labels most of the corresponding sound frames.\n\nFace features. The feature extraction for the video was done using an Active Appearance\nModel (AAM [6]). The AAM represents both the shape and the texture of an object in an\nimage. The shape of the lower part of the face is represented by the location of 23 points on\nkey features on the eyes, mouth and jaw-line (see Figure 2). Given the position of the points\nin a set of training images, we align them to a common co-ordinate frame and apply PCA to\nlearn a low-dimensional linear model capturing the shape change [5]. The intensities across\nthe region in each example are warped to the mean shape using a simple triangulation of\nthe region (Fig 2), and PCA applied to the vectors of intensities sampled from each image.\nThis leads to a low-dimensional linear model of the intensities in the mean frame. E\ufb03cient\nalgorithms exist for matching such models to new images [6]. By combining shape and\nintensity model together, a wide range of convincing synthetic faces can be generated [6]. In\nthis case a 32 parameter model proves su\ufb03cient. This is closely related to eigenfaces [14] but\ngives far better results as shape and texture are decoupled [8]. Since the AAM parameters\n\n1The data is publicly available at http://www.cs.manchester.ac.uk/ai/public/demnow.\n\n2\n\n\fFigure 2: The face was modelled with an AAM. A set of training images is manually labelled as\nshown in the two leftmost images. A statistical model of the shape is then combined with a model\nof the texture within the triangles between feature points. Applying the model to a new image\nresults in a vector of coe\ufb03cients, which can be used to reconstruct the original image.\n\nare a low-dimensional linear projection of the original object, projecting those parameters\nback to the high-dimensional space allows us to reconstruct the modelled part of the original\nimage.\n\n3 Modelling the dynamics of the face\nWe model the face using only phoneme labels to capture the shared information between\nspeech and face. We use 41 distinct phoneme labels, two of which are reserved for breath\nand silence, the rest being the generally accepted phonemes in the English language. Most\nearlier techniques that use discrete labels to generate synthetic video sequences use some\nform of smooth interpolation between key frames [2, 9]. This requires \ufb01nding the correct\nkey frames, and lacks the \ufb02exibility of a probabilistic formulation. Brand uses a HMM\nwhere Gaussian distributions are \ufb01tted to a concatenation of the data features and \u201cdelta\u201d\nfeatures [3]. Since the distribution is \ufb01tted to both the features and the di\ufb00erence between\nfeatures, the resulting \u201cdistribution\u201d cannot be sampled, as it would result in non-sensical\nmismatch between features and delta features. It is therefore not genuinely generative and\nobtaining new sequences from the model requires solving an optimisation problem.\n\nUnder Brand\u2019s approach, new sequences are obtained by \ufb01nding the most likely sequence of\nobservations for a set of labels. This is done by setting the \ufb01rst derivative of the likelihood\nwith respect to the observations to zero, resulting in a set of linear equations involving, at\neach time t, the observation ys\nt\u22121. Such a set of linear\nequations can be solved relatively e\ufb03ciently thanks to its block-band-diagonal structure.\nThis requires the storage of O(d2T ) elements and O(d3T ) time to solve, where d is twice\nthe dimensionality of the face features and T is the number of frames in a sequence. This\nbecomes non-trivial for sequences exceeding a few tens of seconds. More important, however,\nis that this cannot be done in real time, as the last label of the sequence must be known\nbefore the \ufb01rst observation can be computed.\n\nt and the previous observation ys\n\nIn this work, we consider more standard probabilistic models of sequential data, which are\ngenuinely generative. These models are shown to outperform Brand\u2019s approach for the\ngeneration of realistic sequences.\n\nSwitching Linear Dynamical Systems. Before introducing the SLDS, we introduce\nsome notational conventions. We have a set of S video sequences, which we index with\ns \u2208 [1 . . . S]. The feature vector of the frame at time t in the video sequence s is indicated\nt \u2208 Rd, and the complete set of feature vectors for that sequence is denoted as {y}Ts\nas ys\n1 ,\nwhere Ts is the length of the sequence. Continuous hidden variables are indicated as x and\ndiscrete state labels are indicated with \u03c0, where \u03c0 \u2208 [1 . . . \u03a0].\n\n1 which depends on a sequence of discrete labels {\u03c0}Ts\n\nIn an SLDS, the sequence of observations {y}Ts\nis modelled as being a noisy version of a\nhidden sequence {x}Ts\n1 . Each state \u03c0 is\nassociated with a transition matrix A\u03c0 and with a distribution for the output noise v and the\nprocess noise w, such that ys\nt +wt\nfor 2 6 t 6 Ts. Both the output noise vt and the process noise wt are normally distributed\nwith zero mean; vt \u223c N (0, R\u03c0s\nt ). The states in our application are\n\nt ) and wt \u223c N (0, Q\u03c0s\n\n1 \u223c N (\u00b5\u03c0s\n\nt = B\u03c0s\n\nt xs\n\nt +vt, xs\n\nt\u22121 +\u03bd \u03c0s\n\n1\n\n, \u03a3\u03c0s\n\n1 ) and xs\n\nt = A\u03c0s\n\nt xs\n\n1\n\n3\n\n\f\u03c0t\u22122\n\n\u03c0t\u22121\n\n\u03c0t\n\n\u03c0t+1\n\n\u03c0t+2\n\n\u03c0t\u22122\n\n\u03c0t\u22121\n\n\u03c0t\n\n\u03c0t+1\n\n\u03c0t+2\n\n. . .\n\nxt\u22122\n\nxt\u22121\n\nxt\n\nxt+1\n\nxt+2\n\n. . .\n\n. . .\n\n\u00b5t\u22122\n\n\u00b5t\u22121\n\n\u00b5t\n\n\u00b5t+1\n\n\u00b5t+2\n\n. . .\n\nyt\u22122\n\nyt\u22121\n\nyt\n\nyt+1\n\nyt+2\n\nyt\u22122\n\nyt\u22121\n\nyt\n\nyt+1\n\nyt+2\n\n(a)\n\n(b)\n\nFigure 3: Graphical representation of the di\ufb00erent models: \ufb01gure (a) depicts the dependencies in\nan SLDS when the labels are known and (b) represents our proposed DPDS, where we assume the\nprocess is noiseless. The circles are discrete and the squares are multivariate continuous quantities.\nThe shaded elements are observed and the random variables in the dashed box are conditioned on\nthe quantities outside of it.\n\nthe phonemes, which are obtained from the sound. Notice that in general, when the state\nlabels are not known, computing the likelihood in an SLDS is intractable as it requires\nthe enumeration of all possible state sequences, which is exponential in T [1]. In our case,\nhowever, the state label \u03c0s\nt of each frame is known from the sound and the likelihood can\nbe computed with the same algorithm as for a standard Linear Dynamical Systems (LDS),\nwhich is linear in T . Parameter optimisation can therefore be carried out e\ufb03ciently with\na standard EM algorithm. Also note that neither SLDS or LDS are commonly described\nt , as this can easily be emulated by augmenting each latent\nwith the explicit state bias \u03bd \u03c0s\nvector xs\nt . However, doing so prevents us from\nusing a diagonal matrix for A\u03c0s\nt , and experience has shown that the state mean is crucial\nto good prediction while the lack of su\ufb03cient data or, as is the case with our data, the `a\npriori known approximate independence of the data dimensions may make the reduction of\nthe complexity of A\u03c0s\n\nt with a 1 and incorporating \u03bd \u03c0s\n\ninto A\u03c0s\n\nt warranted.\n\nt and R\u03c0s\n\nt , Q\u03c0s\n\nt\n\nIn this form, the model is over-parametrised; it can be simpli\ufb01ed without any loss of gener-\nt to the identity matrix I or, if there is no reason to use a di\ufb00erent\nality either by \ufb01xing Q\u03c0s\ndimensionality for x and y, by setting B\u03c0s\nt = I. We did the latter, as this makes the\nresulting {x}T\n1 easier to interpret and compare across the di\ufb00erent models we evaluate here.\n\nWe trained a SLDS by maximum likelihood and used the model to generate new sequences\nof face observations for given sequences of labels. This was done by computing the most\nlikely sequence of observations for the given set of labels. An in-depth evaluation of the\ntrained SLDS model, when used to generate new video sequences, is given in section 4. This\nevaluation shows that SLDS is overly \ufb02exible: it appears to explain the data well and results\nin a very high likelihood, but does a poor job at generating realistic new sequences.\n\nDeterministic Process Dynamical System. We reduced the complexity of the model\nby simplifying its covariance structure. If we set the output noise vt of the SLDS to zero,\nleaving only process noise, we obtain the autoregressive hidden Markov model [11]. This\nmodel has the advantage that it can be trained using an EM algorithm when the state labels\nare unknown, but we \ufb01nd that it performs very poorly at data generation. If we set the\nprocess noise wt = 0, however, then we obtain a more useful model. The complete hidden\nsequence {x}T\n1 . The log-likelihood p({y}|{\u03c0})\nis given by\n\n1 is then determined exactly by the labels {\u03c0}T\n\nlog p({y}|{x}) = \u2212 1\n\n2\n\n1 | + (ys\n\n1 \u2212 xs\n\n1)\u22a4\u03a3\u22121\n\u03c0s\n\n(ys\n\n1 \u2212 xs\n\n1)+\n\n1\n\nwhere xs\nt for t > 1. We will now refer to this model as the\nDeterministic Process Dynamical System (DPDS, see Figure 3). In our implementation we\n\nt\u22121 + \u03bd \u03c0s\n\nt = A\u03c0s\n\n1 = \u00b5\u03c0s\n\nand xs\n\nt xs\n\n1\n\nt | + (ys\n\nt \u2212 xs\n\nt )\u22a4R\u22121\n\u03c0s\n\n(ys\n\nt \u2212 xs\n\nt\n\nt )(cid:17) + dTs log 2\u03c0i\n\n(1)\n\nS\n\nXs=1h log |\u03a3\u03c0s\n\nTs\n\nXt=2(cid:16)log |R\u03c0s\n\n4\n\n\f(a) Mean L1 distance\n\n(b) RMS Error\n\n(c) Mean L\u221e distance\n\n(d) Log-likelihood\n\nFigure 4: Comparison of the multiple models on the test data of 10-fold cross-validation. Each\nplot shows the mean error of the generated data with respect to the real data over the 10 folds.\nThe error bars span the 95% con\ufb01dence interval of the true error.\n\nmodel all matrices R\u03c0s\nt as diagonal, and further reduce the complexity by sharing the\noutput noise covariance over all states. It is reasonable to assume this because the features\nare the result of PCA and are therefore uncorrelated.\n\nt , \u03a3\u03c0s\n\n2 \u00b7 \u00b7 \u00b7 A\u03c0s\n\nt = f (A\u03c0s\n\nSince in this case the labels \u03c0s\nt are known, equation (1) does not contain any hidden vari-\nables. Applying EM is therefore not necessary. Deriving a closed-form solution for the ML\nestimates of the parameters, however, results in solving polynomial equations of the order\nTs, because xs\nt ). An e\ufb03cient solution is to use a gradient-based method.\nThe log-likelihood of a sequence is a sum of scaled quadratic terms of (ys\nt ), where\nxs\nt = f ({\u03c0}t\n1). The log-likelihood must thus be computed by a forward iteration over all\ntime steps t using xs\nt . The gradients of the likelihood with respect to A\u03c0s\ncan be computed numerically in a similar fashion, by applying the chain rule iteratively at\neach time step and storing the result for the next step. The same could be done for other\nparameters, however for given values of A\u03c0s\nt that maximise\nthe likelihood can be computed exactly by solving a set of linear equations. This markedly\nimproves the rate of convergence. An algorithm for the computation of the gradients with\nrespect to A\u03c0s\n\nt and the exact evaluation of the other parameters is given in Appendix A.\n\nt\u22121 to compute xs\n\nt , the values of \u00b5\u03c0s\n\n, \u03bd \u03c0s\n\nt and R\u03c0s\n\nt\n\nt \u2212 xs\n\nt\n\nSequence generation. Since all models parametrise the distribution of the data, we can\nsample them to generate new observation sequences. In order to evaluate the performance\nof the models and compare it to Brand\u2019s model, it is however useful to generate the most\nlikely sequence of observation features for a sequence of labels with the features of the\ncorresponding real video sequence.\n\nt \u02c6yt\u22121 + \u03bd \u03c0s\n\n1 is found by a forward iteration starting with \u02c6y1 = \u00b5\u03c0s\n\nt = I) and the DPDS, the mean for a given sequence of labels\nFor both SLDS (when B\u03c0s\n{\u03c0}T\nand iterating for t > 1 with\nt . This does not require the storage of the complete sequence in memory\n\u02c6yt = A\u03c0s\nas the current observation only depends on the previous one.\nIn setups where arti\ufb01cial\nspeech is generated, the video sequence can therefore be generated at the same time as the\naudio sequence and without length limitations, with O(d) space and O(dT ) time complexity,\nwhere d is the dimensionality of the face features (without delta features).\n\n1\n\n4 Evaluation against real video\n\nWe evaluated the models in two ways: (1) by computing the error between generated face\nfeatures and a ground truth (the features of real video), and (2) by asking human subjects\nto rate how they perceived the sequences. Both tests were done on the same real-world\ndata, but partitioned di\ufb00erently: the comparison to the ground truth was done using 10-\nfold cross-validation, while the test on humans was done using a single partitioning, due to\nthe limited availability of unbiased test subjects.\n\nIn order to test the models against the ground truth, we\nTest error and likelihood.\nuse the sound to align the labels to the video and generate the corresponding face features.\nWe use 10-fold cross validation and evaluate the performance of the models using di\ufb00erent\nmetrics, see Figure 4. Plot (a) shows, for di\ufb00erent models, the L1 error between the face\n\n5\n\n\fprefer\n\nprefer\n\nA\n\nBrand\nBrand\nBrand\nDPDS\nDPDS\nreality\n\nA\n5\n4\n36\n29\n60\n58\n\nundecided\n\n7\n7\n21\n11\n5\n5\n\nB\n54\n55\n9\n26\n1\n3\n\nB\n\nDPDS\nreality\nSLDS\nreality\nSLDS\nSLDS\n\nDPDS \u2248 reality \u227b Brand \u227b SLDS\n\nTable 1: Raw results of the Psychophysical test conducted by human volunteers. Every model is\ncompared to every other model; the order in which models are listed in this table is meaningless.\nSee text for details.\n\nfeatures generated for the test sound sequences and the face features extracted from the\nreal video. We compared the sequences generated by DPDS, Brand\u2019s model and SLDS to\nthe most likely observations under a standard HMM. This last model just generates the\nmean face for each phoneme, hence resulting in very unnatural sequences. It illustrates how\nan obviously incorrect model nevertheless performs very similarly to the other models in\nterms of generation error. Plots (b) and (c) respectively show the corresponding Root Mean\nSquare (RMS) and L\u221e error. We can see that, except for the SLDS which performs worse\nthan the other methods in terms of L1, RMS and L\u221e error, the generation error for the\nmodels considered, under all metrics, is consistently not statistically signi\ufb01cantly di\ufb00erent.\n\nIn terms of the log-likelihood of the test data under the di\ufb00erent models, the opposite is true:\nthe traditional HMM and DPDS clearly perform worst, while SLDS performs dramatically\nbetter. The model with the highest likelihood generates the sequences with the largest error.\nThe likelihood under Brand\u2019s model cannot be compared directly as it has double the amount\nof features. These results notwithstanding, great di\ufb00erences can be seen in the quality of the\ngenerated video sequences, and the models giving the lowest error or the highest likelihood\nare far from generating the most realistic sequences. We have therefore performed a rigorous\ntest where volunteers were asked to evaluate the quality of the sequences.\n\nPsychophysical test. For this experiment, we trained the models on a training set of 642\nsequences of an average of 5 seconds each. We then labelled the sequences in our test set,\nwhich consists of 80 sequences and 436 seconds of video from sound with phonemes. These\nare substantial amounts of data, showing the face in a wide variety of positions.\n\nWe set up a web-based test, where 33 volunteers compared 12 pairs of video sequences.\nAll video sequences had original sound, but the video stream was generated by any one\nof four methods: (1) from the face features extracted from the corresponding real video,\n(2) from SLDS, (3) from Brand\u2019s model and (4) from DPDS. A pool of 80 sequences was\ngenerated from previously unseen videos. The 12 pairs were chosen such that each generation\nmethod was pitted against each other generation method twice (once on each side, left or\nright, in order to eliminate bias towards a particular side) in random order. For each pair,\ncorresponding sequences were chosen from the respective pools at random. The volunteers\nwere only told that the sequences were either real or arti\ufb01cial, and were asked to either\nselect the real video or to indicate that they could not decide. The test is kept available\non-line for validation at http://www.cs.manchester.ac.uk/ai/public/dpdseval.\n\nThe results are shown in Table 1. The \ufb01rst row, e.g., shows that when comparing Brand\u2019s\nmodel with the DPDS, people thought that the sequence generated with the former model\nwas real in 5 cases, could not make up their mind in 7 cases, and thought the sequence\ngenerated with DPDS was real in 54 instances. These results indicate that DPDS performs\nquite well at generation, clearly much better than the two other models. Note however\nthat this test discriminates the models very harshly. Despite the strong down-voting of\nBrand\u2019s model in this test, the sequences generated with that model do not look all that\nbad. They are over-smoothed, however, and humans appear to be very sensitive to that.\nAlso remember that Brand\u2019s model is the only model considered here with a closed form\nsolution for the parameter estimation given the labels. Contrary to the other two models,\nit can easily be trained in the absence of labelling, using an EM algorithm.\n\n6\n\n\fIn order to correlate human judgement with the generation errors discussed at the start of\nthis section, we have computed the same error measures on the data as partitioned for the\npsychophysical test. These con\ufb01rmed the earlier conclusions: the SLDS, which humans like\nleast, gives the highest likelihood and the worst generation errors while DPDS and Brand\u2019s\nmodel do not give signi\ufb01cantly di\ufb00erent errors.\n\n5 Conclusion\nIn this work we have proposed a truly generative model, which allows real-time generation\nof talking faces given speech. We have evaluated it both using multiple error measures\nand with a thorough test of human perception. The latter test clearly shows that our\nmethod perceptually outperforms the others and is virtually indistinguishable from reality.\nCompared to Brand\u2019s method it is slower during training, and cannot easily be trained in\nthe absence of labelling. This is a trade-o\ufb00 for the very fast generation and visually much\nmore appealing face animation.\n\nIn addition, we have shown that traditional metrics do not agree with human perception.\nThe error measures do not necessarily favour our method, but the human preference for\nit is very signi\ufb01cant. We believe this deserves deeper analysis.\nIn future work, we plan\nto investigate di\ufb00erent error measures, especially on the more directly interpretable video\nframes rather than on the extracted features. We also intend to experiment with a covariance\nmatrix per state and an unrestricted matrix structure for the transition matrix A\u03c0s\nt .\n\nReferences\n\n[1] David Barber. Expectation correction for smoothed inference in switching linear dynamical\n\nsystems. Journal of Machine Learning Research, 7:2515\u20132540, 2006.\n\n[2] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. In\n\nProceedings of ACM SIGGRAPH, Annual Conference Series, 2003.\n\n[3] M. Brand. Voice puppetry. In SIGGRAPH \u201999: Proceedings of the 26th annual conference on\nComputer graphics and interactive techniques, pages 21\u201328, New York, NY, USA, 1999. ACM\nPress/Addison-Wesley Publishing Co.\n\n[4] C. Bregler, H. Hild, and S. Manke. Improving letter recognition by lipreading. In Proceedings\n\nof ICASSP, 1993.\n\n[5] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models, their training\n\nand application. Comput. Vis. Image Underst., 61(1):38\u201359, 1995.\n\n[6] T.F. Cootes, G.J. Edwards, and C.J. Taylor. Active appearance models. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 23(6):681\u2013685, 2001.\n\n[7] P. Duchnowski, U. Meier, and A. Weibel. See me, hear me: Integrating automatic speech\n\nrecognition and lipreading. In Proc. ICSLP 94, 1994.\n\n[8] G. Edwards, C. Taylor, and T. Cootes.\n\nInterpreting face images using active appearance\n\nmodels, 1998.\n\n[9] T. F. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animation. In SIG-\nGRAPH \u201902: Proceedings of the 29th annual conference on Computer graphics and interactive\ntechniques, pages 388\u2013398, New York, NY, USA, 2002. ACM Press.\n\n[10] H. McGurk and J. MacDonald. Hearing lips and seeing voices. Nature, pages 746 \u2013 748,\n\nDecember 1976.\n\n[11] Alan B. Poritz. Linear predictive hidden markov models and the speech signal. Proc. IEEE\n\nInt. Conf. Acoustics, Speech and Signal Processing, 7:1291\u20131294, May 1982.\n\n[12] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recog-\nnition. In Readings in speech recognition, pages 267\u2013296. Morgan Kaufmann Publishers Inc.,\nSan Francisco, CA, USA, 1990.\n\n[13] B. Theobald, G. Cawley, I. Matthews, J. Glauert, and J. Bangham. 2.5D visual speech synthesis\n\nusing appearance models. Proceedings of the British Machine Vision Conference, 2003.\n\n[14] M. A. Turk and A. P. Pentland. Face recognition using eigenfaces. Proc. IEEE Conf. Computer\n\nVision and Pattern Recognition, pages 586\u2013591, 1991.\n\n[15] Mike West and Je\ufb00 Harrison. Bayesian Forecasting and Dynamic Models. Springer, 1999.\n[16] S. Young. The HTK hidden markov model toolkit: Design and philosophy, 1993.\n\n7\n\n\fA Parameter estimation in DPDS\nThe log-likelihood of a sequence is given by eq. 1, which is a multiplicative function of A\n(x1 = f (A\u03c0s\n1 ), etc.). Applying the chain rule repeatedly gives us, for\ndiagonal matrices and using Lt to denote the log-likelihood of a single observation at time\n\n1 ), x2 = f (A\u03c0s\n\nA\u03c0s\n\n2\n\nt, that \u2202L1(cid:14)\u2202An = 0 and \u2202Lt(cid:14)\u2202An = R\u22121\n\n\u03c0s\n\nt\n\n(ys\n\nt \u2212 xs\n\nt )(\u2202xs\n\nt(cid:14)\u2202An) for 2 6 t 6 T , where\n\n\u2202xs\nt\n\u2202An\n\n\u2202xs\nt\u22121\n\u2202An\n\n= xs\n\nt \u03b4n\u03c0s\n\nt + A\u03c0s\n\nt\n\n, and \u03b4n\u03c0s\n\nt = 1 i\ufb00 n = \u03c0s\n\nt\n\n(2)\n\nThere we give the gradients for diagonal matrices for simplicity of notation and because we\nused diagonal matrices for this work, but the same principle applies to full matrices. The\n\ngradient of the likelihood is then \u2202L(cid:14)\u2202An = PS\ns=1PTs\na coe\ufb03cient to. The covariance R =PS\n\nis done for the other parameters of the model, however when the covariance is shared by all\nstates, the value of the other parameters can be maximised exactly as described below. In\nthe following, superscripts di\ufb00erentiate between variables by indicating what the variable is\ns=1(Ts \u2212 1) where\nxs\nt are found by solving the system of linear\nequations (3) for which the coe\ufb03cients D and b are computed by Algorithm 1, which takes\n{\u03c0}, {y} and the current values of A1...\u03a0 as input:\n\nt=2 \u2202Ls,t(cid:14)\u2202An. In general the same\n\nt )\u22a4(cid:14)PS\n\ns=1PTs\n\nt , while \u00b5\u03c0s\n\nt\u22121 + \u03bd \u03c0s\n\nt = A\u03c0s\n\n1 = \u00b5\u03c0s\n\nt=2(ys\n\nand \u03bd \u03c0s\n\nt \u2212 xs\n\nt \u2212 xs\n\nt )(ys\n\nt xs\n\n, xs\n\n1\n\nt\n\ndiag\u03a0\u00d7\u03a0(D\u00b5\u00b5\n\nn ) D\u00b5\u03bd\n\n\u03a0\u00d7\u03a0\n\nD\u03bd\u00b5\n\n\u03a0\u00d7\u03a0\n\nD\u03bd \u03bd\n\n\u03a0\u00d7\u03a0\n\n\uf8ee\n\uf8f0\n\n\uf8f9\n\uf8fb\n\n\" \u00b5\u03a0\u00d71\n\u03bd \u03a0\u00d71# =\" b\u00b5\n\nb\u03bd\n\n\u03a0\u00d71\n\n\u03a0\u00d71# where X\u03a0\u00d7\u03a0 ,\uf8ee\n\uf8ef\uf8ef\uf8f0\n\nX1,1 \u00b7 \u00b7 \u00b7 X1,\u03a0\n...\n...\n\n. . .\n\nX1,\u03a0 \u00b7 \u00b7 \u00b7 X\u03a0,\u03a0\n\n(3)\n\n\uf8f9\n\uf8fa\uf8fa\uf8fb\n\nAlgorithm 1 Maximisation of L with respect to \u00b5 and \u03bd\n\nfor n \u2208 {1 . . . \u03a0} do\n\nn \u2190 0, D\u00b5\u00b5\n\nn \u2190 0, b\u03bd\n\nb\u00b5\n\u2200m \u2208 {1 . . . \u03a0}: D\u00b5\u03bd\nfor s \u2208 {s|\u03c0s\n\n0 = n} do\n\nn \u2190 0\nn,m, D\u03bd \u03bd\n\nn,m, D\u03bd\u00b5\n\nn \u2190 0\n\nn + I, D\u00b5 \u2190 I, b\u00b5\n\nn \u2190 b\u00b5\n\nm \u2190 0\n\nn \u2190 D\u00b5\u00b5\n\nD\u00b5\u00b5\n\u2200m \u2208 {1 . . . \u03a0}: C\u00b5\nfor t \u2208 {2 . . . Ts} do\nD\u00b5 \u2190 D\u00b5 + A\u03c0s\n\u2200m \u2208 {1 . . . \u03a0}: C\u00b5\nC\u00b5\n\u03c0s\nend for\n\n\u2190 C\u00b5\n\u03c0s\n\n+ I\n\nt\n\nt\n\n\u22b2 Compute coe\ufb03cients D\u00b5\u00b5\n\nn ,D\u00b5\u03bd\n\nnx ,b\u00b5\n\nn to \u00b5n\n\nn + ys\nt\nm and D\u00b5 below are temporary variables\n\u22b2 C\u00b5\n\nt D\u00b5, D\u00b5\u00b5\nm \u2190 A\u03c0s\n\nn \u2190 D\u00b5\u00b5\n\nn + D\u00b5D\u00b5, b\u00b5\n\nt C\u00b5\n\nm, D\u00b5\u03bd\n\nn,m \u2190 D\u00b5\u03bd\n\nn + D\u00b5ys\nn \u2190 b\u00b5\nt\nn,m + D\u00b5C\u00b5\nm\n\nend for\nfor s \u2208 {1 . . . S} do\n\n\u2200m \u2208 {1 . . . \u03a0}: C\u03bd\nD\u03bd \u2190 0, C\u00b5 \u2190 I\nfor t \u2208 {2 . . . Ts} do\n\nm \u2190 0\n\nD\u03bd \u2190 A\u03c0s\nif \u03c0s\n\nt D\u03bd ,\nt = n then\nD\u03bd \u2190 D\u03bd + I\n\n\u22b2 Compute coe\ufb03cients D\u03bd\u00b5\n\nnx ,D\u03bd \u03bd\n\nnx ,b\u03bd\n\nn to \u03bd n\n\n\u22b2 C\u03bd\n\nm, D\u03bd , C\u00b5\n\nm are temporary variables\n\nend if\n\u2200m \u2208 {1 . . . \u03a0}: C\u03bd\nC\u03bd\n\u03c0s\nend for\n\n\u2190 C\u03bd\n\u03c0s\n\nt\n\nt\n\nm \u2190 A\u03c0s\n+ I, C\u00b5 \u2190 A\u03c0s\n\nm, D\u03bd \u03bd\nt C\u03bd\nt C\u00b5, D\u03bd\u00b5\n\u03c0s\n\nn,m \u2190 D\u03bd \u03bd\n\u2190 D\u03bd\u00b5\n\u03c0s\n\n1\n\n1\n\nn,m + D\u03bd C\u03bd\nm\n\n+ D\u03bd C\u00b5, b\u03bd\n\nn \u2190 b\u03bd\n\nn + D\u03bd ys\nt\n\nend for\n\nend for\n\n8\n\n\f", "award": [], "sourceid": 442, "authors": [{"given_name": "Gwenn", "family_name": "Englebienne", "institution": null}, {"given_name": "Tim", "family_name": "Cootes", "institution": null}, {"given_name": "Magnus", "family_name": "Rattray", "institution": null}]}