{"title": "Audio-Visual Sound Separation Via Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1173, "page_last": 1180, "abstract": null, "full_text": "Audio-Visual  Sound  Separation Via \n\nHidden Markov  Models \n\nJohn Hershey \n\nDepartment of Cognitive Science \nUniversity of California San Diego \n\nMichael  Casey \n\nMitsubishi  Electric Research Labs \n\nCambridge, Massachussets \n\njhershey@cogsci.ucsd.edu \n\ncasey@merl.com \n\nAbstract \n\nIt  is  well  known  that  under  noisy  conditions  we  can  hear  speech \nmuch  more  clearly  when  we  read  the  speaker's  lips.  This  sug(cid:173)\ngests  the utility of audio-visual information for  the task of speech \nenhancement.  We  propose  a  method  to exploit  audio-visual  cues \nto  enable  speech  separation  under  non-stationary  noise  and  with \na  single  microphone.  We  revise  and  extend  HMM-based  speech \nenhancement techniques, in which signal and noise models are fac(cid:173)\ntori ally  combined,  to  incorporate  visual  lip  information  and  em(cid:173)\nploy  novel  signal  HMMs  in  which  the  dynamics  of  narrow-band \nand  wide  band  components  are  factorial.  We  avoid  the  combina(cid:173)\ntorial  explosion  in  the  factorial  model  by  using  a  simple  approxi(cid:173)\nmate  inference  technique  to  quickly  estimate  the  clean  signals  in \na  mixture.  We  present  a  preliminary  evaluation  of this  approach \nusing a small-vocabulary audio-visual database, showing promising \nimprovements in  machine  intelligibility for  speech  enhanced using \naudio and visual information. \n\n1 \n\nIntroduction \n\nWe often take for  granted the ease with which we  can carryon a conversation in the \nproverbial cocktail party scenario:  guests  chatter, glasses clink,  music  plays  in  the \nbackground:  the room  is  filled  with ambient  sound.  The  vibrations from  different \nsources and their reverberations coalesce translucently yielding a single time series at \neach ear, in which sounds largely overlap even in the frequency domain.  Remarkably \nthe human auditory system delivers high-quality impressions of sounds in conditions \nthat perplex our best computational systems.  A variety of strategies appear to be at \nwork in this, including binaural spatial analysis, and inference using prior knowledge \nof likely signals and their contexts.  In speech perception, vision often plays a crucial \nrole,  because we  can follow  in the lips and face  the very mechanisms that modulate \nthe sound, even  when the sound is  obscured by acoustic noise. \n\nIt has been demonstrated that the addition of visual cues can enhance speech recog(cid:173)\nnition as much as removing 15 dB of noise  [1].  Vision provides speech cues that are \ncomplementary  to  audio  cues  such  as  components  of consonants  and  vowels  that \nare  likely  to  be  obscured  by  acoustic  noise  [2].  Visual  information  is  demonstra-\n\n\fbly  beneficial  to  HMM-based  automatic speech  recognition  (ASR)  systems,  which \ntypically suffer tremendously under moderate acoustical noise  [3]. \n\nWe  introduce a  method of audio-visual speech enhancement using factorial  hidden \nMarkov  models  (fHMMs).  We  focus  on  speech  enhancement  rather  than  speech \nrecognition for  two reasons:  first,  speech  conveys  useful  paralinguistic information, \nsuch as prosody,  emotion,  and speaker identity,  and second,  speech contains useful \ncues for separation from noise, such as pitch.  In automatic speech recognition (ASR) \nsystems, these cues are typically discarded in an effort to reduce irrelevant variance \namong speakers and utterances within a  phonetic class. \n\nWhereas  the  benefit  of  vision  to  speech  recognition  is  well  known,  we  may  well \nwonder if visual  input offers  similar  benefits  to speech  enhancement.  In  [4]  a  non(cid:173)\nparametric  density  estimator  was  used  to  adapt  audio  and  video  transforms  to \nmaximize the mutual information between the face of a target speaker and an audio \nmixture containing both the target voice  and a  distracter voice.  These transforms \nwere then used to construct a  stationary filter  for  separating the target voice from \nthe mixture without any prior knowledge or training.  In [5]  a multi-layer perceptron \nis trained to map noisy estimates of formants to clean ones, employing lip parameters \n(width, height and area of the lip opening) extracted from video as additional input. \nThe  re-estimated  formant  contours  were  used  to  filter  the  speech  to  enhance  the \nsignal.  In both cases video information improved signal separation.  Neither system, \nhowever,  made use  of the dynamics of speech. \n\nIn  speech  recognition,  HMMs  are  commonly  used  because  of  the  advantages  of \nmodeling  signal  dynamics.  This  suggests  the  following  strategy:  train  an  audio(cid:173)\nvisual  HMM  on  clean  speech,  infer  the  likelihoods  of its  state  sequences,  and  use \nthe inferred state probabilities of the signal and noise to estimate a sequence of filters \nto clean the data.  In cases  where  background noise  also has regularity,  such  as  the \ncombination  of  two  voices,  another  HMM  can  be  used  to  model  the  background \nnoise.  Ephraim [6]  first  proposed an approach to factorially  combining two  HMMs \nin  such  an enhancement system.  In  [7]  an efficient  variational learning rule for  the \nfactorial HMM is formulated,  and in  [8,  9]  fHMM  speech enhancement was recently \nrevived  using some clever tricks to allow  more complex models. \n\nThe  fHMM  approach  is  amenable  to  audio-visual  speech  enhancement  in  many \ndifferent  forms.  In  the  simplest  formulation,  which  we  pursue here,  the signal  ob(cid:173)\nservation  model  includes  visual  features.  These  visual  inputs  constrain  the  signal \nHMM and produce more accurate filters.  Below we present a prototype architecture \nfor  such a  system along with preliminary results. 1 \n\n1.1  Factorial Speech Models \n\nOne of the  challenges  of using  speech  HMMs  for  enhancement  is  to  model  speech \nin sufficient  detail.  Typically,  speech models,  following  the practice in  ASR,  ignore \nnarrow-band, spectral details  (corresponding to upper cepstral components)  which \ncarry pitch information,  because they tend to vary across  speakers and utterances \nfor  the same word or phoneme.  Instead such systems focus  on the smooth, or  wide(cid:173)\nband,  spectral characteristics (corresponding to lower cepstral components)  such as \nare produced  by  the  articulation of the mouth.  Such  wide-band  spectral  patterns \nloosely  represent  formant  patterns,  a  well-known  cue  for  vowel  discrimination.  In \ncases  where  the  pitch  or  other  narrow-band  properties,  of the  background  signals \ndiffer  from  the  foreground  speech,  and  have  predictable  dynamics,  such  as  with \n\nlWe  defer  a  detailed  mathematical  development  to  subsequent  publications.  Contact \n\njhershey@cogsci.ucsd.edu for  further  information \n\n\ftwo  simultaneous  speech  signals,  these  components  may  be  helpful  in  separating \nthe two  signals.  Figure  1 illustrates the analysis  of two words  into  wide-band  and \nnarrow-band components. \n\n\"one\" \n\n\" two\" \n\nFull  band: \n\nNarrow band: \n\nWide band: \n\nFigure  1:  full-band,  narrow-band,  and  wide-band  log  spectrograms  of  two  words. \nThe  wide-band  log  spectrograms  (bottom)  are  derived  by  low-pass  filtering  the \nlog  spectra  (across  the frequency  domain),  and the narrow-band log  spectrograms \n(middle)  derived  by  high  pass  filtering  the  log  spectra  The  full  log  spectrogram \n(top)  is  the sum of the two. \n\nHowever, the wide-band and narrow-band variations in speech are only loosely cou(cid:173)\npled.  For  instance,  a  given  formant  is  likely  to  be  uttered  with  many  different \npitches  and  a  given  pitch may be used  to utter any formant.  Thus  a  model  of the \nfull  spectrum of speech  would  have  to have  enough states to represent every  com(cid:173)\nbination of pitches and formants.  Such a  model requires a large amount of training \ndata and imposes serious computational burdens.  For instance in [8]  a  model with \n8000 states is  employed.  When combined with a similarly complex noise model, the \ncomposite  model  has 64  million  states.  This is  expensive  in  terms of computation \nas  well  as  the number of data points required for  inference. \n\nTo parsimoniously model the complexity of speech, we  employ a factorial HMM for \na  single speech signal, in which wide  and narrow-band components are represented \nin  sub-models  with  independent  dynamics.  We  therefore  train the two  submodels \nindependently using  Gaussian observation probability density functions  (p.d.f.)  on \nthe wide-band or narrow-band log spectra, with diagonal covariances for the sake of \nsimplicity.  Figure 2(a) depicts the graphical model for a single wide or narrow-band \ncomponent. \n\nDiscrete States \n\nContinuous \nObservation s \n\nNarrow-Band  Slate \n\nWide- Band  Stale \n\nCombined \nObservmions \n\n(a)  simple HMM \n\n(b)  factorial  speech HMM \n\nFigure 2:  single HMMs are trained separately on wide-band and narrow-band speech \nsignals  (a)  and then combined factorially in  (b)  by adding the means and variances \nof their observation distributions \n\n\fTo  combine  the sub-models,  we  have  to specify the observation p.d.f.  for  a  combi(cid:173)\nnation of a  wide  and a  narrow-band state, over the log-spectrum of speech prior to \nliftering.  Because  the  observation  densities  of each  component  are  Gaussian,  and \nthe log-spectra of the wide  and narrow-band components add in the log spectrum, \nthe composite state has a  Gaussian observation p.d.f.,  whose  mean and variance is \nthe  sum  of the  component  observation means  and  variances.  Although  the  states \nof the two  sub-models  are marginally independent they are typically  conditionally \ndependent given the observation sequence.  In other words we  assume that the state \ndependencies  between  the  sub-models  for  a  given  speech  signal  can  be  explained \nentirely via the observations.  Figure 2(b)  depicts the combination of the wide  and \nnarrow-band models, where the observation p.d.f. 's are a function of two state vari(cid:173)\nables. \n\nWhen  combining  the  signal  and  noise  models  (or  two  different  speech  models)  in \ncontrast,  the  signals  add  in  the  frequency  domain,  and  hence  in  the  log  spectral \ndomain they longer simply  add.  In the spectral domain the amplitudes of the two \nsignals  have log-normal distributions,  and the relative phases are unknown.  There \nis  no closed form  distribution for  the sum of two random variables with log-normal \namplitudes and a uniformly distributed phase difference.  Disregarding phase differ(cid:173)\nences  we  apply  a  well-known  approximation to  the  sum  of two  lognormal  random \nvariables, in which we match the mean and variance of a lognormal random variable \nto  the  sum  of  the  means  and  variances  of the  two  component  lognormal  random \nvariables  [10].  Phase uncertainty can  also  be  incorporated into  an  approximation; \nhowever in practice the costs appear to outweigh the benefits.2  Figure 3(a) depicts \nthe combination of two factorial speech models,  where the observation p.d.f.s  are a \nfunction  of two state variables. \n\nVideo Observations \n\n-'-- .. . \n\n~ (1f \n\nAudio \nObservations \n\n6 \n\n0 \n\n(a)  dual factorial  HMM \n\n(b)  speech fHMM  with video \n\nFigure  3:  combining  two  speech  fHMMs  (a)  and  adding  video  observations  to  a \nspeech fHMM  (b). \n\nUsing  the  log-normal  observation  distribution  of the  composite  model  we  can  es(cid:173)\ntimate  the  likelihood  of the  speech  and  noise  states for  each  frame  using  the  well \nknown forward-backward recursion.  For each frame of the test data we can compute \nthe expected value  of the amplitude of each  model  in  each frequency  bin.  Taking \n\n2The uncertainty of the phase differences  can be incorporated by modeling the sum as \na  mixture of lognormals  that uniformly samples phase differences.  Each mixture element \nis  approximated by taking as its mean the length of the sum of the mean amplitudes when \nadded  in  the  complex  plane  according  a  particular  phase  difference,  and  as  its  variance \nthe sum of the two variances.  This estimation is facilitated by the assumption of diagonal \ncovariances  in the log spectral domain. \n\n\fthe  expected  value  of the  signal  in  the  numerator  and  the  expected  value  of the \nsignal  plus  noise  in  the  denominator  yields  a  Wiener filter  which  is  applied  to  the \noriginal noisy signal enhancing the desired component.  When we  have two speech \nsignals one person's noise is  another's signal and we  can separate both by the same \nmethod. \n\n2 \n\nIncorporating vision \n\nWe  incorporate vision after training the audio models in order to test the improve(cid:173)\nment yielded by visual input while holding the audio model constant.  A video obser(cid:173)\nvation distribution is  added to each state in the model by obtaining the probability \nof each state in  each frame  of the  audio training data using  the forward-backward \nprocedure,  then  estimating  the  parameters  of the  video  observation  distributions \nfor  each state, in the manner of the Baum-Welch observation re-estimation formula. \nThis procedure is  iterated until it converges.  In  this way  we  construct a  system in \nwhich the visual observations are modular.  Figure 3(b)  depicts the structure ofthe \nresulting speech model. \n\nSuch a method in which audio and visual features are integrated early in processing \nis  only  one  of several  approaches.  We  envision  other late  integration  approaches \nin  which  audio  and  visual  dynamics  are  more  loosely  coupled.  What  method  of \naudio-visual integration may be best for  this task is  an open question. \n\n3  Efficient  inference \n\nIn  the models described above,  in which  we  factorially combine two speech models, \neach of which is  itself factorial , the complexity of inference in the composite model, \nusing  the  forward-backward  recursion,  can  easily  become  unmanageable.  If K  is \nthe  number  of  states  in  each  subcomponent,  then  K4  is  the  number  of states  in \nthe  composite  HMM.  In  our experiments  K  is  on  the  order of 40  states,  so  there \nare 2,560,000 states in the composite model.  Naively each composite state must be \nsearched  when  computing  the  probabilities  of state  sequences  necessary  for  infer(cid:173)\nence.  Interesting approximation schemes for  similar models  are developed in  [8,  9]. \nWe  develop  an approximation as follows. \n\nRather than computing the forward-backward procedure on the  composite HMM, \nwe  compute it sequentially on each sub-HMM to derive the probability of each state \nin  each frame.  Of course,  in  order to evaluate the  observation  probabilities of the \ncurrent sub-HMMs for  a  given frame,  we  need to consider the state probabilities of \nthe other three sub-HMMs, because their means and variances are combined in the \nobservation model.  These state probabilities and their associated observation prob(cid:173)\nabilities comprise a mixture model for  a given frame.  The composite mixture model \nstill  has  K4  states,  so  to  defray this  complexity during forward-backward analysis \nof the current sub-HMM,  for  each frame  we  approximate the observation mixtures \nof each of the other three sub-HMMs with a single Gaussian, whose mean and vari(cid:173)\nance  matches  that of the mixture.  Thus we  only  have to  consider the  K  states of \nthe current model,  and use the summarized means and variances of the other three \nHMMs  as  auxiliary inputs to the observation model.  We  initialize the state proba(cid:173)\nbilities  in each frame  with  the  equilibrium  distribution for  each  sub-HMM.  In  our \nexperiments, after a  handful of iterations, the composite state probabilities tend to \nconverge.  This method is  closely  related to a  structured variational approximation \nfor factorial HMMs  [7]  and can be also be seen as an approximate belief propagation \nor sum-product algorithm  [11]. \n\n\f4  Data \n\nWe  used  a  small-vocabulary  audio-visual  speech  database  developed  by  Fu  Jie \nHuang at  Carnegie Mellon  University3  [12].  These data consist  of audio and video \nrecordings  of 10  subjects  (7  males  and  3  females)  saying  78  isolated  words  com(cid:173)\nmonly  used for  numbers and time,  such  aS,\"one\"  \"Monday\", \"February\", \"night\", \netc.  The sequence of 78  words is  repeated in  10 different takes.  Half of these takes \nwere  used for  training,  and one of the remaining takes was  used  as the test set. \n\nThe data set included outer lip parameters extracted from video using an automatic \nlip  tracker,  including height  of the upper and lower lips  relative to the  corners the \nwidth  from  corner  to  corner.  We  interpolated  these  lip  parameters  to  match  the \naudio frame rate, and calculate time derivatives. \n\nAudio  consisted  of  16-bit,  44.1  kHz  recordings  which  we  resample  to  8000  kHz. \nThe  audio  was  framed  at  60  frames  per  second,  with  an  overlap  of 50%,  yielding \n264 samples per frame. 4  The frames  were  analyzed into cepstra: the wide-band log \nspectrum is  derived from  the lower  20  cepstral components and the wide-band log \nspectrum from  the upper cepstra. \n\n5  Results \n\nSpeaker dependent wide and narrow-band HMMs having 40 states each were trained \non  data from  two subjects  (\" Anne\"  and\" Chris\")  selected from  the training set.  A \nPCA basis  was  used to  reduce the  log  spectrograms to a  more  manageable size  of \n30  dimensions  during  training.  This  resulted  in  some  non-zero  covariances  near \nthe  diagonal  in  the  learned  observation  covariance  matrices,  which  we  discarded. \nAn  entropic  prior  and  parameter  extinction  were  used  to  sparsify  the  transition \nmatrices during training  [13]. \n\nThe  narrow-band model  learned  states  that  represented  different  pitches  and  had \ntransition probabilities that were non-zero mainly between neighboring pitches.  The \nnarrow-band model's  video observation probability distributions were  largely over(cid:173)\nlapping,  reflecting  the  fact  that  video  tells  us  little  about  pitch.  The  wide-band \nmodel  learned states that represented  different  formant  structures.  The video  ob(cid:173)\nservation distributions for  several states in the wide-band model were clearly sepa(cid:173)\nrated, reflecting  the information that video  provides about the formant  structure. \n\nSubjectively  the  enhanced  signals  sound  well  separated  from  each  other  for  the \nmost  part.  Figure 4(a)  (bottom)  shows  the  estimated spectrograms for  a  mixture \nof  two  different  words  spoken  by  the  same  speaker  - an  extremely  difficult  task. \nTo  quantify  these  results  we  evaluate  the  system  using  speech  recognizer,  on  the \nslightly  easier  task  of separating  the  speech  of the  two  different  speakers,  whose \nvoices  were  in  different  but overlapping pitch ranges. \n\nA test set was generated by mixing together 39 randomly chosen pairs of words, one \nfrom  each  subject,  such  that no  word  was  used twice.  Each word  pair was  mixed \nat five  different signal to noise ratios (SNRs),  with the SNR provided to the system \nat  test  time. 5  The  total  number  of test  mixtures  for  each  subject  was  thus  195. \n\n3see  http://amp.ece.cmu.edu/projects/ Audio VisualSpeechProcessing/ \n4Sine  windows  were  used in  analysis  and synthesis such that their  product forms  win(cid:173)\n\ndows that sum to unity when overlapped 50%.  The windowed frames were  analyzed using \na  264-point  fast  Fourier  transform  (FFT) .  The  phases  of the  resulting  spectra were  dis(cid:173)\ncarded. \n\n5Estimation of the SNR is  necessary in practice;  however this subject has been treated \n\n\fThe separated test sounds were estimated by the system under two conditions: with \nand without the use  of video information. \n\nWe  evaluated  the  estimates  on  the  test  set  using  a  speech  recognition  system  de(cid:173)\nveloped  by  Bhiksha  Raj,  using  the  eMU  Sphinx  ASR  engine. 6  Existing  speech \nHMMs trained on 60 hours of broadcast news  data were used for  recognition. 7  The \nmodels were adapted in an unsupervised manner to clean speech from each speaker, \nby learning a  single affine transformation of all  the state means,  using a  maximum \nlikelihood linear regression procedure [14].  The recognizer adapted to each speaker \nwas tested with the enhanced speech produced by the speech model for that speaker, \nas  well  as  with no enhancement. \n\nResults  are  shown  in  figure  4(b).  Recognition  was  greatly facilitated  by  the  en(cid:173)\nIt is  somewhat \nhancement,  with  additional gains  resulting  from  the  use  of video. \nsurprising that the gains for  video occur mostly in areas of higher SNR,  whereas in \nhuman speech perception they occur under lower SNR.  Little subjective difference \nwas  noted  with  the  use  of  video  in  the  case  of two  speakers.  However  in  other \nexperiments,  when both voices  came from  the same speaker,  the video  was  crucial \nin  disambiguating which signal came from  which voice. \n\n\"one\" \n\n\"two\" \n\nOriginals \n\nMixture \n\nSeparated \n\n(a)  signal separation spectrograms \n\n(b)  automatic speech recognition \n\nSNR dB \n\nFigure 4:  spectrograms of separated speech signals for  a  mixture two words spoken \nby the same speaker (a), and speech recognition performance for 39 mixtures of two \nwords spoken by different  speakers  (b) \n\n6  Discussion \n\nWe  have presented promising techniques for audio-visual speech enhancement.  We \nintroduced  a  factorial  HMM  to track both formant  and pitch  information, as  well \nas  video,  in  a  unified  probabilistic  model,  and  demonstrated  its  effectiveness  in \nspeech  enhancement.  We  are  not  aware  of  any  other  HMM-based  audio-visual \n\nelsewhere  [6]  and is  beyond the scope  of this paper. \n\n6see  http://www.speech.cs.cmu.edu/sphinxj. \n7These  models  represented  every  combination  of three  phones  (triphones)  using  6000 \nstates tied across trip hone models, with a  16-element Gaussian mixture observation model \nfor  each  state.  The data were  processed  at  8  kHz  in  25ms  windows  overlapped by 15ms, \nwith a frame rate of 100 frames per second, and analyzed into 31  Mel frequency components \nfrom  which  13  cepstral coefficients  were  derived.  These coefficients  with the mean vector \nremoved, and supplemented with their time differences,  comprised the observed features \n\n\fspeech  enhancement  systems in the  literature.  The  results  are  tentative given  the \nsmall sample of voices  used;  however they suggest that further study with a  larger \nsample of voices  is  warranted.  It would  be useful  to  compare the  performance of \na  factorial  speech  model  to  that  of each  factor  in  isolation,  as  well  as  to  a  full(cid:173)\nspectrum model.  Measures of quality and intelligibility by human listeners in terms \nof speech  and emotion  recognition, as  well  as  speaker identity,  will  also  be  helpful \nin further demonstrating the utility of these techniques.  We look forward to further \ndevelopment of these techniques  in future research. \n\nAcknowledgments \n\nWe wish to thank Mitsubishi Electric Research Labs for  hosting this research.  Spe(cid:173)\ncial thanks to Bhiksha Raj  for  devising and producing the evaluation using speech \nrecognition, and to Matt Brand for  his entropic HMM toolkit. \n\nReferences \n\n[1]  W.  H.  Sumby  and  I.  Pollack.  Visual  contribution  to  speech  intelligibility  in  noise. \n\nJournal  of th e  Acoustical  Society  of America,  26:212- 215,  1954. \n\n[2]  Jordi Robert-Ribes,  Jean-Luc Schwartz, Tahar Lallouache, and Pierre Escudier.  Com(cid:173)\n\nplementarity  and  synergy  in  bimodal  speech.  Journ el  of the  Acoustical  Society  of \nAmerica,  103(6):3677- 3689,  1998. \n\n[3]  Stepmane Dupont and Juergen Luettin.  Audio-visual speech modeling for  continuous \n\nspeech recognition.  IEEE transactions  on  Multimedia,  2(3):141- 151,  2000. \n\n[4]  John W.  Fisher, Trevor Darrell, William T.  Freeman, and Paul Viola.  Learning joint \nIn  Advances  in  Neural \n\nstatistical  models  for  audio-visual  fusion  and  segregation. \nInformation  Processing  Systems  13.  200l. \n\n[5]  Laurent  Girin,  Jean-Luc  Schwartz,  and  Gang  Feng.  Audio-visual  enhancement  of \nspeech  in  noise.  Journ el  of  the  Acoustical  Society  of  America,  109(6):3007- 3019, \n200l. \n\n[6]  Yariv Ephraim.  Statistical-model based speech enhancement systems.  Proceedings  of \n\nthe  IEEE, 80(10):1526- 1554,  1992. \n\n[7]  Z.  Ghahramani and  M.  Jordan.  Factorial hidden  markov models.  In David S.  Touret(cid:173)\n\nzky, Michael  C. Mozer, and M.E. Hasselmo, editors,  Advances  in Neural  Information \nProcessing  Systems  8,  1996. \n\n[8]  Sam  T.  Roweis.  One microphone source separation.  In Advances  in  Neural  Informa(cid:173)\n\ntion  Processing  Systems  13.  200l. \n\n[9]  Hagai Attias, John C.  Platt, Alex Acero, and Li Deng.  Speech  denoising and derever(cid:173)\n\nberation  using  probabilistic  models.  In  Advances  in  Neural  Information  Processing \nSystems  13.  200l. \n\n[10]  M.  J.  F .  Gales.  Model-Bas ed  Techniques  for  Noise  Robust  Speech  R ecognition.  PhD \n\nthesis,  Cambridge University,  1996. \n\n[11]  F . R.  Kschischang, B.  Frey,  and H .-A.  Loeliger.  Factor graphs and the sum-product \n\nalgorithm.  IEEE  Trans.  Inform.  Theory,  47(2):498- 519,  200l. \n\n[12]  F.  J. Huang and T.  Chen.  Real-time lip-synch face  animation driven  by  human voice. \nIn  IEEE  Workshop  on  Multimedia  Signal  Processing,  Los  Angeles,  California,  Dec \n1998. \n\n[13]  Matt  Brand.  Structure  learning  in  conditional  probability  models  via  an  entropic \n\nprior and parameter extinction.  Neural  Computation,  11(5):1155- 1182,  1999. \n\n[14]  C.  J. Leggetter and P.  C.  Woodland.  Maximum likelihood linear regression for speaker \nadaptation of the parameters of continuous density hidden  markov models.  Computer \nSpeech  and  Language,  9: 171- 185,  1995. \n\n\f", "award": [], "sourceid": 2005, "authors": [{"given_name": "John", "family_name": "Hershey", "institution": null}, {"given_name": "Michael", "family_name": "Casey", "institution": null}]}