John Hershey, Michael Casey
It is well known that under noisy conditions we can hear speech much more clearly when we read the speaker's lips. This sug(cid:173) gests the utility of audio-visual information for the task of speech enhancement. We propose a method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone. We revise and extend HMM-based speech enhancement techniques, in which signal and noise models are fac(cid:173) tori ally combined, to incorporate visual lip information and em(cid:173) ploy novel signal HMMs in which the dynamics of narrow-band and wide band components are factorial. We avoid the combina(cid:173) torial explosion in the factorial model by using a simple approxi(cid:173) mate inference technique to quickly estimate the clean signals in a mixture. We present a preliminary evaluation of this approach using a small-vocabulary audio-visual database, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information.