An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)

Samy Bengio


This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences de(cid:173) scribing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to ob(cid:173) tain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions.