Lipreading by neural networks: Visual preprocessing, learning, and sensory integration

Wolff, Gregory; Prasad, K.; Stork, David; Hennecke, Marcus

Lipreading by neural networks: Visual preprocessing, learning, and sensory integration

Gregory J. Wolff, K. Venkatesh Prasad, David G. Stork, Marcus Hennecke

Advances in Neural Information Processing Systems 6 (NIPS 1993)

Abstract

We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automat(cid:173) ic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/not(cid:173) visible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and inte(cid:173) grated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an er(cid:173) ror rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.

Abstract

Name Change Policy