Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

Part of Advances in Neural Information Processing Systems 15 (NIPS 2002)

Bibtex Metadata Paper


Lawrence Saul, Daniel Lee, Charles Isbell, Yann Cun


We have implemented a real time front end for detecting voiced speech and estimating its fundamental frequency. The front end performs the signal processing for voice-driven agents that attend to the pitch contours of human speech and provide continuous audiovisual feedback. The al- gorithm we use for pitch tracking has several distinguishing features: it makes no use of FFTs or autocorrelation at the pitch period; it updates the pitch incrementally on a sample-by-sample basis; it avoids peak picking and does not require interpolation in time or frequency to obtain high res- olution estimates; and it works reliably over a four octave range, in real time, without the need for postprocessing to produce smooth contours. The algorithm is based on two simple ideas in neural computation: the introduction of a purposeful nonlinearity, and the error signal of a least squares fit. The pitch tracker is used in two real time multimedia applica- tions: a voice-to-MIDI player that synthesizes electronic music from vo- calized melodies, and an audiovisual Karaoke machine with multimodal feedback. Both applications run on a laptop and display the user’s pitch scrolling across the screen as he or she sings into the computer.