Part of Advances in Neural Information Processing Systems 16 (NIPS 2003)
Kannan Achan, Sam Roweis, Brendan J. Frey
Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipu- late spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient opti- mizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between esti- mating the phase and a spectrally-consistent signal, our technique di- rectly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers.