Part of Advances in Neural Information Processing Systems 14 (NIPS 2001)
K. Yao, S. Nakamura
We present a sequential Monte Carlo method applied to additive noise compensation for robust speech recognition in time-varying noise. The method generates a set of samples according to the prior distribution given by clean speech models and noise prior evolved from previous estimation. An explicit model representing noise ef- fects on speech features is used, so that an extended Kalman filter is constructed for each sample, generating the updated continuous state estimate as the estimation of the noise parameter, and predic- tion likelihood for weighting each sample. Minimum mean square error (MMSE) inference of the time-varying noise parameter is car- ried out over these samples by fusion the estimation of samples ac- cording to their weights. A residual resampling selection step and a Metropolis-Hastings smoothing step are used to improve calcula- tion e#ciency. Experiments were conducted on speech recognition in simulated non-stationary noises, where noise power changed ar- tificially, and highly non-stationary Machinegun noise. In all the experiments carried out, we observed that the method can have sig- nificant recognition performance improvement, over that achieved by noise compensation with stationary noise assumption. 1 Introduction
Speech recognition in noise has been considered to be essential for its real applica- tions. There have been active research e#orts in this area. Among many approaches, model-based approach assumes explicit models representing noise e#ects on speech features. In this approach, most researches are focused on stationary or slow-varying noise conditions. In this situation, environment noise parameters are often esti- mated before speech recognition from a small set of environment adaptation data. The estimated environment noise parameters are then used to compensate noise e#ects in the feature or model space for recognition of noisy speech. However, it is well-known that noise statistics may vary during recognition. In this situation, the noise parameters estimated prior to speech recognition of the utterances is possibly not relevant to the subsequent frames of input speech if en- vironment changes.
A number of techniques have been proposed to compensate time-varying noise ef- fects. They can be categorized into two approaches. In the first approach, time- varying environment sources are modeled by Hidden Markov Models (HMM) or Gaussian mixtures that were trained by prior measurement of environments, so that noise compensation is a task of identification of the underlying state sequences of the noise HMMs, e.g., in , by maximum a posterior (MAP) decision. This ap- proach requires making a model representing di#erent conditions of environments (signal-to-noise ratio, types of noise, etc.), so that statistics at some states or mix- tures obtained before speech recognition are close to the real testing environments. In the second approach, environment model parameters are assumed to be time- varying, so it is not only an inference problem but also related to environment statistics estimation during speech recognition. The parameters can be estimated by Maximum Likelihood estimation, e.g., sequential EM algorithm . They can also be estimated by Bayesian methods. In the Bayesian methods, all relevant information on the set of environment parameters and speech parameters, which are denoted as #(t) at frame t, is included in the posterior distribution given observa- tion sequence Y (0 : t), i.e., p(#(t)|Y (0 : t)). Except for a few cases including linear Gaussian state space model (Kalman filter), it is formidable to evaluate the distri- bution updating analytically. Approximation techniques are required. For example, in , a Laplace transform is used to approximate the joint distribution of speech and noise parameters by vector Taylor series. The approximated joint distribution can give analytical formula for posterior distribution updating. We report an alternative approach for Bayesian estimation and compensation of noise e#ects on speech features. The method is based on sequential Monte Carlo method . In the method, a set of samples is generated hierarchically from the prior distribution given by speech models. A state space model representing noise e#ects on speech features is used explicitly, and an extended Kalman filter (EKF) is con- structed in each sample. The prediction likelihood of the EKF in each sample gives its weight for selection, smoothing, and inference of the time-varying noise param- eter, so that noise compensation is carried out afterwards. Since noise parameter estimation, noise compensation and speech recognition are carried out frame-by- frame, we denote this approach as sequential noise compensation. 2 Speech and noise model
Our work is on speech features derived from Mel Frequency Cepstral Coe#cients (MFCC). It is generated by transforming signal power into log-spectral domain, and finally, by discrete Cosine transform (DCT) to the cepstral domain. The following derivation of the algorithm is in log-spectral domain. Let t denote frame (time) index. In our work, speech and noise are respectively modeled by HMMs and a Gaussian mixture. For speech recognition in stationary additive noise, the following for- mula  has been shown to be e#ective in compensating noise e#ects. For Gaussian mixture k t at state s t , the Log-Add method transforms the mean vector