S. Parveen, P. Green
missing data' approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectral- temporal regions which are dominated by the speech source. The remaining regions are considered to bemissing'. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task. 1. Introduction
Automatic Speech Recognition systems perform reasonably well in controlled and matched training and recognition conditions. However, performance deteriorates when there is a mismatch between training and testing conditions, caused for instance by additive noise (Lippmann, 1997). Conventional techniques for improving recognition robustness (reviewed by Furui 1997) seek to eliminate or reduce the mismatch, for instance by enhancement of the noisy speech, by adapting statistical models for speech units to the noise condition or simply by training in different noise conditions. Missing data techniques provide an alternative solution for speech corrupted by additive noise which make minimal assumptions about the nature of the noise. They are based on identifying uncorrupted, reliable regions in the frequency domain and adapting recognition algorithms so that classification is based on these regions. Present missing data techniques developed at Sheffield (Barker et al. 2000a, Barker et al. 2000b, Cooke et al., 2001) and elsewhere (Drygaglo et al., 1998, Raj et al., 2000) adapt the prevailing technique for ASR based on Continuous Density Hidden Markov Models. CDHMMs are generative models which do not give direct estimates of posterior
probabilities of the classes given the acoustics. Neural Networks, unlike HMMs, are discriminative models which do give direct estimates of posterior probabilities and have been used with success in hybrid ANN/HMM speech recognition systems (Bourlard et al., 1998). In this paper, we adapt a recurrent neural network architecture introduced by (Gingras & Bengio, 1998) for robust ASR with missing data.
2.1 Missing data masks
Speech recognition with missing data is based on the assumption that some regions in time/frequency remain uncorrupted for speech with added noise. See (Cooke et al., 2001) for arguments to support this assumption. Initial processes, based on local signal-to-noise estimates, on auditory grouping cues, or a combination (Barker et al., 2001) define a binary
missing data mask': ones in the mask indicate reliable (orpresent') features and zeros indicate unreliable (or `missing') features.
2.2 Classification with missing data
Techniques for classification with incomplete data can be divided into imputation and marginalisation. Imputation is a technique in which missing features are replaced by estimated values to allow the recognition process proceed in normal way. If the missing values are replaced by either zeros, random values or their means based on training data, the approach is called unconditional imputation. On the other hand in conditional imputation conditional statistics are used to estimate the missing values given the present values. In the marginalisation approach missing values are ignored (by integrating over their possible ranges) and recognition is performed with the reduced data vector which is considered reliable. For the multivariate mixture Gaussian distributions used in CDHMMs, marginalisation and conditional imputation can be formulated analytically (Cooke et al., 2001). For missing data ASR further improvements in both techniques follow from using the knowledge that for spectral energy features the unreliable data is bounded between zero and the energy in speech+noise mixture (Vizinho et al., 1999), (Josifovski et al., 1999). These techniques are referred to as bounded marginalisation and
bounded imputation. Coupled with a `softening' of the reliable/unreliable decision, missing data techniques produce good results on a standard connected-digits-in-noise recognition task: performance using models trained on clean data is comparable, and in severe noise superior, to conventional systems trained across different noise conditions (Barker et al., 2001).
2.3 Why recurrent neural nets for missing data robust ASR?
Several neural net architectures have been proposed to deal with the missing data problem in general (Ahmed & Tresp, 1993), (Ghahramani & Jordan, 1994). The problem in using neural networks with missing data is to compute the output of a node/unit when some of its input values are unavailable. For marginalisation, this involves finding a way of integrating over the range of the missing values. A robust ASR system to deal with missing data using neural networks has recently been proposed by (Morris et al., 2000). This is basically a radial basis function neural network with the hidden units associated with a diagonal covariance gaussian. The marginal over the missing values can be computed in this case and hence the resulting system is equivalent to the HMM based missing data speech recognition system using marginalisation. Reported performance is also comparable to that of the HMM based
speech recognition system. In this paper missing data is dealt with by imputation. We use recurrent neural networks to estimate missing values in the input vector. RNNs have the potential to capture long-term contextual effects over time, and hence to use temporal context to compensate for missing data which CDHMM based missing data techniques do not do. The only contextual information available in CDHMM decoding come from the addition of temporal derivatives to the feature vector. RNNs also allow a single net to perform both imputation and classification, with the potential of combining these processes to mutual benefit. The RNN architecture proposed by Gingras et al. (1998) is based on a fully-connected feedforward network with input, hidden and output layers using hyperbolic tangent activation functions. The output layer has one unit for each class and the network is trained with the correct classification as target. Recurrent links are added to the feedforward net with unit delay from output to the hidden units as in Jordan networks (Jordan, 1988). There are also recurrent links with unit delay from hidden units to missing input units to impute missing features. In addition, there are self delayed terms with a fixed weight for each unit which basically serve to stabilise RNN behaviour over time and help in imputation as well. Gingras et al. used this RNN both for a pattern classification task with static data (one input vector for each example) and sequential data (a sequence of input values for each example). Our aim is to adapt this architecture for robust ASR with missing data. Some preliminary static classification experiments were performed on vowel spectra (individual spectral slices excised from the TIMIT database). RNN performance on this task with missing data was better than standard MLP and gaussian classifiers. In the next section we show how the net can be adapted for dynamic classification of the spectral sequences constituting words.
Figure 1 illustrates our modified version of the Gingras and Bengio architecture. Instead of taking feedback from the output to the hidden layer we have chosen a fully connected or Elman RNN (Elman, 1990) where there are full recurrent links from the past hidden layer to the present hidden layer (figure 1). We have observed that these links produce faster convergence, in agreement with (Pedersen, 1997). The number of input units depends on the size of feature vector, i.e. the number of spectral channels. The number of hidden units is determined by experimentation. There is one output unit for each pattern class. In our case the classes are taken to be whole words, so in the isolated digit recognition experiments we report, there are eleven output units, for
zero' andoh'. In training, missing inputs are initialised with their unconditional means. The RNN is then allowed to impute missing values for the next frame through the recurrent links, after a feedforward pass. Where is the missing feature at time t, is the learning rate, indicates recurrent links from a hidden unit to the missing input and is the activation of hidden unit j at time t-1.
The average of the RNN output over all the frames of an example is taken after these frames have gone through a forward pass. The sum squared error between the correct targets and the RNN output for each frame is back-propagated through time and RNN