New Approaches Towards Robust and Adaptive Speech Recognition

Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)

Bibtex Metadata Paper


Hervé Bourlard, Samy Bengio, Katrin Weber


In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent) "experts", each expert focusing on a different char(cid:173) acteristic of the signal, and that the different stream likelihoods (or posteriors) are combined at some (temporal) stage to yield a global recognition output. As a further extension to multi-stream ASR, we will finally introduce a new approach, referred to as HMM2, where the HMM emission probabilities are estimated via state spe(cid:173) cific feature based HMMs responsible for merging the stream infor(cid:173) mation and modeling their possible correlation.

1 Multi-Channel Processing in ASR

Current automatic speech recognition systems are based on (context-dependent or context-independent) phone models described in terms of a sequence of hidden Markov model (HMM) states, where each HMM state is assumed to be character(cid:173) ized by a stationary probability density function. Furthermore, time correlation, and consequently the dynamic of the signal, inside each HMM state is also usu(cid:173) ally disregarded (although the use of temporal delta and delta-delta features can capture some of this correlation). Consequently, only medium-term dependencies are captured via the topology of the HMM model, while short-term and long-term dependencies are usually very poorly modeled. Ideally, we want to design a partic(cid:173) ular HMM able to accommodate multiple time-scale characteristics so that we can capture phonetic properties, as well as syllable structures and {long term) invariants that are more robust to noise. It is, however, clear that those different time-scale features will also exhibit different levels of stationarity and will require different HMM topologies to capture their dynamics. There are many potential advantages to such a multi-stream approach, including:

  1. The definition of a principled way to merge different temporal knowledge sources such as acoustic and visual inputs, even if the temporal sequences are not synchronous and do not have the same data rate - see [13] for further discussion about this.

  2. Possibility to incorporate multiple time resolutions (as part of a structure

with multiple unit lengths, such as phon(l and syllable).

  1. As a particular case of multi-stream processing, mufti-band ASR [2, 5], involving the independent processing and combination of partial frequency bands, have many potential advantages briefly discussed below.

In the following, we will not discuss the underlying algorithms (more or less "com(cid:173) plex" variants of Viterbi decoding), nor detailed experimental results (see, e.g., [4] for recent results). Instead, we will mainly focus on the combination strategy and discuss different variants arounds the same formalism.

2 Multiband-based ASR

2.1 General Formalism

As a particular case of the multi-stream paradigm, we have been investigating an ASR approach based on independent processing and combination of frequency sub(cid:173) bands. The general idea, as illustrated in Fig. 1, is to split the whole frequency band (represented in terms of critical bands) into a few subbands on which different rec(cid:173) ognizers are independently applied. The resulting probabilities are then combined for recognition later in the process at some segmental level. Starting from critical bands, acoustic processing is now performed independently for each frequency band, yielding K input streams, each being associated with a particular frequency band.