Speech Modelling Using Subspace and EM Techniques

Part of Advances in Neural Information Processing Systems 12 (NIPS 1999)

Bibtex Metadata Paper


Gavin Smith, João de Freitas, Tony Robinson, Mahesan Niranjan


The speech waveform can be modelled as a piecewise-stationary linear stochastic state space system, and its parameters can be estimated using an expectation-maximisation (EM) algorithm. One problem is the ini(cid:173) tialisation of the EM algorithm. Standard initialisation schemes can lead to poor formant trajectories. But these trajectories however are impor(cid:173) tant for vowel intelligibility. The aim of this paper is to investigate the suitability of subspace identification methods to initialise EM. The paper compares the subspace state space system identification (4SID) method with the EM algorithm. The 4SID and EM methods are similar in that they both estimate a state sequence (but using Kalman fil(cid:173) ters and Kalman smoothers respectively), and then estimate parameters (but using least-squares and maximum likelihood respectively). The sim(cid:173) ilarity of 4SID and EM motivates the use of 4SID to initialise EM. Also, 4SID is non-iterative and requires no initialisation, whereas EM is itera(cid:173) tive and requires initialisation. However 4SID is sub-optimal compared to EM in a probabilistic sense. During experiments on real speech, 4SID methods compare favourably with conventional initialisation techniques. They produce smoother formant trajectories, have greater frequency res(cid:173) olution, and produce higher likelihoods.

1 Work done while in Cambridge Engineering Dept., UK.

Speech Modelling Using Subspace and EM Techniques