{"title": "Speech Modelling Using Subspace and EM Techniques", "book": "Advances in Neural Information Processing Systems", "page_first": 796, "page_last": 802, "abstract": null, "full_text": "Speech Modelling Using Subspace and EM \n\nTechniques \n\nGavin Smith \n\nCambridge University \nEngineering Department \n\nCambridge CB2 1PZ \n\nEngland \n\ngas1 oo3@eng.cam.ac.uk \n\nJoao FG de Freitas \n\nComputer Science Division \n\n487 Soda Hall \nUC Berkeley \n\nCA 94720-1776, USA. \njfgf@cs.berkeley.edu 1 \n\nTony Robinson \n\nCambridge University \nEngineering Department \n\nCambridge CB2 IPZ \n\nEngland \n\najr@eng.cam.ac.uk \n\nMahesan Niranjan \nComputer Science \nSheffield University \nSheffield. S 1 4DP \n\nEngland \n\nm.niranjan@dcs.shef.ac.uk \n\nAbstract \n\nThe speech waveform can be modelled as a piecewise-stationary linear \nstochastic state space system, and its parameters can be estimated using \nan expectation-maximisation (EM) algorithm. One problem is the ini(cid:173)\ntialisation of the EM algorithm. Standard initialisation schemes can lead \nto poor formant trajectories. But these trajectories however are impor(cid:173)\ntant for vowel intelligibility. The aim of this paper is to investigate the \nsuitability of subspace identification methods to initialise EM. \nThe paper compares the subspace state space system identification \n(4SID) method with the EM algorithm. The 4SID and EM methods are \nsimilar in that they both estimate a state sequence (but using Kalman fil(cid:173)\nters and Kalman smoothers respectively), and then estimate parameters \n(but using least-squares and maximum likelihood respectively). The sim(cid:173)\nilarity of 4SID and EM motivates the use of 4SID to initialise EM. Also, \n4SID is non-iterative and requires no initialisation, whereas EM is itera(cid:173)\ntive and requires initialisation. However 4SID is sub-optimal compared \nto EM in a probabilistic sense. During experiments on real speech, 4SID \nmethods compare favourably with conventional initialisation techniques. \nThey produce smoother formant trajectories, have greater frequency res(cid:173)\nolution, and produce higher likelihoods. \n\n1 Work done while in Cambridge Engineering Dept., UK. \n\n\fSpeech Modelling Using Subspace and EM Techniques \n\n797 \n\n1 Introduction \n\nThis paper models speech using a stochastic state space model, where model parameters \nare estimated using the expectation-maximisation (EM) technique. One problem is the \ninitialisation of the EM algorithm. Standard initialisation schemes can lead to poor formant \ntrajectories. These trajectories are however important for vowel intelligibility. This paper \ninvestigates the suitability of subspace state space system identification (4SIO) techniques \n[10,11], which are popular in system identification, for EM initialisation. \n\nSpeech is split into fixed-length, overlapping frames. Overlap encourages temporally \nsmoother parameter transitions between frames. Oue to the slow non-stationary behaviour \nof speech, each frame of speech is assumed quasi-stationary and represented as a linear \ntime-invariant stochastic state space (SS) model. \n\nXt+l = AXt + Wt \nCXt + Vt \n\nYt \n\n(1) \n(2) \n\nThe system order is p. Xt E ~p X l is the state vector. A E ~p x p and C E ~l x p are \nsystem parameters. The output Yt E ~ is the speech signal at the microphone. Process \nand observation noises are modelled as white zero-mean Gaussian stationary noises Wt E \nf\"V N(O, Q) and Vt E ~ f\"V N(O, R) respectively. The problem definition is to \n~p x l \nestimate parameters e = (A, c, Q, R) from speech Yt only. \nThe structure of the paper is as follows . The theory section describes EM and 4SIO applied \nto the parameter estimation of the above SS model. The similarity of 4SIO and EM moti(cid:173)\nvates the use of 4SID to initialise EM. Experiments on real speech then compare 4SIO with \nmore conventional initialisation methods. The discussion then compares 4SIO with EM. \n\n2 Theory \n\n2.1 The Expectation-Maximisation (EM) Technique \n\nGiven a sequence of N observations Yl:N of a signal such as speech, the maximum like(cid:173)\nlihood estimate for the parameters is 9ML = arg maxep(Yl:N!e) . EM breaks the \nmaximisation of this potentially difficult likelihood function down into an iterative max(cid:173)\nimisation of a simpler likelihood function, generating a new estimate ek each iteration. \nRewriting P(Yl:N!e) in terms of a hidden state sequence Xl:N, and taking expectations \nover P(Xl:N!Yl:N, ek) \n\n10gp(Yl:N!e) \n10gp(Yl:N!e) \n\n= 10gp(Xl:N,yl :N!e) -logp(Xl:N!Yl :N,e) \n\nEk[logp(Xl:N, Yl:N!e)] - E k[logp(Xl :N!Yl:N, e)] \n\n(3) \n(4) \n\nIterative maximisation of the first expectation in equation 4 guarantees an increase in \n10gp(Yl:N!e). \n\nThis converges to a local or global maximum depending on the initial parameter estimate \neo. Refer to [8] for more details. EM can thus be applied to the stochastic state space \n\n(5) \n\n\f798 \n\nG. Smith, J. F. G. d. Freitas, T. Robinson and M Niranjan \n\nmodel of equations 1 and 2 to determine optimal parameters e. An explanation is given \nin [3]. The EM algorithm applied to the SS system consists of two stages per iteration. \nFirstly, given current parameter estimates, states are estimated using a Kalman smoother. \nSecondly, given these states, new parameters are estimated by maximising the expected \nlog likelihood function. We employ the Rauch-Tung-Striebel formulation of the Kalman \nsmoother [2]. \n\n2.2 The State-Space Model \n\nEquations 1 and 2 can be cast in block matrix form and are termed the state sequence and \nblock output equations respectively [10]. Note that the use of blocking and fixed-length \nsignals applies restrictions to the general model in section 1. i > P is the block size. \n\nXi+I,i+j \nYI!i \n\nAiXl,j + arWI!i \nriXI,j + HrWI!i + VI!i \n\n(6) \n(7) \n\nXi+I,i+j is a state sequence matrix; its columns are the state vectors from time (i + 1) to \n(i+j). XI ,j is similarly defined. Y W is a Hankel matrix of outputs from time 1 to (i+j-1). \nW and V are similarly defined. a i \nis a reversed extended controllability-type matrix, r i \nis the extended observability matrix and Hi is a Toeplitz matrix. These are all defined \nbelow where IPxp is an identity matrix. \n\ndef \n\ndef \n\nXI,j \n\na w \n\n~ \n\n[Xl X2 X3 \u2022\u2022. Xj] \n\n[Ai- I A i- 2 ... I] \n\nr\u00b7~ ~ -\n\n[ci, 1 \n\nY \n\n[ y, \n~f Y2 \n: \n\nl!i -\n\nY2 \nY3 \n\nYj \n\nYj+1 \n\nYi Yi+l \n\nYi+j-l \n\nH~~f \n\n, \n\n1 \n\n[ c1-, \n\n0 \n\n:J \n\nC \n\nA sequence of outputs can be separated into two block output equations containing past \nand future outputs denoted with subscriptsp and! respectively. With Yp dg YI!i, Y, dg \nY i+1!2i and similarly for W and V, and Xp = XI,j and X, = Xi+I,i+j, past and \nfuture are related by the equations \n\nde, \n\nde, \n\nAiXp +arWp \nriXp + HiWp + Vp \nrix, + HiW, + V, \n\n(8) \n(9) \n(10) \n\n2.3 Subspace State Space System Identification (4SID) Techniques \n\nComments throughout this section on 4SIO are largely taken from the work of Van Over(cid:173)\nschee and Oe Moor [10]. 4SIO methods are related to instrumental variable (IV) methods \n[11]. 4SIO algorithms are composed of two stages. Stage one involves the low-rank ap(cid:173)\nproximation and estimation of the extended observability matrix directly from the output \n\n\fSpeech Modelling Using Subspace and EM Techniques \n\n799 \n\ndata. For example, consider the future output block equation 10. Y, undergoes an orthogo(cid:173)\nnal projection onto the row space ofY p' This is denoted by Y, /'J p = Y, YJ (Y p YJ) ty p, \nwhere t is the Moore-Penrose inverse. \n\nr iX, /'J p + HfW, /'J p + V, /'J p \nrix,/'J p \n\n(11) \n\nStage two involves estimation of system parameters. The singular value decomposition of \nY, /'Jp allows the observability and state sequence matrices to be estimated to within a \nsimilarity transform from the column and row spaces respectively. From these two matri(cid:173)\nces, system parameters (A, c, Q, R) can be determined by least-squares. \nThere are two interesting comments. Firstly, the orthogonal projection from stage one co(cid:173)\nincides with a minimum error between true data Y, and its linear prediction from Y p in \nthe Frobenius norm. Greater flexibility is obtained by weighting the projection with ma(cid:173)\ntrices WI and W 2 and analysing this: WI (YJi'J p )W2 \u2022 4SID and IV methods differ \nwith respect to these weighting matrices. Weighting is similar to prefiltering the observa(cid:173)\ntions prior to analysis to preferentially weight some frequency domain, as is common in \nidentification theory [6]. Secondly, the state estimates from stage two can be considered as \noutputs from a parallel bank of Kalman filters, each one estimating a state from the previous \ni observations, and initialised using zero conditions. \n\nThe particular subspace algorithm and software used in this paper is the sto-pos algorithm \nas detailed in [10]. Although this algorithm introduces a small bias into some of the pa(cid:173)\nrameter estimates, it guarantees positive realness of the covariance sequence, which in turn \nguarantees the definition of a forward innovations model. \n\n3 Experiments \n\nExperiments are conducted on the phrase \"in arithmetic\", spoken by an adult male. The \nspeech waveform is obtained from the Eurom 0 database [4] and sampled at 16 kHz. The \nspeech waveform is divided into fixed-length, overlapping frames, the mean is subtracted \nand then a hamming window is applied. Frames are 15 ms in duration, shifted 7.5 ms \neach frame. Speech is modelled as detailed in section 1. All models are order 8. A frame \nis assumed silent and no analysis done when the mean energy per sample is less than an \nempirically defined threshold. \n\nFor the EM algorithm, a modified version of the software in [3] is used. The initial state \nvector and covariance matrix are set to zero and identity respectively, and 50 iterations are \napplied. Q is updated by taking its diagonal only in the M-step for numerical stability (see \n[3]). \n\nIn these experiments, three schemes are compared at initialising parameters for the EM \nalgorithm, that is the estimation of 9 0 . These schemes are compared in terms of their \nformant trajectories relative to the spectrogram and their likelihoods. The three schemes \nare \n\n\u2022 4SID. This is the subspace method in section 2.3 with block size 16 . \n\u2022 ARMA. This estimates 9 0 using the customised Matlab armax function!, which \nmodels the speech waveform as an autoregressive moving average (ARMA) pro(cid:173)\ncess, with order 8 polynomials. \n\nI armax minimises a robustified quadratic prediction error criterion using an iterative Gauss(cid:173)\nNewton algorithm, initialised using a four-stage least-squares instrumental variables algorithm [7]. \n\n\f800 \n\nG. Smith, J F. G. d. Freitas, T. Robinson and M Niranjan \n\n\u2022 AR(l). This uses a simplistic method, and models the speech waveform as a \nfirst order autoregressive (AR) process with some randomness introduced into the \nestimation. It still initialises all parameters fully2. \n\nResults are shown in Figures 1 and 2. Figure 1 shows the speech waveform, spectrogram \nand formant trajectories for EM with all three initialisation schemes. Here formant frequen(cid:173)\ncies are derived from the phase of the positive phase eigenvalues of A after 50 iterations of \nEM. Comparison with the spectrogram shows that for this order 8 model, 4SID-EM pro(cid:173)\nduces best formant trajectories. Figure 2 shows mean average plots of likelihood against \nEM iteration number for each initialisation scheme. 4SID-EM gives greater likelihoods \nthan ARMA-EM and AR(l)-EM. The difference in formant trajectories between subspace(cid:173)\nEM and ARMA-EM despite the high likelihoods, demonstrates the multi-modality of the \nlikelihood function. For AR(l)-EM, a few frames were not estimated due to numerical \ninstability. \n\n4 Discussion \n\nBoth the 4SID and EM algorithms employ similar methodologies: states are first estimated \nusing a Kalman device, and then these states are used to estimate system parameters ac(cid:173)\ncording to similar criteria. However in EM, states are estimated using past, present and \nfuture observations with a Kalman smoother; system parameters are then estimated using \nmaximum likelihood (ML). Whereas in 4SID, states are estimated using the previous i \nobservations only with non-steady state Kalman filters. System parameters are then esti(cid:173)\nmated using least-squares (LS) subject to a positive realness constraint for the covariance \nsequence. Refer also to [5] for a similar comparison. \n\n4SID algorithms are sub-optimal for three reasons. Firstly, states are estimated using only \npartial observations sequences. Secondly, the LS criterion is only an approximation to the \nML criterion. Thirdly, the positive realness constraint introduces bias. A positive realness \nconstraint is necessary due to a finite amount of data and any lacking in the SS model. For \nthis reason, 4SID methods are used to initialise rather than replace EM in these experi(cid:173)\nments. \n\n4SID methods also have some advantages. Firstly, they are linear and non-iterative, and \ndo not suffer from the disadvantages typical of iterative algorithms (including EM) such \nas sensitivity to initial conditions, convergence to local minima, and the definition of con(cid:173)\nvergence criteria. Secondly, they require little prior parameterisation except the definition \nof the system order, which can be determined in situ from observation of the singular val(cid:173)\nues of the orthogonal projection. Thirdly, the use of the SVD gives numerical robustness \nto the algorithms. Fourthly, they have higher frequency resolution than prediction error \nminimisation methods such as ARMA and AR [1]. \n\n5 Conclusions \n\n4SID methods can be used to initialise EM giving better formant tracks, higher likelihoods \nand better frequency resolution than more conventional initialisation methods. In the future \nwe hope to compare 4SID methods with EM in a principled probabilistic manner, investi(cid:173)\ngate weighting matrices further, and apply these methods to speech enhancement. Further \nwork is done by Smith et al. in [9], and similar work done by Grivel et aI. in [5]. \n\nAcknowledgements \nWe are grateful for the use of 4SID software supplied with [10] and the EM software of \n\n2Presented in the software in [3], this method is best used when the dimensions of the state space \n\nand observations are the same. \n\n\fSpeech Modelling Using Subspace and EM Techniques \n\n801 \n\nle4 \n\n