{"title": "Non-linear Prediction of Acoustic Vectors Using Hierarchical Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 835, "page_last": 842, "abstract": null, "full_text": "Non-linear Prediction of Acoustic Vectors \nUsing Hierarchical Mixtures of Experts \n\nS.R.Waterhouse \nA.J.Robinson \nCambridge University Engineering Department, \nTrumpington St ., Cambridge, CB2 1PZ, England. \n\nTel: [+44] 223 332800, Fax: [+44] 223 332662, \n\nEmail: srwlO01.ajr@eng.cam.ac.uk \n\nURL: http://svr-www.eng.cam.ac.ukr srw1001 \n\nAbstract \n\nIn this paper we consider speech coding as a problem of speech \nmodelling. In particular, prediction of parameterised speech over \nshort time segments is performed using the Hierarchical Mixture of \nExperts (HME) (Jordan & Jacobs 1994). The HME gives two ad(cid:173)\nvantages over traditional non-linear function approximators such \nas the Multi-Layer Percept ron (MLP); a statistical understand(cid:173)\ning of the operation of the predictor and provision of information \nabout the performance of the predictor in the form of likelihood \ninformation and local error bars. These two issues are examined \non both toy and real world problems of regression and time series \nprediction. In the speech coding context, we extend the principle \nof combining local predictions via the HME to a Vector Quantiza(cid:173)\ntion scheme in which fixed local codebooks are combined on-line \nfor each observation. \n\n1 INTRODUCTION \n\nWe are concerned in this paper with the application of multiple models, specifi(cid:173)\ncally the Hierarchical Mixtures of Experts, to time series prediction, specifically the \nproblem of predicting acoustic vectors for use in speech coding. There have been \na number of applications of multiple models in time series prediction. A classic \nexample is the Threshold Autoregressive model (TAR) which was used by Tong & \n\n\f836 \n\nS. R. Waterhouse, A. J. Robinson \n\nLim (1980) to predict sunspot activity. More recently, Lewis, Kay and Stevens \n(in Weigend & Gershenfeld (1994)) describe the use of Multivariate and Regres(cid:173)\nsion Splines (MARS) to the prediction of future values of currency exchange rates. \nFinally, in speech prediction, Cuperman & Gersho (1985) describe the Switched \nInter-frame Vector Prediction (SIVP) method which switches between separate lin(cid:173)\near predictors trained on different statistical classes of speech. The form of time \nseries prediction we shall consider in this paper is the single step prediction fI(t) of a \nfuture quantity y(t) , by considering the previous c: samples. This may be viewed as \na regression problem over input-output pairs {x t), y(t)}~ where x(t) is the lag vec(cid:173)\ntor (y(t-I), y(t-2), ... , y(t- p\u00bb. We may perform this regression using standard linear \nmodels such as the Auto-Regressive (AR) model or via nonlinear models such as \nconnectionist feed-forward or recurrent networks. The HME overcomes a number of \nproblems associated with traditional connectionist models via its architecture and \nstatistical framework. Recently, Jordan & Jacobs (1994) and Waterhouse & Robin(cid:173)\nson (1994) have shown that via the EM algorithm and a 2nd order optimization \nscheme known as Iteratively Reweighted Least Squares (IRLS), the HME is faster \nthan standard Multilayer Perceptrons (MLP) by at least an order of magnitude on \nregression and classification tasks respectively. Jordan & Jacobs also describe var(cid:173)\nious methods to visualise the learnt structure of the HME via 'deviance trees' and \nhistograms of posterior probabilities. In this paper we provide further examples \nof the structural relationship of the trained HME and the input-output space in \nthe form of expert activation plots. In addition we describe how the HME can be \nextended to give local error bars or measures of confidence in regression and time \nseries prediction problems. Finally, we describe the extension of the HME to acous(cid:173)\ntic vector prediction, and a VQ coding scheme which utilises likelihood information \nfrom the HME. \n\n2 HIERARCHICAL MIXTURES OF EXPERTS \n\nThe HME architecture (Figure 1) is based on the principle of 'divide and conquer' \nin which a large, hard to solve problem is broken up into many, smaller, easier \nto solve problems. It consists of a series of 'expert networks' which are trained \non different parts of the input space. The outputs of the experts are combined \nby a 'gating network' which is trained to stochastically select the expert which is \nperforming best at solving a particular part of the problem. The operation of the \nHME is as follows: the gating networks receive the input vectors x(t) and produce \nas outputs probabilities P(mi/.x(t), 7'/j) for each local branch mj of assigning the current \ninput to the different branches, where T/j are the gating network parameters. The \nexpert networks sit at the leaves of the tree and each output a vector flJt) given \ninput vector x(t) and parameters Bj . These outputs are combined in a weighted sum \nby P(mjlX