{"title": "Time Series Prediction using Mixtures of Experts", "book": "Advances in Neural Information Processing Systems", "page_first": 309, "page_last": 318, "abstract": null, "full_text": "Time Series Prediction Using Mixtures of \n\nExperts \n\nAssaf J. Zeevi \n\nRon Meir \n\nInformation Systems Lab \n\nDepartment of Electrical Engineering \n\nDepartment of Electrical Engineering \n\nStanford University \nStanford, CA. 94305 \n\nazeevi~isl.stanford.edu \n\nTechnion \n\nHaifa 32000, Israel \n\nrmeir~ee.technion.ac.il \n\nRobert J. Adler \n\nDepartment of Statistics \n\nUniversity of North Carolina \n\nChapel Hill, NC. 27599 \nadler~stat.unc.edu \n\nAbstract \n\nWe consider the problem of prediction of stationary time series, \nusing the architecture known as mixtures of experts (MEM). Here \nwe suggest a mixture which blends several autoregressive models. \nThis study focuses on some theoretical foundations of the predic(cid:173)\ntion problem in this context. More precisely, it is demonstrated \nthat this model is a universal approximator, with respect to learn(cid:173)\ning the unknown prediction function . This statement is strength(cid:173)\nened as upper bounds on the mean squared error are established. \nBased on these results it is possible to compare the MEM to other \nfamilies of models (e.g., neural networks and state dependent mod(cid:173)\nels). It is shown that a degenerate version of the MEM is in fact \nequivalent to a neural network, and the number of experts in the \narchitecture plays a similar role to the number of hidden units in \nthe latter model. \n\n\f310 \n\n1 \n\nIntroduction \n\nA. 1. Zeevi, R. Meir and R. 1. Adler \n\nIn this work we pursue a new family of models for time series, substantially extend(cid:173)\ning, but strongly related to and based on the classic linear autoregressive moving \naverage (ARMA) family. We wish to exploit the linear autoregressive technique in \na manner that will enable a substantial increase in modeling power, in a framework \nwhich is non-linear and yet mathematically tractable. \n\nThe novel model, whose main building blocks are linear AR models, deviates from \nlinearity in the integration process, that is, the way these blocks are combined. This \nmodel was first formulated in the context of a regression problem, and an extension \nto a hierarchical structure was also given [2]. It was termed the mixture of experts \nmodel (MEM). \n\nVariants of this model have recently been used in prediction problems both in \neconomics and engineering. Recently, some theoretical aspects of the MEM , in the \ncontext of non-linear regression, were studied by Zeevi et al. [8], and an equivalence \nto a class of neural network models has been noted. \n\nThe purpose of this paper is to extend the previous work regarding the MEM \nin the context of regression, to the problem of prediction of time series. We shall \ndemonstrate that the MEM is a universal approximator, and establish upper bounds \non the approximation error, as well as the mean squared error, in the setting of \nestimation of the predictor function. \n\nIt is shown that the MEM is intimately related to several existing, state of the art, \nstatistical non-linear models encompassing Tong's TAR (threshold autoregressive) \nmodel [7], and a certain version of Priestley's [6] state dependent models (SDM). \nIn addition, it is demonstrated that the MEM is equivalent (in a sense that will be \nmade precise) to the class of feedforward, sigmoidal, neural networks. \n\n2 Model Description \n\nThe MEM [2] is an architecture composed of n expert networks, each being an AR( d) \nlinear model. The experts are combined via a gating network, which partitions the \ninput space accordingly. Considering a scalar time series {xt}, we associate with \neach expert a probabilistic model (density function) relating input vectors x!=~ == \n[Xt-l,Xt-2, ... ,Xt-d] to an output scalar Xt E]R and denote these probabilistic \nmodels by p(xtl x!=~;Oj,O\"j) j = 1,2, ... ,n where (OJ,O\"j) is the expert parameter \nvector, taking values in a compact subset of ]Rd+l. In what follows we will use \nupper case X t to denote random variables, and lower case Xt to denote values taken \nby those r.v.'s. \n\nj = \nLetting the parameters of each expert network be denoted by ( OJ, 0\" j ), \n1,2, ... , n, those of the gating network by Og and letting 8 = ({OJ, O\"j }j=l, Og) \nrepresent the complete set of parameters specifying the model, we may express the \nconditional distribution of the model, p(xtlx:=~, 8), as \n\np(Xtlx!=~; 8) = L gj (x!=~; Og)p(Xtlx!=~; OJ, O\"j), \n\nn \n\n(1) \n\nj=l \n\n\fTune Series Prediction using Mixtures of Experts \n\n311 \n\no Vx~=~. We assume that the parameter vector e E n, a compact subset of \nJR2n(d+1) . \n\nFollowing the work of Jordan and Jacobs [2] we take the probability density func(cid:173)\ntions to be Gaussian with mean Of X:=~ + OJ,O and variance (7j (representative \nof the underlying, local AR{d) model). The function 9j{X; Og) == exp{O~x + \nOg;.O}/(E~1 exp{O~x + Ogi.O}' thus implementing a multiple output logistic re(cid:173)\ngression function. \n\nThe underlying non-linear mapping (i.e., the conditional expectation, or L2 pre(cid:173)\ndiction function) characterizing the MEM, is described by using (1) to obtain the \nconditional expectation of Xt, \n\nf~ = E[XtIX:=~; Mn] = L 9j{X:=~; Og)[Of X:=~ + OJ,o]' \n\nn \n\n(2) \n\nj=1 \n\nwhere Mn denotes the MEM model. Here the subscript n stands for the number \nof experts. Thus, we have X t = fn = fn(Xt_d; e) where fn : JR x n ---+ JR, and X t \ndenotes the projection of X t on the 'relevant past', given the model, thus defining \nthe model predictor function. \n\nt-l \n\n() _ \n\nd \n\n~ \n\n~ \n\nWe will use the notation MEM(n; d) where n is the number of experts in the model \n(proportional to the complexity, or number of parameters in the model), and d the \nlag size. In this work we assume that d is known and given. \n\n3 Main results \n\n3.1 Background \n\nWe consider a stationary time series, more precisely a discrete time stochastic pro(cid:173)\ncess {Xd which is assumed to be strictly stationary. We define the L2 predictor \nfunction \n\nf = E[X IXt- 1 ] = E[X IXt- 1 ] \nt-d \n\n-00 \n\nt \n\nt \n\n-\n\na.s. \n\nfor some fixed lag size d. Markov chains are perhaps the most widely encountered \nclass of probability models exhibiting this dependence. The NAR(d), that is non(cid:173)\nlinear AR{ d), model is another example, widely studied in the context of time series \n(see [4] for details). Assuming additive noise, the NAR(d) model may be expressed \nas \n\n(3) \nWe note that in this formulation {cd plays the role of the innovation process for \nXt, and the function fe) describes the information on X t contained within its past \nhistory. \n\nIn what follows, we restrict the discussion to stochastic processes satisfying certain \nconstraints on the memory decay, more precisely we are assuming that {Xd is an \nexponentially a-mixing process. Loosely stated, this assumption enables the process \nto have a law of large numbers associated with it, as well as a certain version of \nthe central limit theorem. These results are the basis for analyzing the asymptotic \nbehavior of certain parameter estimators (see, [9] for further details), but other than \nthat this assumption is merely stated here for the sake of completeness. We note in \n\n\f312 \n\nA. 1. Zeevi, R. Meir and R. 1. Adler \n\npassing that this assumption may be substantially weakened, and still allow similar \nresults to hold, but requires more background and notation to be introduced, and \ntherefore is not pursued in what follows (the reader is referred to [1] for further \ndetails). \n\n3.2 Objectives \n\nKnowing the L2 predictor function, f, allows optimal prediction of future samples, \nwhere optimal is meant in the sense that the predicted value is the closest to the true \nvalue of the next sample point, in the mean squared error sense. It therefore seems a \nreasonable strategy, to try and learn the optimal predictor function, based on some \nfinite realization of the stochastic process, which we will denote VN = {Xt}t!t'H. \nNote that for N \u00bb d, the number of sample points is approximately N. \nWe therefore define our objective as follows. Based on the data V N , we seek the \n'best' approximation to f, the L2 predictor function, using the MEM(n, d) predictor \nf~ E Mn as the approximator model. \n\nMore precisely, define the least squares (LS) parameter estimator for the MEM(n, d) \nas \n\nN \n\nA \n\n9n,N = arg~l~ L.t X t -\n\n\u2022 \" [ \n\nfn(Xt _ d , 9) \n\nt-I]2 \n\nt=d+l \n\nwhere fn(X:=~, 9) is f~ evaluated at the point X:=~, and define the LS functional \nestimator as \n\n(J \n\n_ \n\nA \n\nfn,N = fnl(J=on,N \n\nwhere 9n ,N is the LS parameter estimator. \nNow, define the functional estimator risk as \n\nMSE[f, in,N] == Ev [/ If - in'NI2dv] \n\nwhere v is the d fold probability measure of the process {Xt}. In this work we \nmaintain that the integration is over some compact domain Id C Rd, though recent \nwork [3] has shown that the results can be extended to Rd, at the price of slightly \nslower convergence rates. \n\nIt is reasonable, and quite customary, to expect a 'good' estimator to be one that \nis asymptotically unbiased. However, growth of the sample size itself need not, and \nin general does not, mean that the estimator is 'becoming' unbiased. Consequently, \nas a figure of merit, we may restrict attention to the approximation capacity of the \nmodel. That is we ask, what is the error in approximating a given class of predictor \nfunctions, using the MEM(n, d) (Le., {Mn}) as the approximator class. \n\nTo measure this figure, we define the optimal risk as \n\nwhere f~ == f!I(J=(J* and \n\nMSE[j, f~] == / If - f~12dv, \n\n9~ = argmin/ If - f~12dv, \n\n(JE9 \n\n\fTime Series Prediction using Mixtures of Experts \n\n313 \n\nthat is, 9~ is the parameter minimizing the expected L2 loss function. One may \nthink of f~ as the 'best' predictor function in the class of approximators, i.e., the \nclosest approximation to the optimal predictor, given the finite complexity, n, (Le., \nfinite number of parameters) of the model. Here n is simply the number of experts \n(AR models) in the architecture. \n\n3.3 Upper Bounds on the Mean Squared Error and Universal \n\nApproximation Results \n\nConsider first the case where we are simply interested in approximating the function \nf, assuming it belongs to some class of functions. The question then arises as to \nhow well one may approximate f by a MEM architecture comprising n experts. The \nanswer to this question is given in the following proposition, the proof of which can \nbe found in [8]. \n\nProposition 3.1 (Optimal risk bounds) Consider the class of functions Mn de(cid:173)\nfined in (2) and assume that the optimal predictor f belongs to a Sobolev class \ncontaining r continuous derivatives in L 2 \u2022 Then the following bound holds: \n\nMSE[f, f~] ::; n2~jd \n\n(4) \n\nwhere c is a constant independent of n . \n\nPROOF SKETCH: The proof proceeds by first approximating the normalized gating \nfunction gj 0 by polynomials of finite degree, and then using the fact that poly(cid:173)\nnomials can approximate functions in Sobolev space to within a known degree of \napproximation. \n\nThe following main theorem, establishing upper bounds on the functional estimator \nrisk, constitutes the main result of this paper. The proof is given in [9]. \n\nTheorem 3.1 (Upper bounds on the estimator risk) \nSuppose the stochastic process obeys the conditions set forth in the previous section. \nAssume also that the optimal predictor function, f, possesses r smooth derivatives \nin L 2 . Then for N sufficiently large we have \n\n( 1 ) \nMSE[j, fn,N] ::; n2rjd + 2; + 0 N \n\nm* \n\n~ \n\nc \n\n' \n\n(5) \n\nwhere r is the number of continuous derivatives in L2 that f is assumed to possess, \nd is the lag size, and N is the size of the data set 'DN \u00b7 \n\nPROOF SKETCH: The proof proceeds by a standard stochastic Taylor expansion of \nthe loss around the point 9~. Making common regularity assumptions [1] and using \nthe assumption on the a-mixing nature of the process allows one to establish the \nusual asymptotic normality results, from which the result follows. \n\nWe use the notation m~ to denote the effective number of parameters. More pre(cid:173)\ncisely, m~ = Tr{B~(A~)-l} and the matrices A* and B* are related to the Fisher \ninformation matrix in the case of misspecified estimation (see [1] for further dis(cid:173)\ncussion). The upper bound presented in Theorem 3.1 is related to the classic bias \n- variance decomposition in statistics and the obvious tradeoffs are evident by in(cid:173)\nspection. \n\n\f314 \n\n3.4 Comments \n\nA. 1 Zeevi, R. Meir and R. 1 Adler \n\nIt follows from Proposition 3.1 that the class of mixtures of experts is a universal \napproximator, w.r.t. the class of target functions defined for the optimal predictor. \nMoreover, Proposition 3.1 establishes the rate of convergence of the approximator, \nand therefore relates the approximation error to the number of experts used in the \narchitecture (n) . \n\nTheorem 3.1 enhances this result, as it relates the sample complexity and model \ncomplexity, for this class of models. The upper bounds may be used in defining \nmodel selection criteria, based on upper bound minimization. In this setting, we \nmay use an estimator of the stochastic error bound (i.e., the estimation error), to \npenalize the complexity of the model, in the spirit of Ale, MDL etc (see [8] for \nfurther discussion). \n\nAt a first glance it may seem surprising to find that a combination of linear models \nis a universal function approximator. However, one must keep in mind that the \nglobal model is nonlinear, due to the gating network. Nevertheless, this result does \nimply, at least on a theoretical ground, that one may restrict the MEM{n, d) to \nbe locally linear, without loss of generality. Thus, taking a simple local model, \nenabling efficient and tractable learning algorithms (see [2]), still results in a rich \nglobal model. \n\n3.5 Comparison \n\nRecently, Mhaskar [5] proved upper bounds on a feedforward sigmoidal neural net(cid:173)\nwork, for target functions in the same class as we consider herein, i.e., the Sobolev \nclass. The bound we have obtained in Proposition 3.1, and its extension in [8], \ndemonstrate that w.r.t. to this particular target class, neural networks and mix(cid:173)\ntures of experts are equivalent. That is, both models attain optimal precision in \nthe degree of approximation results (see [5] for details of this argument). Keeping \nin mind the advantages of the MEM with respect to learning and generalization \n[2], we believe that our results lend further credence to the emerging view as to \nthe superiority of modular architectures over the more standard feed forward neural \nnetworks. \n\nMoreover, the detailed proof of Proposition 3.1 (see [8]) actually takes the \nMEM(n, d) to be made up of local constants. That is, the linear experts are degen(cid:173)\nerated to constant functions. Thus, one may conjecture that mixtures of experts \nare in fact a more general class than feedforward neural networks, though we have \nno proof of this as of yet. \n\nTwo nonlinear alternatives, generalizing standard statistical linear models, have \nbeen pointed out in the introductory section. These are Tong's TAR (threshold \nautoregressive) model [7], and the more general SDM (state dependent models) \nintroduced by Priestley. The latter models can be reduced to a TAR model by \nimposing a more restrictive structure (for further details see [6]) . We have shown, \nbased on the results described above (see [9]), that the MEM may be viewed as a \ngeneralization of the SDM (and consequently of the TAR model). The relation to \nthe state dependent models is of particular interest, as the mixtures of experts is \nstructured on state dependence as well. Exact statement and proofs of these facts \ncan be found in [9] . \n\n\fTime Series Prediction using Mixtures of Experts \n\n315 \n\nWe should also note that we have conducted several numerical experiments, com(cid:173)\nparing the performance of the MEM with other approaches. We tested the model \non both synthetic as well as real-world data. Without any fine-tuning of parameters \nwe found the performance of the MEM, with linear expert functions, to compare \nvery favorably with other approaches (such as TAR, ARMA and neural networks). \nDetails of the numerical results may be found in [9]. Moreover, the model also \nprovided a very natural and intuitive segmentation of the process. \n\n4 Discussion \n\nIn this work we have pursued a novel non-linear model for prediction in stationary \ntime series. The mixture of experts model (MEM) has been demonstrated to be a \nrich model, endowed with a sound theoretical basis, and compares favorably with \nother, state of the art, nonlinear models. \n\nWe hope that the results of this study will aid in establishing the MEM as, yet an(cid:173)\nother, powerful tool for the study of time-series applicable to the fields of statistics, \neconomics, and signal processing. \n\nReferences \n\n[1] Domowitz, I. and White, H. \"Misspecified Models with Dependent Observa(cid:173)\n\ntions\", Journal of Econometrics, vol. 20: 35-58, 1982. \n\n[2] Jordan, M. and Jacobs, R. \"Hierarchical Mixtures of Experts and the EM \n\nAlgorithm\" , Neural Computation, vol. 6, pp. 181-214, 1994. \n\n[3] Maiorov, V. and Meir. V. \"Approximation Bounds for Smooth Functions in \nC(1Rd) by Neural and Mixture Networks\", submitted for publication, December \n1996. \n\n[4] Meyn, S.P. and Tweedie, R.L. (1993) Markov Chains and Stochastic Stability, \n\nSpringer-Verlag, London. \n\n[5] Mhaskar, H. (1996) \"Neural Networks for Optimal Approximation of Smooth \n\nand Analytic Functions\", Neural Computation vol. 8(1), pp. 164-177. \n\n[6] Priestley M.B. Non-linear and Non-stationary Time Series Analysis, Academic \n\nPress, New York, 1988. \n\n[7] Tong, H. Threshold Models in Non-linear Time Series Analysis, Springer Ver(cid:173)\n\nlag, New York, 1983. \n\n[8] Zeevi, A.J., Meir, R. and Maiorov, V. \"Error Bounds for Functional Approx(cid:173)\nimation and Estimation Using Mixtures of Experts\", EE Pub. CC-132., Elec(cid:173)\ntrical Engineerin g Department, Technion, 1995. \n\n[9] Zeevi, A.J., Meir, R. and Adler, R.J. \"Non-linear Models for Time Series Using \nMixtures of Experts\", EE Pub. CC-150, Electrical Engineering Department, \nTechnion, 1996. \n\n\f\fPART IV \n\nALGORITHMS AND ARCHITECTURE \n\n\f\f", "award": [], "sourceid": 1203, "authors": [{"given_name": "Assaf", "family_name": "Zeevi", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}, {"given_name": "Robert", "family_name": "Adler", "institution": null}]}