{"title": "Dynamical Modeling with Kernels for Nonlinear Time Series Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 129, "page_last": 136, "abstract": "", "full_text": "Dynamical Modeling with Kernels for Nonlinear\n\nTime Series Prediction\n\nLiva Ralaivola\n\nLaboratoire d\u2019Informatique de Paris 6\n\nUniversit\u00b4e Pierre et Marie Curie\n\n8, rue du capitaine Scott\nF-75015 Paris, FRANCE\n\nFlorence d\u2019Alch\u00b4e\u2013Buc\n\nLaboratoire d\u2019Informatique de Paris 6\n\nUniversit\u00b4e Pierre et Marie Curie\n\n8, rue du capitaine Scott\nF-75015 Paris, FRANCE\n\nliva.ralaivola@lip6.fr\n\nflorence.dalche@lip6.fr\n\nAbstract\n\nWe consider the question of predicting nonlinear time series. Kernel Dy-\nnamical Modeling (KDM), a new method based on kernels, is proposed\nas an extension to linear dynamical models. The kernel trick is used\ntwice: \ufb01rst, to learn the parameters of the model, and second, to compute\npreimages of the time series predicted in the feature space by means of\nSupport Vector Regression. Our model shows strong connection with the\nclassic Kalman Filter model, with the kernel feature space as hidden state\nspace. Kernel Dynamical Modeling is tested against two benchmark time\nseries and achieves high quality predictions.\n\n1\n\nIntroduction\n\nPrediction, smoothing and \ufb01ltering are traditional tasks applied to time series. The machine\nlearning community has recently paid a lot of attention to these problems and especially\nto nonlinear time series prediction in various areas such as biological signals, speech or\n\ufb01nancial markets. To cope with non linearities, extensions of the Kalman \ufb01lter [5, 4] have\nbeen proposed for \ufb01ltering and smoothing while recurrent arti\ufb01cial neural networks [2] and\nsupport vector regressors [7, 8] have been developed for prediction purposes. In this paper,\nwe focus on prediction tasks and introduce a powerful method based on the kernel trick\n[1], which has been successfully used in tasks ranging from classi\ufb01cation and regression\nto data analysis (see [13, 15] for details). Time series modeling is addressed by extending\nthe framework of observable linear dynamical systems [12] to the feature space de\ufb01ned\nby a kernel. The predictions are realized in the feature space and are then transformed to\nobtain the corresponding preimages in the input space. While the proposed model could\nbe used for smoothing as well as \ufb01ltering, we here focus on the prediction task. A link to\nthe Kalman \ufb01lter can be drawn by noticing that given the ef\ufb01ciency of our model for the\nprediction task it can be used as a hidden transition process in the Kalman \ufb01lter setting.\n\nThe paper is organized as follows. In the next section, we describe how the modeling of\na time series can take place in the feature space and explain how to solve the preimage\nproblem by a learning strategy. In the third section, we present prediction results achieved\nby our model In the fourth section, the estimation algorithm is discussed and its link to the\nKalman \ufb01lter is highlighted. We \ufb01nally conclude by giving some perspectives to our work.\n\n\f2 Principles of Dynamical Modeling with Kernels\n\n2.1 Basic Formulation\n\nThe problem we address is that of modeling d-dimensional nonlinear real-valued time se-\nries de\ufb01ned as\n\n(1)\nfrom an observed sequence x1:T = fx1; : : : ; xT g produced by this model, where h is a\n(possibly unknown) nonlinear function and u a noise vector.\n\nxt+1 = h(xt) + u\n\nModeling such a series can be done with the help of recurrent neural networks [2] or\nsupport vector machines [7].\nIn this work, we instead propose to deal with this prob-\nlem by extending linear dynamical modeling thanks to the kernel trick. Instead of con-\nsidering the observation sequence x1:T = fx1; : : : ; xT g, we consider the sequence\nx(cid:30)\n1:T = f(cid:30)(x1); : : : ; (cid:30)(xT )g, where (cid:30) is a mapping from Rd to H and k its associated\nkernel function [15] such that k(v1; v2) = h(cid:30)(v1); (cid:30)(v2)i 8v1; v2 2 Rd, h(cid:1); (cid:1)i being the\ninner product of H. The Kernel Dynamical Model (KDM) obtained can be written as:\n\nx(cid:30)\nt+1 = A(cid:30)x(cid:30)\n\nt + (cid:22)(cid:30) + (cid:23) (cid:30)\n\n(2)\n\nwhere A(cid:30) is the process transition matrix, (cid:22)(cid:30) an offset vector, (cid:23) (cid:30) 2 H a gaussian isotropic\nnoise of magnitude (cid:27)2 and x(cid:30)\nWe are going to show that it is possible to apply the maximum likelihood principle to iden-\ntify (cid:27)2, A(cid:30) and (cid:22)(cid:30) and come back to the input space thanks to preimages determination.\n\nt stands for (cid:30)(xt).\n\n2.2 Estimation of the Model Parameters\n\nLearning the parameters of the model (2) by maximum likelihood given an observation\n1:T ; (cid:18)(cid:30))1:\nsequence x(cid:30)\n\n1:T merely consists in optimizing the associated log-likelihood L(cid:30)(x(cid:30)\n\nL(cid:30)(x(cid:30)\n\n1:T ; (cid:18)(cid:30)) = ln P (x(cid:30)\n\n1 )\n\nT\n\nYt=2\n\nP (x(cid:30)\n\nt jx(cid:30)\n\nt(cid:0)1)!\n\n= g((cid:22)(cid:30)\n\n1 ; (cid:6)(cid:30)\n\n1 ) (cid:0)\n\n1\n2(cid:27)2\n\nkx(cid:30)\n\nt (cid:0) A(cid:30)x(cid:30)\n\nt(cid:0)1 (cid:0) (cid:22)(cid:30)k2 (cid:0)\n\n1\n2\n\np(T (cid:0) 1) ln (cid:27)2\n\nT\n\nXt=2\n\nwhere p is the dimension of H, g((cid:22)(cid:30)\n1 ) is a function straightforward to compute which\nwe let aside as it does not add any complexity in setting the gradient of L(cid:30) to 0. Indeed,\nperforming this task leads to the equations:\n\n1 ; (cid:6)(cid:30)\n\n0\n\n(cid:0)\n\n1\n\nT (cid:0) 1\n\n0\n\n(cid:30)\nt(cid:0)1\n\n(cid:0)\n\nx\n\nT (cid:0) 1\n\nT\n\nXt=2\n\n1\n\nT\n\nx\n\nT\n\n(cid:30)\nt\n\nXt=2\nXt=2\n\nx\n\n(cid:30)\nt(cid:0)1\n\nx\n\n(cid:30)\nt(cid:0)1\n\n0!\nXt=2\n\nT\n\nx\n\n(cid:30)\nt(cid:0)1\n\n0!(cid:0)1\n\nx\n\nx\n\n(cid:30)\nt x\n\n(cid:30)\nt(cid:0)1\n\n(cid:30)\nt(cid:0)1\n\nA(cid:30) = T\nXt=2\n T\nXt=2\nXt=2(cid:16)x\nXt=2\n\np(T (cid:0) 1)\n\n(cid:22)(cid:30) =\n\n(cid:27)2 =\n\nT (cid:0) 1\n\n1\n\n1\n\nT\n\nT\n\n(cid:30)\n\nt (cid:0) A(cid:30)\n\nx\n\n(cid:30)\n\nt(cid:0)1(cid:17)\n\nkx\n\n(cid:30)\n\nt (cid:0) A(cid:30)\n\nx\n\n(cid:30)\n\nt(cid:0)1 (cid:0) (cid:22)(cid:30)k2\n\n1(cid:18)(cid:30) := fA(cid:30); (cid:22)(cid:30); (cid:27)2; (cid:22)(cid:30)\n\n1 ; (cid:6)(cid:30)\n\n1 g, and (cid:22)(cid:30)\n\n1 and (cid:6)(cid:30)\n\n1 are the parameters of the gaussian vector x\n\n(3)\n\n(4)\n\n(5)\n\n(cid:30)\n1 .\n\n\f1 ; : : : ; u(cid:30)\n\ninverting a matrix which could be of in\ufb01nite\nwhich require to address two problems:\ndimension (e.g., if a gaussian kernel is used) and/or singular (equation (3)) and making a\ndivision by the dimension of the feature space (p in equation (5)).\nA general solution to circumvent these problems is to introduce an orthonormal basis U =\nfu(cid:30)\n1:T . For instance, U can be obtained\nby computing the set of principal components with non-zero eigenvalues of x(cid:30)\n1:T following\nthe procedure proposed in [6]. Once such a set of vectors is available, trying to \ufb01nd good\nparameters for the model (2) is equivalent to \ufb01nding an m-dimensional linear dynamical\nmodel for the sequence z1:T = fz1; : : : ; zT g where zt is the vector of coordinates of x(cid:30)\nwith respect to U, i.e.:\n\nmg for the subspace Hx of H spanned by x(cid:30)\n\nt\n\nt ; u(cid:30)\n\n1 i hx(cid:30)\n\nt ; u(cid:30)\n\n2 i (cid:1) (cid:1) (cid:1) hx(cid:30)\n\nzt =hhx(cid:30)\n\nt ; u(cid:30)\n\nmii0\n\n8t = 1; : : : ; T:\n\nGiven z1:T , the following linear dynamical model has to be considered:\n\nzt+1 = Azzt + (cid:22)z + (cid:23) z\n\n(6)\n\n(7)\n\n1 ; : : : ; x(cid:30)\n\nwhere (cid:23) z is again a gaussian noise vector of variance (cid:27)2. Determining a basis of Hx allows\nto learn the linear dynamical model (7). As it is based on the coordinates of the observed\nvectors x(cid:30)\nT with respect to the basis, it is equivalent to learning (2). The parameters\nare estimated thanks to equations (3), (4) and (5) where x(cid:30)\nt is replaced with zt and p with\nm.\nFor the sake of generalization ability, it might be useful to choose Az as simple as possible\n[15]. To do this, we put a penalization on matrices Az having large values, by imposing\na prior distribution pA on Az de\ufb01ned as: pA(Az) / exp((cid:0) (cid:13)\nzAz)), (cid:13) > 0. The\ncomputation of the maximum a posteriori values for A, (cid:22) and (cid:27)2 is very similar to (3), (4)\nand (5) except that a few iterations of gradient ascent have to be done.\n\n2 trace (A0\n\n2.3 Back to the Input Space: the Preimage Problem\n\nThe problem Predicting the future observations with model (7) gives vectors in the fea-\nture space H while vectors from the input space Rd are needed. Given a vector z(cid:30) in H,\n\ufb01nding a good vector x in Rd such that (cid:30)(x) is as close as possible to z(cid:30) is known as the\npreimage problem.\n\nMika et al. [6] propose to tackle this problem considering the optimization problem:\n\nmin\n\nx\n\nk(cid:30)(x) (cid:0) z(cid:30)k2:\n\nThis problem can be solved ef\ufb01ciently by gradient descent techniques for gaussian kernels.\nNevertheless, it may require several optimization phases with different starting points to be\nran when other kernels are used (e.g. polynomial kernels of some particular degree).\n\nHere, we propose to use Support Vector Regression (SVR) to solve the preimage problem.\nThis avoids any local minimum problem and allows to bene\ufb01t from the fact that we have to\nwork with vectors from the inner product space H. In addition, using this strategy, there is\nno need to solve an optimization problem each time a preimage has to be computed.\n\nSVR and Preimages Learning Given a sample dataset S = f(z1; y1); : : : ; (z\u2018; y\u2018)g\nwith pairs in Z (cid:2) R, the SVR algorithm assumes a structure on Z given by a kernel kz\nand its associated mapping (cid:30) and feature space H (see [15]). It proceeds as follows (see\n[14] and [15] for further details). Given a real positive value \", the algorithm determines\na function f such that (a) it maps each zi to a value not having deviation larger than \"\n\n\ffrom yi, and (b) it is as \ufb02at as possible. This function computes its output as f (z) =\ni (cid:0) (cid:11)i)kz(zi; z) + b where the vectors (cid:11)(cid:3) and (cid:11) are the solutions of the problem\n\ni=1((cid:11)(cid:3)\n\nP\u2018\n\nmax\n(cid:3);(cid:11)\n\n(cid:11)\n\ns.t.\n\n(cid:0)\"0((cid:11)(cid:3) + (cid:11)) + y0((cid:11)(cid:3) (cid:0) (cid:11)) (cid:0)\n\n1\n2\n\n(((cid:11)(cid:3) (cid:0) (cid:11))KZ ((cid:11)(cid:3) (cid:0) (cid:11)) +\n\n1\nC\n\n((cid:11)(cid:3) 0(cid:11)(cid:3) + (cid:11)0(cid:11)))\n\n(cid:26) 10((cid:11)(cid:3) (cid:0) (cid:11)) = 0\n\n(cid:11)(cid:3) (cid:21) 0; (cid:11) (cid:21) 0\n\nThe vectors involved in this program are of dimension \u2018, with 1 = [1 (cid:1) (cid:1) (cid:1) 1]0, 0 = [0 (cid:1) (cid:1) (cid:1) 0]0,\n\" = [\" (cid:1) (cid:1) (cid:1) \"]0, y = [y1 (cid:1) (cid:1) (cid:1) y\u2018]0 and KZ is the Gram matrix KZij = kz(zi; zj). Here, \" is\nthe parameter of the Vapnik\u2019s \"-insensitive quadratic loss function and C is a user-de\ufb01ned\nconstant penalizing data points which fail to meet the \"-deviation constraint.\nIn our case, we are interested in learning the mapping from Hx to Rd. In order to learn this\nmapping, we construct d (the dimension of input space) SVR machines f1; : : : ; fd. Each fi\nis trained to estimate the ith coordinate of the vector xt given the coordinates vector zt of\nxt with respect to U. Denoting by zu the function which maps a vector x to its coordinate\nvector z in U, the d machines provide the mapping :\n\n : Hx ! Rd\n\nx 7! [f1(zu(x)) (cid:1) (cid:1) (cid:1) fd(zu(x))]0\n\n(8)\n\nwhich can be used to estimate the preimages. Using , and noting that the program involved\nby the SVR algorithm is convex, the estimation of the preimages does not have to deal with\nany problem of local minima.\n\n3 Numerical Results\n\nIn this section we present experiments on highly nonlinear time series prediction with\nKernel Dynamical Modeling. As the two series we consider are one dimensional we\nuse the following setup. Each series of length T is referred to as x1:T .\nIn order to\nmodel it, we introduce an embedding dimension d and a step size (cid:20) such that vectors\nxt = (xt; xt(cid:0)(cid:20); : : : ; xt(cid:0)(d(cid:0)1)(cid:20))0 are considered. We compare the perfomances of KDM\nto the performances achieved by an SVR for nonlinear time series analysis [7, 8], where\nthe mapping associating xt to xt+(cid:20) is learned. The hyperparameters (kernel parameter\nand SVR penalization constant C) are computed with respect to the one-step prediction\nerror measured on a test set, while the value of \" is set to 1e-4. Prediction quality is as-\nsessed on an independent validation sequence on which root mean squared error (RMSE)\nis computed.\n\nTwo kinds of prediction capacity are evaluated. The \ufb01rst one is a one-step prediction when\nafter a prediction has been made, the true value is used to estimate the next time series\noutput. The second one is a multi-step or trajectory prediction, where the prediction made\nby a model serves as a basis for the future predictions.\n\nIn order to make a prediction for a time t > T , we suppose that we are provided with the\nvector xt(cid:0)1, which may have been observed or computed. We determine the coordinates\nzt(cid:0)1 of x(cid:30)\nt(cid:0)1 with respect to U and infer the value of zt by zt = Azzt(cid:0)1 + (cid:22)z (see\nequation (7)); is then used to recover an estimation of xt+1 (cf. equation (8)). In all our\nexperiments we have made the crude \u2013yet ef\ufb01cient\u2013 choice of the linear kernel for kz.\n\n\f 1.3\n\n 1.2\n\n 1.1\n\n 1\n\n 0.9\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\n 250\n\n 200\n\n 150\n\n 100\n\n 50\n\n 0\n\n 0\n\n 50\n\n 100\n\n 150\n\n 200\n\n 250\n\n 300\n\n 350\n\nFigure 1: (left) 100 points of the Mackey-Glass time series M G17, (right) the \ufb01rst 350\npoints of the Laser time series.\n\nTable 1: Error (RMSE) of one-step and trajectory predictions for gaussian and polynomial\nkernels for the time series M G17. The regularizing values used for KDM are in subscript.\nThe best results are italicized.\n\nGaussian\n\nAlgo.\n\n1S\n\nSVR\nKDM0\nKDM0:1\nKDM1\nKDM10\nKDM100\n\n0.0812\n0.0864\n0.0863\n0.0859\n0.0844\n0.0899\n\n100S\n0.2361\n0.2906\n0.2893\n0.2871\n0.2140\n0.1733\n\nPolynomial\n1S\n\n100S\n\n0.1156\n0.1112\n0.1112\n0.1117\n0.1203\n0.0970\n\n-\n\n0.2975\n0.2775\n0.2956\n0.1964\n0.1744\n\n3.1 Mackey-Glass Time Series Prediction\n\nThe Mackey-Glass time series comes from the modeling of blood cells production evolu-\ntion. It is a one-dimensional signal determined by\n\ndx(t)\n\ndt\n\n= (cid:0)0:1x(t) +\n\n0:2x(t (cid:0) (cid:28) )\n\n1 + x(t (cid:0) (cid:28) )10\n\nwhich, for values of (cid:28) greater than 16:8, shows some highly nonlinear chaotic behavior\n(see Figure 1 left).\nWe focus on M G17, for which (cid:28) = 17, and construct embedding vectors of size d = 6\nand step size (cid:20) = 6. As xt is used to predict xt+(cid:20), the whole dataset can be divided\ninto six \u201cindependent\u201d datasets, the \ufb01rst one S1 containing x1+(d(cid:0)1)(cid:20), the second one S2,\nx2+(d(cid:0)1)(cid:20), . . . , and the sixth one S6, xd(cid:20). Learning is done as follows. The \ufb01rst 100\npoints of S1 are used to learning, while the \ufb01rst 100 points of S2 serve to choose the\nhyperparameters. The prediction error is measured with respect to the points in the range\n201 to 300 of S1.\nTable 1 reports the RMSE error obtained with gaussian and polynomial kernels, where 1S\nand 100S respectively stand for one-step prediction and multi-step prediction over the 100\nfuture observations.\n\nSVR one-step prediction with gaussian kernel gives the best RMSE. None of the tested\nregularizers allows KDM to perform better, even if the prediction error obtained with them\nis never more than 10% away from SVR error.\n\n\fTable 2: Error (RMSE) of one-step and trajectory predictions for gaussian and polynomial\nkernels for the time series Laser. The regularizing values used for KDM are in subscript.\n\nGaussian\n1S\n\n100S\n67.57\n416.2\n69.65\n70.16\n66.82\n56.53\n\n15.81\n67.95\n16.59\n13.96\n15.18\n18.65\n\nPolynomial\n1S\n100S\n66.73\n68.90\n69.60\n70.65\n69.43\n53.84\n\n18.14\n43.92\n22.37\n18.13\n17.39\n17.61\n\nAlgo.\n\nSVR\nKDM0\nKDM0:1\nKDM1\nKDM10\nKDM100\n\nKDM trajectory prediction with gaussian kernel and regularizer (cid:13) = 100 leads to the best\nerror. It is around 17% lower than that of SVR multi-step prediction while KDM with no\nregularizer gives the poorest prediction, emphasizing the importance of the regularizer.\n\nRegarding one-step prediction with polynomial kernel, there is no signi\ufb01cant difference\nbetween the performance achieved by SVR and that of KDM, when regularizer is 0, 0.1, 1\nor 10. For a regularizer (cid:13) = 100, KDM however leads to the best one-step prediction error,\naround 16% lower than that obtained by SVR prediction.\n\nThe dash \u2019-\u2019 appearing in the \ufb01rst line of the table means that the trajectory prediction made\nby the SVR with a polynomial kernel has failed to give \ufb01nite predictions. On the contrary,\nKDM never shows this kind of behavior. For a regularizer value of (cid:13) = 100, it even gives\nthe best trajectory prediction error.\n\n3.2 Laser Time Series Prediction\n\nThe Laser time series is the dataset A from the Santa Fe competition. It is a univariate time\nseries from an experiment conducted in a physics laboratory (Figure 1 (right) represents\nthe \ufb01rst 350 points of the series). An embedding dimension d = 3 and a step size (cid:20) = 1\nare used. The dataset is divided as follows. The \ufb01rst 100 points are used for training,\nwhereas the points in the range 201 to 300 provide a test set to select hyperparameters. The\nvalidation error (RMSE) is evaluated on the points in the range 101 to 200.\n\nTable 2 reports the validation errors obtained for the two kinds of prediction. The most\nstriking information provided by this table is the large error archieved by KDM with no\nregularizer when a gaussian kernel is used. Looking at the other RMSE values correspond-\ning to different regularizers, the importance of penalizing transition matrices with large\nentries is underlined.\n\nBesides, when the regularizer (cid:13) is appropriately chosen, we see that KDM with a gaussian\nkernel can achieve very good predictions, for the one-step prediction and the multi-step\nprediction as well. KDM one-step best prediction error is however not as far from SVR\none-step prediction (about 10% lower) than KDM multi-step is from its SVR counterpart\n(around 16% lower).\n\nWhen a polynomial kernel is used, we observe that KDM with no regularizer provides poor\nresults with regards to the one-step prediction error. Contrary to what occurs with the use\nof a gaussian kernel, KDM with no regularization does not show bad multi-step prediction\nability. Looking at the other entries of this table once again shows that KDM can give\nvery good predictions when a well-suited regularizer is chosen. Hence, we notice that the\nbest multi-step prediction error of KDM is above 19% better than that obtained by SVR\n\n\fmulti-step prediction.\n\n4 Discussion\n\n4.1 Another Way of Choosing the Parameters\n\nThe introduction of a basis U allows to \ufb01nd the parameters of KDM without computing\nany inversion of in\ufb01nite dimensional matrices or division by the dimension of H. There\nis, however a more elegant way to \ufb01nd these parameters when (cid:27)2 is assumed to be known.\nIn this case, equation (5) needs not to be considered any longer. Considering the prior\npA(A(cid:30)) / exp((cid:0) (cid:13)\nA(cid:30))), for a user de\ufb01ned (cid:13), the maximum a posteriori for\nA(cid:30) is obtained as:\n\n2(cid:27)2 trace (A(cid:30) 0\n\n(cid:30)\nt x\n\nx\n\nA(cid:30) = T\nXt=2\n (cid:13)I +\n\n0\n\n(cid:30)\nt(cid:0)1\n\n(cid:0)\n\n1\n\nT (cid:0) 1\n\nT\n\nx\n\n(cid:30)\nt\n\nT\n\nXt=2\n\n(cid:30)\nt(cid:0)1\n\n0\n\n(cid:0)\n\n1\n\nT (cid:0) 1\n\nx\n\nx\n\n(cid:30)\nt(cid:0)1\n\nXt=2\n1 (cid:1) (cid:1) (cid:1) x(cid:30)\n\nT\n\nx\n\nXt=2\nXt=2\n\nT\n\n(cid:30)\nt(cid:0)1\n\n0!\n\nT\n\nx\n\n(cid:30)\nt(cid:0)1\n\nXt=2\n\nx\n\n(cid:30)\nt(cid:0)1\n\n0!(cid:0)1\n\n:\n\nIntroducing the matrix X (cid:30) = [x(cid:30)\n:= [0 1 (cid:1) (cid:1) (cid:1) 1]0,\ng := [1 (cid:1) (cid:1) (cid:1) 1 0], the T (cid:2) T matrix P = (Pij) = ((cid:14)i;j+1) de\ufb01ning J = P (cid:0) f g=(T (cid:0) 1)0\nand M = diag (g) (cid:0) gg0=(T (cid:0) 1), A(cid:30) can be rewritten as\n\nT ], the T -dimensional vectors f\n\nA(cid:30) = (cid:16)X (cid:30)JX (cid:30) 0(cid:17)(cid:16)(cid:13)I + X (cid:30)M X (cid:30) 0(cid:17)(cid:0)1\n\n=\n\n1\n(cid:13)\n\nX (cid:30)J(cid:20)I (cid:0)\n\n1\n(cid:13)\n\nKM (I +\n\n1\n(cid:13)\n\nM KM )(cid:0)1M(cid:21) X (cid:30) 0\n\nthanks to the Sherman-Woodbury formula, K being the Gram matrix associated to x(cid:30)\n1:T . It\nis thus possible to directly determine the matrix A(cid:30) when (cid:27)2 is known, the same holding\nfor (cid:22)(cid:30) since equation (5) remains unchanged.\n\n4.2 Link to Kalman Filtering\n\nThe usual way to recover a noisy nonlinear signal is to use the Extended Kalman Filter\n(EKF) or the Unscented Kalman Filter (UKF) [4]. The use of these algorithms involves\ntwo steps. First, the clean dynamics, as given by h in equation (1) is learned by a regressor,\ne.g., a multilayer perceptron. Given a noisy time series from the same driving process h,\nEKF and UKF then process that series by respectively a \ufb01rst-order linearization of h and\nan ef\ufb01cient \u2019sampling\u2019 method to determine the clean signal. Apart from the latter essential\napproximations done by these algorithms, the core of EKF and UKF resembles that of\nclassical Kalman \ufb01ltering (and smoothing).\n\nRegarding the performances of KDM to learn a complex dynamics, it could be directly used\nto model the process h. In addition, its matricial formulation is suitable to the traditional\nmatrices computations involved by the \ufb01ltering task (see [5, 11] for details). Hence, a link\nbetween KDM and Kalman \ufb01ltering has been the purpose of [9, 10] where a nonlinear\nKalman \ufb01lter based on the use of kernels is proposed: the ability of the proposed model to\naddress the modeling of nonlinear dynamics is demonstrated, while the classical procedures\n(even the EM algorithm) associated to linear dynamical systems remain valid.\n\n5 Conclusion and Future Work\n\nThree main results are presented: \ufb01rst, we introduce KDM, a kernel extension of linear\ndynamical models and show how the kernel trick allows to learn a linear model in a feature\n\n\fspace associated to a kernel. Second, an original and ef\ufb01cient solution based on learning\nhas been applied for the preimage problem. Third, Kernel Dynamical Model can be linked\nto the Kalman \ufb01lter model with a hidden state process living in the feature space.\n\nIn the framework of time series prediction, KDM proves to work very well and to compete\nwith the best time series predictors particularly on long time range prediction.\n\nTo conclude, this work can lead to several future directions. All classic tasks involving a\ndynamic setting such as \ufb01ltering/predicting (e.g., tracking) and smoothing (e.g., time series\ndenoising) can be tackled by our approach and have to be tested. As pointed out by [9, 10],\nthe kernel approach can also be applied to linear dynamical models with hidden states to\nprovide a kernelized version of the Kalman \ufb01lter, particularly allowing the implementation\nof an exact nonlinear EM procedure (involving closed form equations as the method pro-\nposed by [3]). Besides, the use of kernel opens the door to dealing with structured data,\nmaking KDM a very attractive tool in many areas such as bioinformatics, texts and video\napplication. Lastly, from the theoretical point of view, a very interesting issue is that of the\nactual noise corresponding to a gaussian noise in a feature space.\n\nReferences\n\n[1] B. Boser, I. Guyon, and V. Vapnik. A Training Algorithm for Optimal Margin Classi\ufb01ers. In\n\nProc. of the 5th Annual Workshop on Comp. Learning Theory, volume 5, 1992.\n\n[2] G. Dorffner. Neural networks for time series processing. Neural Network World, 6(4):447\u2013468,\n\n1996.\n\n[3] Z. Ghahramani and S. Roweis. Learning nonlinear dynamical systems using an em algorithm. In\nM. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing\nSystems, volume 11, pages 599\u2013605. MIT Press, 1999.\n\n[4] S. Julier and J. Uhlmann. A New Extension of the Kalman Filter to Nonlinear Systems. In Int.\n\nSymp. Aerospace/Defense Sensing, Simul. and Controls, 1997.\n\n[5] R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems. Transactions of\n\nthe ASME\u2013Journal of Basic Engineering, 82(Series D):35\u201345, 1960.\n\n[6] S. Mika, B. Sch\u00a8olkopf, A. J. Smola, K.-R. M\u00a8uller, M. Scholz, and G. R\u00a8atsch. Kernel PCA and\n\nDe-Noising in Feature Spaces. In NIPS. MIT Press, 1999.\n\n[7] S. Mukherjee, E. Osuna, and F. Girosi. Nonlinear prediction of chaotic time series using support\n\nvector machines. In Proc. of IEEE NNSP\u201997, 1997.\n\n[8] K. M\u00a8uller, A. Smola, G. R\u00a8atsch, B. Sch\u00a8olkopf, J. Kohlmorgen, and V. Vapnik. Predicting\nTime Series with Support Vector Machines. In W. Gerstner, A. Germond, M. Hasler, and J.-D.\nNicoud, editors, Arti\ufb01cial Neural Networks - ICANN\u201997, pages 999\u20131004. Springer, 1997.\n\n[9] L. Ralaivola. Mod\u00b4elisation et apprentissage de concepts et de syst`emes dynamiques. PhD thesis,\n\nUniversit\u00b4e Paris 6, France, 2003.\n\n[10] L. Ralaivola and F. d\u2019Alch\u00b4e-Buc. Filtrage de Kalman non lin\u00b4eaire `a l\u2019aide de noyaux. In Actes\n\ndu 19eme Symposium GRETSI sur le traitement du signal et des images, 2003.\n\n[11] A-V.I. Rosti and M.J.F. Gales. Generalised linear Gaussian models. Technical Report CUED/F-\n\nINFENG/TR.420, Cambridge University Engineering Department, 2001.\n\n[12] S. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural Compu-\n\ntation, 11(2):305\u2013345, 1997.\n\n[13] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels, Support Vector Machines, Regulariza-\n\ntion, Optimization and Beyond. MIT University Press, 2002.\n\n[14] A. Smola and B. Sch\u00a8olkopf. A Tutorial on Support Vector Regression. Technical Report NC2-\n\nTR-1998-030, NeuroCOLT2, 1998.\n\n[15] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, inc., 1998.\n\n\f", "award": [], "sourceid": 2516, "authors": [{"given_name": "Liva", "family_name": "Ralaivola", "institution": null}, {"given_name": "Florence", "family_name": "d'Alch\u00e9-Buc", "institution": ""}]}