{"title": "Sparse Representation for Gaussian Process Models", "book": "Advances in Neural Information Processing Systems", "page_first": 444, "page_last": 450, "abstract": null, "full_text": "Sparse Representation for Gaussian Process \n\nModels \n\nLehel Csat6 and Manfred Opper \n\nNeural Computing Research Group \n\nSchool of Engineering and Applied Sciences \n\nB4 7ET Birmingham, United Kingdom \n{csat o l, oppe r m} @as t o n. ac .uk \n\nAbstract \n\nWe develop an approach for a sparse representation for Gaussian Process \n(GP) models in order to overcome the limitations of GPs caused by large \ndata sets. The method is based on a combination of a Bayesian online al(cid:173)\ngorithm together with a sequential construction of a relevant subsample \nof the data which fully specifies the prediction of the model. Experi(cid:173)\nmental results on toy examples and large real-world data sets indicate the \nefficiency of the approach. \n\n1 Introduction \n\nGaussian processes (GP) [1; 15] provide promising non-parametric tools for modelling \nreal-world statistical problems. Like other kernel based methods, e.g. Support Vector Ma(cid:173)\nchines (SVMs) [13], they combine a high flexibility ofthe model by working in high (often \n00) dimensional feature spaces with the simplicity that all operations are \"kernelized\" i.e. \nthey are performed in the (lower dimensional) input space using positive definite kernels. \n\nAn important advantage of GPs over other non-Bayesian models is the explicit probabilistic \nformulation of the model. This does not only provide the modeller with (Bayesian) confi(cid:173)\ndence intervals (for regression) or posterior class probabilities (for classification) but also \nimmediately opens the possibility to treat other nonstandard data models (e.g. Quantum \ninverse statistics [4]). \n\nUnfortunately the drawback of GP models (which was originally apparent in SVMs as well, \nbut has now been overcome [6]) lies in the huge increase of the computational cost with \nthe number of training data. This seems to preclude applications of GPs to large datasets. \n\nThis paper presents an approach to overcome this problem. It is based on a combination of \nan online learning approach requiring only a single sweep through the data and a method \nto reduce the number of parameters representing the model. \n\nMaking use of the proposed parametrisation the method extracts a subset of the examples \nand the prediction relies only on these basis vectors (BV). The memory requirement of the \nalgorithm scales thus only with the size of this set. Experiments with real-world datasets \nconfirm the good performance of the proposed method. 1 \n\n1 A different approach for dealing with large datasets was suggested by V. Tresp [12]. His method \n\n\f2 Gaussian Process Models \n\nGPs belong to Bayesian non-parametric models where likelihoods are parametrised by a \nGaussian stochastic process (random field) a(x) which is indexed by the continuous input \nvariable x . The prior knowledge about a is expressed in the prior mean and the covariance \ngiven by the kernel Ko(x,x') = Cov(a(x), a(x')) [14; 15]. In the following, only zero \nmean GP priors are used. \nIn supervised learning the process a(x) is used as a latent variable in the likelihood \nP(yla(x)) which denotes the probability of output Y given the input x . Based on a set \nof input-output pairs (xn, Yn) with Xn E R m and Yn E R (n = 1, N) the Bayesian learn(cid:173)\ning method computes the posterior distribution of the process a(x) using the prior and \nlikelihood [14; 15; 3]. \n\nAlthough the prior is a Gaussian process, the posterior process usually is not Gaussian \n(except for the special case of regression with Gaussian noise). Nevertheless, various ap(cid:173)\nproaches have been introduced recently to approximate the posterior averages [11 ; 9]. Our \napproach is based on the idea of approximating the true posterior process p{ a} by a Gaus(cid:173)\nsian process q{a} which is fully specified by a covariance kernel Kt(x,x') and posterior \nmean (a(x))t, where t is the number of training data processed by the algorithm so far. \nSuch an approximation could be formulated within the variational approach, where q is \nchosen such that the relative entropy D(q,p) == Eq In ~ is minimal [9]. However, in this \nformulation, the expectation is over the approximate process q rather than over p. It seems \nintuitively better to minimise the other KL divergence given by D(p, q) == Ep In ~, be(cid:173)\ncause the expectation is over the true distribution. Unfortunately, such a computation is \ngenerally not possible. The following online approach can be understood as an approxima(cid:173)\ntion to this task. \n\n3 Online learning for Gaussian Processes \n\nIn this section we briefly review the main idea of the Bayesian online approach (see e.g. [5]) \nto GP models. We process the training data sequentially one after the other. Assume we \nhave a Gaussian approximation to the posterior process at time t. We use the next example \nt + 1 to update the posterior using Bayes rule via \n\np(a) = P(Yt+1la(Xt+l))Pt(q) \n(P(Yt+1la(xt+1)))t \n\n-\n\nSince the resulting posterior p(q) is non-Gaussian, we project it to the closest Gaussian \nprocess q which minimises the KL divergence D(p, q). Note, that now the new approxi(cid:173)\nmation q is on \"correct\" side of the KL divergence. The minimisation can be performed \nexactly, leading to a match of the means and covariances of p and q. Since p is much less \ncomplex than the full posterior, it is possible to write down the changes in the first two \nmoments analytically [2]: \n\n(a(x))t+1 = (a(x))t + b1 Kt(x,xt+d \n\nK t+1(x,x') = Kt(x,x') + b2K t(x,xt+1)Kt (xt+1,x') \n\nwhere the scalar coefficients b1 and b2 are: \n\n(1) \n\n(2) \n\nwith averaging performed with respect to the marginal Gaussian distribution of the process \nvariable a at input Xt+1' Note, that this yields a one dimensional integral! Derivatives are \nis based on splitting the data-set into smaller subsets and training individual GP predictors on each of \nthem. The final prediction is achieved by a specific weighting of the individual predictors. \n\n\ft+l is the projection to the linear span of {t+! is the projection of t+! with respect to the basis {