{"title": "Inter-domain Gaussian Processes for Sparse Inference using Inducing Features", "book": "Advances in Neural Information Processing Systems", "page_first": 1087, "page_last": 1095, "abstract": "We present a general inference framework for inter-domain Gaussian Processes (GPs), focusing on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on finding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n^3) to O(m^2n), where m << n. Inter-domain GPs can be used to find a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models fit into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that significant improvement can be achieved, while retaining computational efficiency.", "full_text": "Inter-domain Gaussian Processes for\n\nSparse Inference using Inducing Features\n\nMiguel L\u00b4azaro-Gredilla and An\u00b4\u0131bal R. Figueiras-Vidal\n\nDep. Signal Processing & Communications\nUniversidad Carlos III de Madrid, SPAIN\n\n{miguel,arfv}@tsc.uc3m.es\n\nAbstract\n\nWe present a general inference framework for inter-domain Gaussian Processes\n(GPs) and focus on its usefulness to build sparse GP models. The state-of-the-art\nsparse GP model introduced by Snelson and Ghahramani in [1] relies on \ufb01nding\na small, representative pseudo data set of m elements (from the same domain as\nthe n available data elements) which is able to explain existing data well, and\nthen uses it to perform inference. This reduces inference and model selection\ncomputation time from O(n3) to O(m2n), where m (cid:28) n. Inter-domain GPs can\nbe used to \ufb01nd a (possibly more compact) representative set of features lying in a\ndifferent domain, at the same computational cost. Being able to specify a different\ndomain for the representative features allows to incorporate prior knowledge about\nrelevant characteristics of data and detaches the functional form of the covariance\nand basis functions. We will show how previously existing models \ufb01t into this\nframework and will use it to develop two new sparse GP models. Tests on large,\nrepresentative regression data sets suggest that signi\ufb01cant improvement can be\nachieved, while retaining computational ef\ufb01ciency.\n\n1 Introduction and previous work\n\nAlong the past decade there has been a growing interest in the application of Gaussian Processes\n(GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian models that com-\nbine a number of attractive characteristics: They achieve state-of-the-art performance on supervised\nlearning tasks, provide probabilistic predictions, have a simple and well-founded model selection\nscheme, present no over\ufb01tting (since parameters are integrated out), etc.\nUnfortunately, the direct application of GPs to regression problems (with which we will be con-\ncerned here) is limited due to their training time being O(n3). To overcome this limitation, several\nsparse approximations have been proposed [2, 3, 4, 5, 6]. In most of them, sparsity is achieved by\nprojecting all available data onto a smaller subset of size m (cid:28) n (the active set), which is selected\naccording to some speci\ufb01c criterion. This reduces computation time to O(m2n). However, active\nset selection interferes with hyperparameter learning, due to its non-smooth nature (see [1, 3]).\nThese proposals have been superseded by the Sparse Pseudo-inputs GP (SPGP) model, introduced\nin [1]. In this model, the constraint that the samples of the active set (which are called pseudo-\ninputs) must be selected among training data is relaxed, allowing them to lie anywhere in the input\nspace. This allows both pseudo-inputs and hyperparameters to be selected in a joint continuous\noptimisation and increases \ufb02exibility, resulting in much superior performance.\nIn this work we introduce Inter-Domain GPs (IDGPs) as a general tool to perform inference across\ndomains. This allows to remove the constraint that the pseudo-inputs must remain within the same\ndomain as input data. This added \ufb02exibility results in an increased performance and allows to encode\nprior knowledge about other domains where data can be represented more compactly.\n\n1\n\n\f2 Review of GPs for regression\n\nWe will brie\ufb02y state here the main de\ufb01nitions and results for regression with GPs. See [7] for a\ncomprehensive review.\nAssume we are given a training set with n samples D \u2261 {xj, yj}n\nj=1, where each D-dimensional\ninput xj is associated to a scalar output yj. The regression task goal is, given a new input x\u2217, predict\nthe corresponding output y\u2217 based on D.\nThe GP regression model assumes that the outputs can be expressed as some noiseless latent function\nplus independent noise, y = f(x)+\u03b5, and then sets a zero-mean1 GP prior on f(x), with covariance\nk(x, x(cid:48)), and a zero-mean Gaussian prior on \u03b5, with variance \u03c32 (the noise power hyperparameter).\nThe covariance function encodes prior knowledge about the smoothness of f(x). The most common\nchoice for it is the Automatic Relevance Determination Squared Exponential (ARD SE):\n\n(cid:34)\n\nD(cid:88)\n\nk(x, x(cid:48)) = \u03c32\n\n(xd \u2212 x(cid:48)\n(cid:96)2\nd\n0 (the latent function power) and {(cid:96)d}D\n\n\u22121\n2\n\n0 exp\n\nd=1\n\nd)2\n\n(cid:35)\n\n,\n\n(1)\n\n0 ,{(cid:96)d}D\n\nd=1}. We will omit the dependence on \u03b8 for the sake of clarity.\n\nwith hyperparameters \u03c32\nd=1 (the length-scales, de\ufb01ning how\nrapidly the covariance decays along each dimension). It is referred to as ARD SE because, when\ncoupled with a model selection method, non-informative input dimensions can be removed automat-\nically by growing the corresponding length-scale. The set of hyperparameters that de\ufb01ne the GP are\n\u03b8 = {\u03c32, \u03c32\nIf we evaluate the latent function at X = {xj}n\nj=1, we obtain a set of latent variables following a\njoint Gaussian distribution p(f|X) = N (f|0, K\ufb00 ), where [K\ufb00 ]ij = k(xi, xj). Using this model\nit is possible to express the joint distribution of training and test cases and then condition on the\nobserved outputs to obtain the predictive distribution for any test case\nf\u2217(K\ufb00 + \u03c32In)\u22121y, \u03c32 + k\u2217\u2217 \u2212 k(cid:62)\n\n(2)\nwhere y = [y1, . . . , yn](cid:62), kf\u2217 = [k(x1, x\u2217), . . . , k(xn, x\u2217)](cid:62), and k\u2217\u2217 = k(x\u2217, x\u2217). In is used to\ndenote the identity matrix of size n. The O(n3) cost of these equations arises from the inversion of\nthe n \u00d7 n covariance matrix. Predictive distributions for additional test cases take O(n2) time each.\nThese costs make standard GPs impractical for large data sets.\nTo select hyperparameters \u03b8, Type-II Maximum Likelihood (ML-II) is commonly used. This\namounts to selecting the hyperparameters that correspond to a (possibly local) maximum of the\nlog-marginal likelihood, also called log-evidence.\n\npGP(y\u2217|x\u2217,D) = N (y\u2217|k(cid:62)\n\nf\u2217(K\ufb00 + \u03c32In)\u22121kf\u2217),\n\n3 Inter-domain GPs\n\nIn this section we will introduce Inter-Domain GPs (IDGPs) and show how they can be used as a\nframework for computationally ef\ufb01cient inference. Then we will use this framework to express two\nprevious relevant models and develop two new ones.\n\n3.1 De\ufb01nition\nConsider a real-valued GP f(x) with x \u2208 RD and some deterministic real function g(x, z), with\nz \u2208 RH. We de\ufb01ne the following transformation:\n\nu(z) =\n\nf(x)g(x, z)dx.\n\n(3)\n\n(cid:90)\n\nRD\n\nThere are many examples of transformations that take on this form, the Fourier transform being\none of the best known. We will discuss possible choices for g(x, z) in Section 3.3; for the moment\nwe will deal with the general form. Since u(z) is obtained by a linear transformation of GP f(x),\n\n1We follow the common approach of subtracting the sample mean from the outputs and then assume a\n\nzero-mean model.\n\n2\n\n\fit is also a GP. This new GP may lie in a different domain of possibly different dimension. This\ntransformation is not invertible in general, its properties being de\ufb01ned by g(x, z).\nIDGPs arise when we jointly consider f(x) and u(z) as a single, \u201cextended\u201d GP. The mean and\ncovariance function of this extended GP are overloaded to accept arguments from both the input and\ntransformed domains and treat them accordingly. We refer to each version of an overloaded function\nas an instance, which will accept a different type of arguments. If the distribution of the original GP\nis f(x) \u223c GP(m(x), k(x, x(cid:48))), then it is possible to compute the remaining instances that de\ufb01ne the\ndistribution of the extended GP over both domains. The transformed-domain instance of the mean\nis\n\nm(z) = E[u(z)] =\n\nm(x)g(x, z)dx.\n\nThe inter-domain and transformed-domain instances of the covariance function are:\n\n(cid:90)\n\nRD\n\nE[f(x)]g(x, z)dx =\n\n(cid:90)\nf(x(cid:48))g(x(cid:48), z(cid:48))dx(cid:48)(cid:21)\n\n(cid:90)\n\nRD\n\nRD\n\n(cid:90)\n\n(cid:20)\n(cid:20)(cid:90)\n\nk(x, z(cid:48)) = E[f(x)u(z(cid:48))] = E\n\nf(x)\n\nk(z, z(cid:48)) = E[u(z)u(z(cid:48))] = E\n\n(cid:90)\n\n=\n\n(cid:90)\n\nRD\n\nRD\n\nf(x)g(x, z)dx\n\nRD\n\nk(x, x(cid:48))g(x, z)g(x(cid:48), z(cid:48))dxdx(cid:48).\n\nk(x, x(cid:48))g(x(cid:48), z(cid:48))dx(cid:48)\n\n=\n\nf(x(cid:48))g(x(cid:48), z(cid:48))dx(cid:48)(cid:21)\n\nRD\n\n(cid:90)\n\nRD\n\n(4)\n\n(5)\n\nMean m(\u00b7) and covariance function k(\u00b7,\u00b7) are therefore de\ufb01ned both by the values and domains of\ntheir arguments. This can be seen as if each argument had an additional domain indicator used to\nselect the instance. Apart from that, they de\ufb01ne a regular GP, and all standard properties hold. In\nparticular k(a, b) = k(b, a). This approach is related to [8], but here the latent space is de\ufb01ned as\na transformation of the input space, and not the other way around. This allows to pre-specify the\ndesired input-domain covariance. The transformation is also more general: Any g(x, z) can be used.\nWe can sample an IDGP at n input-domain points f = [f1, f2, . . . , fn](cid:62) (with fj = f(xj)) and m\ntransformed-domain points u = [u1, u2, . . . , um](cid:62) (with ui = u(zi)). With the usual assumption\nof f(x) being a zero mean GP and de\ufb01ning Z = {zi}m\ni=1, the joint distribution of these samples is:\n\n(cid:18)(cid:20)f\n\nu\n\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) X, Z\n(cid:19)\n\np\n\n(cid:18)(cid:20)f\n\n(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) 0,\n\n(cid:20) K\ufb00 Kfu\n\n(cid:21)(cid:19)\n\nfu Kuu\n\n= N\n\nu\n\nK(cid:62)\n[Kfu]pq = k(xp, zq),\n\nwith [K\ufb00 ]pq = k(xp, xq),\n\n[Kuu]pq = k(zp, zq),\n\nwhich allows to perform inference across domains. We will only be concerned with one input\ndomain and one transformed domain, but IDGPs can be de\ufb01ned for any number of domains.\n\n3.2 Sparse regression using inducing features\n\n,\n\n(6)\n\nIn the standard regression setting, we are asked to perform inference about the latent function f(x)\nfrom a data set D lying in the input domain. Using IDGPs, we can use data from any domain to\nperform inference in the input domain. Some latent functions might be better de\ufb01ned by a set of\ndata lying in some transformed space rather than in the input space. This idea is used for sparse\ninference.\nFollowing [1] we introduce a pseudo data set, but here we place it in the transformed domain:\nD = {Z, u}. The following derivation is analogous to that of SPGP. We will refer to Z as the\ninducing features and u as the inducing variables. The key approximation leading to sparsity is to\nset m (cid:28) n and assume that f(x) is well-described by the pseudo data set D, so that any two samples\n(either from the training or test set) fp and fq with p (cid:54)= q will be independent given xp, xq and D.\nWith this simplifying assumption2, the prior over f can be factorised as a product of marginals:\n\np(fj|xj, Z, u).\n\n(7)\n\np(f|X, Z, u) \u2248 n(cid:89)\nditional p(f|X, Z, u) \u2248 q(f|X, Z, u) = (cid:81)n\n\nj=1\n\n2Alternatively, (7) can be obtained by proposing a generic factorised form for the approximate con-\nj=1 qj(fj|xj, Z, u) and then choosing the set of func-\njoint prior\n\nj=1 so as to minimise the Kullback-Leibler (KL) divergence from the exact\n\ntions {qj(\u00b7)}n\nKL(p(f|X, Z, u)p(u|Z)||q(f|X, Z, u)p(u|Z)), as noted in [9], Section 2.3.6.\n\n3\n\n\fMarginals are in turn obtained from (6): p(fj|xj, Z, u) = N (fj|kjK\u22121\nuuu, \u03bbj), where kj is the j-th\nrow of Kfu and \u03bbj is the j-th element of the diagonal of matrix \u039bf = diag(Kf f \u2212 KfuK\u22121\nuuKuf ).\nOperator diag(\u00b7) sets all off-diagonal elements to zero, so that \u039bf is a diagonal matrix.\nSince p(u|Z) is readily available and also Gaussian, the inducing variables can be integrated out\nfrom (7), yielding a new, approximate prior over f(x):\np(f|X, Z) =\n\np(fj|xj, Z, u)p(u|Z)du = N (f|0, KfuK\u22121\n\np(f , u|X, Z)du \u2248\n\n(cid:90) n(cid:89)\n\nuuKuf + \u039bf )\n\n(cid:90)\n\nj=1\n\nuu)ku\u2217),\n\nu\u2217(Q\u22121 \u2212 K\u22121\n\nu\u2217Q\u22121K(cid:62)\nfu\u039b\u22121\n\npIDGP(y\u2217|x\u2217,D, Z) = N (y\u2217|k(cid:62)\n\nUsing this approximate prior, the posterior distribution for a test case is:\ny y, \u03c32 + k\u2217\u2217 + k(cid:62)\n\nfu\u039b\u22121\n(8)\nwhere we have de\ufb01ned Q = Kuu + K(cid:62)\ny Kfu and \u039by = \u039bf + \u03c32In. The distribution (2)\nis approximated by (8) with the information available in the pseudo data set. After O(m2n) time\nprecomputations, predictive means and variances can be computed in O(m) and O(m2) time per\ntest case, respectively. This model is, in general, non-stationary, even when it is approximating a\nstationary input-domain covariance and can be interpreted as a degenerate GP plus heteroscedastic\nwhite noise.\nThe log-marginal likelihood (or log-evidence) of the model, explicitly including the conditioning on\nkernel hyperparameters \u03b8 can be expressed as\nlog p(y|X, Z, \u03b8) = \u22121\ny y\u2212y(cid:62)\u039b\u22121\n2\nwhich is also computable in O(m2n) time.\nModel selection will be performed by jointly optimising the evidence with respect to the hyperpa-\nrameters and the inducing features. If analytical derivatives of the covariance function are available,\nconjugate gradient optimisation can be used with O(m2n) cost per step.\n\ny y+log(|Q||\u039by|/|Kuu|)+n log(2\u03c0)]\n\n[y(cid:62)\u039b\u22121\n\ny KfuQ\u22121K(cid:62)\n\nfu\u039b\u22121\n\n3.3 On the choice of g(x, z)\n\nThe feature extraction function g(x, z) de\ufb01nes the transformed domain in which the pseudo data set\nlies. According to (3), the inducing variables can be seen as projections of the target function f(x)\non the feature extraction function over the whole input space. Therefore, each of them summarises\ninformation about the behaviour of f(x) everywhere. The inducing features Z de\ufb01ne the concrete set\nof functions over which the target function will be projected. It is desirable that this set captures the\nmost signi\ufb01cant characteristics of the function. This can be achieved either using prior knowledge\nabout data to select {g(x, zi)}m\ni=1 or using a very general family of functions and letting model\nselection automatically choose the appropriate set.\nAnother way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is\noften thought of as a linear combination of \u201cbasis functions\u201d. For full GPs and other approximations\nsuch as [1, 2, 3, 4, 5, 6], basis functions must have the form of the input-domain covariance function.\nWhen using IDGPs, basis functions have the form of the inter-domain instance of the covariance\nfunction, and can therefore be adjusted by choosing g(x, z), independently of the input-domain\ncovariance function.\nIf two feature extraction functions g(\u00b7,\u00b7) and h(\u00b7,\u00b7) can be related by g(x, z) = h(x, z)r(z) for any\nfunction r(\u00b7), then both yield the same sparse GP model. This property can be used to simplify the\nexpressions of the instances of the covariance function.\nIn this work we use the same functional form for every feature, i.e. our function set is {g(x, zi)}m\ni=1,\nbut it is also possible to use sets with different functional forms for each inducing feature, i.e.\n{gi(x, zi)}m\ni=1 where each zi may even have a different size (dimension). In the sections below\nwe will discuss different possible choices for g(x, z).\n\n3.3.1 Relation with Sparse GPs using pseudo-inputs\n\nThe sparse GP using pseudo-inputs (SPGP) was introduced in [1] and was later renamed to Fully\nIndependent Training Conditional (FITC) model to \ufb01t in the systematic framework of [10]. Since\n\n4\n\n\fthe sparse model introduced in Section 3.2 also uses a fully independent training conditional, we\nwill stick to the \ufb01rst name to avoid possible confusion.\nIDGP innovation with respect to SPGP consists in letting the pseudo data set lie in a different do-\nmain. If we set gSPGP(x, z) \u2261 \u03b4(x \u2212 z) where \u03b4(\u00b7) is a Dirac delta, we force the pseudo data set to\nlie in the input domain. Thus there is no longer a transformed space and the original SPGP model is\nretrieved. In this setting, the inducing features of IDGP play the role of SPGP\u2019s pseudo-inputs.\n\n3.3.2 Relation with Sparse Multiscale GPs\n\nSparse Multiscale GPs (SMGPs) are presented in [11]. Seeking to generalise the SPGP model with\nARD SE covariance function, they propose to use a different set of length-scales for each basis\nfunction. The resulting model presents a defective variance that is healed by adding heteroscedastic\nwhite noise. SMGPs, including the variance improvement, can be derived in a principled way as\nIDGPs:\n\n(cid:35)\n\n(xd \u2212 \u00b5d)2\nd \u2212 (cid:96)2\n2(c2\nd)\n\nwith z =\n\n(cid:20)\u00b5\n(cid:21)\n\nc\n\ngSMGP(x, z) \u2261\n\nkSMGP(x, z(cid:48)) = exp\n\nkSMGP(z, z(cid:48)) = exp\n\n1\n\nd=1\n\n(cid:112)2\u03c0(c2\n(cid:81)D\n(cid:34)\n\u2212 D(cid:88)\n(cid:34)\n\u2212 D(cid:88)\n\nd=1\n\nd \u2212 (cid:96)2\nd)\n(xd \u2212 \u00b5(cid:48)\nd)2\n2c(cid:48)2\n(\u00b5d \u2212 \u00b5(cid:48)\nd + c(cid:48)2\n\n2(c2\n\nd\n\nd=1\n\nexp\n\n(cid:34)\n\u2212 D(cid:88)\n(cid:115)\n(cid:35) D(cid:89)\n(cid:35) D(cid:89)\n\nd=1\n\nd=1\n\n(cid:96)2\nd\nc(cid:48)2\n\nd\n\n(cid:115)\n\nd)2\nd \u2212 (cid:96)2\nd)\n\nd=1\n\n(cid:96)2\nd\nd \u2212 (cid:96)2\nd + c(cid:48)2\nc2\n\nd\n\n.\n\n(9)\n\n(10)\n\n(11)\n\nWith this approximation, each basis function has its own centre \u00b5 = [\u00b51, \u00b52, . . . , \u00b5d](cid:62) and its\nown length-scales c = [c1, c2, . . . , cd](cid:62), whereas global length-scales {(cid:96)d}D\nd=1 are shared by all\ninducing features. Equations (10) and (11) are derived from (4) and (5) using (1) and (9). The\nd,\u2200d, which suggests that other values,\nintegrals de\ufb01ning kSMGP(\u00b7,\u00b7) converge if and only if c2\neven if permitted in [11], should be avoided for the model to remain well de\ufb01ned.\n\nd \u2265 (cid:96)2\n\n3.3.3 Frequency Inducing Features GP\n\nIf the target function can be described more compactly in the frequency domain than in the input\ndomain, it can be advantageous to let the pseudo data set lie in the former domain. We will pursue\nthat possibility for the case where the input domain covariance is the ARD SE. We will call the\nresulting sparse model Frequency Inducing Features GP (FIFGP).\nDirectly applying the Fourier transform is not possible because the target function is not square\nintegrable (it has constant power \u03c32\n0 everywhere, so (5) does not converge). We will workaround\nthis by windowing the target function in the region of interest. It is possible to use a square window,\nbut this results in the covariance being de\ufb01ned in terms of the complex error function, which is very\nslow to evaluate. Instead, we will use a Gaussian window3. Since multiplying by a Gaussian in\nthe input domain is equivalent to convolving with a Gaussian in the frequency domain, we will be\nworking with a blurred version of the frequency space. This model is de\ufb01ned by:\n\ncos\n\n\u03c90 +\n\nwith z = \u03c9\n\ngFIF(x, z) \u2261\n\nkFIF(x, z(cid:48)) = exp\n\nkFIF(z, z(cid:48)) = exp\n\nexp\n\n(cid:34)\n\nd=1\n\nx2\nd\n2c2\nd\n\n(cid:35)\n\u2212 D(cid:88)\n(cid:32)\n(cid:35)\n(cid:35)(cid:32)\n(cid:35)\n\ncos\n\nd\n\nd + c2\nx2\n2(c2\n\nd\u03c9(cid:48)2\nd + (cid:96)2\nd)\nd + \u03c9(cid:48)2\nd )\nd(\u03c92\nc2\nd + (cid:96)2\nd)\n2(2c2\nd(\u03c9d + \u03c9(cid:48)\nd)2\nc4\nd)\n2(2c2\nd + (cid:96)2\n\nd\n\nd=1\n\n(cid:112)2\u03c0c2\n1(cid:81)D\n(cid:34)\n\u2212 D(cid:88)\n(cid:34)\n\u2212 D(cid:88)\n(cid:34)\n\u2212 D(cid:88)\n\nd=1\n\nd=1\n\nd=1\n\n\u03c9(cid:48)\n0 +\n\n(cid:34)\n\nexp\n\n(cid:16)\nD(cid:88)\n\u2212 D(cid:88)\n\nd=1\n\nd=1\n\nD(cid:88)\n\nd=1\n\n(cid:17)\n(cid:33) D(cid:89)\n\nxd\u03c9d\n\nd\n\nd=1\n\nd\u03c9(cid:48)\nc2\ndxd\nd + (cid:96)2\nc2\nd(\u03c9d \u2212 \u03c9(cid:48)\nd)2\nc4\n(cid:115)\nd)\n2(2c2\nd + (cid:96)2\n\n(cid:33) D(cid:89)\n\n(cid:115)\n(cid:35)\n\n(cid:96)2\nd\nd + (cid:96)2\n\nd\n\n2c2\n\nd=1\n\ncos(\u03c90 + \u03c9(cid:48)\n0)\n\n+ exp\n\n(12)\n\n(13)\n\n(cid:96)2\nd\nd + (cid:96)2\nc2\ncos(\u03c90 \u2212 \u03c9(cid:48)\n0)\n\nd\n\n.\n\n(14)\n\n3A mixture of m Gaussians could also be used as window without increasing the complexity order.\n\n5\n\n\fThe inducing features are \u03c9 = [\u03c90, \u03c91, . . . , \u03c9d](cid:62), where \u03c90 is the phase and the remaining com-\nponents are frequencies along each dimension. In this model, both global length-scales {(cid:96)d}D\nd=1 and\nwindow length-scales {cd}D\nd = cd. Instances (13) and (14) are induced by (12)\nusing (4) and (5).\n\nd=1 are shared, thus c(cid:48)\n\n3.3.4 Time-Frequency Inducing Features GP\n\nInstead of using a single window to select the region of interest, it is possible to use a different\nwindow for each feature. We will use windows of the same size but different centres. The re-\nsulting model combines SPGP and FIFGP, so we will call it Time-Frequency Inducing Features\nGP (TFIFGP). It is de\ufb01ned by gTFIF(x, z) \u2261 gFIF(x \u2212 \u00b5, \u03c9), with z = [\u00b5(cid:62) \u03c9(cid:62)](cid:62). The implied\ninter-domain and transformed-domain instances of the covariance function are:\n\nkTFIF(x, z(cid:48)) = kFIF(x \u2212 \u00b5(cid:48), \u03c9(cid:48)) ,\n\nkTFIF(z, z(cid:48)) = kFIF(z, z(cid:48)) exp\n\n(cid:34)\n\n\u2212 D(cid:88)\n\nd=1\n\n(cid:35)\n\n(\u00b5d \u2212 \u00b5(cid:48)\nd)2\nd + (cid:96)2\n2(2c2\nd)\n\nFIFGP is trivially obtained by setting every centre to zero {\u00b5i = 0}m\nby setting window length-scales c, frequencies and phases {\u03c9i}m\nscales were individually adjusted, SMGP would be obtained.\nWhile FIFGP has the modelling power of both FIFGP and SPGP, it might perform worse in prac-\ntice due to it having roughly twice as many hyperparameters, thus making the optimisation problem\nharder. The same problem also exists in SMGP. A possible workaround is to initialise the hyperpa-\nrameters using a simpler model, as done in [11] for SMGP, though we will not do this here.\n\ni=1, whereas SPGP is obtained\ni=1 to zero. If the window length-\n\n4 Experiments\n\nIn this section we will compare the proposed approximations FIFGP and TFIFGP with the current\nstate of the art, SPGP on some large data sets, for the same number of inducing features/inputs\nand therefore, roughly equal computational cost. Additionally, we provide results using a full GP,\nwhich is expected to provide top performance (though requiring an impractically big amount of\ncomputation). In all cases, the (input-domain) covariance function is the ARD SE (1).\nWe use four large data sets: Kin-40k, Pumadyn-32nm4 (describing the dynamics of a robot arm,\nused with SPGP in [1]), Elevators and Pole Telecomm5 (related to the control of the elevators of an\nF16 aircraft and a telecommunications problem, and used in [12, 13, 14]). Input dimensions that\nremained constant throughout the training set were removed. Input data was additionally centred for\nuse with FIFGP (the remaining methods are translation invariant). Pole Telecomm outputs actually\ntake discrete values in the 0-100 range, in multiples of 10. This was taken into account by using the\ncorresponding quantization noise variance (102/12) as lower bound for the noise hyperparameter6.\nHyperparameters are initialised as follows: \u03c32\nd=1 to one half of\nthe range spanned by training data along each dimension. For SPGP, pseudo-inputs are initialised\nto a random subset of the training data, for FIFGP window size c is initialised to the standard\ndeviation of input data, frequencies are randomly chosen from a zero-mean (cid:96)\u22122\nd -variance Gaussian\ndistribution, and phases are obtained from a uniform distribution in [0 . . . 2\u03c0). TFIFGP uses the\nsame initialisation as FIFGP, with window centres set to zero. Final values are selected by evidence\nmaximisation.\nDenoting the output average over the training set as y and the predictive mean and variance for test\nsample y\u2217l as \u00b5\u2217l and \u03c3\u2217l respectively, we de\ufb01ne the following quality measures: Normalized Mean\nSquare Error (NMSE) (cid:104)(y\u2217l \u2212 \u00b5\u2217l)2(cid:105)/(cid:104)(y\u2217l \u2212 y)2(cid:105) and Mean Negative Log-Probability (MNLP)\n2(cid:104)(y\u2217l \u2212 \u00b5\u2217l)2/\u03c32\u2217l + log \u03c32\u2217l + log 2\u03c0(cid:105), where (cid:104)\u00b7(cid:105) averages over the test set.\n\n0/4, {(cid:96)d}D\n\nj , \u03c32 = \u03c32\n\n(cid:80)n\n\n0 = 1\n\nn\n\nj=1 y2\n\n1\n\n4Kin-40k: 8 input dimensions, 10000/30000 samples for train/test, Pumadyn-32nm: 32 input dimensions,\n7168/1024 samples for train/test, using exactly the same preprocessing and train/test splits as [1, 3]. Note that\ntheir error measure is actually one half of the Normalized Mean Square Error de\ufb01ned here.\n\n5Pole Telecomm: 26 non-constant input dimensions, 10000/5000 samples for train/test. Elevators:\n17 non-constant input dimensions, 8752/7847 samples for train/test. Both have been downloaded from\nhttp://www.liaad.up.pt/\u223cltorgo/Regression/datasets.html\n\n6If unconstrained, similar plots are obtained; in particular, no over\ufb01tting is observed.\n\n6\n\n\fFor Kin-40k (Fig. 1, top), all three sparse methods perform similarly, though for high sparseness\n(the most useful case) FIFGP and TFIFGP are slightly superior. In Pumadyn-32nm (Fig. 1, bottom),\nonly 4 out the 32 input dimensions are relevant to the regression task, so it can be used as an ARD\ncapabilities test. We follow [1] and use a full GP on a small subset of the training data (1024 data\npoints) to obtain the initial length-scales. This allows better minima to be found during optimisation.\nThough all methods are able to properly \ufb01nd a good solution, FIFGP and especially TFIFGP are\nbetter in the sparser regime. Roughly the same considerations can be made about Pole Telecomm\nand Elevators (Fig. 2), but in these data sets the superiority of FIFGP and TFIFGP is more dramatic.\nThough not shown here, we have additionally tested these models on smaller, over\ufb01tting-prone data\nsets, and have found no noticeable over\ufb01tting even using m > n, despite the relatively high number\nof parameters being adjusted. This is in line with the results and discussion of [1].\n\n(a) Kin-40k NMSE (log-log plot)\n\n(b) Kin-40k MNLP (semilog plot)\n\n(c) Pumadyn-32nm NMSE (log-log plot)\n\n(d) Pumadyn-32nm MNLP (semilog plot)\n\nFigure 1: Performance of the compared methods on Kin-40k and Pumadyn-32nm.\n\n5 Conclusions and extensions\n\nIn this work we have introduced IDGPs, which are able combine representations of a GP in differ-\nent domains, and have used them to extend SPGP to handle inducing features lying in a different\ndomain. This provides a general framework for sparse models, which are de\ufb01ned by a feature extrac-\ntion function. Using this framework, SMGPs can be reinterpreted as fully principled models using a\ntransformed space of local features, without any need for post-hoc variance improvements. Further-\nmore, it is possible to develop new sparse models of practical use, such as the proposed FIFGP and\nTFIFGP, which are able to outperform the state-of-the-art SPGP on some large data sets, especially\nfor high sparsity regimes.\n\n7\n\n255010020030050075012500.0010.0050.010.050.10.5Inducing features / pseudo\u2212inputsNormalized Mean Squared Error SPGPFIFGPTFIFGPFull GP on 10000 data points25501002003005007501250\u22121.5\u22121\u22120.500.511.522.5Inducing features / pseudo\u2212inputsMean Negative Log\u2212Probability SPGPFIFGPTFIFGPFull GP on 10000 data points102550751000.040.050.1Inducing features / pseudo\u2212inputsNormalized Mean Squared Error SPGPFIFGPTFIFGPFull GP on 7168 data points10255075100\u22120.2\u22120.15\u22120.1\u22120.0500.050.10.150.2Inducing features / pseudo\u2212inputsMean Negative Log\u2212Probability SPGPFIFGPTFIFGPFull GP on 7168 data points\f(a) Elevators NMSE (log-log plot)\n\n(b) Elevators MNLP (semilog plot)\n\n(c) Pole Telecomm NMSE (log-log plot)\n\n(d) Pole Telecomm MNLP (semilog plot)\n\nFigure 2: Performance of the compared methods on Elevators and Pole Telecomm.\n\nChoosing a transformed space for the inducing features enables to use domains where the target\nfunction can be expressed more compactly, or where the evidence (which is a function of the fea-\ntures) is easier to optimise. This added \ufb02exibility translates as a detaching of the functional form of\nthe input-domain covariance and the set of basis functions used to express the posterior mean.\nIDGPs approximate full GPs optimally in the KL sense noted in Section 3.2, for a given set of\ninducing features. Using ML-II to select the inducing features means that models providing a good\n\ufb01t to data are given preference over models that might approximate the full GP more closely. This,\nthough rarely, might lead to harmful over\ufb01tting. To more faithfully approximate the full GP and\navoid over\ufb01tting altogether, our proposal can be combined with the variational approach from [15],\nin which the inducing features would be regarded as variational parameters. This would result in\nmore constrained models, which would be closer to the full GP but might show reduced performance.\nWe have explored the case of regression with Gaussian noise, which is analytically tractable, but it\nis straightforward to apply the same model to other tasks such as robust regression or classi\ufb01cation,\nusing approximate inference (see [16]). Also, IDGPs as a general tool can be used for other purposes,\nsuch as modelling noise in the frequency domain, aggregating data from different domains or even\nimposing constraints on the target function.\n\nAcknowledgments\n\nWe would like to thank the anonymous referees for helpful comments and suggestions. This work\nhas been partly supported by the Spanish government under grant TEC2008- 02473/TEC, and by\nthe Madrid Community under grant S-505/TIC/0223.\n\n8\n\n10255010025050075010000.10.150.20.25Inducing features / pseudo\u2212inputsNormalized Mean Squared Error SPGPFIFGPTFIFGPFull GP on 8752 data points1025501002505007501000\u22124.8\u22124.6\u22124.4\u22124.2\u22124\u22123.8Inducing features / pseudo\u2212inputsMean Negative Log\u2212Probability SPGPFIFGPTFIFGPFull GP on 8752 data points10255010025050010000.010.020.030.040.050.10.150.2Inducing features / pseudo\u2212inputsNormalized Mean Squared Error SPGPFIFGPTFIFGPFull GP on 10000 data points10255010025050010002.533.544.555.5Inducing features / pseudo\u2212inputsMean Negative Log\u2212Probability SPGPFIFGPTFIFGPFull GP on 10000 data points\fReferences\n\n[1] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural\n\nInformation Processing Systems 18, pages 1259\u20131266. MIT Press, 2006.\n\n[2] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information\n\nProcessing Systems 13, pages 619\u2013625. MIT Press, 2001.\n\n[3] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian\n\nprocess regression. In Proceedings of the 9th International Workshop on AI Stats, 2003.\n[4] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719\u20132741, 2000.\n[5] L. Csat\u00b4o and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(3):641\u2013669, 2002.\n[6] C. K. I. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In Advances\n\nin Neural Information Processing Systems 13, pages 682\u2013688. MIT Press, 2001.\n\n[7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computa-\n\ntion and Machine Learning. MIT Press, 2006.\n\n[8] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In\n\nAdvances in Neural Information Processing Systems 21, pages 57\u201364, 2009.\n\n[9] Ed. Snelson. Flexible and ef\ufb01cient Gaussian process models for machine learning. PhD thesis, University\n\nof Cambridge, 2007.\n\n[10] J. Qui\u02dcnonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[11] C. Walder, K. I. Kim, and B. Sch\u00a8olkopf. Sparse multiscale Gaussian process regression. In 25th Interna-\n\ntional Conference on Machine Learning. ACM Press, New York, 2008.\n\n[12] G. Potgietera and A. P. Engelbrecht. Evolving model trees for mining data sets with continuous-valued\n\nclasses. Expert Systems with Applications, 35:1513\u20131532, 2007.\n\n[13] L. Torgo and J. Pinto da Costa. Clustered partial linear regression. In Proceedings of the 11th European\n\nConference on Machine Learning, pages 426\u2013436. Springer, 2000.\n\n[14] G. Potgietera and A. P. Engelbrecht. Pairwise classi\ufb01cation as an ensemble technique. In Proceedings of\n\nthe 13th European Conference on Machine Learning, pages 97\u2013110. Springer-Verlag, 2002.\n\n[15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of\n\nthe 12th International Workshop on AI Stats, 2009.\n\n[16] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Informa-\n\ntion Processing Systems 20, pages 1057\u20131064. MIT Press, 2008.\n\n9\n\n\f", "award": [], "sourceid": 537, "authors": [{"given_name": "Miguel", "family_name": "L\u00e1zaro-Gredilla", "institution": ""}, {"given_name": "An\u00edbal", "family_name": "Figueiras-Vidal", "institution": ""}]}