{"title": "Using the Equivalent Kernel to Understand Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1313, "page_last": 1320, "abstract": null, "full_text": " Using the Equivalent Kernel to Understand\n Gaussian Process Regression\n\n\n\n Peter Sollich Christopher K. I. Williams\n Dept of Mathematics School of Informatics\n King's College London University of Edinburgh\n Strand, London WC2R 2LS, UK 5 Forrest Hill, Edinburgh EH1 2QL, UK\n peter.sollich@kcl.ac.uk c.k.i.williams@ed.ac.uk\n\n\n\n\n Abstract\n\n The equivalent kernel [1] is a way of understanding how Gaussian pro-\n cess regression works for large sample sizes based on a continuum limit.\n In this paper we show (1) how to approximate the equivalent kernel of the\n widely-used squared exponential (or Gaussian) kernel and related ker-\n nels, and (2) how analysis using the equivalent kernel helps to understand\n the learning curves for Gaussian processes.\n\n\n\nConsider the supervised regression problem for a dataset D with entries (xi, yi) for i =\n1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x\nis given by\n\n \n f (x) = k (x)(K + 2I)-1y, (1)\n\n\nwhere K denotes the n n matrix of covariances between the training points with entries\nk(xi, xj), k(x) is the vector of covariances k(xi, x), 2 is the noise variance on the\nobservations and y is a n 1 vector holding the training targets. See e.g. [2] for further\ndetails.\n\nWe can define a vector of functions h(x) = (K + 2I)-1k(x) . Thus we have \n f (x) =\nh (x)y, making it clear that the mean prediction at a point x is a linear combination of\nthe target values y. Gaussian process regression is thus a linear smoother, see [3, section\n2.8] for further details. For a fixed test point x, h(x) gives the vector of weights applied\nto targets y. Silverman [1] called h (x) the weight function.\n\nUnderstanding the form of the weight function is made complicated by the matrix inversion\nof K + 2I and the fact that K depends on the specific locations of the n datapoints.\nIdealizing the situation one can consider the observations to be \"smeared out\" in x-space\nat some constant density of observations. In this case analytic tools can be brought to bear\non the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the\nidealized weight function the equivalent kernel (EK).\n\nThe structure of the remainder of the paper is as follows: In section 1 we describe how\nto derive the equivalent kernel in Fourier space. Section 2 derives approximations for the\nEK for the squared exponential and other kernels. In section 3 we show how use the EK\napproach to estimate learning curves for GP regression, and compare GP regression to\nkernel regression using the EK.\n\n\f\n1 Gaussian Process Regression and the Equivalent Kernel\n\nIt is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as\nthe function which minimizes the functional\n 1 1 n\n J[f ] = f 2 (y\n 2 H + 22 i - f (xi))2, (2)\n n i=1\n\nwhere f H is the RKHS norm corresponding to kernel k. (However, note that the GP\nframework gives much more than just this mean prediction, for example the predictive\nvariance and the marginal likelihood p(y) of the data under the model.)\n\nLet (x) = E[y|x] be the target function for our regression problem and write E[(y -\nf (x))2] = E[(y - (x))2] + ((x) - f(x))2. Using the fact that the first term on the RHS\nis independent of f motivates considering a smoothed version of equation 2,\n\n 1\n J[f ] = ((x) - f (x))2dx + f 2\n 22 2 H,\n\nwhere has dimensions of the number of observations per unit of x-space\n(length/area/volume etc. as appropriate). If we consider kernels that are stationary,\nk(x, x ) = k(x - x ), the natural basis in which to analyse equation 1 is the Fourier\nbasis of complex sinusoids so that f (x) is represented as ~\n f (s)e2isxds and similarly for\n(x). Thus we obtain\n\n 1 | ~\n f (s)|2\n J[f ] = | ~\n f (s) - ~\n (s)|2 + ds,\n 2 2 S(s)\n\nas f 2 = | ~\n f (\n H s)|2/S(s)ds where S(s) is the power spectrum of the kernel k,\nS(s) = k(x)e-2isxdx. J[f ] can be minimized using calculus of variations to ob-\ntain ~\n f (s) = S(s)(s)/(2/ + S(s)) which is recognized as the convolution f (x) =\n h(x - x)(x)dx. Here the Fourier transform of the equivalent kernel h(x) is\n\n ~ S(s) 1\n h(s) = = . (3)\n S(s) + 2/ 1 + 2/(S(s))\n\nThe term 2/ in the first expression for ~\n h(s) corresponds to the power spectrum of a\nwhite noise process, whose delta-function covariance function becomes a constant in the\nFourier domain. This analysis is known as Wiener filtering; see, e.g. [5, 14-1]. Notice\nthat as , h(x) tends to the delta function. If the input density is non-uniform the\nanalysis above should be interpreted as computing the equivalent kernel for np(x) = .\nThis approximation will be valid if the scale of variation of p(x) is larger than the width of\nthe equivalent kernel.\n\n\n2 The EK for the Squared Exponential and Related Kernels\n\nFor certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier\ninversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance\nk(x) = e-|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer\n P f 2 = (f (m))2dx [1, 6], and the regularizer P f 2 = ( 2f )2dx in two dimen-\nsions, where the EK is given in terms of the Kelvin function kei [7].\n\nWe now consider the commonly used squared exponential (SE) kernel k(r) =\nexp(-r2/2 2), where r2 = ||x-x ||2. (This is sometimes called the Gaussian or radial ba-\nsis function kernel.) Its Fourier transform is given by S(s) = (2 2)D/2 exp(-22 2|s|2),\nwhere D denotes the dimensionality of x (and s) space.\n\n\f\nFrom equation 3 we obtain\n\n ~ 1\n hSE(s) = ,\n 1 + b exp(22 2|s|2)\n\nwhere b = 2/(2 2)D/2. We are unaware of an exact result in this case, but the following\ninitial approximation is simple but effective. For large , b will be small. Thus for small\ns = |s| we have that ~\n hSE 1, but for large s it is approximately 0. The change takes\nplace around the point sc where b exp(22 2s2c) = 1, i.e. s2c = log(1/b)/22 2. As\nexp(22 2s2) grows quickly with s, the transition of ~\n hSE between 1 and 0 can be expected\nto be rapid, and thus be well-approximated by a step function.\n\nProposition 1 The approximate form of the equivalent kernel for the squared-exponential\nkernel in D-dimensions is given by\n\n s D/2\n h c\n SE(r) = J\n r D/2(2scr).\n\n\nProof: hSE(s) is a function of s = |s| only, and for D > 1 the Fourier integral can\nbe simplified by changing to spherical polar coordinates and integrating out the angular\nvariables to give\n\n s +1\n hSE(r) = 2r J(2rs)~hSE(s) ds (4)\n 0 r\n sc s +1 s D/2\n 2r J c\n (2rs) ds = JD/2(2scr).\n 0 r r\n\nwhere = D/2 - 1, J(z) is a Bessel function of the first kind and we have used the\nidentity z+1J(z) = (d/dz)[z+1J+1(z)].\n\nNote that in D = 1 by computing the Fourier transform of the boxcar function we obtain\nhSE(x) = 2scsinc(2scx) where sinc(z) = sin(z)/z. This is consistent with Proposition\n1 and J1/2(z) = (2/z)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in\nFigure 2(left) below.\n\nNotice that sc scales as (log())1/2 so that the width of the EK (which is proportional to\n1/sc) will decay very slowly as increases. In contrast for a spline of order m (with power\nspectrum |s|-2m) the width of the EK scales as -1/2m [1].\n\nIf instead of RD we consider the input set to be the unit circle, a stationary kernel can\nbe periodized by the construction kp(x, x ) = k(x - x + 2n). This kernel will\n nZ\nbe represented as a Fourier series (rather than with a Fourier transform) because of the\nperiodicity. In this case the step function in Fourier space approximation would give rise\nto a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet\nkernel).\n\nWe now show that the result of Proposition 1 is asymptotically exact for , and calcu-\nlate the leading corrections for finite . The scaling of the width of the EK as 1/sc suggests\nwriting hSE(r) = (2sc)Dg(2scr). Then from equation 4 and using the definition of sc\n\n z 2s +1 J\n g(z) = cs (zs/sc) ds\n sc(2sc)D 0 z 1 + exp[22 2(s2 - s2c)]\n u +1 J\n = z (zu) du (5)\n 0 2z 1 + exp[22 2s2c(u2 - 1)]\nwhere we have rescaled s = scu in the second step. The value of sc, and hence , now\nenters only in the exponential via a = 22 2s2c. For a , the exponential tends to zero\n\n\f\nfor u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function\n(1 - u) in the limit and Proposition 1 becomes exact, with g(z) lima g(z) =\n(2z)-D/2JD/2(z). To calculate corrections to this, one uses that for large but finite a the\ndifference (u) = {1 + exp[a(u2 - 1)]}-1 - (1 - u) is non-negligible only in a range\nof order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be\nTaylor-expanded around that point to give\n\n I dk u +1 \ng(z) = g k\n (z) + z J , I (u)(u - 1)k du\n k! duk 2z (zu) k =\n k=0 u=1 0\n\nThe problem is thus reduced to calculating the integrals Ik. Setting u = 1 + v/a one has\n\n 0 1 vk\n ak+1Ik = - 1 vk dv + dv\n -a 1 + exp(v2/a + 2v) 0 1 + exp(v2/a + 2v)\n a (-1)k+1vk vk\n = dv + dv\n 0 1 + exp(-v2/a + 2v) 0 1 + exp(v2/a + 2v)\n\nIn the first integral, extending the upper limit to gives an error that is exponentially\nsmall in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to\nleading order in 1/a, I0 = c0/a2, I1 = c1/a2 while all Ik with k 2 are smaller by\nat least 1/a2. The numerical constants are -c0 = c1 = 2/24. This gives, using that\n(d/dz)[z+1J(z)] = zJ(z) + z+1J-1(z) = (2 + 1)zJ(z) - z+1J+1(z):\n\nProposition 2 The equivalent kernel for the squared-exponential kernel is given for large\n by hSE(r) = (2sc)Dg(2scr) with\n\n 1 z 1\ng(z) = J (c )\n D D/2(z) + 0 + c1(D - 1))JD/2-1(z) - c1zJD/2(z) +O(\n (2z) 2 a2 a4\n\n\nFor e.g. D = 1 this becomes g(z) = -1{sin(z)/z - 2/(24a2)[cos(z) + z sin(z)]}. Here\nand in general, by comparing the second part of the 1/a2 correction with the leading order\nterm, one estimates that the correction is of relative size z2/a2. It will therefore provide\na useful improvement as long as z = 2scr < a; for larger z the expansion in powers of\n1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are\ncomparable to the leading order.\n\n\n2.1 Accuracy of the approximation\n\nTo evaluate the accuracy of the approximation we can compute the EK numerically as\nfollows: Consider a dense grid of points in RD with a sampling density grid. For making\npredictions at the grid points we obtain the smoother matrix K(K + 2 I)-1\n grid , where1\n2 = 2\n grid grid/, as per equation 1. Each row of this matrix is an approximation to the\nEK at the appropriate location, as this is the response to a y vector which is zero at all points\nexcept one. Note that in theory one should use a grid over the whole of RD but in practice\none can obtain an excellent approximation to the EK by only considering a grid around the\npoint of interest as the EK typically decays with distance. Also, by only considering a finite\ngrid one can understand how the EK is affected by edge effects.\n\n 1To understand this scaling of 2grid consider the case where grid > which means that the\neffective variance at each of the grid points per unit x-space is larger, but as there are correspondingly\nmore points this effect cancels out. This can be understood by imagining the situation where there\nare grid/ independent Gaussian observations with variance 2grid at a single x-point; this would\nbe equivalent to one Gaussian observation with variance 2. In effect the observations per unit\nx-space have been smoothed out uniformly.\n\n\f\n 0.16 0.35 0.35\n Numerical Numerical\n 0.3 0.3\n Proposition 1 Proposition 1\n 0.25 Proposition 2 0.25 Proposition 2\n 0.14\n 0.2 0.2\n\n\n 0.15 0.15\n\n 0.12 0.1 0.1\n\n\n 0.05 0.05\n\n 0.1 0 0\n\n -0.05 -0.05\n\n\n 0.08 -0.1 -0.1\n 0 5 10 15 0 5 10 15\n\n\n\n\n 0.06\n\n\n 0.04\n\n\n 0.02\n\n\n 0\n\n\n -0.02 Numerical\n Proposition 1\n -0.04 Sample\n\n\n -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5\n\n\n\nFigure 1: Main figure: plot of the weight function corresponding to = 100 training\npoints/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc\napproximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc\nand Proposition 2 approximations for = 100 (left) and = 104 (right).\n\n\n\nFigure 1 shows plots of the weight function for = 100, the EK computed on the grid as\ndescribed above and the analytical sinc approximation. These are computed for parameter\nvalues of 2 = 0.004 and 2 = 0.1, with grid/ = 5/3. To reduce edge effects, the\ninterval [-3/2, 3/2] was used for computations, although only the centre of this is shown\nin the figure. There is quite good agreement between the numerical computation and the\nanalytical approximation, although the sidelobes decay more rapidly for the numerically\ncomputed EK. This is not surprising because the absence of a truly hard cutoff in Fourier\nspace means one should expect less \"ringing\" than the analytical approximation predicts.\nThe figure also shows good agreement between the weight function (based on the finite\nsample) and the numerically computed EK. The insets show the approximation of Proposi-\ntion 2 to g(z) for = 100 (a = 5.67, left) and = 104 (a = 9.67, right). As expected, the\naddition of the 1/a2-correction gives better agreement with the numerical result for z < a.\nNumerical experiments also show that the mean squared error between the numerically\ncomputed EK and the sinc approximation decreases like 1/ log(). The is larger than the\nnaive estimate (1/a2)2 1/(log())4 based on the first correction term from Proposition\n2, because the dominant part of the error comes from the region z > a where the 1/a\nexpansion breaks down.\n\n\n2.2 Other kernels\n\nOur analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for\nwhich the power spectrum S(s) depends on s = |s| only. Then we can again define from\nequation 3 an effective cutoff sc on the range of s in the EK via 2/ = S(sc), so that\n~\nh(s) = [1 + S(sc)/S(s)]-1. The EK will then have the limiting form given in Proposi-\ntion 1 if ~\n h(s) approaches a step function (sc - s), i.e. if it becomes infinitely \"steep\"\naround the point s = sc for sc . A quantitative criterion for this is that the slope\n\n\f\n|~\nh (sc)| should become much larger than 1/sc, the inverse of the range of the step func-\ntion. Since ~\n h (s) = S (s)S(sc)S-2(s)[1 + S(sc)/S(s)]-2, this is equivalent to requiring\nthat -scS (sc)/4S(sc) -d log S(sc)/d log sc must diverge for sc . The result\nof Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more\nrapidly than any positive power of 1/s.\n\nA trivial example of a kernel obeying this condition would be a superposition of finitely\nmany SE kernels with different lengthscales 2; the asymptotic behaviour of sc is then\ngoverned by the smallest . A less obvious case is the \"rational quadratic\" k(r) =\n[1 + (r/l)2]-(D+1)/2 which has an exponentially decaying power spectrum S(s) \nexp(-2 s). (This relationship is often used in the reverse direction, to obtain the power\nspectrum of the Ornstein-Uhlenbeck (OU) kernel exp(-r/ ).) Proposition 1 then applies,\nwith the width of the EK now scaling as 1/sc 1/ log().\n\nThe previous example is a special case of kernels which can be written as superpositions\nof SE kernels with a distribution p( ) of lengthscales , k(r) = exp(-r2/2 2)p( ) d .\nThis is in fact the most general representation for an isotropic kernel which defines a valid\ncovariance function in any dimension D, see [9, 2.10]. Such a kernel has power spectrum\n \n S(s) = (2)D/2 D exp(-22 2s2)p( ) d (6)\n 0\n\nand one easily verifies that the rational quadratic kernel, which has S(s) exp(-2 0s),\nis obtained for p( ) -D-2 exp(- 20/2 2). More generally, because the exponential\nfactor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) 1/s Dp( ) d\n 0\nfor large s. This will decay more strongly than any power of 1/s for s if p( ) itself\ndecreases more strongly than any power of for 0. Any such choice of p( ) will\ntherefore yield a kernel to which Proposition 1 applies.\n\n\n3 Understanding GP Learning Using the Equivalent Kernel\n\nWe now turn to using EK analysis to get a handle on average case learning curves for Gaus-\nsian processes. Here the setup is that a function is drawn from a Gaussian process, and we\nobtain noisy observations of per unit x-space at random x locations. We are concerned\nwith the mean squared error (MSE) between the GP prediction f and . Averaging over\nthe noise process, the x-locations of the training data and the prior over we obtain the\naverage MSE as a function of . See e.g. [10] and [11] for an overview of earlier work on\nGP learning curves.\n\nTo understand the asymptotic behaviour of for large , we now approximate the true\nGP predictions with the EK predictions from noisy data, given by fEK(x) = h(x -\nx )y(x )dx in the continuum limit of \"smoothed out\" input locations. We assume as before\nthat y = target + noise, i.e. y(x) = (x) + (x) where E[(x)(x )] = (2/)(x - x ).\nHere 2 denotes the true noise variance, as opposed to the noise variance assumed in the\nEK; the scaling of 2 with is explained in footnote 1. For a fixed target , the MSE\nis = ( dx)-1 [(x) - fEK(x)]2dx. Averaging over the noise process and target\nfunction gives in Fourier space\n\n 2 (2/)S\n = S (s)/S2(s) + 2\n /2\n (s)[1 - ~\n h(s)]2 + (2/)~h2(s) ds = ds\n [1 + 2/(S(s))]2\n (7)\nwhere S(s) is the power spectrum of the prior over target functions. In the case S(s) =\nS(s) and 2 = 2 where the kernel is exactly matched to the structure of the target,\nequation 7 gives the Bayes error B and simplifies to B = (2/) [1 + 2/(S(s))]-1ds\n(see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power\nspectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10]\n\n\f\n 0.5\n 0.5 =2\n\n\n 0.03\n\n\n 0.025 \n 0.02\n\n\n 0.015\n\n\n 0.01\n 0.1 =4\n 0.005\n\n\n 0\n\n\n -0.005\n 1 0.05\n 0.5 1\n\n 0.5\n 0\n 0\n -0.5 25 50 100\n -0.5 250 500\n -1 -1\n\n\n\n\n\nFigure 2: Left: plot of the asymptotic form of the EK (sc/r)J1(2scr) for D = 2 and\n = 1225. Right: log-log plot of against log() for the OU and Matern-class processes\n( = 2, 4 respectively). The dashed lines have gradients of -1/2 and -3/2 which are the\npredicted rates.\n\n\non the MSE of standard GP prediction from finite datasets. In experiments this bound\nprovides a good approximation to the actual average MSE for large dataset size n [11].\nThis supports our approach of using the EK to understand the learning behaviour of GP\nregression.\n\nTreating the denominator in the expression for B again as a hard cutoff at s = sc, which is\njustified for large , one obtains for an SE target and learner 2sc/ (log())D/2/.\nTo get analogous predictions for the mismatched case, one can write equation 7 as\n\n 2 [1 + 2/(S(s))] - 2/(S(s)) S\n = d (s)\n s + ds.\n [1 + 2/(S(s))]2 [S(s)/2 + 1]2\n\nThe first integral is smaller than (2/2) B and can be neglected as long as B. In the\nsecond integral we can again make the cutoff approximation--though now with s having\nto be above sc to get the scaling sD-1S\n s (s) ds. For target functions with a\n c\n\npower-law decay S(s) s- of the power spectrum at large s this predicts sD-\n c \n(log())(D-)/2. So we generically get slow logarithmic learning, consistent with the\nobservations in [12]. For D = 1 and an OU target ( = 2) we obtain (log())-1/2, and\nfor the Matern-class covariance function k(r) = (1 + r/ ) exp(-r/ ) (which has power\nspectrum (3/ 2 + 42s2)-2, so = 4) we get (log())-3/2. These predictions\nwere tested experimentally using a GP learner with SE covariance function ( = 0.1 and\nassumed noise level 2 = 0.1) against targets from the OU and Matern-class priors (with\n = 0.05) and with noise level 2 = 0.01, averaging over 100 replications for each value\nof . To demonstrate the predicted power-law dependence of on log(), in Figure 2(right)\nwe make a log-log plot of against log(). The dashed lines show the gradients of -1/2\nand -3/2 and we observe good agreement between experimental and theoretical results\nfor large .\n\n\n3.1 Using the Equivalent Kernel in Kernel Regression\n\nAbove we have used the EK to understand how standard GP regression works. One could\nalternatively envisage using the EK to perform kernel regression, on given finite data sets,\nproducing a prediction -1 h(x\n i - xi)yi at x. Intuitively this seems appealing as a\ncheap alternative to full GP regression, particularly for kernels such as the SE where the\nEK can be calculated analytically, at least to a good approximation. We now analyze briefly\nhow such an EK predictor would perform compared to standard GP prediction.\n\n\f\nLetting denote averaging over noise, training input points and the test point and setting\nf(x) = h(x, x)(x)dx, the average MSE of the EK predictor is\n\n pred = [(x) - (1/) h(x, x\n i i)yi]2\n\n = [(x) - f(x)]2 + 2 h2(x, x )dx + 1 h2(x, x )2(x )dx - 1 f 2\n (x)\n\n 2 (2/)S 2 ds\n = (s)/S2(s) + 2\n /2 ds +\n [1 + 2/(S(s))]2 [1 + 2/(S(s))]2\n\nHere we have set 2 = ( dx)-1 2(x) dx = S(s) ds for the spatial average of the\nsquared target amplitude. Taking the matched case, (S(s) = S(s) and 2 = 2) as an\nexample, the first term (which is the one we get for the prediction from \"smoothed out\"\ntraining inputs, see eq. 7) is of order 2sD\n c /, while the second one is 2 sD\n c /. Thus\nboth terms scale in the same way, but the ratio of the second term to the first is the signal-\nto-noise ratio 2 /2, which in practice is often large. The EK predictor will then perform\nsignificantly worse than standard GP prediction, by a roughly constant factor, and we have\nconfirmed this prediction numerically. This result is somewhat surprising given the good\nagreement between the weight function h(x) and the EK that we saw in figure 1, leading\nto the conclusion that the detailed structure of the weight function is important for optimal\nprediction from finite data sets.\n\nIn summary, we have derived accurate approximations for the equivalent kernel (EK) of\nGP regression with the widely used squared exponential kernel, and have shown that the\nsame analysis in fact extends to a whole class of kernels. We have also demonstrated that\nEKs provide a simple means of understanding the learning behaviour of GP regression,\neven in cases where the learner's covariance function is not well matched to the structure\nof the target function. In future work, it will be interesting to explore in more detail the use\nof the EK in kernel smoothing. This is suboptimal compared to standard GP regression as\nwe saw. However, it does remain feasible even for very large datasets, and may then be\ncompetitive with sparse methods for approximating GP regression. From the theoretical\npoint of view, the average error of the EK predictor which we calculated may also provide\nthe basis for useful upper bounds on GP learning curves.\n\nAcknowledgments: This work was supported in part by the IST Programme of the Eu-\nropean Community, under the PASCAL Network of Excellence, IST-2002-506778. This\npublication only reflects the authors' views.\n\n\nReferences\n\n [1] B. W. Silverman. Annals of Statistics, 12:898916, 1984.\n\n [2] C. K. I. Williams. In M. I. Jordan, editor, Learning in Graphical Models, pages 599621.\n Kluwer Academic, 1998.\n\n [3] T. J. Hastie and R. J. Tibshirani. Generalized Additive Models. Chapman and Hall, 1990.\n\n [4] F. Girosi, M. Jones, and T. Poggio. Neural Computation, 7(2):219269, 1995.\n\n [5] A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw-Hill, New\n York, 1991. Third Edition.\n\n [6] C. Thomas-Agnan. Numerical Algorithms, 13:2132, 1996.\n\n [7] T. Poggio, H. Voorhees, and A. Yuille. Tech. Report AI Memo 833, MIT AI Laboratory, 1985.\n\n [8] B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, 2002.\n\n [9] M. L. Stein. Interpolation of Spatial Data. Springer-Verlag, New York, 1999.\n\n[10] M. Opper and F. Vivarelli. In NIPS 11, pages 302308, 1999.\n\n[11] P. Sollich and A. Halees. Neural Computation, 14:13931428, 2002.\n\n[12] P. Sollich. In NIPS 14, pages 519526, 2002.\n\n\f\n", "award": [], "sourceid": 2676, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}]}