Using the Equivalent Kernel to Understand Gaussian Process Regression

Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

Bibtex Metadata Paper

Authors

Peter Sollich, Christopher Williams

Abstract

The equivalent kernel [1] is a way of understanding how Gaussian pro- cess regression works for large sample sizes based on a continuum limit. In this paper we show (1) how to approximate the equivalent kernel of the widely-used squared exponential (or Gaussian) kernel and related ker- nels, and (2) how analysis using the equivalent kernel helps to understand the learning curves for Gaussian processes.

Consider the supervised regression problem for a dataset D with entries (xi, yi) for i = 1, . . . , n. Under Gaussian Process (GP) assumptions the predictive mean at a test point x is given by

                                                               f (x) = k (x)(K + 2I)-1y,                                    (1)

where K denotes the n n matrix of covariances between the training points with entries k(xi, xj), k(x) is the vector of covariances k(xi, x), 2 is the noise variance on the observations and y is a n 1 vector holding the training targets. See e.g. [2] for further details.

We can define a vector of functions h(x) = (K + 2I)-1k(x) . Thus we have f (x) = h (x)y, making it clear that the mean prediction at a point x is a linear combination of the target values y. Gaussian process regression is thus a linear smoother, see [3, section 2.8] for further details. For a fixed test point x, h(x) gives the vector of weights applied to targets y. Silverman [1] called h (x) the weight function.

Understanding the form of the weight function is made complicated by the matrix inversion of K + 2I and the fact that K depends on the specific locations of the n datapoints. Idealizing the situation one can consider the observations to be "smeared out" in x-space at some constant density of observations. In this case analytic tools can be brought to bear on the problem, as shown below. By analogy to kernel smoothing Silverman [1] called the idealized weight function the equivalent kernel (EK).

The structure of the remainder of the paper is as follows: In section 1 we describe how to derive the equivalent kernel in Fourier space. Section 2 derives approximations for the EK for the squared exponential and other kernels. In section 3 we show how use the EK approach to estimate learning curves for GP regression, and compare GP regression to kernel regression using the EK.

1 Gaussian Process Regression and the Equivalent Kernel

It is well known (see e.g. [4]) that the posterior mean for GP regression can be obtained as the function which minimizes the functional 1 1 n J[f ] = f 2 (y 2 H + 22 i - f (xi))2, (2) n i=1

where f H is the RKHS norm corresponding to kernel k. (However, note that the GP framework gives much more than just this mean prediction, for example the predictive variance and the marginal likelihood p(y) of the data under the model.)

Let (x) = E[y|x] be the target function for our regression problem and write E[(y - f (x))2] = E[(y - (x))2] + ((x) - f(x))2. Using the fact that the first term on the RHS is independent of f motivates considering a smoothed version of equation 2,

                                                                                            1                              J[f ] =                 ((x) - f (x))2dx +                             f 2                                              22                                                 2         H,

where has dimensions of the number of observations per unit of x-space (length/area/volume etc. as appropriate). If we consider kernels that are stationary, k(x, x ) = k(x - x ), the natural basis in which to analyse equation 1 is the Fourier basis of complex sinusoids so that f (x) is represented as ~ f (s)e2isxds and similarly for (x). Thus we obtain

                                    1                                                | ~                                                                                                f (s)|2                             J[f ] =                       | ~                                                            f (s) - ~                                                                         (s)|2 +                               ds,                                         2            2                                        S(s)

as f 2 = | ~ f ( H s)|2/S(s)ds where S(s) is the power spectrum of the kernel k, S(s) = k(x)e-2isxdx. J[f ] can be minimized using calculus of variations to ob- tain ~ f (s) = S(s)(s)/(2/ + S(s)) which is recognized as the convolution f (x) = h(x - x)(x)dx. Here the Fourier transform of the equivalent kernel h(x) is

                           ~                    S(s)                                  1                                h(s) =                                  =                                  .                   (3)                                              S(s) + 2/                    1 + 2/(S(s))

The term 2/ in the first expression for ~ h(s) corresponds to the power spectrum of a white noise process, whose delta-function covariance function becomes a constant in the Fourier domain. This analysis is known as Wiener filtering; see, e.g. [5, 14-1]. Notice that as , h(x) tends to the delta function. If the input density is non-uniform the analysis above should be interpreted as computing the equivalent kernel for np(x) = . This approximation will be valid if the scale of variation of p(x) is larger than the width of the equivalent kernel.

2 The EK for the Squared Exponential and Related Kernels

For certain kernels/covariance functions the EK h(x) can be computed exactly by Fourier inversion. Examples include the Ornstein-Uhlenbeck process in D = 1 with covariance k(x) = e-|x| (see [5, p. 326]), splines in D = 1 corresponding to the regularizer P f 2 = (f (m))2dx [1, 6], and the regularizer P f 2 = ( 2f )2dx in two dimen- sions, where the EK is given in terms of the Kelvin function kei [7].

We now consider the commonly used squared exponential (SE) kernel k(r) = exp(-r2/2 2), where r2 = ||x-x ||2. (This is sometimes called the Gaussian or radial ba- sis function kernel.) Its Fourier transform is given by S(s) = (2 2)D/2 exp(-22 2|s|2), where D denotes the dimensionality of x (and s) space.

From equation 3 we obtain

                                    ~                                  1                                         hSE(s) =                                                    ,                                                         1 + b exp(22 2|s|2)

where b = 2/(2 2)D/2. We are unaware of an exact result in this case, but the following initial approximation is simple but effective. For large , b will be small. Thus for small s = |s| we have that ~ hSE 1, but for large s it is approximately 0. The change takes place around the point sc where b exp(22 2s2c) = 1, i.e. s2c = log(1/b)/22 2. As exp(22 2s2) grows quickly with s, the transition of ~ hSE between 1 and 0 can be expected to be rapid, and thus be well-approximated by a step function.

Proposition 1 The approximate form of the equivalent kernel for the squared-exponential kernel in D-dimensions is given by

                                                     s          D/2                                    h                          c                                          SE(r) =                           J                                                          r                      D/2(2scr).

Proof: hSE(s) is a function of s = |s| only, and for D > 1 the Fourier integral can be simplified by changing to spherical polar coordinates and integrating out the angular variables to give

                                          s +1             hSE(r) = 2r                                J(2rs)~hSE(s) ds                                                 (4)                                    0           r                                         sc     s +1                                     s          D/2                          2r                            J                                     c                                                               (2rs) ds =                                 JD/2(2scr).                                    0           r                                         r

where = D/2 - 1, J(z) is a Bessel function of the first kind and we have used the identity z+1J(z) = (d/dz)[z+1J+1(z)].

Note that in D = 1 by computing the Fourier transform of the boxcar function we obtain hSE(x) = 2scsinc(2scx) where sinc(z) = sin(z)/z. This is consistent with Proposition 1 and J1/2(z) = (2/z)1/2 sin(z). The asymptotic form of the EK in D = 2 is shown in Figure 2(left) below.

Notice that sc scales as (log())1/2 so that the width of the EK (which is proportional to 1/sc) will decay very slowly as increases. In contrast for a spline of order m (with power spectrum |s|-2m) the width of the EK scales as -1/2m [1].

If instead of RD we consider the input set to be the unit circle, a stationary kernel can be periodized by the construction kp(x, x ) = k(x - x + 2n). This kernel will nZ be represented as a Fourier series (rather than with a Fourier transform) because of the periodicity. In this case the step function in Fourier space approximation would give rise to a Dirichlet kernel as the EK (see [8, section 4.4.3] for further details on the Dirichlet kernel).

We now show that the result of Proposition 1 is asymptotically exact for , and calcu- late the leading corrections for finite . The scaling of the width of the EK as 1/sc suggests writing hSE(r) = (2sc)Dg(2scr). Then from equation 4 and using the definition of sc

                          z                       2s           +1                  J             g(z) =                                            cs                                    (zs/sc)        ds                       sc(2sc)D 0                        z                      1 + exp[22 2(s2 - s2c)]                                        u       +1                   J                 = z                                                         (zu)                          du              (5)                          0         2z                 1 + exp[22 2s2c(u2 - 1)] where we have rescaled s = scu in the second step. The value of sc, and hence , now enters only in the exponential via a = 22 2s2c. For a  , the exponential tends to zero

for u < 1 and to infinity for u > 1. The factor 1/[1 + exp(. . .)] is therefore a step function (1 - u) in the limit and Proposition 1 becomes exact, with g(z) lima g(z) = (2z)-D/2JD/2(z). To calculate corrections to this, one uses that for large but finite a the difference (u) = {1 + exp[a(u2 - 1)]}-1 - (1 - u) is non-negligible only in a range of order 1/a around u = 1. The other factors in the integrand of equation 5 can thus be Taylor-expanded around that point to give

                                      I dk             u       +1                                                     g(z) = g                                     k              (z) + z                                                      J                    ,         I                     (u)(u - 1)k du                                             k! duk         2z                   (zu)                         k =                                     k=0                                                      u=1                      0

The problem is thus reduced to calculating the integrals Ik. Setting u = 1 + v/a one has

                           0                     1                                                                    vk        ak+1Ik =                                                      - 1 vk dv +                                                                 dv                           -a             1 + exp(v2/a + 2v)                                     0         1 + exp(v2/a + 2v)                                a            (-1)k+1vk                                                    vk                 =                                                     dv +                                                        dv                           0         1 + exp(-v2/a + 2v)                            0         1 + exp(v2/a + 2v)

In the first integral, extending the upper limit to gives an error that is exponentially small in a. Expanding the remaining 1/a-dependence of the integrand one then gets, to leading order in 1/a, I0 = c0/a2, I1 = c1/a2 while all Ik with k 2 are smaller by at least 1/a2. The numerical constants are -c0 = c1 = 2/24. This gives, using that (d/dz)[z+1J(z)] = zJ(z) + z+1J-1(z) = (2 + 1)zJ(z) - z+1J+1(z):

Proposition 2 The equivalent kernel for the squared-exponential kernel is given for large by hSE(r) = (2sc)Dg(2scr) with

           1                                     z                                                                                                 1 g(z) =                              J                      (c                                                                                                )                      D                   D/2(z) +                0 + c1(D - 1))JD/2-1(z) - c1zJD/2(z)                                      +O(             (2z) 2                                  a2                                                                                                a4

For e.g. D = 1 this becomes g(z) = -1{sin(z)/z - 2/(24a2)[cos(z) + z sin(z)]}. Here and in general, by comparing the second part of the 1/a2 correction with the leading order term, one estimates that the correction is of relative size z2/a2. It will therefore provide a useful improvement as long as z = 2scr < a; for larger z the expansion in powers of 1/a becomes a poor approximation because the correction terms (of all orders in 1/a) are comparable to the leading order.

2.1 Accuracy of the approximation

To evaluate the accuracy of the approximation we can compute the EK numerically as follows: Consider a dense grid of points in RD with a sampling density grid. For making predictions at the grid points we obtain the smoother matrix K(K + 2 I)-1 grid , where1 2 = 2 grid grid/, as per equation 1. Each row of this matrix is an approximation to the EK at the appropriate location, as this is the response to a y vector which is zero at all points except one. Note that in theory one should use a grid over the whole of RD but in practice one can obtain an excellent approximation to the EK by only considering a grid around the point of interest as the EK typically decays with distance. Also, by only considering a finite grid one can understand how the EK is affected by edge effects.

   1To understand this scaling of 2grid consider the case where grid >  which means that the effective variance at each of the grid points per unit x-space is larger, but as there are correspondingly more points this effect cancels out. This can be understood by imagining the situation where there are grid/ independent Gaussian observations with variance 2grid at a single x-point; this would be equivalent to one Gaussian observation with variance 2. In effect the  observations per unit x-space have been smoothed out uniformly.



   0.16       0.35                                                    0.35                                             Numerical                                                    Numerical                    0.3                                                     0.3                                             Proposition 1                                                Proposition 1                   0.25                      Proposition 2                 0.25                           Proposition 2        0.14                    0.2                                                     0.2


              0.15                                                    0.15

   0.12        0.1                                                     0.1


              0.05                                                    0.05

    0.1         0                                                          0

             -0.05                                                   -0.05


   0.08       -0.1                                                    -0.1                      0             5        10               15                0           5             10               15




   0.06


   0.04


   0.02


     0


 -0.02                                                                                             Numerical                                                                                                        Proposition 1      -0.04                                                                                             Sample


    -0.5              -0.4    -0.3    -0.2      -0.1           0    0.1         0.2         0.3            0.4         0.5

Figure 1: Main figure: plot of the weight function corresponding to = 100 training points/unit length, plus the numerically computed equivalent kernel at x = 0.0 and the sinc approximation from Proposition 1. Insets: numerically evaluated g(z) together with sinc and Proposition 2 approximations for = 100 (left) and = 104 (right).

Figure 1 shows plots of the weight function for = 100, the EK computed on the grid as described above and the analytical sinc approximation. These are computed for parameter values of 2 = 0.004 and 2 = 0.1, with grid/ = 5/3. To reduce edge effects, the interval [-3/2, 3/2] was used for computations, although only the centre of this is shown in the figure. There is quite good agreement between the numerical computation and the analytical approximation, although the sidelobes decay more rapidly for the numerically computed EK. This is not surprising because the absence of a truly hard cutoff in Fourier space means one should expect less "ringing" than the analytical approximation predicts. The figure also shows good agreement between the weight function (based on the finite sample) and the numerically computed EK. The insets show the approximation of Proposi- tion 2 to g(z) for = 100 (a = 5.67, left) and = 104 (a = 9.67, right). As expected, the addition of the 1/a2-correction gives better agreement with the numerical result for z < a. Numerical experiments also show that the mean squared error between the numerically computed EK and the sinc approximation decreases like 1/ log(). The is larger than the naive estimate (1/a2)2 1/(log())4 based on the first correction term from Proposition 2, because the dominant part of the error comes from the region z > a where the 1/a expansion breaks down.

2.2 Other kernels

Our analysis is not in fact restricted to the SE kernel. Consider an isotropic kernel, for which the power spectrum S(s) depends on s = |s| only. Then we can again define from equation 3 an effective cutoff sc on the range of s in the EK via 2/ = S(sc), so that ~ h(s) = [1 + S(sc)/S(s)]-1. The EK will then have the limiting form given in Proposi- tion 1 if ~ h(s) approaches a step function (sc - s), i.e. if it becomes infinitely "steep" around the point s = sc for sc . A quantitative criterion for this is that the slope

|~ h (sc)| should become much larger than 1/sc, the inverse of the range of the step func- tion. Since ~ h (s) = S (s)S(sc)S-2(s)[1 + S(sc)/S(s)]-2, this is equivalent to requiring that -scS (sc)/4S(sc) -d log S(sc)/d log sc must diverge for sc . The result of Proposition 1 therefore applies to any kernel whose power spectrum S(s) decays more rapidly than any positive power of 1/s.

A trivial example of a kernel obeying this condition would be a superposition of finitely many SE kernels with different lengthscales 2; the asymptotic behaviour of sc is then governed by the smallest . A less obvious case is the "rational quadratic" k(r) = [1 + (r/l)2]-(D+1)/2 which has an exponentially decaying power spectrum S(s) exp(-2 s). (This relationship is often used in the reverse direction, to obtain the power spectrum of the Ornstein-Uhlenbeck (OU) kernel exp(-r/ ).) Proposition 1 then applies, with the width of the EK now scaling as 1/sc 1/ log().

The previous example is a special case of kernels which can be written as superpositions of SE kernels with a distribution p( ) of lengthscales , k(r) = exp(-r2/2 2)p( ) d . This is in fact the most general representation for an isotropic kernel which defines a valid covariance function in any dimension D, see [9, 2.10]. Such a kernel has power spectrum S(s) = (2)D/2 D exp(-22 2s2)p( ) d (6) 0

and one easily verifies that the rational quadratic kernel, which has S(s) exp(-2 0s), is obtained for p( ) -D-2 exp(- 20/2 2). More generally, because the exponential factor in equation 6 acts like a cutoff for > 1/s, one estimates S(s) 1/s Dp( ) d 0 for large s. This will decay more strongly than any power of 1/s for s if p( ) itself decreases more strongly than any power of for 0. Any such choice of p( ) will therefore yield a kernel to which Proposition 1 applies.

3 Understanding GP Learning Using the Equivalent Kernel

We now turn to using EK analysis to get a handle on average case learning curves for Gaus- sian processes. Here the setup is that a function is drawn from a Gaussian process, and we obtain noisy observations of per unit x-space at random x locations. We are concerned with the mean squared error (MSE) between the GP prediction f and . Averaging over the noise process, the x-locations of the training data and the prior over we obtain the average MSE as a function of . See e.g. [10] and [11] for an overview of earlier work on GP learning curves.

To understand the asymptotic behaviour of for large , we now approximate the true GP predictions with the EK predictions from noisy data, given by fEK(x) = h(x - x )y(x )dx in the continuum limit of "smoothed out" input locations. We assume as before that y = target + noise, i.e. y(x) = (x) + (x) where E[(x)(x )] = (2/)(x - x ). Here 2 denotes the true noise variance, as opposed to the noise variance assumed in the EK; the scaling of 2 with is explained in footnote 1. For a fixed target , the MSE is = ( dx)-1 [(x) - fEK(x)]2dx. Averaging over the noise process and target function gives in Fourier space

                                                               2       (2/)S       =        S                                                                        (s)/S2(s) + 2                                                                                                        /2                      (s)[1 - ~                              h(s)]2 + (2/)~h2(s) ds =                                                       ds                                                                                 [1 + 2/(S(s))]2                                                                                                                (7) where S(s) is the power spectrum of the prior over target functions. In the case S(s) = S(s) and 2 = 2 where the kernel is exactly matched to the structure of the target, equation 7 gives the Bayes error B and simplifies to B = (2/) [1 + 2/(S(s))]-1ds (see also [5, eq. 14-16]). Interestingly, this is just the analogue (for a continuous power spectrum of the kernel rather than a discrete set of eigenvalues) of the lower bound of [10]



                                                                                        0.5                                                                                             0.5                        =2


     0.03


    0.025                                                                                         0.02


    0.015


     0.01                                                                                             0.1                        =4         0.005


       0


   -0.005            1                                                                               0.05                  0.5                                                             1

                                                                      0.5                         0                                                                   0                               -0.5                                                                 25            50             100                                                     -0.5                                                                                250         500                                        -1    -1

Figure 2: Left: plot of the asymptotic form of the EK (sc/r)J1(2scr) for D = 2 and = 1225. Right: log-log plot of against log() for the OU and Matern-class processes ( = 2, 4 respectively). The dashed lines have gradients of -1/2 and -3/2 which are the predicted rates.

on the MSE of standard GP prediction from finite datasets. In experiments this bound provides a good approximation to the actual average MSE for large dataset size n [11]. This supports our approach of using the EK to understand the learning behaviour of GP regression.

Treating the denominator in the expression for B again as a hard cutoff at s = sc, which is justified for large , one obtains for an SE target and learner 2sc/ (log())D/2/. To get analogous predictions for the mismatched case, one can write equation 7 as

                    2            [1 + 2/(S(s))] - 2/(S(s))                                                      S                   =                                                                                d                          (s)                                                                                                          s +                             ds.                                                   [1 + 2/(S(s))]2                                             [S(s)/2 + 1]2

The first integral is smaller than (2/2) B and can be neglected as long as B. In the second integral we can again make the cutoff approximation--though now with s having to be above sc to get the scaling sD-1S s (s) ds. For target functions with a c

power-law decay S(s) s- of the power spectrum at large s this predicts sD- c (log())(D-)/2. So we generically get slow logarithmic learning, consistent with the observations in [12]. For D = 1 and an OU target ( = 2) we obtain (log())-1/2, and for the Matern-class covariance function k(r) = (1 + r/ ) exp(-r/ ) (which has power spectrum (3/ 2 + 42s2)-2, so = 4) we get (log())-3/2. These predictions were tested experimentally using a GP learner with SE covariance function ( = 0.1 and assumed noise level 2 = 0.1) against targets from the OU and Matern-class priors (with = 0.05) and with noise level 2 = 0.01, averaging over 100 replications for each value of . To demonstrate the predicted power-law dependence of on log(), in Figure 2(right) we make a log-log plot of against log(). The dashed lines show the gradients of -1/2 and -3/2 and we observe good agreement between experimental and theoretical results for large .

3.1 Using the Equivalent Kernel in Kernel Regression

Above we have used the EK to understand how standard GP regression works. One could alternatively envisage using the EK to perform kernel regression, on given finite data sets, producing a prediction -1 h(x i - xi)yi at x. Intuitively this seems appealing as a cheap alternative to full GP regression, particularly for kernels such as the SE where the EK can be calculated analytically, at least to a good approximation. We now analyze briefly how such an EK predictor would perform compared to standard GP prediction.

Letting denote averaging over noise, training input points and the test point and setting f(x) = h(x, x)(x)dx, the average MSE of the EK predictor is

pred = [(x) - (1/) h(x, x i i)yi]2

  = [(x) - f(x)]2 + 2               h2(x, x )dx + 1                    h2(x, x )2(x )dx - 1 f 2                                                                                                           (x)

       2       (2/)S                                              2              ds       =                         (s)/S2(s) + 2                                                       /2 ds +                         [1 + 2/(S(s))]2                                       [1 + 2/(S(s))]2

Here we have set 2 = ( dx)-1 2(x) dx = S(s) ds for the spatial average of the squared target amplitude. Taking the matched case, (S(s) = S(s) and 2 = 2) as an example, the first term (which is the one we get for the prediction from "smoothed out" training inputs, see eq. 7) is of order 2sD c /, while the second one is 2 sD c /. Thus both terms scale in the same way, but the ratio of the second term to the first is the signal- to-noise ratio 2 /2, which in practice is often large. The EK predictor will then perform significantly worse than standard GP prediction, by a roughly constant factor, and we have confirmed this prediction numerically. This result is somewhat surprising given the good agreement between the weight function h(x) and the EK that we saw in figure 1, leading to the conclusion that the detailed structure of the weight function is important for optimal prediction from finite data sets.

In summary, we have derived accurate approximations for the equivalent kernel (EK) of GP regression with the widely used squared exponential kernel, and have shown that the same analysis in fact extends to a whole class of kernels. We have also demonstrated that EKs provide a simple means of understanding the learning behaviour of GP regression, even in cases where the learner's covariance function is not well matched to the structure of the target function. In future work, it will be interesting to explore in more detail the use of the EK in kernel smoothing. This is suboptimal compared to standard GP regression as we saw. However, it does remain feasible even for very large datasets, and may then be competitive with sparse methods for approximating GP regression. From the theoretical point of view, the average error of the EK predictor which we calculated may also provide the basis for useful upper bounds on GP learning curves.

Acknowledgments: This work was supported in part by the IST Programme of the Eu- ropean Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors' views.