Part of Advances in Neural Information Processing Systems 11 (NIPS 1998)
Peter Sollich
I consider the problem of calculating learning curves (i.e., average generalization performance) of Gaussian processes used for regres(cid:173) sion. A simple expression for the generalization error in terms of the eigenvalue decomposition of the covariance function is derived, and used as the starting point for several approximation schemes. I identify where these become exact, and compare with existing bounds on learning curves; the new approximations, which can be used for any input space dimension, generally get substantially closer to the truth.
1
INTRODUCTION: GAUSSIAN PROCESSES
Within the neural networks community, there has in the last few years been a good deal of excitement about the use of Gaussian processes as an alternative to feedforward networks [lJ. The advantages of Gaussian processes are that prior assumptions about the problem to be learned are encoded in a very transparent way, and that inference-at least in the case of regression that I will consider-is relatively straightforward. One crucial question for applications is then how 'fast' Gaussian processes learn, i.e., how many training examples are needed to achieve a certain level of generalization performance. The typical (as opposed to worst case) behaviour is captured in the learning curve, which gives the average generalization error € as a function of the number of training examples n. Several workers have [2,3, 4J or studied its large n asymptotics. As I will illustrate derived bounds on €(n) below, however, the existing bounds are often far from tight; and asymptotic results will not necessarily apply for realistic sample sizes n. My main aim in this paper is therefore to derive approximations to €( n) which get closer to the true learning curves than existing bounds, and apply both for small and large n. In its simplest form, the regression problem that I am considering is this: We are trying to learn a function 0* which maps inputs x (real-valued vectors) to (real(cid:173) valued scalar) outputs O*(x) . We are given a set of training data D, consisting of n
'Present address: Department of Mathematics, King's College London, Strand, London
WC2R 2LS, U.K. Email peter.sollicMlkcl.ac . uk
Learning Curves for Gaussian Processes
345
input-output pairs (Xl, yt); the training outputs Yl may differ from the 'clean' target outputs 9* (xL) due to corruption by noise. Given a test input x, we are then asked to come up with a prediction 9(x) for the corresponding output, expressed either in the simple form of a mean prediction 9(x) plus error bars, or more comprehensively in terms of a 'predictive distribution' P(9(x)lx, D). In a Bayesian setting, we do this by specifying a prior P(9) over our hypothesis functions, and a likelihood P(DI9) with which each 9 could have generated the training data; from this we deduce the posterior distribution P(9ID) ex P(DI9)P(9). In the case of feedforward networks, where the hypothesis functions 9 are parameterized by a set of network weights, the predictive distribution then needs to be extracted by integration over this posterior, either by computationally intensive Monte Carlo techniques or by approximations which lead to analytically tractable integrals. For a Gaussian process, on the other hand, obtaining the predictive distribution is trivial (see below); one reason for this is that the prior P(9) is defined directly over input-output functions 9. How is this done? Any 9 is uniquely determined by its output values 9(x) for all x from the input domain, and for a Gaussian process, these are simply assumed to have a joint Gaussian distribution (hence the name). This distribution can be specified by the mean values (9(x))o (which I assume to be zero in the following, as is commonly done), and the covariances (9(x)9(x' ))o = C(x, x'); C(x, x') is called the covariance function of the Gaussian process. It encodes in an easily interpretable way prior assumptions about the function to be learned. Smoothness, for example, is controlled by the behaviour of C(x, x') for x' -+ x: The Ornstein(cid:173) Uhlenbeck (OU) covariance function C(x, x') ex exp( -IX-X'l/l) produces very rough (non-differentiable) functions, while functions sampled from the squared exponential (SE) prior with C(X,X') ex exp(-Ix - x' 12/(2l2)) are infinitely differentiable. The 'length scale' parameter l, on the other hand, corresponds directly to the distance in input space over which we expect our function to vary significantly. More complex properties can also be encoded; by replacing l with different length scales for each input component, for example, relevant (smalll) and irrelevant (large l) inputs can be distinguished. How does inference with Gaussian processes work? I only give a brief summary here and refer to existing reviews on the subject (see e.g. [5, 1]) for details. It is simplest to assume that outputs yare generated from the 'clean' values of a hypothesis function 9(x) by adding Gaussian noise of x-independent variance 0'2. The joint distribution of a set of training outputs {yd and the function values 9(x) is then also Gaussian, with covariances given by
here I have defined an n x n matrix K and x-dependent n-component vectors k(x). The posterior distribution P(9ID) is then obtained by simply conditioning on the {Yl}. It is again Gaussian and has mean and variance