{"title": "Nonstationary Covariance Functions for Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 273, "page_last": 280, "abstract": "", "full_text": "Nonstationary Covariance Functions for\n\nGaussian Process Regression\n\nChristopher J. Paciorek and Mark J. Schervish\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\npaciorek@alumni.cmu.edu,mark@stat.cmu.edu\n\nAbstract\n\nWe introduce a class of nonstationary covariance functions for Gaussian\nprocess (GP) regression. Nonstationary covariance functions allow the\nmodel to adapt to functions whose smoothness varies with the inputs.\nThe class includes a nonstationary version of the Mat\u00e9rn stationary co-\nvariance, in which the differentiability of the regression function is con-\ntrolled by a parameter, freeing one from \ufb01xing the differentiability in\nadvance.\nIn experiments, the nonstationary GP regression model per-\nforms well when the input space is two or three dimensions, outperform-\ning a neural network model and Bayesian free-knot spline models, and\ncompetitive with a Bayesian neural network, but is outperformed in one\ndimension by a state-of-the-art Bayesian free-knot spline model. The\nmodel readily generalizes to non-Gaussian data. Use of computational\nmethods for speeding GP \ufb01tting may allow for implementation of the\nmethod on larger datasets.\n\n1\n\nIntroduction\n\nGaussian processes (GPs) have been used successfully for regression and classi\ufb01cation\ntasks. Standard GP models use a stationary covariance, in which the covariance between\nany two points is a function of Euclidean distance. However, stationary GPs fail to adapt\nto variable smoothness in the function of interest [1, 2]. This is of particular importance in\ngeophysical and other spatial datasets, in which domain knowledge suggests that the func-\ntion may vary more quickly in some parts of the input space than in others. For example, in\nmountainous areas, environmental variables are likely to be much less smooth than in \ufb02at\nregions. Spatial statistics researchers have made some progress in de\ufb01ning nonstationary\ncovariance structures for kriging, a form of GP regression. We extend the nonstationary\ncovariance structure of [3], of which [1] gives a special case, to a class of nonstationary\ncovariance functions. The class includes a Mat\u00e9rn form, which in contrast to most covari-\nance functions has the added \ufb02exibility of a parameter that controls the differentiability\nof sample functions drawn from the GP distribution. We use the nonstationary covariance\nstructure for one, two, and three dimensional input spaces in a standard GP regression\nmodel, as done previously only for one-dimensional input spaces [1].\n\nThe problem of variable smoothness has been attacked in spatial statistics by mapping\n\n\fthe original input space to a new space in which stationarity is assumed, but research has\nfocused on multiple noisy replicates of the regression function with no development nor\nassessment of the method in the standard regression setting [4, 5]. The issue has been ad-\ndressed in regression spline models by choosing the knot locations during the \ufb01tting [6] and\nin smoothing splines by choosing an adaptive penalizer on the integrated squared derivative\n[7]. The general approach in spline and other models involves learning the underlying basis\nfunctions, either explicitly or implicitly, rather than \ufb01xing the functions in advance. One\nalternative to a nonstationary GP model is mixtures of stationary GPs [8, 9]. Such meth-\nods adapt to variable smoothness by using different stationary GPs in different parts of the\ninput space. The main dif\ufb01culty is that the class membership is a function of the inputs;\nthis involves additional unknown functions in the hierarchy of the model. One possibility\nis to use stationary GPs for these additional unknown functions [8], while [9] reduce com-\nputational complexity by using a local estimate of the class membership, but do not know\nif the resulting model is well-de\ufb01ned probabilistically. While the mixture approach is in-\ntriguing, neither of [8, 9] compare their model to other methods. In our model, there are\nunknown functions in the hierarchy of the model that determine the nonstationary covari-\nance structure. We choose to fully model the functions as Gaussian processes themselves,\nbut recognize the computational cost and suggest that simpler representations are worth\ninvestigating.\n\n2 Covariance functions and sample function differentiability\n\nThe covariance function is crucial in GP regression because it controls how much the data\nare smoothed in estimating the unknown function. GP distributions are distributions over\nfunctions; the covariance function determines the properties of sample functions drawn\nfrom the distribution. The stochastic process literature gives conditions for determining\nsample function properties of GPs based on the covariance function of the process, sum-\nmarized in [10] for several common covariance functions. Stationary, isotropic covariance\nfunctions are functions only of Euclidean distance, (cid:28) . Of particular note, the squared expo-\nnential (also called the Gaussian) covariance function, C((cid:28) ) = (cid:27)2 exp(cid:0)(cid:0)((cid:28) =(cid:20))2(cid:1) ; where\n(cid:27)2 is the variance and (cid:20) is a correlation scale parameter, has sample functions with in-\n\ufb01nitely many derivatives. In contrast, spline regression models have sample functions that\nare typically only twice differentiable. In addition to being of theoretical concern from an\nasymptotic perspective [11], other covariance forms might better \ufb01t real data for which it is\nunlikely that the unknown function is so highly differentiable. In spatial statistics, the expo-\nnential covariance, C((cid:28) ) = (cid:27)2 exp ((cid:0)(cid:28) =(cid:20)) ; is commonly used, but this form gives sample\nfunctions that, while continuous, are not differentiable. Recent work in spatial statistics has\n(cid:0)((cid:23))2(cid:23)(cid:0)1 (2p(cid:23)(cid:28) =(cid:20))(cid:23) K(cid:23) (2p(cid:23)(cid:28) =(cid:20)) ; where K(cid:23)((cid:1))\nfocused on the Mat\u00e9rn form, C((cid:28) ) = (cid:27)2\nis the modi\ufb01ed Bessel function of the second kind, whose order is the differentiability pa-\nrameter, (cid:23) > 0. This form has the desirable property that sample functions are b(cid:23) (cid:0) 1c\ntimes differentiable. As (cid:23) ! 1, the Mat\u00e9rn approaches the squared exponential form,\nwhile for (cid:23) = 0:5; the Mat\u00e9rn takes the exponential form. Standard covariance functions\nrequire one to place all of one\u2019s prior probability on a particular degree of differentiability;\nuse of the Mat\u00e9rn allows one to more accurately, yet easily, express prior lack of knowledge\nabout sample function differentiability. One application for which this may be of particular\ninterest is geophysical data.\n\n1\n\n[12] suggest using the squared exponential covariance but with anisotropic distance,\n(cid:28) (xi; xj) = p(xi (cid:0) xj)T (cid:1)(cid:0)1(xi (cid:0) xj), where (cid:1) is an arbitrary positive de\ufb01nite ma-\ntrix, rather than the standard diagonal matrix. This allows the GP model to more easily\nmodel interactions between the inputs. The nonstationary covariance function we intro-\nduce next builds on this more general form.\n\n\f3 Nonstationary covariance functions\n\nOne nonstationary covariance\nis C(xi; xj) =\nR<2 kxi (u)kxj (u)du; where xi; xj, and u are locations in <2, and kx((cid:1)) is a ker-\nnel function centered at x. One can show directly that C(xi; xj) is positive de\ufb01nite in\n