{"title": "Function-Space Distributions over Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 14965, "page_last": 14976, "abstract": "Gaussian processes are flexible function approximators, with inductive biases controlled by a covariance kernel. Learning the kernel is the key to representation learning and strong predictive performance. In this paper, we develop functional kernel learning (FKL) to directly infer functional posteriors over kernels. In particular, we place a transformed Gaussian process over a spectral density, to induce a non-parametric distribution over kernel functions. The resulting approach enables learning of rich representations, with support for any stationary kernel, uncertainty over the values of the kernel, and an interpretable specification of a prior directly over kernels, without requiring sophisticated initialization or manual intervention. We perform inference through elliptical slice sampling, which is especially well suited to marginalizing posteriors with the strongly correlated priors typical to function space modeling. We develop our approach for non-uniform, large-scale, multi-task, and multidimensional data, and show promising performance in a wide range of settings, including interpolation, extrapolation, and kernel recovery experiments.", "full_text": "Function-Space Distributions over Kernels\n\nGregory W. Benton\u22171 Wesley J. Maddox\u22172\n\nJ\u00falio Albinati\u20213 Andrew Gordon Wilson1,2\n\nJayson P. Salkey\u22171\n\n1Courant Institute of Mathematical Sciences, New York University\n\n2Center for Data Science, New York University\n\n3Microsoft\n\nAbstract\n\nGaussian processes are \ufb02exible function approximators, with inductive biases\ncontrolled by a covariance kernel. Learning the kernel is the key to representation\nlearning and strong predictive performance. In this paper, we develop functional\nkernel learning (FKL) to directly infer functional posteriors over kernels.\nIn\nparticular, we place a transformed Gaussian process over a spectral density, to\ninduce a non-parametric distribution over kernel functions. The resulting approach\nenables learning of rich representations, with support for any stationary kernel,\nuncertainty over the values of the kernel, and an interpretable speci\ufb01cation of a\nprior directly over kernels, without requiring sophisticated initialization or manual\nintervention. We perform inference through elliptical slice sampling, which is\nespecially well suited to marginalizing posteriors with the strongly correlated\npriors typical to function space modeling. We develop our approach for non-\nuniform, large-scale, multi-task, and multidimensional data, and show promising\nperformance in a wide range of settings, including interpolation, extrapolation, and\nkernel recovery experiments.\n\n1\n\nIntroduction\n\nPractitioners typically follow a two-step modeling procedure: (1) choosing the functional form of\na model, such as a neural network; (2) focusing learning efforts on training the parameters of that\nmodel. While inference of these parameters consume our efforts, they are rarely interpretable, and are\nonly of interest insomuch as they combine with the functional form of the model to make predictions.\nGaussian processes (GPs) provide an alternative function space approach to machine learning, directly\nplacing a distribution over functions that could \ufb01t data [25]. This approach enables great \ufb02exibility,\nand also provides a compelling framework for controlling the inductive biases of the model, such as\nwhether we expect the solutions to be smooth, periodic, or have conditional independence properties.\nThese inductive biases, and thus the generalization properties of the GP, are determined by a kernel\nfunction. The performance of the GP, and what representations it can learn, therefore crucially depend\non what we can learn about the kernel function itself. Accordingly, kernel functions are becoming\nincreasingly expressive and parametrized [15, 31, 34]. There is, however, no a priori reason to assume\nthat the true data generating process is driven by a particular parametric family of kernels.\nWe propose extending the function-space view to kernel learning itself \u2013 to represent uncertainty over\nthe kernel function, and to re\ufb02ect the belief that the kernel does not have a simple parametric form.\nJust as one uses GPs to directly specify a prior and infer a posterior over functions that can \ufb01t data,\nwe propose to directly reason about priors and posteriors over kernels. In Figure 1, we illustrate the\nshift from standard function-space GP regression, to a function-space view of kernel learning.\nSpeci\ufb01cally, our contributions are as follows:\n\n\u2217Equal contribution. \u2021Work done while interning with AGW.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Above: A function-space view of regression on data. We show draws from a GP prior and\nposterior over functions in the left and right panels, respectively. Below: With FKL, we apply the\nfunction-space view to kernels, showing prior kernel draws on the left, and posterior kernel draws on\nthe right. In both cases, prior and posterior means are in thick black, two standard deviations about\nthe mean in grey shade, and data points given by crosses. With FKL, one can specify the prior mean\nover kernels to be any parametric family, such an RBF kernel, to provide a useful inductive bias,\nwhile still containing support for any stationary kernel.\n\n\u2022 We model a spectral density as a transformed Gaussian process, providing a non-parametric\nfunction-space distribution over kernels. Our approach, functional kernel learning (FKL),\nhas several key properties: (1) it is highly \ufb02exible, with support for any stationary covariance\nfunction; (2) it naturally represents uncertainty over all values of the kernel; (3) it can easily\nbe used to incorporate intuitions about what types of kernels are a priori likely; (4) despite\nits \ufb02exibility, it does not require sophisticated initialization or manual intervention; (5) it\nprovides a conceptually appealing approach to kernel learning, where we reason directly\nabout prior and posterior kernels, rather than about parameters of these kernels.\n\ntask learning.\n\n\u2022 We further develop FKL to handle multidimensional and irregularly spaced data, and multi-\n\u2022 We demonstrate the effectiveness of FKL in a wide range of settings, including interpolation,\nextrapolation, and kernel recovery experiments, demonstrating strong performance compared\nto state-of-the-art methods.\n\n\u2022 Code is available at https://github.com/wjmaddox/spectralgp .\n\nOur work is intended as a step towards developing Gaussian processes for representation learning. By\npursuing a function-space approach to kernel learning, we can discover rich representations of data,\nenabling strong predictive performance, and new interpretable insights into our modeling problems.\n\n2 Related Work\n\nWe assume some familiarity with Gaussian processes [e.g., 25]. A vast majority of kernels and\nkernel learning methods are parametric. Popular kernels include the parametric RBF, Mat\u00e9rn, and\nperiodic kernels. The standard multiple kernel learning [11, 12, 16, 24] approaches typically involve\nadditive compositions of RBF kernels with different bandwidths. More recent methods model the\nspectral density (the Fourier transform) of stationary kernels to construct kernel learning procedures.\nL\u00e1zaro-Gredilla et al. [17] models the spectrum as independent point masses. Wilson and Adams\n[34] models the spectrum as a scale-location mixture of Gaussians, referred to as a spectral mixture\n\n2\n\n0.00.51.01.52.02.53.03.54.0X432101234YPrior Data Samples0.00.51.01.52.02.53.03.54.0X432101234YPosterior Data Samples0.00.51.01.52.02.53.00.40.20.00.20.40.60.81.01.2CovariancePrior Kernel Samples0.00.51.01.52.02.53.00.40.20.00.20.40.60.81.01.2CovariancePosterior Kernel SamplesDataPrior/Posterior Mean\u00b12 SD\fkernel (SM). Yang et al. [39] combine these approaches, using a random feature expansion for a\nspectral mixture kernel, for scalability. Oliva et al. [23] consider a Bayesian non-parametric extension\nof Yang et al. [39], using a random feature expansion for a Dirichlet process mixture. Alternatively,\nJang et al. [15] model the parameters of a SM kernel with prior distributions, and infer the number\nof mixture components. While these approaches provide strong performance improvements over\nstandard kernels, they often struggle with dif\ufb01culty specifying a prior expectation over the value\nof the kernel, and multi-modal learning objectives, requiring sophisticated manual intervention and\ninitialization procedures [13].\nA small collection of pioneering works [30, 31, 38] have considered various approaches to modeling\nthe spectral density of a kernel with a Gaussian process. Unlike FKL, these methods are constrained to\none-dimensional time series, and still require signi\ufb01cant intervention to achieve strong performance,\nsuch as choices of windows for convolutional kernels. Moreover, we demonstrate that even in this\nconstrained setting, FKL provides improved performance over these state-of-the-art methods.\n\n3 Functional Kernel Learning\n\nIn this section, we introduce the prior model for functional kernel learning (FKL). FKL induces a\ndistribution over kernels by modeling a spectral density (Section 3.1) with a transformed Gaussian\nprocess (Section 3.2). Initially we consider one dimensional inputs x and outputs y, and then\ngeneralize the approach to multiple input dimensions (Section 3.3), and multiple output dimensions\n(multi-task) (Section 3.4). We consider inference within this model in Section 4.\n\n3.1 Spectral Transformations of Kernel Functions\nBochner\u2019s Theorem [5, 25] speci\ufb01es that k(\u00b7) is the covariance of a stationary process on R if and\nonly if\n(1)\n| is the difference between any pair of inputs x and x(cid:48), for a positive, \ufb01nite spectral\nwhere \u03c4 = |x \u2212 x(cid:48)\ndensity S(\u03c9). This relationship is reversible: if S(\u03c9) is known, k(\u03c4 ) can be computed via inverse\nFourier transformation.\n\ne2\u03c0i\u03c9\u03c4 S(\u03c9)d\u03c9,\n\nk(\u03c4 ) =\n\n(cid:90)\n\nR\n\nFor k(\u03c4 ) to be real-valued, S(\u03c9) must be symmetric. Further-\nmore, for \ufb01nitely sampled \u03c4 we are only able to identify angular\nfrequencies up to 2\u03c0/\u2206 where \u2206 is the minimum absolute dif-\nference between any two inputs. Equation 1 simpli\ufb01es to\n\nk(\u03c4 ) =\n\ncos(2\u03c0\u03c4 \u03c9)S(\u03c9)d\u03c9,\n\n(2)\n\n(cid:90)\n\n[0,2\u03c0/\u2206)\n\nby expanding the complex exponential and using the oddness of\nsine (see Eqs. 4.7 and 4.8 in Rasmussen and Williams [25]) and\nthen truncating the integral to the point of identi\ufb01ability.\nFor an arbitrary function, S(\u03c9), Fourier inversion does not pro-\nduce an analytic form for k(\u03c4 ), however we can use simple nu-\nmerical integration schemes like the trapezoid rule to approximate\nthe integral in Equation 2 as\n\ni=1\n\n\u2206\u03c9\n2\n\nk(\u03c4 ) \u2248\n\ncos(2\u03c0\u03c4 \u03c9i)S(\u03c9i) + cos(2\u03c0\u03c4 \u03c9i\u22121)S(\u03c9i\u22121),\n\nFigure 2: Graphical model for the\nFKL framework. Observed data\nis yn, corresponding to the GP\noutput fn. The transformed latent\n(3)\nGP is denoted with outputs Si for\nwhere the spectrum is sampled at I evenly spaced frequencies \u03c9i\nobserved frequencies \u03c9i. Hyper-\nthat are \u2206\u03c9 units apart in the frequency domain.\nparameters are denoted by \u03c6 =\n{\u03b8, \u03b3}.\nThe covariance k(\u03c4 ) in Equation (3) is periodic.\nIn practice,\nfrequencies can be chosen such that the period is beyond the\nbounds that would need to be evaluated in \u03c4. As a simple heuristic we choose P to be 8\u03c4max,\nwhere \u03c4max is the maximum distance between training inputs. We then choose frequencies so that\n\u03c9n = 2\u03c0n/P to ensure k(\u03c4 ) is P -periodic. We have found choosing 100 frequencies (n = 0, . . . , 99)\nin this way leads to good performance over a range of experiments in Section 5.\n\nI(cid:88)\n\n3\n\nNI\u03c9ixnfngisi\u03b8\u03b3yn\fFigure 3: Forward sampling from the hierarchical FKL model of Equation (4). Left: Using randomly\ninitialized hyper-parameters \u03c6, we draw functions g(\u03c9) from the latent GP modeling the log spectral\ndensity. Center: We use the latent realizations of g(\u03c9) with Bochner\u2019s Theorem and Eq. (3) to\ncompose kernels. Right: We sample from a mean-zero Gaussian process with a kernel given by\neach of the kernel samples. Shaded regions show 2 standard deviations above and below the mean\nin dashed blue. Notice that the shapes of the prior kernel samples have signi\ufb01cant variation but are\nclearly in\ufb02uenced by the prior mean, providing a controllable inductive bias.\n3.2 Speci\ufb01cation of Latent Density Model\n\nUniqueness of the relationship in Equation 1 is guaranteed by the Wiener-Khintchine Theorem\n(see Eq. 4.6 of Rasmussen and Williams [25]), thus learning the spectral density of a kernel is\nsuf\ufb01cient to learn the kernel. We propose modeling the log-spectral density of kernels using GPs.\nThe log-transformation ensures that the spectral representation is non-negative. We let \u03c6 = {\u03b8, \u03b3} be\nthe set of all hyper-parameters (including those in both the data, \u03b3, and latent spaces, \u03b8), to simplify\nthe notation of Section 4.\nUsing Equation 3 to produce a kernel k(\u03c4 ) through S(\u03c9), the hierarchical model over the data is\n\n{Hyperprior}\n{Latent GP}\n{Spectral Density}\n{Data GP}\n\np(\u03c6) = p(\u03b8, \u03b3)\n\ng(\u03c9)|\u03b8 \u223c GP (\u00b5(\u03c9; \u03b8), kg(\u03c9, \u03c9(cid:48); \u03b8))\nS(\u03c9) = exp{g(\u03c9)}\n\n(4)\n\nf (xn)|S(\u03c9), \u03b3 \u223c GP(\u03b30, k(\u03c4 ; S(\u03c9))).\n\nWe let f (x) be a noise free function that forms part of an observation model. For regression, we can let\ny(x) = f (x) + \u0001(x), \u0001 \u223c N (0, \u03b12) (in future equations we implicitly condition on hyper-parameters\nof the noise model, e.g., \u03b12, for succinctness, but learn these as part of \u03c6). The approach can easily\nbe adapted to classi\ufb01cation through a different observation model; e.g., p(y(x)) = \u03c3(y(x)f (x)) for\nbinary classi\ufb01cation with labels y \u2208 {\u22121, 1}. Full hyper-parameter prior speci\ufb01cation is given in\nAppendix 2. Note that unlike logistic Gaussian process density estimators [1, 32] we need not worry\nabout the normalization factor of S(\u03c9), since it is absorbed by the scale of the kernel over data,\nk(0). The hierarchical model in Equation 4 de\ufb01nes the functional kernel learning (FKL) prior, with\ncorresponding graphical model in Figure 2. Figure 3 displays the hierarchical model, showing the\nconnection between spectral and data spaces.\nA compelling feature of FKL is the ability to conveniently specify a prior expectation for the kernel\nby specifying a mean function for g(\u03c9), and to encode smoothness assumptions by the choice of\ncovariance function. For example, if we choose the mean of the latent process g(\u03c9) to be negative\nquadratic, then prior kernels are concentrated around RBF kernels, encoding the inductive bias that\nfunction values close in input space are likely to have high covariance. In many cases the spectral\ndensity contains sharp peaks around dominant frequencies, so we choose a Mat\u00e9rn 3/2 kernel for the\ncovariance of g(\u03c9) to capture this behaviour.\n\n3.3 Multiple Input Dimensions\n\nWe extend FKL to multiple input dimensions by either corresponding each one-dimensional kernel in\na product of kernels with its own latent GP with distinct hyper-parameters (FKL separate) or having\nall one-dimensional kernels be draws from a single latent process with one set of hyper-parameters\n(FKL shared). The hierarchical Bayesian model over the d dimensions is described in the following\nmanner:\n\n4\n\n0.000.250.500.751.001.251.501.752.00Frequency54321012Log-Spectral DensityPrior Draws in Latent Space0.00.20.40.60.81.00.50.00.51.01.52.02.53.0CovariancePrior Kernels0.00.20.40.60.81.0X432101234Y(X)Function Draws from Prior KernelsPrior Mean\u00b12 SD\f{Hyperprior}\n{Latent GP \u2200d \u2208 {1, ...D}}\n{Product Kernel GP} f (x)|{gd(\u03c9d)}D\n\np(\u03c6) = p(\u03b8, \u03b3)\n\ngd(\u03c9d)|\u03b8 \u223c GP (\u00b5(\u03c9d; \u03b8), kgd (\u03c9d, \u03c9(cid:48)\nd=1, \u03b3 \u223c GP(\u03b30,\n\nk(\u03c4d; S(\u03c9d)))\n\nD(cid:89)\n\nd=1\n\nd; \u03b8))\n\n(5)\n\nTying the kernels over each dimension while considering their spectral densities to be draws from\nthe same latent process (FKL shared) provides multiple bene\ufb01ts. Under these assumptions, we have\nmore information to learn the underlying latent GP g(\u03c9). We also have the helpful inductive bias that\nthe covariance functions across each dimension have some shared high-order properties, and enables\nlinear time scaling with dimensionality.\n\n3.4 Multiple Output Dimensions\n\nFKL additionally provides a natural way to view multi-task GPs. We assume that each task (or output),\nindexed by t \u2208 {1, . . . , T}, is generated by a GP with a distinct kernel. The kernels are tied together\nby assuming each of those T kernels are constructed from realizations of a single shared latent GP.\nNotationally, we let g(\u03c9) denote the latent GP, and use subscripts gt(\u03c9) to indicate independent\nrealizations of this latent GP. The hierarchical model can then be described in the following manner:\n\n{Hyperprior}\n{Latent GP}\n{Task GP \u2200t \u2208 {1, ...T}}\n\np(\u03c6) = p(\u03b8, \u03b3)\n\ng(\u03c9)|\u03b8 \u223c GP (\u00b5(\u03c9; \u03b8), kg(\u03c9, \u03c9(cid:48); \u03b8))\n\nft(x)|gt(\u03c9), \u03b3 \u223c GP(\u03b30,t, k(\u03c4 ; St(\u03c9)))\n\n(6)\n\nIn this setup, rather than having to learn the kernel from a single realization of a process (a single\ntask), we can learn the kernel from multiple realizations, which provides a wealth of information for\nkernel learning [37]. While sharing individual hyper-parameters across multiple tasks is standard\n(see e.g. Section 9.2 of MacKay [18]), these approaches can only learn limited structure. The\ninformation provided by multiple tasks is distinctly amenable to FKL, which shares a \ufb02exible process\nover kernels across tasks. FKL can use this information to discover unconventional structure in data,\nwhile retaining computational ef\ufb01ciency (see Appendix 1).\n\n4\n\nInference and Prediction\n\nWhen considering the hierarchical model de\ufb01ned in Equation 4, one needs to learn both the hyper-\nparameters, \u03c6, and an instance of the latent Gaussian process, g(\u03c9). We employ alternating updates\nin which the hyper-parameters \u03c6 and draws of the latent GP are updated separately. A full description\nof the method is in Algorithm 1 in Appendix 2.\n\nUpdating Hyper-Parameters: Considering the model speci\ufb01cation in Eq. 4, we can de\ufb01ne a loss\nas a function of \u03c6 = {\u03b8, \u03b3} for an observation of the density, \u02dcg(\u03c9), and data observations y(x). This\nloss corresponds to the entropy, marginal log-likelihood of the latent GP with \ufb01xed data GP, and the\nmarginal log-likelihood of the data GP.\n\nL(\u03c6) = \u2212 (log p(\u03c6) + log p(\u02dcg(\u03c9)|\u03b8, \u03c9) + log p(y(x)|\u02dcg(\u03c9), \u03b3, x)) .\n\n(7)\nThis objective can be optimized using any procedure; we use the AMSGRAD variant of Adam\nas implemented in PyTorch [26]. For GPs with D input dimensions (and similarly for D output\ndimensions), we extend Eq. 7 as\n\n(cid:33)\n\n(cid:32)\n\nD(cid:88)\n\nd=1\n\nL(\u03c6) = \u2212\n\nlog p(\u03c6) +\n\n[log p(\u02dcgd(\u03c9d)|\u03b8, \u03c9)] + log p(y(x)|{\u02dcgd(\u03c9d)}D\n\nd=1, \u03b3, x)\n\n.\n\n(8)\n\nUpdating Latent Gaussian Process: With \ufb01xed hyper-parameters \u03c6, the posterior of the latent\nGP is\n\n(9)\nWe sample from this posterior using elliptical slice sampling (ESS) [21, 20], which is speci\ufb01cally\ndesigned to sample from posteriors with highly correlated Gaussian priors. Note that we must\n\np(g(\u03c9)|\u03c6, x, y(x), f (x)) \u221d N (\u00b5(\u03c9; \u03b8), kg(\u03c9; \u03b8))p(f (x)|g(\u03c9), \u03b3).\n\n5\n\n\freparametrize the prior by removing the mean before using ESS; we then consider it part of the\nlikelihood afterwards.\nTaken together, these two updates can be viewed as a single sample Monte Carlo expectation\nmaximization (EM) algorithm [33] where only the \ufb01nal g(\u03c9) sample is used in the Monte Carlo\nexpectation. Using the alternating updates (following Algorithm 1) and transforming the spectral\ndensities into kernels, samples of predictions on the training and testing data can be taken. We\ngenerate posterior estimates of kernels by \ufb01xing \u03c6 after updating and drawing samples from the\nposterior distribution, p(g(\u03c9)|f, y, \u03c6), taken from ESS (using y as short for y(x), the training data\nindexed by inputs x).\nPrediction: The predictive distribution for any test input x\u2217 is given by\n\np(f\u2217\n\n|x\u2217, x, y, \u03c6) =\n\np(f\u2217\n\n|x\u2217, x, y, \u03c6, k)p(k|x\u2217, x, y, \u03c6)dk\n\n(10)\n\nwhere we are only conditioning on data x, y, and hyper-parameters \u03c6 determined from optimization,\nby marginalizing the whole posterior distribution over kernels k given by FKL. We use simple Monte\nCarlo to approximate this integral as\n\n(cid:90)\n\nJ(cid:88)\n\nj=1\n\nJ(cid:88)\n\nj=1\n\np(f\u2217\n\n|x\u2217, x, y, \u03c6) \u2248\n\n1\nJ\n\np(f\u2217\n\n|x\u2217, x, y, \u03c6, kj) ,\n\nkj \u223c p(k|x\u2217, x, y, \u03c6).\n\n(11)\n\nWe sample from the posterior over g(\u03c9) using elliptical slice sampling as above. We then transform\nthese samples S(\u03c9) = exp{g(\u03c9)} to form posterior samples from the spectral density. We then\nsample kj \u223c p(k|x\u2217, x, y, \u03c6) by evaluating the trapezoidal approximation in Eq. (3) (at a collection\nof frequencies \u03c9) for each sample of the spectral density. For regression with Gaussian noise\np(f\u2217\n\n|x\u2217, x, y, \u03c6, k) is Gaussian, and our expression for the predictive distribution becomes\n\np(f\u2217\n\n1\nJ\n\n|x\u2217, x, y, \u03c6, \u03c9) =\n\nN ( \u00aff\u2217(x\u2217)j, Cov(f\u2217)j)\n\u00aff\u2217(x\u2217)j = kfj (x\u2217, x; \u03b3)kfj (x, x; \u03b8)\u22121y\nCov(f\u2217)j = kfj (x\u2217, x\u2217; \u03b3) \u2212 kfj (x\u2217, x; \u03b3)kfj (x, x; \u03b8)\u22121kfj (x, x\u2217; \u03b3),\n\n(12)\n\nwhere kfj is the kernel associated with sample gj from the posterior over g after transformation to a\nspectral density and then evaluation of the trapezoidal approximation (suppressing dependence on\n\u03c9 used in Eq. (3)). y is an n \u00d7 1 vector of training data. kfj (x, x; \u03b8) is an n \u00d7 n matrix formed by\nevaluating kfj at all pairs of n training inputs x. Similarly kfj (x\u2217, x\u2217; \u03b8) is a scalar and kfj (x\u2217, x)\nis 1 \u00d7 n for a single test input x\u2217. This distribution is a mixture of Gaussians with J components.\nFollowing the above procedure, we obtain J samples from the unconditional distribution in Eq. (12).\nWe can compute the sample mean for point predictions and twice the sample standard deviation for a\ncredible set. We use the mixture of Gaussians representation in conjunction with the laws of total\nmean and variance to approximate the moments of the predictive distribution in Eq. (12), which is\nwhat we do for the experiments.\n\n5 Experiments\n\nWe demonstrate the practicality of FKL over a wide range of experiments: (1) recovering known\nkernels from data (Section 5.1); (2) extrapolation (Section 5.2); (3) multi-dimensional inputs and\nirregularly spaced data (section 5.3); (4) multi-task precipitation data (Section 5.4); and (5) multidi-\nmensional pattern extrapolation (Section 5.5). We compare to the standard RBF and Mat\u00e9rn kernels,\nas well as spectral mixture kernels [34], and the Bayesian nonparametric spectral estimation (BNSE)\nof Tobar [30].\nFor FKL experiments, we use g(\u03c9) with a negative quadratic mean function (to induce an RBF-like\nprior mean in the distribution over kernels), and a Mat\u00e9rn kernel with \u03bd = 3\n2 (to capture the typical\nsharpness of spectral densities). We use the heuristic for frequencies in the trapezoid rule described\nin Section 3.1. Using J = 10 samples from the posterior over kernels, we evaluate the sample mean\nand twice the sample standard deviation from the unconditional predictive distribution in Eq. (12) for\npoint predictions and credible sets. We perform all experiments in GPyTorch [10].\n\n6\n\n\fFigure 4: Left: Samples from the FKL posterior over the spectral density capture the shape of the\ntrue spectrum. Right: Many of the FKL predictions on the held out data are nearly on par with the\nground-truth model (SM in dashed red). GPs using the other kernels perform poorly on extrapolation\naway from the training points.\n\n5.1 Recovery of Spectral Mixture Kernels\n\nHere we test the ability of FKL to recover known ground truth kernels. We generate 150 data points,\nxi \u223c U (\u22127., 7) randomly and then draw a random function from a GP with a two component spectral\nmixture kernel with weights 1 and 0.5, spectral means of 0.2 and 0.9 and standard deviations of 0.05.\nAs shown in Figure 4, FKL accurately reconstructs the underlying spectral density, which enables\naccurate in-\ufb01lling of data in a held out test region, alongside reliable credible sets. A GP with a\nspectral mixture kernel is suited for this task and closely matches with withheld data. GP regression\nwith the RBF or Mat\u00e9rn kernels is unable to predict accurately very far from the training points.\nBNSE similarly interpolates the training data well but performs poorly on the extrapolation region\naway from the data. In Appendix 5.1 we illustrate an additional kernel recovery experiment, with\nsimilar results.\n\n5.2\n\nInterpolation and Extrapolation\n\nAirline Passenger Data We next consider the airline passenger dataset [14] consisting of 96\nmonthly observations of numbers of airline passengers from 1949 to 1961, and attempt to extrapolate\nthe next 48 observations. We standardize the dataset to have zero mean and unit standard deviation\nbefore modeling. The dataset is dif\ufb01cult for Gaussian processes with standard stationary kernels, due\nto the rising trend, and dif\ufb01culty in extrapolating quasi-periodic structure.\n\nSinc We model a pattern of three sinc functions replicating the experiment of Wilson and Adams\n[34]. Here y(x) = sinc(x + 10) + sinc(x) + sinc(x \u2212 10) with sinc(x) = sin(\u03c0x)/(\u03c0x). This has\nbeen shown previously [34] to be a case for which parametric kernels fail to pick up on the correct\nperiodic structure of the data.\nFigures 5a and 5b show that FKL outperforms simple parametric kernels on complex datasets.\nPerformance of FKL is on par with that of SM kernels while requiring less manual tuning and being\nmore robust to initialization.\n\n5.3 Multiple Dimensions: Interpolation on UCI datasets\n\nWe use the product kernel described in Section 5.3 with both separate and shared latent GPs for\nregression tasks on UCI datasets. Figure 6 visually depicts the model with respect to prior and\nposterior products of kernels. We standardize the data to zero mean and unit variance and randomly\nsplit the training and test sets, corresponding to 90% and 10% of the full data, respectively. We\nconduct experiments over 10 random splits and show the average RMSE and standard deviation. We\ncompare to the RBF, ARD, and ARD Mat\u00e9rn. Furthermore, we compare the results of sharing a\nsingle latent GP across the kernels of the product decomposition(Eq. 5) with independent latent GPs\nfor each kernel in the decomposition.\n\n7\n\n0.00.51.01.52.0012345S()Spectral Density42024X4202YSpectral Mixture DataFKL\u00b12SDRBFMaternSMBNSETrain DataTruth\f(a) Extrapolation on the airlines dataset [14].\n\n(b) Interpolation on the sinc function.\n\nFigure 5: (a): Extrapolation on the airline passenger dataset. (b): Prediction on sinc data. FKL is on\npar with a carefully tuned SM kernel (dashed pink) in (a) and shows best performance in (b), BNSE\n(brown) performs well on the training data, but quickly reverts to the mean in the testing set.\n\nFigure 6: Samples of prior (a) and posterior (b) kernels displayed alongside the sample mean (thick\nlines) and \u00b1 2 standard deviations (shade). Each color corresponds to a kernel, k(\u00b7), for a dimension\nof the airfoil dataset.\n\n5.4 Multi-Task Extrapolation\n\nWe use the multi-task version of FKL in Section 3.4 to model precipitation data sourced from the\nUnited States Historical Climatology Network [19]. The data contain daily precipitation measure-\nments over 115 years collected at 1218 locations in the US. Average positive precipitation by day\nof the year is taken for three climatologically similar recording locations in Colorado: Boulder,\nTelluride, and Steamboat Springs, as shown in Figure 8. The data for these locations have similar\nseasonal variations, motivating a shared latent GP across tasks, with a \ufb02exible kernel process capable\nof learning this structure. Following the procedure outlined in Section 4 and detailed in Algorithm 2\nin the Appendix, FKL provides predictive distributions that accurately interpolates and extrapolates\nthe data with appropriate credible sets. In Appendix 6 we extend these multi-task precipitation results\nto large scale experimentation with datasets containing tens of thousands of points.\n\nFigure 7: Standardized log losses on \ufb01ve of the 12 UCI datasets used. Here, we can see that FKL\ntypically outperforms parametric kernels, even with a shared latent GP. See Table 2 for the full results\nin the Appendix.\n\n8\n\n10123X210123456YAirline15105051015X0.20.00.20.40.60.81.0YSincFKL\u00b12SDTrain DataTruthRBFMaternSMBNSE0.00.51.01.52.02.53.03.5024CovariancePrior Kernels0.00.51.01.52.02.53.03.50246810CovariancePosterior Kernelsdim=0dim=1dim=2dim=3dim=401234\u22121.3\u22121.2\u22121.1\u22121.0\u22120.9\u22120.8\u22120.7LogLoss(std)boston01234\u22122.0\u22121.5\u22121.0\u22120.50.00.51.0concrete01234010203040concreteslumpRBFARDARDMaternFKL(separate)FKL(shared)01234\u22123.5\u22123.0\u22122.5\u22122.0\u22121.5energy01234\u221250510yacht\fFigure 8: Posterior predictions generated using latent GP samples. 10 samples of the latent GP for\neach site are used to construct covariance matrices and posterior predictions of the GPs over the data.\n\nFigure 9: Texture Extrapolation: training data is shown to the left of the blue line and predicted\nextrapolations according to each model are to the right.\n\n5.5 Scalability and Texture Extrapolation\n\nLarge datasets typically provide additional information to learn rich covariance structure. Following\nthe setup in [36], we exploit the underlying structure in images and scale FKL to learn such a rich\ncovariance \u2014 enabling extrapolation on textures. When the inputs, X, form a Cartesian product\nmultidimensional grid, the covariance matrix decomposes as the Kronecker product of the covariance\nmatrices over each input dimension, i.e. K(X, X) = K(X1, X1)\u2297 K(X2, X2)\u2297\u00b7\u00b7\u00b7\u2297 K(XP , XP )\nwhere Xi are the elements of the grid in the ith dimension [28]. Using the eigendecompositions of\nKronecker matrices, solutions to linear systems and log determinants of covariance matrices that\nhave Kronecker structure can be computed exactly in O(P N P/2) time, instead of the standard cubic\nscaling in N [36].\nWe train FKL on a 10, 000 pixel image of a steel tread-plate and extrapolate the pattern beyond the\ntraining domain. As shown in Figure 9, FKL uncovers the underlying structure, with no sophisticated\ninitialization procedure. While the spectral mixture kernel performs well on these tasks [36], it\nrequires involved initialization procedures. By contrast, standard kernels, such as the RBF kernel, are\nunable to discover the covariance structure to extrapolate on these tasks.\n\n6 Discussion\n\nIn an era where the number of model parameters often exceeds the number of available data points,\nthe function-space view provides a more natural representation of our models. It is the complexity\nand inductive biases of functions that affect generalization performance, rather than the number\nof parameters in a model. Moreover, we can interpretably encode our assumptions over functions,\nwhereas parameters are often inscrutable. We have shown the function-space approach to learning\ncovariance structure is \ufb02exible and convenient, able to automatically discover rich representations of\ndata, without over-\ufb01tting.\nThere are many exciting directions for future work: (i) interpreting the learned covariance structure\nacross multiple heterogeneous tasks to gain new scienti\ufb01c insights; (ii) developing function-space\ndistributions over non-stationary kernels; and (iii) developing deep hierarchical functional kernel\nlearning models, where we consider function space distributions over distributions of kernels.\n\n9\n\n050100150200250300350Days64202468Avg. Pos. Precip. (std.)BOULDER, CO050100150200250300350DaysSTEAMBOAT SPRINGS, CO050100150200250300350DaysTELLURIDE 4WNW, COSampleTrain DataTest DataPosterior MeanTraining Data and RBF ExtrapolationTraining Data and FKL Extrapolation\fAcknowledgements\n\nGWB, WJM, JPS, and AGW were supported by an Amazon Research Award, Facebook Research,\nNSF IIS-1563887, and NSF IIS-1910266. WJM was additionally supported by an NSF Graduate\nResearch Fellowship under Grant No. DGE-1650441.\n\nReferences\n[1] Ryan Prescott Adams, Iain Murray, and David J. C. MacKay. Nonparametric Bayesian density\nmodeling with Gaussian processes. arXiv:0912.4896 [math, stat], December 2009. URL\nhttp://arxiv.org/abs/0912.4896. arXiv: 0912.4896.\n\n[2] Mauricio \u00c1lvarez, David Luengo, Michalis Titsias, and Neil D Lawrence. Ef\ufb01cient multioutput\ngaussian processes through variational inducing kernels. In Proceedings of the Thirteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 25\u201332, 2010.\n\n[3] Mauricio A \u00c1lvarez and Neil D Lawrence. Computationally ef\ufb01cient convolved multiple output\n\ngaussian processes. Journal of Machine Learning Research, 12(May):1459\u20131500, 2011.\n\n[4] Yu K Belyaev. Continuity and h\u00f6lder\u2019s conditions for sample functions of stationary gaussian\nprocesses. In Proc. Fourth Berkeley Symp. Math. Statist. Prob, volume 2, pages 23\u201333, 1961.\n\n[5] Salomon Bochner. Lectures on Fourier integrals. Princeton University Press, 1959.\n\n[6] Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task gaussian process\n\nprediction. In Advances in neural information processing systems, pages 153\u2013160, 2008.\n\n[7] D Cruz-Uribe and CJ Neugebauer. Sharp error bounds for the trapezoidal rule and simpson\u2019s\n\nrule. J. Inequal. Pure Appl. Math, 3, 2002.\n\n[8] Andreas Damianou and Neil Lawrence. Deep gaussian processes. In Arti\ufb01cial Intelligence and\n\nStatistics, pages 207\u2013215, 2013.\n\n[9] Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew Gordon Wilson.\nScalable log determinants for Gaussian process kernel learning. arXiv:1711.03481 [cs, stat],\nNovember 2017. URL http://arxiv.org/abs/1711.03481. arXiv: 1711.03481.\n\n[10] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson.\nIn\n\nGpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration.\nAdvances in Neural Information Processing Systems, pages 7576\u20137586, 2018.\n\n[11] Marc G Genton. Classes of kernels for machine learning: a statistics perspective. Journal of\n\nmachine learning research, 2(Dec):299\u2013312, 2001.\n\n[12] Mehmet G\u00f6nen and Ethem Alpayd\u0131n. Multiple kernel learning algorithms. Journal of machine\n\nlearning research, 12(Jul):2211\u20132268, 2011.\n\n[13] William Herlands, Daniel B Neill, Hannes Nickisch, and Andrew Gordon Wilson. Change\nsurfaces for expressive multidimensional changepoints and counterfactual prediction. arXiv\npreprint arXiv:1810.11861, 2018.\n\n[14] R J Hyndman. Time series data library, 2005. URL http://www-personal.buseco.monash.\n\nedu.au/~hyndman/TSDL/.\n\n[15] Phillip A Jang, Andrew Loeb, Matthew Davidow, and Andrew G Wilson. Scalable levy process\npriors for spectral kernel learning. In Advances in Neural Information Processing Systems,\npages 3940\u20133949, 2017.\n\n[16] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan.\nLearning the kernel matrix with semide\ufb01nite programming. Journal of Machine learning\nresearch, 5(Jan):27\u201372, 2004.\n\n10\n\n\f[17] Miguel L\u00e1zaro-Gredilla, Joaquin Qui\u00f1onero-Candela, Carl Edward Rasmussen, and An\u00edbal R\nFigueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning\nResearch, page 17, 2010.\n\n[18] David JC MacKay. Introduction to gaussian processes. NATO ASI Series F Computer and\n\nSystems Sciences, 168:133\u2013166, 1998.\n\n[19] M J Menne, C N Williams, and R S Vose. United states historical climatology network daily\ntemperature, precipitation, and snow data. Carbon Dioxide Information Analysis Center, Oak\nRidge National Laboratory, Oak Ridge, Tennessee, 2015.\n\n[20] Iain Murray and Ryan P Adams. Slice sampling covariance hyperparameters of latent gaussian\n\nmodels. In Advances in neural information processing systems, pages 1732\u20131740, 2010.\n\n[21] Iain Murray, Ryan Prescott Adams, and David JC MacKay. Elliptical slice sampling.\n\nArti\ufb01cial Intelligence and Statistics, 2010.\n\nIn\n\n[22] Trung V Nguyen, Edwin V Bonilla, et al. Collaborative multi-output gaussian processes. In\n\nUncertainty in Arti\ufb01cial Intelligence, pages 643\u2013652, 2014.\n\n[23] Junier B Oliva, Avinava Dubey, Andrew G Wilson, Barnab\u00e1s P\u00f3czos, Jeff Schneider, and Eric P\nXing. Bayesian nonparametric kernel-learning. In Arti\ufb01cial Intelligence and Statistics, pages\n1078\u20131086, 2016.\n\n[24] Alain Rakotomamonjy, Francis Bach, St\u00e9phane Canu, and Yves Grandvalet. More ef\ufb01ciency\nin multiple kernel learning. In Proceedings of the 24th international conference on Machine\nlearning, pages 775\u2013782. ACM, 2007.\n\n[25] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning,\n\nvolume 2. MIT Press Cambridge, MA, 2006.\n\n[26] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In\n\nInternational Conference on Learning Representations, 2019.\n\n[27] James Requeima, William Tebbutt, Wessel Bruinsma, and Richard E Turner. The gaussian\nprocess autoregressive regression model (gpar). In The 22nd International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 1860\u20131869, 2019.\n\n[28] Yunus Saat\u00e7i. Scalable inference for structured Gaussian process models. PhD thesis, Citeseer,\n\n2012.\n\n[29] Zheyang Shen, Markus Heinonen, and Samuel Kaski. Harmonizable mixture kernels with\nvariational fourier features. In The 22nd International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 3273\u20133282, 2019.\n\n[30] Felipe Tobar. Bayesian nonparametric spectral estimation. In Advances in Neural Information\n\nProcessing Systems, pages 10127\u201310137, 2018.\n\n[31] Felipe Tobar, Thang D Bui, and Richard E Turner. Learning Stationary Time Series using\nGaussian Processes with Nonparametric Kernels. In Advances in Neural Information Processing\nSystems, page 9, 2015.\n\n[32] Surya T Tokdar and Jayanta K Ghosh. Posterior consistency of logistic gaussian process priors\n\nin density estimation. Journal of statistical planning and inference, 137(1):34\u201342, 2007.\n\n[33] Greg C. G. Wei and Martin A. Tanner. A Monte Carlo Implementation of the EM Algorithm\nand the Poor Man\u2019s Data Augmentation Algorithms. Journal of the American Statistical\nAssociation, 85(411):699\u2013704, 1990. ISSN 0162-1459. doi: 10.2307/2290005. URL https:\n//www.jstor.org/stable/2290005.\n\n[34] Andrew Wilson and Ryan Adams. Gaussian process kernels for pattern discovery and extrapo-\n\nlation. In International Conference on Machine Learning, pages 1067\u20131075, 2013.\n\n[35] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian\nprocesses (kiss-gp). In International Conference on Machine Learning, pages 1775\u20131784, 2015.\n\n11\n\n\f[36] Andrew G Wilson, Elad Gilboa, Arye Nehorai, and John P Cunningham. Fast kernel learning\nfor multidimensional pattern extrapolation. In Advances in Neural Information Processing\nSystems, pages 3626\u20133634, 2014.\n\n[37] Andrew G Wilson, Christoph Dann, Chris Lucas, and Eric P Xing. The human kernel. In\n\nAdvances in neural information processing systems, pages 2854\u20132862, 2015.\n\n[38] Andrew Gordon Wilson. Covariance kernels for fast automatic pattern discovery and extrapo-\n\nlation with Gaussian processes. PhD thesis, 2014.\n\n[39] Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte\u2013learning fast kernels. In\n\nArti\ufb01cial Intelligence and Statistics, pages 1098\u20131106, 2015.\n\n12\n\n\f", "award": [], "sourceid": 8540, "authors": [{"given_name": "Gregory", "family_name": "Benton", "institution": "New York University"}, {"given_name": "Wesley", "family_name": "Maddox", "institution": "New York University"}, {"given_name": "Jayson", "family_name": "Salkey", "institution": "New York University"}, {"given_name": "Julio", "family_name": "Albinati", "institution": "Microsoft"}, {"given_name": "Andrew Gordon", "family_name": "Wilson", "institution": "New York University"}]}