{"title": "Gaussian Process Random Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 3357, "page_last": 3365, "abstract": "Gaussian processes have been successful in both supervised and unsupervised machine learning tasks, but their computational complexity has constrained practical applications. We introduce a new approximation for large-scale Gaussian processes, the Gaussian Process Random Field (GPRF), in which local GPs are coupled via pairwise potentials. The GPRF likelihood is a simple, tractable, and parallelizeable approximation to the full GP marginal likelihood, enabling latent variable modeling and hyperparameter selection on large datasets. We demonstrate its effectiveness on synthetic spatial data as well as a real-world application to seismic event location.", "full_text": "Gaussian Process Random Fields\n\nDavid A. Moore and Stuart J. Russell\n\n{dmoore, russell}@cs.berkeley.edu\n\nBerkeley, CA 94709\n\nComputer Science Division\n\nUniversity of California, Berkeley\n\nAbstract\n\nGaussian processes have been successful in both supervised and unsupervised\nmachine learning tasks, but their computational complexity has constrained prac-\ntical applications. We introduce a new approximation for large-scale Gaussian\nprocesses, the Gaussian Process Random Field (GPRF), in which local GPs are\ncoupled via pairwise potentials. The GPRF likelihood is a simple, tractable, and\nparallelizeable approximation to the full GP marginal likelihood, enabling latent\nvariable modeling and hyperparameter selection on large datasets. We demonstrate\nits effectiveness on synthetic spatial data as well as a real-world application to\nseismic event location.\n\n1\n\nIntroduction\n\nMany machine learning tasks can be framed as learning a function given noisy information about\nits inputs and outputs. In regression and classi\ufb01cation, we are given inputs and asked to predict the\noutputs; by contrast, in latent variable modeling we are given a set of outputs and asked to reconstruct\nthe inputs that could have produced them. Gaussian processes (GPs) are a \ufb02exible class of probability\ndistributions on functions that allow us to approach function-learning problems from an appealingly\nprincipled and clean Bayesian perspective. Unfortunately, the time complexity of exact GP inference\nis O(n3), where n is the number of data points. This makes exact GP calculations infeasible for\nreal-world data sets with n > 10000.\nMany approximations have been proposed to escape this limitation. One particularly simple approxi-\nmation is to partition the input space into smaller blocks, replacing a single large GP with a multitude\nof local ones. This gains tractability at the price of a potentially severe independence assumption.\nIn this paper we relax the strong independence assumptions of independent local GPs, proposing\ninstead a Markov random \ufb01eld (MRF) of local GPs, which we call a Gaussian Process Random\nField (GPRF). A GPRF couples local models via pairwise potentials that incorporate covariance\ninformation. This yields a surrogate for the full GP marginal likelihood that is simple to implement\nand can be tractably evaluated and optimized on large datasets, while still enforcing a smooth\ncovariance structure. The task of approximating the marginal likelihood is motivated by unsupervised\napplications such as the GP latent variable model [1], but examining the predictions made by our\nmodel also yields a novel interpretation of the Bayesian Committee Machine [2].\nWe begin by reviewing GPs and MRFs, and some existing approximation methods for large-scale\nGPs. In Section 3 we present the GPRF objective and examine its properties as an approximation to\nthe full GP marginal likelihood. We then evaluate it on synthetic data as well as an application to\nseismic event location.\n\n1\n\n\f(a) Full GP.\n\n(b) Local GPs.\n\n(c) Bayesian committee machine.\n\nFigure 1: Predictive distributions on a toy regression problem.\n\nf exp(cid:0)\u2212 1\n\n2(cid:107)x \u2212 x(cid:48)(cid:107)2/(cid:96)2(cid:1), with hyperparameters \u03c32\n\n2 Background\n2.1 Gaussian processes\nGaussian processes [3] are distributions on real-valued functions. GPs are parameterized by a mean\nfunction \u00b5\u03b8(x), typically assumed without loss of generality to be \u00b5(x) = 0, and a covariance\nfunction (sometimes called a kernel) k\u03b8(x, x(cid:48)), with hyperparameters \u03b8. A common choice is the\nsquared exponential covariance, kSE(x, x(cid:48)) = \u03c32\nand (cid:96) specifying respectively a prior variance and correlation lengthscale.\nWe say that a random function f (x) is Gaussian process distributed if, for any n input points X,\nthe vector of function values f = f (X) is multivariate Gaussian, f \u223c N (0, k\u03b8(X, X)). In many\napplications we have access only to noisy observations y = f + \u03b5 for some noise process \u03b5. If\nnI), then the observations are themselves Gaussian,\nthe noise is iid Gaussian, i.e., \u03b5 \u223c N (0, \u03c32\nnI.\ny \u223c N (0, Ky), where Ky = k\u03b8(X, X) + \u03c32\nThe most common application of GPs is to Bayesian regression [3], in which we attempt to predict\nthe function values f\u2217 at test points X\u2217 via the conditional distribution given the training data,\np(f\u2217|y; X, X\u2217, \u03b8). Sometimes, however, we do not observe the training inputs X, or we observe them\nonly partially or noisily. This setting is known as the Gaussian Process Latent Variable Model (GP-\nLVM) [1]; it uses GPs as a model for unsupervised learning and nonlinear dimensionality reduction.\nThe GP-LVM setting typically involves multi-dimensional observations, Y = (y(1), . . . , y(D)), with\neach output dimension y(d) modeled as an independent Gaussian process. The input locations and/or\nhyperparameters are typically sought via maximization of the marginal likelihood\n\nf\n\nL(X, \u03b8) = log p(Y ; X, \u03b8) =\n\nD(cid:88)\n\ni=1\n\n= \u2212 D\n2\n\ni K\u22121\nyT\n\ny yi + C\n\nlog |Ky| \u2212 1\n2\ntr(K\u22121\n\n\u2212 1\n2\nlog |Ky| \u2212 1\n2\n\ny Y Y T ) + C,\n\n(1)\n\nthough some recent work [4, 5] attempts to recover an approximate posterior on X by maximizing\na variational bound. Given a differentiable covariance function, this maximization is typically\nperformed by gradient-based methods, although local maxima can be a signi\ufb01cant concern as the\nmarginal likelihood is generally non-convex.\n\n2.2 Scalability and approximate inference\n\nThe main computational dif\ufb01culty in GP methods is the need to invert or factor the kernel matrix Ky,\nwhich requires time cubic in n. In GP-LVM inference this must be done at every optimization step to\nevaluate (1) and its derivatives.\nThis complexity has inspired a number of approximations. The most commonly studied are inducing-\npoint methods, in which the unknown function is represented by its values at a set of m inducing\npoints, where m (cid:28) n. These points can be chosen by maximizing the marginal likelihood in a\nsurrogate model [6, 7] or by minimizing the KL divergence between the approximate and exact GP\nposteriors [8]. Inference in such models can typically be done in O(nm2) time, but this comes at\nthe price of reduced representational capacity: while smooth functions with long lengthscales may\nbe compactly represented by a small number of inducing points, for quickly-varying functions with\n\n2\n\n\fsigni\ufb01cant local structure it may be dif\ufb01cult to \ufb01nd any faithful representation more compact than the\ncomplete set of training observations.\nA separate class of approximations, so-called \u201clocal\u201d GP methods [3, 9, 10], involves partitioning the\ninputs into blocks of m points each, then modeling each block with an independent Gaussian process.\nIf the partition is spatially local, this corresponds to a covariance function that imposes independence\nbetween function values in different regions of the input space. Computationally, each block requires\nonly O(m3) time; the total time is linear in the number of blocks. Local approximations preserve\nshort-lengthscale structure within each block, but their harsh independence assumptions can lead to\npredictive discontinuities and inaccurate uncertainties (Figure 1b). These assumptions are problematic\nfor GP-LVM inference because the marginal likelihood becomes discontinuous at block boundaries.\nNonetheless, local GPs sometimes work very well in practice, achieving results comparable to more\nsophisticated methods in a fraction of the time [11].\nThe Bayesian Committee Machine (BCM) [2] attempts to improve on independent local GPs by\naveraging the predictions of multiple GP experts. The model is formally equivalent to an inducing-\npoint model in which the test points are the inducing points, i.e., it assumes that the training blocks\nare conditionally independent given the test data. The BCM can yield high-quality predictions that\navoid the pitfalls of local GPs (Figure 1c), while maintaining scalability to very large datasets [12].\nHowever, as a purely predictive approximation, it is unhelpful in the GP-LVM setting, where we are\ninterested in the likelihood of our training set irrespective of any particular test data. The desire for a\nBCM-style approximation to the marginal likelihood was part of the motivation for this current work;\nin Section 3.2 we show that the GPRF proposed in this paper can be viewed as such a model.\nMixture-of-experts models [13, 14] extend the local GP concept in a different direction: instead\nof deterministically assigning points to GP models based on their spatial locations, they treat the\nassignments as unobserved random variables and do inference over them. This allows the model to\nadapt to different functional characteristics in different regions of the space, at the price of a more\ndif\ufb01cult inference task. We are not aware of mixture-of-experts models being applied in the GP-LVM\nsetting, though this should in principle be possible.\nSimple building blocks are often combined to create more complex approximations. The PIC\napproximation [15] blends a global inducing-point model with local block-diagonal covariances, thus\ncapturing a mix of global and local structure, though with the same boundary discontinuities as in\n\u201cvanilla\u201d local GPs. A related approach is the use of covariance functions with compact support [16]\nto capture local variation in concert with global inducing points. [11] surveys and compares several\napproximate GP regression methods on synthetic and real-world datasets.\nFinally, we note here the similar title of [17], which is in fact orthogonal to the present work: they use\na random \ufb01eld as a prior on input locations, whereas this paper de\ufb01nes a random \ufb01eld decomposition\nof the GP model itself, which may be combined with any prior on X.\n\n2.3 Markov Random Fields\n\nWe recall some basic theory regarding Markov random \ufb01elds (MRFs), also known as undirected\ngraphical models [18]. A pairwise MRF consists of an undirected graph (V, E), along with node\npotentials \u03c8i and edge potentials \u03c8ij, which de\ufb01ne an energy function on a random vector y,\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n(i,j)\u2208E\n\nE(y) =\n\n\u03c8i(yi) +\n\n\u03c8ij(yi, yj),\n\n(2)\n\nwhere y is partitioned into components yi identi\ufb01ed with nodes in the graph. This energy in turn\nZ exp(\u2212E(y)) where\nde\ufb01nes a probability density, the \u201cGibbs distribution\u201d, given by p(y) = 1\n\nZ =(cid:82) exp(\u2212E(z))dz is a normalizing constant.\n\nGaussian random \ufb01elds are the special case of pairwise MRFs in which the Gibbs distribution is\nmultivariate Gaussian. Given a partition of y into sub-vectors y1, y2, . . . , yM , a zero-mean Gaussian\ndistribution with covariance K and precision matrix J = K\u22121 can be expressed by potentials\n\n(3)\nwhere Jij is the submatrix of J corresponding to the sub-vectors yi, yj. The normalizing constant\nZ = (2\u03c0)n/2|K|1/2 involves the determinant of the covariance matrix. Since edges whose potentials\n\ni Jiiyi,\n\ni Jijyj\n\n\u03c8i(yi) = \u2212 1\n2\n\nyT\n\n\u03c8ij(yi, yj) = \u2212yT\n\n3\n\n\fare zero can be dropped without effect, the nonzero entries of the precision matrix can be seen as\nspecifying the edges present in the graph.\n\n3 Gaussian Process Random Fields\nWe consider a vector of n real-valued1 observations y \u223c N (0, Ky) modeled by a GP, where Ky\nis implicitly a function of input locations X and hyperparameters \u03b8. Unless otherwise speci\ufb01ed,\nall probabilities p(yi), p(yi, yj), etc., refer to marginals of this full GP. We would like to perform\ngradient-based optimization on the marginal likelihood (1) with respect to X and/or \u03b8, but suppose\nthat the cost of doing so directly is prohibitive.\nIn order to proceed, we assume a partition y = (y1, y2, . . . , yM ) of the observations into M\nblocks of size at most m, with an implied corresponding partition X = (X1, X2, . . . , XM ) of the\n(perhaps unobserved) inputs. The source of this partition is not a focus of the current work; we\nmight imagine that the blocks correspond to spatially local clusters of input points, assuming that\nwe have noisy observations of the X values or at least a reasonable guess at an initialization. We\nlet Kij = cov\u03b8(yi, yj) denote the appropriate submatrix of Ky, and Jij denote the corresponding\nsubmatrix of the precision matrix Jy = K\u22121\n\ny ; note that Jij (cid:54)= (Kij)\u22121 in general.\n\n3.1 The GPRF Objective\n\nGiven the precision matrix Jy, we could use (3) to represent the full GP distribution in factored\nform as an MRF. This is not directly useful, since computing Jy requires cubic time. Instead we\npropose approximating the marginal likelihood via a random \ufb01eld in which local GPs are connected\nby pairwise potentials. Given an edge set which we will initially take to be the complete graph,\nE = {(i, j)|1 \u2264 i < j \u2264 M}, our approximate objective is\n\n(cid:89)\n\nM(cid:89)\nM(cid:89)\n\ni=1\n\ni=1\n\nqGP RF (y; X, \u03b8) =\n\n=\n\np(yi)\n\n(i,j)\u2208E\n\np(yi)1\u2212|Ei| (cid:89)\n\np(yi, yj)\n\n(i,j)\u2208E\n\np(yi, yj)\np(yi)p(yj)\n\n,\n\n(4)\n\nwhere Ei denotes the neighbors of i in the graph, and p(yi) and p(yi, yj) are marginal probabilities\nunder the full GP; equivalently they are the likelihoods of local GPs de\ufb01ned on the points Xi and\nXi \u222a Xj respectively. Note that these local likelihoods depend implicitly on X and \u03b8. Taking the log,\nwe obtain the energy function of an unnormalized MRF\n\nM(cid:88)\n\n(1 \u2212 |Ei|) log p(yi) +\n\nlog p(yi, yj)\n\n(cid:88)\n\n(i,j)\u2208E\n\n(5)\n\n(6)\n\nlog qGP RF (y; X, \u03b8) =\n\nwith potentials\n\ni=1\n\n\u03c8GP RF\n\ni\n\n(yi) = (1 \u2212 |Ei|) log p(yi),\n\n\u03c8GP RF\n\nij\n\n(yi, yj) = log p(yi, yj).\n\nWe refer to the approximate objective (5) as qGP RF rather than pGP RF to emphasize that it is not in\ngeneral a normalized probability density. It can be interpreted as a \u201cBethe-type\u201d approximation [19],\nin which a joint density is approximated via overlapping pairwise marginals. In the special case that\nthe full precision matrix Jy induces a tree structure on the blocks of our partition, qGP RF recovers\nthe exact marginal likelihood. (This is shown in the supplementary material.) In general this will not\nbe the case, but in the spirit of loopy belief propagation [20], we consider the tree-structured case as\nan approximation for the general setting.\nBefore further analyzing the nature of the approximation, we \ufb01rst observe that as a sum of local\nGaussian log-densities, the objective (5) is straightforward to implement and fast to evaluate. Each\nof the O(M 2) pairwise densities requires O((2m)3) = O(m3) time, for an overall complexity of\n\n1The extension to multiple independent outputs is straightforward.\n\n4\n\n\fO(M 2m3) = O(n2m) when M = n/m. The quadratic dependence on n cannot be avoided by any\nalgorithm that computes similarities between all pairs of training points; however, in practice we\nwill consider \u201clocal\u201d modi\ufb01cations in which E is something smaller than all pairs of blocks. For\nexample, if each block is connected only to a \ufb01xed number of spatial neighbors, the complexity\nreduces to O(nm2), i.e., linear in n. In the special case where E is the empty set, we recover the\nexact likelihood of independent local GPs.\nIt is also straightforward to obtain the gradient of (5) with respect to hyperparameters \u03b8 and inputs\nX, by summing the gradients of the local densities. The likelihood and gradient for each term in the\nsum can be evaluated independently using only local subsets of the training data, enabling a simple\nparallel implementation.\nHaving seen that qGP RF can be optimized ef\ufb01ciently, it remains for us to argue its validity as a proxy\nfor the full GP marginal likelihood. Due to space constraints we defer proofs to the supplementary\nmaterial, though our results are not dif\ufb01cult. We \ufb01rst show that, like the full marginal likelihood (1),\nqGP RF has the form of a Gaussian distribution, but with a different precision matrix.\nTheorem 1. The objective qGP RF has the form of an unnormalized Gaussian density with precision\nmatrix \u02dcJ, with blocks \u02dcJij given by\n\n\u02dcJii = K\u22121\n\nii +\n\n,\n\n\u02dcJij =\n\nQ(ij)\n12\n0\n\nif (i, j) \u2208 E\notherwise.\n\n,\n\n(7)\n\n(cid:26)\n\n(cid:19)\n\nwhere Q(ij) is the local precision matrix Q(ij) de\ufb01ned as the inverse of the marginal covariance,\n\n(cid:88)\n\n(cid:16)\n\nj\u2208Ei\n\n(cid:17)\n\nii\n\n11 \u2212 K\u22121\nQ(ij)\n(cid:32)\n\nQ(ij)\nQ(ij)\n\n11 Q(ij)\n21 Q(ij)\n\n22\n\n12\n\nQ(ij) =\n\n(cid:33)\n\n(cid:18) Kii Kij\n\nKji Kjj\n\n(cid:19)\u22121\n\n.\n\n=\n\nAlthough the Gaussian density represented by qGP RF is not in general normalized, we show that it is\napproximately normalized in a certain sense.\nTheorem 2. The objective qGP RF is approximately normalized in the sense that the optimal value\nof the Bethe free energy [19],\n\n(cid:18)(cid:90)\n\n(cid:88)\n\ni\u2208V\n\nFB(b) =\n\nbi(yi)\n\n(1 \u2212 |Ei|) ln bi(yi)\n\nln \u03c8i(yi)\n\nyi\n\n(cid:19)\n\n+\n\n(cid:88)\n\n(cid:32)(cid:90)\n\n(i,j)\u2208E\n\nyi,yj\n\nbij(yi, yj) ln\n\nbij(yi, yj)\n\u03c8ij(yi, yj))\n\n\u2248 log Z,\n\n(8)\nthe approximation to the normalizing constant found by loopy belief propagation, is precisely zero.\nFurthermore, this optimum is obtained when the pseudomarginals bi, bij are taken to be the true GP\nmarginals pi, pij.\n\nThis implies that loopy belief propagation run on a GPRF would recover the marginals of the true GP.\n\n3.2 Predictive equivalence to the BCM\n\nWe have introduced qGP RF as a surrogate model for the training set (X, y); however, it is natural\nto extend the GPRF to make predictions at a set of test points X\u2217, by including the function values\nf\u2217 = f (X\u2217) as an M + 1st block, with an edge to each of the training blocks. The resulting predictive\ndistribution,\n\n\uf8eb\uf8ed M(cid:89)\n\ni=1\n\nM(cid:89)\n\ni=1\n\np(yi, f\u2217)\np(yi)p(f\u2217)\n\nM(cid:89)\n\n(cid:89)\n\np(yi)\n\n(i,j)\u2208E\n\np(yi, yj)\np(yi)p(yj)\n\n\u221d p(f\u2217)1\u2212M\n\np(f\u2217|yi),\n\n(9)\n\npGP RF (f\u2217|y) \u221d qGP RF (f\u2217, y) = p(f\u2217)\n\n(cid:33)\n\n\uf8f6\uf8f8\n\ncorresponds exactly to the prediction of the Bayesian Committee Machine (BCM) [2]. This motivates\nthe GPRF as a natural extension of the BCM as a model for the training set, providing an alternative to\n\ni=1\n\n5\n\n\f(a) Noisy observed loca-\ntions: mean error 2.48.\n\n(b) Full GP: 0.21.\n\n(c) GPRF-100:\n(showing grid cells)\n\n0.36.\n\n(d) FITC-500:\n4.86.\n(with inducing points,\nnote contraction)\n\nFigure 2: Inferred locations on synthetic data (n = 10000), colored by the \ufb01rst output dimension y1.\n\nthe standard transductive interpretation of the BCM.2 A similar derivation shows that the conditional\ndistribution of any block yi given all other blocks yj(cid:54)=i also takes the form of a BCM prediction,\nsuggesting the possibility of pseudolikelihood training [21], i.e., directly optimizing the quality of\nBCM predictions on held-out blocks (not explored in this paper).\n\n\u221a\n\n4 Experiments\n4.1 Uniform Input Distribution\nWe \ufb01rst consider a 2D synthetic dataset intended to simulate spatial location tasks such as WiFi-\nSLAM [22] or seismic event location (below), in which we observe high-dimensional measurements\nbut have only noisy information regarding the locations at which those measurements were taken.\nWe sample n points uniformly from the square of side length\nn to generate the true inputs X, then\nsample 50-dimensional output Y from independent GPs with SE kernel k(r) = exp(\u2212(r/(cid:96))2) for\n(cid:96) = 6.0 and noise standard deviation \u03c3n = 0.1. The observed points Xobs \u223c N (X, \u03c32\nobsI) arise by\ncorrupting X with isotropic Gaussian noise of standard deviation \u03c3obs = 2. The parameters (cid:96), \u03c3n,\nand \u03c3obs were chosen to generate problems with interesting short-lengthscale structure for which\nGP-LVM optimization could nontrivially improve the initial noisy locations. Figure 2a shows a\ntypical sample from this model.\nFor local GPs and GPRFs, we take the spatial partition to be a grid with n/m cells, where m is the\ndesired number of points per cell. The GPRF edge set E connects each cell to its eight neighbors\n(Figure 2c), yielding linear time complexity O(nm2). During optimization, a practical choice is\nnecessary: do we use a \ufb01xed partition of the points, or re-assign points to cells as they cross spatial\nboundaries? The latter corresponds to a coherent (block-diagonal) spatial covariance function, but\nintroduces discontinuities to the marginal likelihood. In our experiments the GPRF was not sensitive\nto this choice, but local GPs performed more reliably with \ufb01xed spatial boundaries (in spite of the\ndiscontinuities), so we used this approach for all experiments.\nFor comparison, we also evaluate the Sparse GP-LVM, implemented in GPy [23], which uses the\nFITC approximation to the marginal likelihood [7]. (We also considered the Bayesian GP-LVM [4],\nbut found it to be more resource-intensive with no meaningful difference in results on this problem.)\nHere the approximation parameter m is the number of inducing points.\nWe ran L-BFGS optimization to recover maximum a posteriori (MAP) locations, or local optima\nthereof. Figure 3a shows mean location error (Euclidean distance) for n = 10000 points; at this size\nit is tractable to compare directly to the full GP-LVM. The GPRF with a large block size (m=1111,\ncorresponding to a 3x3 grid) nearly matches the solution quality of the full GP while requiring less\ntime, while the local methods are quite fast to converge but become stuck at inferior optima. The\nFITC optimization exhibits an interesting pathology: it initially moves towards a good solution but\nthen diverges towards what turns out to correspond to a contraction of the space (Figure 2d); we\nconjecture this is because there are not enough inducing points to faithfully represent the full GP\n2The GPRF is still transductive, in the sense that adding a test block f\u2217 will change the marginal distribution\non the training observations y, as can be seen explicitly in the precision matrix (7). The contribution of the\nGPRF is that it provides a reasonable model for the training-set likelihood even in the absence of test points.\n\n6\n\n\f(a) Mean location error over time for\nn = 10000, including comparison\nto full GP.\n\n(b) Mean error at convergence as a\nfunction of n, with learned length-\nscale.\n\n(c) Mean location error over time\nfor n = 80000.\n\nFigure 3: Results on synthetic data.\n\ndistribution over the entire space. A partial \ufb01x is to allow FITC to jointly optimize over locations and\nthe correlation lengthscale (cid:96); this yielded a biased lengthscale estimate \u02c6(cid:96) \u2248 7.6 but more accurate\nlocations (FITC-500-(cid:96) in Figure 3a).\nTo evaluate scaling behavior, we next considered problems of increasing size up to n = 80000.3 Out\nof generosity to FITC we allowed each method to learn its own preferred lengthscale. Figure 3b\nreports the solution quality at convergence, showing that even with an adaptive lengthscale, FITC\nrequires increasingly many inducing points to compete in large spatial domains. This is intractable\nfor larger problems due to O(m3) scaling; indeed, attempts to run at n > 55000 with 2000 inducing\npoints exceeded 32GB of available memory. Recently, more sophisticated inducing-point methods\nhave claimed scalability to very large datasets [24, 25], but they do so with m \u2264 1000; we expect\nthat they would hit the same fundamental scaling constraints for problems that inherently require\nmany inducing points.\nOn our largest synthetic problem, n = 80000, inducing-point approximations are intractable, as is\nthe full GP-LVM. Local GPs converge more quickly than GPRFs of equal block size, but the GPRFs\n\ufb01nd higher-quality solutions (Figure 3c). After a short initial period, the best performance always\nbelongs to a GPRF, and at the conclusion of 24 hours the best GPRF solution achieves mean error\n42% lower than the best local solution (0.18 vs 0.31).\n\n4.2 Seismic event location\nWe next consider an application to seismic event location, which formed the motivation for this\nwork. Seismic waves can be viewed as high-dimensional vectors generated from an underlying three-\ndimension manifold, namely the Earth\u2019s crust. Nearby events tend to generate similar waveforms;\nwe can model this spatial correlation as a Gaussian process. Prior information regarding the event\nlocations is available from traditional travel-time-based location systems [26], which produce an\nindependent Gaussian uncertainty ellipse for each event.\nA full probability model of seismic waveforms, accounting for background noise and performing\njoint alignment of arrival times, is beyond the scope of this paper. To focus speci\ufb01cally on the ability\nto approximate GP-LVM inference, we used real event locations but generated synthetic waveforms\nby sampling from a 50-output GP using a Mat\u00b4ern kernel [3] with \u03bd = 3/2 and a lengthscale of\n40km. We also generated observed location estimates Xobs, by corrupting the true locations with\n\n3The astute reader will wonder how we generated synthetic data on problems that are clearly too large\nfor an exact GP. For these synthetic problems as well as the seismic example below, the covariance matrix is\nrelatively sparse, with only ~2% of entries corresponding to points within six kernel lengthscales of each other.\nBy considering only these entries, we were able to draw samples using a sparse Cholesky factorization, although\nthis required approximately 30GB of RAM. Unfortunately, this approach does not straightforwardly extend to\nGP-LVM inference under the exact GP, as the standard expression for the marginal likelihood derivatives\n\n(cid:17) \u2202Ky\n\n(cid:19)\n\n\u2202xi\n\n(cid:18)(cid:16)\n\n\u2202\n\u2202xi\n\nlog p(y) =\n\n1\n2\n\ntr\n\n\u22121\ny y)(K\n\n(K\n\ny y)T \u2212 K\n\u22121\n\n\u22121\ny\n\ninvolves the full precision matrix K\u22121\ny which is not sparse in general. Bypassing this expression via automatic\ndifferentiation through the sparse Cholesky decomposition could perhaps allow exact GP-LVM inference to\nscale to somewhat larger problems.\n\n7\n\n\f(a) Event map for seismic dataset.\n\n(b) Mean location error over time.\n\nFigure 4: Seismic event location task.\n\nGaussian noise of standard deviation 20km in each dimension. Given the observed waveforms and\nnoisy locations, we are interested in recovering the latitude, longitude, and depth of each event.\nOur dataset consists of 107556 events detected at the Mankachi array station in Kazakstan between\n2004 and 2012. Figure 4a shows the event locations, colored to re\ufb02ect a principle axis tree partition\n[27] into blocks of 400 points (tree construction time was negligible). The GPRF edge set contains\nall pairs of blocks for which any two points had initial locations within one kernel lengthscale (40km)\nof each other. We also evaluated longer-distance connections, but found that this relatively local edge\nset had the best performance/time tradeoffs: eliminating edges not only speeds up each optimization\nstep, but in some cases actually yielded faster per-step convergence (perhaps because denser edge\nsets tended to create large cliques for which the pairwise GPRF objective is a poor approximation).\nFigure 4b shows the quality of recovered locations as a function of computation time; we jointly\noptimized over event locations as well as two lengthscale parameters (surface distance and depth)\nand the noise variance \u03c32\nn. Local GPs perform quite well on this task, but the best GPRF achieves 7%\nlower mean error than the best local GP model (12.8km vs 13.7km, respectively) given equal time. An\neven better result can be obtained by using the results of a local GP optimization to initialize a GPRF.\nUsing the same partition (m = 800) for both local GPs and the GPRF, this \u201chybrid\u201d method gives the\nlowest \ufb01nal error (12.2km), and is dominant across a wide range of wall clock times, suggesting it as\na promising practical approach for large GP-LVM optimizations.\n\n5 Conclusions and Future Work\nThe Gaussian process random \ufb01eld is a tractable and effective surrogate for the GP marginal likelihood.\nIt has the \ufb02avor of approximate inference methods such as loopy belief propagation, but can be\nanalyzed precisely in terms of a deterministic approximation to the inverse covariance, and provides\na new training-time interpretation of the Bayesian Committee Machine. It is easy to implement and\ncan be straightforwardly parallelized.\nOne direction for future work involves \ufb01nding partitions for which a GPRF performs well, e.g.,\npartitions that induce a block near-tree structure. A perhaps related question is identifying when the\nGPRF objective de\ufb01nes a normalizable probability distribution (beyond the case of an exact tree\nstructure) and under what circumstances it is a good approximation to the exact GP likelihood.\nThis evaluation in this paper focuses on spatial data; however, both local GPs and the BCM have been\nsuccessfully applied to high-dimensional regression problems [11, 12], so exploring the effectiveness\nof the GPRF for dimensionality reduction tasks would also be interesting. Another useful avenue is\nto integrate the GPRF framework with other approximations: since the GPRF and inducing-point\nmethods have complementary strengths \u2013 the GPRF is useful for modeling a function over a large\nspace, while inducing points are useful when the density of available data in some region of the\nspace exceeds what is necessary to represent the function \u2013 an integrated method might enable new\napplications for which neither approach alone would be suf\ufb01cient.\nAcknowledgements\nWe thank the anonymous reviewers for their helpful suggestions. This work was supported by DTRA\ngrant #HDTRA-11110026, and by computing resources donated by Microsoft Research under an\nAzure for Research grant.\n\n8\n\n\fReferences\n[1] Neil D Lawrence. Gaussian process latent variable models for visualisation of high dimensional data.\n\nAdvances in Neural Information Processing Systems (NIPS), 2004.\n\n[2] Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719\u20132741, 2000.\n[3] Carl Rasmussen and Chris Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[4] Michalis K Titsias and Neil D Lawrence. Bayesian Gaussian process latent variable model. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2010.\n\n[5] Andreas C. Damianou, Michalis K. Titsias, and Neil D. Lawrence. Variational Inference for Latent\nVariables and Uncertain Inputs in Gaussian Processes. Journal of Machine Learning Research (JMLR),\n2015.\n\n[6] Joaquin Qui\u02dcnonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaussian\n\nprocess regression. Journal of Machine Learning Research (JMLR), 6:1939\u20131959, 2005.\n\n[7] Neil D Lawrence. Learning for larger datasets with the Gaussian process latent variable model.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2007.\n\nIn\n\n[8] Michalis K Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2009.\n\n[9] Duy Nguyen-Tuong, Matthias Seeger, and Jan Peters. Model learning with local Gaussian process\n\nregression. Advanced Robotics, 23(15):2015\u20132034, 2009.\n\n[10] Chiwoo Park, Jianhua Z Huang, and Yu Ding. Domain decomposition approach for fast Gaussian process\nregression of large spatial data sets. Journal of Machine Learning Research (JMLR), 12:1697\u20131728, 2011.\n[11] Krzysztof Chalupka, Christopher KI Williams, and Iain Murray. A framework for evaluating approximation\nmethods for Gaussian process regression. Journal of Machine Learning Research (JMLR), 14:333\u2013350,\n2013.\n\n[12] Marc Peter Deisenroth and Jun Wei Ng. Distributed Gaussian Processes. In International Conference on\n\nMachine Learning (ICML), 2015.\n\n[13] Carl Edward Rasmussen and Zoubin Ghahramani. In\ufb01nite mixtures of Gaussian process experts. Advances\n\nin Neural Information Processing Systems (NIPS), pages 881\u2013888, 2002.\n\n[14] Trung Nguyen and Edwin Bonilla. Fast allocation of Gaussian process experts. In International Conference\n\non Machine Learning (ICML), pages 145\u2013153, 2014.\n\n[15] Edward Snelson and Zoubin Ghahramani. Local and global sparse Gaussian process approximations. In\n\nArti\ufb01cial Intelligence and Statistics (AISTATS), 2007.\n\n[16] Jarno Vanhatalo and Aki Vehtari. Modelling local and global phenomena with sparse Gaussian processes.\n\nIn Uncertainty in Arti\ufb01cial Intelligence (UAI), 2008.\n\n[17] Guoqiang Zhong, Wu-Jun Li, Dit-Yan Yeung, Xinwen Hou, and Cheng-Lin Liu. Gaussian process latent\n\nrandom \ufb01eld. In AAAI Conference on Arti\ufb01cial Intelligence, 2010.\n\n[18] Daphne Koller and Nir Friedman. Probabilistic graphical models: Principles and techniques. MIT Press,\n\n2009.\n\n[19] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, Kikuchi approximations, and\n\nbelief propagation algorithms. Advances in Neural Information Processing Systems (NIPS), 13, 2001.\n\n[20] Kevin P Murphy, Yair Weiss, and Michael I Jordan. Loopy belief propagation for approximate inference:\n\nAn empirical study. In Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 467\u2013475, 1999.\n\n[21] Julian Besag. Statistical analysis of non-lattice data. The Statistician, pages 179\u2013195, 1975.\n[22] Brian Ferris, Dieter Fox, and Neil D Lawrence. WiFi-SLAM using Gaussian process latent variable models.\n\nIn International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 2480\u20132485, 2007.\n\n[23] The GPy authors. GPy: A Gaussian process framework in Python. http://github.com/\n\nSheffieldML/GPy, 2012\u20132015.\n\n[24] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. In Uncertainty in\n\nArti\ufb01cial Intelligence (UAI), page 282, 2013.\n\n[25] Yarin Gal, Mark van der Wilk, and Carl Rasmussen. Distributed variational inference in sparse Gaussian\nprocess regression and latent variable models. In Advances in Neural Information Processing Systems\n(NIPS), 2014.\n\n[26] International Seismological Centre. On-line Bulletin. Int. Seis. Cent., Thatcham, United Kingdom, 2015.\n\nhttp://www.isc.ac.uk.\n\n[27] James McNames. A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence (PAMI), 23(9):964\u2013976, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1846, "authors": [{"given_name": "David", "family_name": "Moore", "institution": "UC Berkeley"}, {"given_name": "Stuart", "family_name": "Russell", "institution": "UC Berkeley"}]}