{"title": "Tree-structured Gaussian Process Approximations", "book": "Advances in Neural Information Processing Systems", "page_first": 2213, "page_last": 2221, "abstract": "Gaussian process regression can be accelerated by constructing a small pseudo-dataset to summarise the observed data. This idea sits at the heart of many approximation schemes, but such an approach requires the number of pseudo-datapoints to be scaled with the range of the input space if the accuracy of the approximation is to be maintained. This presents problems in time-series settings or in spatial datasets where large numbers of pseudo-datapoints are required since computation typically scales quadratically with the pseudo-dataset size. In this paper we devise an approximation whose complexity grows linearly with the number of pseudo-datapoints. This is achieved by imposing a tree or chain structure on the pseudo-datapoints and calibrating the approximation using a Kullback-Leibler (KL) minimisation. Inference and learning can then be performed efficiently using the Gaussian belief propagation algorithm. We demonstrate the validity of our approach on a set of challenging regression tasks including missing data imputation for audio and spatial datasets. We trace out the speed-accuracy trade-off for the new method and show that the frontier dominates those obtained from a large number of existing approximation techniques.", "full_text": "Tree-structured Gaussian Process Approximations\n\nThang Bui\n\ntdb40@cam.ac.uk\n\nRichard Turner\n\nret26@cam.ac.uk\n\nComputational and Biological Learning Lab, Department of Engineering\nUniversity of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, UK\n\nAbstract\n\nGaussian process regression can be accelerated by constructing a small pseudo-\ndataset to summarize the observed data. This idea sits at the heart of many approx-\nimation schemes, but such an approach requires the number of pseudo-datapoints\nto be scaled with the range of the input space if the accuracy of the approxi-\nmation is to be maintained. This presents problems in time-series settings or in\nspatial datasets where large numbers of pseudo-datapoints are required since com-\nputation typically scales quadratically with the pseudo-dataset size. In this paper\nwe devise an approximation whose complexity grows linearly with the number\nof pseudo-datapoints. This is achieved by imposing a tree or chain structure on\nthe pseudo-datapoints and calibrating the approximation using a Kullback-Leibler\n(KL) minimization. Inference and learning can then be performed ef\ufb01ciently us-\ning the Gaussian belief propagation algorithm. We demonstrate the validity of our\napproach on a set of challenging regression tasks including missing data imputa-\ntion for audio and spatial datasets. We trace out the speed-accuracy trade-off for\nthe new method and show that the frontier dominates those obtained from a large\nnumber of existing approximation techniques.\n\n1\n\nIntroduction\n\nGaussian Processes (GPs) provide a \ufb02exible nonparametric prior over functions which can be used\nas a probabilistic module in both supervised and unsupervised machine learning problems. The\napplicability of GPs is, however, severely limited by a burdensome computational complexity. For\nexample, this paper will consider non-linear regression on a dataset of size N for which training\nscales as O(N 3) and prediction as O(N 2). This represents a prohibitively large computational cost\nfor many applications. Consequently, a substantial research effort has sought to develop ef\ufb01cient ap-\nproximation methods that side-step these signi\ufb01cant computational demands [1\u20139]. Many of these\napproximation methods are based upon an intuitive idea, which is to use a smaller pseudo-dataset of\nsize M (cid:28) N to summarize the observed dataset, reducing the cost for training and prediction (typ-\nically to O(N M 2) and O(M 2)). The methods can be usefully categorized into two non-exclusive\nclasses according to the way in which they arrive at the pseudo-dataset. Indirect posterior approx-\nimations employ a modi\ufb01ed generative model that is carefully constructed to be calibrated to the\noriginal, but for which inference is computationally cheaper. In practice this leads to parametric\nprobabilistic models that inherit some of the GP\u2019s robustness to over-\ufb01tting. Direct posterior ap-\nproximations, on the other hand, cut to the chase and directly calibrate an approximate posterior\ndistribution, chosen to have favourable computational properties, to the true posterior distribution.\nIn other words, the non-parametric model is retained, but the pseudo-datapoints provide a bottleneck\nat the inference stage, rather than at the modelling stage.\nPseudo-datapoint approximations have enabled GPs to be deployed in a far wider range of problems\nthan was previously possible. However, they have a severe limitation which means many challenging\ndatasets still remain far out of their reach. The problem arises from the fact that pseudo-dataset\nmethods are functionally local in the sense that each pseudo-datapoint sculpts out the approximate\n\n1\n\n\fposterior in a small region of the input space around it [10]. Consequently, when the range of the\ninputs is large compared to the range of the dependencies in the posterior, many pseudo-datapoints\nare required to maintain the accuracy of the approximation. In time-series settings [11\u201313], such\nas audio denoising and missing data imputation considered later in the paper, this means that the\nnumber of pseudo-datapoints must grow with the number of datapoints if restoration accuracy is to\nbe maintained. In other words, M must be scaled with N and so pseudo-datapoint schemes have\nnot reduced the scaling of the computational complexity. In this context, approximation methods\nbuilt from a series of local GPs are perhaps more appropriate, but they suffer from discontinuities\nat the boundaries that are problematic in many contexts, in the audio restoration example they lead\nto audible artifacts. The limitations of pseudo-datapoint approximations are not restricted to the\ntime-series setting. Many datasets in geostatistics, climate science, astronomy and other \ufb01elds have\nlarge, and possibly growing, spatial extent compared to the posterior dependency length. This puts\nthem well out of the reach of all current pseudo-datapoint approximation methods.\nThe purpose of this paper is to develop a new pseudo-datapoint approximation scheme which can\nbe applied to these challenging datasets. Since the need to scale the number of pseudo-datapoints\nwith the range of the inputs appears to be unavoidable, the approach instead focuses on reducing\nthe computational cost of training and inference so that it is truely linear in N. This reduction in\ncomputational complexity comes from an indirect posterior approximation method which imposes\nadditional structural restrictions on the pseudo-dataset so that it has a chain or tree structure. The\npaper is organized as follows: In the next section we will brie\ufb02y review GP regression together with\nsome well known pseudo-datapoint approximation methods. The tree-structured approximation is\nthen proposed, related to previous methods, and developed in section 2. We demonstrate that this\nnew approximation is able to tractably handle far larger datasets whilst maintaining the accuracy of\nprediction and learning in section 3.\n1.1 Regression using Gaussian Processes\nThis section provides a concise introduction to GP regression [14]. Suppose we have a training set\ncomprising N D-dimensional input vectors {xn}N\nn=1 and corresponding real valued scalar obser-\nvations {yn}N\nn=1. The GP regression model assumes that each observation yn is formed from an\nunknown function f (.), evaluated at input xn, which is corrupted by independent Gaussian noise.\nThat is yn = f (xn) + \u0001n where p(\u0001n) = N (\u0001n; 0, \u03c32). Typically a zero mean GP is used to spec-\nify a prior over the function f so that any \ufb01nite set of function values are distributed under the\nprior according to a multivariate Gaussian p(f ) = N (f ; 0, K\ufb00 ).1 The covariance of this Gaussian\nis speci\ufb01ed by a covariance function or kernel, (K\ufb00 )n,n(cid:48) = k\u03b8(xn, xn(cid:48)), which depends upon a\nsmall number of hyper-parameters \u03b8. The form of the covariance function and the values of the\nhyper-parameters encapsulates prior knowledge about the unknown function. Having speci\ufb01ed the\nprobabilistic model, we now consider regression tasks which typically involve predicting the func-\ntion value f\u2217 at some unseen input x\u2217 (also known as missing data imputation) or estimating the\nfunction value f at a training input xn (also known as denoising). Both of these prediction problems\ncan be handled elegantly in the GP regression framework by noting that the posterior distribution\nover the function values is another Gaussian process with a mean and covariance function given by\n\nmf (x) = Kxf (K\ufb00 + \u03c32I)\u22121y,\n\nkf (x, x(cid:48)) = k(x, x(cid:48)) \u2212 Kxf (K\ufb00 + \u03c32I)\u22121Kfx(cid:48).\n\n(1)\nHere K\ufb00 is the covariance matrix on the training set de\ufb01ned above and Kxf is the covariance\nfunction evaluated at pairs of test and training inputs. The hyperparameters \u03b8 and the noise vari-\nance \u03c32 can be learnt by \ufb01nding a (local) maximum of the marginal likelihood of the parameters,\np(y|\u03b8, \u03c3) = N (y; 0, K\ufb00 + \u03c32I). The origin of the cubic computational cost of GP regression is\nthe need to compute the Cholesky decomposition of the matrix K\ufb00 + \u03c32I. Once this step has been\nperformed a subsequent prediction can be made in O(N 2).\n1.2 Review of Gaussian process approximation methods\nThere are a plethora of methods for accelerating learning and inference in GP regression. Here we\nprovide a brief and inexhaustive survey that focuses on indirect posterior approximation schemes\nbased on pseudo-datasets. These approximations can be understood in terms of a three stage pro-\ncess. In the \ufb01rst stage the generative model is augmented with pseudo-datapoints, that is a set of\npseudo-input points {\u00afxm}M\nm=1. In the second stage\n1Here and in what follows, the dependence on the input values x has been suppressed to lighten the notation.\n\nm=1 and (noiseless) pseudo-observations {um}M\n\n2\n\n\fyielding q(f , u) = p(u)(cid:81)N\n\nsome of the dependencies in the model prior distribution are removed so that inference becomes\ncomputationally tractable. In the third stage the parameterisation of the new model is chosen in such\na way that it is calibrated to the old one. This last stage can seem mysterious, but it can often be\nusefully understood as a KL divergence minimization between the true and the modi\ufb01ed model.\nPerhaps the simplest example of this general approach is the Fully Independent Training Conditional\n(FITC) approximation [4] (see table 1). FITC removes direct dependencies between the function\nvalues f (see \ufb01g. 1) and calibrates the modi\ufb01ed prior using the KL divergence KL(p(f , u)||q(f , u))\nn=1 p(fn|u). That this model leads to computational advantages can\nperhaps most easily be seen by recognising that it is essentially a factor analysis model, with an ad-\nmittedly clever parameterisation in terms of the covariance function. FITC has since been extended\nso that the pseudo-datapoints can have a different covariance function to the data [6] and so that\nsome subset of the direct dependencies between the function values f are retained as in the Partially\nIndependent Conditional (PIC) approximation [3,5] which generalizes the Bayesian Committee Ma-\nchine [15].\nThere are indirect approximation methods which do not naturally fall into this general scheme.\nStationary covariance functions can be approximated using a sum of M cosines which leads to the\nSparse Spectrum Gaussian Process (SSGP) [7] which has identical computational cost to FITC. An\nalternative prior approximation method for stationary covariance functions in the multi-dimensional\ntime-series setting designs a linear Gaussian state space model (LGSSM) so that it approximates\nthe prior power spectrum using a connection to stochastic differential equations (SDEs) [16]. The\nKalman smoother can then be used to perform inference and learning in the new representation\nwith a linear complexity. This technique, however, only reduces the computational complexity for\nthe temporal axis and the spatial complexity is still cubic, moreover the extension beyond the time-\nseries setting requires a second layer of approximations, such as variational free-energy methods [17]\nwhich are known to introduce signi\ufb01cant biases [18].\nIn contrast to the methods mentioned above, direct posterior approximation methods do not alter\nthe generative model, but rather seek computational savings through a simpli\ufb01ed representation of\nthe posterior distribution. Examples of this type of approach include the Projected Process (PP)\nmethod [1, 2] which has been since been interpreted as the expectation step in a variational free\nenergy (VFE) optimisation scheme [8] enabling stochastic versions [19]. Similarly, the Expectation\nPropagation (EP) framework can also be used to devise posterior approximations with associated\nhyper-parameter learning scheme [9]. All of these methods employ a pseudo-dataset to parameterize\nthe approximate posterior.\n\nKL minimization\n\nKL(p(f , u)||q(u)(cid:81)\nKL(p(f , u)||q(u)(cid:81)\nKL(p(f , u)||(cid:81)\n\nn q(fn|u))\nk q(fCk|u))\nZ p(u)p(f|u)q(y|u)||p(f , u|y))\nKL( 1\nKL(p(f|u)q(u)||p(f , u|y))\nKL(q(f ; u)p(yn|fn)/qn(f ; u)||q(f ; u))\nk q(fCk|uBk )\u00d7\n\nq(uBk|upar(Bk)))\n\nMethod\nFITC\u2217\nPIC\u2217\nPP\nVFE\nEP\nTree\u2217\n\nResult\n\nq(u) = p(u), q(fn|u) = p(fn|u)\nq(u) = p(u), q(fCk|u) = p(fCk|u)\nq(y|u) = N (y; KfuK\u22121\nuuu, \u03c32I)\nq(u) \u221d p(u) exp((cid:104)log(p(y|f ))(cid:105)p(f|u))\nm p(um|fm)\nq(fCk|uBk ) = p(fCk|uBk )\n\nq(f ; u) \u221d p(f )(cid:81)\n\nq(uBk|upar(Bk)) = p(uBk|upar(Bk))\n\nTable 1: GP approximations as KL minimization. Ck and Bk are disjoint subsets of the function\nvalues and pseudo-datapoints respectively. Indirect posterior approximations are indicated \u2217.\n1.3 Limitations of current pseudo-dataset approximations\nThere is a con\ufb02ict at the heart of current pseudo-dataset approximations. Whilst the effect of each\npseudo-datapoint is local, the computations involving them are global. The local characteristic\nmeans that large numbers of pseudo-datapoints are required to accurately approximate complex pos-\nterior distributions. If ld is the range of the dependencies in the posterior in dimension d and Ld is the\nd=1 Ld/ld.\nCritically, for many applications this condition means that large numbers of pseudo-points are re-\nquired, such as time series (L1 \u221d N) and large spatial datasets (Ld (cid:29) ld). Unfortunately, the global\ngraphical structure means that it is computationally costly to handle such large pseudo-datasets. The\nobvious solution to this con\ufb02ict is to use the so-called local approximation which splits the observa-\ntions into disjoint blocks and models each one with a GP. This is a severe approach and this paper\n\ndata-range in each dimension then approximation accuracy will be retained when M (cid:39)(cid:81)D\n\n3\n\n\fK(cid:89)\n\nq(u) =\n\nK(cid:89)\n\nN(cid:89)\n\nu\n\nu\n\nf1\n\nf2\n\nf3\n\nfn\n\nfN\n\nf\u2217\n\nf1\n\nf2\n\nf3\n\nfn\n\nfN\n\nf\u2217\n\n(a) Full GP\n\nu\n\nuB1\n\nuB2\n\n(b) FITC\nuB3\n\nuBk\n\nuBK\n\nfC1\n\nfC2\n\nfC3\n\nfCk\n\nf\u2217\n\nfCK\n\nfC1\n\nfC2\n\nfC3\n\nfCk\n\nf\u2217\n\nfCK\n\n(c) PIC\n\n(d) Tree (chain)\n\nFigure 1: Graphical models of the GP model and different prior approximation schemes using\npseudo-datapoints. Thick edges indicate full pairwise connections and boldface fonts denote sets\nof variables. The chain structured version of the new approximation is shown for clarity.\nproposes a more elegant and accurate alternative that retains more of the graphical structure whilst\nstill enabling local computation.\n\n2 Tree-structured prior approximations\n\nIn this section we develop an indirect posterior approximation in the same family as FITC and PIC.\nIn order to reduce the computational overhead of these approximations, the global graphical struc-\nture is replaced by a local one via two modi\ufb01cations. First, the M pseudo-datapoints are divided\ninto K disjoint blocks of potentially different cardinality {uBk}K\nk=1 and the blocks are then arranged\ninto a tree. Second, the function values are also divided into K disjoint blocks of potentially dif-\nferent cardinality {fCk}K\nk=1 and the blocks are assumed to be conditionally independent given the\ncorresponding subset of pseudo-datapoints. The new graphical model is shown in \ufb01g. 1d and it can\nbe described mathematically as follows,\n\nq(uBk|upar(Bk)),\n\nq(f|u) =\n\nq(fCk|uBk ), p(y|f ) =\n\np(yn; fn, \u03c32).\n\n(2)\n\nk=1\n\nk=1\n\nn=1\n\nKL(p(f , u)||(cid:81)\n\nHere upar(Bk) denotes the pseudo-datapoints in the parent node of uBk. This is an example of prior\napproximation as the original likelihood function has been retained.\nThe next step is to calibrate the new approximate model by choosing suitable values for the distribu-\ntions {q(uBk|upar(Bk)), q(fCk|uBk )}K\nk=1. Taking an identical approach to that employed by FITC\nand PIC, we minimize a forward KL divergence between the true model prior and the approximation,\nk q(fCk|uBk )q(uBk|upar(Bk))) (see table 1). The optimal distributions are found to\nq(uBk|upar(Bk)) = p(uBk|upar(Bk)) = N (uBk ; Akupar(Bk), Qk),\n\n(3)\n(4)\nThe parameters depend upon the covariance function. Letting uk = uBk, ul = upar(Bk) and\nfk = fCk we \ufb01nd that,\n\nbe the corresponding conditional distributions in the unapproximated augmented model,\n\nq(fCk|uBk ) = p(fCk|uBk ) = N (fCk ; CkuBk , Rk).\n\nAk = KukulK\u22121\nCk = KfkukK\u22121\n\n(5)\n(6)\nAs shown in the graphical model, the local pseudo-data separate test and training latent functions.\nThe marginal posterior distribution of the local pseudo-data is then suf\ufb01cient to obtain the approx-\n\nimate predictive distribution: p(f\u2217|y) = (cid:82) duBk p(f\u2217, uBk|y) = (cid:82) duBk p(f\u2217|uBk )p(uBk|y). In\n\n, Qk = Kukuk \u2212 KukulK\u22121\n, Rk = Kfkfk \u2212 KfkukK\u22121\n\nululKuluk ,\nukukKukfk .\n\nother words, once inference has been performed, prediction is local and therefore fast. The important\nquestion of how to assign test and training points to blocks is discussed in the next section.\nWe note that the tree-based prior approximation includes as special cases; the full GP, PIC, FITC,\nthe local method and local versions of PIC and FITC (see table 1 in the supplementary material).\nImportantly, in a time-series setting the blocks can be organized into a chain and the approximate\nmodel becomes a LGSSM. This provides an new method for approximating GPs using LGSSMs\nin which the state is a set pseudo-observations, rather than for instance, the derivatives of function\nvalues at the input locations [16].\n\nulul\n\nukuk\n\n4\n\n\fInference and learning\n\nExact inference in this approximate model proceeds ef\ufb01ciently using the up-down algorithm for\nGaussian Beliefs (see [20, Ch. 14]). The inference scheme has the same complexity as forming the\nmodel, O(KD3) \u2248 O(N D2) (where D is the average number of observations per block).\n2.1\nSelecting the pseudo-inputs and constructing the tree First we consider the method for dividing\nthe observed data into blocks and selecting the pseudo-inputs. Typically, the block sizes will be\nchosen to be fairly small in order to accelerate learning and inference. For data which are on a grid,\nsuch as regularly sampled time-series considered later in the paper, it may be simplest to use regular\nblocks. An alternative, which might be more appropriate for non-regularly sampled data, is to use\na k-means algorithm with the Euclidean distance score. Having blocked the observations, a random\nsubset of the data in each block are chosen to set the pseudo-inputs. Whilst it would be possible in\nprinciple to optimize the locations of the pseudo-inputs, in practice the new approach can tractably\nhandle a very large number of pseudo-datapoints (e.g. M \u2248 N), and so optimisation is less critical\nthan for previous approaches. Once the blocks are formed, they are \ufb01xed during hyperparameter\ntraining and prediction. Second, we consider how to construct the tree. The pair-wise distances\nbetween the cluster centers are used to de\ufb01ne the weights between candidate edges in a graph.\nKruskal\u2019s algorithm uses this information to construct an acyclic graph. The algorithm starts with\na fully disconnected graph and recursively adds the edge with the smallest weight that does not\nintroduce loops. A tree is randomly formed from this acyclic subgraph by choosing one node to be\nthe root. This choice is arbitrary and does not affect the results of inference. The parameters of the\nmodel {Ak, Qk, Ck, Rk}K\nk=1 (state transitions and noise) are computed by traversing down the tree\nfrom the root to the leaves. These matrices must be recomputed at each step during learning.\n\nInference\nIt is straightforward to marginalize out the latent functions f in the graphical model in\nwhich case the effective local likelihood becomes p(yk|uk) = N (yk; Ckuk, Rk +\u03c32I). The model\ncan be recognized from the graphical model as a tree-structured Gaussian model with latent variables\nu and observations y. As is shown in the supplementary, the posterior distribution can be found by\nusing the Gaussian belief propagation algorithm (for more see [20]). The passing of messages can\nbe scheduled so the marginals can be found after two passes (asynchronous scheduling: upwards\nfrom leaves to root and then downwards). For chain structures inference can be performed using the\nKalman smoother at the same cost.\n\nlief propagation algorithms due to its recursive form, p(y1:K|\u03b8) = (cid:81)K\n\nHyperparameter learning The marginal likelihood can be ef\ufb01ciently computed by the same be-\nk=1 p(yk|y1:k\u22121, \u03b8). The\n\nderivatives can also be tractably computed as they involve only local moments:\n\n(cid:21)\n\n.\n\n(7)\n\n(cid:20)\n\nK(cid:88)\n\nk=1\n\nlog p(y|\u03b8) =\n\nd\nd\u03b8\n\n(cid:104) d\nd\u03b8\n\nlog p(uk|ul)(cid:105)p(uk,ul|y) + (cid:104) d\nd\u03b8\n\nlog p(yk|uk)(cid:105)p(uk|y)\n\nFor concreteness, the explicit form of the marginal likelihood and its derivative are included in\nthe supplementary material. We obtain point estimates of the hyperparameters by \ufb01nding a (local)\nmaximum of the marginal likelihood using the BFGS algorithm.\n\n3 Experiments\n\nWe test the new approximation method on three challenging real-world prediction tasks2 via a speed-\naccuracy trade-off as recommended in [21]. Following that work, we did not investigate the effects of\npseudo-input optimisation. We used different datasets that had less limited spatial/temporal extent.\n\nExperiment 1: Audio sub-band data (exponentiated quadratic kernel)\nIn the \ufb01rst experiment\nwe consider imputation of missing data in a sub-band of a speech signal. The speech signal was\ntaken from the TIMIT database (see \ufb01g. 4), a short time Fourier transform was applied (20ms Gaus-\nsian window), and the real part of the 152Hz channel selected for the experiments. The signal was\nT = 50000 samples long and 25 sections of length 80 samples were removed. An exponentiated\n2l2 (t \u2212 t(cid:48))2), was used for prediction. We compare the chain\nquadratic kernel, k\u03b8(t, t(cid:48)) = \u03c32 exp(\u2212 1\n\n2Synthetic data experiments can be found in the supplementary material.\n\n5\n\n\fstructured pseudo-datapoint approximation to FITC, VFE, SSGP, local versions of PIC (correspond-\ning to setting Ak = 0, Qk = Kukuk in the tree-structured approximation) and the SDE method.3\nOnly 20000 datapoints were used for the SDE method due to the long run times. The size of the\npseudo-dataset and the number of blocks in the chain and local approximations, and the order of\napproximation in SDE were varied to trace out speed-accuracy frontiers. Accuracy of the impu-\ntation was quanti\ufb01ed using the standardized mean squared errors (SMSEs) (for other metrics, see\nthe supplementary material). Hyperparameter learning proceeded until a convergence criteria or a\nmaximum number of function evaluations was reached. Learning and prediction (imputation) times\nwere recorded. We found that the chain structured method outperforms all of the other methods\n(see \ufb01g. 2). For example, for a \ufb01xed training time of 100s, the best performing chain provided a\nthree-fold increase in accuracy over the local method which was the next best. A typical imputation\nis shown in \ufb01g. 4 (left hand side). The chain structured method was able to accurately impute the\nmissing data whilst that the local method is less accurate and more uncertain as information is not\npropagated between the blocks.\n\nFigure 2: Experiment 1. Audio sub-band reconstruction error as a function of training time (a) and\ntest time (b) for different approximations. The numerical labels for the chain and local methods are\nthe number of pseudo-datapoints per block and the number of observations per block respectively,\nand for the SDE method are the order of approximation. For the other methods they are the size\nof the pseudo-dataset. Faster and more accurate approximations are located towards the bottom left\nhand corners of the plots.\nExperiment 2: Audio \ufb01lter data (spectral mixture) The second experiment tested the perfor-\nmance of the chain based approximation when more complex kernels are employed. We \ufb01ltered\nthe same speech signal using a 152Hz \ufb01lter with a 50Hz bandwidth, producing a signal of length\nT = 50000 samples from which missing sections of length 150 samples were removed. Since the\ncomplete signal had a complex bandpass spectrum we used a spectral mixture kernel containing two\n(t \u2212 t(cid:48))2). We compared a chain\nbased approximation to FITC, VFE and the local PIC method \ufb01nding it to be substantially more\naccurate than both methods (see \ufb01g. 3 for SMSE results and the right hand side of \ufb01g. 4 for a typical\nexample). Results with more components showed identical trends (see supplementary material).\n\ncomponents [22], k\u03b8(t, t(cid:48)) =(cid:80)2\n\n2l2\nk\n\nk=1 \u03c32\n\nk cos(\u03c9k(t \u2212 t(cid:48))) exp(\u2212 1\n\nExperiment 3: Terrain data (two dimensional input space, exponentiated quadratic kernel)\nIn the \ufb01nal experiment we tested the tree based appoximation using a spatial dataset in which terrain\naltitude was measured as a function of geographical position.4 We considered a 20km by 30km re-\ngion (400\u00d7600 datapoints) and tested prediction on 80 randomly positioned missing blocks of size\n1km by 1km (20x20 datapoints). In total, this translates into about 200k/40k training/test points.\nWe used an exponentiated quadratic kernel with different length-scales in the two input dimensions,\ncomparing a tree-based approximation, which was constructed as described in section 2.1, to the\n\n3Code\n\nis\n\navailable\n\nat\n\nhttp://www.tsc.uc3m.es/\u02dcmiguel/downloads.php\nbayes/gpstuff/ [SDE] and http://mlg.eng.cam.ac.uk/thang/ [Tree+VFE].\n4Dataset is available at http://data.gov.uk/dataset/os-terrain-50-dtm.\n\n[SSGP],\n\nhttp://www.gaussianprocess.org/gpml/code/matlab/doc/\n\n[FITC],\nhttp://becs.aalto.fi/en/research/\n\n6\n\nSMSETraining time/s2,82,85,2020,8020,802,102,105,2510,502,202,205,505,5010,10020,2002,405,10010,20020,4002,505,12510,25020,50020,500161632323264641281281282565125125121024102410241500150015001234567810(a)(b)101001000100000.010.10.20.51SMSETest time/s2,82,85,205,2010,4020,8010,5010,502,205,5010,10020,2005,10010,20020,40020,4002,505,12510,25010,25020,5001616163232646464128128256256512512102410241500150015002345678100.11100.010.10.20.51ChainLocalFITCVFESSGPSDE\fFigure 3: Experiment 2. Filtered audio signal reconstruction error as a function of training time (a)\nand test time (b) for different approximations. See caption of \ufb01g. 2 for full details.\n\nFigure 4: Missing data imputation for experiment 1 (audio sub-band data, (a)) and experiment 2\n(\ufb01ltered audio data, (b)). Imputation using the chain-structured approximation (top) is more accurate\nand less uncertain than the predictions obtained from the local method (bottom). Blocks consisted\nof 5 pseudo-datapoints and 50 observations respectively.\n\npseudo-point approximation methods considered in the \ufb01rst experiment. Figure 5 shows the speed-\naccuracy trade-off for the various approximation methods at the test and training stages. We found\nthat the global approximation techniques such as FITC or SSGP could not tractably handle a suf\ufb01-\ncient number of pseudo-datapoints to support accurate imputation. The local variant of our method\noutperformed the other techniques, but compared poorly to the tree. Typical reconstructions from\nthe tree, local and FITC approximations are shown in \ufb01g. 6.\n\nrequired for the three datasets is M (cid:39) (cid:81)\n\nSummary of experimental results The speed-accuracy frontier for the new approximation\nscheme dominates those produced by the other methods over a wide range for each of the three\ndatasets. Similar results were found for additional datasets (see supplementary material). It is per-\nhaps not surprising that the tree approximation performs so favourably. Consider the rule-of-thumb\nestimate for the number of pseudo-datapoints required. Using the length-scales ld learned by the\ntree-approximation as a proxy for the posterior dependency length the estimated pseudo-dataset size\nd Ld/ld \u2248 {1400, 1000, 5000}. This is at the upper end\nof what can be tractably handled using standard approximations. Moreover, these approximation\nschemes can be made arbitrarily poor by expanding the region further. The most accurate tree-\nstructured approximation for the three datasets used {2500, 10000, 20000} datapoints respectively.\nThe local PIC method performs more favourably than the standard approximations and is generally\nfaster than the tree since it involves a single pass through the dataset and simpler matrix computa-\ntions. However, blocking the data into independent chunks results in artifacts at the block bound-\naries which reduces the approximation\u2019s accuracy signi\ufb01cantly when compared to the tree (e.g. if\nthey happen to coincide with a missing region).\n\n7\n\nSMSETraining time/s2,85,2010,4010,4020,802,1020,1005,505,5020,2002,405,1005,10020,40020,4002,502,505,1255,12520,50016163264641282562565125121024102415001500101001000100000.020.10.20.51SMSETest time/s2,85,2010,4020,8020,805,2510,5010,505,5020,2002,402,405,10020,4002,502,505,1255,12520,50020,5001632326464128128256512512102415001500(a)0.11100.020.10.20.51ChainLocalFITCVFE(b)yt\u2212202ytTime/ms23402350236023702380\u2212202yt\u2212202ytTime/ms503050405050506050705080\u2212202TrueChainLocal(a)(b)\fFigure 5: Experiment 3. Terrain data reconstruction. SMSE as a function of training time (a) and\ntest time (b). See caption of \ufb01g. 2 for full details.\n\nFigure 6: Experiment 3. Terrain data reconstruction. The blocks in this region input space are\norganized into a tree-structure (a) with missing regions shown by the black squares. The complete\nterrain altitude data for the region (b). Prediction errors from three methods (c).\n\n4 Conclusion\n\nThis paper has presented a new pseudo-datapoint approximation scheme for Gaussian process re-\ngression problems which imposes a tree or chain structure on the pseudo-dataset that is calibrated\nusing a KL divergence. Inference and learning in the resulting approximate model proceeds ef\ufb01-\nciently via Gaussian belief propagation. The computational cost of the approximation is linear in\nthe pseudo-dataset size, improving upon the quadratic scaling of typical approaches, and opening the\ndoor to more challenging datasets than have previously been considered. Importantly, the method\ndoes not require the input data or the covariance function to have special structure (stationarity, reg-\nular sampling, time-series settings etc. are not a requirement). We showed that the approximation\nobtained a superior performance in both predictive accuracy and runtime complexity on challenging\nregression tasks which included audio missing data imputation and spatial terrain prediction.\nThere are several directions for future work. First, the new approximation scheme should be tested\non datasets that have higher dimensional input spaces since it is not clear how well the approximation\nwill generalize to this setting. Second, the tree structure naturally leads to (possibly distributed)\nonline stochastic inference procedures in which gradients computed at a local block, or a collection\nof local blocks, are used to update hyperparameters directly, as opposed waiting for a full pass up\nand down the tree. Third, the tree structure used for prediction can be decoupled from the tree\nstructure used for training, whilst still employing the same pseudo-datapoints potentially improving\nprediction.\nAcknowledgements\nWe would like to thank the EPSRC (grant numbers EP/G050821/1 and EP/L000776/1) and Google\nfor funding.\n\n8\n\nTraining time/sSMSE6412825651210246412825651210246412825651210245,30010,30015,30025,3004,2408,2405,30010,30015,3004,2408,240501001000100000.050.10.20.4VFEFITCSSGPTreeLocalTest time/msSMSE64128256512102464128256512102464128256512102410,30015,30025,3004,2408,2405,30015,30025,3004,2408,2400.51510200.050.10.20.4(a)(b)tree inference errorlocal inference errorFITC inference errorgraph250m-150m050m250m(a)(b)(c)03km03kmcomplete data\fReferences\n[1] M. Seeger, C. K. I. Williams, and N. D. Lawrence, \u201cFast forward selection to speed up sparse Gaussian\n\nprocess regression,\u201d in International Conference on Arti\ufb01cial Intelligence and Statistics, 2003.\n\n[2] M. Seeger, Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse\n\napproximations. PhD thesis, University of Edinburgh, 2003.\n\n[3] J. Qui\u02dcnonero-Candela and C. E. Rasmussen, \u201cA unifying view of sparse approximate Gaussian process\n\nregression,\u201d The Journal of Machine Learning Research, vol. 6, pp. 1939\u20131959, 2005.\n\n[4] E. Snelson and Z. Ghahramani, \u201cSparse Gaussian processes using pseudo-inputs,\u201d in Advances in Neural\n\nInformation Processing Systems 19, pp. 1257\u20131264, MIT press, 2006.\n\n[5] E. Snelson and Z. Ghahramani, \u201cLocal and global sparse Gaussian process approximations,\u201d in Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics, pp. 524\u2013531, 2007.\n\n[6] M. L\u00b4azaro-Gredilla and A. R. Figueiras-Vidal, \u201cInter-domain Gaussian processes for sparse inference\nusing inducing features.,\u201d in Advances in Neural Information Processing Systems 22, pp. 1087\u20131095,\nCurran Associates, Inc., 2009.\n\n[7] M. L\u00b4azaro-Gredilla, J. Qui\u02dcnonero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal, \u201cSparse spec-\ntrum Gaussian process regression,\u201d The Journal of Machine Learning Research, vol. 11, pp. 1865\u20131881,\n2010.\n\n[8] M. K. Titsias, \u201cVariational learning of inducing variables in sparse Gaussian processes,\u201d in International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pp. 567\u2013574, 2009.\n\n[9] Y. Qi, A. H. Abdel-Gawad, and T. P. Minka, \u201cSparse-posterior Gaussian processes for general likeli-\nhoods.,\u201d in Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pp. 450\u2013457, AUAI Press, 2010.\n\n[10] E. Snelson, Flexible and ef\ufb01cient Gaussian process models for machine learning. PhD thesis, Gatsby\n\nComputational Neuroscience Unit, University College London, 2007.\n\n[11] R. E. Turner and M. Sahani, \u201cTime-frequency analysis as probabilistic inference,\u201d Signal Processing,\n\nIEEE Transactions on, vol. Early Access, 2014.\n\n[12] R. E. Turner and M. Sahani, \u201cProbabilistic amplitude and frequency demodulation,\u201d in Advances in Neural\n\nInformation Processing Systems 24, pp. 981\u2013989, 2011.\n\n[13] R. E. Turner, Statistical Models for Natural Sounds. PhD thesis, Gatsby Computational Neuroscience\n\nUnit, UCL, 2010.\n\n[14] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning (Adaptive Computation\n\nand Machine Learning). The MIT Press, 2005.\n\n[15] V. Tresp, \u201cA Bayesian committee machine,\u201d Neural Computation, vol. 12, no. 11, pp. 2719\u20132741, 2000.\n[16] S. Sarkka, A. Solin, and J. Hartikainen, \u201cSpatiotemporal learning via in\ufb01nite-dimensional Bayesian \ufb01lter-\ning and smoothing: A look at Gaussian process regression through Kalman \ufb01ltering,\u201d Signal Processing\nMagazine, IEEE, vol. 30, pp. 51\u201361, July 2013.\n\n[17] E. Gilboa, Y. Saatci, and J. Cunningham, \u201cScaling multidimensional inference for structured Gaussian\n\nprocesses,\u201d Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. Early Access, 2013.\n\n[18] R. E. Turner and M. Sahani, \u201cTwo problems with variational expectation maximisation for time-series\nmodels,\u201d in Bayesian Time series models (D. Barber, T. Cemgil, and S. Chiappa, eds.), ch. 5, pp. 109\u2013\n130, Cambridge University Press, 2011.\n\n[19] J. Hensman, N. Fusi, and N. Lawrence, \u201cGaussian processes for big data,\u201d in Proceedings of the Twenty-\nNinth Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-13), (Corvallis, Ore-\ngon), pp. 282\u2013290, AUAI Press, 2013.\n\n[20] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques - Adaptive Com-\n\nputation and Machine Learning. The MIT Press, 2009.\n\n[21] K. Chalupka, C. K. Williams, and I. Murray, \u201cA framework for evaluating approximation methods for\nGaussian process regression,\u201d The Journal of Machine Learning Research, vol. 14, no. 1, pp. 333\u2013350,\n2013.\n\n[22] A. G. Wilson and R. P. Adams, \u201cGaussian process kernels for pattern discovery and extrapolation,\u201d in\n\nProceedings of the 30th International Conference on Machine Learning, pp. 1067\u20131075, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1150, "authors": [{"given_name": "Thang", "family_name": "Bui", "institution": "University of Cambridge"}, {"given_name": "Richard", "family_name": "Turner", "institution": "University of Cambridge"}]}