{"title": "A matching pursuit approach to sparse Gaussian process regression", "book": "Advances in Neural Information Processing Systems", "page_first": 643, "page_last": 650, "abstract": null, "full_text": "A Matching Pursuit Approach to Sparse Gaussian Process Regression\n\nS. Sathiya Keerthi Yahoo! Research Labs 210 S. DeLacey Avenue Pasadena, CA 91105 selvarak@yahoo-inc.com\n\nWei Chu Gatsby Computational Neuroscience Unit University College London London, WC1N 3AR, UK chuwei@gatsby.ucl.ac.uk\n\nAbstract\nIn this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efficiency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions.\n\n1\n\nIntroduction\n\nBayesian Gaussian processes provide a promising probabilistic kernel approach to supervised learning tasks. The advantage of Gaussian process (GP) models over non-Bayesian kernel methods, such as support vector machines, comes from the explicit probabilistic formulation that yields predictive distributions for test instances and allows standard Bayesian techniques for model selection. The cost of training GP models is O(n3 ) where n is the number of training instances, which results in a huge computational cost for large data sets. Furthermore, when predicting a test case, a GP model requires O(n) cost for computing the mean and O(n2 ) cost for computing the variance. These heavy scaling properties obstruct the use of GPs in large scale problems. Recently, sparse GP models which bring down the complexity of training as well as test ing have attracted considerable attention. Williams and Seeger (2001) applied the Nystrom  method to calculate a reduced rank approximation of the original n  n kernel matrix. Csato and Opper (2002) developed an on-line algorithm to maintain a sparse representation of the GP models. Smola and Bartlett (2001) proposed a forward selection scheme to approximate the log posterior probability. Candela (2004) suggested a promising alternative criterion by maximizing the approximate model evidence. Seeger et al. (2003) presented a very fast greedy selection method for building sparse GP regression models. All of these methods make efforts to select an informative subset of the training instances for the predictive model. This subset is usually referred to as the set of basis vectors, denoted as I . The maximal size of I is usually limited by a value dmax . As dmax n, the sparseness greatly alleviates the computational burden in both training and prediction of the GP models. The performance of the resulting sparse GP models crucially depends on the criterion used in the basis vector selection. Motivated by the ideas of Matching Pursuit (Vincent and Bengio, 2002), we propose a new criterion of greedy forward selection for sparse GP models.\n\n\f\nOur algorithm is closely related to that of Smola and Bartlett (2001), but the criterion we propose is much more efficient. Compared with the information gain method of Seeger et al. (2003) our approach yields clearly better generalization performance, while essentially having the same algorithm complexity. We focus only on regression in this paper, but the main ideas are applicable to other supervised learning tasks. The paper is organized as follows: in Section 2 we present the probabilistic framework for sparse GP models; in Section 3 we describe our method of greedy forward selection after motivating it via the previous methods; in Section 4 we discuss some issues in model adaptation; in Section 5 we report results of numerical experiments that demonstrate the effectiveness of our new method.\n\n2\n\nSparse GPs for regression\n\nIn regression problems, we are given a training data set composed of n samples. Each sample is a pair of an input vector xi  Rm and its corresponding target yi  R. The true function value at xi is represented as an unobservable latent variable f (xi ) and the target yi is a noisy measurement of f (xi ). The goal is to construct a predictive model that estimates the relationship x  f (x). Gaussian process regression. In standard GPs for regression, the latent variables {f (xi )} are random variables in a zero mean Gaussian process indexed by {xi }. The prior distribution of {f (xi )} is a multivariate joint Gaussian, denoted as P (f ) = N (f ; 0, K), where f = [f (x1 ), . . . , f (xn )]T and K is the n  n covariance matrix whose ij -th element is K(xi , xj ), K being the kernel function. The likelihood is essentially a model of the measurement noise, which is usually evaluated as a product of independent Gaussian noises, P (y |f ) = N (y ; f ,  2 I), where y = [y1 , . . . , yn ]T and  2 is the noise variance. The posterior distribution P (f |y )  P (y |f )P (f ) is also exactly a Gaussian: P (f |y ) = N (f ; K ,  2 K(K +  2 I)-1 )\n= 2 -1\n\n(1)\n\n(K +  I) y . For any test instance x, the predictive distribution is where  2 2 N (f (x); x , x ) where x = kT (K +  2 I)-1 y = kT  , x = K(x, x) - kT (K + 2 -1 T  I) k, and k = [K(x1 , x), . . . , K(xn , x)] . The computational cost of training is O(n3 ), which mainly comes from the need to invert the matrix (K + 2 I) and obtain the vector  . For doing predictions of a test instance the cost is O(n) to compute the mean and O(n2 ) for computing the variance. This heavy scaling with respect to n makes the use of standard GP computationally prohibitive on large datasets. Projected latent variables. Seeger et al. (2003) gave a neat method for working with a reduced number of latent variables, laying the foundation for forming sparse GP models. In this section we review their ideas. Instead of assuming n latent variables for all the training instances, sparse GP models assume only d latent variables placed at some chosen ~ ~ ~ basis vectors {xi }, denoted as a column vector f I = [f (x1 ), . . . , f (xd )]T . The prior distribution of the sparse GP is a joint Gaussian over f I only, i.e., P (f I ) = N (f I ; 0, KI ) (2) where KI is the d  d covariance matrix of the basis vectors whose ij -th element is ~~ K(xi , xj ). These latent variables are then projected to all the training instances. Under the imposed joint Gaussian prior, the conditional mean at the training instances is KT, K-1 f I , where I I KI , is a d  n matrix of the covariance functions between the basis vectors and all the training instances. The likelihood can be evaluated by these projected latent variables as follows P (y |f I ) = N (y ; KT, K-1 f I ,  2 I) (3) I I\n\n\f\nThe posterior is P (f I |y ) = N (f I ; KI I ,  2 KI ( 2 KI + KI , KT, )-1 KI ), where I I = ( 2 KI + KI , KT, )-1 KI , y . The predictive distribution at any test instance x is I ~T ~T ~T ~T ~ ~2 N (f (x); x , x ), where x = k I , x = K(x, x) - k K-1 k +  2 k ( 2 KI + ~ ~2 I ~ ~ KI , KT, )-1 k, and k is a column vector of the covariance functions between the basis I ~ ~ ~ vectors and the test instance x, i.e. k = [K(x1 , x), . . . , K(xd , x)]T . While the cost of training the full GP model is O(n3 ), the training complexity of sparse GP models is only O(nd2 ax ). This corresponds to the cost of forming K-1 , ( 2 KI + m I KI , KT, )-1 and I . Thus, if dmax is not big, learning on large datasets is feasible via I sparse GP models. Also, for these sparse models, prediction for each test instance costs O(dmax ) for the mean and O(d2 ax ) for the variance. m Generally the basis vectors can be placed anywhere in the input space Rm . Since training instances usually cover the input space of interest quite well, it is quite reasonable to select basis vectors from just the set of training instances. For a given problem dmax is chosen to be as large as possible subject to constraints on computational time in training and/or testing. Then we use some basis selection method to find I of size dmax . This important step is taken up in section 3. A Useful optimization formulation. As pointed out by Smola and Bartlett (2001), it is useful to view the determination of the mean of the posterior as coming from an optimization problem. This viewpoint helps in the selection of basis vectors. The mean of the posterior distribution is exactly the maximum a posteriori (MAP) estimate, and it is possible to give an equivalent parametric representation of the latent variables as f = K, where  = [1 , . . . , n ]T . The MAP estimate of the full GP is equivalent to minimizing the negative logarithm of the posterior (1): 1 min  () := T ( 2 K + KT K)  - y T K  (4)  2 Similarly, using f I = KI I for sparse GP models, the MAP estimate of the sparse GP is equivalent to minimizing the negative logarithm of the posterior, P (f I |y ): 1 min  (I ) := T ( 2 KI + KI , KT, ) I - y T KT, I ~ (5) I I I 2I Suppose  in (4) is composed of two parts,  = [I ; R ] where I denotes the set of basis vectors and R denotes the remaining instances. Interestingly, as pointed out by Seeger et al. (2003), the optimization problem (5) is same as minimizing  () in (4) using I only, i.e., with the constraint, R = 0. In other words, the basis vectors of the sparse GPs can be selected to minimize the negative log-posterior of the full GPs,  () defined as in (4).\n\n3\n\nSelection of basis functions\n\nThe most crucial element of the sparse GP approach of the previous section is the choice of I , the set of basis vectors, which we take to be a subset of the training vectors. The cheapest method is to select the basis vectors at random from the training data set. But, such a choice will not work well when dmax is much smaller than n. A principled approach is to select I that makes the corresponding sparse GP approximate well, the posterior distribution of the full GP. The optimization formulation of the previous section is useful here. It would be ideal to choose, among all subsets, I of size dmax , the one that gives the best value of  in (5). But, this requires a combinatorial search that is infeasible for large problems. A ~ practical approach is to do greedy forward selection. This is the approach used in previous methods as well as in our method of this paper. Before we go into the details of the methods, let us give a brief discussion of the time complexities associated with forward selection. There are two costs involved. (1) There is a\n\n\f\nbasic cost associated with updating of the sparse GP solution, given a sequence of chosen basis functions. Let us refer to this cost as Tbasic . This cost is the same for all forward selection methods, and is O(nd2 ax ). (2) Then, depending on the basis selection method, m there is the cost associated with basis selection. We will refer to the accumulated value of this cost for choosing all dmax basis functions as Tselection . Forward basis selection methods differ in the way they choose effective basis functions while keeping Tselection small. It is useful to note that the total cost associated with the random basis selection method mentioned earlier is just Tbasic = O(nd2 ax ). This cost forms a baseline for comparison. m Smola and Bartlett's method. Consider the typical situation in forward selection where we have a current working set I and we are interested in choosing the next basis vector, / xi . The method of Smola and Bartlett (2001) evaluates each given xi  I by trying its complete inclusion, i.e., set I = I  {xi } and optimize  () using I = [I ; i ]. Thus, their selection criterion for the instance xi  I is the decrease in  () that can be / obtained by allowing both I and i as variables to be non-zero. The minimal value of  () can be obtained by solving minI  (I ) defined in (5). This costs O(nd) time for ~ each candidate, xi , where d is the size of the current set, I . If all xi  I need to be tried, / it will lead to O(n2 d) cost. Accumulated till dmax basis functions are added, this leads to a Tselection that has O(n2 d2 ax ) complexity, which is disproportionately higher than Tbasic . m Therefore, Smola and Bartlett (2001) resorted to a randomized scheme by considering only  basis elements randomly chosen from outside I during one basis selection. They used a value of  = 59. For this randomized method, the complexity of Tselection is O(nd2 ax ). m Although, from a complexity viewpoint, Tbasic and Tselection are same, it should be noted that the overall cost of the method is about 60 times that of Tbasic . Seeger et al's information gain method. Seeger et al. (2003) proposed a novel and very cheap heuristic criterion for basis selection. The \"informativeness\" of an input vector xi  / I is scored by the information gain between the true posterior distribution, P (f I |y) and a posterior approximation, Q(f I |y), where I denotes the new set of basis vectors after including a new element xi into the current set I . The posterior approximation Q(f I |y) ignores the dependencies between the latent variable f (xi ) and the targets other than yi . Due to this simplification, this value of information gain is computed in O(1) time, given the current predictive model represented by I . Thus, the scores of all instances outside I can be efficiently evaluated in O(n) time, which makes this algorithm almost as fast as using random selection! The potential weakness of this algorithm might be the non-use of the correlation in the remaining instances {xi : xi  I }. / Post-backfitting approach. The two methods presented above are extremes in efficiency: in Smola and Bartlett's method Tselection is disproportionately larger than Tbasic while, in Seeger et al's method Tselection is very much smaller than Tbasic . In this section we introduce a moderate method that is effective and whose complexity is in between the two earlier methods. Our method borrows an idea from kernel matching pursuit. Kernel Matching Pursuit (Vincent and Bengio, 2002) is a sparse method for ordinary least squares that consists of two general greedy sparse approximation schemes, called prebackfitting and post-backfitting. It is worth pointing out that the same methods were also considered much earlier in Adler et al. (1996). Both methods can be generalized to select the basis vectors for sparse GPs. The pre-backfitting approach is very similar in spirit to Smola and Bartlett's method. Our method is an efficient selection criterion that is based on the post-backfitting idea. Recall that, given the current I , the minimal value of  () when it is optimized using only I as variables is equivalent to minI  (I ) as in (5). ~ The minimizer, denoted as I , is given by I = ( 2 KI + KI , KT, )-1 KI , y I (6) Our scoring criterion for an instance xi  I is based on optimizing  () by fixing I = / I and changing i only. The one-dimensional minimizer can be easily found as\n\n\f\n(7)  2 K(xi , xi ) + KT, Ki, i where Ki, is the n  1 matrix of covariance functions between xi and all the training data, ~ and ki is a d dimensional vector having K(xj , xi ), xj  I . The selection score of the instance xi is the decrease in  () achieved by the one dimensional optimization of i , which can be written in closed form as  ( 1 i = (i )2 2 K(xi , xi ) + KT, Ki, 8) i 2  where i is defined as in (7). Note that a full kernel column Ki, is required and so it costs O(n) time to compute (8). In contrast, for scoring one instance, Smola and Bartlett's method requires O(nd) time and Seeger et al's method requires O(1) time. / Ideally we would like to run over all xi  I and choose the instance which gives the largest decrease. This will need O(n2 ) effort. Summing the cost till dmax basis vectors are selected, we get an overall complexity of O(n2 dmax ), which is much higher than Tbasic . To restrict the overall complexity of Tselection to O(nd2 ax ), we resort to a randomization m scheme that selects a relatively good one rather than the best. Since it costs only O(n) time to evaluate our selection criterion in (8) for one instance, we can choose the next basis vector from a set of dmax instances randomly selected from outside of I . Such a scheme keeps the overall complexity of Tselection to O(nd2 ax ). But, from a practical point of view m the scheme is expensive because the selection criterion (8) requires computing a full kernel row Ki, for each instance to be evaluated. As kernel evaluations could be very expensive, we propose a modified scheme to keep the number of such evaluations small. Let us maintain a matrix cache, C of size c  n, that contains c rows of the full kernel matrix K. At the beginning of the algorithm (when I is empty) we initialize C by randomly choosing c training instances, computing the full kernel row, Ki, for the chosen i's and putting them in the rows of C . Each step corresponding to a new basis vector selection proceeds as follows. First we compute i for the c instances corresponding to the rows of C and select the instance with the highest score for inclusion in I . Let xj denote the chosen basis vector. Then we sort the remaining instances (that define C ) according to their i values. Finally, we randomly select  fresh instances (from outside of I and the vectors that define C ) to replace xj and the  - 1 cached instances with the lowest score. Thus, in each basis selection step, we compute the criterion scores for c instances, but evaluate full kernel rows only for  fresh instances. An important advantage of the above scheme is that, those basis elements which have very good scores, but are overtaken by another better element in a particular step, continue to remain in C and probably get to be selected in future basis selection steps. Like in Smola and Bartlett's method we use  = 59. The value of c can be set to be any integer between  and dmax . For any c in this range, the complexity of Tselection remains at most O(nd2 ax ). The above cache scheme is special to m our method and cannot be used with Smola and Bartlett's method without unduly increasing its complexity. If available, it is also useful to have an extra cache for storing kernel rows of instances which get discarded in one step, but which get to be considered again in a future step. Smola and Bartlett's method can also gain from such a cache.\n\n i =\n\n~ KT, (y - KT, I ) -  2 ki I i I\nT\n\n4\n\nModel adaptation\n\nIn this section we address the problem of model adaptation for a given number of basis functions, dmax . Seeger (2003) and Seeger et al. (2003) give the details together with a very good discussion of various issues associated with gradient based model adaptation. Since the same ideas hold for all basis selection methods, we will not discuss them in detail. The sparse GP model is conditional on the parameters in the kernel function and the Gaussian noise level  2 , which can all be collected together in , the hyperparameter vector. The optimal values of  can be inferred by minimizing the negative log\n\n\f\nof the marginal likelihood, () = - log P (y |) using gradient based techniques, where P P (y |) = (y |f I )P (f I )df I = N (y |0,  2 I + KT, K-1 KI , ). One of the problems I I in doing this is the dependence of I on  that makes  a non-differentiable function. This problem can be handled by repeating the following alternating steps: (1) fix  and select I by the given basis selection algorithm; and (2) fix I and do a (short) gradient based adaptation of . For the cache-based post-backfitting method of basis selection we also do the following for adding some stability to the model adaptation process. After we do step (2) using some I and obtain a  we set the initial kernel cache, C using the rows of KI , at .\n\n5\n\nNumerical experiments\n\nIn this section, we compare our method against other sparse GP methods to verify the usefulness of our algorithm. To evaluate generalization performance, we utilize Normalized 2 t i- Mean Square Error (N M S E) given by 1 i=1 (yVar(i)) and Negative Logarithm of Pret y t 2 dictive Distribution (N L P D) defined as 1 i=1 - log P (yi |i , i ) where t is the number t 2 of test cases, yi , i and i are, respectively, the target, the predictive mean and the predictive variance of the i-th test case. N M S E uses only the mean while N L P D measures the quality of predictive distributions as it penalizes over-confident predictions as well as under-confident ones. For all experiments, + e use the ARD Gaussian kernel defined by w m i j2 K(xi , xj ) = 0 exp b where 0 ,  , b > 0 and xi denotes the =1  (x - x ) -th element of xi . The ARD parameters { } give variable weights to input features that leads to a type of feature selection. Quality of Basis Selection in KIN40K Data Set. We use the KIN40K data set,1 composed of 40,000 samples, to evaluate and compare the performance of the various basis selection criteria. We first trained a full GPR model with the ARD Gaussian kernel on a subset of 2000 samples randomly selected in the dataset. The optimal values of the hyperparameters that we obtained were fixed and used for all the sparse GP models in this experiment. We compare the following five basis selection methods: 1. the baseline algorithm (R A N D) that selects I at random; 2. the information gain algorithm (I N F O) proposed by Seeger et al. (2003); 3. our algorithm described in Section 3 with cache size c =  = 59 (K A P PA) in which we evaluate the selection scores of  instances at each step; 4. our algorithm described in Section 3 with cache size c = dmax (D M A X); 5. the algorithm (S B) proposed by Smola and Bartlett (2001) with  = 59. We randomly selected 10,000 samples for training, and kept the remaining 30,000 samples as test cases. For the purpose of studying variability the methods were run on ten such random partitions. We varied dmax from 100 to 1200. The test performances of the five methods are presented in Figure 1. From the upper plot of Figure 1 we can see that I N F O yields much worse N M S E results than K A P PA, D M A X and S B, when dmax is less than 600. When the size is around 100, I N F O is even worse than R A N D. D M A X is always better than K A P PA . Interestingly, D M A X is even slightly better than S B when dmax is less than 200. This is probably because D M A X has a bigger set of basis functions to choose from, than S B. S B generally yields slightly better results than K A P PA . From the middle plot of Figure 1 we can note that I N F O always gives poor N L P D results, even worse than R A N D. The performances of K A P PA, D M A X and S B are close. The lower plot of Figure 1 gives the CPU time consumed by the five algorithms for training, as a function of dmax , in log - log scale. The scaling exponents of R A N D, I N F O and S B are\n1\n\nThe dataset is available at http://www.igi.tugraz.at/aschwaig/data.html.\n\n\f\n0.3\n\nNMSE\n\n0.2 0.1 0\n-0.4 -0.6\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n1100\n\n1200\n\nNLPD\n\n-0.8 -1 -1.2 -1.4 -1.6 -1.8\n10 10\n5 4 3 2 1 0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n1100\n\n1200\n\nCPU TIME\n\n10 10 10\n\nSB DMAX KAPPA INFO RAND\n\n10 100\n\n200\n\n300\n\n500\n\n1000\n\n2000\n\nFigure 1: The variations of test set N M S E, test set N L P D and CPU time (in seconds) for training of the five algorithms as a function of dmax . In the N M S E and N L P D plots, at each value of dmax , the results of the five algorithms are presented as a boxplot group. From left to right, they are R A N D(blue), I N F O(red), K A P PA(green), D M A X(black), and S B(magenta). Note that the CPU time plot is on a log - log scale. around 2.0 (i.e., cost is proportional to d2 ax ), which is consistent with our analysis. I N F O m is almost as fast as R A N D, while S B is about 60 times slower than I N F O. The gap between K A P PA and I N F O is the O (ndmax ) time in computing the score (8) for  candidates.2 As dmax increases, the cost of K A P PA asymptotically gets close to I N F O. The gap between D M A X and K A P PA is the O (nd2 ax - ndmax ) cost in computing the score (8) for the adm ditional (dmax - ) instances. Thus, as dmax increases, the curve of D M A X asymptotically becomes parallel to the curve of I N F O. Asymptotically, the ratio of the computational times of D M A X and I N F O is only about 3. Thus, unlike S B, which is about 60 times slower than I N F O , D M A X is only about 3 times slower than I N F O . Thus D M A X is an excellent method for achieving excellent generalization while also being quite efficient. Model Adaptation on Benchmark Data Sets. Next, we compare model adaptation abilities of the following three algorithms for dmax = 500. 1. The S B algorithm is applied to build a sparse GPR model with fixed hyperparameters (FI X E D - S B). The values of these hyperparameters were obtained by training via a standard full GPR model on a manageable subset of 2000 samples randomly selected from the training data. FI X E D - S B serves as a baseline. 2. The model adaptation scheme is coupled with the I N F O basis selection algorithm (A DA P T- I N F O). 3. The model adaptation scheme is coupled with our D M A X basis selection algorithm (A DA P T- D M A X).\nIf we want to take kernel evaluations also into account, the cost of K A P PA is O(mndmax ) where m is the number of input variables. Note that I N F O does not require any kernel evaluations for computing its selection criterion.\n2\n\n\f\nTable 1: Test results of the three algorithms on the seven benchmark regression datasets. The results are the averages over 20 trials, along with the standard deviation. d denotes the number of input features, ntrg denotes the training data size and ntst denotes the test data size. We use bold face to indicate the lowest average value among the results of the three algorithms. The symbol is used to indicate the cases significantly worse than the winning entry; a p-value threshold of 0.01 in Wilcoxon rank sum test was used to decide this.\nNMSE NLPD A DA P T- D M A X\n- 0 .67  0.53 0.88  0.03 3 .04  0.17 3 .09  0.20 1 3.03  0.30 1 1.71  0.03 1 2.13  0.04\n\nDATA S E T d ntrg ntst FI X E D - S B A DA P T- I N F O A DA P T- D M A X FI X E D - S B A DA P T- I N F O 1 BA N K 8 F M 8 4 5 0 0 3 6 9 2 3.52  0.08 3.54  0.08 3.56  0.09 3.11  0.65 .37  0.34 C 4 BA N K 3 2 N H 3 2 4 5 0 0 3 6 9 2 4 8 . 0 8  2 . 9 2 4 9 . 0 4  1 . 3 4 7.41  1.35 -1.02  0.21 -0 . 7 9  0 . 0 6 3 PUSMALL 12 4500 3692 2.45  0.16 2.45  0.15 2 . 4 6  0 . 1 4 5 . 1 8  0 . 6 1 .70  0.46 3 C P UAC T 2 1 4 5 0 0 3 6 9 2 1.58  0.13 1.61  0.14 1.61  0.11 4.49  0.26 .68  0.40 2 CALHOUSE 8 10000 10640 22.58  0.34 2 . 8 2  0 . 4 6 20.02  0.88 3 1 . 8 3  3 . 3 5 2 1 . 2 0  1 . 4 7 HOUSE8L 8 1 0 0 0 0 1 2 7 8 4 4 2 . 2 7  2 . 1 4 3 7 . 3 0  1 . 2 9 35.87  0.94 1 2 . 0 6  0 . 6 7 1 2 . 0 6  0 . 0 7 4 HOUSE16H 16 10000 12784 53.45  7.05 5 . 7 2  1 . 1 5 44.29  0.76 1 2 . 7 2  1 . 6 9 1 2 . 4 8  0 . 0 6\n\nWe selected seven large regression datasets.3 Each of them is randomly partitioned into training/test splits. For the purpose of analyzing statistical significance, the partition was repeated 20 times independently. Test set performances (N M S E and N L P D) of the three methods on the seven datasets are presented in Table 1. On the four datasets with 4500 training instances, the N M S E results of the three methods are quite comparable. A DA P TD M A X yields significantly better N L P D results on three of those four datasets. On the three larger datasets with 10,000 training instances, A DA P T- D M A X is significantly better than A DA P T- I N F O on both N M S E and N L P D . We also tested our algorithm on the Outaouais dataset, which consists of 29000 training samples and 20000 test cases whose targets are held by the organizers of the \"Evaluating Predictive Uncertainty Challenge\".4 The results of N M S E and N L P D we obtained in this blind test are 0.014 and -1.037 respectively, which are much better than the results of other participants.\n\nReferences\nAdler, J., B. D. Rao, and K. Kreutz-Delgado. Comparison of basis selection methods. In Proceedings of the 30th Asilomar conference on signals, systems and computers, pages 252257, 1996. Candela, J. Q. Learning with uncertainty - Gaussian processes and relevance vector machines. PhD thesis, Technical University of Denmark, 2004.  Csato, L. and M. Opper. Sparse online Gaussian processes. Neural Computation, The MIT Press, 14:641668, 2002. Seeger, M. Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and sparse approximations. PhD thesis, University of Edinburgh, July 2003. Seeger, M., C. K. I. Williams, and N. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Workshop on AI and Statistics 9, 2003. Smola, A. J. and P. Bartlett. Sparse greedy Gaussian process regression. In Leen, T. K., T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 619625. MIT Press, 2001. Vincent, P. and Y. Bengio. Kernel matching pursuit. Machine Learning, 48:165187, 2002.  Williams, C. K. I. and M. Seeger. Using the Nystrom method to speed up kernel machines. In Leen, T. K., T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682688. MIT Press, 2001.\nThese datasets are vailable at http://www.liacc.up.pt/ltorgo/Regression/DataSets.html. The dataset and the results contributed by other participants can be found at the web site of the challenge http://predict.kyb.tuebingen.mpg.de/.\n3 4\n\n\f\n", "award": [], "sourceid": 2862, "authors": [{"given_name": "Sathiya", "family_name": "Keerthi", "institution": null}, {"given_name": "Wei", "family_name": "Chu", "institution": null}]}