{"title": "Efficient Kernel Machines Using the Improved Fast Gauss Transform", "book": "Advances in Neural Information Processing Systems", "page_first": 1561, "page_last": 1568, "abstract": null, "full_text": "                  Efficient Kernel Machines Using the\n                     Improved Fast Gauss Transform\n\n\n               Changjiang Yang, Ramani Duraiswami and Larry Davis\n     Department of Computer Science, Perceptual Interfaces and Reality Laboratory\n                     University of Maryland, College Park, MD 20742\n                    {yangcj,ramani,lsd}@umiacs.umd.edu\n\n                                         Abstract\n\n\n         The computation and memory required for kernel machines with N train-\n         ing samples is at least O(N 2). Such a complexity is significant even for\n         moderate size problems and is prohibitive for large datasets. We present\n         an approximation technique based on the improved fast Gauss transform\n         to reduce the computation to O(N ). We also give an error bound for the\n         approximation, and provide experimental results on the UCI datasets.\n\n\n1    Introduction\n\nKernel based methods, including support vector machines [16], regularization networks [5]\nand Gaussian processes [18], have attracted much attention in machine learning. The solid\ntheoretical foundations and good practical performance of kernel methods make them very\npopular. However one major drawback of the kernel methods is their scalability. Ker-\nnel methods require O(N 2) storage and O(N 3) operations for direct methods, or O(N 2)\noperations per iteration for iterative methods, which is impractical for large datasets.\n\nTo deal with this scalability problem, many approaches have been proposed, including the\nNystrom method [19], sparse greedy approximation [13, 12], low rank kernel approxima-\ntion [3] and reduced support vector machines [9]. All these try to find a reduced subset\nof the original dataset using either random selection or greedy approximation. In these\nmethods there is no guarantee on the approximation of the kernel matrix in a deterministic\nsense. An assumption made in these methods is that most eigenvalues of the kernel matrix\nare zero. This is not always true and its violation results in either performance degradation\nor negligible reduction in computational time or memory.\n\nWe explore a deterministic method to speed up kernel machines using the improved fast\nGauss transform (IFGT) [20, 21]. The kernel machine is solved iteratively using the conju-\ngate gradient method, where the dominant computation is the matrix-vector product which\nwe accelerate using the IFGT. Rather than approximating the kernel matrix by a low-rank\nrepresentation, we approximate the matrix-vector product by the improved fast Gauss trans-\nform to any desired precision. The total computational and storage costs are of linear order\nin the size of the dataset. We present the application of the IFGT to kernel methods in\nthe context of the Regularized Least-Squares Classification (RLSC) [11, 10], though the\napproach is general and can be extended to other kernel methods.\n\n\n2    Regularized Least-Squares Classification\n\n\nThe RLSC algorithm [11, 10] solves the binary classification problems in Reproducing\nKernel Hilbert Space (RKHS) [17]: given N training samples in d-dimensional space xi \n\n\f\nRd and the labels yi  {-1, 1}, find f  H that minimizes the regularized risk functional\n\n                                     1 N\n                            min             V (yi, f (xi)) +  f 2K,                       (1)\n                            f H N i=1\n\nwhere H is an RKHS with reproducing kernel K, V is a convex cost function and  is\nthe regularization parameter controlling the tradeoff between the cost and the smoothness.\nBased on the Representer Theorem [17], the solution has a representation as\n\n                                                  N\n\n                                     f(x) =             ciK(x, xi).                       (2)\n                                                 i=1\n\nIf the loss function V is the hinge function, V (y, f ) = (1 - yf )+, where ( )+ =  for\n > 0 and 0 otherwise, then the minimization of (1) leads to the popular Support Vector\nMachines which can be solved using quadratic programming.\n\nIf the loss function V is the square-loss function, V (y, f ) = (y - f )2, the minimization\nof (1) leads to the so-called Regularized Least-Squares Classification which requires only\nthe solution of a linear system. The algorithm has been rediscovered several times and\nhas many different names [11, 10, 4, 15]. In this paper, we stick to the term \"RLSC\" for\nconsistency. It has been shown in [11, 4] that RLSC achieves accuracy comparable to the\npopular SVMs for binary classification problems.\n\nIf we substitute (2) into (1), and denote c = [c1, . . . , cN ]T , K = K(xi, xj), we can find\nthe solution of (1) by solving the linear system\n\n                                        (K +  I)c = y                                     (3)\n\nwhere  = N , I is the identity matrix, and y = [y1, . . . , yN ]T .\n\nThere are many choices for the kernel function K. The Gaussian is a good kernel for classi-\nfication and is used in many applications. If a Gaussian kernel is applied, as shown in [10],\nthe classification problem can be solved by the solution of a linear system, i.e., Regularized\nLeast-Squares Classification. A direct solution of the linear system will require O(N 3)\ncomputation and O(N 2) storage, which is impractical even for problems of moderate size.\n\n\nAlgorithm 1 Regularized Least-Squares Classification\n\nRequire: Training dataset SN = (xi, yi)N\n                                                 i=1.\n                                                                  2\n  1. Choose the Gaussian kernel: K(x, x ) = e- x-x                     /2 .\n  2. Find the solution as f (x) =      N    c\n                                       i=1 iK (x, xi), where c satisfies the linear system (3).\n  3. Solve the linear system (3).\n\n\nAn effective way to solve the large-scale linear system (3) is to use iterative methods.\nSince the matrix K is symmetric, we consider the well-known conjugate gradient method.\nThe conjugate gradient method solves the linear system (3) by iteratively performing the\nmatrix-vector multiplication Kc. If rank(K) = r, then the conjugate gradient algorithm\nconverges in at most r +1 steps. Only one matrix-vector multiplication and 10N arithmetic\noperations are required per iteration. Only four N -vectors are required for storage. So the\ncomputational complexity is O(N 2) for low-rank K and the storage requirement is O(N 2).\nWhile this represents an improvement for most problems, the rank of the matrix may not\nbe small, and moreover the quadratic storage and computational complexity are still too\nhigh for large datasets. In the following sections, we present an algorithm to reduce the\ncomputational and storage complexity to linear order.\n\n\f\n3      Fast Gauss Transform\n\nThe matrix-vector product Kc can be written in the form of the so-called discrete Gauss\ntransform [8]\n                                                              N\n                                                                                            2\n                                           G(y                                                   /2\n                                                  j ) =              cie- xi-yj                         ,                                      (4)\n                                                              i=1\n\nwhere ci are the weight coefficients, {xi}N\n                                                                     i=1 are the centers of the Gaussians (called\n\"sources\"), and  is the bandwidth parameter of the Gaussians. The sum of the Gaus-\nsians is evaluated at each of the \"target\" points {yj}M\n                                                                                      j=1. Direct evaluation of the Gauss\ntransform at M target points due to N sources requires O(M N ) operations.\n\nThe Fast Gauss Transform (FGT) was invented by Greengard and Strain [8] for efficient\nevaluation of the Gauss transform in O(M + N ) operations. It is an important variant of\nthe more general Fast Multipole Method [7].\n\nThe FGT [8] expands the Gaussian function into Hermite functions. The expansion of the\nunivariate Gaussian is\n\n                                                 p-1                             n\n                                   2                    1      x                                  y\n                  e- y                                              i - x                             j - x\n                          j -xi         /2 =                                         hn                                  + (p),               (5)\n                                                        n!                                                  \n                                                 n=0\n\n\nwhere hn(x) are the Hermite functions defined by hn(x) = (-1)n dn                                                                   e-x2 , and x\n                                                                                                                          dxn                      \nis the expansion center. The d-dimensional Gaussian function is treated as a Kronecker\nproduct of d univariate Gaussians. For simplicity, we adopt the multi-index notation of\nthe original FGT papers [8]. A multi-index  = (1, . . . , d) is a d-tuple of nonnegative\nintegers. For any multi-index   Nd and any x  Rd, we have the monomial x =\nx1\n 1 x2\n       2       xd . The length and the factorial of  are defined as || = \n                 d                                                                                                             1 + 2 + . . . + d,\n! = 1!2!    d!. The multidimensional Hermite functions are defined by\n\n                                        h(x) = h (x                     (x                       (x\n                                                         1     1)h2            2)    hd                 d).\n\nThe sum (4) is then equal to the Hermite expansion about center x:\n\n                                                                                                  N\n                                                  y                                    1                            x              \n               G(y                                     j - x                                                            i - x\n                      j ) =             Ch                         ,    C =                               ci                         .      (6)\n                                                        h                              !                                 h\n                               0                                                                i=1\n\nwhere C are the coefficients of the Hermite expansions.\n\nIf we truncate each of the Hermite series (6) after p terms (or equivalently order p - 1),\nthen each of the coefficients C is a d-dimensional matrix with pd terms. The total compu-\ntational complexity for a single Hermite expansion is O((M + N )pd). The factor O(pd)\ngrows exponentially as the dimensionality d increases. Despite this defect in higher di-\nmensions, the FGT is quite effective for two and three-dimensional problems, and has\nachieved success in some physics, computer vision and pattern recognition applications.\n\nIn practice a single expansion about one center is not always valid or accurate over the en-\ntire domain. A space subdivision scheme is applied in the FGT and the Gaussian functions\nare expanded at multiple centers. The original FGT subdivides space into uniform boxes,\nwhich is simple, but highly inefficient in higher dimensions. The number of boxes grows\nexponentially with dimensionality, which makes it inefficient for storage and for searching\nnonempty neighbor boxes. Most important, since the ratio of volume of the hypercube to\nthat of the inscribed sphere grows exponentially with dimension, points have a high prob-\nability of falling into the area inside the box and outside the sphere, where the truncation\nerror of the Hermite expansion is much larger than inside of the sphere.\n\n\f\n3.1         Improved Fast Gauss Transform\n\nIn brief, the original FGT suffers from the following two defects:\n\n            1. The exponential growth of computationally complexity with dimensionality.\n            2. The use of the box data structure in the FGT is inefficient in higher dimensions.\n\nWe introduced the improved FGT [20, 21] to address these deficiencies, and it is summa-\nrized below.\n\n3.1.1        Multivariate Taylor Expansions\nInstead of expanding the Gaussian into Hermite functions, we factorize it as\n\n                                       2                         2                          2\n                         e- yj-xi /2 = e- yj /2 e- xi /2 e2yjxi/2 ,                                                                          (7)\n\nwhere x is the center of the sources, yj = yj - x, xi = xi - x. The first two\nexponential terms can be evaluated individually at the source points or target points. In the\nthird term, the sources and the targets are entangled. Here we break the entanglement by\nexpanding it into a multivariate Taylor series\n\n                                                               y         n                     2||    x                    y         \n             e2y                                                     j                                          i                    j\n                     j xi /2 =           2n      xi                        =                                                               .     (8)\n                                                                                               !                            \n                                     n=0                                             ||0\n\nIf we truncate the series after total order p - 1, then the number of terms is rp-1,d =\n p+d-1         which is much less than pd in higher dimensions. For d = 12 and p = 10, the\n       d\noriginal FGT needs 1012 terms, while the multivariate Taylor expansion needs only 293930.\nFor d   and moderate p, the number of terms is O(dp), a substantial reduction.\n\nFrom Eqs.(7) and (8), the weighted sum of Gaussians (4) can be expressed as a multivariate\nTaylor expansions about center x:\n                                                                                                                 \n                                                                                2\n                              G(y                                                    /2         yj - x\n                                      j ) =                Ce- yj-x                                                 ,                               (9)\n                                                                                                    \n                                                  ||0\n\nwhere the coefficients C are given by\n\n                                            2|| N                                                          \n                                                                                2\n                               C                                                     /2      xi - x\n                                      =                   cie- xi-x                                                 .                              (10)\n                                            !                                                     \n                                                    i=1\n\nThe coefficients C can be efficiently evaluate with rnd storage and rnd - 1 multiplications\nusing the multivariate Horner's rule [20].\n\n3.1.2        Spatial Data Structures\nTo efficiently subdivide the space, we need a scheme that adaptively subdivides the space\naccording to the distribution of points. It is also desirable to generate cells as compact as\npossible. Based on these consideration, we model the space subdivision task as a k-center\nproblem [1]: given a set of N points and a predefined number of clusters k, find a partition\nof the points into clusters S1, . . . , Sk, with cluster centers c1, . . . , ck, that minimizes the\nmaximum radius of any cluster:\n\n                                                      max max v - ci .\n                                                           i    vSi\n\nThe k-center problem is known to be N P -hard. Gonzalez [6] proposed a very simple\ngreedy algorithm, called farthest-point clustering. Initially, pick an arbitrary point v0 as\nthe center of the first cluster and add it to the center set C. Then, for i = 1 to k do\nthe follows: in iteration i, for every point, compute its distance to the set C: di(v, C) =\nmincC v - c . Let vi be a point that is farthest away from C, i.e., a point for which\ndi(vi, C) = maxv di(v, C). Add vi to the center set C. After k iterations, report the points\nv0, v1, . . . , vk-1 as the cluster centers. Each point is then assigned to its nearest center.\n\n\f\nGonzalez [6] proved that farthest-point clustering is a 2-approximation algorithm, i.e., it\ncomputes a partition with maximum radius at most twice the optimum. The direct imple-\nmentation of farthest-point clustering has running time O(N k). Feder and Greene [2] give\na two-phase algorithm with optimal running time O(N log k). In practice, we used circular\nlists to index the points and achieve the complexity O(N log k) empirically.\n\n3.1.3     The Algorithm and Error Bound\nThe improved fast Gauss transform consists of the following steps:\n\nAlgorithm 2 Improved Fast Gauss Transform\n\n     1. Assign N sources into k clusters using the farthest-point clustering algorithm such\n     that the radius is less than x.\n     2. Choose p sufficiently large such that the error estimate (11) is less than the desired\n     precision .\n     3. For each cluster Sk with center ck, compute the coefficients given by (10).\n     4. Repeat for each target yj, find its neighbor clusters whose centers lie within the range\n     y. Then the sum of Gaussians (4) can be evaluated by the expression (9).\n\n\nThe amount of work required in step 1 is O(N log k) using Feder and Greene's algo-\nrithm [2]. The amount of work required in step 3 is of O(N rpd). The work required\nin step 4 is O(M n rpd), where n  k is the maximum number of neighbor clusters for\neach target. So, the improved fast Gauss transform achieves linear running time. The algo-\nrithm needs to store the k coefficients of size rpd, so the storage complexity is reduced to\nO(Krpd). To verify the linear order of our algorithm, we generate N source points and N\ntarget points in 4, 6, 8, 10 dimensional unit hypercubes using a uniform distribution. The\nweights on the source points are generated from a uniform distribution in the interval [0, 1]\nand  = 1. The results of the IFGT and the direct evaluation are displayed in Figure 1(a),\n(b), and confirm the linear order of the IFGT.\n\nThe error of the improved fast Gauss transform (2) is bounded by\n\n                                          N              2p\n                       |E(G(yj))|               |ci|          p\n                                                         p!     xp\n                                                                     y + e-(y-x)2    .    (11)\n                                          i=1\n\nThe details are in [21]. The comparison between the maximum absolute errors in the\nsimulation and the estimated error bound (11) is displayed in Figure 1(c) and (d). It shows\nthat the error bound is very conservative compared with the real errors. Empirically we can\nobtain the parameters on a randomly selected subset and use them on the entire dataset.\n\n4      IFGT Accelerated RLSC: Discussion and Experiments\n\nThe key idea of all acceleration methods is to reduce the cost of the matrix-vector product.\nIn reduced subset methods, this is performed by evaluating the product at a few points,\nassuming that the matrix is low rank. The general Fast Multipole Methods (FMM) seek to\nanalytically approximate the possibly full-rank matrix as a sum of low rank approximations\nwith a tight error bound [14] (The FGT is a variant of the FMM with Gaussian kernel). It is\nexpected that these methods can be more robust, while at the same time achieve significant\nacceleration.\n\nThe problems to which kernel methods are usually applied are in higher dimensions, though\nthe intrinsic dimensionality of the data is expected to be much smaller. The original FGT\ndoes not scale well to higher dimensions. Its cost is of linear order in the number of sam-\nples, but exponential order in the number of dimensions. The improved FGT uses new data\nstructures and a modified expansion to reduce this to polynomial order.\n\nDespite this improvement, at first glance, even with the use of the IFGT, it is not clear if the\nreduction in complexity will be competitive with the other approaches proposed. Reason\n\n\f\n                                      2\n                                10                                                                                                                        -3\n                                                                                                                                                 10\n                                                      direct method, 4D                                                                                                                                                4D\n                                                      fast method, 4D                                                                                                                                                  6D\n                                      1               direct method, 6D\n                                10                    fast method, 6D                                                                                                                                                  8D\n                                                      direct method, 8D                                                                                                                                                10D\n                                                      fast method, 8D\n                                                      direct method, 10D\n                                      0\n                                10                    fast method, 10D                                                                                    -4\n                                                                                                                                                 10\n\n\n                                      -1\n                                10\n\n           CPU time\n                                      -2                                                                                        Max abs error\n                                10                                                                                                                        -5\n                                                                                                                                                 10\n\n\n                                      -3\n                                10\n\n\n                                      -4\n                                10                                                                                                                        -6\n                                            2                                         3                                    4                     10\n                                      10                                         10                                    10                                       2                                   3                           4\n                                                                                                                                                               10                             10                              10\n                                                                                 N                                                                                                             N\n\n                                                                            (a)                                                                                                              (b)\n                                 3\n                                10                                                                                                                        4\n                                                                                                 Real max abs error                              10\n                                                                                                 Estimated error bound                                               Real max abs error\n                                 2\n                                10                                                                                                                        3          Estimated error bound\n                                                                                                                                                 10\n\n\n                                 1                                                                                                                        2\n                                10                                                                                                               10\n\n\n                                                                                                                                                          1\n                                 0                                                                                                               10\n                                10\n\n                                                                                                                                                          0\n                       Error                                                                                                                     10\n                                 -1\n                                10                                                                                                               Error\n\n                                                                                                                                                          -1\n                                                                                                                                                 10\n                                 -2\n                                10\n                                                                                                                                                          -2\n                                                                                                                                                 10\n\n                                 -3\n                                10\n                                                                                                                                                          -3\n                                                                                                                                                 10\n\n                                 -4\n                                10                                                                                                                        -4\n                                      0          2          4       6       8    10        12     14    16     18         20                     10 0.3                  0.4          0.5                0.6    0.7           0.8\n                                                                                 p                                                                                                             rx\n\n                                                                            (c)                                                                                                              (d)\n\nFigure 1: (a) Running time and (b) maximum absolute error w.r.t. N in d = 4, 6, 8, 10. The\ncomparison between the real maximum absolute errors and the estimated error bound (11) w.r.t. (c)\nthe order of the Taylor series p, and (d) the radius of the farthest-point clustering algorithm r =  .\n                                                                                                                                                                                                                                     x    x\n\nThe uniformly distributed sources and target points are in 4-dimension.\n\n\n\nfor hope is provided by the fact that in high dimensions we expect that the IFGT with very\nlow order expansions will converge rapidly (because of the sharply vanishing exponential\nterms multiplying the expansion in factorization (7). Thus we expect that combined with a\ndimensionality reduction technique, we can achieve very competitive solutions.\n\nIn this paper we explore the application of the IFGT accelerated RLSC to certain standard\nproblems that have already been solved by the other techniques. While dimensionality\nreduction would be desirable, here we do not perform such a reduction for fair comparison.\nWe use small order expansions (p = 1 and p = 2) in the IFGT and run the iterative solver.\n\nIn the first experiment, we compared the performance of the IFGT on approximating the\nsums (4) with the Nystr om method [19]. The experiments were carried out on a Pentium\n4 1.4GHz PC with 512MB memory. We generate N source points and N target points in\n100 dimensional unit hypercubes using a uniform distribution. The weights on the source\npoints are generated using a uniform distribution in the interval [0, 1]. We directly evaluate\nthe sums (4) as the ground truth, where 2 = (0.5)d and d is the dimensionality of the\ndata. Then we estimate it using the improved fast Gauss transform and Nystr om method.\nTo compare the results, we use the maximum relative error to measure the precision of the\napproximations. Given a precision of 0.5%, we use the error bound (11) to find the para-\nmeters of the IFGT, and use a trial and error method to find the parameter of the Nystr om\nmethod. Then we vary the number of points, N , from 500 to 5000 and plot the time against\nN in Figure 2 (a). The results show the IFGT is much faster than the Nystr om method. We\nalso fix the number of points to N = 1000 and vary the size of centers (or random subset)\nk from 10 to 1000 and plot the results in Figure 2 (b). The results show that the errors of\nthe IFGT are not sensitive to the number of the centers, which means we can use very a\nsmall number of centers to achieve a good approximation. The accuracy of the Nystr om\n\n\f\nmethod catches up at large k, where the direct evaluation may be even faster. The intuition\nis that the use of expansions improves the accuracy of the approximation and relaxes the\nrequirement of the centers.\n\n                     IFGT, p=1                                                         0.07\n                                                                                                                                     IFGT, p=1\n                     IFGT, p=2                                                                                                       IFGT, p=2\n                     Nystrom                                                                                                         Nystrom\n               -1                                                                      0.06\n              10\n\n\n                                                                                       0.05\n\n\n\n                                                                                       0.04\n\n\n  Time (s)                                                                             0.03\n\n                                                                Max Relative Error\n               -2\n              10                                                                       0.02\n\n\n\n                                                                                       0.01\n\n\n\n                                                                                         0 1                        2                              3\n                                   3\n                                  10                                                     10                    10                                 10\n                                               N                                                               k\n                                        (a)                                                                  (b)\n\nFigure 2: Performance comparison between the approximation methods. (a) Running time against\nN and (b) maximum relative error against k for fixed N = 1000 in 100 dimensions.\n\nTable 1: Ten-fold training and testing accuracy in percentage and training time in seconds using the\nfour classifiers on the five UCI datasets. Same value of 2 = (0.5)d is used in all the classifiers. A\nrectangular kernel matrix with random subset size of 20% of N was used in PSVM on Galaxy Dim\nand Mushroom datasets.\n\n                                  Dataset           RLSC+FGT                           RLSC       Nystr om               PSVM\n                     Size  Dimension                %, %, s                          %, %, s     %, %, s                %, %, s\n                           Ionosphere                94.8400                          97.7209     91.8656                95.1250\n                                251  34             91.7302                          90.6032     88.8889                94.0079\n                                                     0.3711                            1.1673     0.4096                  0.8862\n                          BUPA Liver                 79.6789                          81.7318     76.7488                75.8134\n                                345  6              71.0336                          67.8403     69.2857                71.4874\n                                                     0.1279                            0.4833     0.1475                  0.3468\n                          Tic-Tac-Toe                88.7263                          88.6917     88.4945                92.9715\n                                958  9              86.9507                          85.4890     84.1272                87.2680\n                                                     0.3476                            2.9676     1.8326                  3.9891\n                          Galaxy Dim                 93.2967                          93.3206     93.7023                93.6705\n                           4192  14                 93.2014                          93.2258     93.7020                93.5589\n                                                     2.0972                           78.3526     3.1081                 44.5143\n                           Mushroom                  88.2556                          87.9001                            85.5955\n                           8124  22                 87.9615                          87.6658      failed                85.4629\n                                                     14.7422                          341.7148                           285.1126\n\n\nIn the second experiment, five datasets from the UCI repository are used to compare the\nperformance of four different methods for classification: RLSC with the IFGT, RLSC with\nfull kernel evaluation, RLSC with the Nystr om method and the Proximal Support Vector\nMachines (PSVM) [4]. The Gaussian kernel is used for all these methods. We use the\nsame value of 2 = (0.5)d for a fair comparison. The ten-fold cross validation accuracy\non training and testing and the training time are listed in Table 1. The RLSC with the\nIFGT is fastest among the four classifiers on all five datasets, while the training and testing\naccuracy is close to the accuracy of the RLSC with full kernel evaluation. The RLSC\nwith the Nystrom approximation is nearly as fast, but the accuracy is lower than the other\nmethods. Worst of all, it is not always feasible to solve the linear systems, which results in\nthe failure on the Mushroom dataset. The PSVM is accurate on the training and testing, but\nslow and memory demanding for large datasets, even with subset reduction.\n\n\f\n5    Conclusions and Discussion\n\nWe presented an improved fast Gauss transform to speed up kernel machines with Gaussian\nkernel to linear order. The simulations and the classification experiments show that the\nalgorithm is in general faster and more accurate than other matrix approximation methods.\nAt present, we do not consider the reduction from the support vector set or dimensionality\nreduction. The combination of the improved fast Gauss transform with these techniques\nshould bring even more reduction in computation. Another improvement to the algorithm\nis an automatic procedure to tune the parameters. A possible solution could be running a\nseries of testing problems and tuning the parameters accordingly. If the bandwidth is very\nsmall compared with the data range, the nearest neighbor searching algorithms could be a\nbetter solution to these problems.\nAcknowledgments\nWe would like to thank Dr. Nail Gumerov for many discussions. We also gratefully acknowledge\nsupport of NSF awards 9987944, 0086075 and 0219681.\n\nReferences\n [1] M. Bern and D. Eppstein. Approximation algorithms for geometric problems. In D. Hochbaum,\n     editor, Approximation Algorithms for NP-Hard Problems, chapter 8, pages 296345. PWS Pub-\n     lishing Company, Boston, 1997.\n [2] T. Feder and D. Greene. Optimal algorithms for approximate clustering. In Proc. 20th ACM\n     Symp. Theory of computing, pages 434444, Chicago, Illinois, 1988.\n [3] S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Jour-\n     nal of Machine Learning Research, 2:243264, Dec. 2001.\n [4] G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In Proceedings\n     KDD-2001: Knowledge Discovery and Data Mining, pages 7786, San Francisco, CA, 2001.\n [5] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures.\n     Neural Computation, 7(2):219269, 1995.\n [6] T. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer\n     Science, 38:293306, 1985.\n [7] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. J. Comput. Phys.,\n     73(2):325348, 1987.\n [8] L. Greengard and J. Strain. The fast Gauss transform. SIAM J. Sci. Statist. Comput., 12(1):79\n     94, 1991.\n [9] Y.-J. Lee and O. Mangasarian. RSVM: Reduced support vector machines. In First SIAM\n     International Conference on Data Mining, Chicago, 2001.\n[10] T. Poggio and S. Smale. The mathematics of learning: Dealing with data. Notices of the\n     American Mathematical Society (AMS), 50(5):537544, 2003.\n[11] R. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches in Machine\n     Learning. PhD thesis, MIT, Cambridge, MA, 2002.\n[12] A. Smola and P. Bartlett. Sparse greedy gaussian process regression. In Advances in Neural\n     Information Processing Systems, pages 619625. MIT Press, 2001.\n[13] A. Smola and B. Sch \n                             olkopf. Sparse greedy matrix approximation for machine learning. In\n     Proc. Int'l Conf. Machine Learning, pages 911918. Morgan Kaufmann, 2000.\n[14] X. Sun and N. P. Pitsianis. A matrix version of the fast multipole method. SIAM Review,\n     43(2):289300, 2001.\n[15] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural\n     Processing Letters, 9(3):293300, 1999.\n[16] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.\n[17] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, PA, 1990.\n[18] C. K. Williams and D. Barber. Bayesian classification with gaussian processes. IEEE Trans.\n     Pattern Anal. Mach. Intell., 20(12):13421351, Dec. 1998.\n[19] C. K. I. Williams and M. Seeger. Using the Nystr \n                                                       om method to speed up kernel machines. In\n     Advances in Neural Information Processing Systems, pages 682688. MIT Press, 2001.\n[20] C. Yang, R. Duraiswami, N. Gumerov, and L. Davis. Improved fast Gauss transform and effi-\n     cient kernel density estimation. In Proc. ICCV 2003, pages 464471, 2003.\n[21] C. Yang, R. Duraiswami, and N. A. Gumerov. Improved fast gauss transform. Technical Report\n     CS-TR-4495, UMIACS, Univ. of Maryland, College Park, 2003.\n\n\f\n", "award": [], "sourceid": 2550, "authors": [{"given_name": "Changjiang", "family_name": "Yang", "institution": null}, {"given_name": "Ramani", "family_name": "Duraiswami", "institution": null}, {"given_name": "Larry", "family_name": "Davis", "institution": null}]}