{"title": "Random Features for Large-Scale Kernel Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 1177, "page_last": 1184, "abstract": null, "full_text": "Random Features for Large-Scale Kernel Machines\n\nAli Rahimi\n\nIntel Research Seattle\nSeattle, WA 98105\n\nali.rahimi@intel.com\n\nBenjamin Recht\n\nCaltech IST\n\nPasadena, CA 91125\n\nbrecht@ist.caltech.edu\n\nAbstract\n\nTo accelerate the training of kernel machines, we propose to map the input data\nto a randomized low-dimensional feature space and then apply existing fast linear\nmethods. The features are designed so that the inner products of the transformed\ndata are approximately equal to those in the feature space of a user speci\ufb01ed shift-\ninvariant kernel. We explore two sets of random features, provide convergence\nbounds on their ability to approximate various radial basis kernels, and show\nthat in large-scale classi\ufb01cation and regression tasks linear machine learning al-\ngorithms applied to these features outperform state-of-the-art large-scale kernel\nmachines.\n\n1 Introduction\n\nKernel machines such as the Support Vector Machine are attractive because they can approximate\nany function or decision boundary arbitrarily well with enough training data. Unfortunately, meth-\nods that operate on the kernel matrix (Gram matrix) of the data scale poorly with the size of the\ntraining dataset. For example, even with the most powerful workstation, it might take days to\ntrain a nonlinear SVM on a dataset with half a million training examples. On the other hand, lin-\near machines can be trained very quickly on large datasets when the dimensionality of the data is\nsmall [1, 2, 3]. One way to take advantage of these linear training algorithms for training nonlinear\nmachines is to approximately factor the kernel matrix and to treat the columns of the factor matrix\nas features in a linear machine (see for example [4]). Instead, we propose to factor the kernel func-\ntion itself. This factorization does not depend on the data, and allows us to convert the training and\nevaluation of a kernel machine into the corresponding operations of a linear machine by mapping\ndata into a relatively low-dimensional randomized feature space. Our experiments show that these\nrandom features, combined with very simple linear learning techniques, compete favorably in speed\nand accuracy with state-of-the-art kernel-based classi\ufb01cation and regression algorithms, including\nthose that factor the kernel matrix.\nThe kernel trick is a simple way to generate features for algorithms that depend only on the inner\nproduct between pairs of input points. It relies on the observation that any positive de\ufb01nite function\nk(x, y) with x, y \u2208 Rd de\ufb01nes an inner product and a lifting \u03c6 so that the inner product between\nlifted datapoints can be quickly computed as h\u03c6(x), \u03c6(y)i = k(x, y). The cost of this convenience\nis that the algorithm accesses the data only through evaluations of k(x, y), or through the kernel\nmatrix consisting of k applied to all pairs of datapoints. As a result, large training sets incur large\ncomputational and storage costs.\nInstead of relying on the implicit lifting provided by the kernel trick, we propose explicitly mapping\nthe data to a low-dimensional Euclidean inner product space using a randomized feature map z :\nRd \u2192 RD so that the inner product between a pair of transformed points approximates their kernel\nevaluation:\n\nk(x, y) = h\u03c6(x), \u03c6(y)i \u2248 z(x)0z(y).\n\n(1)\n\n1\n\n\fat a test point x requires computing f(x) =PN\n\nUnlike the kernel\u2019s lifting \u03c6, z is low-dimensional. Thus, we can simply transform the input with\nz, and then apply fast linear learning methods to approximate the answer of the corresponding\nnonlinear kernel machine. In what follows, we show how to construct feature spaces that uniformly\napproximate popular shift-invariant kernels k(x \u2212 y) to within \u0001 with only D = O(d\u0001\u22122 log 1\n\u00012 )\ndimensions, and empirically show that excellent regression and classi\ufb01cation performance can be\nobtained for even smaller D.\nIn addition to giving us access to extremely fast learning algorithms, these randomized feature maps\nalso provide a way to quickly evaluate the machine. With the kernel trick, evaluating the machine\ni=1 cik(xi, x), which requires O(N d) operations to\ncompute and requires retaining much of the dataset unless the machine is very sparse. This is often\nunacceptable for large datasets. On the other hand, after learning a hyperplane w, a linear machine\ncan be evaluated by simply computing f(x) = w0z(x), which, with the randomized feature maps\npresented here, requires only O(D + d) operations and storage.\nWe demonstrate two randomized feature maps for approximating shift invariant kernels. Our \ufb01rst\nrandomized map, presented in Section 3, consists of sinusoids randomly drawn from the Fourier\ntransform of the kernel function we seek to approximate. Because this map is smooth, it is well-\nsuited for interpolation tasks. Our second randomized map, presented in Section 4, partitions the\ninput space using randomly shifted grids at randomly chosen resolutions. This mapping is not\nsmooth, but leverages the proximity between input points, and is well-suited for approximating ker-\nnels that depend on the L1 distance between datapoints. Our experiments in Section 5 demonstrate\nthat combining these randomized maps with simple linear learning algorithms competes favorably\nwith state-of-the-art training algorithms in a variety of regression and classi\ufb01cation scenarios.\n\n2 Related Work\n\nThe most popular methods for large-scale kernel machines are decomposition methods for solving\nSupport Vector Machines (SVM). These methods iteratively update a subset of the kernel machine\u2019s\ncoef\ufb01cients using coordinate ascent until KKT conditions are satis\ufb01ed to within a tolerance [5,\n6]. While such approaches are versatile workhorses, they do not always scale to datasets with\nmore than hundreds of thousands of datapoints for non-linear problems. To extend learning with\nkernel machines to these scales, several approximation schemes have been proposed for speeding\nup operations involving the kernel matrix.\nThe evaluation of the kernel function can be sped up using linear random projections [7]. Throwing\naway individual entries [7] or entire rows [8, 9, 10] of the kernel matrix lowers the storage and\ncomputational cost of operating on the kernel matrix. These approximations either preserve the\nseparability of the data [8], or produce good low-rank or sparse approximations of the true kernel\nmatrix [7, 9]. Fast multipole and multigrid methods have also been proposed for this purpose,\nbut, while they appear to be effective on small and low-dimensional problems, they have not been\ndemonstrated on large datasets. Further, the quality of the Hermite or Taylor approximation that\nthese methods rely on degrades exponentially with the dimensionality of the dataset [11]. Fast\nnearest neighbor lookup with KD-Trees has been used to approximate multiplication with the kernel\nmatrix, and in turn, a variety of other operations [12]. The feature map we present in Section 4 is\nreminiscent of KD-trees in that it partitions the input space using multi-resolution axis-aligned grids\nsimilar to those developed in [13] for embedding linear assignment problems.\n\n3 Random Fourier Features\n\nOur \ufb01rst set of random features project data points onto a randomly chosen line, and then pass the\nresulting scalar through a sinusoid (see Figure 1 and Algorithm 1). The random lines are drawn\nfrom a distribution so as to guarantee that the inner product of two transformed points approximates\nthe desired shift-invariant kernel.\nThe following classical theorem from harmonic analysis provides the key insight behind this trans-\nformation:\nTheorem 1 (Bochner [15]). A continuous kernel k(x, y) = k(x \u2212 y) on Rd is positive de\ufb01nite if\nand only if k(\u03b4) is the Fourier transform of a non-negative measure.\n\n2\n\n\fx\n\n\u03c9\n\nRD\n\nKernel Name\nGaussian\nLaplacian\nCauchy\n\nR2\n\nk(\u2206)\ne\u2212 k\u2206k2\ne\u2212k\u2206k1\n\n2\n\n2\n\nQ\n\n2\n\n1+\u22062\nd\n\nd\n\n2 e\u2212 k\u03c9k2\n\n2\n\n2\n\np(\u03c9)\n(2\u03c0)\u2212 D\n\nQ\n\n1\n\nd\n\n\u03c0(1+\u03c92\nd)\n\ne\u2212k\u2206k1\n\nFigure 1: Random Fourier Features. Each component of the feature map z(x) projects x onto a random\ndirection \u03c9 drawn from the Fourier transform p(\u03c9) of k(\u2206), and wraps this line onto the unit circle in R2.\nAfter transforming two points x and y in this way, their inner product is an unbiased estimator of k(x, y). The\ntable lists some popular shift-invariant kernels and their Fourier transforms. To deal with non-isotropic kernels,\nthe data may be whitened before applying one of these kernels.\n\nIf the kernel k(\u03b4) is properly scaled, Bochner\u2019s theorem guarantees that its Fourier transform p(\u03c9)\nis a proper probability distribution. De\ufb01ning \u03b6\u03c9(x) = ej\u03c90x, we have\n\nk(x \u2212 y) =\n\np(\u03c9)ej\u03c90(x\u2212y) d\u03c9 = E\u03c9[\u03b6\u03c9(x)\u03b6\u03c9(y)\u2217],\n\n(2)\n\nZ\n\nRd\n\n\u221a\n\nso \u03b6\u03c9(x)\u03b6\u03c9(y)\u2217 is an unbiased estimate of k(x, y) when \u03c9 is drawn from p.\nTo obtain a real-valued random feature for k, note that both the probability distribution p(\u03c9) and\nthe kernel k(\u2206) are real, so the integrand ej\u03c90(x\u2212y) may be replaced with cos \u03c90(x \u2212 y). De\ufb01ning\nz\u03c9(x) = [ cos(x) sin(x) ]0 gives a real-valued mapping that satis\ufb01es the condition E[z\u03c9(x)0z\u03c9(y)] =\nk(x, y), since z\u03c9(x)0z\u03c9(y) = cos \u03c90(x \u2212 y). Other mappings such as z\u03c9(x) =\n2 cos(\u03c90x +\nb), where \u03c9 is drawn from p(\u03c9) and b is drawn uniformly from [0, 2\u03c0], also satisfy the condition\nE[z\u03c9(x)0z\u03c9(y)] = k(x, y).\nWe can lower the variance of z\u03c9(x)0z\u03c9(y) by concatenating D randomly chosen z\u03c9 into a column\nvector z and normalizing each component by\nD. The inner product of points featureized by the\n2D-dimensional random feature z, z(x)0z(y) = 1\nj=1 z\u03c9j (x)z\u03c9j (y) is a sample average of\nz\u03c9j (x)z\u03c9j (y) and is therefore a lower variance approximation to the expectation (2).\nSince z\u03c9(x)0z\u03c9(y) is bounded between -1 and 1, for a \ufb01xed pair of points x and y, Hoeffd-\ning\u2019s inequality guarantees exponentially fast convergence in D between z(x)0z(y) and k(x, y):\nPr [|z(x)0z(y) \u2212 k(x, y)| \u2265 \u0001] \u2264 2 exp(\u2212D\u00012/2). Building on this observation, a much stronger\nassertion can be proven for every pair of points in the input space simultaneously:\nClaim 1 (Uniform convergence of Fourier features). Let M be a compact subset of Rd with diam-\neter diam(M). Then, for the mapping z de\ufb01ned in Algorithm 1, we have\n\nPD\n\n\u221a\n\nD\n\n(cid:20)\n\nPr\n\nsup\nx,y\u2208M\n\n|z(x)0z(y) \u2212 k(x, y)| \u2265 \u0001\n\n\u2264 28\n\n(cid:21)\n\n(cid:18) \u03c3p diam(M)\n\n(cid:19)2\n\n(cid:18)\n\nexp\n\n\u2212 D\u00012\n4(d + 2)\n\n\u0001\n\n(cid:19)\n\n,\n\nwhere \u03c32\nther,\n\u2126\n\np \u2261 Ep[\u03c90\u03c9]\n(cid:17)\n\n(cid:16) d\nFur-\nsupx,y\u2208M |z(x)0z(y) \u2212 k(y, x)| \u2264 \u0001 with any constant probability when D =\n\u00012 log \u03c3p diam(M)\n\nthe Fourier transform of k.\n\nis the second moment of\n\n.\n\n\u0001\n\nThe proof of this assertion \ufb01rst guarantees that z(x)0z(y) is close to k(x \u2212 y) for the centers of an\n\u0001-net over M \u00d7 M. This result is then extended to the entire space using the fact that the feature\nmap is smooth with high probability. See the Appendix for details.\nBy a standard Fourier identity, the scalar \u03c32\nIt\nquanti\ufb01es the curvature of the kernel at the origin. For the spherical Gaussian kernel, k(x, y) =\n\np is equal to the trace of the Hessian of k at 0.\n\nexp(cid:0)\u2212\u03b3kx \u2212 yk2(cid:1), we have \u03c32\n\np = 2d\u03b3.\n\n3\n\n\fAlgorithm 1 Random Fourier Features.\nRequire: A positive de\ufb01nite shift-invariant kernel k(x, y) = k(x \u2212 y).\nEnsure: A randomized feature map z(x) : Rd \u2192 R2D so that z(x)0z(y) \u2248 k(x \u2212 y).\n\nR e\u2212j\u03c90\u2206k(\u2206) d\u2206.\n\nLet z(x) \u2261q 1\n\nCompute the Fourier transform p of the kernel k: p(\u03c9) = 1\nDraw D iid samples \u03c91,\u00b7\u00b7\u00b7 , \u03c9D \u2208 Rd from p.\n2\u03c0\n\nD [ cos(\u03c90\n\n1x) \u00b7\u00b7\u00b7 cos(\u03c90\n\nDx) sin(\u03c90\n\n1x) \u00b7\u00b7\u00b7 sin(\u03c90\n\nDx) ]0.\n\n4 Random Binning Features\n\nOur second random map partitions the input space using randomly shifted grids at randomly chosen\nresolutions and assigns to an input point a binary bit string that corresponds to the bin in which it\nfalls (see Figure 2 and Algorithm 2). The grids are constructed so that the probability that two points\nx and y are assigned to the same bin is proportional to k(x, y). The inner product between a pair of\ntransformed points is proportional to the number of times the two points are binned together, and is\ntherefore an unbiased estimate of k(x, y).\n\n\u2248\n\n+\n\n+\n\n+\u00b7\u00b7\u00b7 =\n\nk(xi, xj)\n\nz1(xi)0z1(xj )\n\nz2(xi)0z2(xj )\n\nz3(xi)0z3(xj )\n\nz(xi)0z(xj)\n\nFigure 2: Random Binning Features. (left) The algorithm repeatedly partitions the input space using a ran-\ndomly shifted grid at a randomly chosen resolution and assigns to each point x the bit string z(x) associated\nwith the bin to which it is assigned. (right) The binary adjacency matrix that describes this partitioning has\nz(xi)0z(xj) in its ijth entry and is an unbiased estimate of kernel matrix.\n\n(cid:16)\n\n(cid:17)\n\n\u03b4\n\n\u03b4\n\n(cid:16)\n\n0, 1 \u2212 |x\u2212y|\n\ncompact subset of R \u00d7 R: k(x, y) =R \u221e\n\nWe \ufb01rst describe a randomized mapping to approximate the \u201chat\u201d kernel khat(x, y; \u03b4) =\non a compact subset of R \u00d7 R, then show how to construct mappings for\nmax\nmore general separable multi-dimensional kernels. Partition the real number line with a grid of\n(cid:17)\npitch \u03b4, and shift this grid randomly by an amount u drawn uniformly at random from [0, \u03b4]. This\ngrid partitions the real number line into intervals [u + n\u03b4, u + (n + 1)\u03b4] for all integers n. The\n0, 1 \u2212 |x\u2212y|\nprobability that two points x and y fall in the same bin in this grid is max\n[13].\n\u03b4 c and y\nIn other words, if we number the bins of the grid so that a point x falls in bin \u02c6x = b x\u2212u\nfalls in bin \u02c6y = b y\u2212u\n\u03b4 c, then Pru[\u02c6x = \u02c6y|\u03b4] = khat(x, y; \u03b4). If we encode \u02c6x as a binary indicator\nvector z(x) over the bins, z(x)0z(y) = 1 if x and y fall in the same bin and zero otherwise, so\nPru[z(x)0z(y) = 1|\u03b4] = Eu[z(x)0z(y)|\u03b4] = khat(x, y; \u03b4). Therefore z is a random map for khat.\nNow consider shift-invariant kernels that can be written as convex combinations of hat kernels on a\n0 khat(x, y; \u03b4)p(\u03b4) d\u03b4. If the pitch \u03b4 of the grid is sampled\nfrom p, z again gives a random map for k because E\u03b4,u[z(x)0z(y)] = E\u03b4 [Eu[z(x)0z(y)|\u03b4]] =\nE\u03b4[khat(x, y; \u03b4)] = k(x, y). That is, if the pitch \u03b4 of the grid is sampled from p, and the shift u is\ndrawn uniformly from [0, \u03b4] the probability that x and y are binned together is k(x, y). Lemma 1 in\nthe appendix shows that p can be easily recovered from k by setting p(\u03b4) = \u03b4\u00a8k(\u03b4). For example, in\nthe case of the Laplacian kernel, kLaplacian(x, y) = exp(\u2212|x \u2212 y|), p(\u03b4) is the Gamma distribution\n\u03b4 exp(\u2212\u03b4). For the Gaussian kernel, \u00a8k is not everywhere positive, so this procedure does not yield a\nrandom map.\nQd\nRandom maps for separable multivariate shift-invariant kernels of the form k(x \u2212 y) =\nm=1 km(|xm\u2212ym|) (such as the multivariate Laplacian kernel) can be constructed in a similar way\nif each km can be written as a convex combination of hat kernels. We apply the above binning pro-\ncess over each dimension of Rd independently. The probability that xm and ym are binned together\nin dimension m is km(|xm \u2212 ym|). Since the binning process is independent across dimensions, the\n\n4\n\n1000000001000000001000000001000000001000000001000000001000000001\fprobability that x and y are binned together in every dimension isQd\n\nm=1 km(|xm\u2212ym|) = k(x\u2212y).\nIn this multivariate case, z(x) encodes the integer vector [ \u02c6x1,\u00b7\u00b7\u00b7 ,\u02c6xd ] corresponding to each bin of the\nd-dimensional grid as a binary indicator vector. In practice, to prevent over\ufb02ows when computing\nz(x) when d is large, our implementation eliminates unoccupied bins from the representation. Since\nthere are never more bins than training points, this ensures no over\ufb02ow is possible.\nWe can again reduce the variance of the estimator z(x)0z(y) by concatenating P random binning\n\nfunctions z into a larger list of features z and scaling byp1/P . The inner product z(x)0z(y) =\nPP\np=1 zp(x)0zp(y) is the average of P independent z(x)0z(y) and has therefore lower variance.\nSince z(x)0z(y) is binary, Hoeffding\u2019s inequality guarantees that for a \ufb01xed pair of points x and y,\nz(x)0z(y) converges exponentially quickly to k(x, y) as a function of P . Again, a much stronger\nclaim is that this convergence holds simultaneously for all points:\nClaim 2. Let M be a compact subset of Rd with diameter diam(M). Let \u03b1 = E[1/\u03b4] and let Lk\ndenote the Lipschitz constant of k with respect to the L1 norm. With z as above, we have\n\n1\nP\n\n|z(x)0z(y) \u2212 k(x, y)| \u2264 \u0001\n\n\u2265 1 \u2212 36dP \u03b1 diam(M) exp\n\n\uf8eb\uf8ed\u2212(cid:16) P \u00012\n\n8 + ln \u0001\nLk\nd + 1\n\n(cid:17)\n\n\uf8f6\uf8f8 ,\n\n(cid:20)\n\nPr\n\nsup\nx,y\u2208M\n\nNote that \u03b1 =R \u221e\n\n\u03b4 p(\u03b4) d\u03b4 =R \u221e\n\n1\n\n(cid:21)\n\n0\n\n\u00a8k(\u03b4) d\u03b4 is 1, and Lk = 1 for the Laplacian kernel. The proof\nof the claim (see the appendix) partitions M \u00d7 M into a few small rectangular cells over which\nk(x, y) does not change much and z(x) and z(y) are constant. With high probability, at the centers\nof these cells z(x)0z(y) is close to k(x, y), which guarantees that k(x, y) and z(x)0z(y) are close\nthroughout M \u00d7 M.\n\n0\n\nAlgorithm 2 Random Binning Features.\n\nRequire: A point x \u2208 Rd. A kernel function k(x, y) =Qd\n\nEnsure: A randomized feature map z(x) so that z(x)0z(y) \u2248 k(x \u2212 y).\n\n\u2206\u00a8km(\u2206) is a probability distribution on \u2206 \u2265 0.\nfor p = 1 . . . P do\n\nm=1 km(|xm \u2212 ym|), so that pm(\u2206) \u2261\n\nDraw grid parameters \u03b4, u \u2208 Rd with the pitch \u03b4m \u223c pm, and shift um from the uniform\ndistribution on [0, \u03b4m].\nLet z return the coordinate of the bin containing x as a binary indicator vector zp(x) \u2261\nhash(d x1\u2212u1\n\ne).\n\nz(x) \u2261q 1\n\nend for\n\n\u03b41\n\ne,\u00b7\u00b7\u00b7 ,d xd\u2212ud\nP [ z1(x)\u00b7\u00b7\u00b7zP (x) ]0.\n\n\u03b4d\n\n5 Experiments\n\nThe experiments summarized in Table 1 show that ridge regression with our random features is a fast\nway to approximate the training of supervised kernel machines. We focus our comparisons against\nthe Core Vector Machine [14] because it was shown in [14] to be both faster and more accurate than\nother known approaches for training kernel machines, including, in most cases, random sampling of\ndatapoints [8]. The experiments were conducted on the \ufb01ve standard large-scale datasets evaluated\nin [14], excluding the synthetic datasets. We replicated the results in the literature pertaining to the\nCVM, SVMlight, and libSVM using binaries provided by the respective authors.1 For the random\nfeature experiments, we trained regressors and classi\ufb01ers by solving the ridge regression problem\n\n1We include KDDCUP99 results for completeness, but note this dataset is inherently oversampled: training\nan SVM (or least squares with random features) on a random sampling of 50 training examples (0.001% of the\ntraining dataset) is suf\ufb01cient to consistently yield a test-error on the order of 8%. Also, while we were able\nto replicate the CVM\u2019s 6.2% error rate with the parameters supplied by the authors, retraining after randomly\nshuf\ufb02ing the training set results in 18% error and increases the computation time by an order of magnitude.\nEven on the original ordering, perturbing the CVM\u2019s regularization parameter by a mere 15% yields 49% error\nrate on the test set [16].\n\n5\n\n\fDataset\nCPU\nregression\n6500 instances 21 dims\nCensus\nregression\n18,000 instances 119 dims\nAdult\nclassi\ufb01cation\n32,000 instances 123 dims\nForest Cover\nclassi\ufb01cation\n522,000 instances 54 dims\nKDDCUP99 (see footnote)\nclassi\ufb01cation\n4,900,000 instances 127 dims\n\nFourier+LS\n3.6%\n20 secs\nD = 300\n5%\n36 secs\nD = 500\n14.9%\n9 secs\nD = 500\n11.6%\n71 mins\nD = 5000\n7.3%\n1.5 min\nD = 50\n\nBinning+LS\n5.3%\n3 mins\nP = 350\n7.5%\n19 mins\nP = 30\n15.3%\n1.5 mins\nP = 30\n2.2%\n25 mins\nP = 50\n7.3%\n35 mins\nP = 10\n\nCVM\n5.5%\n51 secs\n\n8.8%\n7.5 mins\n\n14.8%\n73 mins\n\n2.3%\n7.5 hrs\n\n6.2% (18%)\n1.4 secs (20 secs)\n\nExact SVM\n11%\n31 secs\nASVM\n9%\n13 mins\nSVMTorch\n15.1%\n7 mins\nSVMlight\n2.2%\n44 hrs\nlibSVM\n8.3%\n< 1 s\nSVM+sampling\n\nTable 1: Comparison of testing error and training time between ridge regression with random features, Core\nVector Machine, and various state-of-the-art exact methods reported in the literature. For classi\ufb01cation tasks,\nthe percent of testing points incorrectly predicted is reported, and for regression tasks, the RMS error normal-\nized by the norm of the ground truth.\n\nFigure 3: Accuracy on test data continues to improve as the training set grows. On the Forest dataset, using\nrandom binning, doubling the dataset size reduces testing error by up to 40% (left). Error decays quickly as P\ngrows (middle). Training time grows slowly as P grows (right).\n\n2 + \u03bbkwk2\n\nminw kZ0w \u2212 yk2\n2, where y denotes the vector of desired outputs and Z denotes the\nmatrix of random features. To evaluate the resulting machine on a datapoint x, we can simply\ncompute w0z(x). Despite its simplicity, ridge regression with random features is faster than, and\nprovides competitive accuracy with, alternative methods. It also produces very compact functions\nbecause only w and a set of O(D) random vectors or a hash-table of partitions need to be retained.\nRandom Fourier features perform better on the tasks that largely rely on interpolation. On the other\nhand, random binning features perform better on memorization tasks (those for which the standard\nSVM requires many support vectors), because they explicitly preserve locality in the input space.\nThis difference is most dramatic in the Forest dataset.\nFigure 3(left) illustrates the bene\ufb01t of training classi\ufb01ers on larger datasets, where accuracy con-\ntinues to improve as more data are used in training. Figure 3(middle) and (right) show that good\nperformance can be obtained even from a modest number of features.\n\n6 Conclusion\n\nWe have presented randomized features whose inner products uniformly approximate many popular\nkernels. We showed empirically that providing these features as input to a standard linear learning\nalgorithm produces results that are competitive with state-of-the-art large-scale kernel machines in\naccuracy, training time, and evaluation time.\nIt is worth noting that hybrids of Fourier features and Binning features can be constructed by con-\ncatenating these features. While we have focused on regression and classi\ufb01cation, our features can\nbe applied to accelerate other kernel methods, including semi-supervised and unsupervised learn-\ning algorithms. In all of these cases, a signi\ufb01cant computational speed-up can be achieved by \ufb01rst\ncomputing random features and then applying the associated linear technique.\n\n6\n\n1001021041060.10.20.30.40.5Training set sizeTesting error102030405023456P% error10203040504008001200Ptraining+testing time (sec)\f7 Acknowledgements\n\nWe thank Eric Garcia for help on early versions of these features, Sameer Agarwal and James R.\nLee for helpful discussions, and Erik Learned-Miller and Andres Corrada-Emmanuel for helpful\ncorrections.\n\nReferences\n[1] T. Joachims. Training linear SVMs in linear time. In ACM Conference on Knowledge Discovery and\n\nData Mining (KDD), 2006.\n\n[2] M. C. Ferris and T. S. Munson.\n\nInterior-point methods for massive Support Vector Machines. SIAM\n\nJournal of Optimization, 13(3):783\u2013804, 2003.\n\n[3] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.\n\nIn IEEE International Conference on Machine Learning (ICML), 2007.\n\n[4] D. DeCoste and D. Mazzoni. Fast query-optimized kernel machine classi\ufb01cation via incremental approx-\n\nimate nearest support vectors. In IEEE International Conference on Machine Learning (ICML), 2003.\n\n[5] J. Platt. Using sparseness and analytic QP to speed training of Support Vector Machines. In Advances in\n\nNeural Information Processing Systems (NIPS), 1999.\n\n[6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at\n\nhttp://www.csie.ntu.edu.tw/\u223ccjlin/libsvm.\n\n[7] D. Achlioptas, F. McSherry, and B. Sch\u00a8olkopf. Sampling techniques for kernel methods. In Advances in\n\nNeural Information Processing Systems (NIPS), 2001.\n\n[8] A. Blum. Random projection, margins, kernels, and feature-selection. LNCS, 3940:52\u201368, 2006.\n[9] A. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for \ufb01nding low-rank approximations.\n\nIn Foundations of Computer Science (FOCS), pages 378\u2013390, 1998.\n\n[10] P. Drineas and M. W. Mahoney. On the nystrom method for approximating a Gram matrix for improved\n\nkernel-based learning. In COLT, pages 323\u2013337, 2005.\n\n[11] C. Yang, R. Duraiswami, and L. Davis. Ef\ufb01cient kernel machines using the improved fast gauss transform.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2004.\n\n[12] Y. Shen, A. Y. Ng, and M. Seeger. Fast gaussian process regression using KD-Trees. In Advances in\n\nNeural Information Processing Systems (NIPS), 2005.\n\n[13] P. Indyk and N. Thaper. Fast image retrieval via embeddings. In International Workshop on Statistical\n\nand Computational Theories of Vision, 2003.\n\n[14] I. W. Tsang, J. T. Kwok, and P.-M. Cheung. Core Vector Machines: Fast SVM training on very large data\n\nsets. Journal of Machine Learning Research (JMLR), 6:363\u2013392, 2005.\n\n[15] W. Rudin. Fourier Analysis on Groups. Wiley Classics Library. Wiley-Interscience, New York, reprint\n\nedition edition, 1994.\n\n[16] G. Loosli and S. Canu. Comments on the \u2018Core Vector Machines: Fast SVM training on very large data\n\nsets\u2019. Journal of Machine Learning Research (JMLR), 8:291\u2013301, February 2007.\n\n[17] F. Cucker and S. Smale. On the mathematical foundations of learning. Bull. Amer. Soc., 39:1\u201349, 2001.\n\nA Proofs\nR \u221e\nLemma 1. Suppose a function k(\u2206) : R \u2192 R is twice differentiable and has the form\n0 p(\u03b4) max(0, 1 \u2212 \u2206\nProof. We want p so that\n\n\u03b4 ) d\u03b4. Then p(\u03b4) = \u03b4\u00a8k(\u03b4).\n\n=\n\np(\u03b4) \u00b7 0 d\u03b4 +\n\nTo solve for p, differentiate twice w.r.t. to \u2206 to \ufb01nd that \u02d9k(\u2206) = \u2212R \u221e\n\np(\u03b4)(1 \u2212 \u2206/\u03b4) d\u03b4 =\n\np(\u03b4) d\u03b4 \u2212 \u2206\n\n0\n\np(\u2206)/\u2206.\n\nZ \u221e\n\n\u2206\n\nZ \u221e\n\n\u2206\n\n(3)\n\n(4)\n\np(\u03b4)/\u03b4 d\u03b4.\n\n\u2206 p(\u03b4)/\u03b4 d\u03b4 and \u00a8k(\u2206) =\n\nZ \u221e\nZ \u2206\n\n0\n\nk(\u2206) =\n\np(\u03b4) max(0, 1 \u2212 \u2206/\u03b4) d\u03b4\n\nZ \u221e\n\n\u2206\n\n7\n\n\fProof of Claim 1. De\ufb01ne s(x, y) \u2261 z(x)0z(y), and f(x, y) \u2261 s(x, y) \u2212 k(y, x). Since f, and s\nare shift invariant, as their arguments we use \u2206 \u2261 x \u2212 y \u2208 M\u2206 for notational simplicity.\nM\u2206 is compact and has diameter at most twice diam(M), so we can \ufb01nd an \u0001-net that covers M\u2206\nusing at most T = (4 diamM/r)d balls of radius r [17]. Let {\u2206i}T\ni=1 denote the centers of these\nballs, and let Lf denote the Lipschitz constant of f. We have |f(\u2206)| < \u0001 for all \u2206 \u2208 M\u2206 if\n|f(\u2206i)| < \u0001/2 and Lf < \u0001\nSince f is differentiable, Lf = k\u2207f(\u2206\u2217)k, where \u2206\u2217 = arg max\u2206\u2208M\u2206 k\u2207f(\u2206)k. We have\nE[L2\np, so\nby Markov\u2019s inequality, Pr[L2\n\nf ] = Ek\u2207f(\u2206\u2217)k2 = Ek\u2207s(\u2206\u2217)k2 \u2212 Ek\u2207k(\u2206\u2217)k2 \u2264 Ek\u2207s(\u2206\u2217)k2 \u2264 Epk\u03c9k2 = \u03c32\n\n2r for all i. We bound the probability of these two events.\n\nf \u2265 t] \u2264 E[L2\n\nf ]/t, or\n\nh\n\nPr\n\nLf \u2265 \u0001\n2r\n\n(cid:18)2r\u03c3p\n\n(cid:19)2\n\ni \u2264\n\n.\n\n\u0001\n\n(5)\n\n(6)\n\n(cid:18)2r\u03c3p\n\n(cid:19)2\n\n.\n\n(7)\n\nThe union bound followed by Hoeffding\u2019s inequality applied to the anchors in the \u0001-net gives\n\nCombining (5) and (6) gives a bound in terms of the free variable r:\n\ni=1 |f(\u2206i)| \u2265 \u0001/8(cid:3) \u2264 2T exp(cid:0)\u2212D\u00012/2(cid:1) .\nPr(cid:2)\u222aT\n(cid:21)\n(cid:18)4 diam(M)\n(cid:19)d\nexp(cid:0)\u2212D\u00012/8(cid:1) \u2212\n(cid:16) \u03ba1\n(cid:17) 1\n\n\u2265 1 \u2212 2\n\n(cid:20)\n\nPr\n\nsup\n\u2206\u2208M\u2206\n\n|f(\u2206)| \u2264 \u0001\n\n\u0001\n\nd\n\n2\n\n\u0001\n\n\u03ba2\n\nd+2\n1\n\nd+2\n2 \u03ba\n\nd+2 turns this to 1 \u2212 2\u03ba\n\nr\nThis has the form 1 \u2212 \u03ba1r\u2212d \u2212 k2r2. Setting r =\nassuming that \u03c3p diam(M)\nsecond part of the claim, pick any probability for the RHS and solve for D.\nProof of Claim 2. M can be covered by rectangles over each of which z is constant. Let \u03b4pm be the\npitch of the pth grid along the mth dimension. Each grid has at most ddiam(M)/\u03b4pme bins, and P\n\n, and\n\u2265 1 and diam(M) \u2265 1, proves the \ufb01rst part of the claim. To prove the\n\n(cid:17)\nm=1Nm \u2264 t(P + P diam(M)\u03b1)(cid:3) \u2265 1 \u2212 d/t.\n\noverlapping grids produce at most Nm =PP\nBy Markov\u2019s inequality and the union bound, Pr(cid:2)\u2200d\nthatPNm\nPNm\ni=1d dmi\ncan be covered with T \u2264(cid:16) 3tP \u03b1 diam(M)\n\npartitions along the mth dimension. The expected value of the right hand side is P + P diam(M)\u03b1.\nThat is, with probability 1 \u2212 d/t, along every dimension, we have at most t(P + P diam(M)\u03b1)\none-dimensional cells. Denote by dmi the width of the ith cell along the mth dimension and observe\ni=1 dmi \u2264 diam(M). We further subdivide these cells into smaller rectangles of some small\nwidth r to ensure that the kernel k varies very little over each of these cells. This results in at most\nr e \u2264 Nm+diam(M)\nsmall one-dimensional cells over each dimension. Plugging in the\nupper bound for Nm, setting t \u2265 1\n\u03b1P and assuming \u03b1 diam(M) \u2265 1, with probability 1 \u2212 d/t, M\n\ng=1ddiam(M)/\u03b4gme \u2264(cid:16)\n\nP + diam(M)PP\n\nrectangles of side r centered at {xi}T\n\n(cid:17)d\n\n\u03b4pm\n\np=1\n\n1\n\nr\n\ni=1.\n\nr\n\nThe condition |z(x, y) \u2212 k(x, y)| \u2264 \u0001 on M \u00d7 M holds if |z(xi, yi) \u2212 k(xi, yi)| \u2264 \u0001 \u2212 Lkrd\nand z(x) is constant throughout each rectangle. With rd = \u0001\n, the union bound followed by\n2Lk\nHoeffding\u2019s inequality gives\n\nPr [\u222aij|z(xi, yj) \u2212 k(xi, yj)| \u2265 \u0001/2] \u2264 2T 2 exp(cid:0)\u2212P \u00012/8(cid:1)\n\n(8)\n\nCombining this with the probability that z(x) is constant in each cell gives a bound in terms of t:\n\n(cid:20)\n\nPr\n\nsup\n\nx,y\u2208M\u00d7M\n\n|z(x, y) \u2212 k(x, y)| \u2264 \u0001\n\n\u22651 \u2212 d\nt\n\nThis has the form 1 \u2212 \u03ba1t\u22121 \u2212 \u03ba2td. To prove the claim, set t =\nupper bound of 1 \u2212 3\u03ba1\u03ba\n\n.\n\n1\n\nd+1\n2\n\n(cid:21)\n\n(cid:18)\n\nexp\n\n\u2212 P \u00012\n8\n\n(cid:19)\n\n.\n\n\u2212 2(3tP \u03b1 diam(M))d 2Lk\n(cid:17) 1\n\u0001\n\n(cid:16) \u03ba1\n\nd+1 , which results in an\n\n2\u03ba2\n\n8\n\n\f", "award": [], "sourceid": 833, "authors": [{"given_name": "Ali", "family_name": "Rahimi", "institution": null}, {"given_name": "Benjamin", "family_name": "Recht", "institution": null}]}