{"title": "Spherical Random Features for Polynomial Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 1846, "page_last": 1854, "abstract": "Compact explicit feature maps provide a practical framework to scale kernel methods to large-scale learning, but deriving such maps for many types of kernels remains a challenging open problem. Among the commonly used kernels for nonlinear classification are polynomial kernels, for which low approximation error has thus far necessitated explicit feature maps of large dimensionality, especially for higher-order polynomials. Meanwhile, because polynomial kernels are unbounded, they are frequently applied to data that has been normalized to unit l2 norm. The question we address in this work is: if we know a priori that data is so normalized, can we devise a more compact map? We show that a putative affirmative answer to this question based on Random Fourier Features is impossible in this setting, and introduce a new approximation paradigm, Spherical Random Fourier (SRF) features, which circumvents these issues and delivers a compact approximation to polynomial kernels for data on the unit sphere. Compared to prior work, SRF features are less rank-deficient, more compact, and achieve better kernel approximation, especially for higher-order polynomials. The resulting predictions have lower variance and typically yield better classification accuracy.", "full_text": "Spherical Random Features for Polynomial Kernels\n\nJeffrey Pennington\n\nFelix X. Yu\n\nSanjiv Kumar\n\n{jpennin, felixyu, sanjivk}@google.com\n\nGoogle Research\n\nAbstract\n\nCompact explicit feature maps provide a practical framework to scale kernel meth-\nods to large-scale learning, but deriving such maps for many types of kernels\nremains a challenging open problem. Among the commonly used kernels for non-\nlinear classi\ufb01cation are polynomial kernels, for which low approximation error\nhas thus far necessitated explicit feature maps of large dimensionality, especially\nfor higher-order polynomials. Meanwhile, because polynomial kernels are un-\nbounded, they are frequently applied to data that has been normalized to unit (cid:96)2\nnorm. The question we address in this work is: if we know a priori that data is\nnormalized, can we devise a more compact map? We show that a putative af\ufb01r-\nmative answer to this question based on Random Fourier Features is impossible\nin this setting, and introduce a new approximation paradigm, Spherical Random\nFourier (SRF) features, which circumvents these issues and delivers a compact\napproximation to polynomial kernels for data on the unit sphere. Compared to\nprior work, SRF features are less rank-de\ufb01cient, more compact, and achieve bet-\nter kernel approximation, especially for higher-order polynomials. The resulting\npredictions have lower variance and typically yield better classi\ufb01cation accuracy.\n\n1\n\nIntroduction\n\nKernel methods such as nonlinear support vector machines (SVMs) [1] provide a powerful frame-\nwork for nonlinear learning, but they often come with signi\ufb01cant computational cost. Their training\ncomplexity varies from O(n2) to O(n3), which becomes prohibitive when the number of training\nexamples, n, grows to the millions. Testing also tends to be slow, with an O(nd) complexity for\nd-dimensional vectors.\nExplicit kernel maps provide a practical alternative for large-scale applications since they rely on\nproperties of linear methods, which can be trained in O(n) time [2, 3, 4] and applied in O(d) time,\nindependent of n. The idea is to determine an explicit nonlinear map Z(\u00b7) : Rd \u2192 RD such\nthat K(x, y) \u2248 (cid:104)Z(x), Z(y)(cid:105), and to perform linear learning in the resulting feature space. This\nprocedure can utilize the fast training and testing of linear methods while still preserving much of\nthe expressive power of the nonlinear methods.\nFollowing this reasoning, Rahimi and Recht [5] proposed a procedure for generating such a non-\nlinear map, derived from the Monte Carlo integration of an inverse Fourier transform arising from\nBochner\u2019s theorem [6]. Explicit nonlinear random feature maps have also been proposed for other\ntypes of kernels, such as intersection kernels [7], generalized RBF kernels [8], skewed multiplicative\nhistogram kernels [9], additive kernels [10], and semigroup kernels [11].\nAnother type of kernel that is used widely in many application domains is the polynomial kernel\n[12, 13], de\ufb01ned by K(x, y) = ((cid:104)x, y(cid:105) + q)p, where q is the bias and p is the degree of the polyno-\nmial. Approximating polynomial kernels with explicit nonlinear maps is a challenging problem, but\nsubstantial progress has been made in this area recently. Kar and Karnick [14] catalyzed this line of\n\n1\n\n\fproduct(cid:81)p\n\ni=1(cid:104)wi, x(cid:105)(cid:81)p\n\nresearch by introducing the Random Maclaurin (RM) technique, which approximates (cid:104)x, y(cid:105)p by the\ni=1(cid:104)wi, y(cid:105), where wi is a vector consisting of Bernoulli random variables.\nAnother technique, Tensor Sketch [15], offers further improvement by instead writing (cid:104)x, y(cid:105)p as\n(cid:104)x(p), y(p)(cid:105), where x(p) is the p-level tensor product of x, and then estimating this tensor product\nwith a convolution of count sketches.\nAlthough these methods are applicable to any real-valued input data, in practice polynomial kernels\nare commonly used on (cid:96)2-normalized input data [15] because they are otherwise unbounded. More-\nover, much of the theoretical analysis developed in former work is based on normalized vectors [16],\nand it has been shown that utilizing norm information improves the estimates of random projections\n[17]. Therefore, a natural question to ask is, if we know a priori that data is (cid:96)2-normalized, can we\ncome up with a better nonlinear map?1 Answering this question is the main focus of this work and\nwill lead us to the development of a new form of kernel approximation.\nRestricting the input domain to the unit sphere implies that (cid:104)x, y(cid:105) = 2\u22122||x\u2212y||2 , \u2200 x, y \u2208 S d\u22121,\nso that a polynomial kernel can be viewed as a shift-invariant kernel in this restricted domain. As\nsuch, one might expect the random feature maps developed in [5] to be applicable in this case. Un-\nfortunately, this expectation turns out to be false because Bochner\u2019s theorem cannot be applied in\nthis setting. The obstruction is an inherent limitation of polynomial kernels and is examined exten-\nsively in Section 3.1. In Section 3.2, we propose an alternative formulation that overcomes these\nlimitations by approximating the Fourier transform of the kernel function as the positive projec-\ntion of an inde\ufb01nite combination of Gaussians. We provide a bound on the approximation error\nof these Spherical Random Fourier (SRF) features in Section 4, and study their performance on a\nvariety of standard datasets including a large-scale experiment on ImageNet in Section 5 and in the\nSupplementary Material.\nCompared to prior work, the SRF method is able to achieve lower kernel approximation error with\ncompact nonlinear maps, especially for higher-order polynomials. The variance in kernel approxi-\nmation error is much lower than that of existing techniques, leading to more stable predictions. In\naddition, it does not suffer from the rank de\ufb01ciency problem seen in other methods. Before describ-\ning the SRF method in detail, we begin by reviewing the method of Random Fourier Features.\n\n2 Background: Random Fourier Features\n\nIn [5], a method for the explicit construction of compact nonlinear randomized feature maps was\npresented. The technique relies on two important properties of the kernel: i) The kernel is shift-\ninvariant, i.e. K(x, y) = K(z) where z = x\u2212y and ii) The function K(z) is positive de\ufb01nite on Rd.\nProperty (ii) guarantees that the Fourier transform of K(z), k(w) =\nadmits an interpretation as a probability distribution. This fact follows from Bochner\u2019s celebrated\ncharacterization of positive de\ufb01nite functions,\nTheorem 1. (Bochner [6]) A function K \u2208 C(Rd) is positive de\ufb01nite on Rd if and only if it is the\nFourier transform of a \ufb01nite non-negative Borel measure on Rd.\nA consequence of Bochner\u2019s theorem is that the inverse Fourier transform of k(w) can be interpreted\nas the computation of an expectation, i.e.,\n\n(cid:82) ddz K(z) ei(cid:104)w,z(cid:105) ,\n\n(2\u03c0)d/2\n\n1\n\n(cid:90)\n\nK(z) =\n\n1\n\n(2\u03c0)d/2\n\nddw k(w) e\u2212i(cid:104)w,z(cid:105)\n\n= Ew\u223cp(w) e\u2212i(cid:104)w,x\u2212y(cid:105)\n= 2 E w\u223cp(w)\nb\u223cU (0,2\u03c0)\n\n(cid:2) cos((cid:104)w, x(cid:105) + b) cos((cid:104)w, y(cid:105) + b)(cid:3) ,\n\n(1)\n\nwhere p(w) = (2\u03c0)\u2212d/2k(w) and U (0, 2\u03c0) is the uniform distribution on [0, 2\u03c0). If the above\nexpectation is approximated using Monte Carlo with D random samples wi, then K(x, y) \u2248\n\n(cid:104)Z(x), Z(y)(cid:105) with Z(x) =(cid:112)2/D(cid:2)cos(wT\n\nDx + bD)(cid:3)T . This identi\ufb01cation is\n\n1 x + b1), ..., cos(wT\n\n1We are not claiming total generality of this setting; nevertheless, in cases where the vector length carries\n\nuseful information and should be preserved, it could be added as an additional feature before normalization.\n\n2\n\n\fmade possible by property (i), which guarantees that the functional dependence on x and y factorizes\nmultiplicatively in frequency space.\nSuch Random Fourier Features have been used to approximate different types of positive-de\ufb01nite\nshift-invariant kernels, including the Gaussian kernel, the Laplacian kernel, and the Cauchy kernel.\nHowever, they have not yet been applied to polynomial kernels, because this class of kernels does\nnot satisfy the positive-de\ufb01niteness prerequisite for the application of Bochner\u2019s theorem. This\nstatement may seem counter-intuitive given the known result that polynomial kernels K(x, y) are\npositive de\ufb01nite kernels. The subtlety is that this statement does not necessarily imply that the\nassociated single variable functions K(z) = K(x \u2212 y) are positive de\ufb01nite on Rd for all d. We\nwill prove this fact in the next section, along with the construction of an ef\ufb01cient and effective\nmodi\ufb01cation of the Random Fourier method that can be applied to polynomial kernels de\ufb01ned on\nthe unit sphere.\n\n3 Polynomial kernels on the unit sphere\nIn this section, we consider approximating the polynomial kernel de\ufb01ned on S d\u22121 \u00d7 S d\u22121,\n\n(cid:16)\n\nK(x, y) =\n\n1 \u2212 ||x \u2212 y||2\n\na2\n\n(cid:17)p\n\n= \u03b1 (q + (cid:104)x, y(cid:105))p\n\n(2)\n\nwith q = a2/2 \u2212 1, \u03b1 = (2/a2)p. We will restrict our attention to p \u2265 1, a \u2265 2.\nThe kernel is a shift-invariant radial function of the single variable z = x \u2212 y, which with a slight\nabuse of notation we write as K(x, y) = K(z) = K(z), with z = ||z||.2 In Section 3.1, we show\nthat the Fourier transform of K(z) is not a non-negative function, so a straightforward application\nof Bochner\u2019s theorem to produce Random Fourier Features as in [5] is impossible in this case.\nNevertheless, in Section 3.2, we propose a fast and accurate approximation of K(z) by a surrogate\npositive de\ufb01nite function which enables us to construct compact Fourier features.\n\n\u221a\n\n3.1 Obstructions to Random Fourier Features\n2 \u2212 2 cos \u03b8 \u2264 2, the behavior of K(z) for z > 2 is unde\ufb01ned and\nBecause z = ||x \u2212 y|| =\narbitrary since it does not affect the original kernel function in eqn. (2). On the other hand, it should\nbe speci\ufb01ed in order to perform the Fourier transform, which requires an integration over all values\nof z. We \ufb01rst consider the natural choice of K(z) = 0 for z > 2 before showing that all other\nchoices lead to the same conclusion.\nLemma 1. The Fourier transform of {K(z), z \u2264 2; 0, z > 2} is not a non-negative function of w\nfor any values of a, p, and d.\n\nProof. (See the Supplementary Material for details). A direct calculation gives,\n\np(cid:88)\n\ni=0\n\nk(w) =\n\np!\n\n(p \u2212 i)!\n\n1 \u2212 4\na2\n\n(cid:19)d/2+i\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)p\u2212i(cid:18) 2\n(cid:19)d/2\n(cid:19)p(cid:18) 2\n\na2\n\n(cid:19)i(cid:18) 2\n(cid:16)\n\nw\n\nJd/2+i(2w) ,\n\n(cid:17)\n\n\u22122w\n\n\u03c0\n4\n\nwhere J\u03bd(z) is the Bessel function of the \ufb01rst kind. Expanding for large w yields,\n\nk(w)\u223c 1\u221a\n\u03c0w\n\n1\u2212 4\na2\n\nw\n\ncos\n\n(d + 1)\n\n,\n\n(3)\n\nwhich takes negative values for some w for all a > 2, p, and d.\n\nSo a Monte Carlo approximation of K(z) as in eqn. (1) is impossible in this case. However, there is\nstill the possibility of de\ufb01ning the behavior of K(z) for z > 2 differently, and in such a way that the\nFourier transform is positive and integrable on Rd. The latter condition should hold for all d, since\nthe vector dimensionality d can vary arbitrarily depending on input data.\nWe now show that such a function cannot exist. To this end, we \ufb01rst recall a theorem due to Schoen-\nberg regarding completely monotone functions,\n\n2We also follow this practice in frequency space, i.e. if k(w) is radial, we also write k(w) = k(w).\n\n3\n\n\fDe\ufb01nition 1. A function f is said to be completely monotone on an interval [a, b] \u2282 R if it is con-\ntinuous on the closed interval, f \u2208 C([a, b]), in\ufb01nitely differentiable in its interior, f \u2282 C\u221e((a, b)),\nand (\u22121)lf (l)(x) \u2265 0 ,\nTheorem 2. (Schoenberg [18]) A function \u03c6 is completely monotone on [0,\u221e) if and only if \u03a6 \u2261\n\u03c6(|| \u00b7 ||2) is positive de\ufb01nite and radial on Rd for all d.\n\nx \u2208 (a, b), l = 0, 1, 2, . . .\n\nTogether with Theorem 1, Theorem 2 shows that \u03c6(z) = K(\nz) must be completely monotone\n\u221a\nif k(w) is to be interpreted as a probability distribution. We now establish that \u03c6(z) cannot be\ncompletely monotone and simultaneously satisfy \u03c6(z) = K(\n\nz) for z \u2264 2.\n\n\u221a\nProposition 1. The function \u03c6(z) = K(\n\nProof. From the de\ufb01nition of \u03c6, \u03c6(z) =(cid:0)1 \u2212 z\n\nz) is completely monotone on [0, a2].\n\n(cid:1)p, \u03c6 is continuous on [0, a2], in\ufb01nitely differen-\n\ntiable on (0, a2), and its derivatives vanish for l > p. They obey (\u22121)l\u03c6(l)(z) = p!\n(p\u2212l)!\nwhere the inequality follows since z < a2. Therefore \u03c6 is completely monotone on [0, a2].\n\n\u03c6(z)\n\n(a2\u2212z)l \u2265 0,\n\na2\n\n\u221a\n\nTheorem 3. Suppose f is a completely monotone polynomial of degree n on the interval [0, c],\nc < \u221e, with f (c) = 0. Then there is no completely monotone function on [0,\u221e) that agrees with f\non [0, a] for any nonzero a < c.\n\nProof. Let g \u2208 C([0,\u221e))(cid:84) C\u221e((0,\u221e)) be a non-negative function that agrees with f on [0, a]\n\nand let h = g \u2212 f. We show that for all non-negative integers m there exists a point \u03c7m satisfying\na < \u03c7m \u2264 c such that h(m)(\u03c7m) > 0. For m = 0, the point \u03c70 = c obeys h(\u03c70) = g(\u03c70)\u2212f (\u03c70) =\ng(\u03c70) > 0 by the de\ufb01nition of g. Now, suppose there is a point \u03c7m such that a < \u03c7m \u2264 c and\nh(m)(\u03c7m) > 0. The mean value theorem then guarantees the existence of a point \u03c7m+1 such that\na < \u03c7m+1 < \u03c7m and h(m+1)(\u03c7m+1) = h(m)(\u03c7m)\u2212h(m)(a)\n\u03c7m\u2212a > 0, where we have utilized\nthe fact that h(m)(a) = 0 and the induction hypothesis. Noting that f (m) = 0 for all m > n, this\nresult implies that g(m)(\u03c7m) > 0 for all m > n. Therefore g cannot be completely monotone.\n\n= h(m)(\u03c7m)\n\n\u03c7m\u2212a\n\nCorollary 1. There does not exist a \ufb01nite non-negative Borel measure on Rd whose Fourier trans-\nform agrees with K(z) on [0, 2].\n\n3.2 Spherical Random Fourier features\n\nFrom the section above, we see that the Bochner\u2019s theorem cannot be directly applied to the poly-\nnomial kernel. In addition, it is impossible to construct a positive integrable \u02c6k(w) whose inverse\nFourier transform \u02c6K(z) equals K(z) exactly on [0, 2]. Despite this result, it is nevertheless possible\nto \ufb01nd \u02c6K(z) that is a good approximation of K(z) on [0, 2], which is all that is necessary given\nthat we will be approximating \u02c6K(z) by Monte Carlo integration anyway. We present our method of\nSpherical Random Fourier (SRF) features in this section.\nWe recall a characterization of radial functions that are positive de\ufb01nite on Rd for all d due to\nSchoenberg.\nTheorem 4. (Schoenberg [18]) A continuous function f : [0,\u221e) \u2192 R is positive de\ufb01nite and\nd\u00b5(t), where \u00b5 is a \ufb01nite\nnon-negative Borel measure on [0,\u221e).\nThis characterization motivates an approximation for K(z) as a sum of N Gaussians, \u02c6K(z) =\ni z2. To increase the accuracy of the approximation, we allow the ci to take negative val-\nues. Doing so enables its Fourier transform (which is also a sum of Gaussians) to become negative.\nWe circumvent this problem by mapping those negative values to zero,\n\nradial on Rd for all d if and only if it is of the form f (r) = (cid:82) \u221e\n(cid:80)N\ni=1 cie\u2212\u03c32\n\n0 e\u2212r2t2\n\n(cid:32)\n\n(cid:17)d\n\n(cid:16) 1\u221a\n\nci\n\nN(cid:88)\n\ni=1\n\n2\u03c3i\n\n(cid:33)\n\n\u02c6k(w) = max\n\n0,\n\ne\u2212w2/4\u03c32\n\ni\n\n,\n\n(4)\n\nand simply de\ufb01ning \u02c6K(z) as its inverse Fourier transform. Owing to the max in eqn. (4), it is not\npossible to calculate an analytical expression for \u02c6K(z). Thankfully, this isn\u2019t necessary since we\n\n4\n\n\f(a) p = 10\n\n(b) p = 20\n\nFigure 1: K(z), its approximation \u02c6K(z), and the corresponding pdf p(w) for d = 256, a = 2 for\npolynomial orders (a) 10 and (b) 20. Higher-order polynomials are approximated better, see eqn. (6).\n\nAlgorithm 1 Spherical Random Fourier (SRF) Features\n\n(cid:104)\n\nInput: A polynomial kernel K(x, y) = K(z), z = ||x \u2212 y||2,||x||2 = 1,||y||2 = 1, with bias a \u2265 2,\norder p \u2265 1, input dimensionality d and feature dimensionality D.\nOutput: A randomized feature map Z(\u00b7) : Rd \u2192 RD such that (cid:104)Z(x), Z(y)(cid:105) \u2248 K(x, y).\n1. Solve argmin \u02c6K\nwhose form is given in eqn. (4). Let p(w) = (2\u03c0)\u2212d/2\u02c6k(w).\n2. Draw D iid samples w1, ..., wD from p(w).\n3. Draw D iid samples b1, ..., bD \u2208 R from the uniform distribution on [0, 2\u03c0].\n4. Z(x) =\n\n(cid:82) 2\n(cid:2)cos(wT\n\nfor \u02c6k(w), where \u02c6K(z) is the inverse Fourier transform of \u02c6k(w),\n\nDx + bD)(cid:3)T\n\n1 x + b1), ..., cos(wT\n\nK(z) \u2212 \u02c6K(z)\n\n(cid:113) 2\n\n(cid:105)2\n\n0 dz\n\nD\n\ncan evaluate it numerically by performing a one dimensional numerical integral,\n\n\u02c6K(z) =\n\ndw w \u02c6k(w)(w/z)d/2\u22121Jd/2\u22121(wz) ,\n\n(cid:90) \u221e\n\n0\n\nwhich is well-approximated using a \ufb01xed-width grid in w and z, and can be computed via a single\nmatrix multiplication. We then optimize the following cost function, which is just the MSE between\nK(z) and our approximation of it,\n\n(cid:104)\n\n(cid:90) 2\n\n0\n\nL =\n\n1\n2\n\n(cid:105)2\n\nK(z) \u2212 \u02c6K(z)\n\ndz\n\n,\n\n(5)\n\nwhich de\ufb01nes an optimal probability distribution p(w) through eqn. (4) and the relation p(w) =\n(2\u03c0)\u2212d/2k(w). We can then follow the Random Fourier Feature [5] method to generate the nonlin-\near maps. The entire SRF process is summarized in Algorithm 1. Note that for any given of kernel\nparameters (a, p, d), p(w) can be pre-computed, independently of the data.\n\n4 Approximation error\n\nThe total MSE comes from two sources: error approximating the function, i.e. L from eqn. (5),\nand error from Monte Carlo sampling. The expected MSE of Monte-Carlo converges at a rate of\nO(1/D) and a bound on the supremum of the absolute error was given in [5]. Therefore, we focus\non analyzing the \ufb01rst type of error.\n\nwhich is a special case of eqn. (4) obtained by setting N = 1, ci = 1, and \u03c31 =(cid:112)p/a2. The MSE\n\nWe describe a simple method to obtain an upper bound on L. Consider the function \u02c6K(z) = e\n\na2 z2,\n\nbetween K(z) and this function thus provides an upper bound to our approximation error,\n\n\u2212 p\n\n(cid:90) 2\n(cid:90) a\n(cid:114) \u03c0\n\n0\n\n0\n\n2p\n\nL =\n\n=\n\n=\n\n1\n2\n\n1\n2\na\n4\n\n(cid:19)\n\ndz [ \u02c6K(z) \u2212 K(z)]2 \u2264 1\n2\n\n(cid:20)\n(cid:18)\nerf((cid:112)2p) +\n\nexp\n\n\u2212 2p\na2 z2\n\u221a\na\n4\n\ndz\n\n+\n\n1 \u2212 z2\na2\n\u2212 a\n2\n\n\u221a\n\n\u03c0\n\n\u0393(p + 1)\n\u0393(p + 3\n2 )\n\ndz [ \u02c6K(z) \u2212 K(z)]2\n\n(cid:19)2p \u2212 2 exp\n\na2 z2(cid:17)(cid:18)\n(cid:16)\u2212 p\n\n\u03c0\n\n\u0393(p + 1)\n\u0393(p + 3\n2 )\n\nM ( 1\n\n2 , p + 3\n\n1 \u2212 z2\na2\n2 ,\u2212p) .\n\n(cid:19)p(cid:21)\n\n(cid:90) a\n\n0\n\n(cid:18)\n\n5\n\n00.511.5200.51zK(z) OriginalApprox02040608000.51x 10\u22123wp(w)00.511.5200.51zK(z) OriginalApprox020406000.511.5x 10\u22123wp(w)\f(a) p = 3\n\n(b) p = 7\n\n(c) p = 10\n\n(d) p = 20\n\nFigure 2: Comparison of MSE of kernel approximation on different datasets with various polynomial\norders (p) and feature map dimensionalities. The \ufb01rst to third rows show results of usps, gisette,\nadult, respectively. SRF gives better kernel approximation, especially for large p.\nIn the \ufb01rst line we have used the fact that integrand is positive and a \u2265 2. The three terms on the\nsecond line are integrated using the standard integral de\ufb01nitions of the error function, beta function,\nand Kummer\u2019s con\ufb02uent hypergeometric function [19], respectively. To expose the functional de-\npendence of this result more clearly, we perform an expansion for large p. We use the asymptotic\nexpansions of the error function and the Gamma function,\n\nerf(z) = 1 \u2212 e\u2212z2\n\u221a\n\nz\n\n\u03c0\n\n\u221e(cid:88)\n\nk=0\n\n,\n\n(\u22121)k (2k \u2212 1)!!\n\u221e(cid:88)\n\n(2z2)k\n\nlog\n\n+\n\nz\n2\u03c0\n\nk=2\n\nlog \u0393(z) = z log z \u2212 z \u2212 1\n2\n\nBk\n\nk(k \u2212 1)\n\nz1\u2212k ,\n\nwhere Bk are Bernoulli numbers. For the third term, we write the series representation of M (a, b, z),\n\nexpand each term for large p, and sum the result. All together, we obtain the following bound,\n\nL \u2264 105\n4096\n\na\n\n2\n\np5/2\n\n,\n\n(6)\n\nwhich decays at a rate of O(p\u22122.5) and becomes negligible for higher-order polynomials. This is\nremarkable, as the approximation error of previous methods increases as a function of p. Figure 1\nshows two kernel functions K(z), their approximations \u02c6K(z), and the corresponding pdfs p(w).\n\n5 Experiments\n\nWe compare the SRF method with Random Maclaurin (RM) [14] and Tensor Sketch (TS) [15],\nthe other polynomial kernel approximation approaches. Throughout the experiments, we choose the\nnumber of Gaussians, N, to equal 10, though the speci\ufb01c number had negligible effect on the results.\nThe bias term is set as a = 4. Other choices such as a = 2, 3 yield similar performance; results\nwith a variety of parameter settings can be found in the Supplementary Material. The error bars and\nstandard deviations are obtained by conducting experiments 10 times across the entire dataset.\n\n6\n\nM (a, b, z) =\n\n\u0393(b)\n\u0393(a)\n\n\u0393(a + k)\n\u0393(b + k)\n\nzk\nk!\n\n,\n\n\u221e(cid:88)\n(cid:114) \u03c0\n\nk=0\n\n910111213140123x 10\u22123log (Dimensionality)MSE RMTSSRF91011121314012345x 10\u22123log (Dimensionality)MSE RMTSSRF9101112131402468x 10\u22123log (Dimensionality)MSE RMTSSRF9101112131400.0050.010.0150.020.025log (Dimensionality)MSE RMTSSRF910111213140123x 10\u22123log (Dimensionality)MSE RMTSSRF910111213140246x 10\u22123log (Dimensionality)MSE RMTSSRF9101112131400.0020.0040.0060.0080.01log (Dimensionality)MSE RMTSSRF9101112131400.010.020.030.040.05log (Dimensionality)MSE RMTSSRF91011121314012345x 10\u22123log (Dimensionality)MSE RMTSSRF910111213140246x 10\u22123log (Dimensionality)MSE RMTSSRF91011121314012345x 10\u22123log (Dimensionality)MSE RMTSSRF9101112131400.0050.010.0150.020.025log (Dimensionality)MSE RMTSSRF\fDataset\nusps\np = 3\n\nusps\np = 7\n\nusps\np = 10\n\nusps\np = 20\n\ngisette\np = 3\n\ngisette\np = 7\n\ngisette\np = 10\n\ngisette\np = 20\n\nMethod\nRM\nTS\nSRF\nRM\nTS\nSRF\nRM\nTS\nSRF\nRM\nTS\nSRF\nRM\nTS\nSRF\nRM\nTS\nSRF\nRM\nTS\nSRF\nRM\nTS\nSRF\n\nD = 29\n\n87.29 \u00b1 0.87\n89.85 \u00b1 0.35\n90.91 \u00b1 0.32\n88.86 \u00b1 1.08\n92.30 \u00b1 0.52\n92.44 \u00b1 0.31\n88.95 \u00b1 0.60\n92.41 \u00b1 0.48\n92.63 \u00b1 0.46\n88.67 \u00b1 0.98\n91.73 \u00b1 0.88\n92.27 \u00b1 0.48\n89.53 \u00b1 1.43\n93.52 \u00b1 0.60\n91.72 \u00b1 0.92\n89.44 \u00b1 1.44\n92.89 \u00b1 0.66\n92.75 \u00b1 1.01\n89.91 \u00b1 0.58\n92.48 \u00b1 0.62\n92.42 \u00b1 0.85\n89.40 \u00b1 0.98\n90.49 \u00b1 1.07\n92.12 \u00b1 0.62\n\nD = 210\n89.11 \u00b1 0.53\n90.99 \u00b1 0.42\n92.08 \u00b1 0.32\n91.01 \u00b1 0.44\n93.59 \u00b1 0.20\n93.85 \u00b1 0.32\n91.41 \u00b1 0.46\n93.85 \u00b1 0.34\n94.33 \u00b1 0.33\n91.09 \u00b1 0.42\n93.92 \u00b1 0.28\n94.30 \u00b1 0.46\n92.77 \u00b1 0.40\n95.28 \u00b1 0.71\n94.39 \u00b1 0.65\n92.77 \u00b1 0.57\n95.29 \u00b1 0.39\n94.85 \u00b1 0.53\n93.16 \u00b1 0.40\n94.61 \u00b1 0.60\n95.10 \u00b1 0.47\n92.46 \u00b1 0.67\n92.88 \u00b1 0.42\n94.22 \u00b1 0.45\n\nD = 211\n90.43 \u00b1 0.49\n91.37 \u00b1 0.19\n92.50 \u00b1 0.48\n92.70 \u00b1 0.38\n94.53 \u00b1 0.20\n94.79 \u00b1 0.19\n93.27 \u00b1 0.28\n94.75 \u00b1 0.26\n95.18 \u00b1 0.26\n93.22 \u00b1 0.39\n94.68 \u00b1 0.28\n95.48 \u00b1 0.39\n94.49 \u00b1 0.48\n96.12 \u00b1 0.36\n95.62 \u00b1 0.47\n95.15 \u00b1 0.60\n96.32 \u00b1 0.47\n96.42 \u00b1 0.49\n94.94 \u00b1 0.72\n95.72 \u00b1 0.53\n96.35 \u00b1 0.42\n94.37 \u00b1 0.55\n94.43 \u00b1 0.69\n95.85 \u00b1 0.54\n\nD = 212\n91.09 \u00b1 0.44\n91.68 \u00b1 0.19\n93.10 \u00b1 0.26\n94.03 \u00b1 0.30\n94.84 \u00b1 0.10\n95.06 \u00b1 0.21\n94.29 \u00b1 0.34\n95.31 \u00b1 0.28\n95.60 \u00b1 0.27\n94.32 \u00b1 0.27\n95.26 \u00b1 0.31\n95.97 \u00b1 0.32\n95.90 \u00b1 0.31\n96.76 \u00b1 0.40\n96.50 \u00b1 0.40\n96.37 \u00b1 0.46\n96.66 \u00b1 0.34\n97.07 \u00b1 0.30\n96.19 \u00b1 0.49\n96.60 \u00b1 0.58\n97.15 \u00b1 0.34\n95.67 \u00b1 0.43\n95.41 \u00b1 0.71\n96.94 \u00b1 0.29\n\nD = 213\n91.48 \u00b1 0.31\n91.85 \u00b1 0.18\n93.31 \u00b1 0.16\n94.54 \u00b1 0.30\n95.06 \u00b1 0.23\n95.37 \u00b1 0.12\n95.19 \u00b1 0.21\n95.55 \u00b1 0.25\n95.78 \u00b1 0.23\n95.24 \u00b1 0.27\n95.90 \u00b1 0.20\n96.18 \u00b1 0.23\n96.69 \u00b1 0.33\n97.06 \u00b1 0.19\n96.91 \u00b1 0.36\n96.90 \u00b1 0.46\n97.16 \u00b1 0.25\n97.50 \u00b1 0.24\n96.88 \u00b1 0.23\n96.99 \u00b1 0.28\n97.57 \u00b1 0.23\n96.14 \u00b1 0.55\n96.24 \u00b1 0.44\n97.47 \u00b1 0.24\n\nD = 214\n91.78 \u00b1 0.32\n91.90 \u00b1 0.23\n93.28 \u00b1 0.24\n94.97 \u00b1 0.26\n95.27 \u00b1 0.12\n95.51 \u00b1 0.17\n95.53 \u00b1 0.25\n95.91 \u00b1 0.17\n95.85 \u00b1 0.16\n95.62 \u00b1 0.24\n96.07 \u00b1 0.19\n96.28 \u00b1 0.15\n97.01 \u00b1 0.26\n97.12 \u00b1 0.27\n97.05 \u00b1 0.19\n97.27 \u00b1 0.22\n97.58 \u00b1 0.25\n97.53 \u00b1 0.15\n97.15 \u00b1 0.40\n97.41 \u00b1 0.20\n97.75 \u00b1 0.14\n96.63 \u00b1 0.40\n96.97 \u00b1 0.28\n97.75 \u00b1 0.32\n\nExact\n\n96.21\n\n96.51\n\n96.56\n\n96.81\n\n98.00\n\n97.90\n\n98.10\n\n98.00\n\nTable 1: Comparison of classi\ufb01cation accuracy (in %) on different datasets for different polynomial\norders (p) and varying feature map dimensionality (D). The Exact column refers to the accuracy of\nexact polynomial kernel trained with libSVM. More results are given in the Supplementary Material.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: Comparison of CRAFT features on usps dataset with polynomial order p = 10 and\nfeature maps of dimension D = 212. (a) Logarithm of ratio of ith-leading eigenvalue of the ap-\nproximate kernel to that of the exact kernel, constructed using 1,000 points. CRAFT features are\nprojected from 214 dimensional maps. (b) Mean squared error. (c) Classi\ufb01cation accuracy.\n\nKernel approximation. The main focus of this work is to improve the quality of kernel approxima-\ntion, which we measure by computing the mean squared error (MSE) between the exact kernel and\nits approximation across the entire dataset. Figure 2 shows MSE as a function of the dimensionality\n(D) of the nonlinear maps. SRF provides lower MSE than other methods, especially for higher order\npolynomials. This observation is consistent with our theoretical analysis in Section 4. As a corollary,\nSRF provides more compact maps with the same kernel approximation error. Furthermore, SRF is\nstable in terms of the MSE, whereas TS and RM have relatively large variance.\nClassi\ufb01cation with linear SVM. We train linear classi\ufb01ers with liblinear [3] and evaluate classi\ufb01-\ncation accuracy on various datasets, two of which are summarized in Table 1; additional results are\navailable in the Supplementary Material. As expected, accuracy improves with higher-dimensional\nnonlinear maps and higher-order polynomials. It is important to note that better kernel approxima-\ntion does not necessarily lead to better classi\ufb01cation performance because the original kernel might\nnot be optimal for the task [20, 21]. Nevertheless, we observe that SRF features tend to yield better\nclassi\ufb01cation performance in most cases.\nRank-De\ufb01ciency. Hamid et al. [16] observe that RM and TS produce nonlinear features that are\nrank de\ufb01cient. Their approximation quality can be improved by \ufb01rst mapping the input to a higher\ndimensional feature space, and then randomly projecting it to a lower dimensional space. This\nmethod is known as CRAFT. Figure 3(a) shows the logarithm of the ratio of the ith eigenvalue\n\n7\n\n02004006008001000\u22122.5\u22122\u22121.5\u22121\u22120.500.5Eigenvalue Ranklog (Eigenvalue Ratio) RMTSSRFCRAFT RMCRAFT TSCRAFT SRF9101112131401234567x 10\u22123log (Dimensionality)MSE RMTSSRFCRAFT RMCRAFT TSCRAFT SRF910111213148889909192log (Dimensionality)Accuracy % RMTSSRFCRAFT RMCRAFT TSCRAFT SRF\f(a) d = 1000\n\n(b) d = D\n\nFigure 4: Computational time to generate randomized feature\nmap for 1,000 random samples on a \ufb01xed hardware with p = 3.\n(a) d = 1, 000. (b) d = D.\n\nFigure 5: Doubly stochastic gra-\ndient learning curves with RFF\nand SRF features on ImageNet.\n\nof the various approximate kernel matrices to that of the exact kernel. For a full-rank, accurate\napproximation, this value should be constant and equal to zero, which is close to the case for SRF.\nRM and TS deviate from zero signi\ufb01cantly, demonstrating their rank-de\ufb01ciency.\nFigures 3(b) and 3(c) show the effect of the CRAFT method on MSE and classi\ufb01cation accuracy.\nCRAFT improves RM and TS but it has no or even a negative effect on SRF. These observations all\nindicate that the SRF is less rank-de\ufb01cient than RM and TS.\nComputational Ef\ufb01ciency. Both RM and SRF have computational complexity O(ndD), whereas\nTS scales as O(np(d + D log D)), where D is the number of nonlinear maps, n is the number of\nsamples, d is the original feature dimension, and p is the polynomial order. Therefore the scalability\nof TS is better than SRF when D is of the same order as d (O(D log D) vs. O(D2)). However,\nthe computational cost of SRF does not depend on p, making SRF more ef\ufb01cient for higher-order\npolynomials. Moreover, there is little computational overhead involved in the SRF method, which\nenables it to outperform T S for practical values of D, even though it is asymptotically inferior. As\nshown in Figure 4(a), even for the low-order case (p = 3), SRF is more ef\ufb01cient than TS for a \ufb01xed\nd = 1000. In Figure 4(b), where d = D, SRF is still more ef\ufb01cient than TS up to D (cid:46) 4000.\nLarge-scale Learning. We investigate the scalability of the SRF method on the ImageNet 2012\ndataset, which consists of 1.3 million 256 \u00d7 256 color images from 1000 classes. We employ the\ndoubly stochastic gradient method of Dai et al. [22], which utilizes two stochastic approximations\n\u2014 one from random training points and the other from random features associated with the kernel.\nWe use the same architecture and parameter settings as [22] (including the \ufb01xed convolutional neural\nnetwork parameters), except we replace the RFF kernel layer with an (cid:96)2 normalization step and an\nSRF kernel layer with parameters a = 4 and p = 10. The learning curves in Figure 5 suggest that\nSRF features may perform better than RFF features on this large-scale dataset. We also evaluate\nthe model with multi-view testing, in which max-voting is performed on 10 transformations of the\ntest set. We obtain Top-1 test error of 44.4%, which is comparable to the 44.5% error reported in\n[22]. These results demonstrate that the unit norm restriction does not have a negative impact on\nperformance in this case, and that polynomial kernels can be successfully scaled to large datasets\nusing the SRF method.\n6 Conclusion\nWe have described a novel technique to generate compact nonlinear features for polynomial kernels\napplied to data on the unit sphere. It approximates the Fourier transform of kernel functions as\nthe positive projection of an inde\ufb01nite combination of Gaussians. It achieves more compact maps\ncompared to the previous approaches, especially for higher-order polynomials. SRF also shows less\nfeature redundancy, leading to lower kernel approximation error. Performance of SRF is also more\nstable than the previous approaches due to reduced variance. Moreover, the proposed approach\ncould easily extend beyond polynomial kernels: the same techniques would apply equally well to\nany shift-invariant radial kernel function, positive de\ufb01nite or not. In the future, we would also like\nto explore adaptive sampling procedures tuned to the training data distribution in order to further\nimprove the kernel approximation accuracy, especially when D is large, i.e. when the Monte-Carlo\nerror is low and the kernel approximation error dominates.\nAcknowledgments. We thank the anonymous reviewers for their valuable feedback and Bo Xie for\nfacilitating experiments with the doubly stochastic gradient method.\n\n8\n\n1000200030004000500000.10.20.30.40.5DimensionalityTime (sec) RMTSSRF10002000300040005000012345DimensionalityTime (sec) RMTSSRF5101520254648505254Training Examples (M)Top\u22121 Test Error (%) SRFRFF\fReferences\n[1] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297, 1995.\n\n[2] T. Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International\n\nConference on Knowledge Discovery and Data Mining, pages 217\u2013226. ACM, 2006.\n\n[3] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library\n\nfor large linear classi\ufb01cation. The Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[4] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-\n\ngradient solver for SVM. Mathematical Programming, 127(1):3\u201330, 2011.\n\n[5] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural\n\ninformation processing systems, pages 1177\u20131184, 2007.\n\n[6] Salomon Bochner. Harmonic analysis and the theory of probability. Dover Publications, 1955.\n\n[7] Subhransu Maji and Alexander C Berg. Max-margin additive classi\ufb01ers for detection. In International\n\nConference on Computer Vision, pages 40\u201347. IEEE, 2009.\n\n[8] V Sreekanth, Andrea Vedaldi, Andrew Zisserman, and C Jawahar. Generalized RBF feature maps for\n\nef\ufb01cient detection. In British Machine Vision Conference, 2010.\n\n[9] Fuxin Li, Catalin Ionescu, and Cristian Sminchisescu. Random fourier approximations for skewed multi-\n\nplicative histogram kernels. In Pattern Recognition, pages 262\u2013271. Springer, 2010.\n\n[10] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 34(3):480\u2013492, 2012.\n\n[11] Jiyan Yang, Vikas Sindhwani, Quanfu Fan, Haim Avron, and Michael Mahoney. Random laplace feature\nmaps for semigroup kernels on histograms. In Computer Vision and Pattern Recognition (CVPR), pages\n971\u2013978. IEEE, 2014.\n\n[12] Hideki Isozaki and Hideto Kazawa. Ef\ufb01cient support vector classi\ufb01ers for named entity recognition. In\nProceedings of the 19th International Conference on Computational Linguistics-Volume 1, pages 1\u20137.\nAssociation for Computational Linguistics, 2002.\n\n[13] Kwang In Kim, Keechul Jung, and Hang Joon Kim. Face recognition using kernel principal component\n\nanalysis. Signal Processing Letters, IEEE, 9(2):40\u201342, 2002.\n\n[14] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 583\u2013591, 2012.\n\n[15] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Pro-\nceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 239\u2013247. ACM, 2013.\n\n[16] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis Decoste. Compact random feature maps. In Pro-\n\nceedings of the 31st International Conference on Machine Learning (ICML-14), pages 19\u201327, 2014.\n\n[17] Ping Li, Trevor J Hastie, and Kenneth W Church. Improving random projections using marginal informa-\n\ntion. In Learning Theory, pages 635\u2013649. Springer, 2006.\n\n[18] Isaac J Schoenberg. Metric spaces and completely monotone functions. Annals of Mathematics, pages\n\n811\u2013841, 1938.\n\n[19] EE Kummer. De integralibus quibusdam de\ufb01nitis et seriebus in\ufb01nitis. Journal f\u00a8ur die reine und ange-\n\nwandte Mathematik, 17:228\u2013242, 1837.\n\n[20] Felix X Yu, Sanjiv Kumar, Henry Rowley, and Shih-Fu Chang. Compact nonlinear maps and circulant\n\nextensions. arXiv preprint arXiv:1503.03893, 2015.\n\n[21] Dmitry Storcheus, Mehryar Mohri, and Afshin Rostamizadeh. Foundations of coupled nonlinear dimen-\n\nsionality reduction. arXiv preprint arXiv:1509.08880, 2015.\n\n[22] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable\nkernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems,\npages 3041\u20133049, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1141, "authors": [{"given_name": "Jeffrey", "family_name": "Pennington", "institution": "Google"}, {"given_name": "Felix Xinnan", "family_name": "Yu", "institution": "Google Research"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google"}]}