{"title": "Quadrature-based features for kernel approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 9147, "page_last": 9156, "abstract": "We consider the problem of improving kernel approximation via randomized feature maps. These maps arise as Monte Carlo approximation to integral representations of kernel functions and scale up kernel methods for larger datasets. Based on an efficient numerical integration technique, we propose a unifying approach that reinterprets the previous random features methods and extends to better estimates of the kernel approximation. We derive the convergence behavior and conduct an extensive empirical study that supports our hypothesis.", "full_text": "Quadrature-based features for kernel approximation\n\nMarina Munkhoeva\u2020\n\nYermek Kapushev\u2020\n\nEvgeny Burnaev\u2020\n\nIvan Oseledets\u2020,\u2021\n\n\u2020Skolkovo Institute of Science and Technology\n\nMoscow, Russia\n\n\u2021Institute of Numerical Mathematics of the Russian Academy of Sciences\n\nMoscow, Russia\n\nAbstract\n\nWe consider the problem of improving kernel approximation via randomized\nfeature maps. These maps arise as Monte Carlo approximation to integral\nrepresentations of kernel functions and scale up kernel methods for larger datasets.\nBased on an ef\ufb01cient numerical integration technique, we propose a unifying\napproach that reinterprets the previous random features methods and extends to\nbetter estimates of the kernel approximation. We derive the convergence behaviour\nand conduct an extensive empirical study that supports our hypothesis1.\n\n1\n\nIntroduction\n\nKernel methods proved to be an ef\ufb01cient technique in numerous real-world problems. The core idea\nof kernel methods is the kernel trick \u2013 compute an inner product in a high-dimensional (or even\nin\ufb01nite-dimensional) feature space by means of a kernel function k:\n\nk(x, y) = (cid:104)\u03c8(x), \u03c8(y)(cid:105),\n\n(1)\nwhere \u03c8 : X \u2192 F is a non-linear feature map transporting elements of input space X into a feature\nspace F. It is a common knowledge that kernel methods incur space and time complexity infeasible\nto be used with large-scale datasets directly. For example, kernel regression has O(N 3 + N d2)\ntraining time, O(N 2) memory, O(N d) prediction time complexity for N data points in original\nd-dimensional space X .\nOne of the most successful techniques to handle this problem, known as Random Fourier Features\n(RFF) proposed by [29], introduces a low-dimensional randomized approximation to feature maps:\n(2)\nThis is essentially carried out by using Monte-Carlo sampling to approximate scalar product in (1).\nA randomized D-dimensional mapping \u02c6\u03a8(\u00b7) applied to the original data input allows employing\nstandard linear methods, i.e. reverting the kernel trick. In doing so one reduces the complexity to\nthat of linear methods, e.g. D-dimensional approximation admits O(N D2) training time, O(N D)\nmemory and O(N ) prediction time.\nIt is well known that as D \u2192 \u221e, the inner product in (2) converges to the exact kernel k(x, y).\nRecent research [35; 14; 9] aims to improve the convergence of approximation so that a smaller D\ncan be used to obtain the same quality of approximation.\nThis paper considers kernels that allow the following integral representation\nk(x, y) = Ep(w)fxy(w) = I(fxy),\n\nfxy = \u03c6(w(cid:62)x)\u03c6(w(cid:62)y).\n\nk(x, y) \u2248 \u02c6\u03a8(x)\n\n(cid:62) \u02c6\u03a8(y).\n\np(w) =\n\n1\n\n(2\u03c0)d/2 e\u2212\n\n(cid:107)w(cid:107)2\n\n2\n\n,\n\n(3)\n\n1The code for this paper is available at https://github.com/maremun/quffka.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFor\n\nexample,\n\nthe\n\npopular Gaussian\n\nkernel\n\nadmits\n\nsuch\n\nrepresentation with\n\nfxy(w) = \u03c6(w(cid:62)x)(cid:62)\u03c6(w(cid:62)y), where \u03c6(\u00b7) = [cos(\u00b7)\n\n(cid:62)\nsin(\u00b7)]\n\n.\n\nThe class of kernels admitting the form in (3) covers shift-invariant kernels (e.g. radial basis function\n(RBF) kernels) and Pointwise Nonlinear Gaussian (PNG) kernels. They are widely used in practice\nand have interesting connections with neural networks [8; 34].\nThe main challenge for the construction of low-dimensional feature maps is the approximation of\nthe expectation in (3) which is d-dimensional integral with Gaussian weight. Unlike other research\nstudies we refrain from using simple Monte Carlo estimate of the integral, instead, we propose to use\nspeci\ufb01c quadrature rules. We now list our contributions:\n\n\u2022 We propose to use spherical-radial quadrature rules to improve kernel approximation\naccuracy. We show that these quadrature rules generalize the RFF-based techniques. We\nalso provide an analytical estimate of the error for the used quadrature rules that implies\nbetter approximation quality.\n\n\u2022 We use structured orthogonal matrices (so-called butter\ufb02y matrices) when designing\nquadrature rule that allow fast matrix by vector multiplications. As a result, we speed\nup the approximation of the kernel function and reduce memory requirements.\n\n\u2022 We carry out an extensive empirical study comparing our methods with the state-of-the-art\nones on a set of different kernels in terms of both kernel approximation error and downstream\ntasks performance. The study supports our hypothesis on the exceeding accuracy of the\nmethod.\n\n2 Quadrature Rules and Random Features\n\nWe start with rewriting the expectation in Equation (3) as integral of fxy with respect to p(w):\n\nI(fxy) = (2\u03c0)\u2212 d\n\n2\n\ne\u2212 w(cid:62)w\n\n2 fxy(w)dw.\n\n(cid:90) \u221e\n\n(cid:90) \u221e\n\n\u00b7\u00b7\u00b7\n\n\u2212\u221e\n\n\u2212\u221e\n\nIntegration can be performed by means of quadrature rules. The rules usually take a form of\ninterpolating function that is easy to integrate. Given such a rule, one may sample points from the\ndomain of integration and calculate the value of the rule at these points. Then, the sample average of\nthe rule values would yield the approximation of the integral.\nThe connection between integral approximation and mapping \u03c8 is straightforward. In what follows\nwe show a brief derivation of the quadrature rules that allow for an explicit mapping of the form:\n\u03c8(x) = [ a0\u03c6(0) a1\u03c6(w(cid:62)1 x) . . . aD\u03c6(w(cid:62)Dx) ], where the choice of the weights ai and the points\nwi is dictated by the quadrature.\nWe use the average of sampled quadrature rules developed by [18] to yield unbiased estimates of\nI(fxy). A change of coordinates is the \ufb01rst step to facilitate stochastic spherical-radial rules. Now,\nlet w = rz, with z(cid:62)z = 1, so that w(cid:62)w = r2 for r \u2208 [0,\u221e], leaving us with (to ease the notation\nwe substitute fxy with f)\n\n(cid:90)\n\n(cid:90) \u221e\n\n(cid:90)\n\n(cid:90) \u221e\n\n(2\u03c0)\u2212 d\n\n2\n\ne\u2212 r2\n\n2\n\n0\n\nUd\n\n2 rd\u22121f (rz)drdz =\n\nI(f ) = (2\u03c0)\u2212 d\nI(f ) is now a double integral over the unit d-sphere Ud = {z : z(cid:62)z = 1, z \u2208 Rd} and over the\nradius. To account for both integration regions we apply a combination of spherical (S) and radial\n(R) rules known as spherical-radial (SR) rules. To provide an intuition how the rules work, here we\nbrie\ufb02y state and discuss their form2.\n\n2 |r|d\u22121f (rz)drdz,\n\n\u2212\u221e\n\n(4)\n\nUd\n\n2\n\ne\u2212 r2\n\nStochastic radial rules of degree 2l + 1 R(h) =\n\nsymmetric sums and approximate the in\ufb01nite range integral T (h) =(cid:82) \u221e\n\nh(\u03c1i)+h(\u2212\u03c1i)\n\n\u02c6wi\n\ni=0\n\n2 |r|d\u22121h(r)dr. Note\n2Please see [18] for detailed derivation of the stochastic radial (section 2), spherical (section 3) and spherical\n\n\u2212\u221e\n\nhave the form of the weighted\n\ne\u2212 r2\n\n2\n\nradial rules (section 4)\n\nl(cid:80)\n\n2\n\n\fthat when h is set to the function f of interest, T (f ) corresponds to the inner integral in (4). To get\nan unbiased estimate for T (h), points \u03c1i are sampled from speci\ufb01c distributions. The weights \u02c6wi are\nderived so that the rule is exact for polynomials of degree 2l + 1 and give unbiased estimate for other\nfunctions.\n\n(cid:101)wjs(Qzj), where Q is a random orthogonal matrix,\nThe weights (cid:101)wj are stochastic with distribution such that the rule is exact for polynomials of degree p\n\napproximate an integral of a function s(z) over the surface of unit d-sphere Ud, where zj are\npoints on Ud, i.e. z(cid:62)\nj zj = 1. Remember that the outer integral in (4) has Ud as its integration region.\n\nStochastic spherical rules SQ(s) =\n\nand gives unbiased estimate for other functions.\nStochastic spherical-radial rules SR of degree (2l + 1, p) are given by the following expression\n\np(cid:80)\n\nj=1\n\np(cid:88)\n\n(cid:101)wj\n\nl(cid:88)\n\nj=1\n\ni=1\n\nSR(2l+2,p)\n\nQ,\u03c1\n\n=\n\n\u02c6wi\n\nf (\u03c1Qzi) + f (\u2212\u03c1Qzi)\n\n2\n\n,\n\nwhere the distributions of weights are such that if degrees of radial rules and spherical rules coincide,\ni.e. 2l + 1 = p, then the rule is exact for polynomials of degree 2l + 1 and gives unbiased estimate of\nthe integral for other functions.\n\n2.1 Spherical-radial rules of degree (1, 1) is RFF\n\n2\n\nQ,\u03c1 = f (\u03c1Qz)+f (\u2212\u03c1Qz)\n\nIf we take radial rule of degree 1 and spherical rule of degree 1, we obtain the following rule\nSR(1,1)\n, where \u03c1 \u223c \u03c7(d). It is easy to see that \u03c1Qz \u223c N (0, I), and for shift\ninvariant kernel f (w) = f (\u2212w), thus, the rule reduces to SR(1,1)\nQ,\u03c1 = f (w), where w \u223c N (0, I).\nNow, RFF [29] makes approximation of the RBF kernel in exactly the same way: it generates random\nvector from Gaussian distribution and calculates the corresponding feature map.\nProposition 2.1. Random Fourier Features for RBF kernel are SR rules of degree (1, 1).\n\n2.2 Spherical-radial rules of degree (1, 3) is ORF\n\nQ,\u03c1 =(cid:80)d\n\nlet\u2019s take radial rule of degree 1 and spherical rule of degree 3.\n\nNow,\nwe get the following spherical-radial rule SR1,3\nei = (0, . . . , 0, 1, 0, . . . , 0)(cid:62) is an i-th column of the identity matrix.\nLet us compare SR1,3 rules with Orthogonal Random Features [14] for the RBF kernel. In the\nORF approach, the weight matrix W = SQ is generated, where S is a diagonal matrix with the\nentries drawn independently from \u03c7(d) distribution and Q is a random orthogonal matrix. The\ni=1 f (wi), where wi is the i-th row of\n\napproximation of the kernel is then given by kORF(x, y) =(cid:80)d\n\nIn this case\n, where \u03c1 \u223c \u03c7(d),\n\nf (\u03c1Qei)+f (\u2212\u03c1Qei)\n\nthe matrix W. As the rows of Q are orthonormal, they can be represented as Qei.\nProposition 2.2. Orthogonal Random Features for RBF kernel are SR rules of degree (1, 3).\n\ni=1\n\n2\n\n2.3 Spherical-radial rules of degree (3, 3)\n\nWe go further and take both spherical and radial rules of degree 3, where we use original and re\ufb02ected\nvertices vj of randomly rotated unit vertex regular d-simplex V as the points on the unit sphere\n\n(cid:18)\n\n(cid:19)\n\nSR3,3\n\nQ,\u03c1(f ) =\n\nd\n\u03c12\n\n1 \u2212\n\nf (0) +\n\nd\n\nd + 1\n\nj=1\n\nwhere \u03c1 \u223c \u03c7(d + 2). We apply (5) to the approximation of (4) by averaging the samples of SR3,3\nQ,\u03c1:\n\nd+1(cid:88)\n\n(cid:21)\n\n(cid:20) f (\u2212\u03c1Qvj) + f (\u03c1Qvj)\nn(cid:88)\n\n2\u03c12\n\nSR3,3\n\nQi,\u03c1i(f ),\n\n1\nn\n\ni=1\n\n,\n\n(5)\n\n(6)\n\nI(f ) = EQ,\u03c1[SR3,3\n\nQ,\u03c1(f )] \u2248 \u02c6I(f ) =\n\n3\n\n\fwhere n is the number of sampled SR rules. Speaking in terms of the approximate feature maps, the\nnew feature dimension D in case of the quadrature based approximation equals 2n(d + 1) + 1 as we\nsample n rules and evaluate each of them at 2(d + 1) random points and 1 zero point.\nIn this work we propose to modify the quadrature rule by generating \u03c1j \u223c \u03c7(d + 2) for each vj,\ni.e. SR3,3\n. It doesn\u2019t affect\n\n(cid:104) f (\u2212\u03c1j Qvj )+f (\u03c1j Qvj )\n\n(cid:80)d+1\n\nd+1(cid:80)\n\n(cid:16)\n\n(cid:17)\n\n(cid:105)\n\nd\n\nQ,\u03c1(f ) =\n\nf (0) + d\nd+1\n\nj=1\n\n(d+1)\u03c12\nj\n\n2\u03c12\nj\n\n1 \u2212\n\nthe quality of approximation while simpli\ufb01es an analysis of the quadrature-based random features.\n\nj=1\n\nExplicit mapping We \ufb01nally arrive at the map \u03c8(x) = [ a0\u03c6(0) a1\u03c6(w(cid:62)1 x) . . . aD\u03c6(w(cid:62)Dx) ],\n\n(cid:115)\n\nwhere a0 =\n\nj=1(cid:80)\n(cid:105)\nsuch matrices Wk = \u03c1k(cid:104)\n\nW = \u03c1 \u2297\n\n(QV)\n\u2212 (QV)\n\n1 \u2212\n(cid:62)\n(cid:62)\n\n(cid:104)\n\n(cid:105)\n\n(cid:62)\n(cid:62)\n\n(cid:113) d\n\nd\n\u03c12\n\n3, aj = 1\n\u03c1j\n\n2(d+1), wj\n\nis the j-th row in the matrix\n\nd+1\n, \u03c1 = [\u03c11 . . . \u03c1D](cid:62). To get D features one simply stacks n =\n\nD\n\n2(d+1)+1\n\n(QkV)\n\u2212 (QkV)\n\nso that W \u2208 RD\u00d7d, where only Qk \u2208 Rd\u00d7d and \u03c1k are\n. For the 0-order\ngenerated randomly (k = 1, . . . , n). For Gaussian kernel, \u03c6(\u00b7) = [cos(\u00b7)\narc-cosine kernel, \u03c6(\u00b7) = \u0398(\u00b7), where \u0398(\u00b7) is the Heaviside function. For the 1-order arc-cosine\nkernel, \u03c6(\u00b7) = max(0,\u00b7).\n2.4 Generating uniformly random orthogonal matrices\n\n(cid:62)\nsin(\u00b7)]\n\nThe SR rules require a random orthogonal matrix Q. If Q follows Haar distribution, the averaged\nsamples of SR3,3\nQ,\u03c1 rules provide an unbiased estimate for (4). Essentially, Haar distribution means\nthat all orthogonal matrices in the group are equiprobable, i.e. uniformly random. Methods for\nsampling such matrices vary in their complexity of generation and multiplication.\nWe test two algorithms for obtaining Q. The \ufb01rst uses a QR decomposition of a random matrix to\nobtain a product of a sequence of re\ufb02ectors/rotators Q = H1 . . . Hn\u22121D, where Hi is a random\nHouseholder/Givens matrix and a diagonal matrix D has entries such that P(dii = \u00b11) = 1/2.\nIt implicates no fast matrix multiplication. We test both methods for random orthogonal matrix\ngeneration and, since their performance coincides, we leave this one out for cleaner \ufb01gures in the\nExperiments section.\nThe other choice for Q are so-called butter\ufb02y matrices [17]. For d = 4\n\n\uf8ee\uf8ef\uf8f0c1 \u2212s1\n\ns1\n0\n0\n\nc1\n0\n0\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f0c2\n\n0\ns2\n0\n\n0\n0\n0\n0\nc3 \u2212s3\nc3\ns3\n\n\uf8f9\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8f0c1c2 \u2212s1c2 \u2212c1s2\n\ns1s2\nc1c2 \u2212s1s2 \u2212c1s2\ns1c2\nc3s2 \u2212s3s2\nc3c2 \u2212s3c2\nc3c2\ns3c2\nc3s2\ns3s2\n\n\uf8f9\uf8fa\uf8fb,\n\n0 \u2212s2\n0\nc2\nc2\n0\ns2\n0\n\n0\n\u2212s2\n0\nc2\n\nB(4) =\n\nwhere si, ci is sine and cosine of some angle \u03b8i, i = 1, . . . , d \u2212 1. For de\ufb01nition and discussion\nplease see Supplementary Materials. The factors of B(d) are structured and allow fast matrix\nmultiplication. The method using butter\ufb02y matrices is denoted by B in the Experiments section.\n\n3 Error bounds\n\nProposition 3.1. Let l be a diameter of the compact set X and p(w) = N (0, \u03c32\npI) be the probability\ndensity corresponding to the kernel. Let us suppose that |\u03c6(w(cid:62)x)| \u2264 \u03ba, |\u03c6(cid:48)(w(cid:62)x)| \u2264 \u00b5 for all\nw \u2208 \u2126, x \u2208 X and\nFeatures approximation \u02c6k(x, y) of the kernel function k(x, y) and any \u03b5 > 0 it holds\n\n(cid:12)(cid:12)(cid:12) \u2264 M for all \u03c1 \u2208 [0,\u221e), where z(cid:62)z = 1. Then for Quadrature-based\n\n(cid:12)(cid:12)(cid:12) 1\u2212fxy(\u03c1z)\n\n\u03c12\n\n(cid:19)\n\n(cid:18) \u03c3pl\u03ba\u00b5\n\n(cid:19) 2d\n\nd+1\n\n(cid:18)\n\n(cid:19)\n\nD\u03b52\n\nsup\n\nx,y\u2208X |\u02c6k(x, y) \u2212 k(x, y)| \u2265 \u03b5\n0 \u2265 0, you need to sample \u03c1j two times on average (see Supplementary Materials for details).\n\n8M 2(d + 1)\n\n\u2264 \u03b2d\n\nexp\n\n\u2212\n\n\u03b5\n\n,\n\n3To get a2\n\n(cid:18)\n\nP\n\n4\n\n\fMethod\n\nTable 1: Space and time complexity.\nTime\nO(Dd)\nO(Dd)\nO(d log d)\nO(d log d)\n\nSpace\nO(Dd)\nO(Dd)\nO(d)\nO(d)\n\nQuadrature based\n\nORF\nQMC\nROM\n\n(cid:16)\n\n(cid:17)\n\n(cid:16) d\n\n(cid:17) d\n(cid:20)\n\nDataset\n\n#samples\n\nPowerplant\nLETTER\n\nTable 2: Experimental settings for the datasets.\n#runs\n500\n500\n500\n100\n50\n10\n\nN\n9568\n20000\n9298\n70000\n60000\n\nd\n4\n16\n256\n784\n3072\n7129\n\nCIFAR100\nLEUKEMIA\n\n550\n550\n550\n550\n50\n10\n\nUSPS\nMNIST\n\n72\n\n\u2212d\nd+1 + d\n\n1\n\nd\n\nd+1\n\nwhere \u03b2d =\nmore than \u03b5 with probability at least 1 \u2212 \u03b4 as long as\nlog\n\n8M 2(d + 1)\n\nd+1\n\n2\n\n2\n\n6d+1\nd+1\n\nD \u2265\n\n\u03b52\n\n1 + 1\nd\n\n(cid:21)\n\n.\n\n\u03c3pl\u03ba\u00b5\n\n\u03b5\n\n+ log\n\n\u03b2d\n\u03b4\n\nd+1 . Thus we can construct approximation with error no\n\nThe proof of this proposition closely follows [33], details can be found in the Supplementary\nMaterials.\nTerm \u03b2d depends on dimension d, its maximum is \u03b286 \u2248 64.7 < 65, and limd\u2192\u221e \u03b2d = 64, though\nit is lower for small d. Let us compare this probability bound with the similar result for RFF in [33].\nUnder the same conditions the required number of samples to achieve error no more than \u03b5 with\nprobability at least 1 \u2212 \u03b4 for RFF is the following\n\u03c3pl\n\u03b5\n\n8(d + 1)\n\n3d + 3\n\nD \u2265\n\n\u03b2d\n\u03b4\n\n+ log\n\nd + 1\n\n(cid:20)\n\n(cid:21)\n\nlog\n\nlog\n\n2d\n\n\u03b52\n\n+\n\nd\n\n2\n\n.\n\n1 + 1\nd\n\nFor Quadrature-based Features for RBF kernel M = 1\n\n(cid:20)\n\n(cid:21)\n\n2 , \u03ba = \u00b5 = 1, therefore, we obtain\n\u03c3pl\n\u03b5\n\n\u03b2d\n\u03b4\n\n+ log\n\n.\n\nlog\n\nD \u2265\n\n2(d + 1)\n\n\u03b52\n\n2\n\n1 + 1\nd\n\nThe asymptotics is the same, however, the constants are smaller for our approach. See Section 4 for\nempirical justi\ufb01cation of the obtained result.\ni=1, with xi \u2208 Rd and yi \u2208 R, let h(x) denote\nProposition 3.2 ([33]). Given a training set {(xi, yi)}n\nthe result of kernel ridge regression using the positive semi-de\ufb01nite training kernel matrix K, test\nto the training kernel matrix (cid:98)K and test kernel values \u02c6kx. Further, assume that the training labels\nare centered,(cid:80)n\nkernel values kx and regularization parameter \u03bb. Let \u02c6h(x) be the same using a PSD approximation\n\n(cid:80)n\ny = 1\nn\n\u03c3y\u221an\n\u03bb (cid:107)\u02c6kx \u2212 kx(cid:107)2 +\n\ni=1 y2\n\ni . Also suppose (cid:107)kx(cid:107)\u221e \u2264 \u03ba. Then\n\n\u03bb2 (cid:107)(cid:98)K \u2212 K(cid:107)2.\n\n\u03ba\u03c3yn\n\ni=1 yi = 0, and let \u03c32\n|\u02c6h(x) \u2212 h(x)| \u2264\n\n(cid:107)(cid:98)K \u2212 K(cid:107)2 \u2264 (cid:107)(cid:98)K \u2212 K(cid:107)F \u2264 n\u03b5. By denoting \u03bb = n\u03bb0 we obtain |\u02c6h(x) \u2212 h(x)| \u2264 \u03bb0+1\nSuppose that sup|k(x, x(cid:48)) \u2212 \u02c6k(x, x(cid:48))| \u2264 \u03b5 for all x, x(cid:48) \u2208 Rd. Then (cid:107)\u02c6kx \u2212 kx(cid:107)2 \u2264 \u221an\u03b5 and\n\n\u03c3y\u03b5.\n\n\u03bb2\n0\n\nTherefore,\n\nP(cid:16)\n\n(cid:18)\n\n(cid:17)\n\u2264 P\n|\u02c6h(x) \u2212 h(x)| \u2265 \u03b5\n(cid:107)\u02c6k(x, x(cid:48)) \u2212 k(x, x(cid:48))(cid:107)\u221e \u2265\n(cid:19)2(cid:20)\n(cid:18) \u03bb0 + 1\n\n2\n\n\u03bb2\n0\u03b5\n\n\u03c3y(\u03bb0 + 1)\n\n(cid:19)\n\n.\n\n(cid:21)\n\nD \u2265 8M 2(d + 1)\u03c32\n\ny\n\n\u03bb2\n0\u03b5\n\n1 + 1\nd\n\n\u03bb2\n0\u03b5\n\n\u03c3y\u03c3pl\u03ba\u00b5(\u03bb0 + 1)\n\nlog\n\n+ log\n\n\u03b2d\n\u03b4\n\n.\n\nSo, for the quadrature rules we can guarantee |\u02c6h(x) \u2212 h(x)| \u2264 \u03b5 with probability at least 1 \u2212 \u03b4 as\nlong as\n\n4 Experiments\n\nWe extensively study the proposed method on several established benchmarking datasets: Powerplant,\nLETTER, USPS, MNIST, CIFAR100 [23], LEUKEMIA [20].\nIn Section 4.2 we show kernel\napproximation error across different kernels and number of features. We also report the quality of\nSVM models with approximate kernels on the same data sets in Section 4.3.\n\n5\n\n\fFigure 1: Kernel approximation error across three kernels and 6 datasets. Lower is better. The x-axis\nrepresents the factor to which we extend the original feature space, n =\n2(d+1)+1, where d is the\ndimensionality of the original feature space, D is the dimensionality of the new feature space.\n\nD\n\n4.1 Methods\n\n3(cid:81)\n\nWe present a comparison of our method (B) with estimators based on a simple Monte Carlo,\nquasi-Monte Carlo [35] and Gaussian quadratures [11]. The Monte Carlo approach has a variety of\nways to generate samples: unstructured Gaussian [29], structured Gaussian [14], random orthogonal\nmatrices (ROM) [10].\nMonte Carlo integration (G, Gort, ROM). The kernel is estimated as \u02c6k(x, y) = 1\nD \u03c6(Mx)\u03c6(My),\nwhere M \u2208 RD\u00d7d is a random weight matrix. For unstructured Gaussian based approximation\nM = G, where Gij \u223c N (0, 1). Structured Gaussian has M = Gort, where Gort = DQ, Q is\nobtained from RQ decomposition of G, D is a diagonal matrix with diagonal elements sampled from\nthe \u03c7(d) distribution. In compliance with the previous work on ROM we use S-Rademacher with\nthree blocks: M = \u221ad\nSDi, where S is a normalized Hadamard matrix and P(Dii = \u00b11) = 1/2.\nQuasi-Monte Carlo integration (QMC). Quasi-Monte Carlo integration boasts improved rate of\nconvergence 1/D compared to 1/\u221aD of Monte Carlo, however, as empirical results illustrate its\nperformance is poorer than that of orthogonal random features [14]. It has larger constant factor\nhidden under O notation in computational complexity. For QMC the weight matrix M is generated\nas a transformation of quasi-random sequences. We run our experiments with Halton sequences in\ncompliance with the previous work.\nGaussian quadratures (GQ). We included subsampled dense grid method from [11] into our\ncomparison as it is the only data-independent approach from the paper that is shown to work well.\nWe reimplemented code for the paper to the best of our knowledge as it is not open sourced.\n\ni=1\n\n4.2 Kernel approximation\n\nTo measure kernel approximation quality we use relative error in Frobenius norm (cid:107)K\u2212 \u02c6K(cid:107)F\n, where\n(cid:107)K(cid:107)F\nK and \u02c6K denote exact kernel matrix and its approximation. In line with previous work we run\nexperiments for the kernel approximation on a random subset of a dataset. Table 2 displays the\nsettings for the experiments across the datasets.\nApproximation was constructed for different number of SR samples n =\n2(d+1)+1, where d is\nan original feature space dimensionality and D is the new one. For the Gaussian kernel we set\nhyperparameter \u03b3 = 1\nd for all the approximants, while the arc-cosine\nkernels (see de\ufb01nition of arc-cosine kernel in the Supplementary Materials) have no hyperparameters.\n\n2\u03c32 to the default value of 1\n\nD\n\n6\n\n123451.62.43.24.04.8kK\u2212\u02c6KkkKk\u00d710\u22121Arc-cosine0Powerplant123450.60.91.21.51.8\u00d710\u22121LETTER1234523456\u00d710\u22122USPS123451.21.82.43.03.6\u00d710\u22122MNIST123450.30.60.91.21.51.8\u00d710\u22122CIFAR100123450.40.60.81.01.21.4\u00d710\u22122LEUKEMIA123451.53.04.56.07.5kK\u2212\u02c6KkkKk\u00d710\u22121Arc-cosine112345012345\u00d710\u22121123450.00.20.40.60.81.0\u00d710\u22121123451.53.04.56.0\u00d710\u22122123450.00.61.21.82.43.0\u00d710\u22122123450.61.21.82.43.0\u00d710\u2212212345n1.53.04.56.07.5kK\u2212\u02c6KkkKk\u00d710\u22122Gaussian12345n0.000.250.500.751.001.25\u00d710\u2212212345n0.51.01.52.02.53.0\u00d710\u2212212345n012345\u00d710\u2212312345n0.51.01.52.02.5\u00d710\u2212312345n0.00.81.62.43.24.0\u00d710\u22124GGortROMQMCGQB\fWe run experiments for each [kernel, dataset, n] tuple and plot 95% con\ufb01dence interval around the\nmean value line. Figure 1 shows the results for kernel approximation error on LETTER, MNIST,\nCIFAR100 and LEUKEMIA datasets.\nQMC method almost always coincides with RFF except for arc-cosine 0 kernel. It particularly enjoys\nPowerplant dataset with d = 4, i.e. small number of features. Possible explanation for such behaviour\ncan be due to the connection with QMC quadratures. The worst case error for QMC quadratures\nscales with n\u22121(log n)d, where d is the dimensionality and n is the number of sample points [28]. It\nis worth mentioning that for large d it is also a problem to construct a proper QMC point set. Thus,\nin higher dimensions QMC may bring little practical advantage over MC. While recent randomized\nQMC techniques indeed in some cases have no dependence on d, our approach is still computationally\nmore ef\ufb01cient thanks to the structured matrices. GQ method as well matches the performance of RFF.\nWe omit both QMC and GQ from experiments on datasets with large d = [3072, 7129] (CIFAR100,\nLEUKEMIA).\nThe empirical results in Figure 1 support our hypothesis about the advantages of SR quadratures\napplied to kernel approximation compared to SOTA methods. With an exception of a couple of cases:\n(Arc-cosine 0, Powerplant) and (Gaussian, USPS), our method displays clear exceeding performance.\n\n4.3 Classi\ufb01cation/regression with new features\n\nFigure 2: Accuracy/R2 score using embeddings with three kernels on 3 datasets. Higher is better.\nThe x-axis represents the factor to which we extend the original feature space, n =\n\nD\n\n2(d+1)+1.\n\nWe report accuracy and R2 scores for the classi\ufb01cation/regression tasks on some of the datasets\n(Figure 2). We examine the performance with the same setting as in experiments for kernel\napproximation error, except now we map the whole dataset. We use Support Vector Machines\nto obtain predictions.\nKernel approximation error does not fully de\ufb01ne the \ufb01nal prediction accuracy \u2013 the best performing\nkernel matrix approximant not necessarily yields the best accuracy or R2 score. However, the\nempirical results illustrate that our method delivers comparable and often superior quality on the\ndownstream tasks.\n\n4.4 Walltime experiment\n\nWe measure time spent on explicit mapping of features by running each experiment 50 times and\naveraging the measurements. Indeed, Figure 3 demonstrates that the method scales as theoretically\npredicted with larger dimensions thanks to the structured nature of the mapping.\n\n7\n\n12345n0.7750.8000.8250.8500.875accuracy/R2Arc-cosine0Powerplant12345n0.30.40.50.60.70.8LETTER12345n0.9450.9480.9510.9540.957USPS12345n0.840.860.880.900.920.94accuracy/R2Arc-cosine112345n0.7000.7250.7500.7750.8000.82512345n0.96550.96700.96850.97000.97150.973012345n0.9240.9270.9300.9330.936accuracy/R2Gaussian12345n0.6450.6600.6750.69012345n0.96000.96080.96160.96240.9632exactGortROMGB\fFigure 3: Time spent on explicit mapping. The x-axis represents the 5 datasets with increasing input\nnumber of features: LETTER, USPS, MNIST, CIFAR100 and LEUKEMIA.\n\n5 Related work\n\nThe most popular methods for scaling up kernel methods are based on a low-rank approximation of\nthe kernel using either data-dependent or independent basis functions. The \ufb01rst one includes Nystr\u00f6m\nmethod [12], greedy basis selection techniques [31], incomplete Cholesky decomposition [15].\nThe construction of basis functions in these techniques utilizes the given training set making them\nmore attractive for some problems compared to Random Fourier Features approach. In general,\ndata-dependent approaches perform better than data-independent approaches when there is a gap in\nthe eigen-spectrum of the kernel matrix. The rigorous study of generalization performance of both\napproaches can be found in [36].\nIn data-independent techniques, the kernel function is approximated directly. Most of the methods\n(including the proposed approach) that follow this idea are based on Random Fourier Features [29].\nThey require so-called weight matrix that can be generated in a number of ways. [24] form the weight\nmatrix as a product of structured matrices. It enables fast computation of matrix-vector products and\nspeeds up generation of random features.\nAnother work [14] orthogonalizes the features by means of orthogonal weight matrix. This leads\nto less correlated and more informative features increasing the quality of approximation. They\nsupport this result both analytically and empirically. The authors also introduce matrices with some\nspecial structure for fast computations. [10] propose a generalization of the ideas from [24] and [14],\ndelivering an analytical estimate for the mean squared error (MSE) of approximation.\nAll these works use simple Monte Carlo sampling. However, the convergence can be improved by\nchanging Monte Carlo sampling to Quasi-Monte Carlo sampling. Following this idea [35] apply\nquasi-Monte Carlo to Random Fourier Features. In [37] the authors make attempt to improve quality\nof the approximation of Random Fourier Features by optimizing sequences conditioning on a given\ndataset.\nAmong the recent papers there are works that, similar to our approach, use the numerical integration\nmethods to approximate kernels. While [3] carefully inspects the connection between random\nfeatures and quadratures, they did not provide any practically useful explicit mappings for kernels.\nLeveraging the connection [11] propose several methods with Gaussian quadratures. Among them\nthree schemes are data-independent and one is data-dependent. The authors do not compare them with\nthe approaches for random feature generation other than random Fourier features. The data-dependent\nscheme optimizes the weights for the quadrature points to yield better performance. A closely related\nwork [25] constructs features for kernel approximation by approximating spherical-radial integral\nand designs QMC points to speed up approximation and reduce memory.\n\n6 Conclusion\n\nWe propose an approach for the random features methods for kernel approximation, revealing a\nnew interpretation of RFF and ORF. The latter are special cases of the spherical-radial quadrature\nrules with degrees (1,1) and (1,3) respectively. We take this further and develop a more accurate\ntechnique for the random features preserving the time and space complexity of the random orthogonal\nembeddings.\n\n8\n\n01000200030004000500060007000d,datasetinputdimension10\u2212410\u2212310\u2212210\u22121Time,sExplicitmappingtimeGGortGQB\fOur experimental study con\ufb01rms that for many kernels on the most datasets the proposed approach\ndelivers the best kernel approximation. Additionally, the results showed that the quality of the\ndownstream task (classi\ufb01cation/regression) is also superior or comparable to the state-of-the-art\nbaselines.\n\nAcknowledgments\n\nThis work was supported by the Ministry of Science and Education of Russian Federation as a part of\nMega Grant Research Project 14.756.31.0001.\n\nReferences\n[1] Theodore W Anderson, Ingram Olkin, and Les G Underhill. Generation of random orthogonal matrices.\n\nSIAM Journal on Scienti\ufb01c and Statistical Computing, 8(4):625\u2013629, 1987.\n\n[2] Haim Avron and Vikas Sindhwani. High-performance kernel machines with implicit distributed\n\noptimization and randomization. Technometrics, 58(3):341\u2013349, 2016.\n\n[3] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal\n\nof Machine Learning Research, 18(21):1\u201338, 2017. 8\n\n[4] John A Baker. Integration over spheres and the divergence theorem for balls. The American Mathematical\n\nMonthly, 104(1):36\u201347, 1997.\n\n[5] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter\noptimization in hundreds of dimensions for vision architectures. In International Conference on Machine\nLearning, pages 115\u2013123, 2013.\n\n[6] Salomon Bochner. Monotone funktionen, stieltjessche integrale und harmonische analyse. Mathematische\n\nAnnalen, 108(1):378\u2013410, 1933.\n\n[7] Xixian Chen, Haiqin Yang, Irwin King, and Michael R Lyu. Training-ef\ufb01cient feature map for shift-invariant\n\nkernels. In IJCAI, pages 3395\u20133401, 2015.\n\n[8] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in Neural Information\n\nProcessing Systems, pages 342\u2013350, 2009. 2\n\n[9] Krzysztof Choromanski and Vikas Sindhwani. Recycling randomness with structure for sublinear time\n\nkernel expansions. arXiv preprint arXiv:1605.09049, 2016. 1\n\n[10] Krzysztof Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effectiveness of random\n\northogonal embeddings. arXiv preprint arXiv:1703.00864, 2017. 6, 8\n\n[11] Tri Dao, Christopher M De Sa, and Christopher R\u00e9. Gaussian quadrature for kernel features. In Advances\n\nin Neural Information Processing Systems, pages 6109\u20136119, 2017. 6, 8\n\n[12] Petros Drineas and Michael W Mahoney. On the Nystr\u00f6m method for approximating a Gram matrix for\n\nimproved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153\u20132175, 2005. 8\n\n[13] Kai-Tai Fang and Run-Ze Li. Some methods for generating both an NT-net and the uniform distribution on\na Stiefel manifold and their applications. Computational Statistics & Data Analysis, 24(1):29\u201346, 1997.\n\n[14] X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv\nKumar. Orthogonal Random Features. In Advances in Neural Information Processing Systems, pages\n1975\u20131983, 2016. 1, 3, 6, 8\n\n[15] Shai Fine and Katya Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations. Journal of\n\nMachine Learning Research, 2(Dec):243\u2013264, 2001. 8\n\n[16] Alexander Forrester, Andy Keane, et al. Engineering design via surrogate modelling: a practical guide.\n\nJohn Wiley & Sons, 2008.\n\n[17] Alan Genz. Methods for generating random orthogonal matrices. Monte Carlo and Quasi-Monte Carlo\n\nMethods, pages 199\u2013213, 1998. 4\n\n[18] Alan Genz and John Monahan. Stochastic integration rules for in\ufb01nite regions. SIAM journal on scienti\ufb01c\n\ncomputing, 19(2):426\u2013439, 1998. 2\n\n9\n\n\f[19] Alan Genz and John Monahan. A stochastic algorithm for high-dimensional integrals over unbounded\nregions with gaussian weight. Journal of Computational and Applied Mathematics, 112(1):71\u201381, 1999.\n\n[20] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine Huard, Michelle Gaasenbeek, Jill P Mesirov,\nHilary Coller, Mignon L Loh, James R Downing, Mark A Caligiuri, et al. Molecular classi\ufb01cation of\ncancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531\u2013537,\n1999. 5\n\n[21] Simon Haykin. Cognitive dynamic systems: perception-action cycle, radar and radio. Cambridge\n\nUniversity Press, 2012.\n\n[22] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Kernel\nmethods match deep neural networks on timit. In Acoustics, Speech and Signal Processing (ICASSP), 2014\nIEEE International Conference on, pages 205\u2013209. IEEE, 2014.\n\n[23] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. 5\n\n[24] Quoc Le, Tam\u00e1s Sarl\u00f3s, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In\n\nProceedings of the International Conference on Machine Learning, 2013. 8\n\n[25] Yueming Lyu. Spherical structured feature maps for kernel approximation. In International Conference on\n\nMachine Learning, pages 2256\u20132264, 2017. 8\n\n[26] Francesco Mezzadri. How to generate random matrices from the classical compact groups. arXiv preprint\n\nmath-ph/0609050, 2006.\n\n[27] John Monahan and Alan Genz. Spherical-radial integration rules for bayesian computation. Journal of the\n\nAmerican Statistical Association, 92(438):664\u2013674, 1997.\n\n[28] Art B Owen. Latin supercube sampling for very high-dimensional simulations. ACM Transactions on\n\nModeling and Computer Simulation (TOMACS), 8(1):71\u2013102, 1998. 7\n\n[29] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural\n\nInformation Processing Systems, pages 1177\u20131184, 2008. 1, 3, 6, 8\n\n[30] Walter Rudin. Fourier analysis on groups. Courier Dover Publications, 2017.\n\n[31] Alex J Smola and Bernhard Sch\u00f6lkopf. Sparse greedy matrix approximation for machine learning. 2000. 8\n\n[32] G. W. Stewart. The ef\ufb01cient generation of random orthogonal matrices with an application to condition\nestimators. SIAM Journal on Numerical Analysis, 17(3):403\u2013409, 1980. ISSN 00361429. URL http:\n//www.jstor.org/stable/2156882.\n\n[33] Dougal J Sutherland and Jeff Schneider. On the error of random fourier features. arXiv preprint\n\narXiv:1506.02785, 2015. 5\n\n[34] Christopher KI Williams. Computing with in\ufb01nite networks. In Advances in Neural Information Processing\n\nSystems, pages 295\u2013301, 1997. 2\n\n[35] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-Monte Carlo feature maps\nfor shift-invariant kernels. In Proceedings of The 31st International Conference on Machine Learning\n(ICML-14), pages 485\u2013493, 2014. 1, 6, 8\n\n[36] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00f6m Method vs Random\nFourier Features: A Theoretical and Empirical Comparison. In Advances in Neural Information Processing\nSystems, pages 476\u2013484, 2012. 8\n\n[37] Felix X Yu, Sanjiv Kumar, Henry Rowley, and Shih-Fu Chang. Compact nonlinear maps and circulant\n\nextensions. arXiv preprint arXiv:1503.03893, 2015. 8\n\n10\n\n\f", "award": [], "sourceid": 5500, "authors": [{"given_name": "Marina", "family_name": "Munkhoeva", "institution": "Skoltech / Skolkovo Institute of Science and Technology"}, {"given_name": "Yermek", "family_name": "Kapushev", "institution": "Skolkovo Institute of Science and Technology"}, {"given_name": "Evgeny", "family_name": "Burnaev", "institution": "Skoltech"}, {"given_name": "Ivan", "family_name": "Oseledets", "institution": "Skoltech"}]}