{"title": "Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 9475, "page_last": 9486, "abstract": "We revisit the classic randomized sketch of a tensor product of $q$ vectors $x_i\\in\\mathbb{R}^n$. The $i$-th coordinate $(Sx)_i$ of the sketch is equal to\n$\\prod_{j = 1}^q \\langle u^{i, j}, x^j \\rangle / \\sqrt{m}$, where $u^{i,j}$ are independent random sign vectors. Kar and Karnick (JMLR, 2012) show that\nif the sketching dimension $m = \\Omega(\\epsilon^{-2} C_{\\Omega}^2 \\log (1/\\delta))$, where $C_{\\Omega}$ is a certain property of the point set $\\Omega$ one wants to sketch, then with probability $1-\\delta$, $\\|Sx\\|_2 = (1\\pm \\epsilon)\\|x\\|_2$ for all $x\\in\\Omega$. However, in their analysis $C_{\\Omega}^2$ can be as large as $\\Theta(n^{2q})$, even for a set $\\Omega$ of $O(1)$ vectors $x$.\n\nWe give a new analysis of this sketch, providing nearly optimal bounds.\nNamely, we show an upper bound of\n$m = \\Theta \\left (\\epsilon^{-2} \\log(n/\\delta) + \\epsilon^{-1} \\log^q(n/\\delta) \\right ),$\nwhich by composing with CountSketch, can be improved to\n$\\Theta(\\epsilon^{-2}\\log(1/(\\delta \\epsilon)) + \\epsilon^{-1} \\log^q (1/(\\delta \\epsilon))$. For the important case of $q = 2$ and $\\delta = 1/\\poly(n)$, this shows that $m = \\Theta(\\epsilon^{-2} \\log(n) + \\epsilon^{-1} \\log^2(n))$,\ndemonstrating that the $\\epsilon^{-2}$ and $\\log^2(n)$ terms do not multiply each other. We also show a nearly matching lower bound of\n$m = \\Omega(\\eps^{-2} \\log(1/(\\delta)) + \\eps^{-1} \\log^q(1/(\\delta)))$.\nIn a number of applications, one has $|\\Omega| = \\poly(n)$ and in this case our bounds are optimal up to a constant factor. This is the first high probability sketch for tensor products that has optimal sketch size and can be implemented in $m \\cdot \\sum_{i=1}^q \\textrm{nnz}(x_i)$ time, where $\\textrm{nnz}(x_i)$ is the\nnumber of non-zero entries of $x_i$.\n\nLastly, we empirically compare our sketch to other sketches for tensor products, and give a novel application to compressing neural networks.", "full_text": "Tight Dimensionality Reduction for Sketching Low\n\nDegree Polynomial Kernels\n\nMichela Meister\u2217\nCornell University\nIthaca, NY 14850\n\nTamas Sarlos\nGoogle Research\n\nMountain View, CA 94043\n\nmeister.michela@gmail.com\n\nstamas@google.com\n\nDavid P. Woodruff\u2020\n\nDepartment of Computer Science\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ndwoodruf@cs.cmu.edu\n\nAbstract\n\n\u221a\nj=1(cid:104)ui,j, xj(cid:105)/\n\nThe i-th coordinate (Sx)i of the sketch is equal to(cid:81)q\n\nwe show an upper bound of m = \u0398(cid:0)\u0001\u22122 log(n/\u03b4) + \u0001\u22121 logq(n/\u03b4)(cid:1) , which\n\nWe revisit the classic randomized sketch of a tensor product of q vectors xi \u2208 Rn.\nm, where\nui,j are independent random sign vectors. Kar and Karnick (JMLR, 2012) show\nthat if the sketching dimension m = \u2126(\u0001\u22122C 2\n\u2126 log(1/\u03b4)), where C\u2126 is a certain\nproperty of the point set \u2126 one wants to sketch, then with probability 1 \u2212 \u03b4,\n(cid:107)Sx(cid:107)2 = (1 \u00b1 \u0001)(cid:107)x(cid:107)2 for all x \u2208 \u2126. However, in their analysis C 2\n\u2126 can be as large\nas \u0398(n2q), even for a set \u2126 of O(1) vectors x.\nWe give a new analysis of this sketch, providing nearly optimal bounds. Namely,\nby composing with CountSketch, can be improved to \u0398(\u0001\u22122 log(1/(\u03b4\u0001)) +\n\u0001\u22121 logq(1/(\u03b4\u0001)). For the important case of q = 2 and \u03b4 = 1/poly(n), this\nshows that m = \u0398(\u0001\u22122 log(n) + \u0001\u22121 log2(n)), demonstrating that the \u0001\u22122 and\nlog2(n) terms do not multiply each other. We also show a nearly matching lower\nbound of m = \u2126(\u03b5\u22122 log(1/(\u03b4)) + \u03b5\u22121 logq(1/(\u03b4))). In a number of applications,\none has |\u2126| = poly(n) and in this case our bounds are optimal up to a constant\nfactor. This is the \ufb01rst high probability sketch for tensor products that has optimal\ni=1 nnz(xi) time, where nnz(xi) is\n\nsketch size and can be implemented in m \u00b7(cid:80)q\n\nthe number of non-zero entries of xi.\nLastly, we empirically compare our sketch to other sketches for tensor products,\nand give a novel application to compressing neural networks.\n\n1\n\nIntroduction\n\nDimensionality reduction, or sketching, is a way of embedding high-dimensional data into a low-\ndimensional space, while approximately preserving distances between data points. The embedded\ndata is often easier to store and manipulate, and typically results in much faster algorithms. Therefore,\nit is often bene\ufb01cial to sketch a dataset \ufb01rst and then run machine learning algorithms on the sketched\ndata. This technique has been applied to numerical linear algebra problems [37], classi\ufb01cation [9, 10],\n\n\u2217Work done at Google Research.\n\u2020Work done at Google Research, and while visiting the Simons Institute for the Theory of Computing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdata stream algorithms [33], nearest neighbor search [22], sparse recovery [12, 20], and numerous\nother problems.\nWhile effective, in many modern machine learning problems the points one would like to embed\nare often only speci\ufb01ed implicitly. Kernel machines, such as support vector machines, are one\nexample, for which one \ufb01rst non-linearly transforms the input points before running an algorithm.\nSuch machines are much more powerful than their linear counterparts, as they can approximate\nany function or decision boundary arbitrary well with enough training data. In kernel applications\nthere is a feature map \u03c6 : Rn \u2192 Rn(cid:48)\nwhich maps inputs in Rn to a typically much higher n(cid:48)-\ndimensional space, with the important property that for x, y \u2208 Rn, one can typically quickly compute\n(cid:104)\u03c6(x), \u03c6(y)(cid:105) given only (cid:104)x, y(cid:105). As many applications only depend on the geometry of the input\npoints, or equivalently inner product information, this allows one to work in the potentially much\nhigher and richer n(cid:48)-dimensional space while running in time proportional to that of the smaller\nn-dimensional space. Here often one would like to sketch the n(cid:48)-dimensional points \u03c6(x), without\nexplicitly computing \u03c6(x) and then applying the sketch, as this would be too slow.\nA speci\ufb01c example is the polynomial kernel of degree q, for which n(cid:48) = nq and \u03c6(x)i1,i2,...,iq =\nxi1 \u00b7 xi2 \u00b7\u00b7\u00b7 xiq. The polynomial kernel is also often used for approximating more general functions\nvia Taylor expansion [17, 30]. Note that the polynomial kernel \u03c6(x) can be written as a special type\nof tensor product, \u03c6(x) = x \u2297 x \u2297 \u00b7\u00b7\u00b7 \u2297 x, where \u03c6(x) is the tensor product of x with itself q times.\nIn this work we explore the more general problem of sketching a tensor product of arbitrary vectors\nx1, . . . , xq \u2208 Rn with the goal of embedding polynomial kernels. We will focus on the typical case\nwhen q is an absolute constant independent of n. In this problem we would like to quickly compute\nS \u00b7 x, where x = x1 \u2297 x2 \u2297 \u00b7\u00b7\u00b7 \u2297 xq, where S is a sketching matrix with a small number m of rows,\nwhich corresponds to the embedding dimension.\nThe most na\u00efve solution would be to explicitly compute x and then apply an off-the-shelf Johnson\nLindenstrauss transform S [25, 18, 28, 16], which using the best known bounds gives an embedding\ndimension of m = \u0398(\u0001\u22122 log(1/\u03b4)), which is optimal [24, 27, 31]. However, the running time\n(cid:80)q\nis prohibitive, since it is at least the number nnz(x) of non-zeros of x, which can be as large as\nnq. A much more practical alternative is TENSORSKETCH [34, 35] which gives a running time of\ni=1 nnz(xi), which is optimal, but the embedding dimension is a prohibitive \u0398(\u0001\u22122/\u03b4). Note that\nfor high probability applications, where one may want to set \u03b4 = 1/poly(n), this gives an embedding\ndimension as large as poly(n), which since x has length nq = poly(n), may defeat the purpose of\ndimensionality reduction.\nThus, we are at a crossroads; on the one hand we have a sketch with the optimal embedding dimension\nwith a prohibitive running time, and on the other hand we have a sketch with the optimal running\ntime but with a prohibitive embedding dimension. A natural question is if there is another sketch\nwhich achieves both a small embedding dimension and enjoys a fast running time.\n\n1.1 Our Contributions\n\n1.1.1 Near-Optimal Analysis of Tensorized Random Projection Sketch\n\nm \u00b7(cid:81)q\n\nOur \ufb01rst contribution shows that a previously analyzed sketch by Kar and Karnick for tensor products\n[30], referred here to as a Tensorized Random Projection, has exponentially better embedding\ndimension than previously known. Given vectors x1, . . . , xq \u2208 Rn in this sketch one computes the\nsketch S \u00b7 x of the tensor product x = x1 \u2297 x2 \u2297 \u00b7\u00b7\u00b7 \u2297 xq where the i-th coordinate (Sx)i of the\nj=1(cid:104)ui,j, xj(cid:105). Here the ui,j \u2208 {\u22121, 1}n are independent random sign\nsketch is equal to 1\u221a\nvectors, and q is typically a constant. The previous analysis of this sketch in [35] describes the sketch\nas having large variance and requires a sketching dimension that grows as n2q, as detailed in the\nsupplementary, in Appendix D.\nWe give a much improved analysis of this sketch in 2.1, showing that for any x, y \u2208 Rnq and\n\u0001] \u2264 \u03b4. Notably our dimension bound grows as logq(n) rather than n2q, providing an exponential\nimprovement over previous analyses of this sketch. Another interesting aspect of our bound is\nthat the second term only depends linearly on \u0001\u22121, rather than quadratically. This can represent\na substantial savings for small \u0001, e.g., if \u0001 = .001. Thus, for example, if \u0001 \u2264 1/ logq\u22121(n), our\n\n\u03b4 < 1/nq, there is an m = \u0398(cid:0)\u0001\u22122 log(n/\u03b4) + \u0001\u22121 logq(n/\u03b4)(cid:1) for which Pr[|(cid:104)Sx, Sy(cid:105) \u2212 (cid:104)x, y(cid:105)| >\n\n2\n\n\fwith CountSketch) in time O((cid:80)q\n\nsketch size is \u0398(\u0001\u22122 log(n)) which is optimal for any possibly adaptive and possibly non-linear\nsketch, in light of lower bounds for arbitrary Johnson-Lindenstrauss transforms [31]. Thus, at least\nfor this natural setting of parameters, this sketch does not incur very large variance, contrary to the\nbeliefs stated above. Moreover, q = 2 is one of the most common settings for the polynomial kernel\nin natural language processing [2], since larger degrees tend to over\ufb01t. In this case, our bound is\nm = \u0398(\u0001\u22122 log(n) + \u0001\u22121 log2(n)), and the separation of the \u0001\u22122 and log2(n) terms in our sketching\ndimension is especially signi\ufb01cant.\nWe next show in 2.2 that a simple composition of the Tensorized Random Projection with a CountS-\nketch [14] slightly improves the embedding dimension to m = \u0398(\u0001\u22122 log(1/(\u03b4\u0001))+\u0001\u22121 logq(1/(\u03b4\u0001))\nand works for all \u03b4 < 1. Moreover, we can compute the entire sketch (including the composition\ni=1 m \u00b7 nnz(xi)). This makes our sketch a \"best of both worlds\"\nin comparison to the Johnson-Lindenstrauss transform and TensorSketch: Tensorized Random\nProjection runs much faster than the Johnson-Lindenstrauss transform and it enjoys a smaller\nembedding dimension than TensorSketch. Additionally, we are able to show a nearly matching\nm = \u2126(\u03b5\u22122 log(1/\u03b4) + \u03b5\u22121 logq(1/\u03b4)) lower bound for this sketch, by exhibiting an input x for\nwhich (cid:107)Sx(cid:107)2 /\u2208 (1 \u00b1 \u0001)(cid:107)x(cid:107)2 with probability more than \u03b4.\nIt is also worthwhile to contrast our results with earlier work in the data streaming community\n[23, 11] that analyzed the variance only for q = 2 and general q respectively, and then achieved high\nprobability bounds by taking the median of multiple independent copies of S. The non-linear median\noperation makes the former constructs unsuitable for machine learning applications. In contrast,\nwe show high probability bounds for the linear embedding S directly. Recent work [4], which\nwas a merger of [5, 29], provide different sketches with different trade-offs. Their main focus is a\nsketching dimension with a (polynomial) dependence on q, making it more suitable for approximating\nhigh-degree polynomial kernels. Our focus is instead on improving the analysis of an existing sketch,\nwhich is most useful for small values of q.\nFrom a technical standpoint, our work builds off the recent proof of the Johnson-Lindenstrauss trans-\nform in [16]. We write the sketch S as \u03c3T A\u03c3, where in our setting \u03c3 corresponds to the concatenation\nof u1,1, u2,1, . . . , um,1, while A is a random matrix which depends on all of u1,j, u2,j, . . . , um,j for\nj = 2, 3, . . . , q. Following the proof in [16], we then apply the Hanson-Wright inequality to upper\nbound the w-th moment E[|\u03c3T A\u03c3 \u2212 E[\u03c3T A\u03c3]|w], for integers w, in terms of the Frobenius norm\n(cid:107)A(cid:107)F and operator norm (cid:107)A(cid:107)2 of the matrix A. The main twist here is that in the tensor setting,\nwhen we try to apply this inequality, the matrix A is a random variable itself. Bounding (cid:107)A(cid:107)2 can\nbe accomplished by essentially viewing A as a (q \u2212 1)-th order tensor, \ufb02attening it q \u2212 1 times,\nand applying Khintchine\u2019s inequality each time. The more complicated part of the argument is in\nbounding (cid:107)A(cid:107)F , which again involves an inductive argument to obtain tail bounds on the Frobenius\nnorm of each of the blocks of A, which itself is a block-diagonal matrix with m blocks. The tail\nbounds are not as strong as sub-Gaussian or even sub-exponential random variables, which makes\nstandard analyses based on moment generating functions inapplicable. We instead give a \u201clevel-set\u201d\nargument by giving a novel adaptation of analyses of Tao, originally needed for showing concentration\nof p-norms for 0 < p < 1, to our tensor setting (see, e.g., Proposition 6 in [36]).\n\n1.1.2 Approximating Polynomial Kernels\n\nReplicating experiments from [35], we approximate polynomial kernels using Tensorized Random\nProjection, TensorSketch, and Random Maclaurin [30] features. In Section 4.1 we demonstrate that\nTensorSketch always fails for certain sparse inputs, while Tensorized Random Projection succeeds\nwith high probability. We show in 4.2 that Tensorized Random Projection has similar accuracy to\nTensorSketch, and both vastly outperform Random Maclaurin features.\n\n1.1.3 Compressing Neural Networks\n\nWe also experiment with using Tensorized Random Projection to compress the layers of a neural\nnetwork. In [8], Arora et al. propose a method for compressing the layers of a neural network via\nrandom projections and prove generalization bounds for such networks. To compress an individual\nlayer, they choose a basis set of random Rademacher matrices and project the layer\u2019s weight matrix\nonto this random basis set. We refer to this method here as Random Projection. The simplest, order\nq = 2, Tensorized Random Projection can be viewed as a more ef\ufb01cient, rank-1 version of Random\nProjection: instead of using a basis set of fully-random Rademacher matrices, the basis set is made\n\n3\n\n\fup of random rank-1 Rademacher matrices. We show in 4.3 that Tensorized Random Projection\nhas similar test accuracy as Random Projection when compressing the top layer of a small neural\nnetwork.\n\n1.2 Preliminaries\n\nFor a survey of using sketching for algorithms in randomized numerical linear algebra, we refer the\nreader to [37]. We give a brief background here on several concepts related to our work.\n\u221a\nThere are many variants of the Johnson-Lindenstrauss Lemma, though for us the most useful is that for\nan m \u00d7 n matrix S of independent entries drawn from {\u22121/\nm}, if m = \u2126(\u0001\u22122 log(1/\u03b4)),\nthen for any \ufb01xed vector x \u2208 Rn, we have:\n[(cid:107)Sx(cid:107)2\n\n\u221a\nm, 1/\n\n2 = (1 \u00b1 \u0001)(cid:107)x(cid:107)2\n\n2] \u2265 1 \u2212 \u03b4.\n\nPr\nS\n\nThis lemma is also known to hold for any matrix S with independent sub-Gaussian entries.\nThe matrix S is dense, and the CountSketch transform is instead much sparser.\nDe\ufb01nition 1.1 (CountSketch). A CountSketch transform is de\ufb01ned to be \u03a0 = \u03a6D \u2208 Rm\u00d7n. Here,\nD is an n \u00d7 n random diagonal matrix with each diagonal entry independently chosen to be +1 or\n\u22121 with equal probability, and \u03a6 \u2208 {0, 1}m\u00d7n is an m \u00d7 n binary matrix with \u03a6h(i),i = 1 and all\nremaining entries 0, where h : [n] \u2192 [m] is a random map such that for each i \u2208 [n], h(i) = j with\nprobability 1/m for each j \u2208 [m]. For a matrix A \u2208 Rn\u00d7d, \u03a0A can be computed in O(nnz(A)) time,\nwhere nnz(A) denotes the number of non-zero entries of A.\n\nWe now de\ufb01ne a tensor product and various sketches for tensors.\nDe\ufb01nition 1.2 (\u2297 product for vectors). Given q vectors u1 \u2208 Rn1, u2 \u2208 Rn2, \u00b7\u00b7\u00b7 , uq \u2208 Rnq, we use\nu1 \u2297 u2 \u2297 \u00b7\u00b7\u00b7 \u2297 uq to denote an n1 \u00d7 n2 \u00d7 \u00b7\u00b7\u00b7 \u00d7 nq tensor such that, for each (j1, j2,\u00b7\u00b7\u00b7 , jq) \u2208\n[n1] \u00d7 [n2] \u00d7 \u00b7\u00b7\u00b7 \u00d7 [nq],\n\n(u1 \u2297 u2 \u2297 \u00b7\u00b7\u00b7 \u2297 uq)j1,j2,\u00b7\u00b7\u00b7 ,jq = (u1)j1(u2)j2 \u00b7\u00b7\u00b7 (uq)jq ,\n\nwhere (ui)ji denotes the ji-th entry of vector ui.\nWe now formally de\ufb01ne TensorSketch:\nDe\ufb01nition 1.3 (TensorSketch [34]). Given q vectors v1, v2,\u00b7\u00b7\u00b7 , vq where for each i \u2208 [q], vi \u2208 Rni,\nlet m be the target dimension. The TensorSketch transform is speci\ufb01ed using q 3-wise independent\nhash functions, h1,\u00b7\u00b7\u00b7 , hq, where for each i \u2208 [q], hi : [ni] \u2192 [m], as well as q 4-wise independent\nsign functions s1,\u00b7\u00b7\u00b7 , sq, where for each i \u2208 [q], si : [ni] \u2192 {\u22121, +1}.\nTensorSketch applied to v1,\u00b7\u00b7\u00b7 , vq is then CountSketch applied to \u03c6(v1,\u00b7\u00b7\u00b7 , vq) with hash function\n\ni=1 ni] \u2192 [m] and sign functions S : [(cid:81)q\n\ni=1 ni] \u2192 {\u22121, +1} de\ufb01ned as follows:\n\nH : [(cid:81)q\n\nH(i1,\u00b7\u00b7\u00b7 , iq) = h1(i1) + h2(s2) + \u00b7\u00b7\u00b7 + hq(iq)\n\n(mod m),\n\nand\n\nm log m)) time.\n\nS(i1,\u00b7\u00b7\u00b7 , iq) = s1(i1) \u00b7 s2(i2) \u00b7 \u00b7\u00b7\u00b7 \u00b7 sq(iq).\n\nUsing the Fast Fourier Transform, TensorSketch(v1,\u00b7\u00b7\u00b7 , vq) can be computed in O((cid:80)q\nxi \u2208 Rn. The i-th coordinate (Sx)i of the sketch is equal to(cid:81)q\n\nThe main sketch we study is the classic randomized sketch of a tensor product of q vectors\nm, where ui,j\nare independent random sign vectors. Kar and Karnick show [30] that if the sketching dimension\nm = \u2126(\u0001\u22122C 2\n\u2126 log(1/\u03b4)), where C\u2126 is a certain property of the point set \u2126 one wants to sketch, then\nwith probability 1 \u2212 \u03b4, (cid:107)Sx(cid:107)2 = (1 \u00b1 \u0001)(cid:107)x(cid:107)2 for all x \u2208 \u2126. However, in their analysis C 2\n\u2126 can be as\nlarge as \u0398(n2q), even for a set \u2126 of O(1) vectors x.\n\n\u221a\nj=1(cid:104)ui,j, xj(cid:105)/\n\ni=1(nnz(vi) +\n\n2 Main Theorem and its Proof\n\nOur main theorem combining sketches S and T described in Sections 2.1 and 2.2 is the following.\nWe provide its proof in Section 2.3.\n\n4\n\n\fTheorem 2.1. There is an oblivious sketch S \u00b7 T : Rnq \u2192 Rm for m = \u0398(\u0001\u22122 log(1/(\u0001\u03b4)) +\n\u0001\u22121 logq(1/(\u0001\u03b4)), such that for any \ufb01xed vector x \u2208 Rnq and constant q, Pr[(cid:107)ST x(cid:107)2\n2 = (1 \u00b1\n2] \u2265 1 \u2212 \u03b4, where 0 < \u0001, \u03b4 < 1. Further, if x has the form x = x1 \u2297 x2 \u2297 \u00b7\u00b7\u00b7 \u2297 xq for vectors\n\u0001)(cid:107)x(cid:107)2\n\nxi \u2208 Rn for i = 1, 2, . . . , q, then the time to compute ST x is O((cid:80)q\n\ni=1 nnz(xi)m).\n\n2.1\n\nInitial Bound on Our Sketch Size\n\nWe are ready to present Tensorized Random Projection sketch S and the outermost layer of its analysis.\nWe defer statements and proofs of some key technical lemmas to Appendix A in the supplementary.\nNote that both the sketching dimension m and the failure probability \u03b4 depend on n, which we later\neliminate with the help of Section B.\nTheorem 2.2. De\ufb01ne oblivious sketch S : Rnq \u2192 Rm for m = \u0398(\u0001\u22122 log(n/\u03b4)+\u0001\u22121 logq(n/\u03b4)) as\nfollows. Choose m\u00b7 q independent uniformly random vectors ui,j \u2208 {+1,\u22121}n, where i = 1, . . . , m\nm)u(cid:96),1 \u2297 u(cid:96),2 \u2297 \u00b7\u00b7\u00b7 \u2297 u(cid:96),q, that is, the\nand j = 1, . . . , q. Let the (cid:96) = 1, . . . , m-th row of S be (1/\n\u221a\n. Then for any \ufb01xed vector x \u2208 Rnq\n(i1, i2, . . . , iq)-th entry of the (cid:96)-th row of S is (1/\nand failure probability \u03b4 < 1/nq it holds that Pr[(cid:107)Sx(cid:107)2\n\n\u221a\nj=1 u(cid:96),j\nij\n2 = (1 \u00b1 \u0001)(cid:107)x(cid:107)2\n\nm)(cid:81)q\n\n2] \u2265 1 \u2212 \u03b4.\n\nProof. It suf\ufb01ces to show for any unit vector x \u2208 Rnq, that\n2 \u2212 1| > \u0001] \u2264 \u03b4.\n\nPr[|(cid:107)Sx(cid:107)2\n\n\u221a\nWe de\ufb01ne Si \u2208 Rm\u00d7nq\u22121 to have (cid:96)-th row equal to (1/\n\nu(cid:96),q, and de\ufb01ne x = (x1, . . . , xn), with each xi \u2208 Rnq\u22121, so that Sx =(cid:80)n\n\nm)u(cid:96),1\n\ni\n\n(1)\n\u00b7 v(cid:96), where v(cid:96) = u(cid:96),2 \u2297 u(cid:96),3 \u2297\u00b7\u00b7\u00b7\u2297\n\ni=1 Sixi. Then,\n\n2 = (cid:107) n(cid:88)\nLemma 2.3 below proves that(cid:80)n\n\n(cid:107)Sx(cid:107)2\n\ni=1\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\ni(cid:54)=i(cid:48)\n\nSixi(cid:107)2\n\n2 =\n\n(cid:107)Sixi(cid:107)2\n\n2 + 2\n\n(cid:104)Sixi, Si(cid:48)\n\nxi(cid:48)(cid:105).\n\n2 holds with probability at least 1\u2212\u03b4/10.\n2 = (1\u00b1\u0001/3)(cid:107)x(cid:107)2\nWe prove Lemma 2.3 and in effect Theorem 2.2 by induction on q and applying Theorem 2.2 for\nq(cid:48) = q \u2212 1. To complete the proof, we need to show that that\n\n(cid:104)Sixi, Si(cid:48)\n\nxi(cid:48)(cid:105) \u2264 \u03b5/3\n\n(2)\n\ni=1 (cid:107)Sixi(cid:107)2\n(cid:88)\n\ni(cid:54)=i(cid:48)\n\ni (cid:104)v(cid:96), xi(cid:105).\n(cid:88)\nm(cid:88)\n\n(cid:96)=1\n\ni(cid:54)=i(cid:48)\n\nZ :=\n\n1\nm\n\n(cid:80)\ni(cid:54)=i(cid:48)(cid:80)m\n\nleast 1 \u2212 9\u03b4/10.\nwith probability at\n(cid:96)th coordinate\n\u221a\nm)u(cid:96),1\nis\nof Sixi,\nshowing\n(1/\ni(cid:48) (cid:104)v(cid:96), xi(cid:105)(cid:104)v(cid:96), xi(cid:48)(cid:105) \u2264 \u0001/3. Rearranging the order of summation, we\ni u(cid:96),1\n(cid:96)=1 u(cid:96),1\n1\nm\nneed to upper bound\n\nNote\nshowing\n\nequivalent\n\n(Sixi)(cid:96),\n\nthat\n\nthe\n\n(2)\n\nSo\n\nto\n\nis\n\ni u(cid:96),1\nu(cid:96),1\n\ni(cid:48) (cid:104)v(cid:96), xi(cid:105)(cid:104)v(cid:96), xi(cid:48)(cid:105) := uT Au,\n\nLet E be the event that(cid:80)n\n\nwhere u \u2208 Rnm\u00d71 and A \u2208 Rnm\u00d7nm is a block-diagonal matrix with m blocks, each of size n \u00d7 n.\n2 = (1 \u00b1 \u0001/3). By Lemma 2.3, we have that Pr[E] \u2265 1 \u2212 \u03b4/10.\n\ni=1 (cid:107)Sixi(cid:107)2\n\nFurthermore, let F be the event that (cid:107)A(cid:107)2 = O( log(q\u22121)(qnqm/\u03b4)\n\n) and\n\nm\n\n\u221a\n(cid:107)A(cid:107)F = O(1/\n\nm + log1/2(1/\u03b4) log(2q\u22123)/2(m/\u03b4) log log(m/\u03b4)/m)\n\nbounds hold for the operator and Frobenius norm of A. By a union bound over Lemmas A.4 and A.7,\nwe have that Pr[F] \u2265 1 \u2212 \u03b4/10. Lemma A.3 uses the Hanson-Wright Theorem to bound Z in terms\nof (cid:107)A(cid:107)2 and (cid:107)A(cid:107)F and proves that Pr[Z \u2265 \u03b5/3|F] \u2264 \u03b4/2.\n\n5\n\n\fPutting this all together, we achieve our initial bound on (cid:107)Sx(cid:107)2\nand v(cid:96), we have,\n\n2: Taking the probability over all u(cid:96)\n\nPr[|(cid:107)Sx(cid:107)2\n\n2 \u2212 1| > \u0001] \u2264 Pr[\u00acE] + Pr[|(cid:107)Sx(cid:107)2\nPr[Z \u2265 \u0001/3]\n\n\u2264 \u03b4/10 + Pr[Z \u2265 \u0001/3 | E]\n\u2264 \u03b4/10 +\n\nPr[E]\n\n2 \u2212 1| > \u0001 | E]\n\nPr[Z \u2265 \u0001/3]\n1 \u2212 \u03b4/10\n\n\u2264 \u03b4/10 +\n\u2264 \u03b4/10 + (1 + \u03b4/5) Pr[Z \u2265 \u0001/3]\n\u2264 \u03b4/10 + (1 + \u03b4/5)(Pr[Z \u2265 \u0001/3 | F] + Pr[\u00acF])\n\u2264 \u03b4/10 + (1 + \u03b4/5)(\u03b4/2 + \u03b4/10) = 3\u03b42/25 + 7\u03b4/10\n\u2264 3\u03b4/25 + 7\u03b4/10 \u2264 \u03b4.\n\nFrom the \u03b4 \u2264 1 assumption it follows that \u03b42 \u2264 \u03b4, which implies the second to last inequality and\nconcludes the proof.\nLemma 2.3. For all q \u2265 2, any set of \ufb01xed vectors x1, . . . , xn \u2208 Rnq\u22121, sketching dimension\nm = \u0398(\u0001\u22122 log(n/\u03b4) + \u0001\u22121 logq\u22121(n/\u03b4)), \u03b4 < 1/nq\u22121, and matrices Si \u2208 Rm\u00d7nq\u22121 de\ufb01ned in the\n\nproof of Theorem 2.2, we have that Pr[(cid:80)n\n\n2 = (1 \u00b1 \u0001/3)(cid:107)x(cid:107)2\n\n2] \u2265 1 \u2212 \u03b4/10.\n\ni=1 (cid:107)Sixi(cid:107)2\n\n\u221a\nProof. De\ufb01ne matrix S0 \u2208 Rm\u00d7nq\u22121 such that its (cid:96)-th row is v(cid:96)/\nm from the proof of Theorem 2.2.\nAdditionally de\ufb01ne m \u00d7 m diagonal matrices Di such that Di\n. Note that Si = DiS0 and\ntherefore (cid:107)Sixi(cid:107)2 = (cid:107)DiS0xi(cid:107)2 = (cid:107)S0xi(cid:107)2 holds since Di is \u00b11 diagonal matrix. To prove the\nlemma, it is suf\ufb01cient to show that\n\n(cid:96),(cid:96) := u(cid:96),1\n\ni\n\nholds, since then we have that(cid:80)n\n\n\u2200i \u2208 [1, n] : Pr[(cid:107)S0xi(cid:107)2\n\ni=1 (cid:107)Sixi(cid:107)2\n\n2 = (cid:80)n\n\n2 = (1 \u00b1 \u0001/3)(cid:107)xi(cid:107)2\n\ni=1 (cid:107)S0xi(cid:107)2\n\n2] \u2265 1 \u2212 \u03b4/(10n)\n\n2 = (1 \u00b1 \u0001/3)(cid:80)n\n\n2 with probability at least 1 \u2212 \u03b4 by a union bound.\n\n\u0001/3)(cid:107)x(cid:107)2\nWe prove inequality (3) by induction on q. In the base q = 2 case, entries of v(cid:96) = u(cid:96),2 vectors are\ni.i.d. \u00b11 random variables. Equivalently the entries of S0 are i.i.d. \u00b11 random variables. Applying\nthe Johnson-Lindenstrauss lemma [31] to S0 and each xi with \u03b4(cid:48) = \u03b4/(10n) proves the base case.\nNow assume that Theorem 2.2 holds for q(cid:48) = q \u2212 1. Observe that the structure of S0 for q(cid:48) = q \u2212 1 is\nexactly like that of S for q. Setting \u03b4(cid:48) = \u03b4/(10n) in Theorem 2.2 we have that inequality (3) holds for\n\nsketching dimension m(cid:48) = \u0398(cid:0)\u0001\u22122 log(n/\u03b4(cid:48)) + \u0001\u22121 logq\u22121(n/\u03b4(cid:48))(cid:1). Since log(n/\u03b4(cid:48)) = log(n2/\u03b4) =\n\u0398(log(n/\u03b4)) we can simplify m(cid:48) to \u0398(cid:0)\u0001\u22122 log(n/\u03b4) + \u0001\u22121 logq\u22121(n/\u03b4)(cid:1) as claimed.\n\n(3)\n2 = (1 \u00b1\n\ni (cid:107)xi(cid:107)2\n\n2.2 Optimizing Our Sketch Size\n\nWe de\ufb01ne the sketch T , which is a tensor product of CountSketch matrices. We compose our sketch\nS from Section 2.1 with T in order to remove the dependence on n. See Section B for the proof.\nTheorem 2.4. Let T be a tensor product of q CountSketch matrices T = T 1 \u2297 \u00b7\u00b7\u00b7 \u2297 T q, where\neach T i maps Rn \u2192 Rt for t = \u0398(q3/(\u00012\u03b4)). Then for any unit vector x \u2208 Rnq, we have\n2 \u2212 1| > \u0001] \u2264 \u03b4. Furthermore, if x is of the form x1 \u2297 x2 \u2297 \u00b7\u00b7\u00b7 \u2297 xq, for xi \u2208 Rn for\nPr[|(cid:107)T x(cid:107)2\ni = 1, 2, . . . , q, then T x = T 1x1 \u2297 \u00b7\u00b7\u00b7 \u2297 T qxq, where nnz(T ixi) \u2264 nnz(xi) and where the time to\ncompute T ixi is O(nnz(xi)) for i = 1, 2, . . . , q.\n\n2.3 Proof of Theorem 2.1\n\nFinally we prove our main claim by composing sketches S and T from Sections 2.1 and 2.2.\nProof. Our overall sketch is S \u00b7 T , where S is the sketching matrix of Section 2.1, with sketch-\ning dimension m = \u0398(\u0001\u22122 log(t/\u03b4) + \u0001\u22121 logq(t/\u03b4)), and T is the sketching matrix of Sec-\ntion 2.2, with sketching dimension t = \u0398(q3/(\u00012\u03b4)). To satisfy the conditions of Theorem\n\n6\n\n\f2.2, set \u03b4S = 0.5/tq. S is applied with approximation error \u0001/2 and failure probability \u03b4S\nand T is applied with \u0001/2 and \u03b4/2 respectively. Note that \u03b4S \u2264 \u03b4/2 and for q constant,\nlog(t/\u03b4S) = \u0398(log(tq+1)) = \u0398(log(t)) = \u0398(log(1/(\u0001\u03b4))) holds. Thus, the sketching dimension m\nof ST is now \u0398(\u0001\u22122 log(1/(\u0001\u03b4)) + \u0001\u22121 logq(1/(\u0001\u03b4)), and has no dependence on n. By Theorems\n2.2, 2.4, and a union bound, we have that for any unit vector x \u2208 Rnq, Pr[|(cid:107)S \u00b7 T x(cid:107)2\n2 \u2212 1| > \u0001] \u2264 \u03b4.\nIn Theorem 2.4 above we show that, if x is a vector of the form x1 \u2297 x2 \u2297 \u00b7\u00b7\u00b7 \u2297 xq, for xi \u2208 Rn\nfor i = 1, 2, . . . , q, then T x = T 1x1 \u2297 \u00b7\u00b7\u00b7 \u2297 T qxq where each T ixi can be computed in O(nnz(xi))\n\ntime and where nnz(T ixi) \u2264 nnz(xi). Thus, we can apply S to T x in O((cid:80)q\n\ni=1 nnz(xi)m) time.\n\n3 Lower Bound on Our Sketch Size\nWe next show that our sketching dimension of m = \u0398(\u0001\u22122 log(1/(\u03b4\u0001)) + \u0001\u22121 logq(1/(\u03b4\u0001)) is nearly\ntight for our particular sketch S \u00b7 T . We will assume that q is constant. Note that S \u00b7 T is an\noblivious sketch, and consequently by lower bounds for any oblivious sketch [24, 27, 31], one has\nthat m = \u2126(\u0001\u22122 log(1/\u03b4)). More interestingly, we show a lower bound of m = \u2126(\u0001\u22121 logq(1/\u03b4))\nsummarized in the following theorem; see Section C for the proof.\nTheorem 3.1. For any constant integer q, there is an input x \u2208 Rnq for which if the number\nm of rows of S satis\ufb01es m = o(\u03b5\u22122 log(1/\u03b4) + \u03b5\u22121 logq(1/\u03b4)), then with probability at least \u03b4,\n(cid:107)ST x(cid:107)2\nRecall that the upper bound on our sketch size, for constant q, is m = O(\u03b5\u22122 log(1/(\u0001\u03b4)) +\n\u03b5\u22121 logq(1/(\u0001\u03b4))), and thus our analysis is nearly tight whenever log(1/(\u0001\u03b4)) = \u0398(log(1/\u03b4)).\nThis holds, for example, whenever \u03b4 < \u0001, which is a typical setting since \u03b4 = 1/poly(n) for high\nprobability applications.\n\n2 > (1 + \u0001)(cid:107)x(cid:107)2\n2.\n\n4 Experiments\n\nWe evaluate Tensorized Random Projections in three different applications. In Section 4.1 we\nshow that Tensorized Random Projections always succeed with high probability while TensorSketch\nalways fails on extremely sparse inputs. Then in Section 4.2 we observe that TensorSketch and\nTensorized Random Projections approximate non-linear SVMs with polynomial kernels equally\nwell. Finally in Section 4.3 we demonstrate that Random Projections and Tensorized Random\nProjections are equally effective in reducing the number of parameters in a neural network while\nTensorized Random Projections are faster to compute. To the best of our knowledge this comprises\nthe \ufb01rst experimental evaluation of [8]\u2019s compression technique in terms of accuracy. The code for\nthe experiments is available at https://github.com/google-research/google-research/\ntree/master/poly_kernel_sketch.\n\n4.1 Success Probability of TensorSketch vs Tensorized Random Projection\n\nIn this section we demonstrate that TensorSketch cannot approximate the polynomial kernel \u03ba(x, y) =\n(cid:104)x, y(cid:105)q accurately for all pairs x, y \u2208 V simultaneously if the vectors in the set V are not smooth,\ni.e., if (cid:107)x(cid:107)\u221e/(cid:107)x(cid:107)2 = \u2126(1) holds for all x in V . TensorSketch fails even if the sketching dimension\nm is much larger than |V |. On the contrary, Tensorized Random Projection works well.\nLet a set S of data points be a standard basis in d dimensions. If k \u2265 2 coordinates of different vectors\ncollide in the same TensorSketch hash bucket then their common bucket is either zero or non-zero. If\nit is 0, then (cid:104)ei, ei(cid:105)q is incorrectly estimated as 0 instead of 1. If the common bucket\u2019s value is not\n0, then the estimate of (cid:104)ei, ej(cid:105)q is non-zero, where i and j are any pair of two colliding coordinates.\nThus if there is a collision, then TensorSketch cannot estimate all dot products exactly. Moreover\nthe estimate cannot be close to the true kernel value either since if the dot product is incorrect, then\n\nit is off by at least 1. Now if n \u2265(cid:112)2m ln(1/(1 \u2212 p) then by the Birthday paradox [1] we have at\n\nleast one collision with probability p. If the number of vectors (and dimension) n is greater than the\nsketching dimension m, which is the interesting case for sketching, then there is always a collision\nby the pigeonhole principle. We remark that [26] provides a more detailed analysis of this sketching\n\n7\n\n\fdimension vs input vector smoothness tradeoff for CountSketch, which is a key building block of\nTensorSketch.\nWe illustrate the above phenomena in Figure 1(a) as follows. We \ufb01x the sketch size m = 100 and\nvary the input dimension (= number of vectors) n along the x-axis. We measure the largest absolute\nerror in approximating \u03ba(ei, ej) = (cid:104)x, y(cid:105)2 = \u03b4ij among the \ufb01rst n standard basis vectors and repeat\nthe experiment with 100 randomly drawn TensorSketch and Tensorized Random Projection instances.\nThe y-axis shows the average of the maximum error in approximating the true kernel, where error\nbars correspond to one standard deviation. It is clear that TensorSketch\u2019s error quickly becomes the\nlargest possible, 1, as the number n of vectors passes the critical threshold\n100, while Tensorized\nRandom Projection\u2019s max error is much smaller, more concentrated, and grows at a much slower rate\nin the same setting.\n\n\u221a\n\n(a) Max error vs input dimension (n)\n\n(b) Max error vs sketch size (m)\n\nFigure 1: Maximum Error\n\nNext, in Figure 1(b) we \ufb01x the input dimension (= number of vectors) to n = 100 and vary the sketch\nsize m along the x-axis instead. The y-axis remains unchanged. We again observe that TensorSketch\u2019s\nmax error decreases very slowly and it is still about 40% of the largest error possible (1) on average\nat sketching size m = n2 = 104 (cid:28) d. Tensorized Random Projection\u2019s max error is almost an order\nof magnitude smaller at the same sketch size.\n\n4.2 Comparison of Sketching Methods for SVMs with Polynomial Kernel\n\nWe replicate experiments from [35] to compare Tensorized Random Projections with TensorSketch\n(TS) and Random Maclaurin (RM) sketch. We approximate the polynomial kernel (cid:104)x, y(cid:105)2 for the\nAdult [19] and MNIST [32] datasets, by applying one of the above three sketches to the dataset. We\nthen train a linear SVM on the sketched dataset using LIBLINEAR [21], and report the training\naccuracy. This accuracy is the median accuracy of 5 trials. Our baseline is the training accuracy of a\nnon-linear SVM trained with the exact kernel by LIBSVM [13]. We experiment with between 100\nand 500 random features.\nBoth Figures 2(a) and 2(b) show that Tensorized Random Projection has similar accuracy to TensorS-\nketch, and both have far better accuracy than Random Maclaurin. Recall that Random Maclaurin\napproximates the kernel function \u03ba with its Maclaurin series. For each sketch coordinate it randomly\npicks degree t with probability 2\u2212t and computes degree-t Tensorized Random Projection. This\nis rather inef\ufb01cient for the polynomial kernel, which has exactly one non-zero coef\ufb01cient in its\nMaclaurin expansion. Random Maclaurin\u2019s generality is not required for the polynomial kernel and\nwe can obtain more accurate results for general kernels by sampling degree t proportional to its\nMaclaurin coef\ufb01cient.\n\n4.3 Compressing Neural Networks\n\nWe begin with a standard 2-layer fully connected neural network trained on MNIST [32] with a\nbaseline test accuracy of around 0.97. The \ufb01rst layer has dimension (784x512) and the top layer has\n\n8\n\n\f(a) Adult dataset\n\n(b) MNIST dataset\n\nFigure 2: Accuracy vs Number of Random Features\n\ndimension (512x10). Further speci\ufb01cs of the model can be found in the TensorFlow tutorials [3].\nWe sketch the weight matrix in the top layer using either Tensorized Random Projection or Random\nProjection. We then reinsert this sketched matrix into the original model and evaluate its accuracy on\nthe MNIST test set. We compare both the test accuracy and the time needed to compute the sketch\nfor both methods.\nIn Figure 3(a) we see that both Tensorized Random Projection and Random Projection reach similar\ntest accuracy for the same number of parameters. Figure 3(b) in illustrates that Tensorized Random\nProjection runs somewhat faster than ordinary Random Projection.\n\n(a) Test Accuracy vs Sketch Size\n\n(b) Time vs Sketch Size\n\nFigure 3: Sketching the Last Layer of MNIST Neural Network\n\n5 Conclusion\n\nwhether its m \u00b7(cid:80)q\n\nWe presented a new analysis of Tensorized Random Projection, providing nearly optimal bounds and\ndemonstrated its versatility in multiple applications. An interesting question left for future work is\ni=1 nnz(xi) running time could be further improved for dense x. We conjecture\nthat the iid random u(cid:96)\ni Rademacher vectors might be replaced with fast pseudo-random rotations,\nperhaps a product of one or more randomized Hadamard matrices similar to ideas in [7], which could\npossibly lead to an O(m log n) running time.\n\n9\n\n\fReferences\n[1] https://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_\n\nproblem.\n\n[2] https://en.wikipedia.org/wiki/Polynomial_kernel, practical use section.\n\n[3] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dan Man\u00e9, Rajat Monga, Sherry Moore, Derek Murray,\nChris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul\nTucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete Warden,\nMartin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale\nmachine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[4] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker,\nDavid P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels.\nIn SODA, 2020.\n\n[5] Thomas D. Ahle and Jakob B\u00e6k Tejs Knudsen. Almost optimal tensor sketch. CoRR,\n\nabs/1909.01821, 2019.\n\n[6] Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the\n\nfrequency moments. J. Comput. Syst. Sci., 58(1):137\u2013147, 1999.\n\n[7] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt.\nPractical and optimal lsh for angular distance. In Advances in Neural Information Processing\nSystems, pages 1225\u20131233, 2015.\n\n[8] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds\nfor deep nets via a compression approach. In Proceedings of the 35th International Conference\non Machine Learning, ICML 2018, Stockholmsm\u00e4ssan, Stockholm, Sweden, July 10-15, 2018,\npages 254\u2013263, 2018.\n\n[9] Rosa I. Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and\nrandom projection. In 40th Annual Symposium on Foundations of Computer Science, FOCS\n\u201999, 17-18 October, 1999, New York, NY, USA, pages 616\u2013623, 1999.\n\n[10] Avrim Blum. Random projection, margins, kernels, and feature-selection. In Subspace, Latent\nStructure and Feature Selection, Statistical and Optimization, Perspectives Workshop, SLSFS\n2005, Bohinj, Slovenia, February 23-25, 2005, Revised Selected Papers, pages 52\u201368, 2005.\n\n[11] Vladimir Braverman, Kai-Min Chung, Zhenming Liu, Michael Mitzenmacher, and Rafail\nOstrovsky. Ams without 4-wise independence on product domains. In 27th International\nSymposium on Theoretical Aspects of Computer Science (STACS 2010), 2010.\n\n[12] Emmanuel J. Cand\u00e8s, Justin K. Romberg, and Terence Tao. Robust uncertainty principles: exact\nsignal reconstruction from highly incomplete frequency information. IEEE Trans. Information\nTheory, 52(2):489\u2013509, 2006.\n\n[13] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM\nTransactions on Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at\nhttp://www.csie.ntu.edu.tw/~cjlin/libsvm.\n\n[14] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams.\nIn International Colloquium on Automata, Languages, and Programming, pages 693\u2013703.\nSpringer, 2002.\n\n[15] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input\nsparsity time. In Symposium on Theory of Computing Conference, STOC\u201913, Palo Alto, CA,\nUSA, June 1-4, 2013, pages 81\u201390, 2013.\n\n10\n\n\f[16] Michael B. Cohen, T. S. Jayram, and Jelani Nelson. Simple analyses of the sparse Johnson-\nLindenstrauss transform. In 1st Symposium on Simplicity in Algorithms, SOSA 2018, January\n7-10, 2018, New Orleans, LA, USA, pages 15:1\u201315:9, 2018.\n\n[17] Andrew Cotter, Joseph Keshet, and Nathan Srebro. Explicit approximations of the Gaussian\n\nkernel. CoRR, 2011.\n\n[18] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and\n\nLindenstrauss. Random Struct. Algorithms, 22(1):60\u201365, 2003.\n\n[19] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017.\n\n[20] David L. Donoho. Compressed sensing. IEEE Trans. Information Theory, 52(4):1289\u20131306,\n\n2006.\n\n[21] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIB-\nLINEAR: A library for large linear classi\ufb01cation. Journal of Machine Learning Research,\n9:1871\u20131874, 2008.\n\n[22] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards\n\nremoving the curse of dimensionality. Theory of Computing, 8(1):321\u2013350, 2012.\n\n[23] Piotr Indyk and Andrew McGregor. Declaring independence via the sketching of sketches. In\nProceedings of the Nineteenth Annual ACM-SIAM symposium on Discrete algorithms (SODA),\npages 737\u2013745. Society for Industrial and Applied Mathematics, 2008.\n\n[24] T. S. Jayram and David P. Woodruff. Optimal bounds for Johnson-Lindenstrauss transforms and\n\nstreaming problems with subconstant error. ACM Trans. Algorithms, 9(3):26:1\u201326:17, 2013.\n\n[25] William B Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert\n\nspace. Contemporary Mathematics, 1984.\n\n[26] Lior Kamma, Casper B Freksen, and Kasper Green Larsen. Fully understanding the hashing\n\ntrick. In Advances in Neural Information Processing Systems, pages 5394\u20135404, 2018.\n\n[27] Daniel M. Kane, Raghu Meka, and Jelani Nelson. Almost optimal explicit Johnson-\nLindenstrauss families. In Approximation, Randomization, and Combinatorial Optimization.\nAlgorithms and Techniques - 14th International Workshop, APPROX 2011, and 15th Interna-\ntional Workshop, RANDOM 2011, Princeton, NJ, USA, August 17-19, 2011. Proceedings, pages\n628\u2013639, 2011.\n\n[28] Daniel M. Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms. J. ACM,\n\n61(1):4:1\u20134:23, 2014.\n\n[29] Michael Kapralov, Rasmus Pagh, Ameya Velingker, David P. Woodruff, and Amir Zandieh.\n\nOblivious sketching of high-degree polynomial kernels. CoRR, abs/1909.01410, 2019.\n\n[30] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels.\n\nIn\nProceedings of the Fifteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\nAISTATS 2012, La Palma, Canary Islands, Spain, April 21-23, 2012, pages 583\u2013591, 2012.\nLater: Journal of Machine Learning Research (JMLR) : WCP, 22:583-591.\n\n[31] Kasper Green Larsen and Jelani Nelson. Optimality of the Johnson-Lindenstrauss lemma. In\n58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA,\nUSA, October 15-17, 2017, pages 633\u2013638, 2017.\n\n[32] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.\n\n[33] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in\n\nTheoretical Computer Science, 1(2), 2005.\n\n[34] Rasmus Pagh. Compressed matrix multiplication. In Innovations in Theoretical Computer\n\nScience 2012, Cambridge, MA, USA, January 8-10, 2012, pages 442\u2013451, 2012.\n\n11\n\n\f[35] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In\nThe 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\nKDD 2013, Chicago, IL, USA, August 11-14, 2013, pages 239\u2013247, 2013.\n\n[36] Terence Tao. Math 254a: Notes 1: Concentration of measure, https://terrytao.\n\nwordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/, 2010.\n\n[37] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in\n\nTheoretical Computer Science, 10(1-2):1\u2013157, 2014.\n\n12\n\n\f", "award": [], "sourceid": 5042, "authors": [{"given_name": "Michela", "family_name": "Meister", "institution": "Cornell University"}, {"given_name": "Tamas", "family_name": "Sarlos", "institution": "Google Research"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}]}