Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels

Part of Advances in Neural Information Processing Systems 32 (NeurIPS 2019)

AuthorFeedback Bibtex MetaReview Metadata Paper Reviews Supplemental

Authors

Michela Meister, Tamas Sarlos, David Woodruff

Abstract

We revisit the classic randomized sketch of a tensor product of $q$ vectors $x_i\in\mathbb{R}^n$. The $i$-th coordinate $(Sx)_i$ of the sketch is equal to $\prod_{j = 1}^q \langle u^{i, j}, x^j \rangle / \sqrt{m}$, where $u^{i,j}$ are independent random sign vectors. Kar and Karnick (JMLR, 2012) show that if the sketching dimension $m = \Omega(\epsilon^{-2} C_{\Omega}^2 \log (1/\delta))$, where $C_{\Omega}$ is a certain property of the point set $\Omega$ one wants to sketch, then with probability $1-\delta$, $\|Sx\|_2 = (1\pm \epsilon)\|x\|_2$ for all $x\in\Omega$. However, in their analysis $C_{\Omega}^2$ can be as large as $\Theta(n^{2q})$, even for a set $\Omega$ of $O(1)$ vectors $x$. We give a new analysis of this sketch, providing nearly optimal bounds. Namely, we show an upper bound of $m = \Theta \left (\epsilon^{-2} \log(n/\delta) + \epsilon^{-1} \log^q(n/\delta) \right ),$ which by composing with CountSketch, can be improved to $\Theta(\epsilon^{-2}\log(1/(\delta \epsilon)) + \epsilon^{-1} \log^q (1/(\delta \epsilon))$. For the important case of $q = 2$ and $\delta = 1/\poly(n)$, this shows that $m = \Theta(\epsilon^{-2} \log(n) + \epsilon^{-1} \log^2(n))$, demonstrating that the $\epsilon^{-2}$ and $\log^2(n)$ terms do not multiply each other. We also show a nearly matching lower bound of $m = \Omega(\eps^{-2} \log(1/(\delta)) + \eps^{-1} \log^q(1/(\delta)))$. In a number of applications, one has $|\Omega| = \poly(n)$ and in this case our bounds are optimal up to a constant factor. This is the first high probability sketch for tensor products that has optimal sketch size and can be implemented in $m \cdot \sum_{i=1}^q \textrm{nnz}(x_i)$ time, where $\textrm{nnz}(x_i)$ is the number of non-zero entries of $x_i$. Lastly, we empirically compare our sketch to other sketches for tensor products, and give a novel application to compressing neural networks.