NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:8733
Title:Understanding Sparse JL for Feature Hashing

Reviewer 1

Note that a denser version of feature hashing was already considered in [29], under the name "Multiple Hashing". This is an alternative to sparse JL, seen as a preprocessing step that expands each dimension to s dimensions (scaling by 1/sqrt(s)) before applying feature hashing. This defines essentially the same mapping (only difference is whether a set or multiset of s dimensions is selected). Note that this decreases the L_infty/L_2 ratio by exactly a factor sqrt(s), so combining this with [23] yields lower bounds f(m,eps,p) that are exactly sqrt(s) times the bounds in [23] and not far from your tight bounds. This consequence of previous work should be acknowledged in your paper. Even though the L_infty/L_2 ratio has been used in the past to give bounds on the performance, it does not *characterise* the performance of feature hashing. For example, feature hashing handles a single large entry in a vector well, even though it may send L_infty/L_2 through the roof. It would be nice to have a more fine-grained understanding that has predictive power on any given vector.

Reviewer 2

This paper follows a recent line of research on analyzing sparse JL transform for vectors with bounded l_{\infty} norm to l_2 norm ratio. The current paper focuses on sparse JL transform with general sparsity parameter s and give tight bounds on tradeoff between l_{\infty} norm to l_2 norm ratio, accuracy and sparsity s. The given bounds clearly demonstrate the advantage of large sparsity parameter s and give a complete picture for sparse JL tradeoffs. I appreciate the amount of technical effort the authors put to analyze sparse JL transforms. The paper seems technically sound and is clearly written. I recommend acceptance.

Reviewer 3

The interesting part of the paper is a new analysis on the same quantity of the embedding error that has been repeatedly studied in the previous works. The previous approaches fail to generalize to either s > 1 or vectors with a general ell_infty/ell_2 bound. Previously the quantity of embedding error is usually treated as a quadratic form in Rademacher variables and the Hanson-Wright inequality can be applied to obtain moment bounds. However, this does not give tight bounds for v(m,eps,delta,s). Instead, this paper uses a more specific inequality on Rademacher quadratic form from Latala, and a bound on the weighted sum of symmetric random variables (originally proved by Latala, while the authors provided an alternate proof). The rest seems laborious case analysis and I wasn't able to verify everything. The author clearly intended to connect this sparse-JL problem to machine learning and thus branded “feature vectors” in the main body of the paper, but the connection is not really not explained. I would recommend the author to drop this superficial phrase and make it a cleaner math paper. Minor comments: - Do not need to write the subscript e for natural log. Use ln instead or just say that log stands for natural logarithm, as this does not affect the asymptotic orders. - Notation inconsistency: the distribution is mathcal{A} in the main body but \mathscr{A} in the supplementary - Last line of the statement of Lemma 3.2, change “...>=...” to “>= … >=” - First line below Section 4: missing a right bracket ) in ||Z_r(...)||_q. There are other places of missing closing brackets, please check the paper thoroughly Lemma 6.1: Change “x_i <= \nu” to “|x_i| <= \nu” - Page 12 of supplementary material: in the bottom equations, what is the supremum taking over? The condition is not clear. Is it T >= max{...,...,...}? This error occurs through the proof. - Lemma 8.1: Isn’t the first sentence redundant?