Paper ID: 1226
Title: Sign Cauchy Projections and Chi-Square Kernel
Reviews

Submitted by Assigned_Reviewer_7

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper studies the collision probability of the following hashing scheme for points in R^D: the hash function picks a random vector of D i.i.d. samples from a symmetric alpha-stable distribution, and to hash a point x, we compute the dot product of x and the random vector and return the sign of the dot product. The same scheme for 2-stable distribution has been studied before (known as sim-hash). The main result shows that the collision probability for alpha=1 on binary data can be approximated by a function of the chi-square similarity. The bound for general data is pointed out by the authors themselves to be far from the true collision probability so it is not clear what it means, especially regarding the comparison between that bound and the chi square similarity.

The paper provides some experiments to show that the collision probability is approximately equal to some functions of the chi square similarity, and thus one can approximate chi square similarity from the collision probability and these functions. However, in this process, I think we are no longer able to use linear SVM or efficient near neighbor search (advantages 2 and 3 in the introduction) and have to use kernel SVM and exhaustive search instead. When using linear SVM, we use the kernel implicitly defined by the LSH and there is inadequate explanation of how useful it is (chi square is useful but it does not mean any function of it is also useful).

I have not verified all the proofs in the appendix but the ones I have read seem correct. The proof details are clear and easy to follow. The question of finding a hashing scheme approximating the chi square similarity is interesting, but the paper does not get close to resolving it.

============

Rebuttal comments: the streaming application is interesting but in all 3 applications (linear SVM, near neighbor search, streaming), the crucial factor is whether the kernel implicitly defined by the hashing scheme is useful, which is still inadequately supported. It would be helpful to report the accuracy and speed achieved using the hashing scheme directly with linear SVM (as described in the introduction) on, say, the MNIST data.
Q2: Please summarize your review in 1-2 sentences
The paper shows the collision probability of a certain hashing scheme based on stable distribution is related to some function of the chi square similarity. The paper does not show how useful this function is for learning tasks and how to get algorithms for chi square similarity.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper explores an novel hashing scheme that relates the chi-squared similarity between two high-dimensional vectors x and y, and the Hamming distance between the sign vectors of their randomly projected versions, sign(xR) and sign(yR). Here R is chosen to be a projection matrix whose elements are drawn from an alpha-stable distribution. The authors theoretically demonstrate that the chi-squared similarity measure is lower-bounded by the cosine of the (normalized) Hamming distance between the sign vectors. Moreover, for the case where alpha=1 and the matrix is i.i.d. Cauchy, they empirically demonstrate this bound is a good approximation to the actual similarity measures. The authors utilize this relation to achieve competitive performance with existing kernel-based methods for classification.

The ideas introduced in this paper are interesting. As the authors point out in the conclusions, the chi-squared similarity measure potentially lends itself to several important applications and therefore, a method that efficiently computes this measure can be very beneficial. I have a few concerns that the authors can hopefully address:

- The paper can benefit from some careful editing. Most of the paper concerns itself with carefully bounding the collision probabilities of the dimensionality reduced representations. However, it looks like the true focus instead (especially in practice) is the chi-squared similarity measure and this message is not really hammered home until Section 7. The experimental results are also scattered and this impedes the flow of the paper somewhat.

- A feature of most hashing schemes in this line of work (such as sim-hash, LSH, etc) is that it is possible to provide concrete lower as well as upper bound-type relations between pairwise metrics measured in the original high-dimensional space and the Hamming space. Therefore, one cannot really describe this scheme as a rigorous similarity estimation technique in the vein of the previous approaches. It would be nice if the authors comment upon the conceptual difficulties involved in proving the other half of the problem.

- While the approach presented in the paper is space-efficient, the authors do not also explicitly address the time efficiency for the case alpha = 1 where the random projection matrix is Cauchy (for alpha = 2, it is known that there exist fast transforms that achieve the dimensionality reduction).

- Lastly, it would be good if the authors normalized the x-axis to the same limits in Figs 2 and 3. It is difficult to visualize the empirical tightness of the proposed bounds when the scales of visualization are very different.

--------

Post-rebuttal edit: I would leave my score unchanged. My opinion of the paper continues to be positive, since the ideas are interesting, novel, and might have considerable potential impact. However, I echo several concerns of the other reviewers: (i) The papers needs a careful and thorough re-organization. (ii) The paper needs to address the (fairly substantial) gaps in the narrative -- particularly, the relationship between the chi-squared similarity and the approximations proposed in Theorem 2 and Lemma 5. (iii) The experimental results need to be re-organized, explained clearly, and made more convincing to a machine learning practitioner. I hope that the authors can address some of these in future revisions.
Q2: Please summarize your review in 1-2 sentences
This is an interesting paper that proposes an efficient method to estimate the chi-squared similarity between high-dimensional vectors via the method of measuring Hamming distances between the signs of their alpha-stable random projections. The paper falls short of establishing a fully rigorous metric equivalence between the original space and its low-dimensional embedding, but the (fairly extensive) empirical evaluations offsets this drawback.

Submitted by Assigned_Reviewer_9

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
In this paper, the authors propose fast approaches to compute non-linear kernels between non-negative vectors. They focus in particular on chi_2 kernels.

Building upon the known fact that the dot-product < x, y > of two vectors can be recovered as the cosine of the collision probability P(sign(x_j) \ne sign(y_j)) where x_j and y_j are equal to the dot product of x and y with a same vector a_j whose coordinates are drawn randomly from a stable distribution with scale alpha=2. The topic of the paper is to check what happens when the scale alpha < 2 and see whether one can recover interesting approximations of other kernels. After a motivating example in Section 2 and preliminary results in Section 3, the authors introduce their real aim in Section 4 and provide in Section 5 two formulas to approximate the collision probability as a function of the chi2 similarity.

I have liked this paper, and I think it contains sufficient novel material to be a worthwhile addition to NIPS, but I have a few concerns which prevent me from being more positive about it. Basically, the paper is messy. It feels as if the authors themselves have not digested the content they propose and are still hesitating about which form the paper should take. Some expressions such as "(perhaps surprising)", and the simplicity of some of the results (upper bounds) suggest that this is still a work in progress. This makes the paper difficult to read. Here are a few examples:

= the authors motivate their interest for the chi2 similarity by fact that it's popular to compare histograms. In that case why consider time series (UCI-PEMS) of non-normalized measurements as one of their main examples? This is confusing. Besides, that dataset itself is not that large (in the context of what the authors are trying to show) since it only contains 440 points. Finally, the reader will have a hard time trying to understand why the "acos-\chi_2" kernel is introduced in l.82 without more context. I'd suggest to drop section 2 and focus more cleanly on experimental results at the end of the paper. As a side note, the accuracies on MNIST seem very low... I wonder what a Hellinger kernel or a Gaussian kernel would do.

= the whole paper is constructed to show how the chi2 similarity can be used to approximate the collision probability. However, if I understand correctly, the natural flow in applications goes the other way round: empirical collision probabilities can be computed, and hence chi2 similarities can be approximated. In that sense, I wonder whether the second approximation of Equation 12 has any sense in machine learning, unless there's an obvious inversion formula for arctan integral.

= the paper has reads like a hack in progress. Maybe the explicit layout of an algorithm at the end of what the authors think is their more important contribution could help the paper be more understandable.

= Lemma 4 is provided without a proof, and I don't think one can proceed by analogy with the proof of Lemma 2, as suggested by the authors). In the case of Lemma 2 the correspondence between features and feature products (collisions) / the kernel 1-1/pi acos(p_2) was, when taking the expectation, exact. There is not such an exact formula in the case of p_\chi_2, as stated repeatedly by the authors. Therefore, one needs to prove that, because \rho_\chi_2 is a psd kernel, 1-1/\pi acos \rho_\chi_2 is psd too. This is not "trivial" but must be proved. Using corollary 1.14 chapter 3 (p.70) of the Berg/Christensen/Ressel reference (Harmonic Analysis on Semigroups) should suffice, since we only need to show that acos is holomorphic and with positive coefficients. The following reference might be a good starting point:
http://mizugadro.mydns.jp/t/index.php/ArcCos#Range_of_holomorphism
At this point, however, lemma 4 is just a conjecture.

update after rebuttal:

The answer from the authors on the psdness of the acos chi2 did not convince me. I don't understand their proof. The feature representations for the chi2 kernel are, as far as I know, infinite dimensional. Then the authors propose to use random Gaussian approximations on infinite dimensional vectors? I am sorry, I can't follow this. This part needs to be proved more clearly.
Q2: Please summarize your review in 1-2 sentences
This nice paper reads like a mishmash of ideas that try to generalize simhash (alpha=2) for stable distributions with alpha < 2 and see what can be taken out of it. Some of these ideas are promising, but the presentation needs to be improved to be suitable for NIPS

Submitted by Assigned_Reviewer_10

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper shows that the collision probability after alpha-stable random projections can be approximated by functions of Chi-Sqaure similarity. In particular, when alpha = 1, the approximation has a close form. This result enables approximating Chi-Sqaure related kernels via linear kernels after sign Cauchy random projection.

However, the bounds can be loose, and the application part seems rush and incomplete at the moment.

The paper can be (easily) improved by turning into a more theoretical paper with only hints for applications. After all, the main contribution is the theoretical part. Despite the looseness of the bounds and the issue of the current application part, the paper is interesting, and may still have values in practice or at least motivate new techniques.

After reading the reviews, author response, and discussion, my score (7) remains unchanged. I prefer a novel though unpolished paper to a polished incremental paper. I hope the writing will be improved in the final version if accepted.

Q2: Please summarize your review in 1-2 sentences
The result is original. The new bounds are useful addition to the literature and the community.

Submitted by Assigned_Reviewer_11

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes to use a signed Cauchy random projection to approximate the chi-square kernel with relatively low errors. The motivation comes from generalizing the Sim-hash algorithm and noting that chi-square is related to the collision probability.

Overall, the reviewer finds the finding quite interesting and innovative, as no previous attempts seem to have linked chi-square with random Cauchy projections. The paper is of importance to very high-dimensional sparse approximations of the chi-square, which could find its usage in a lot of text and image applications.

One main concern is that the writing is very bad, almost atrociously so. From the beginning it is extremely unclear what is the goal of the paper. The first theorem is almost irrelevant with the rest, as it gives a bound, just to be said not tight. And starting from section 4, the focus has turned to the alpha=1 case where links between the collision probability and the chi-square is proposed, and later proved in the binary case. The logic seems completely reversed and free-flowing, with occasionally theorems that don't lead to anywhere (e.g. Theorem 1, and the approximation (12)). The claims in the conclusion look out-or-order and random (e.g. lines 424-427). In general, this looks more like a research memo rather than a rigorous, well-organized paper. Therefore I'm torn on a score between 5 and 6 since it looks like this paper should be completely re-written before presented to readers.

However, the finding itself is still quite interesting. Lemma 5 is a strong result that established a low error bound on the approximation in binary case. The result that one can approximate cos^{-1}(chi^2) by inner product is new. A less satisfying result is the way chi^2 is approximated, where cos needs to be applied to transform sign(x)* sign(y) so that the approximation is no longer an inner product thus don't enjoy computational savings in large-scale learning. It will be interesting to see whether the approximated cos^{-1}(chi^2) can translate to a linear approximation to the more useful chi2 CDF function, or whether some more theoretical and empirical results can be found for the cos^{-1}(chi^2).
Other notes and relevant references:

Note that when alpha = 1, the bound in Theorem 1 corresponds to the geodesic distance on the simplex, as defined in

Lafferty and Lebanon. Diffusion Kernels on Statistical Manifolds. JMLR 2005.

Taking the square root of all the input u, v is known as the Hellinger distance, widely used in both NLP and vision but widely known as inferior to chi-square.

Chi-square is very important at least in computer vision, quite a few papers have worked on its approximation. There are some additional references that might be interesting for this topic:

Sreekanth, Vedaldi and Zisserman. Generalized RBF feature maps for efficient detection. BMVC 2010

established an approximation for exp-chi2, which has been empirically verified as much better than the chi2 kernel in classification accuracy. Also note exp-chi2 resembles the CDF of the chi2 distribution, hence corresponds to the p-value of the chi2-test. Therefore it's more appropriate than the chi2 itself as a comparison metric. Also relevant is a recent arXiv paper provides a linear approximation to chi2 with geometric convergence rate:

Li, Lebanon and Sminchisescu. A Linear Approximation to the chi2 Kernel with Geometric Convergence.

http://arxiv.org/pdf/1206.4074v3.pdf

which presents a formula to approximate each dimension of the chi2 kernel with several dimensions with geometric convergence rate. One can in principle construct an approximation of chi2 on very high-dimensional sparse features by doing a standard Gaussian random projection from such an approximation. It would be interesting to compare these two approaches.

---------------------------------------------------------------------

Comments on Rebuttal:

The point the authors mentioned on the ability for applications in streaming data is correct and interesting, and give some merit to the otherwise unjustified acos(chi2) metric (since it's computable for streams, even if it's not a well-understood metric yet). Therefore I decide to change the score to 6. The main concerns are still 1) Lack of theoretical and/or practical insights of the acos(chi2) kernel. 2) Writing and organization of the paper.

Performance comparisons with the arXiv paper would be an interesting side note, but that was not a factor in the score I gave with the initial review. The score came from the above 2 concerns.
Q2: Please summarize your review in 1-2 sentences
The paper proposes an approach to approximate the cos^{-1}(chi2) kernel using signed random Cauchy projections. The approach is interesting but it's unclear what theoretical/empirical meanings do the approximated cos^{-1}(chi2) have. The writing is extremely messy which undermines the score of this paper.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
Reviewer_9:

(1) Lemma 4 is indeed analogous to Lemma 2. Because \rho_{chi2} is PD, we can write it as \rho_{chi2} = sum_i z_i w_i, even though we don’t know z, w explicitly. When two original data vectors are identical, we have \rho_{chi2} = 1, which means sum z_i^2 = sum w_i^2 = 1 (ie naturally normalized). At this point, we can apply normal random projections and the rest follows from Lemma 2. There is another perspective: Since \rho_{chi2} is PD, there exists a Gaussian Process with covariance \rho_{chi2}. The variance turns out to be 1 = sum_i u_i ; and the rest follows easily. We of course appreciate your kind suggestion about holomorphic functions and we did learn something useful from reading the link. Also, we agree it always helps readers if we could provide more details (if space allows).

(2) In practice, we usually don’t explicitly need numerical numbers of similarities. Often we just need a good smooth function which is monotone in similarity. The second approximation is motivated by the analysis of collision probability on binary data, which is the case we can fully analyze.

(3) For MNIST-small, we are actually impressed by the performance of chi-square kernel compared to linear kernel. Note that we only used 1/6 of the original training samples and the entire original testing samples. Linear SVM achieved 90% while chi-square kernels achieved about 96%. Like linear kernel, chi-square kernel does not have additional tuning parameter (like gamma in RBF). This is actually the example which makes us feel excited about chi-square kernels. And Yes, MNIST is more close to vision data than PEMS. For this paper we used them mainly for verifying the proposed approximations. We realize we were a bit too enthusiastic about linear SVM and we agree with you if we add additional tuning parameters (similar to RBF), the accuracy will be further improved. In industry practice (where the data are not as nice as MNIST), it appears that linear kernels are the most common. The recent blog article "Research Directions for Machine Learning and Algorithms" (May 16, 2011) in the Communications of the ACM (CACM) confirmed this industry practice as well.

(4) In addition, we should mention another significant advantage of this method. It naturally works with streaming data. The process of histogram constructions is a typical example of Turnstile data streams. Using this method, we don’t need to store the original massive data.
Thanks for the careful reading and thoughtful constructive suggestions.

Reviewer_8:

(1) Speedup. Yes, one can improve the processing cost for random projections easily by “very sparse stable random projections” (Li, KDD07). The projection matrix can be made extremely sparse, especially for \alpha = 1 (see the experiments in KDD07 paper).

(2)x-axis. If we normalize the x-axis to the same scale, then we won’t see anything for \alpha=0.2. As explained in the paper, the experiments cannot cover the entire range of \rho_\alpha. We will try a few things to see which provides the best compromise.
Thanks for pointing this out.

(3) If the collision probability is exact, then the inequality relations (as the results people usually present in LSH type of papers) immediately hold. The key in LSH is that the collision probability should be a monotone function of the similarity. Both approximations are monotone functions of \rho_chi2.
Thank you and other reviewers for pointing out the editing.


Reviewer_11:

Thanks for confirming the high importance of chi-square similarity in practice. Note that our method works with data streams (Turnstile model). Histograms are naturally data streams. The other methods you mentioned require storing the original data at some step. Denote the original data matrix by A, then random projection A*R can be computed incrementally since (A+dA)*R = A*R + dA*R, meaning that A is never stored (R is re-generated on-demand by pseudo-rand numbers). This is a significant advantage of linear projections. The methods you mentioned, e.g., arXiv:1206.4074, first applied a nonlinear operation on A and expanded the dimension by a factor of 5 to 10, before random projections. Consequently they can are not applicable to data streams. This is a big difference. Also note that our method is very simple. Again, thanks for the references and other suggestions.

Reviewer_7:

(1) The bound (Theorem 1) is rigorous and sharp when \alpha>1.5. In machine learning practice, \alpha can be viewed as tuning parameter. Our contribution is not limited to \alpha=1.

(2) The use of acos(\rho_2) (sim-hash) is standard practice in search and learning when using sign normal random projections (which is a very influential work). For near neighbor search, it is actually difficult to find alternative method which works a lot better than sim-hash. This is why we are enthusiastic about the use of acos(\rho_chi2) in practice (which can be approximated by very efficient linear kernel).

(3) For binary data, we analytically showed that the max error is <0.01919. Binary data are important in practice especially when data are high-dimensional.

(4) For non-binary data, we have extensive experiments to verify the approximations.

(5) Our method naturally works with data streams.
This is the first paper on sign stable random projections. We expect quite a few related papers (practical or theoretical) can be written in the future (by other researchers).

Reviewer_10: Thanks for suggestions. It is fun reading so many constructive comments.