Paper ID: 382
Title: Compressive spectral embedding: sidestepping the SVD
Current Reviews

Submitted by Assigned_Reviewer_1

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
A new embedding method is proposed by using ideas from compressed sensing literature which does not require SVD. The manuscript provides a good case for the intuition and back it up with analysis. Experiments on two datasets are provided.

This method gives access to the raw embedding which many alternate methods don't. However, it is not clear whether the resulting embedding has nice orthogonal structure like SVD or not? This property is not crucial for the applications that require pairwise distances for inference but is crucial to applications with more involved analysis.

Comparing the quality of results with true SVD results is a good experiment. But independent experiments showing running time of the proposed algorithm for large scale problems would have helped the reviewer to appreciate the proposed method. Probably comparisons with few other methods which also side-step SVD? Reviewer understand that other methods may not have same flexibility as the proposed method offer but one of the crucial factor here is time.

The manuscript is well written in terms of providing intuition and insight for the proposed algorithm. However in its current form, manuscript has a lot of typos and also notations are not consistent. It is very confusing at places whether the A is a data matrix i.e., rows are data instances and cols are the feature representation in high dimension or A is pairwise distances between instances eq(1).

SVD operation on large matrices is a bottleneck in various large scale analyses. A new method which avoid SVD but still provide access to the embedding and not just pairwise distances is a significant contribution.

Q2: Please summarize your review in 1-2 sentences
The paper proposes a new method for spectral embedding and dimensionality reduction by avoiding the expensive SVD. In particular, proposed method develop the embedding algorithm by using concepts from compressed sensing literature -- finite order polynomial expansions of matrices. The paper is well written and describe the algorithm at length along with performance analysis w.r.t various parameters.

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors propose a novel algorithm for computing "compressive embeddings" of a matrix, which is most often done via approximate partial SVD. They provide detailed motivation and background for the problem, reviewing applications in graph processing, NLP, etc. The proposed algorithm "side-steps" the SVD by leveraging the JL lemma to compute approximate pair-wise distances and polynomial approximations to incorporate approximate functions of the singular values (e.g., thresholds). Some implementation details are highlighted.

Overall, I think this a very good paper that presents simple, fast, effective, and (to my knowledge) novel algorithm for computing compressive embeddings, which is likely to be superior to Randomized SVD and pure random projections for many applications. The presentation is also relatively clear, if dense, despite a few minor errors and formatting issues (outlined below).

One issue is that the authors do not flesh out a detailed story of how the performance changes relative to the embedding dimension and types of polynomials that might be approximated. This is difficult due to space-constraints, but especially for the former it would have been helpful to have a plot relating accuracy to dimension (not just a comparison btw two dimension sizes) that essentially provided an empirical view of theorem 1. Moreover, since the top 500 eigenvalues were computed the \delta could be reasonably approximated and so an empirical vs. theoretical expectation plot is possible.

Related to the above, more concrete recommendations from the authors w.r.t. how to choose embedding dimensions and eigenvalue thresholds would be helpful. And it is somewhat misleading that the authors set the polynomial approximations given knowledge of the first 500 eigenvalues. Discussion of what to do when such prior computations are infeasible would be helpful.

Minor issues:

-\cite{foo}\cite{bar} -> \cite{foo,bar} everywhere - line 73: point (a) is unclear - line 113: iteration\cite{foo} -> iteration \cite{foo} - line 237: "of good" -> "of a good" - line 340: "(=" -> "=" - capitalize Euclidean, Hermitian etc.

Q2: Please summarize your review in 1-2 sentences
A well-written, compelling paper. The proposed algorithm is insightful and likely to have an impact.

Submitted by Assigned_Reviewer_3

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors aim at various interesting machine learning tasks. Avoiding expensive SVD is the main selling point of the paper. As the authors claim that the computational complexity of the algorithm is O(min(m,n)(T+m+n)log(m+n)), parallel computation improves the cost to O((T+m+n)log(m+n)). I am curious about how parallel SVD would do. Maybe a careful comparison with sparse parallel SVD is useful to convince readers the superiority. I am also curious the computational cost of Legenche polynomial when choosing the polynomial in the algorithm.

In the experiment section, it is surprising to me that changing d dramatically doesn't affect the result much. Is there an intuition why this is happening?

Some minor points are (1) A few typos are found in the paper, especially in the experiment section.

(2) Notations can be more precise. For example, the definition of X and f(A). Although I understand what the authors mean, it could be confusing for readers who are not familiar with these topics.

It would be nice if the authors could address these questions.
Q2: Please summarize your review in 1-2 sentences
The paper computes embedding of matrix without using SVD to perform spectral methods. Theoretical bounds on projection dimension and polynomial approximation is provided. However, I am not sure about the originality of the paper. Improvements on the experiment section is needed to convince me the algorithm is advantageous.

Author Feedback
Author Feedback
Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 5000 characters. Note however, that reviewers and area chairs are busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
REVIEWER 1

- Orthogonality is exactly what we sacrifice in our compressive representation (e.g., we used d = 40 & 80 dimensional embeddings to capture up to K = 500 eigenvectors in our example).

- On comparisons with other algorithms: We have compared our algorithm with exact SVD and the recently proposed approximate Randomized SVD algorithm. Randomized SVD is an element-by-element approximation, and retains more of the noise spectrum (small eigenvalues), thus degrading inference quality. Except for Spielman and Teng's theoretical work on commute time embeddings (references [21]-[22]), we are not aware of any other work on spectral embeddings without computing the spectrum itself. However, the Spielman-Teng algorithm is restricted to normalized adjacency matrices with a specific weighting function, so that our algorithm represents a significant generalization.

- We realize reusing the symbol A to represent symmetric matrices (eigenvector embeddings) and feature vector matrices (typically not symmetric; SVD embeddings) can cause confusion. The construction of the input matrix depends on the modeler: it can be a symmetric matrix of pairwise similarities constructed from feature vectors (sparsified by truncating small values), or the feature vectors themselves stacked into a matrix. For the latter, the modification in Section 3.5 can be used to compute SVD embeddings. We will make sure we carefully distinguish these cases in the revised paper.

We will review the manuscript closely for typos.

REVIEWER 2

- To illustrate performance dependence on embedding dimension d, we will replace Figure 1(a) with one plotting the maximum & average distortion of pairwise distances vs d. CLT-based intuition (see response to Reviewer 3) indicates that distortion should scale as 1/sqrt(d), but the JL lemma based estimates in Theorem 1 are conservative. Comprehensive study of such scaling falls beyond our present scope.

- Regarding the impact of polynomial approximations, they apply to any spectral weighting function, but the smoother the function, the smaller the order L needed for a target \delta. Complexity grows as O(L(T+m+n) log(m+n)), hence smooth functions lead to considerable savings.

- Regarding knowledge of the top 500 eigenvalues used in our example, this is not required in practice. Our example was to illustrate that we could capture a desired number of eigenvalues accurately. We chose the top 500 because they could be computed based on a reasonable time budget.

- In practice, we would not know how many eigenvectors are needed. We recommend choosing embedding dimension and eigenvalue threshold based on problem structure (e.g., for PCA, we can set the threshold based on noise variance), or based on downstream inference performance. Increasing the embedding dimension incrementally until such criteria are met is easy: since each dimension is independent, we can add dimensions simply by running the algorithm on additional random starting vectors.

- If we do need an approximation based on the first k eigenvectors, we can do so by setting the corresponding eigenvalue threshold using the CDF of eigenvalues, estimated using techniques similar to ours in [24, 25].

We will summarize the preceding discussion in the revised paper, and also fix the typos pointed out by the reviewer (thank you for your careful reading).

REVIEWER 3

- Actually, we claim that the algorithm complexity is O(L(T+m+n)log(m+n)). This can be reduced to O(L(T+m+n)) with parallelism where L is the order of the polynomial approximation of f().

- Parallelism for speeding up matrix vector products applies to our algorithm and to existing SVD techniques in equal measure (and we use it in the Python implementation of our algorithm that we plan to release). In terms of space, exact SVD methods (even those for sparse matrices) cannot save beyond min(m,n) * number of eigs, since that is the size of the output they report. For the same reason they cannot have complexity smaller than O(num of eigs * min(m,n)), as we note in the paper. In our running time result comparison, we do use sparse parallel eigensolver routines (available via MATLAB).

- Our algorithm's complexity scales linearly with the order L of the polynomial approximation of the spectral weighting function. Smoother functions can be approximated by lower order polynomials, and therefore require less computation to approximate. The computations in the algorithm exploit the recursive structure of the Legendre poloynomials.

- Rough intuition on the effect of the embedding dimension d can be obtained using the Central Limit Theorem: Each squared distance follows the Gaussian(0, 1/d) distribution (across multiple runs of this randomized algorithm) and we therefore see that distortion scales as 1/sqrt(d). Since correlations are linearly related to distance-squares, we expect them to also scale as 1/sqrt{d} (i.e., a relatively slow decay with d).