NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:2652
Title:Optimal Sketching for Kronecker Product Regression and Low Rank Approximation

Reviewer 1

***** Post Rebuttal ****** The authors response did not change my evaluation. I strongly suggest to provide better motivation for the problem, and also discuss alternative approaches (e.g., first-order method)s. ***************************** The paper studies efficient algorithms for solving Kronecker Product Regression problems. These are interesting problems, since explicitly forming the regression problem and solving it requires typically huge amount of memory and runtime, but on the otherhand, the regression matrix is highly-structured and hence there is hope of having efficient solvers which do not require to explicitly "write down" the problem. Concretely, given matrices A_1,...A_q, each of dimension n_ixd_i, we want to solve L_p regression (1<=p<=2) with the matrix being the Kronecker product of A_1,...A_q (this is a matrix of dimension n_1n_2...n_qxd_1d_2...d_q) and the vector of outcomes b is of the corresponding dimension (n_1n_2...n_q). A recent result (DSSW18]) showed how to solve the most interesting (to my taste) case of $p=2$ with runtime that depends only on the sum of non-zero entries in the matrices A_1,...A_q (which is naturally unavoidable) and the number of non-zeros in the outcome vector b (which might be avoidable). Since b will typically be dense, this runtime is not ideal because of the huge dimension of b. Thus, the main question answered in the paper is to obtain an improved result of runtime that only depends on the overall number of non-zero entries in the input matrices A_1,...A_q. The authors provide improved results also for the case p<2 (although the improvement is less dramatic - still depends on nnz(b)) and for some related problems such as the-all-pairs regression problem. On the technical side, the authros claim to develop and use novel techniques for sampling from Kronecker products to construct a compact sketch and then solve it via standard methods (SVD for p=2). The authors also conduct empirical tests showing their method improves in runtime over the previous work. While I am certainly not an expert on subject (e.g., sketching techniques used), I think that from an algorithmic point of view, the problem is indeed interesting and challenging, and it seems the authors advance the state-of-the-art with a solid and clear contribution. Some issues: 1. from the references provided, it is not clear how much such regression problems are indeed interesting in terms of applications (or are they just theoretically interesting?) 2. In principle we can solve such problems with first-order methods for convex optimization. Is it clear that something like SGD are computing gradients in a "smart way" cannot give something better than explicitly forming the problem? I think such a discussion makes sense.

Reviewer 2

This paper gives randomized algorithms for Kronecker product regression problems. The claimed run-time of the method is a slight improvement over the state of the art in randomized algorithms. They consider L2 and Lp regression for p in [1,2]. The method is based on obtaining leverage scores of the Kronecker factors. The key observation is that approximate leverage score distribution of the factors yield an approximate leverage score distribution of the product. The algorithm subsamples the matrix proportional to the leverage scores and solves the smaller linear system via classical methods. The algorithm utilizes input sparsity time leverage score approximation methods based on the work by Clarkson and Woodruff, 2009. They also extend the result to Lp regression using p-stable random variables. For L2 regression, the run-time of the proposed method does not depend on the number of nonzeros of the right-hand-side vector, which is an improvement over previous run-time bounds when the vector is dense. Although the contributions appear to be sufficiently novel. I have some concerns on the hidden constants and the practicality of the method. Please see below for detailed comments. (1) In Theorem 3.1 and it's proof, is m=O(1 / (\delta \epsilon^2)) or m=O(d / (\delta \epsilon^2)) ? It looks like a factor of d is missing. (2) The run-time guarantee of Theorem 3.1 involves an unknown constant in the O(1) term, in O( \sum nnz(A_i) + (dq/\delta\epsilon))^(O(1)). Is there a bound on this term ? (3) The authors need to motivate problems involving extremely rectangular, i.e., n>> d, Kronecker product matrices. Otherwise, the run-time involving d^O(1) is not smaller compared to direct methods. (4) The authors claim that O(nnz(b)) is larger than O(\sum_i nnz(A_i)) to establish a faster run-time. Is this realistic ? This also needs to motivated. (5) Conjugate Gradient algorithm on well-conditioned linear systems Ax=b run in O(nnz(A)) time. Is there a similar guarantee for well-conditioned Kronecker product systems ?

Reviewer 3

The improvement of the complexity results stems from tensor reshaping described in Lemma B.2 so that the sketching for tensor can be reduced to that for matrices. It follows closely the work by Diao et al. AISTATS 2018. However, it seems non-trivial to piece together different components with details for achieving the results. Also, extensions with analysis to other related problems are considered. It would be good if experiments on real data can be given. Missing reference which might be related Li, "very sparse stable random projections", SIGKDD 2007