NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 4511 Fast and Accurate Least-Mean-Squares Solvers

### Reviewer 1

To the best of my knowledge, the paper is the first to introduce an *exact* coreset/sketch construction for linear regression problems (i.e. there is no loss in quality by using the coreset/sketch instead of the original input data). In contrast, other work, e.g. [13] (see paper), focuses on coresets/sketches for which one can only obtain approximation guarantees. To achieve this result, the paper combines known results, but in a non-trivial non-obvious way. The paper gives proves for all theorems (in the appendix); these proofs are easy to follow. Also the overview on the contributions, related work, and the high-level overview on the approach are useful. In the experiments, the contribution of the paper helps to speedup known LMS solvers in the scikit-learn library by a large factors. Hence, the contribution of the paper adds a real measurable value for practical applications. However, sometimes the descriptons are a bit hard to follow and non-straightforward. Especially the "weighted set" vs. matrix notation is a bit hard to follow: - The definition "P={p_1,..,p_n} is an ordered finite set" is strange. The given set might very well be a multiset (and ordered), so a tuple notation like (p_1,...,p_n) might be clearer. - In Algorithm 3, "C in R^{...} a weighted subset of rows in A", matrices are interpreted as sets. How about "rows of C ... are linear combinations of rows in A" (to avoid the "ordered multi-set" problem alltogether)? - In Algorithm 2, "S in R^{...} whose union of rows is a weighted subset of A". It's not a weighted subset. Its rows coincide with a certain subset of rows of A that have been scaled. (For people that are used to coresets/sketches in the context of clustering this can be very confusing). Some more minor remarks: - [8] is mentioned Theorem 8.8 directly, while [27] is mentioned separately in the text twice - but it seems [27] explains Theorem 2.2 as well? - "Then (C,u) is a Caratheodory set of (P,w) that can be computed in time..." - The given algorithm has this running time, not some other algorithm that can compute (C,u) - wikipedia as a reference... - In l. 95, "linear in the input O(nd) for asymptotically large n" is a weird way of ignoring the dependency on d, better write "linear in n and polynomial in d". Moreover, "only log(n) calls to an LMS solver" is not so meaningful because it's unclear to what the LMS solver is applied. - I think it's nice that the paper does use O-notation correctly but the explicit remark in l. 141 seems unnecessary since it's only about one "\in" instead of "=" (and all the theorems the use abbreviations anyway ("comuted in time O(...)" = {...}=set) - The font size used in the figures in Figure 2 are way too tiny. The results shown in Figures 2(r) and 2(s) are nice (if you can zomm ;)) - but it's aweful how the sub-figures have been pasted together (and again, how tiny the fonts are) - l. 212 "time for .. becomes neglected" - l. 215 "is much significant" Some typos: - l. 43: "this errors increases" - l. 178 "from [the] previous section" - l. 68 "on the web T" - l. 199 "the common Python's solvers" - l. 130 "a sufficiently coreset"

### Reviewer 2

This is a groundbreaking result in the area of sketch and coreset. A large number of existing algorithms in sketch and coreset are randomized and approximate, which prevents those algorithms from being used in practice. The current paper presents a new deterministic coreset which receives a matrix A with n rows and d columns, and outputs a matrix A' with d^2 + 1 rows and d columns, such that rows of A' are scaled rows of A and A^TA equals A'^TA' *with no approximation*. Caratheodory's Theorem in fact implies the existence of such matrix A' and an inefficient algorithm, and the main contribution of this paper is a new practical algorithm which calculates the Caratheodory set with much faster running time. The new algorithm is elegant yet very interesting. Technically, the algorithm first partitions the given data points into k clusters and then calls the slow algorithm on k data points which correspond to the weighted mean of these k clusters. Using the output of the slow algorithm, by appropriately reweighting the data points, one can reduce the number of data points from n to O(n*k/d), at which point one can recursively call the algorithm. Thus, the algorithm terminates after log n / log (k / d) recursive calls. Another advantage of this paper is its strong experiment section. The discussion and plots presented in this section clearly demonstrate the practicality of the proposed coreset. In summary, this is a strong paper with a novel coreset construction for least square regression. The paper is also well-written. I will fight for accepting this paper. Minor Comments 1. It might be worth mentioning that a coreset of size Omega(d^2) is in fact the best one can hope for if the covariance matrix needs to be preserved exactly. This can be seen when A is the edge-vertex incidence matrix of a complete graph. 2. It might be better to add more details about Algorithm 8 and provide a full proof for readers who are not familiar with Caratheodory's Theorem. 3. Line 77: which lower bound in [13] are you referring to?

### Reviewer 3

Originality - The idea of evolving core-sets in order to improve running time of linear solvers is new to me. The paper introduce the idea of core-sets, which by the review has been studied before, but on relatively small scales. The work boost the performance of existing linear solvers using new methods Clarity - The submission is very well written and organized. there are some typo's (lines 47 , 68 , 84 , 185 ) some of them are important for the consistency of the math in the paper. The review and the motivation is clear and helpful. proofs are well written. The paper provides enough details to reproduce the results. Section 1.3 is exceptional. It suppose to give a good intuition for the framework but this part is not formal, and harder to understand. e.g. line 117 "a pair of coreset and sketch construction scheme" did not understand what type of input to expect , line 119 "faster", compare to what? it was unclear. Section 2 made everything clear and well defined. So maybe section 1.3 could be left out. Quality - Claims are supported. The paper misses mentioning the ratio needed between sample size n , a sample dimension d, and parameter k, so the reader could completely understand the running time of algorithm 2. Empirical evaluation demonstrate the practical potential of their method. Significant - This work has practical importance. practitioners are likely to easily implement suggested method to improve existing tools. Theoretical aspect could be motivated to other boosting method for improving running time for linear solvers. Or investigate the importance of Coresets for other use. === After reading authors feedback and other reviews indicating this work is first to offer an accurate solution to the problem and a, I tend to put even higher score on this paper. The paper has great significant. Clear accept.