{"title": "Fast and Accurate Least-Mean-Squares Solvers", "book": "Advances in Neural Information Processing Systems", "page_first": 8307, "page_last": 8318, "abstract": "Least-mean squares (LMS) solvers such as Linear / Ridge / Lasso-Regression, SVD and Elastic-Net not only solve fundamental machine learning problems, but are also the building blocks in a variety of other methods, such as decision trees and matrix factorizations.\r\n\r\nWe suggest an algorithm that gets a finite set of $n$ $d$-dimensional real vectors and returns a weighted subset of $d+1$ vectors whose sum is \\emph{exactly} the same. The proof in Caratheodory's Theorem (1907) computes such a subset in $O(n^2d^2)$ time and thus not used in practice. Our algorithm computes this subset in $O(nd)$ time, using $O(\\log n)$ calls to Caratheodory's construction on small but \"smart\" subsets. This is based on a novel paradigm of fusion between different data summarization techniques, known as sketches and coresets.\r\n\r\nAs an example application, we show how it can be used to boost the performance of existing LMS solvers, such as those in scikit-learn library, up to x100. Generalization for streaming and distributed (big) data is trivial.\r\nExtensive experimental results and complete open source code are also provided.", "full_text": "Fast and Accurate Least-Mean-Squares Solvers\n\nAlaa Maalouf \u2217\n\nIbrahim Jubran\u2217\n\nDan Feldman\n\nAlaamalouf12@gmail.com\n\nibrahim.jub@gmail.com\n\ndannyf.post@gmail.com\n\nThe Robotics and Big Data Lab,\nDepartment of Computer Science,\n\nUniversity of Haifa,\n\nHaifa, Israel\n\nAbstract\n\nLeast-mean squares (LMS) solvers such as Linear / Ridge / Lasso-Regression,\nSVD and Elastic-Net not only solve fundamental machine learning problems, but\nare also the building blocks in a variety of other methods, such as decision trees\nand matrix factorizations.\nWe suggest an algorithm that gets a \ufb01nite set of n d-dimensional real vectors and\nreturns a weighted subset of d + 1 vectors whose sum is exactly the same. The\nproof in Caratheodory\u2019s Theorem (1907) computes such a subset in O(n2d2) time\nand thus not used in practice. Our algorithm computes this subset in O(nd) time,\nusing O(log n) calls to Caratheodory\u2019s construction on small but \"smart\" subsets.\nThis is based on a novel paradigm of fusion between different data summarization\ntechniques, known as sketches and coresets.\nAs an example application, we show how it can be used to boost the performance\nof existing LMS solvers, such as those in scikit-learn library, up to x100. General-\nization for streaming and distributed (big) data is trivial. Extensive experimental\nresults and complete open source code are also provided.\n\n1\n\nIntroduction and Motivation\n\nLeast-Mean-Squares (LMS) solvers are the family of fundamental optimization problems in machine\nlearning and statistics that include linear regression, Principle Component Analysis (PCA), Singular\nValue Decomposition (SVD), Lasso and Ridge regression, Elastic net, and many more [17, 20,\n19, 38, 43, 39, 37]. See formal de\ufb01nition below. First closed form solutions for problems such\nas linear regression were published by e.g. Pearson [33] around 1900 but were probably known\nbefore. Nevertheless, today they are still used extensively as building blocks in both academy and\nindustry for normalization [27, 23, 3], spectral clustering [34], graph theory [42], prediction [11, 36],\ndimensionality reduction [26], feature selection [16] and many more; see more examples in [18].\nLeast-Mean-Squares solver in this paper is an optimization problem that gets as input an n \u00d7 d real\nmatrix A, and another n-dimensional real vector b (possibly the zero vector). It aims to minimize\nthe sum of squared distances from the rows (points) of A to some hyperplane that is represented by\nits normal or vector of d coef\ufb01cients x, that is constrained to be in a given set X \u2286 Rd:\n\n(1)\n\nf ((cid:107)Ax \u2212 b(cid:107)2) + g(x).\n\nmin\nx\u2208X\n\nHere, g is called a regularization term. For example: in linear regression X = Rd, f (y) = y2 for\nevery y \u2208 R and g(x) = 0 for every x \u2208 X. In Lasso f (y) = y2 for every y \u2208 R and g(x) = \u03b1\u00b7(cid:107)x(cid:107)1\n\n\u2217These authors contributed equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffor every x \u2208 Rd and \u03b1 > 0. Such LMS solvers can be computed via the covariance matrix AT A.\nFor example, the solution to linear regression of minimizing (cid:107)Ax \u2212 b(cid:107)2 is (AT A)\u22121AT b.\n\n1.1 Related work\n\nWhile there are many LMS solvers and corresponding implementations, there is always a trade-off\nbetween their accuracy and running time; see comparison table in [5] with references therein. The\nreason is related to the fact that computing the covariance matrix of A can be done essentially in\none of two ways: (i) summing the d \u00d7 d outer product aiaT\ni of A over every i,\ni of the ith row aT\ni , or (ii) factorization of A, e.g. using\ni=1 aiaT\n\n1 \u2264 i \u2264 n. This is due to the fact that AT A = (cid:80)n\n\nSVD or the QR decomposition [17].\nNumerical issues. Method (i) is easy to implement for streaming rows of A by maintaining only d2\nentries of the covariance matrix for the n vectors seen so far, or maintaining its inverse (AT A)\u22121 as\nexplained e.g. in [18]. This takes O(d2) time for each vector insertion and requires O(d2) memory,\nwhich is the same as the desired output covariance matrix. However, every such addition may\nintroduce another numerical error which accumulates over time. This error increases signi\ufb01cantly\nwhen running the algorithms using 32 bit \ufb02oating point representation, which is common for GPU\ncomputations; see Fig. 2v for example. This solution is similar to maintaining the set of d rows of the\nmatrix DV T , where A = U DV T is the SVD of A, which is not a subset of the original input matrix\nA but has the same covariance matrix AT A = V D2V . A common problem is that to compute\n(AT A)\u22121, the matrix AT A must be invertible. This may not be the case due to numerical issues.\nIn algorithms such as Lasso, the input cannot be a covariance matrix, but only a corresponding\nmatrix whose covariance matrix is AT A, that can be computed from the Cholesky decomposition [6]\nthat returns a left triangular matrix A for the given covariance matrix AT A. However, Cholesky\ndecomposition can be applied only on positive-de\ufb01nite matrices, which is not the case even for small\nnumerical errors that are added to AT A. See Section 4 for more details and empirical evidence.\nRunning-time issues. Method (ii) above utilizes factorizations such as SVD, i.e., A = U DV T to\ncompute the covariance matrix via AT A = V D2V T or the QR decomposition A = QR to compute\nAT A = RT QT QRT = RT R. This approach is known to be much more stable. However, it is\nmuch more time consuming: while in theory the running time is O(nd2) as in the \ufb01rst method, the\nconstants that are hidden in the O(\u00b7) notation are signi\ufb01cantly larger. Moreover, unlike Method (i),\nit is impossible to compute such factorizations exactly for streaming data [8].\nCaratheodory\u2019s Theorem [7] states that every point contained in the convex hull of n points in\nRd can be represented as a convex combination of a subset of at most d + 1 points, which we\ncall the Caratheodory set; see Section 2 and Fig. 1. This implies that we can maintain a weighted\ni aiaT\ni\nis the mean of n matrices and thus in the convex hull of their corresponding points in R(d2); see\nAlgorithm 2. The fact that we can maintain such a small sized subset of points instead of updating\nlinear combinations of all the n points seen so far, signi\ufb01cantly reduces the numerical errors as\nshown in Fig. 2v. Unfortunately, computing this set from Caratheodory\u2019s Theorem takes O(n2d2)\nor O(nd3) time via O(n) calls to an LMS solver. This fact makes it non-practical to use in an LMS\nsolvers, as we aim to do in this work, and may explain the lack of software or source code for this\nalgorithm on the web.\nApproximations via Coresets and Sketches.\nIn the recent decades numerous approximation and\ndata summarization algorithms were suggested to approximate the problem in (1); see e.g. [13, 21, 9,\n30] and references therein. One possible approach is to compute a small matrix S whose covariance\nST S approximates, in some sense, the covariance matrix AT A of the input data A. The term coreset\nis usually used when S is a weighted (scaled) subset of rows from the n rows of the input matrix.\nThe matrix S is sometimes called a sketch if each rows in S is a linear combination of few or all rows\nin A, i.e. S = W A for some matrix W \u2208 Rs\u00d7n. However, those coresets and sketches usually yield\n(1 + \u03b5)-multiplicative approximations for (cid:107)Ax(cid:107)2\n2 where the matrix S is of (d/\u03b5)O(1) rows\nand x may be any vector, or the smallest/largest singular vector of S or A; see lower bounds in [14].\nMoreover, a (1 + \u03b5)-approximation to (cid:107)Ax(cid:107)2\n2 does not guarantee an approximation to the\nactual entries or eigenvectors of A by S that may be very different.\nAccurately handling big data. The algorithms in this paper return accurate coresets (\u03b5 = 0),\nwhich is less common in the literature; see [22] for a brief summary. These algorithms can be used\n\n(scaled) set of d2 + 1 points (rows) whose covariance matrix is the same as A, since (1/n)(cid:80)\n\n2 by (cid:107)Sx(cid:107)2\n\n2 by (cid:107)Sx(cid:107)2\n\n2\n\n\fto compute the covariance matrix AT A via a scaled subset of rows from the input matrix A. Such\ncoresets support unbounded stream of input rows using memory that is sub-linear in their size, and\nalso support dynamic/distributed data in parallel. This is by the useful merge-and-reduce property\nof coresets that allow them to handle big data; see details e.g. in [4]. Unlike traditional coresets\nthat pay additional logarithmic multiplicative factors due to the usage of merge-reduce trees and\nincreasing error, the suggested weighted subsets in this paper do not introduce additional error to\nthe resulting compression since they preserve the desired statistics accurately. The actual numerical\nerrors are measured in the experimental results, with analysis that explain the differences.\nA main advantage of a coreset over a sketch is that it preserves sparsity of the input rows [15], which\nusually reduces theoretical running time. Our experiments show, as expected from the analysis, that\ncoresets can also be used to signi\ufb01cantly improve the numerical stability of existing algorithms.\nAnother advantage is that the same coreset can be used for parameter tuning over a large set of can-\ndidates. In addition to other reasons, this signi\ufb01cantly reduces the running time of such algorithms\nin our experiments; see Section 4.\n\n1.2 Our contribution\n\nA natural question that follows from the previous section is: can we maintain the optimal solution\nfor LMS problems both accurately and fast? We answer this question af\ufb01rmably by suggesting:\n\n(i) the \ufb01rst algorithm that computes the Caratheodory set of n input points in time that is linear\nin the input O(nd) for asymptotically large n. This is by using a novel approach of core-\nset/skecthes fusion that is explained in the next section; see Algorithm 1 and Theorem 1.\n(ii) an algorithm that maintains a (\"coreset\") matrix S \u2208 R(d2+1)\u00d7d such that: (a) its set of rows is\na scaled subset of rows from A \u2208 Rn\u00d7d whose rows are the input points, and (b) the covariance\nmatrices of S and A are the same, i.e., ST S = AT A; see Algorithm 2 and Theorem 3.2.\n\n(iii) example applications for boosting the performance of existing solvers by running them on the\n\nmatrix S above or its variants for Linear/Ridge/Lasso Regressions and Elastic-net.\n\n(iv) extensive experimental results on synthetic and real-world data for common LMS solvers of\nScikit-learn library with either CPython or Intel\u2019s distribution. Either the running time or\nnumerical stability is improved up to two orders of magnitude.\n\n(v) open code [29] for our algorithms that we hope will be used for the many other LMS solvers\n\nand future research as suggested in our Conclusion section; see Section 5.\n\n1.3 Novel approach: Coresets meet Sketches\n\nAs explained in Section 1.1, the covariance matrix AT A of A itself can be considered as a sketch\nwhich is relatively less numerically stable to maintain (especially its inverse, as desired by e.g. linear\nregression). The Caratheodory set, as in De\ufb01nition 2.1, that corresponds to the set of outer products\nof the rows of A is a coreset whose weighted sum yields the covariance matrix AT A. Moreover, it\nis more numerically stable but takes much more time to compute; see Theorem 2.2.\nTo this end, we suggest a meta-algorithm that combines these two approaches: sketches and core-\nsets. It may be generalized to other, not-necessarily accurate, \u03b5-coresets and sketches (\u03b5 > 0); see\nSection 5.\nThe input to our meta-algorithm is 1) a set P of n items, 2) an integer k \u2208 {1,\u00b7\u00b7\u00b7 , n} where n is\nhighest numerical accuracy but longest running time, and 3) a pair of coreset and sketch construction\nschemes for the problem at hand.\nThe output is a coreset for the problem whose construction time is faster than the construction time\nof the given coreset scheme; see Fig. 1.\nStep I: Compute a balanced partition {P1,\u00b7\u00b7\u00b7 , Pk} of the input set P into k clusters of roughly the\nsame size. While the correctness holds for any such arbitrary partition (e.g. see Algorithm 3.1), to\nreduce numerical errors \u2013 the best is a partition that minimizes the sum of loss with respect to the\nproblem at hand.\nStep II: Compute a sketch Si for each cluster Pi, where i \u2208 {1,\u00b7\u00b7\u00b7 , k}, using the input sketch\nscheme. This step does not return a subset of P as desired, and is usually numerically less stable.\n\n3\n\n\fFigure 1: Overview of Algorithm 1 and the steps in Section 1.3. Images left to right: Steps I and II (Partition and sketch steps): A partition\nof the input weighted set of n = 48 points (in blue) into k = 8 equal clusters (in circles) whose corresponding means are \u00b5, . . . , \u00b58 (in\nred). The mean of P (and these means) is x (in green). Step III (Coreset step): Caratheodory (sub)set of d + 1 = 3 points (bold red) with\ncorresponding weights (in green) is computed only for these k = 8 (cid:28) n means. Step IV (Recover step): the Caratheodory set is replaced by\nits corresponding original points (dark blue). The remaining points in P (bright blue) are deleted. Step V (Recursive step): Previous steps are\nrepeated until only d + 1 = 3 points remain. This procedure takes O(log n) iterations for k = 2d + 2.\n\nStep III: Compute a coreset B for the union S = S1 \u222a \u00b7\u00b7\u00b7 \u222a Sk of sketches from Step II, using the\ninput coreset scheme. Note that B is not a subset (or coreset) of P .\nStep IV: Compute the union C of clusters in P1,\u00b7\u00b7\u00b7 , Pk that correspond to the selected sketches in\n\nStep III, i.e. C =(cid:83)\n\nSi\u2208B Pi. By de\ufb01nition, C is a coreset for the problem at hand.\n\nStep V: Recursively compute a coreset for C until a suf\ufb01ciently small coreset is obtained. This step\nis used to reduce running time, without selecting k that is too small.\nWe then run an existing solver on the coreset C to obtain a faster accurate solution for P . Algo-\nrithm 1 and 3.1 are special cases of this meta-algorithm, where the sketch is simply the sum of a\nset of points/matrices, and the coreset is the existing (slow) implementation of the Caratheodory set\nfrom Theorem 2.2.\nPaper organization. In Section 2 we give our notations, de\ufb01nitions and the current state-of-the-\nart result. Section 3 presents our main algorithms for ef\ufb01cient computation of the Caratheodory\n(core-)set and a subset that preserves the inputs covariance matrix, their theorems of correctness\nand proofs. Section 4 demonstrates the applications of those algorithms to common LMS solvers,\nwith extensive experimental results on both real-world and synthetic data via the Scikit-learn library\nwith either CPython or Intel\u2019s Python distributions. We conclude the paper with open problems and\nfuture work in Section 5.\n\n2 Notation and Preliminaries\nFor a pair of integers n, d \u2265 1, we denote by Rn\u00d7d the set of n \u00d7 d real matrices, and [n] =\n{1,\u00b7\u00b7\u00b7 , n}. To avoid abuse of notation, we use the big O notation where O(\u00b7) is a set [12]. A\nweighted set is a pair (P, u) where P = {p1,\u00b7\u00b7\u00b7 , pn} is an ordered \ufb01nite set in Rd, and u : P \u2192\n[0,\u221e) is a positive weights function. We sometimes use a matrix notation whose rows contains the\nelements of P instead of the ordered set notation.\nGiven a point q inside the convex hull of a set of points P , Caratheodory\u2019s Theorem proves that there\na subset of at most d + 1 points in P whose convex hull also contains q. This geometric de\ufb01nition\ncan be formulated as follows.\nDe\ufb01nition 2.1 (Caratheodory set). Let (P, u) be a weighted set of n points in Rd such that\np\u2208P u(p) = 1. A weighted set (S, w) is called a Caratheodory Set for (P, u) if: (i) its size is\np\u2208P u(p) \u00b7 p, and (iii) its sum\n\n(cid:80)\n|S| \u2264 d + 1, (ii) its weighted mean is the same,(cid:80)\nof weights is(cid:80)\n\np\u2208S w(p) \u00b7 p =(cid:80)\n\np\u2208S w(p) = 1.\n\nCaratheodory\u2019s Theorem suggests a constructive proof for computing this set in O(n2d2) time [7,\n10]; see Algorithm 8 along with an overview and full proof in Section A of the supplementary\nmaterial [31]. However, as observed e.g. in [32], it can be computed only for the \ufb01rst m = d + 1\npoints, and then be updated point by point in O(md2) = O(d3) time per point, to obtain O(nd3)\noverall time. This still takes \u0398(n) calls to a linear system solver that returns x \u2208 Rd satisfying\nAx = b for a given matrix A \u2208 R(d+1)\u00d7d and vector b \u2208 Rd+1, in O(d3) time per call.\nTheorem 2.2 ([7], [32]). A Caratheodory set (S, w) can be computed for any weighted set (P, u)\n\np\u2208P u(p) = 1 in t(n, d) \u2208 O(1) \u00b7 min(cid:8)n2d2, nd3(cid:9) time.\n\nwhere(cid:80)\n\n4\n\n\f3 Faster Caratheodory Set\n\nCaratheodory set from O(min(cid:8)n2d2, nd3(cid:9)) in Theorem 2.2 to O(nd) for suf\ufb01ciently large n; see\n\nIn this section, we present our main algorithm that reduces the running time for computing a\n\nthat (cid:80)\n\nTheorem 3.1. A visual illustration of the corresponding Algorithm 1 is shown in Fig. 1. As an\napplication, we present a second algorithm, called CARATHEODORY-MATRIX, which computes a\nsmall weighted subset of a the given input that has the same covariance matrix as the input matrix;\nsee Algorithm 2.\nTheorem 3.1 (Caratheodory-Set Booster). Let (P, u) be a weighted set of n points in Rd such\np\u2208P u(p) = 1, and k \u2265 d + 2 be an integer. Let (C, w) be the output of a call to\nFAST-CARATHEODORY-SET(P, u, k); See Algorithm 1. Let t(k, d) be the time it takes to com-\npute a Caratheodory Set for k points in Rd, as in Theorem 2.2. Then (C, w) is a Caratheodory set\nof (P, u) that is computed in time O\n\nnd + t(k, d) \u00b7\n\n(cid:16)\n\n(cid:17)\n\nlog n\n\n.\n\nlog(k/d)\n\nProof. See full proof of Theorem B.1 in the supplementary material [31].\n\nTuning Algorithm 1 for the fastest running time.\nTo achieve the fastest running time in\nAlgorithm 1, simple calculations show that when t(k, d) = kd3, i.e., when applying the algo-\nrithm from [32], k = ed is the optimal value (that achieves the fastest running time), and when\nt(k, d) = k2d2, i.e., when applying the original Caratheodory algorithm (Algorithm 8 in the supple-\nmentary material [31]), k =\n\ned is the value that achieves the fastest running time.\n\n\u221a\n\nAlgorithm 1 FAST-CARATHEODORY-SET(P, u, k); see Theorem 3.1\n\nInput: A set P of n points in Rd, a (weight) function u : P \u2192 [0,\u221e) such that(cid:80)\n\nand an integer (number of clusters) k \u2208 {1,\u00b7\u00b7\u00b7 , n} for the numerical accuracy/speed trade-off.\n\np\u2208P u(p) = 1,\n\nOutput: A Caratheodory set of (P, u); see De\ufb01nition 2.1.\n1 P := P \\ {p \u2208 P | u(p) = 0}.\n2 if |P| \u2264 d + 1 then\n3\n\n// |P| is already small\n\nreturn (P, u)\n\n// Remove all points with zero weight.\n\n4 {P1,\u00b7\u00b7\u00b7 , Pk} := a partition of P into k disjoint subsets (clusters), each contains at most (cid:100)n/k(cid:101) points.\n5 for every i \u2208 {1,\u00b7\u00b7\u00b7 , k} do\n\nu(p) \u00b7 p\n\n// the weighted mean of Pi\n\n\u00b7 (cid:88)\n\np\u2208Pi\n\nu(q)\n\np\u2208Pi\n\nu(p)\n\nq\u2208Pi\n\n\u00b5i :=\n\n1(cid:80)\nu(cid:48)(\u00b5i) :=(cid:80)\n(cid:91)\n\nPi\n\n9 C :=\n\n77\n\n8 (\u02dc\u00b5, \u02dcw) := CARATHEODORY({\u00b51,\u00b7\u00b7\u00b7 , \u00b5k} , u(cid:48))\n\n// The weight of the ith cluster.\n\n// see Algorithm 8 in the supplementary material and Theorem 2.2.\n\n\u00b5i\u2208\u02dc\u00b5\n// C is the union over all clusters Pi \u2286 P whose representative \u00b5i was\nchosen for \u02dc\u00b5.\n\n10 for every \u00b5i \u2208 \u02dc\u00b5 and p \u2208 Pi do\n\n(cid:80)\n\n\u02dcw(\u00b5i)u(p)\nu(q)\n\nq\u2208Pi\n\n11\n\nw(p) :=\n\n// assign weight for each point in C\n\n12 (C, w) := FAST-CARATHEODORY-SET(C, w, k)\n13 return (C, w)\n\n// recursive call\n\nTheorem 3.2. Let A \u2208 Rn\u00d7d be a matrix, and k \u2265 d2 + 2 be an integer. Let S \u2208 R(d2+1)\u00d7d\nbe the output of a call to CARATHEODORY-MATRIX(A, k); see Algorithm 2. Let t(k, d) be the\ncomputation time of CARATHEODORY (Algorithm 8) given k points in Rd. Then AT A = ST S.\nFurthermore, S is computed in O\n\nnd2 + t(k, d2) \u00b7\n\ntime.\n\n(cid:17)\n\n(cid:16)\n\nlog n\n\nlog (k/d2))\n\nProof. See full proof of Theorem B.2 in the supplementary material [31].\n\n5\n\n\f: A matrix A = (a1 | \u00b7\u00b7\u00b7 | an)T \u2208 Rn\u00d7d, and an integer k \u2208 {1,\u00b7\u00b7\u00b7 , n} for numerical\naccuracy/speed trade-off.\n\nAlgorithm 2 CARATHEODORY-MATRIX(A, k); see Theorem 3.2\nInput\nOutput: A matrix S \u2208 R(d2+1)\u00d7d whose rows are scaled rows from A, and AT A = ST S.\n1 for every i \u2208 {1\u00b7\u00b7\u00b7 , n} do\n2\n\nSet pi \u2208 R(d2) as the concatenation of the d2 entries of aiaT\n\n// The order of entries may be arbitrary but the same for all points.\n\n4 P :=(cid:8)pi | i \u2208 {1,\u00b7\u00b7\u00b7 , n}(cid:9)\n6 S := a (d2 + 1) \u00d7 d matrix whose ith row is(cid:112)n \u00b7 w(pi) \u00b7 aT\n\n// P is a set of n vectors in R(d2).\ni for every pi \u2208 C.\n\n5 (C, w) := FAST-CARATHEODORY-SET(P, u, k) // C \u2286 P and |C| = d2 +1 by Theorem 3.1\n\ni \u2208 Rd\u00d7d.\n\nu(pi) := 1/n\n\n3\n\n7 return S\n\nSolver\n\nLinear regression [6]\n\nRidge regression [19]\n\nLasso regression [39]\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2 + \u03b1(cid:107)x(cid:107)2\n\n2\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2 + \u03b1(cid:107)x(cid:107)1\n\n1\n2n\n(cid:107)Ax \u2212 b(cid:107)2\n\nObjective function\n\n(cid:107)Ax \u2212 b(cid:107)2\n\n2\n\nPython\u2019s Package\n\nExample Python\u2019s solver\n\nscipy.linalg\n\nLinearRegression(A, b)\n\nsklearn.linear_model\n\nRidgeCV(A, b, A, m)\n\nsklearn.linear_model\n\nLassoCV(A, b, A, m)\n\nElasticNetCV(A, b, A, \u03c1, m)\nElastic-Net regression [43]\nTable 1: Four LMS solvers that were tested with Algorithm 3. Each procedure gets a matrix A \u2208 Rn\u00d7d, a vector b \u2208 Rn and aims to\ncompute x \u2208 Rd that minimizes its objective function. Additional regularization parameters include \u03b1 > 0 and \u03c1 \u2208 [0, 1]. The Python\u2019s\nsolvers use m-fold cross validation over every \u03b1 in a given set A \u2286 [0, \u221e).\n\nsklearn.linear_model\n\n2 +\n\n2\n\n2 + \u03c1\u03b1(cid:107)x(cid:107)2\n\n1\n2n\n\n(1 \u2212 \u03c1)\n\n\u03b1(cid:107)x(cid:107)1\n\n4 Experimental Results\n\nIn this section we apply our fast construction of the Carathoodory Set S from the previous section\nto boost the running time of common LMS solvers in Table 1 by a factor of tens to hundreds,\nor to improve their numerical accuracy by a similar factor to support, e.g., 32 bit \ufb02oating point\nrepresentation as in Fig. 2v. This is by running the given solver as a black box on the small matrix\nC that is returned by Algorithms 4\u20137, which is based on S. That is, our algorithm does not compete\nwith existing solvers but relies on them, which is why we called it \"booster\". Open code for our\nalgorithms is provided [29].\nm-folds cross validation (CV). We brie\ufb02y discuss the CV technique which is utilized in common\nLMS solvers. Given a parameter m and a set of real numbers A, to select the optimal value \u03b1 \u2208 A\nof the regularization term, the existing Python\u2019s LMS solvers partition the rows of A into m folds\n(subsets) and run the solver m \u00b7 |A| times, each run is done on a concatenation of m \u2212 1 folds\n(subsets) and \u03b1 \u2208 A, and its result is tested on the remaining \u201ctest fold\u201d. Finally, the cross validation\nreturns the parameters that yield the optimal (minimal) value on the test fold; see [25] for details.\nFrom Caratheodory Matrix to LMS solvers. As stated in Theorem 3.2, Algorithm 2 gets an\ninput matrix A \u2208 Rn\u00d7d and an integer k > d + 1, and returns a matrix S \u2208 R(d2+1)\u00d7d of the\nsame covariance AT A = ST S, where k is a parameter for setting the desired numerical accuracy.\nTo \"learn\" a given label vector b \u2208 Rn, Algorithm 3 partitions the matrix A(cid:48) = (A | b) into m\npartitions, computes a subset for each partition that preserves its covariance matrix, and returns the\nunion of subsets as a pair (C, y) where C \u2208 R(m(d+1)2+m)\u00d7d and y \u2208 Rm(d+1)2+m. For m = 1\nand every x \u2208 Rd,\n\n(cid:107)Ax \u2212 b(cid:107) =(cid:13)(cid:13)A(cid:48)(x | \u22121)T(cid:13)(cid:13) =(cid:13)(cid:13)(C | y)(x | \u22121)T(cid:13)(cid:13) = (cid:107)Cx \u2212 y(cid:107) ,\n\n(2)\n\nwhere the second and third equalities follow from Theorem 3.2 and the construction of C, respec-\ntively. This enables us to replace the original pair (A, b) by the smaller pair (C, y) for the solvers in\nTable 1 as in Algorithms 4\u20137. A scaling factor \u03b2 is also needed in Algorithms 6\u20137.\n\n6\n\n\fTo support CV with m > 1 folds, Algorithm 3 computes a coreset for each of the m folds (subsets\nof the data) in Line 4 and concatenates the output coresets in Line 5. Thus, (2) holds similarly for\neach fold (subset) when m > 1.\nThe experiments. We applied our CARATHEODORY-MATRIXcoreset from Algorithm 2 on com-\nmon Python\u2019s SKlearn LMS-solvers that are described in Table 1. Most of these experiments were\nrepeated twice: using the default CPython distribution [40] and Intel\u2019s distribution [28] of Python.\nAll the experiments were conducted on a standard Lenovo Z70 laptop with an Intel i7-5500U CPU\n@ 2.40GHZ and 16GB RAM. We used the 3 following real-world datasets:\n\n(i) 3D Road Network (North Jutland, Denmark) [24]. It contains n = 434874 records. We used\nthe d = 2 attributes: \u201cLongitude\u201d [Double] and \u201cLatitude\u201d [Double] to predict the attribute\n\u201cHeight in meters\u201d [Double].\n\n(ii) Individual household electric power consumption [1]. It contains n = 2075259 records. We\nused the d = 2 attributes: \u201cglobal active power\u201d [kilowatt - Double], \u201cglobal reactive power\u201d\n[kilowatt - Double]) to predict the attribute \u201cvoltage\u201d [volt - Double].\n\n(iii) House Sales in King County, USA [2]. It contains n = 21, 600 records. We used the following\nd = 8 attributes: \u201cbedrooms\u201d [integer], \u201csqft living\u201d [integer], \u201csqft lot\u201d [integer], \u201c\ufb02oors\u201d\n[integer], \u201cwaterfront\u201d [boolean], \u201csqft above\u201d [integer], \u201csqft basement\u201d [integer], \u201cyear built\u201d\n[integer]) to predict the \u201chouse price\u201d [integer] attribute.\n\nThe synthetic data consists of an n \u00d7 d matrix A and vector b of length n, both of uniform random\nentries in [0, 1000]. As expected by the analysis, since our compression introduces no error to the\ncomputation accuracy, the actual values of the data had no affect on the results, unlike the size of\nthe input which affects the computation time. Table 2 summarizes the experimental results.\n\n4.1 Competing methods\n\nWe now present other sketches for improving the practical running time of LMS solvers; see discus-\nsion in Section 4.2.\nSKETCH + CHOLESKY is a method which simply sums the 1-rank matrices of outer products of\nrows in the input matrix A(cid:48) = (A | b) which yields its covariance matrix B = A(cid:48)T A(cid:48). The Cholesky\ndecomposition B = LT L then returns a small matrix L \u2208 Rd\u00d7d that can be plugged to the solvers,\nsimilarly to our coreset.\nSKETCH + INVERSE is applied in the special case of linear regression, where one can avoid\napplying the Cholesky decomposition and can compute the solution (AT A)\u22121AT b directly after\nmaintaining AT A and AT b for the data seen so far.\n\n4.2 Discussion\n\nRunning time. The number of rows in the reduced matrix C is O(d2), which is usually much smaller\nthan the number n of rows in the original matrix A. This also explains why some coresets (dashed\nred line) failed for small values of n in Fig. 2b,2c,2h and 2i. The construction of C takes O(nd2).\nSolving linear regression takes the same time, with or without the coreset. However, the constants\nhidden in the O notation are much smaller since the time for computing C becomes neglectable for\nlarge values of n, as shown in Fig. 2u. We emphasize that, unlike common coresets, there is no\naccuracy loss due to the use of our coreset, ignoring \u00b110\u221215 additive errors/improvements. The\nimprovement in running time due to our booster is in order of up to x10 compared to the algorithm\u2019s\nrunning time on the original data, as shown in Fig. 2m\u2013 2n. The contribution of the coreset is\nsigni\ufb01cant, already for smaller values of n, when it boosts other solvers that use cross validation\nfor parameter tuning as explained above. In this case, the time complexity reduced by a factor of\nm\u00b7|A| since the coreset is computed only once for each of the m folds, regardless of the size |A|. In\npractice, the running time is improved by a factor of x10\u2013x100 as shown for example in Fig. 2a\u2013 2c.\nAs shown in the graphs, e.g., Fig. 2u, the computations via Intel\u2019s Python distribution reduced the\nrunning times by 15-40% compared to the default CPython distribution, with or without the booster.\nThis is probably due to its tailored implementation for our hardware.\nNumerical stability. The SKETCH + CHOLESKY method is simple and accurate in theory, and\nthere is no hope to improve its running time via our much more involved booster. However, it\nis numerically unstable in practice for the reasons that are explained in Section 1.1. In fact, on\n\n7\n\n\fFigure\n2a,2b,2c\n2d,2e,2f\n2g,2h,2i\n2j,2k,2l\n2m,2n\n2o,2p\n2q,2r\n2s,2t\n2u\n2v\n\n5\u20137\n5\u20137\n5\u20137\n5\u20137\n5\u20137\n5\u20137\n5\u20137\n5\u20137\n4\n4\n\nAlgorithm\u2019s number\n\nPython Distribution\n\nx/y Axes labels\nSize/Time for various d\nSize/Time for various |A|\nSize/Time for various d\nSize/Time for various |A|\n|A|/Time\n|A|/Time\nTime/maximal |A| than is feasible\nTime/maximal |A| than is feasible\nSize/Time for various Distributions\nError/Count Histogram + Size/Error\n\nCPython\nCPython\nIntel\u2019s\nIntel\u2019s\nCPython\nIntel\u2019s\nCPython\nIntel\u2019s\n\nCPython, Intel\u2019s\n\nCPython\n\nDataset\nSynthetic\nSynthetic\nSynthetic\nSynthetic\n\nInput Parameter\nm = 3, |A| = 100\nm = 3, d = 7\nm = 3,|A| = 100\nm = 3, d = 7\nDatasets (i),(ii) m = 3,|A| = 100\nDatasets (i),(ii) m = 3,|A| = 100\nDatasets (i),(ii) m = 3\nDatasets (i),(ii) m = 3\n\nSynthetic\n\nm = 64, d = 15\n\nDatasets (i),(iii) m = 1\n\nTable 2: Summary of experimental results. CPython [40] and Intel\u2019s [28] distributions were used. The input: A \u2208 Rn\u00d7d and b \u2208\nRn, where n is \u201cData size\u201d. CV used m folds for evaluating each parameter in A. The chosen number of clusters in Algorithm 3 is\nk = 2(d + 1)2 + 2 in order to have O(log n) iterations in Algorithm 1, and \u03c1 = 0.5 for Algorithm 7. Computation time includes the\ncomputation of the reduced input (C, y); See Section 3. The histograms consist of bins along with the number of errors that fall in each bin.\n\nmost of our experiments we could not even apply this technique at all using 32-bit \ufb02oating point\nrepresentation. This is because the resulting approximation to A(cid:48)T A(cid:48) was not a positive de\ufb01nite\nmatrix as required by the Cholesky Decomposition, and we could not compute the matrix L at\nall. In case of success, the running time of the booster was slower by at most a factor of 2 but\neven in these cases numerical accuracy was improved up to orders of magnitude; See Fig. 2v for\nhistogram of errors using such 32-bit \ufb02oat representation which is especially common in GPUs for\nsaving memory, running time and power [41]. For the special case of linear regression, we can apply\nSKETCH + INVERSE, which still has large numerical issues compared to our coreset computation\nas shown in Fig. 2v.\n\nand an integer k \u2208 {1,\u00b7\u00b7\u00b7 , n} that denotes accuracy/speed trade-off.\n\nAlgorithm 3 LMS-CORESET(A, b, m, k)\nInput: A matrix A \u2208 Rn\u00d7d, a vector b \u2208 Rn, a number (integer) m of cross-validation folds,\nOutput: A matrix C \u2208 RO(md2)\u00d7d whose rows are scaled rows from A, and a vector y \u2208 Rd.\n1 A(cid:48) := (A | b)\n1,\u00b7\u00b7\u00b7 , A(cid:48)\n2 {A(cid:48)\nm ) \u00d7 (d + 1)\n3 for every i \u2208 {1,\u00b7\u00b7\u00b7 , m} do\nSi := CARATHEODORY-MATRIX(A(cid:48)\n4\n5 S := (ST\n\nm} := a partition of the rows of A(cid:48) into m matrices, each of size ( n\n\n// A matrix A(cid:48) \u2208 Rn\u00d7(d+1)\n\nm)T // concatenation of the m matrices into a single matrix of\n\n// see Algorithm 2\n\n1 |\u00b7\u00b7\u00b7|ST\n\ni, k)\n\nm(d + 1)2 + m rows and d + 1 columns\n\n6 C := the \ufb01rst d columns of S\n7 y := the last column of S\n8 return (C, y)\n\nAlgorithm 4 LINREG-BOOST(A, b, m, k)\n\nAlgorithm 5 RIDGECV-BOOST(A, b, A, m, k)\n\n1 (C, y) := LMS-CORESET(A, b, m, k)\n2 x\u2217 := LinearRegression(C, y)\n3 return x\u2217\n\n1 (C, y) := LMS-CORESET(A, b, m, k)\n2 (x, \u03b1) := RidgeCV(C, y, A, m)\n3 return (x, \u03b1)\n\nAlgorithm 6 LASSOCV-BOOST(A, b, A, m, k)\n\nAlgorithm 7 ELASTICCV-BOOST(A, b, m, A, \u03c1, k)\n\n1 (C, y) := LMS-CORESET(A, b, m, k)\n\n1 (C, y) := LMS-CORESET(A, b, m, k)\n\n(cid:113)(cid:0)m \u00b7(cid:0)d + 1)2 + m(cid:1)/n\n\n(cid:113)(cid:0)m \u00b7(cid:0)d + 1)2 + m(cid:1)/n\n\n2 \u03b2 :=\n3 (x, \u03b1) := LassoCV(\u03b2 \u00b7 C, \u03b2 \u00b7 y, A, m)\n4 return (x, \u03b1)\n\n2 \u03b2 :=\n3 (x, \u03b1) := ElasticNetCV(\u03b2 \u00b7 C, \u03b2 \u00b7 y, A, \u03c1, m)\n4 return (x, \u03b1)\n\n5 Conclusion and Future Work\nWe presented a novel framework that combines sketches and coresets. As an example application,\nwe proved that the set from the Caratheodory Theorem can be computed in O(nd) overall time for\nsuf\ufb01ciently large n instead of the O(n2d2) time as in the original theorem. We then generalized\nthe result for a matrix S whose rows are a weighted subset of the input matrix and their covariance\nmatrix is the same. Our experimental results section shows how to signi\ufb01cantly boost the numerical\n\n8\n\n\f(a)\n\n(e)\n\n(i)\n\n(m)\n\n(q)\n\n(b)\n\n(f)\n\n(j)\n\n(n)\n\n(r)\n\n(c)\n\n(g)\n\n(k)\n\n(o)\n\n(s)\n\n(d)\n\n(h)\n\n(l)\n\n(p)\n\n(t)\n\n(u)\n\n(v) Accuracy comparison. (left): Dataset (i), (right): Dataset (ii). x\u2217 = LinearRegression(A, b). x\nwas computed using the methods speci\ufb01ed in the legend; see Section 4.2.\n\nFigure 2: Experimental results; see Table 2.\n\nstability or running time of existing LMS solvers by applying them on S. Future work includes:\n(a) applications of our framework to combine other sketch-coreset pairs e.g. as listed in [35], (b)\nExperiments for streaming/distributed/GPU data, and (c) experiments with higher dimensional data:\nwe may compute each of the O(d2) entries in the covariance matrix by calling our algorithm with\nd = 2 and the corresponding pair of columns in the d columns of the input matrix.\n\n9\n\n\f6 acknowledgements\n\nWe thank Ra\ufb01 Dalla-Torre and Benjamin Lastmann from Samsung Research Israel for the fruitful\ndebates and their useful review of our code.\n\nReferences\n[1] Individual household electric power consumption Data Set . https://archive.ics.uci.\n\nedu/ml/datasets/Individual+household+electric+power+consumption, 2012.\n\n[2] House Sales\n\nin King County, USA.\n\nhousesalesprediction, 2015.\n\nhttps://www.kaggle.com/harlfoxem/\n\n[3] Homayun Afrabandpey, Tomi Peltola, and Samuel Kaski. Regression analysis in small-n-large-\np using interactive prior elicitation of pairwise similarities. In FILM 2016, NIPS Workshop on\nFuture of Interactive Learning Machines, 2016.\n\n[4] Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. Approximating extent mea-\n\nsures of points. Journal of the ACM (JACM), 51(4):606\u2013635, 2004.\n\n[5] Christian Bauckhage. Numpy/scipy recipes for data science: Ordinary least squares optimiza-\n\ntion. researchgate. net, Mar, 2015.\n\n[6] Ake Bjorck. Solving linear least squares problems by gram-schmidt orthogonalization. BIT\n\nNumerical Mathematics, 7(1):1\u201321, 1967.\n\n[7] Constantin Carath\u00e9odory. \u00dcber den variabilit\u00e4tsbereich der koef\ufb01zienten von potenzreihen, die\n\ngegebene werte nicht annehmen. Mathematische Annalen, 64(1):95\u2013115, 1907.\n\n[8] Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming model.\nIn Proceedings of the forty-\ufb01rst annual ACM symposium on Theory of computing, pages 205\u2013\n214. ACM, 2009.\n\n[9] Kenneth L Clarkson and David P Woodruff. Low-rank approximation and regression in input\n\nsparsity time. Journal of the ACM (JACM), 63(6):54, 2017.\n\n[10] WD Cook and RJ Webster. Caratheodory\u2019s theorem. Canadian Mathematical Bulletin,\n\n15(2):293\u2013293, 1972.\n\n[11] John B Copas. Regression, prediction and shrinkage. Journal of the Royal Statistical Society:\n\nSeries B (Methodological), 45(3):311\u2013335, 1983.\n\n[12] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to\n\nalgorithms. MIT press, 2009.\n\n[13] Petros Drineas, Michael W Mahoney, and Shan Muthukrishnan. Sampling algorithms for l 2\nregression and applications. In Proceedings of the seventeenth annual ACM-SIAM symposium\non Discrete algorithm, pages 1127\u20131136. Society for Industrial and Applied Mathematics,\n2006.\n\n[14] Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P Woodruff. Coresets\nand sketches for high dimensional subspace approximation problems. In Proceedings of the\ntwenty-\ufb01rst annual ACM-SIAM symposium on Discrete Algorithms, pages 630\u2013649. Society\nfor Industrial and Applied Mathematics, 2010.\n\n[15] Dan Feldman, Mikhail Volkov, and Daniela Rus. Dimensionality reduction of massive sparse\ndatasets using coresets. In Advances in neural information processing systems (NIPS), 2016.\n\n[16] Neil Gallagher, Kyle R Ulrich, Austin Talbot, Kafui Dzirasa, Lawrence Carin, and David E\nCarlson. Cross-spectral factor analysis. In Advances in Neural Information Processing Sys-\ntems, pages 6842\u20136852, 2017.\n\n[17] Gene H Golub and Christian Reinsch. Singular value decomposition and least squares solu-\n\ntions. In Linear Algebra, pages 134\u2013151. Springer, 1971.\n\n10\n\n\f[18] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2012.\n\n[19] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal\n\nproblems. Technometrics, 12(1):55\u201367, 1970.\n\n[20] Ian Jolliffe. Principal component analysis. Springer, 2011.\n\n[21] Ibrahim Jubran, David Cohn, and Dan Feldman. Provable approximations for constrained lp\n\nregression. arXiv preprint arXiv:1902.10407, 2019.\n\n[22] Ibrahim Jubran, Alaa Maalouf, and Dan Feldman. Introduction to coresets: Accurate coresets.\n\narXiv preprint arXiv:1910.08707, 2019.\n\n[23] Byung Kang, Woosang Lim, and Kyomin Jung. Scalable kernel k-means via centroid approx-\n\nimation. In Proc. NIPS, 2011.\n\n[24] Manohar Kaul, Bin Yang, and Christian S Jensen. Building accurate 3d spatial networks to\nIn 2013 IEEE 14th International\n\nenable next generation intelligent transportation systems.\nConference on Mobile Data Management, volume 1, pages 137\u2013146. IEEE, 2013.\n\n[25] Ron Kohavi et al. A study of cross-validation and bootstrap for accuracy estimation and model\n\nselection. In Ijcai, volume 14, pages 1137\u20131145. Montreal, Canada, 1995.\n\n[26] Valero Laparra, Jes\u00fas Malo, and Gustau Camps-Valls. Dimensionality reduction via regression\nin hyperspectral imagery. IEEE Journal of Selected Topics in Signal Processing, 9(6):1026\u2013\n1036, 2015.\n\n[27] Yingyu Liang, Maria-Florina Balcan, and Vandana Kanchanapally. Distributed pca and k-\n\nmeans clustering. In The Big Learning Workshop at NIPS, volume 2013. Citeseer, 2013.\n\n[28] Intel LTD. Accelerate python* performance. https://software.intel.com/en-us/\n\ndistribution-for-python, 2019.\n\n[29] Alaa Maalouf, Ibrahim Jubran, and Dan Feldman. Open source code for all the algorithms\n\npresented in this paper, 2019. Link for open-source code.\n\n[30] Alaa Maalouf, Adiel Statman, and Dan Feldman. Tight sensitivity bounds for smaller coresets.\n\narXiv preprint arXiv:1907.01433, 2019.\n\n[31] Maalouf, Alaa and Jubran, Ibrahim and Feldman, Dan. Supplementary material. https://\npapers.nips.cc/paper/9040-fast-and-accurate-least-mean-squares-solvers,\n2019.\n\n[32] Soliman Nasser, Ibrahim Jubran, and Dan Feldman. Coresets for kinematic data: From theo-\n\nrems to real-time systems. arXiv preprint arXiv:1511.09120, 2015.\n\n[33] Karl Pearson. X. on the criterion that a given system of deviations from the probable in the\ncase of a correlated system of variables is such that it can be reasonably supposed to have\narisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine\nand Journal of Science, 50(302):157\u2013175, 1900.\n\n[34] Xi Peng, Zhang Yi, and Huajin Tang. Robust subspace clustering via thresholding ridge re-\n\ngression. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[35] Jeff M Phillips. Coresets and sketches. arXiv preprint arXiv:1601.00617, 2016.\n\n[36] Aldo Porco, Andreas Kaltenbrunner, and Vicen\u00e7 G\u00f3mez. Low-rank approximations for pre-\ndicting voting behaviour. In Workshop on Networks in the Social and Information Sciences,\nNIPS, 2015.\n\n[37] S Rasoul Safavian and David Landgrebe. A survey of decision tree classi\ufb01er methodology.\n\nIEEE transactions on systems, man, and cybernetics, 21(3):660\u2013674, 1991.\n\n[38] George AF Seber and Alan J Lee. Linear regression analysis, volume 329. John Wiley &\n\nSons, 2012.\n\n11\n\n\f[39] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society: Series B (Methodological), 58(1):267\u2013288, 1996.\n\n[40] Wikipedia contributors. Cpython \u2014 Wikipedia,\n\nthe free encyclopedia.\n\nwikipedia.org/w/index.php?title=CPython&oldid=896388498, 2019.\n\nhttps://en.\n\n[41] Wikipedia contributors. List of nvidia graphics processing units \u2014 Wikipedia, the free\nhttps://en.wikipedia.org/w/index.php?title=List_of_Nvidia_\n\nencyclopedia.\ngraphics_processing_units&oldid=897973746, 2019.\n\n[42] Yilin Zhang and Karl Rohe. Understanding regularized spectral clustering via graph conduc-\n\ntance. In Advances in Neural Information Processing Systems, pages 10631\u201310640, 2018.\n\n[43] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal\n\nof the royal statistical society: series B (statistical methodology), 67(2):301\u2013320, 2005.\n\n12\n\n\f", "award": [], "sourceid": 4511, "authors": [{"given_name": "Alaa", "family_name": "Maalouf", "institution": "The University of Haifa"}, {"given_name": "Ibrahim", "family_name": "Jubran", "institution": "The University of Haifa"}, {"given_name": "Dan", "family_name": "Feldman", "institution": "University of Haifa"}]}