{"title": "Sublinear Time Low-Rank Approximation of Distance Matrices", "book": "Advances in Neural Information Processing Systems", "page_first": 3782, "page_last": 3792, "abstract": "Let $\\PP=\\{ p_1, p_2, \\ldots p_n \\}$ and $\\QQ = \\{ q_1, q_2 \\ldots q_m \\}$ be two point sets in an arbitrary metric space. Let $\\AA$ represent the $m\\times n$ pairwise distance matrix with $\\AA_{i,j} = d(p_i, q_j)$. Such distance matrices are commonly computed in software packages and have applications to learning image manifolds, handwriting recognition, and multi-dimensional unfolding, among other things. In an attempt to reduce their description size, we study low rank approximation of such matrices. Our main result is to show that for any underlying distance metric $d$, it is possible to achieve an additive error low rank approximation in sublinear time. We note that it is provably impossible to achieve such a guarantee in sublinear time for arbitrary matrices $\\AA$, and our proof exploits special properties of distance matrices. We develop a recursive algorithm based on additive projection-cost preserving sampling. We then show that in general, relative error approximation in sublinear time is impossible for distance matrices, even if one allows for bicriteria solutions. Additionally, we show that if $\\PP = \\QQ$ and $d$ is the squared Euclidean distance, which is not a metric but rather the square of a metric, then a relative error bicriteria solution can be found in sublinear time. Finally, we empirically compare our algorithm with the SVD and input sparsity time algorithms. Our algorithm is several hundred times faster than the SVD, and about $8$-$20$ times faster than input sparsity methods on real-world and and synthetic datasets of size $10^8$. Accuracy-wise, our algorithm is only slightly worse than that of the SVD (optimal) and input-sparsity time algorithms.", "full_text": "Sublinear Time Low-Rank Approximation of\n\nDistance Matrices\n\nAinesh Bakshi\n\nDepartment of Computer Science\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nabakshi@cs.cmu.edu\n\nDavid P. Woodruff\n\nDepartment of Computer Science\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\ndwoodruf@cs.cmu.edu\n\nAbstract\n\nLet P = {p1, p2, . . . pn} and Q = {q1, q2 . . . qm} be two point sets in an arbitrary\nmetric space. Let A represent the m \u00d7 n pairwise distance matrix with Ai,j =\nd(pi, qj). Such distance matrices are commonly computed in software packages\nand have applications to learning image manifolds, handwriting recognition, and\nmulti-dimensional unfolding, among other things. In an attempt to reduce their\ndescription size, we study low rank approximation of such matrices. Our main\nresult is to show that for any underlying distance metric d, it is possible to achieve an\nadditive error low rank approximation in sublinear time. We note that it is provably\nimpossible to achieve such a guarantee in sublinear time for arbitrary matrices\nA, and our proof exploits special properties of distance matrices. We develop a\nrecursive algorithm based on additive projection-cost preserving sampling. We then\nshow that in general, relative error approximation in sublinear time is impossible\nfor distance matrices, even if one allows for bicriteria solutions. Additionally,\nwe show that if P = Q and d is the squared Euclidean distance, which is not a\nmetric but rather the square of a metric, then a relative error bicriteria solution can\nbe found in sublinear time. Finally, we empirically compare our algorithm with\nthe singular value decomposition (SVD) and input sparsity time algorithms. Our\nalgorithm is several hundred times faster than the SVD, and about 8-20 times faster\nthan input sparsity methods on real-world and and synthetic datasets of size 108.\nAccuracy-wise, our algorithm is only slightly worse than that of the SVD (optimal)\nand input-sparsity time algorithms.\n\nIntroduction\n\n1\nWe study low rank approximation of matrices A formed by the pairwise distances between two\n(possibly equal) sets of points or observations P = {p1, . . . , pm} and Q = {q1, . . . , qn} in an\narbitrary underlying metric space. That is, A is an m \u00d7 n matrix for which Ai,j = d(pi, qj). Such\ndistance matrices are the outputs of routines in commonly used software packages such as the pairwise\ncommand in Julia, the pdist2 command in Matlab, or the crossdist command in R.\nDistance matrices have found many applications in machine learning, where Weinberger and Sauk\nuse them to learn image manifolds [18], Tenenbaum, De Silva, and Langford use them for image\nunderstanding and handwriting recognition [17], Jain and Saul use them for speech and music [12],\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand Demaine et al. use them for music and musical rhythms [7]. For an excellent tutorial on Euclidean\ndistance matrices, we refer the reader to [8], which lists applications to nuclear magnetic resonance\n(NMR), crystallagraphy, visualizing protein structure, and multi-dimensional unfolding.\nWe consider the most general case for which P and Q are not necessarily the same set of points.\nFor example, one may have two large unordered sets of samples from some distribution, and may\nwant to determine how similar (or dissimilar) the sample sets are to each other. Such problems\narise in hierarchical clustering and phylogenetic analysis1. Formally, Let P = {p1, p2, . . . pm} and\nQ = {q1, q2 . . . qn} be two sets of points in an arbitrary metric space. Let A represent the m \u00d7 n\npairwise distance matrix with Ai,j = d(pi, qj). Since the matrix A may be very large, it is often\ndesirable to reduce the number of parameters needed to describe it. Two standard methods of doing\nthis are via sparsity and low-rank approximation. In the distance matrix setting, if one \ufb01rst \ufb01lters P\nand Q to contain only distinct points, then each row and column can contain at most a single zero\nentry, so typically such matrices A are dense. Low-rank approximation, on the other hand, can be\nhighly bene\ufb01cial since if the point sets can be clustered into a small number of clusters, then each\ncluster can be used to de\ufb01ne an approximately rank-1 component, and so A is an approximately low\nrank matrix.\nTo \ufb01nd a low rank factorization of A, one can compute its singular value decomposition (SVD),\nthough in practice this takes min(mn2, m2n) time. One can do slightly better with theoretical\nalgorithms for fast matrix multiplication, though not only are they impractical, but there exist much\nfaster randomized approximation algorithms.\nIndeed, one can use Fast Johnson Lindenstrauss\ntransforms (FJLT) [16], or CountSketch matrices [4, 13, 15, 1, 5], which for dense matrices A, run in\nO(mn) + (m + n)poly(k/\u0001) time. At \ufb01rst glance the O(mn) time seems like it could be optimal.\nIndeed, for arbitrary m \u00d7 n matrices A, outputting a rank-k matrix B for which\n\nF + \u0001(cid:107)A(cid:107)2\n\nF\n\n(cid:107)A \u2212 B(cid:107)2\n\nF \u2264 (cid:107)A \u2212 Ak(cid:107)2\n(1.1)\ncan be shown to require \u2126(mn) time. Here Ak denotes the best rank-k approximation to A in\nFrobenius norm, and recall for an m \u00d7 n matrix C, (cid:107)C(cid:107)2\ni,j. The additive\nerror guarantee above is common in low-rank approximation literature and appears in [10]. To see\nthis lower bound, note that if one does not read nearly all the entries of A, then with good probability\none may miss an entry of A which is arbitrarily large, and therefore cannot achieve (1.1).\nPerhaps surprisingly, [14] show that for positive semide\ufb01nite (PSD) n \u00d7 n matrices A, one can\nachieve (1.1) in sublinear time, namely, in n \u00b7 k \u00b7 poly(1/\u0001) time. Moreover, they achieve the stronger\nnotion of relative error, that is, they output a rank-k matrix B for which\n\nF =(cid:80)\n\ni=1,...,m,j=1,...,n C 2\n\n(cid:107)A \u2212 B(cid:107)2\n\nF \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2\nF .\n\n(1.2)\nThe intuition behind their result is that the \u201clarge entries\u201d causing the \u2126(mn) lower bound cannot\nhide in a PSD matrix, since they necessarily create large diagonal entries. A natural question is\nwhether it is possible to obtain low-rank approximation algorithms for distance matrices in sublinear\ntime as well. A driving intuition that it may be possible is that no matter which metric the underlying\npoints reside in, they necessarily satisfy the triangle inequality. Therefore, if Ai,j = d(pi, qj) is large,\nthen since d(pi, qj) \u2264 d(pi, q1) + d(q1, p1) + d(p1, qj), at least one of d(pi, q1), d(q1, p1), d(p1, qj)\nis large, and further, all these distances can be found by reading the \ufb01rst row and column of A. Thus,\nlarge entries cannot hide in the matrix. Are there sublinear time algorithms achieving (1.1)? Are\nthere sublinear time algorithms achieving (1.2)? These are the questions we put forth and study in\nthis paper.\n1.1 Our Results\nOur main result is that we obtain sublinear time algorithms achieving the additive error guarantee\nalgorithm running in (cid:101)O((m1+\u03b3 + n1+\u03b3)poly(k\u0001\u22121)) time and achieving guarantee (cid:107)A\u2212MNT(cid:107)2\nsimilar to (1.1) for distance matrices, which is impossible for general matrices A. We show that for\nevery metric d, this is indeed possible. Namely, for an arbitrarily small constant \u03b3 > 0, we give an\nF \u2264\n(cid:107)A\u2212Ak(cid:107)2\nF , for any distance matrix with metric d. Note that our running time is signi\ufb01cantly\nsublinear in the description size of A. Indeed, thinking of the shortest path metric on an unweighted\nbipartite graph in which P corresponds to the left set of vertices, and Q corresponds to the right set\nof vertices, for each pair of points pi \u2208 P and qj \u2208 Q, one can choose d(pi, qj) = 1 or d(pi, qj) > 1\nindependently of all other distances by deciding whether to include the edge {pi, qj}. Consequently,\n\nF +\u0001(cid:107)A(cid:107)2\n\n1See, e.g., https://en.wikipedia.org/wiki/Distance_matrix\n\n2\n\n\fthere are at least 2\u2126(mn) possible distance matrices A, and since our algorithm reads o(mn) entries\nof A, cannot learn whether d(pi, qj) = 1 or d(pi, qj) > 1 for each i and j. Nevertheless, it still\nlearns enough information to compute a low rank approximation to A.\nWe note that a near matching lower bound holds just to write down the output of a factorization of a\nrank-k matrix B into an m \u00d7 k and a k \u00d7 n matrix. Thus, up to an (m\u03b3 + n\u03b3)poly(k\u0001\u22121) factor, our\nalgorithm is also optimal among those achieving the additive error guarantee of (1.1).\nA natural followup question is to consider achieving relative error (1.2) in sublinear time. Although\nlarge entries in a distance matrix A cannot hide, we show it is still impossible to achieve the relative\nerror guarantee in less than mn time for distance matrices. That is, we show for the (cid:96)\u221e distance\nmetric, that there are instances of distance matrices A with unequal P and Q for which even for\nk = 2 and any constant accuracy \u0001, must read \u2126(mn) entries of A. In fact, our lower bound holds\neven if the algorithm is allowed to output a rank-k(cid:48) approximation for any 2 \u2264 k(cid:48) = o(min(m, n))\nwhose cost is at most that of the best rank-2 approximation to A. We call the latter a bicriteria\nalgorithm, since its output rank k(cid:48) may be larger than the desired rank k. Therefore, in some sense\nobtaining additive error (1.1) is the best we can hope for.\nWe next consider the important class of Euclidean matrices for which the entries correspond to the\nsquare of the Euclidean distance, and for which P = Q. In this case, we are able to show that if\nwe allow the low rank matrix B output to be of rank k + 4, then one can achieve the relative error\nguarantee of (1.2) with respect to the best rank-k approximation, namely, that\n\n(cid:107)A \u2212 B(cid:107)2\n\nF \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2\nF .\n\nFurther, our algorithm runs in a sublinear n \u00b7 k \u00b7 poly(1/\u0001) amount of time. Thus, our lower bound\nruling out sublinear time algorithms achieving (1.2) for bicriteria algorithms cannot hold for this class\nof matrices.\nFinally, we empirically compare our algorithm with the SVD and input sparsity time algorithms\n[4, 13, 15, 1, 5]. Our algorithm is several hundred times faster than the SVD, and about 8-20 times\nfaster than input sparsity methods on real-world datasets such as MNIST, Gisette and Poker, and a\nsynthetic clustering dataset of size 108. Accuracy-wise, our algorithm is only slightly worse than that\nof the SVD (optimal error) and input-sparsity time algorithms. Due to space constraints, we defer all\nof our proofs to the Supplementary Material 2.\n2 Row and Column Norm Estimation\nWe observe that we can obtain a rough estimate for the row or column norms of a distance matrix by\nuniformly sampling a small number of elements of each row or column. The only structural property\nwe need to obtain such an estimate is approximate triangle inequality.\nDe\ufb01nition 2.1. (Approximate Triangle Inequality.) Let A be a m \u00d7 n matrix. Then, matrix A\nsatis\ufb01es approximate triangle inequality if, for any \u0001 \u2208 [0, 1], for any p \u2208 [m], q, r \u2208 [n]\n|Ai,q \u2212 Ai,r|\n\n|Ap,r \u2212 maxi\u2208[m] |Ai,q \u2212 Ai,r||\n\n\u2264 Ap,q \u2264 (1 + \u0001)\n\n(cid:18)\n\n(cid:19)\n\n(1 + \u0001)\n|Ap,q \u2212 Ap,r|\n\n1 + \u0001\n\nAp,r + max\ni\u2208[m]\n\n(2.1)\n\n\u2264 max\ni\u2208[m]\n\n|Ai,q \u2212 Ai,r| \u2264 (1 + \u0001) (Ap,q + Ap,r)\n\n(2.2)\n\nFurther, similar equations hold for AT .\n\nThe above de\ufb01nition captures distance matrices if we set \u0001 = 0. In order to see this, recall, each\nentry in a m \u00d7 n matrix A is associated with a distance between points sets P and Q, such that\n|P| = m and |Q| = n. Then, for points p \u2208 P, q \u2208 Q, Ap,q represents d(p, q), where d is an\narbitrary distance metric. Further, for arbitrary point, i \u2208 P and r \u2208 Q, maxi |Ai,q \u2212 Ai,r| =\nmaxi |d(i, r)\u2212 d(i, q)|. Intuitively, we would like to highlight that, for the case where A is a distance\nmatrix, maxi\u2208[m] |Ai,q\u2212Ai,r| represents a lower bound on the distance d(q, r). Since d is a metric, it\nfollows triangle inequality, and d(p, q) \u2264 d(p, r) + d(q, r). Further, by reverse triangle inequality, for\nall i \u2208 [m], d(q, r) \u2265 |d(i, q) \u2212 d(i, r)|. Therefore, Ap,q \u2264 Ap,r + maxi |Ai,q \u2212 Ai,r| and distance\nmatrices satisfy equation 2.1. Next, maxi\u2208[m] |Ai,q \u2212 Ai,r| = maxi\u2208[m] |di,q \u2212 di,r| \u2264 d(q, r),\n\n2The full version of our paper is available at https://arxiv.org/pdf/1809.06986.pdf\n\n3\n\n\fand d(q, r) \u2264 d(p, r) + d(p, q) = Ap,r + Ap,q therefore, equation 2.2 is satis\ufb01ed. We note that\napproximate triangle inequality is a relaxation of the traditional triangle inequality and is suf\ufb01cient to\nobtain coarse estimates to row and column norms of A in sublinear time.\n\nAlgorithm 1 : Row Norm Estimation.\nInput: A Distance Matrix Am\u00d7n, Sampling parameter b.\n\n1. Let x = argmini\u2208[m]Ai,1.\n2. Let d = maxj\u2208[n] Ax,j.\n3. For i \u2208 [m], let Ti be a uniformly random sample of \u0398(b) indices in [n].\n\n4. (cid:101)Xi = d2 +(cid:80)\n\nOutput: Set {(cid:101)X1, (cid:101)X2, . . . (cid:101)Xm}\n\nj\u2208Ti\n\nn\n\nb A2\n\ni,j.\n\n2. Further, Algorithm 1 runs in O(bm + n) time.\n\n2. Further, this algorithm runs in O(bn + m) time.\n\nLemma 2.1. (Row Norm Estimation.) Let A be a m \u00d7 n matrix such that A satis\ufb01es approximate\ntriangle inequality. For i \u2208 [m], let Ai,\u2217 be the ith row of A. Algorithm 1 uniformly samples\n\u0398(b) elements from Ai,\u2217 and with probability at least 9/10 outputs an estimator which obtains an\nO (n/b)-approximation to (cid:107)Ai,\u2217(cid:107)2\nTo obtain an O (n/b) approximation for all the m rows simultaneously with high probability, we can\ncompute O (log(m)) estimators for each row and take their median. We also observe that Column\nand Row Norm Estimation are symmetric operations and a slight modi\ufb01cation to Algorithm 1 yields\na Column Norm Estimation algorithm with the following guarantee:\nCorollary 2.1. (Column Norm Estimation.) There exists an algorithm that uniformly samples\n\u0398(b) elements from A\u2217,j and with probability 9/10, outputs an estimator which is an O (m/b)-\napproximation to (cid:107)A\u2217,j(cid:107)2\n3 Projection-Cost Preserving Sketches\nWe would like to reduce the dimensionality of the matrix in a way that approximately preserves\nlow-rank structure. The main insight is that if we can approximately preserve all rank-k subspaces\nin the column and row space of the matrix, then we can recursively sample rows and columns to\nobtain a much smaller matrix. To this end, we introduce a relaxation of projection-cost preserving\nsketches [6] that satisfy an additive error guarantee. We show that sampling columns of A according\nto approximate column norms yields a matrix that preserves projection cost for all rank-k projections\nup to additive error. We give a similar proof to the relative error guarantees in [6], but need to replace\ncertain parts with our different distribution which is only based on row and column norms rather than\nleverage scores, and consequently we obtain additive error in places instead. As our lower bound\nshows, this is necessary in our setting.\nTheorem 3.1. (Column Projection-Cost Preservation.) Let A be a m \u00d7 n matrix such that A\n\nsatis\ufb01es approximate triangle inequality. For j \u2208 [n], let (cid:101)Xj be an O (m/b)-approximate estimate\n{q1, q2 . . . qn} be a probability distribution over the columns of A such that qj = (cid:101)Xj/(cid:80)\nj(cid:48) (cid:101)Xj(cid:48). Let\n\nfor the jth column of A such that it satis\ufb01es the guarantee of Corollary A.1. Then, let q =\n\nb\u00012 log( m\n\u03b4 )\n\nF \u00b1 \u0001(cid:107)A(cid:107)2\nF .\n\nF = (cid:107)A \u2212 XA(cid:107)2\n\nfor some constant c. Then, construct C using t columns of A and set each\ntqj with probability qj. With probability at least 1 \u2212 \u03b4, for any rank-k orthogonal\n\nt = O\none to A\u2217,j/\nprojection X, (cid:107)C \u2212 XC(cid:107)2\nWe observe that we can also estimate the row norms in sublinear time and immediately obtain a\nsimilar guarantee for row projection-cost preservation. Next, we describe how to apply projection-\ncost preserving sketching for low-rank approximation. Let C be a column pcp for A. Then, an\napproximate solution for the best rank-k approximation to C is an approximate solution for the best\nrank-k approximation to A. Formally,\nLemma 3.1. Let C be a column pcp for A satisfying the guarantee of Theorem A.2. Let P\u2217\nthe minimizing projection matrix for minX (cid:107)C \u2212 XC(cid:107)2\nminimizes minX (cid:107)A \u2212 XA(cid:107)2\nF . Then, for any projection matrix P such that (cid:107)C \u2212 PC(cid:107)2\n(cid:107)C\u2212 P\u2217\nA similar guarantee holds if C is a row pcp of A.\n\nC be\nA be the projection matrix that\nF \u2264\nF + \u0001(cid:107)A(cid:107)2\nF .\n\nF , with probability at least 98/100, (cid:107)A\u2212 PA(cid:107)2\n\nF \u2264 (cid:107)A\u2212 P\u2217\n\nF and P\u2217\n\nCC(cid:107)2\n\nF + \u0001(cid:107)C(cid:107)2\n\nAA(cid:107)2\n\n(cid:16) mk2\n\n\u221a\n\n(cid:17)\n\n4\n\n\f4 A Sublinear Time Algorithm.\n\nAlgorithm 2 : First Sublinear Time Algorithm.\nInput: A Distance Matrix Am\u00d7n, integer k and \u0001 > 0.\nlog(m). Set s1 = \u0398\n\nlog(n) and b2 = \u0001m0.34\n\n1. Set b1 = \u0001n0.34\n\n(cid:16) mk2 log(m)\n\n(cid:17)\n\n(cid:16) nk2 log(n)\n\n(cid:17)\n\n.\n\nb1\n\nb2\u00012\n\n2\n\n(cid:107)A(cid:107)2\n\nF\n\nb1\u00012\n\n(cid:17)\n\nand s2 = \u0398\n\n2 returned by ColumnNormEstimation(A, b1).\n\n2. Let (cid:101)Xj be the estimate for (cid:107)A\u2217,j(cid:107)2\n(cid:16) m\nRecall, (cid:101)Xj is an O\n(cid:1) (cid:107)A\u2217,j(cid:107)2\n3. Let q = {q1, q2 . . . qn} denote a distribution over columns of A such that qi =\n(cid:101)Xj(cid:80)\nj (cid:101)Xj\n. Construct a column pcp for A by sampling s1 columns of A\nsuch that each column is set to A\u2217,j\u221a\nwith probability qj. Let AS be the resulting\ns1qj\nm \u00d7 s1 matrix that follows guarantees of Theorem A.2.\n\n-approximation to A\u2217,j.\n\n\u2265(cid:0) b1\n\n4. To account for the rescaling, consider O(\u0001\u22121 log(n)) weight classes for scaling param-\neters of the columns of AS. Let AS|Wg be the columns of AS restricted to the weight\nclass Wg (de\ufb01ned below.)\n\n(cid:16) n\n(cid:17)\n(cid:1) (cid:107)ASi,\u2217(cid:107)2\n\n5. Run the RowNormEstimation(AS|Wg , b2) estimation algorithm with parameter b2\n\nfor each weight class independently and sum up the estimates for a given row. Let (cid:101)Xi\n6. Let p = {p1, p2, . . . pm} denote a distribution over rows of AS such that pi =\n(cid:101)Xi(cid:80)\ni (cid:101)Xi\n. Construct a row pcp for AS by sampling s2 rows of AS such\nwith probability pi. Let TAS be the resulting s2 \u00d7 s1\nthat each row is set to ASi,\u2217\u221a\ns2pi\nmatrix that follows guarantees of Corollary A.2.\n\n-approximate estimator for ASi,\u2217.\n\n\u2265(cid:0) b2\n\nbe the resulting O\n\n(cid:107)AS(cid:107)2\n\nb2\n\n2\n\nF\n\nm\n\nn\n\n7. Run the input-sparsity time low-rank approximation algorithm (corresponding to\nTheorem 4.2) on TAS with rank parameter k to obtain a rank-k approximation to\nTAS, output in factored form: L, D, WT . Note, LD is an s2 \u00d7 k matrix and WT is\na k \u00d7 s1 matrix.\n\n8. Consider the regression problem minX (cid:107)AS \u2212 XWT(cid:107)2\n\nF . Sketch the problem using\nthe leverage scores of WT as shown in Theorem 4.3 to obtain a sampling matrix\nE with poly( k\nF . Let\nXASWT = P(cid:48)N(cid:48)T be such that P(cid:48) has orthonormal columns.\n\n\u0001 ) columns. Compute XAS = argminX(cid:107)ASE \u2212 XWT E(cid:107)2\n\n9. Consider the regression problem minX (cid:107)A \u2212 P(cid:48)X(cid:107)2\n\nF . Sketch the problem using the\nthe leverage scores of P(cid:48) following Theorem 4.3 to obtain a sampling matrix E(cid:48) with\npoly( k\n\n\u0001 ) rows. Compute XA = argminX(cid:107)E(cid:48)A \u2212 E(cid:48)P(cid:48)X(cid:107)2\nF .\n\nOutput: M = P(cid:48), NT = XA\n\nIn this section, we give a sublinear time algorithm which relies on constructing column and row pcps,\nwhich in turn rely on our column and row norm estimators. Intuitively, we begin with obtaining\ncoarse estimates to column norms. Next, we sample a subset of the columns of A with probability\nproportional to their column norm estimates to obtain a column pcp for A. We show that the rescaled\nmatrix still has enough structure to get a coarse estimate to its row norms. Then, we compute the row\nnorm estimates of the sampled rescaled matrix, and subsample its rows to obtain a small matrix that\nis a row pcp. We run an input-sparsity time algorithm ([4]) on the small matrix to obtain a low-rank\napproximation. The main theorem we prove is as follows:\nTheorem 4.1. (Sublinear Low-Rank Approximation.) Let A \u2208 Rm\u00d7n be a matrix that satis\ufb01es\napproximate triangle inequality. Then, for any \u0001 > 0 and integer k, Algorithm 2 runs in time\n\n\u0001 )(cid:1) and outputs matrices M \u2208 Rm\u00d7k, N \u2208 Rn\u00d7k such that with probabil-\n(cid:13)(cid:13)A \u2212 MNT(cid:13)(cid:13)2\n\nO(cid:0)(cid:0)m1.34 + n1.34(cid:1) poly( k\n\nity at least 9/10,\n\nF \u2264 (cid:107)A \u2212 Ak(cid:107)2\n\nF + \u0001(cid:107)A(cid:107)2\n\nF\n\n5\n\n\fi\n\nj\n\nm\n\n2\n\nb1\u00012\n\n1\u221a\n\ns1qj\n\n(cid:107)A(cid:107)2\n\nF\n\n(cid:1) (cid:107)A\u2217,j(cid:107)2\n\n(cid:101)Xj(cid:80)\n(cid:48) \u2208[n] (cid:101)X\n\na probability distribution over the columns of A as qj =\n\nColumn Sampling. We observe that constructing a column and row projection-cost preserving\nsketches require sampling columns proportional to their relative norms and subsequently subsample\ncolumns proportional to the relative row norms of the sampled, rescaled matrix. In the previous\nsection, we obtain coarse approximations to these norms and thus use our estimates to serve as a\n\nproxy for the real distribution. For j \u2208 [n], let (cid:101)Xj be our estimate for the column A\u2217,j. We de\ufb01ne\ncolumn norms up to an O (m/b1)-factor, qj \u2265(cid:0) b1\n(cid:16) mk2 log(m)\n(cid:17)\nF \u2264 (cid:107)A \u2212 XAk(cid:107)2\n\n(cid:48) . Given that we can estimate\n, where b1 is a parameter to be set later.\nTherefore, we oversample columns of A by a \u0398 (m/b1)-factor to construct a column pcp for A. Let\ncolumns of A such that each column is set to A\u2217,j\u221a\nAS be a scaled sample of s1 = \u0398\ns1qj\nwith probability qj. Then, by Theorem A.2 for any \u0001 > 0, with probability at least 99/100, for all\nrank-k projection matrices X, (cid:107)AS \u2212 XAS(cid:107)2\nHandling Rescaled Columns. We note that during the construction of the column pcp, the jth\ncolumn, if sampled, is rescaled by\n. Therefore, the resulting matrix, AS may no longer be\na distance matrix. To address this issue, we partition the columns of AS into weight classes such\nthat the gth weight class contains column index j if the corresponding scaling factor\nlies in the\n)-factor since every entry is rescaled\nby the same constant. Formally, Wg =\n. Next, with high\nprobability, for all j \u2208 [n], if column j is sampled, 1/qj \u2264 nc for a large constant c. If instead,\nqj \u2264 1\n, for some c(cid:48) > c. Union\nbounding over such events for n columns, the number of weight classes is at most log1+\u0001(nc) =\n\nO(cid:0)\u0001\u22121 log(n)(cid:1). Let AS|Wg denote the columns of AS restricted to the set of indices in Wg. Observe\n\ninterval(cid:2)(1 + \u0001)g, (1 + \u0001)g+1(cid:1). Note, we can ignore the ( 1\u221a\n\nnc(cid:48) , the probability that the jth is sampled would be at most 1/nc(cid:48)\n\n\u2208(cid:2)(1 + \u0001)g, (1 + \u0001)g+1(cid:1)(cid:111)\n\nF + \u0001(cid:107)A(cid:107)2\nF .\n\ni \u2208 [s1]\n\nthat all entries in AS|Wg are scaled to within a (1 + \u0001)-factor of each other and therefore, satisfy\napproximate triangle inequality (equation 2.1).\nTherefore, row norms of AS|Wg can be computed using Algorithm 1 and the estimator is an O (n/b2)-\napproximation (for some parameter b2), since Lemma 2.1 blows up by a factor of at most (1 + \u0001).\nSumming over the estimates from each partition above, with probability at least 99/100, we obtain\n2, simultaneously for all i \u2208 [m]. However, we note\nan O (n/b2)-approximate estimate to (cid:107)ASi,\u2217(cid:107)2\nthat each iteration of Algorithm 1 reads b2m + n entries of A and there are at most O(\u0001\u22121 log(n))\nRow Sampling. Next, we construct a row pcp for AS. For i \u2208 [m], let (cid:101)Xi be an O (n/b2)-\niterations.\napproximate estimate for (cid:107)ASi,\u2217(cid:107)2\n2. Let p = {p1, p2, . . . , pm} be a distribution over the rows of\nAS such that pi = (cid:101)Xi(cid:80)\ni (cid:101)Xi\n(cid:107)ASi,\u2217(cid:107)2\n. Therefore, we oversample rows by a \u0398 (n/b2) factor to\n(cid:107)AS(cid:107)2\n\n\u2265(cid:16) b2\n\n(cid:17)\n\n(cid:110)\n\n1\u221a\nqj\n\n1\u221a\nqj\n\n(cid:12)(cid:12)(cid:12)\n\nn\n\n2\n\nF\n\ns1\n\n(cid:16) nk2 log(n)\n\n(cid:17)\n\nb2\u00012\n\nobtain a row pcp for AS. Let TAS be a scaled sample of s2 = \u0398\nrows of AS such that\neach row is set to ASi,\u2217\u221a\nwith probability pi. By the row analogue of Theorem A.2, with probability at\ns2pi\nleast 99/100, for all rank-k projection matrices X, (cid:107)TAS\u2212TASX(cid:107)2\nF +\u0001(cid:107)AS(cid:107)2\nF .\nInput-sparsity Time Low-Rank Approximation. Next, we compute a low-rank approximation\nfor the smaller matrix, TAS, in input-sparsity time. To this end we use the following theorem from\n[4]:\nTheorem 4.2. (Clarkson-Woodruff LRA.) For A \u2208 Rm\u00d7n, there is an algorithm that with failure\nprobability at most 1/10 \ufb01nds L \u2208 Rm\u00d7k, W \u2208 Rn\u00d7k and a diagonal matrix D \u2208 Rk\u00d7k, such that\n\nF and runs in time O(cid:0)nnz(A) + (n + m)poly( k\n\n(cid:13)(cid:13)A \u2212 LDWT(cid:13)(cid:13)2\n\n\u0001 )(cid:1), where\n\nF \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)2\n\nF \u2264 (cid:107)AS\u2212ASX(cid:107)2\n\nnnz(A) is the number of non-zero entries in A.\n\nRunning the input-sparsity time algorithm with the above guarantee on the matrix TAS, we obtain\na rank-k matrix LDWT , such that (cid:107)TAS \u2212 LDWT(cid:107)2\nF where\n(TAS)k is the best rank-k approximation to TAS under the Frobenius norm. Since TAS is a small\nmatrix, we can afford to read all of it by querying at most O\nentries of A\n\n(cid:16) nm log(n) log(m)\nF \u2264 (1 + \u0001)(cid:107)TAS \u2212 (TAS)k(cid:107)2\nand the algorithm runs in time O(cid:0)s1s2 + (s1 + s2)poly( k\n\u0001 )(cid:1).\n\npoly( k\n\u0001 )\n\n(cid:17)\n\nb1b2\n\n6\n\n\fConstructing a solution for A. Note, while LDWT is an approximate rank-k solution for TAS,\nit does not have the right dimensions as A. If we do not consider running time, we could construct a\nlow-rank approximation to A as follows: since projecting TAS onto WT is approximately optimal,\nit follows from Lemma A.1 that with probability 98/100,\n\n(cid:107)AS \u2212 ASWWT(cid:107)2\n\nF = (cid:107)AS \u2212 (AS)k(cid:107)2\n\nF \u00b1 \u0001(cid:107)AS(cid:107)2\n\nF\n\n(4.1)\n\nLet (AS)k = PNT be such that P has orthonormal columns. Then, (cid:107)AS \u2212 PPT AS(cid:107)2\nF =\n(cid:107)AS \u2212 (AS)k(cid:107)2\nF and by Lemma A.1 it follows that with probability 98/100, (cid:107)A \u2212 PPT A(cid:107)2\nF \u2264\nF + \u0001(cid:107)A(cid:107)F . However, even approximately computing a column space P for (AS)k using\n(cid:107)A\u2212 Ak(cid:107)2\nan input-sparsity time algorithm is no longer sublinear. To get around this issue, we observe that an\napproximate solution for TAS lies in the row space of WT and therefore, an approximately optimal\nsolution for AS lies in the row space of WT . We then set up the following regression problem:\nminX (cid:107)AS \u2212 XWT(cid:107)2\nF .\nNote, this regression problem is still too big to be solved in sublinear time. Therefore, we sketch it\nby sampling columns of AS according to the leverage scores of WT to set up a smaller regression\nproblem. Formally, we use a theorem of [4] (Theorem 38) to approximately solve this regression\nproblem (also see [9] for previous work.)\nTheorem 4.3. (Fast Regression.) Given a matrix A \u2208 Rm\u00d7n and a rank-k matrix B \u2208 Rm\u00d7k,\nsuch that B has orthonormal columns, the regression problem minX (cid:107)A \u2212 BX(cid:107)2\nF can be solved\n\nup to (1 + \u0001) relative error, with probability at least 2/3 in time O(cid:0)(m log(m) + n)poly( k\n\n\u0001 ) rows and solving minX (cid:107)EA \u2212 EBX(cid:107)2\n\nconstructing a sketch E with poly( k\nguarantee holds for solving minX (cid:107)A \u2212 XB(cid:107)2\nF .\nSince WT has orthonomal rows, the leverage scores are precomputed. With probability at least\n99/100, we can compute XAS = argminX(cid:107)ASE \u2212 XWT E(cid:107)2\nF , where E is a leverage score\n\n(cid:1) columns, as shown in Theorem 4.3.\n\nsketching matrix with poly(cid:0) k\n\n\u0001 )(cid:1) by\n\nF . Note, a similar\n\n\u0001\n\n(cid:107)AS \u2212 XASWT(cid:107)2\n\nF \u2264 (1 + \u0001) min\n\n(cid:107)AS \u2212 XWT(cid:107)2\n\nX\n\nF \u2264 (1 + \u0001)(cid:107)AS \u2212 ASWWT(cid:107)2\n= (cid:107)AS \u2212 (AS)k(cid:107)2\nF \u00b1 \u0001(cid:107)AS(cid:107)2\n\nF\n\n(4.2)\n\nthe running time is O(cid:0)(m + s1) log(m)poly(cid:0) k\n\nwhere the last two inequalities follow from equation 4.1. Recall, AS is an m \u00d7 s1 matrix and thus\northonormal columns. Then, the column space of P(cid:48) contains an approximately optimal solution\nfor A, since (cid:107)AS \u2212 P(cid:48)N(cid:48)T(cid:107)2\nF \u00b1 \u0001(cid:107)AS(cid:107)2\nF and AS is a column pcp for A. It\nfollows from Lemma A.1 that with probability at least 98/100,\n\n(cid:1)(cid:1). Let XASWT = P(cid:48)N(cid:48)T be such that P(cid:48) has\n\nF\n\n\u0001\n\nF \u2264 (cid:107)A \u2212 Ak(cid:107)F + \u0001(cid:107)A(cid:107)F\n\n(4.3)\nTherefore, there exists a good solution for A in the column space of P(cid:48). Since we cannot compute\nthis explicitly, we set up the following regression problem: minX (cid:107)A \u2212 P(cid:48)X(cid:107)2\nF . Again, we sketch\nthe regression problem above by sampling columns of A according to the leverage scores of P(cid:48). We\ncan then compute XA = argminX(cid:107)E(cid:48)A \u2212 E(cid:48)P(cid:48)X(cid:107)2\nF with probability at least 99/100, where E(cid:48) is\n\n(cid:1) rows. Then, using the properties of leverage score\n\na leverage score sketching matrix with poly(cid:0) k\n\nF = (cid:107)AS \u2212 (AS)k(cid:107)2\n(cid:107)A \u2212 P(cid:48)P(cid:48)T A(cid:107)2\n\n\u0001\n\nsampling from Theorem 4.3,\n(cid:107)A \u2212 P(cid:48)XA(cid:107)2\n\nF \u2264 (1 + \u0001) min\n\n(cid:107)A \u2212 P(cid:48)X(cid:107)2\n\nX\n\nF \u2264 (1 + \u0001)(cid:107)A \u2212 P(cid:48)P(cid:48)T A(cid:107)2\nF + O(\u0001)(cid:107)A(cid:107)2\n\u2264 (cid:107)A \u2212 Ak(cid:107)2\n\nF\n\nF\n\n(4.4)\n\nTheorem 38 of CW, the time taken to solve the regression problem is O(cid:0)(m log(m) + n)poly(cid:0) k\n\nwhere the second inequality follows from X being the minimizer and P(cid:48)T A being some other\nmatrix, and the last inequality follows from equation 4.3. Recall, P(cid:48) is an m \u00d7 k matrix and by\nTherefore, we observe that P(cid:48)XA suf\ufb01ces and we output it in factored form by setting M = P(cid:48) and\nA. Union bounding over the probabilistic events, and rescaling \u0001, with probability at least\nN = XT\n9/10, Algorithm 2 outputs M \u2208 Rm\u00d7k and N \u2208 Rn\u00d7k such that the guarantees of Theorem 4.1 are\nsatis\ufb01ed.\nFinally, we analyze the overall running time of Algorithm 2. Computing the estimates for the column\nnorms and constructing the column pcp for A has running time O\n.\n\n(cid:16) m log(m)\n\n(cid:1)(cid:1).\n\npoly( k\n\n(cid:17)\n\n\u0001\n\n\u0001 ) + b1n + m\n\nb1\n\n7\n\n\fMetric\n\nL2\nL1\nL\u221e\nLc\n\nSVD\n398.76\n410.60\n427.90\n452.16\n\nIS\n8.94\n8.15\n9.18\n8.49\n\nSublinear\n\n1.69\n1.81\n1.63\n1.76\n\nMetric\n\nL2\nL1\nL\u221e\nLc\n\nSVD\n398.50\n560.9\n418.01\n390.07\n\nIS\n\nSublinear\n\n34.32\n39.51\n39.32\n38.33\n\n4.16\n3.72\n3.99\n3.91\n\nTable 1: Running Time (in seconds) on the\nClustering Dataset for Rank = 20\n\nTable 2: Running Time (in seconds) on the\nMNIST Dataset for Rank = 40\n\nA similar guarantee holds for the rows. The input-sparsity time algorithm low-rank approxi-\n\nmation runs in O(cid:0)s1s2 + (s1 + s2)poly( k\nO(cid:0)(m log(m) + n)poly(cid:0) k\ntime is (cid:101)O(cid:0)(m1.34 + n1.34)poly(cid:0) k\n\n\u0001 )(cid:1) and constructing a solution for A is dominated by\n(cid:1)(cid:1) where (cid:101)O hides log(m) and log(n) factors. This completes the\n\n(cid:1)(cid:1). Setting b1 = \u0001n0.34\n\nlog(m), we note that the overall running\n\nlog(n) and b2 = \u0001m0.34\n\n\u0001\n\n\u0001\n\nF + \u0001(cid:107)A(cid:107)2\nF .\n\nF \u2264 (cid:107)A \u2212 Ak(cid:107)2\n\ntime (cid:101)O(cid:0)(cid:0)m1+\u03b3 + n1+\u03b3(cid:1) poly( k\n\n\u0001 )(cid:1) to output matrices M \u2208 Rm\u00d7k and N \u2208 Rn\u00d7k such that with\n\nproof of Theorem 4.1.\nWe extend the above result to obtain a better exponent in the running time. The critical observation\nhere is that we can recursively sketch rows and columns of A such that all the rank-k projections are\npreserved. Intuitively, it is important to preserve all rank-k subspaces since we do not know which\nprojection will be approximately optimal when we recurse back up. At a high level, the algorithm is\nthen to recursively sub-sample columns and rows of A such that we obtain projection-cost preserving\nsketches at each step. We defer the details to the Supplementary Material.\nTheorem 4.4. Let A \u2208 Rm\u00d7n be a matrix such that it satis\ufb01es approximate triangle inequality.\nThen, for any \u0001 > 0, integer k and a small constant \u03b3 > 0, there exists an algorithm that runs in\nprobability at least 9/10, (cid:107)A \u2212 MNT(cid:107)2\n5 Relative Error Guarantees\nIn this section, we consider the relative error guarantee 1.2 for distance matrices. We begin by\nshowing a lower bound for any relative error approximation for distance matrices and also preclude\nthe possibility of a sublinear bi-criteria algorithm outputting a rank-poly(k) matrix satisfying the\nrank-k relative error guarantee.\nTheorem 5.1. Let A be an n \u00d7 n distance matrix. Let B be a rank-poly(k) matrix such that\n(cid:107)A \u2212 B(cid:107)2\nF \u2264 c(cid:107)A \u2212 Ak(cid:107) for any constant c > 1. Then, any algorithm that outputs such a B\nrequires \u2126(nnz(A)) time.\nEuclidean Distance Matrices. We show that in the special case of Euclidean distances, when the\nentries correspond to squared distances, there exists a bi-criteria algorithm that outputs a rank-(k + 4)\nmatrix satisfying the relative error rank-k low-rank approximation guarantee. Note, here the point sets\nP and Q are identical. Let A be such a matrix s.t. Ai,j = (cid:107)xi \u2212 xj(cid:107)2\n2 \u2212 2(cid:104)xi, xj(cid:105).\nThen, we can write A as A1 + A2 \u2212 2B such that each entry in the ith row of A1 is (cid:107)xi(cid:107)2\n2, each\nentry in the jth column of A2 is (cid:107)xj(cid:107)2\n2 and B is a PSD matrix, where Bi,j = (cid:104)xi, xj(cid:105). The main\ningredient we use is the sublinear low-rank approximation of PSD matrices from [14]. We show that\nthere exists an algorithm (see Supplementary Material) that outputs the description of a rank-(k + 4)\nmatrix AWWT in sublinear time such it satis\ufb01es the relative-error rank-k low rank approximation\nguarantee.\nTheorem 5.2. Let A be a Euclidean Distance matrix. Then, for any \u0001 > 0 and integer k, there exists\nan algorithm that with probability at least 9/10, outputs a rank (k + 4) matrix WWT such that\n(cid:107)A \u2212 AWWT(cid:107)F \u2264 (1 + \u0001)(cid:107)A \u2212 Ak(cid:107)F , where Ak is the best rank-k approximation to A and runs\nin O(npoly( k\n6 Experiments\nIn this section, we benchmark the performance of our sublinear time algorithm with the conventional\nSVD Algorithm (optimal error), iterative SVD methods and the input-sparsity time algorithm from [4].\nWe use the built-in svd function in numpy\u2019s linear algebra package to compute the truncated SVD.\nWe also consider the iterative SVD algorithm implemented by the svds function in scipy\u2019s sparse\nlinear algebra package, however, the error achieved is typically 3 orders of magnitude worse than\ncomputing the SVD and thus we defer the results to the Supplementary Material. We implement\nthe input-sparsity time low-rank approximation algorithm from [4] using a count-sketch matrix [3].\n\n2 = (cid:107)xi(cid:107)2\n\n2 + (cid:107)xj(cid:107)2\n\n\u0001 )).\n\n8\n\n\fSynthetic Clustering Dataset\n\nMNIST Dataset\n\nFigure 6.1: We plot error on a synthetic dataset with 20 clusters and the MNIST dataset. The distance\nmatrix is created using (cid:96)2, (cid:96)1, (cid:96)\u221e and (cid:96)c metrics. We compare the error achieved by SVD (optimal),\nour Sublinear Algorithm and the Input Sparsity Algorithm from [4].\n\nFinally, we implement Algorithm 2, as this has small recursive overhead. The experiments are run on\na Macbook Pro 2017 with 16GB RAM, 2.8GHz quad-core 7th-generation Intel Core i7 processor.\nThe \ufb01rst dataset is a synthetic clustering dataset generated in scikit-learn using the make blobs\nfunction. We generate 10000 points with 200 features split up into 20 clusters. We note that given\nthe clustered structure, this dataset is expected to have a good rank-20 approximation, as observed in\nour experiments. The second dataset we use is the popular MNIST dataset which is a collection of\n70000 handwritten characters but sample 10000 points from it. In the Supplementary Material we\nalso consider the Gisette and Poker datasets. Given n points, {p1, p2, . . . pn}, in Rd, we compute\na n \u00d7 n distance matrix A, such that Ai,j = d(pi, pj), where d is Manhattan ((cid:96)1), Euclidean ((cid:96)2),\nChebyshev ((cid:96)\u221e) or Canberra3 ((cid:96)c) distance. We compare the Frobenius norm error of the algorithms\nin Figure 6.1 and their corresponding running time in Table 1 and 2. We note that our sublinear time\nalgorithm is only marginally worse in terms of absolute error, but runs 100-250 times faster than SVD\nand 8-20 times faster than the input-sparsity time algorithm.\nAcknowledgments: The authors thank partial support from the National Science Foundation under\nGrant No. CCF-1815840. Part of this work was done while the author was visiting the Simons\nInstitute for the Theory of Computing.\n\n3See https://en.wikipedia.org/wiki/Canberra_distance\n\n9\n\n\fReferences\n[1] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a uni\ufb01ed theory of sparse dimen-\nsionality reduction in euclidean space. In Proceedings of the Forty-Seventh Annual ACM on\nSymposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, pages\n499\u2013508, 2015.\n\n[2] Robert Cattral, Franz Oppacher, and Dwight Deugo. Evolutionary data mining with automatic\nrule generalization. Recent Advances in Computers, Computing and Communications, 1(1):296\u2013\n300, 2002.\n\n[3] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams.\nIn International Colloquium on Automata, Languages, and Programming, pages 693\u2013703.\nSpringer, 2002.\n\n[4] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input\nsparsity time. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing,\npages 81\u201390. ACM, 2013.\n\n[5] Michael B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities.\n\nIn\nProceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms,\nSODA 2016, Arlington, VA, USA, January 10-12, 2016, pages 278\u2013287, 2016.\n\n[6] Michael B. Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank\napproximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual\nACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta\nFira, January 16-19, pages 1758\u20131777, 2017.\n\n[7] Erik D. Demaine, Francisco Gomez-Martin, Henk Meijer, David Rappaport, Perouz Taslakian,\nGodfried T. Toussaint, Terry Winograd, and David R. Wood. The distance geometry of music.\nComput. Geom., 42(5):429\u2013454, 2009.\n\n[8] Ivan Dokmanic, Reza Parhizkar, Juri Ranieri, and Martin Vetterli. Euclidean distance matrices:\nessential theory, algorithms, and applications. IEEE Signal Processing Magazine, 32(6):12\u201330,\n2015.\n\n[9] Petros Drineas, Michael W Mahoney, and S Muthukrishnan. Relative-error cur matrix decom-\n\npositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844\u2013881, 2008.\n\n[10] Alan M. Frieze, Ravi Kannan, and Santosh Vempala. Fast monte-carlo algorithms for \ufb01nding\n\nlow-rank approximations. J. ACM, 51(6):1025\u20131041, 2004.\n\n[11] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips\n2003 feature selection challenge. In Advances in neural information processing systems, pages\n545\u2013552, 2005.\n\n[12] Viren Jain and L.K. Saul. Exploratory analysis and visualization of speech and music by locally\n\nlinear embedding. In Departmental Papers (CIS), pages iii \u2013 984, 06 2004.\n\n[13] Xiangrui Meng and Michael W. Mahoney. Low-distortion subspace embeddings in input-\nsparsity time and applications to robust linear regression. In Symposium on Theory of Computing\nConference, STOC\u201913, Palo Alto, CA, USA, June 1-4, 2013, pages 91\u2013100, 2013.\n\n[14] Cameron Musco and David P. Woodruff. Sublinear time low-rank approximation of positive\nsemide\ufb01nite matrices. In 58th IEEE Annual Symposium on Foundations of Computer Science,\nFOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 672\u2013683, 2017.\n\n[15] Jelani Nelson and Huy L. Nguyen. OSNAP: faster numerical linear algebra algorithms via\nsparser subspace embeddings. In 54th Annual IEEE Symposium on Foundations of Computer\nScience, FOCS 2013, 26-29 October, 2013, Berkeley, CA, USA, pages 117\u2013126, 2013.\n\n[16] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections.\n\nIn FOCS, pages 143\u2013152, 2006.\n\n10\n\n\f[17] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. science, 290(5500):2319\u20132323, 2000.\n\n[18] Kilian Q. Weinberger and Lawrence K. Saul. Unsupervised learning of image manifolds by\nsemide\ufb01nite programming. In 2004 IEEE Computer Society Conference on Computer Vision\nand Pattern Recognition (CVPR 2004), with CD-ROM, 27 June - 2 July 2004, Washington, DC,\nUSA, pages 988\u2013995, 2004.\n\n11\n\n\f", "award": [], "sourceid": 1883, "authors": [{"given_name": "Ainesh", "family_name": "Bakshi", "institution": "Carnegie Mellon University"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}]}