{"title": "Dimensionality Reduction of Massive Sparse Datasets Using Coresets", "book": "Advances in Neural Information Processing Systems", "page_first": 2766, "page_last": 2774, "abstract": "In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the Principle Component Analysis (PCA) of any $n\\times d$ matrix, using one pass over the stream of its rows. Our solution uses coresets: a scaled subset of the $n$ rows that approximates their sum of squared distances to \\emph{every} $k$-dimensional \\emph{affine} subspace. An open theoretical problem has been to compute such a coreset that is independent of both $n$ and $d$. An open practical problem has been to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix in a reasonable time. We answer both of these questions affirmatively. Our main technical result is a new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream.", "full_text": "Dimensionality Reduction of Massive Sparse Datasets\n\nUsing Coresets\n\nDan Feldman\n\nUniversity of Haifa\n\nHaifa, Israel\n\ndannyf.post@gmail.com\n\nmikhail@csail.mit.edu\n\nMikhail Volkov\nCSAIL, MIT\n\nCambridge, MA, USA\n\nDaniela Rus\nCSAIL, MIT\n\nCambridge, MA, USA\nrus@csail.mit.edu\n\nAbstract\n\nIn this paper we present a practical solution with performance guarantees to the\nproblem of dimensionality reduction for very large scale sparse matrices. We\nshow applications of our approach to computing the Principle Component Anal-\nysis (PCA) of any n \u00d7 d matrix, using one pass over the stream of its rows. Our\nsolution uses coresets: a scaled subset of the n rows that approximates their sum\nof squared distances to every k-dimensional af\ufb01ne subspace. An open theoretical\nproblem has been to compute such a coreset that is independent of both n and\nd. An open practical problem has been to compute a non-trivial approximation to\nthe PCA of very large but sparse databases such as the Wikipedia document-term\nmatrix in a reasonable time. We answer both of these questions af\ufb01rmatively. Our\nmain technical result is a new framework for deterministic coreset constructions\nbased on a reduction to the problem of counting items in a stream.\n\n1\n\nIntroduction\n\nAlgorithms for dimensionality reduction usually aim to project an input set of d-dimensional vectors\n(database records) onto a k \u2264 d \u2212 1 dimensional af\ufb01ne subspace that minimizes the sum of squared\ndistances to these vectors, under some constraints. Special cases include the Principle Component\nAnalysis (PCA), Linear regression (k = d \u2212 1), Low-rank approximation (k-SVD), Latent Drichlet\nAnalysis (LDA) and Non-negative matrix factorization (NNMF). Learning algorithms such as k-\nmeans clustering can then be applied on the low-dimensional data to obtain fast approximations with\nprovable guarantees. To our knowledge, unlike SVD, there are no algorithms or coreset construc-\ntions with performance guarantees for computing the PCA of sparse n\u00d7 n matrices in the streaming\nmodel, i.e. using memory that is poly-logarithmic in n. Much of the large scale high-dimensional\ndata sets available today (e.g. image streams, text streams, etc.) are sparse. For example, consider\nthe text case of Wikipedia. We can associate a matrix with Wikipedia, where the English words\nde\ufb01ne the columns (approximately 1.4 million) and the individual documents de\ufb01ne the rows (ap-\nproximately 4.4 million documents). This large scale matrix is sparse because most English words\ndo not appear in most documents. The size of this matrix is huge and no existing dimensionality\nreduction algorithm can compute its eigenvectors. To this point, running the state of the art SVD\nimplementation from GenSim on the Wikipedia document-term matrix crashes the computer very\nquickly after applying its step of random projection on the \ufb01rst few thousand documents. This is\nbecause such dense vectors, each of length 1.4 million, use all of the computer\u2019s RAM capacity.\n\n1Support for this research has been provided by Hon Hai/Foxconn Technology Group and NSFSaTC-BSF\nCNC 1526815, and in part by the Singapore MIT Alliance on Research and Technology through the Future of\nUrban Mobility project and by Toyota Research Institute (TRI). TRI provided funds to assist the authors with\ntheir research but this article solely re\ufb02ects the opinions and conclusions of its authors and not TRI or any other\nToyota entity. We are grateful for this support.\n\nSubmitted to 30th Conference on Neural Information Processing Systems (NIPS 2016). Do not distribute.\n\n\fdist2(A, S) :=\n\ndist2(ai, S)\n\nn(cid:88)i=1\n\nDe\ufb01nition 1 ((k, \u03b5)-coreset). Given a n\u00d7 d matrix A whose rows a1,\u00b7\u00b7\u00b7 , an are n points (vec-\ntors) in Rd, an error parameter \u03b5 \u2208 (0, 1], and an integer k \u2208 [1, d \u2212 1] = {1,\u00b7\u00b7\u00b7 , d \u2212 1}\nthat represents the desired dimensionality reduction, n (k, \u03b5)-coreset for A is a weighted subset\nC = {wiai | wi > 0 and i \u2208 [n]} of the rows of A, where w = (w1,\u00b7\u00b7\u00b7 , wn) \u2208 [0,\u221e)n is a\nnon-negative weight vector, such that for every af\ufb01ne k-subspace S in Rd we have\n\n(1)\n\nIn this paper we present a dimensionality reduction algorithms that can handle very large scale sparse\ndata sets such as Wikipedia and returns provably correct results. A long-open research question has\nbeen whether we can have a coreset for PCA that is both small in size and a subset of the original\ndata. In this paper we answer this question af\ufb01rmatively and provide an ef\ufb01cient construction. We\nalso show that this algorithm provides a practical solution to a long-standing open practical problem:\ncomputing the PCA of large matrices such as those associated with Wikipedia.\n\n2 Problem Formulation\n\nGiven a matrix A, a coreset C in this paper is de\ufb01ned as a weighted subset of rows of A such that the\nsum of squared distances from any given k-dimensional subspace to the rows of A is approximately\nthe same as the sum of squared weighted distances to the rows in C. Formally,\nFor a compact set S \u2208 Rd and a vector x in Rd, we denote the Euclidean distance between x and its\nclosest points in S by\n\ndist2(x, S) := min\n\ns\u2208S (cid:107)x \u2212 s(cid:107)2\n\n2\n\nFor an n\u00d7d matrix A whose rows are a1, . . . , an, we de\ufb01ne the sum of the squared distances from\nA to S by\n\n(cid:12)(cid:12)dist2(A, S)) \u2212 dist2(C, S))(cid:12)(cid:12) \u2264 \u03b5 dist2(A, S)).\n\ni=1 w2\n\nThat is, the sum of squared distances from the n points to S approximates the sum of squared\nweighted distances(cid:80)n\ni (dist(ai, S))2 to S. The approximation is up to a multiplicative factor\nof 1\u00b1\u03b5. By choosing w = (1,\u00b7\u00b7\u00b7 , 1) we obtain a trivial (k, 0)-coreset. However, in a more ef\ufb01cient\ncoreset most of the weights will be zero and the corresponding rows in A can be discarded. The\ncardinality of the coreset is thus the sparsity of w, given by |C| = (cid:107)w(cid:107)0 := |{wi (cid:54)= 0 | i \u2208 [n]}|.\nIf C is small, then the computation is ef\ufb01cient. Because C is a weighted subset of the rows of A,\nif A is sparse, then C is also sparse. A long-open research question has been whether we can have\nsuch a coreset that is both of size independent of the input dimension (n and d) and a subset of the\noriginal input rows.\n\n2.1 Related Work\nIn [24] it was recently proved that an (k, \u03b5) coreset of size |C| = O(dk3/\u03b52) exists for every\ninput matrix, and distances to the power of z \u2265 1 where z is constant. The proof is based on a\ngeneral framework for constructing different kinds of coresets, and is known as sensitivity [10, 17].\nThis coreset is ef\ufb01cient for tall matrices, since its cardinality is independent of n. However, it is\nuseless for \u201cfat\u201d or square matrices (such as the Wikipedia matrix above), where d is in the order\nof n, which is the main motivation for our paper. In [5], the Frank-Wolfe algorithm was used to\nconstruct different types of coresets than ours, and for different problems. Our approach is based\non a solution that we give to an open problem in [5], however we can see how it can be used to\ncompute the coresets in [5] and vice versa. For the special case z = 2 (sum of squared distances),\na coreset of size O(k/\u03b52) was suggested in [7] with a randomized version in [8] for a stream of n\npoints that, unlike the standard approach of using merge-and-reduce trees, returns a coreset of size\nindependent of n with a constant probability. These result minimizes the (cid:107)\u00b7(cid:107)2 error, while our result\nminimizes the Frobenius norm, which is always higher, and may be higher by a factor of d. After\nappropriate weighting, we can apply the uniform sampling of size O(k/\u03b52) to get a coreset with a\nsmall Frobenius error [14], as in our paper. However, in this case the probability of success is only\nconstant. Since in the streaming case we compute roughly n coresets (formally, O(n/m) coresets,\nwhere m is the size of the coreset) the probability that all these coresets constructions will succeed\n\n2\n\n\fis close to zero (roughly 1/n). Since the probability of failure in [14] reduces linearly with the size\nof the coreset, getting a constant probability of success in the streaming model for O(n) coresets\nwould require to take coresets of size that is no smaller than the input size.\nThere are many papers, especially in recent years, regarding data compression for computing the\nSVD of large matrices. None of these works addresses the fundamental problem of computing a\nsparse approximated PCA for a large matrix (in both rows and columns), such as Wikipedia. The\nreason is that current results use sketches which do no preserve the sparsity of the data (e.g. because\nof using random projections). Hence, neither the sketch nor the PCA computed on the sketch is\nsparse. On the other side, we de\ufb01ne coreset as a small weighted subset of rows, which is thus\nsparse if the input is sparse. Moreover, the low rank approximation of a coreset is sparse, since\neach of its right singular vectors is a sum of a small set of sparse vectors. While there are coresets\nconstructions as de\ufb01ned in this paper, all of them have cardinality of at least d points, which makes\nthem impractical for large data matrices, where d \u2265 n. In what follows we describe these recent\nresults in details.\nThe recent results in [7, 8] suggest coresets that are similar to our de\ufb01nition of coresets (i.e., weighted\nsubsets), and do preserve sparsity. However, as mentioned above they minimize the 2-norm error and\nnot the larger Frobesnius error, and maybe more important, they provide coresets for k-SVD (i.e.,\nk-dimensional subspaces) and not for PCA (k-dimensional af\ufb01ne subspaces that might not intersect\nthe origin). In addition [8] works with constant probability, while our algorithm is deterministic\n(works with probability 1).\nSoftware. Popular software for computing SVD such as GenSim [21], redsvd [12] or the MATLAB\nsparse SVD function (svds) use sketches and crash for inputs of a few thousand of documents and\na dimensionality reduction (approximation rank) k < 100 on a regular laptop, as expected from\nthe analysis of their algorithms. This is why existing implementations (including Gensim) extract\ntopics from large matrices (e.g. Wikipedia), based on low-rank approximation of only small subset\nof few thousands of selected words (matrix columns), and not the complete Wikipedia matrix.Even\nfor k = 3, running the implementation of sparse SVD in Hadoop [23] took several days [13]. Next\nwe give a broad overview of the very latest state of the dimensionality reduction methods, such as\nthe Lanczoz algorithm [16] for large matrices, that such systems employ under the hood.\nCoresets. Following a decade of research in [24] it was recently proved that an (\u03b5, k)-coreset for low\nrank approximation of size |C| = O(dk3/\u03b52) exists for every input matrix. The proof is based on a\ngeneral framework for constructing different kinds of coresets, and is known as sensitivity [10, 17].\nThis coreset is ef\ufb01cient for tall matrices, since its cardinality is independent of n. However, it is\nuseless for \u201cfat\u201d or square matrices (such as the Wikipedia matrix above), where d is in the order\nof n, which is the main motivation for our paper. In [5], the Frank-Wolfe algorithm was used to\nconstruct different types of coresets than ours, and for different problems. Our approach is based on\na solution that we give to an open problem in [5].\nSketches. A sketch in the context of matrices is a set of vectors u1,\u00b7\u00b7\u00b7 , us in Rd such that the sum of\nsquared distances(cid:80)n\ni=1(dist(ai, S))2 from the input n points to every k-dimensional subspace S in\nRd, can be approximated by(cid:80)n\ni=1(dist(ui, S))2 up to a multiplicative factor of 1\u00b1\u03b5. Note that even\nif the input vectors a1,\u00b7\u00b7\u00b7 , an are sparse, the sketched vectors u1,\u00b7\u00b7\u00b7 , us in general are not sparse,\nunlike the case of coresets. A sketch of cardinality d can be constructed with no approximation error\n(\u03b5 = 0), by de\ufb01ning u1,\u00b7\u00b7\u00b7 , ud to be the d rows of the matrix DV T where U DV T = A is the SVD\nof A. It was proved in [11] that taking the \ufb01rst O(k/\u03b5) rows of DV T yields such a sketch, i.e. of\nsize independent of n and d.\nThe \ufb01rst sketch for sparse matrices was suggested in [6], but like more recent results, it assumes that\nthe complete matrix \ufb01ts in memory. Other sketching methods that usually do not support streaming\ninclude random projections [2, 1, 9] and randomly combined rows [20, 25, 22, 18].\nThe Lanczoz Algorithm. The Lanczoz method [19] and its variant [15] multiply a large matrix by a\nvector for a few iterations to get its largest eigenvector v1. Then the computation is done recursively\nafter projecting the matrix on the hyperplane that is orthogonal to v1. However, v1 is in general not\nsparse even A is sparse. Hence, when we project A on the orthogonal subspace to v1, the resulting\nmatrix is dense for the rest of the computations (k > 1). Indeed, our experimental results show that\nthe MATLAB svds function which uses this method runs faster than the exact SVD, but crashes on\nlarge input, even for small k.\n\n3\n\n\fThis paper builds on this extensive body of prior work in dimensionality reduction, and our approach\nuses coresets to solve the time and space challenges.\n\n2.2 Key Contributions\n\nOur main result is the \ufb01rst algorithm for computing an (k, \u03b5)-coreset C of size independent of\nboth n and d, for any given n \u00d7 d input matrix. The algorithm takes as input a \ufb01nite set of d-\ndimensional vectors, a desired approximation error \u03b5, and an integer k \u2265 0. It returns a weighted\nsubset S (coreset) of k2/\u03b52 such vectors. This coreset S can be used to approximate the sum of\nsquared distances from the matrix A \u2208 Rn\u00d7d, whose rows are the n vectors seen so far, to any\nk-dimensional af\ufb01ne subspace in Rd, up to a factor of 1 \u00b1 \u03b5. For a (possibly unbounded) stream of\nsuch input vectors the coreset can be maintained at the cost of an additional factor of log2 n.\nThe polynomial dependency on d of the cardinality of previous coresets made them impractical for\nfat or square input matrices, such as Wikipedia, images in a sparse feature space representation, or\nadjacency matrix of a graph. If each row of in input matrix A has O(nnz) non-zeroes entries, then\nthe update time per insertion, the overall memory that is used by our algorithm, and the low rank\napproximation of the coreset S is O(nnz \u00b7 k2/\u03b52), i.e. independent of n and d.\nWe implemented our algorithm to obtain a low-rank approximation for the term-document matrix\nof Wikipedia with provable error bounds. Since our streaming algorithm is also \u201cembarrassingly\nparallel\u201d we run it on Amazon Cloud, and receive a signi\ufb01cantly better running time and accuracy\ncompared to existing heuristics (e.g. Hadoop/MapReduce) that yield non-sparse solutions.\nThe key contributions in this work are:\n\n1. A new algorithm for dimensionality reduction of sparse data that uses a weighted subset of\n\nthe data, and is independent of both the size and dimensionality of the data.\n\n2. An ef\ufb01cient algorithm for computing such a reduction, with provable bounds on size and\n\nrunning time (cf. http://people.csail.mit.edu/mikhail/NIPS2016).\n\n3. A system that implements this dimensionality reduction algorithm and an application of the system\n\nto compute latent semantic analysis (LSA) of the entire English Wikipedia.\n\n3 Technical Solution\n\nGiven a n\u00d7d matrix A, we propose a construction mechanism for a matrix C of size |C| = O(k2/\u03b52)\nand claim that it is a (k, \u03b5)-coreset for A. We use the following corollary for De\ufb01nition 1 of a coreset,\nbased on simple linear algebra that follows from the geometrical de\ufb01nitions (e.g. see [11]).\nProperty 1 (Coreset for sparse matrix). Let A \u2208 Rn\u00d7d, k \u2208 [1, d \u2212 1] be an integer, and let \u03b5 > 0\nbe an error parameter. For a diagonal matrix W \u2208 Rn\u00d7n, the matrix C = W A is a (k, \u03b5)-coreset\nfor A if for every matrix X \u2208 Rd\u00d7(d\u2212k) such that X T X = I, we have\n\nand\n\n(ii) (cid:107)A \u2212 W A(cid:107) < \u03b5 var(A)\n\n(2)\n\n(cid:107)AX(cid:107) (cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b5,\n(i) (cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 (cid:107)W AX(cid:107)\n\nwhere var(A) is the sum of squared distances from the rows of A to their mean.\n\nThe goal of this paper is to prove that such a coreset (De\ufb01nition 1) exists for any matrix A (Prop-\nerty 1) and can be computed ef\ufb01ciently. Formally,\nTheorem 1. For every input matrix A \u2208 Rn\u00d7d, an error \u03b5 \u2208 (0, 1] and an integer k \u2208 [1, d \u2212 1]:\n\n(a) there is a (k, \u03b5)-coreset C of size |C| = O(k2/\u03b52);\n(b) such a coreset can be constructed in O(k2/\u03b52) time.\n\nTheorem 1 is the formal statement for the main technical contribution of this paper. Sections 3\u20135\nconstitute a proof for Theorem 1.\nTo establish Theorem 1(a), we \ufb01rst state our two main results (Theorems 2 and 3) axiomatically, and\nshow how they combine such that Property 1 holds. Thereafter we prove the these results in Sections\n4 and 5, respectively. To prove Theorem 1(b) (ef\ufb01cient construction) we present an algorithm for\n\n4\n\n\fAlgorithm 1 CORESET-SUMVECS(A, \u03b5)\n\n(a) Coreset for sum of vectors algorithm (b) Illustration showing \ufb01rst 3 steps of the computation\n\ncomputing a matrix C, and analyze the running time to show that the C can be constructed in\nO(k2/\u03b52) iterations.\nLet A \u2208 Rn\u00d7d be a matrix of rank d, and let U \u03a3V T = A denote its full SVD. Let W \u2208 Rn\u00d7n be a\ndiagonal matrix. Let k \u2208 [1, d \u2212 1] be an integer. For every i \u2208 [n] let\nUi,k+1:d\u03a3k+1:d,k+1:d\n\n(3)\n\nvi =(cid:18)Ui,1,\u00b7\u00b7\u00b7 , Ui,k,\n\n(cid:107)\u03a3k+1:d,k+1:d(cid:107)\n\n, 1(cid:19) .\n\nThen the following two results hold:\nTheorem 2 (Coreset for sum of vectors). For every set of of n vectors v1,\u00b7\u00b7\u00b7 , vn in Rd and every\n\u03b5 \u2208 (0, 1), a weight vector w \u2208 (0,\u221e)n of sparsity (cid:107)w(cid:107)0 \u2264 1/\u03b52 can be computed deterministically\nin O(nd/\u03b5) time such that\n\nn(cid:88)i=1\n\n(cid:107)vi(cid:107)2.\n\n(4)\n\nSection 4 establishes a proof for Theorem 2.\nTheorem 3 (Coreset for Low rank approximation). For every X \u2208 Rd\u00d7(d\u2212k) such that X T X = I,\n\nvi \u2212\n\nn(cid:88)i=1\n\nn(cid:88)i=1\n\nwivi(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b5\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:107)AX(cid:107)2 (cid:12)(cid:12)(cid:12)(cid:12) \u2264 5(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 (cid:107)W AX(cid:107)2\nn(cid:88)i=1\n\nvivT\n\ni \u2212 Wi,ivivT\n\n.\n\n(5)\n\ni (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\nSection 5 establishes a proof for Theorem 3.\n\n3.1 Proof of Theorem 1\n\nProof of Theorem 1(a). Replacing vi with vivT\nyields\n\ni , (cid:107)vi(cid:107)2 with (cid:107)vivT\n\ni (cid:107), and \u03b5 by \u03b5/(5k) in Theorem 2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:88)i\n\nvivT\n\ni \u2212 Wi,ivivT\n\ni (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 (\u03b5/5k)\n\n5\n\nn(cid:88)i=1\n\n(cid:107)vivT\n\ni (cid:107) = \u03b5.\n\n000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053Algorithm1CORESET-SUMVECS(A,\u03b5)1:Input:A:ninputpointsa1,...,aninRd2:Input:\u03b5\u2208(0,1):theapproximationerror3:Output:w\u2208[0,\u221e)n:non-negativeweights4:A\u2190A\u2212mean(A)5:A\u2190cAwherecisaconstants.t.var(A)=16:w\u2190(1,0,...,0)7:j\u21901,p\u2190Aj,J\u2190{j}8:Mj=(cid:8)y2|y=A\u00b7ATj(cid:9)9:fori=1,...,ndo10:j\u2190argmin{wJ\u00b7MJ}11:G\u2190W0\u00b7AJwhereW0i,i=\u221awi12:kck=kGTG)k2F13:c\u00b7p=P|J|i=1GpT14:kc\u2212pk=p1+kck2\u2212c\u00b7p15:compp(v)=1/kc\u2212pk\u2212(c\u00b7p)/kc\u2212pk16:kc\u2212c0k=kc\u2212pk\u2212compp(v)17:\u03b1=kc\u2212c0k/kc\u2212pk18:w\u2190w(1\u2212|\u03b1|)19:wj\u2190wj+\u03b120:w\u2190w/Pni=1wi21:Mj\u2190(cid:8)y2|y=A\u00b7ATj(cid:9)22:J\u2190J\u222a{j}23:ifkck2\u2264\u03b5then24:break25:endif26:endfor27:returnw1-1-0.8-0.6-0.4-0.200.20.40.60.81-1-0.8-0.6-0.4-0.200.20.40.60.81c2a3c3a1 = c1a2a4a5\fCombining this inequality with (4) gives\n\n(cid:107)AX(cid:107)2 (cid:12)(cid:12)(cid:12)(cid:12) \u2264 5(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212 (cid:107)W AX(cid:107)2\nn(cid:88)i=1\n\nvivT\n\ni \u2212 Wi,ivivT\n\ni (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2264 \u03b5.\n\nThus the left-most term is bounded by the right-most term, which proves (2). This also means that\nC = W A is a coreset for k-SVD, i.e., (non-af\ufb01ne) k-dimensional subspaces. To support PCA\n(af\ufb01ne subspaces) the coreset C = W A needs to satisfy the expression in the last line of Property 1\nregarding its mean. This holds using the last entry (one) in the de\ufb01nition of vi (3), which implies\nthat the sum of the rows is preserved as in equation (4). Therefore Property 1 holds for C = W A,\nwhich proves Theorem 1(a).\nClaim Theorem 1(b) follows from simple analysis of Algorithm 2 that implements this construction.\n\n4 Coreset for Sum of Vectors (k = 0)\n\nIn order to prove the general result Theorem 1(a), that is the existence of a (k, \u03b5)-coreset for any\nk \u2208 [1, d\u22121], we \ufb01rst establish the special case for k = 0. In this section, we prove Theorem 2 by\nproviding an algorithm for constructing a small weighted subset of points that constitutes a general\napproximation for the sum of vectors.\nTo this end, we \ufb01rst introduce an intermediate result that shows that given n points on the unit ball\nwith weight distribution z, there exists a small subset of points whose weighted mean is approxi-\nmately the same as the weighted mean of the original points.\n\nLet Dn denote the union over every vector z \u2208 [0, 1]n that represent a distribution, i.e.,(cid:80)i zi = 1.\nOur \ufb01rst technical result is that for any \ufb01nite set of unit vectors a1, . . . , an in Rd, any distribution\nz \u2208 Dn, and every \u03b5 \u2208 (0, 1], we can compute a sparse weight vector w \u2208 Dn of sparsity (non-\nzeroes entries) (cid:107)w(cid:107)0 \u2264 1/\u03b52.\nLemma 1. Let z \u2208 Dn be a distribution over n unit vectors a1,\u00b7\u00b7\u00b7 , an in Rd. For \u03b5 \u2208 (0, 1), a\nsparse weight vector w \u2208 Dn of sparsity s \u2264 1/\u03b52 can be computed in O(nd/\u03b52) time such that\n\nzi \u00b7 ai \u2212\n\nn(cid:88)i=2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\nn(cid:88)i=1\n\nwi ai(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n\u2264 \u03b5.\n\n(6)\n\nProof of Lemma 1. Please see Supplementary Material, Section A.\n\nWe prove Theorem 2 by providing a computation of such a sparse weight vector w. The intuition\nfor this computation is as follows. Given n input points a1,. . . ,an in Rd, with weighted mean\n\n(cid:80)i zi ai = 0, we project all the points on the unit sphere. Pick an arbitrary starting point a1 = c1.\nAt each step \ufb01nd the farthest point aj+1 from cj, and compute cj+1 by projecting the origin onto\nthe line segment [cj, aj+1]. Repeat this for j = 1,. . . ,N iterations, where N = 1/\u03b52. We prove that\n(cid:107)ci(cid:107)2 = 1/i, thus if we iterate 1/\u00012 times, this norm will be (cid:107)c1/\u00012(cid:107) = \u00012. The resulting points ci\nare a weighted linear combination of a small subset of the input points. The output weight vector\nw \u2208 Dn satis\ufb01es cN =(cid:80)n\n\nFig. 1a contains the pseudocode for Algorithm 1. Fig. 1b illustrates the \ufb01rst steps of the main com-\nputation (lines 9\u201326). Please see Supplementary Material, Section C for a complete line-by-line\nanalysis of Algorithm 1.\n\ni=1 wi ai, and this weighted subset forms the coreset.\n\nProof of Theorem 2. The proof of Theorem 2 follows by applying Lemma 1 after normalization of\nthe input points and then post-processing the output.\n\n5 Coreset for Low Rank Approximation (k > 0)\n\nIn Section 4 we presented a new coreset construction for approximating the sum of vectors, showing\nthat given n points on the unit ball there exists a small weighted subset of points that is a coreset\nfor those points. In this section we describe the reduction of Algorithm 1 for k = 0 to an ef\ufb01cient\nalgorithm for any low rank approximation with k \u2208 [1, d\u22121].\n\n6\n\n\fAlgorithm 2 CORESET-LOWRANK(A, k, \u03b5)\n\n(a) 1/2: Initialization\n\n(b) 2/2: Computation\n\nConceptually, we achieve this reduction in two steps. The \ufb01rst step is to show that Algorithm 1 can\nbe reduced to an inef\ufb01cient computation for low rank approximation for matrices. To this end, we\n\ufb01rst prove Theorem 3, thus completing the existence clause Theorem 1(a).\n\ni=1(1 \u2212 W 2\n\ni,i)vivT\n\ni (cid:107). For every i \u2208 [n] let ti = 1 \u2212 W 2\n\nProof of Theorem 3. Let \u03b5 = (cid:107)(cid:80)n\ni,i. Set\nX \u2208 Rd\u00d7(d\u2212k) such that X T X = I. Without loss of generality we assume V T = I, i.e. A = U \u03a3,\notherwise we replace X by V T X. It thus suf\ufb01ces to prove that(cid:12)(cid:12)(cid:80)i ti(cid:107)Ai,:X(cid:107)2(cid:12)(cid:12) \u2264 5\u03b5(cid:107)AX(cid:107)2.\nUsing the triangle inequality, we get\nti(cid:107)Ai,:X(cid:107)2(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)i\n+(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)i\n\nti(cid:107)Ai,:X(cid:107)2 \u2212(cid:88)i\nti(cid:107)(Ai,1:k, 0)X(cid:107)2(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nWe complete the proof by deriving bounds on (7) and (8), thus proving (5). For the complete proof,\nplease see Supplementary Material, Section B.\n\nti(cid:107)(Ai,1:k, 0)X(cid:107)2(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)i\n\n(7)\n\n(8)\n\nTogether, Theorems 2 and 3 show that the error of the coreset is a 1 \u00b1 \u03b5 approximation to the true\nweighted mean. By Theorem 3, we can now simply apply Algorithm 1 to the right hand side of (5)\nto compute the reduction. The intuition for this inef\ufb01cient reduction is as follows. We \ufb01rst compute\nthe outer product of each row vector x in the input matrix A \u2208 R[n\u00d7d]. Each such outer products\nxT x is a matrix in Rd\u00d7d. Next, we expand every such matrix into a vector, in Rd2 by concatenating\nits entries. Finally, we combine each such vector back to be a vector in the matrix P \u2208 Rn\u00d7d2. At\nthis point the reduction is complete, however it is clear that this matrix expansion is inef\ufb01cient.\nThe second step of the reduction is to transform the slow computation of running Algorithm 1 on the\nexpanded matrix P \u2208 Rn\u00d7d2 into an equivalent and provably fast computation on the original set of\npoints A \u2208 Rd. To this end we make use of the fact that each row of P is a sparse vector in Rd to\nimplicitly run the computation in the original row space Rd. We present Algorithm 2 and prove that\nit returns the weight vector w = (w1,\u00b7\u00b7\u00b7 , wn) of a (k, \u03b5)-coreset for low-rank approximation of the\ninput point set P , and that this coreset is small, namely, only O(k2/\u03b52) of the weights (entries) in w\nare non-zeros. Fig. 5 contains the pseudocode for Algorithm 2. Please see Supplementary Material,\nSection D for a complete line-by-line analysis of Algorithm 2.\n\n6 Evaluation and Experimental Results\n\nThe coreset construction algorithm described in Section 5 was implemented in MATLAB. We make\nuse of the redsvd package [12] to improve performance, but it is not required to run the system. We\nevaluate our system on two types of data: synthetic data generated with carefully controlled param-\neters, and real data from the English Wikipedia under the \u201cbag of words\u201d (BOW) model. Synthetic\ndata provides ground-truth to evaluate the quality, ef\ufb01ciency, and scalability of our system, while\nthe Wikipedia data provides us with a grand challenge for latent semantic analysis computation.\n\n7\n\n000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053Algorithm1CORESET-LOWRANK(A,k,\u03b5)1:Input:A:Asparsen\u00d7dmatrix2:Input:k\u2208Z>0:theapproximationrank3:Input:\u03b5\u2208(cid:0)0,12(cid:1):theapproximationerror4:Output:w\u2208[0,\u221e)n:non-negativeweights5:ComputeU\u03a3VT=A,theSVDofA6:R\u2190\u03a3k+1:d,k+1:d7:P\u2190matrixwhosei-throw\u2200i\u2208[n]is8:Pi=(Ui,1:k,Ui,k+1:d\u00b7RkRkF)9:X\u2190matrixwhosei-throw\u2200i\u2208[n]is10:Xi=Pi/kPikF11:w\u2190(1,0,...,0)12:fori=1,...,(cid:6)k2/\u03b52(cid:7)do13:j\u2190argmini=1,...,n{wXXi}14:a=Pni=1wi(XTiXj)215:b=1\u2212kPXjk2F+Pni=1wikPXik2FkPk2F16:c=kwXk2F17:\u03b1=(1\u2212a+b)/(1+c\u22122a)18:w\u2190(1\u2212\u03b1)Ij+\u03b1w19:endfor20:returnw1000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053Algorithm1CORESET-LOWRANK(A,k,\u03b5)1:Input:A:Asparsen\u00d7dmatrix2:Input:k\u2208Z>0:theapproximationrank3:Input:\u03b5\u2208(cid:0)0,12(cid:1):theapproximationerror4:Output:w\u2208[0,\u221e)n:non-negativeweights5:ComputeU\u03a3VT=A,theSVDofA6:R\u2190\u03a3k+1:d,k+1:d7:P\u2190matrixwhosei-throw\u2200i\u2208[n]is8:Pi=(Ui,1:k,Ui,k+1:d\u00b7RkRkF)9:X\u2190matrixwhosei-throw\u2200i\u2208[n]is10:Xi=Pi/kPikF11:w\u2190(1,0,...,0)12:fori=1,...,(cid:6)k2/\u03b52(cid:7)do13:j\u2190argmini=1,...,n{wXXi}14:a=Pni=1wi(XTiXj)215:b=1\u2212kPXjk2F+Pni=1wikPXik2FkPk2F16:c=kwXk2F17:\u03b1=(1\u2212a+b)/(1+c\u22122a)18:w\u2190(1\u2212\u03b1)Ij+\u03b1w19:endfor20:returnw1\f(a) Relative error (k = 10)\n\n(b) Relative error (k = 20)\n\n(c) Relative error (k = 50)\n\n(d) Synthetic data errors (e) Wikipedia running time (x-axis log\n\nscale)\n\n(f) Wikipedia log errors\n\nFigure 1: Experimental results for synthetic data (Fig. 1a\u20131d) and Wikipedia (Fig. 1e\u2013Fig. 1f).\n\nFor our synthetic data experiments, we used a moderate size sparse input of (5000\u00d71000) to evaluate\nthe relationship between the error \u03b5 and the number of iterations of the algorithm N. We then\ncompare our coreset against uniform sampling and weighted random sampling using the squared\nnorms of U (A = U \u03a3V T ) as the weights. Finally, we evaluate the ef\ufb01ciency of our algorithm by\ncomparing the running time against the MATLAB svds function and against the most recent state\nof the art dimensionality reduction algorithm [8]. Figure 1a\u20131d show the exerimental results. Please\nsee Supplementary Material, Section E for a complete description of the experiments.\n\n6.1 Latent Semantic Analysis of Wikipedia\n\nFor our large-scale grand challenge experiment, we apply our algorithm for computing Latent Se-\nmantic Analysis (LSA) on the entire English Wikipedia. The size of the data is n = 3.69M (docu-\nments) with a dimensionality d = 7.96M (words). We specify a nominal error of \u03b5 = 0.5, which is a\ntheoretical upper bound for N = 2k/\u03b5 iterations, and show that the coreset error remains bounded.\nFigure 1f shows the log approximation error, i.e. sum of squared distances of the coreset to the sub-\nspace for increasing approximation rank k = 1, 10, 100. We see that the log error is proportional to\nk, and as the number of streamed points increases into the millions, coreset error remains bounded\nby k. Figure 1e shows the running time of our algorithm compared against svds for increasing\ndimensionality d and a \ufb01xed input size n = 3.69M (number of documents).\nFinally, we show that our coreset can be used to create a topic model of 100 topics for the entire\nEnglish Wikipedia. We construct the coreset of size N = 1000 words. Then to generate the topics,\nwe compute a projection of the coreset onto a subspace of rank k = 100. Please see Supplementary\nMaterial, Section F for more details, including an example of the topics obtained in our experiments.\n\n7 Conclusion\n\nWe present a new approach for dimensionality reduction using coresets. Our solution is general and\ncan be used to project spaces of dimension d to subspaces of dimension k < d. The key feature of\nour algorithm is that it computes coresets that are small in size and subsets of the original data. We\nbenchmark our algorithm for quality, ef\ufb01ciency, and scalability using synthetic data. We then apply\nour algorithm for computing LSA on the entire Wikipedia \u2013 a computation task hitherto not possible\nwith state of the art algorithms. We see this work as a theoretical foundation and practical toolbox\nfor a range of dimensionality reduction problems, and we believe that our algorithms will be used to\n\n8\n\nCoreset size (number of points)0102030405060Relative error#10-400.511.522.533.54SVD CoresetUniform Random SamplingWeighted Random SamplingCoreset size (number of points)01020304050607080Relative error#10-4012345SVD CoresetUniform Random SamplingWeighted Random SamplingCoreset size (number of points)0102030405060708090100Relative error#10-300.10.20.30.40.50.60.70.80.91SVD CoresetUniform Random SamplingWeighted Random SamplingNumber of iterations N0200400600800100012001400160018002000f(N)00.10.20.30.40.50.60.70.80.91A[5000x1000], sparsity=0.0333f(N) = epsf(N) = N epsf(N) = N logN epsf(N) = N2 epsf(N) = f*(N)+CNumber of million points streamed00.511.522.533.5log10 eps-5-4.5-4-3.5-3-2.5-2-1.5-1-0.50Wikipedia approximation log errork = 1k = 10k = 100\fconstruct many other coresets in the future. Our project codebase is open-sourced and can be found\nhere: http://people.csail.mit.edu/mikhail/NIPS2016.\n\nReferences\n[1] D. Achlioptas and F. Mcsherry. Fast computation of low-rank matrix approximations. Journal of the ACM\n\n(JACM), 54(2):9, 2007.\n\n[2] S. Arora, E. Hazan, and S. Kale. A fast random sampling algorithm for sparsifying matrices. In Approx-\nimation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 272\u2013279.\nSpringer, 2006.\n\n[3] J. Batson, D. A. Spielman, and N. Srivastava. Twice-ramanujan sparsi\ufb01ers. SIAM Journal on Computing,\n\n41(6):1704\u20131721, 2012.\n\n[4] C. Carath\u00b4eodory. \u00a8Uber den variabilit\u00a8atsbereich der fourierschen konstanten von positiven harmonischen\n\nfunktionen. Rendiconti del Circolo Matematico di Palermo (1884-1940), 32(1):193\u2013217, 1911.\n\n[5] K. L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions\n\non Algorithms (TALG), 6(4):63, 2010.\n\n[6] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In\nProceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing, pages 81\u201390. ACM, 2013.\n\n[7] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu. Dimensionality reduction for k-means\nclustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium\non Theory of Computing, pages 163\u2013172. ACM, 2015.\n\n[8] M. B. Cohen, C. Musco, and J. W. Pachocki. Online row sampling. CoRR, abs/1604.05448, 2016.\n\n[9] P. Drineas and A. Zouzias. A note on element-wise matrix sparsi\ufb01cation via a matrix-valued bernstein\n\ninequality. Information Processing Letters, 111(8):385\u2013389, 2011.\n\n[10] D. Feldman and M. Langberg. A uni\ufb01ed framework for approximating and clustering data. In Proc. 41th\n\nAnn. ACM Symp. on Theory of Computing (STOC), 2010. Manuscript available at arXiv.org.\n\n[11] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-\nmeans, pca and projective clustering. Proceedings of ACM-SIAM Symposium on Discrete Algorithms\n(SODA), 2013.\n\n[12] Google. redsvd. https://code.google.com/archive/p/redsvd/, 2011.\n[13] N. P. Halko. Randomized methods for computing low-rank approximations of matrices. PhD thesis,\n\nUniversity of Colorado, 2012.\n\n[14] M. Inaba, N. Katoh, and H. Imai. Applications of weighted voronoi diagrams and randomization to\nvariance-based k-clustering. In Proceedings of the tenth annual symposium on Computational geometry,\npages 332\u2013339. ACM, 1994.\n\n[15] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, and R. Sepulchre. Generalized power method for sparse principal\n\ncomponent analysis. The Journal of Machine Learning Research, 11:517\u2013553, 2010.\n\n[16] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and\n\nintegral operators. United States Governm. Press Of\ufb01ce Los Angeles, CA, 1950.\n\n[17] M. Langberg and L. J. Schulman. Universal \u03b5 approximators for integrals. Proceedings of ACM-SIAM\n\nSymposium on Discrete Algorithms (SODA), 2010.\n\n[18] E. Liberty, F. Woolfe, P.-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the\nlow-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167\u2013\n20172, 2007.\n\n[19] C. C. Paige. Computational variants of the lanczos method for the eigenproblem. IMA Journal of Applied\n\nMathematics, 10(3):373\u2013381, 1972.\n\n[20] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent semantic indexing: A probabilistic\nanalysis. In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles\nof database systems, pages 159\u2013168. ACM, 1998.\n\n[21] R. Ruvrek, P. Sojka, et al. Gensimstatistical semantics in python. 2011.\n[22] T. Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations\n\nof Computer Science, 2006. FOCS\u201906. 47th Annual IEEE Symposium on, pages 143\u2013152. IEEE, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1406, "authors": [{"given_name": "Dan", "family_name": "Feldman", "institution": "University of Haifa"}, {"given_name": "Mikhail", "family_name": "Volkov", "institution": "MIT"}, {"given_name": "Daniela", "family_name": "Rus", "institution": "MIT"}]}