{"title": "Single Pass PCA of Matrix Products", "book": "Advances in Neural Information Processing Systems", "page_first": 2585, "page_last": 2593, "abstract": "In this paper we present a new algorithm for computing a low rank approximation of the product $A^TB$ by taking only a single pass of the two matrices $A$ and $B$. The straightforward way to do this is to (a) first sketch $A$ and $B$ individually, and then (b) find the top components using PCA on the sketch. Our algorithm in contrast retains additional summary information about $A,B$ (e.g. row and column norms etc.) and uses this additional information to obtain an improved approximation from the sketches. Our main analytical result establishes a comparable spectral norm guarantee to existing two-pass methods; in addition we also provide results from an Apache Spark implementation that shows better computational and statistical performance on real-world and synthetic evaluation datasets.", "full_text": "Single Pass PCA of Matrix Products\n\nShanshan Wu\n\nThe University of Texas at Austin\n\nshanshan@utexas.edu\n\nSrinadh Bhojanapalli\n\nToyota Technological Institute at Chicago\n\nsrinadh@ttic.edu\n\nSujay Sanghavi\n\nThe University of Texas at Austin\nsanghavi@mail.utexas.edu\n\nAlexandros G. Dimakis\n\nThe University of Texas at Austin\n\ndimakis@austin.utexas.edu\n\nAbstract\n\nIn this paper we present a new algorithm for computing a low rank approximation\nof the product AT B by taking only a single pass of the two matrices A and B. The\nstraightforward way to do this is to (a) \ufb01rst sketch A and B individually, and then\n(b) \ufb01nd the top components using PCA on the sketch. Our algorithm in contrast\nretains additional summary information about A, B (e.g. row and column norms\netc.) and uses this additional information to obtain an improved approximation from\nthe sketches. Our main analytical result establishes a comparable spectral norm\nguarantee to existing two-pass methods; in addition we also provide results from\nan Apache Spark implementation1 that shows better computational and statistical\nperformance on real-world and synthetic evaluation datasets.\n\n1\n\nIntroduction\n\nGiven two large matrices A and B we study the problem of \ufb01nding a low rank approximation of their\nproduct AT B, using only one pass over the matrix elements. This problem has many applications in\nmachine learning and statistics. For example, if A = B, then this general problem reduces to Principal\nComponent Analysis (PCA). Another example is a low rank approximation of a co-occurrence matrix\nfrom large logs, e.g., A may be a user-by-query matrix and B may be a user-by-ad matrix, so AT B\ncomputes the joint counts for each query-ad pair. The matrices A and B can also be two large bag-of-\nword matrices. For this case, each entry of AT B is the number of times a pair of words co-occurred\ntogether. As a fourth example, AT B can be a cross-covariance matrix between two sets of variables,\ne.g., A and B may be genotype and phenotype data collected on the same set of observations. A low\nrank approximation of the product matrix is useful for Canonical Correlation Analysis (CCA) [3].\nFor all these examples, AT B captures pairwise variable interactions and a low rank approximation is\na way to ef\ufb01ciently represent the signi\ufb01cant pairwise interactions in sub-quadratic space.\nLet A and B be matrices of size d \u21e5 n (d  n) assumed too large to \ufb01t in main memory. To obtain\na rank-r approximation of AT B, a naive way is to compute AT B \ufb01rst, and then perform truncated\nsingular value decomposition (SVD) of AT B. This algorithm needs O(n2d) time and O(n2) memory\nto compute the product, followed by an SVD of the n \u21e5 n matrix. An alternative option is to directly\nrun power method on AT B without explicitly computing the product. Such an algorithm will need to\naccess the data matrices A and B multiple times and the disk IO overhead for loading the matrices to\nmemory multiple times will be the major performance bottleneck.\nFor this reason, a number of recent papers introduce randomized algorithms that require only a few\npasses over the data, approximately linear memory, and also provide spectral norm guarantees. The\n\n1The code can be found at https://github.com/wushanshan/MatrixProductPCA\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fkey step in these algorithms is to compute a smaller representation of data. This can be achieved by\ntwo different methods: (1) dimensionality reduction, i.e., matrix sketching [15, 5, 14, 6]; (2) random\nsampling [7, 1]. The recent results of Cohen et al. [6] provide the strongest spectral norm guarantee\n\n\u23184\n\nof the former. They show that a sketch size of O(\u02dcr/\u270f2) suf\ufb01ces for the sketched matrices eATeB to\nachieve a spectral error of \u270f, where \u02dcr is the maximum stable rank of A and B. Note that eATeB is\n\nnot the desired rank-r approximation of AT B. On the other hand, [1] is a recent sampling method\nwith very good performance guarantees. The authors consider entrywise sampling based on column\nnorms, followed by a matrix completion step to compute low rank approximations. There is also a lot\nof interesting work on streaming PCA, but none can be directly applied to the general case when A is\ndifferent from B (see Figure 4(c)). Please refer to Appendix D for more discussions on related work.\nDespite the signi\ufb01cant volume of prior work, there is no method that computes a rank-r approximation\nof AT B when the entries of A and B are streaming in a single pass 2. Bhojanapalli et al. [1] consider\na two-pass algorithm which computes column norms in the \ufb01rst pass and uses them to sample in\na second pass over the matrix elements. In this paper, we combine ideas from the sketching and\nsampling literature to obtain the \ufb01rst algorithm that requires only a single pass over the data.\nContributions: We propose a one-pass algorithm SMP-PCA (which stands for Streaming Matrix\nProduct PCA) that computes a rank-r approximation of AT B in time O((nnz(A) + nnz(B)) \u21e22r3 \u02dcr\n\u23182 +\nnr6\u21e24 \u02dcr3\n). Here nnz(\u00b7) is the number of non-zero entries, \u21e2 is the condition number, \u02dcr is the maximum\nstable rank, and \u2318 measures the spectral norm error. Existing two-pass algorithms such as [1] typically\nhave longer runtime than our algorithm (see Figure 3(a)). We also compare our algorithm with the\nsimple idea that \ufb01rst sketches A and B separately and then performs SVD on the product of their\nsketches. We show that our algorithm always achieves better accuracy and can perform arbitrarily\nbetter if the column vectors of A and B come from a cone (see Figures 2, 4(b), 3(b)).\nThe central idea of our algorithm is a novel rescaled JL embedding that combines information from\nmatrix sketches and vector norms. This allows us to get better estimates of dot products of high\ndimensional vectors compared to previous sketching approaches. We explain the bene\ufb01t compared to\na naive JL embedding in Figure 2 and the related discussion; we believe it may be of more general\ninterest beyond low rank matrix approximations.\nWe prove that our algorithm recovers a low rank approximation of AT B up to an error that depends\non kAT B  (AT B)rk and kAT Bk, decaying with increasing sketch size and number of samples\n(Theorem 3.1). The \ufb01rst term is a consequence of low rank approximation and vanishes if AT B is\nexactly rank-r. The second term results from matrix sketching and subsampling; the bounds have\nsimilar dependencies as in [6].\nWe implement SMP-PCA in Apache Spark and perform several distributed experiments on synthetic\nand real datasets. Our distributed implementation uses several design innovations described in\nSection 4 and Appendix C.5 and it is the only Spark implementation that we are aware of that\ncan handle matrices that are large in both dimensions. Our experiments show that we improve by\napproximately a factor of 2\u21e5 in running time compared to the previous state of the art and scale\ngracefully as the cluster size increases. The source code is available at [18].\nIn addition to better performance, our algorithm offers another advantage: It is possible to compute\nlow-rank approximations to AT B even when the entries of the two matrices arrive in some arbitrary\norder (as would be the case in streaming logs). We can therefore discover signi\ufb01cant correlations\neven when the original datasets cannot be stored, for example due to storage or privacy limitations.\n\n2 Problem setting and algorithms\nConsider the following problem: given two matrices A 2 Rd\u21e5n1 and B 2 Rd\u21e5n2 that are stored in\ndisk, \ufb01nd a rank-r approximation of their product AT B. In particular, we are interested in the setting\nwhere both A, B and AT B are too large to \ufb01t into memory. This is common for modern large scale\nmachine learning applications. For this setting, we develop a single-pass algorithm SMP-PCA that\ncomputes the rank-r approximation without explicitly forming the entire matrix AT B.\n\n2One straightforward idea is to sketch each matrix individually and perform SVD on the product of the\nsketches. We compare against that scheme and show that we can perform arbitrarily better using our rescaled JL\nembedding.\n\n2\n\n\fNotations. Throughout the paper, we use A(i, j) or Aij to denote (i, j) entry for any matrix A. Let\nAi and Aj be the i-th column vector and j-th row vector. We use kAkF for Frobenius norm, and\nkAk for spectral (or operator) norm. The optimal rank-r approximation of matrix A is Ar, which\ncan be found by SVD. For any positive integer n, let [n] denote the set {1, 2,\u00b7\u00b7\u00b7 , n}. Given a set\n\u2326 \u21e2 [n1] \u21e5 [n2] and a matrix A 2 Rn1\u21e5n2, we de\ufb01ne P\u2326(A) 2 Rn1\u21e5n2 as the projection of A on \u2326,\ni.e., P\u2326(A)(i, j) = A(i, j) if (i, j) 2 \u2326 and 0 otherwise.\n2.1 SMP-PCA\n\nOur algorithm SMP-PCA (Streaming Matrix Product PCA) takes four parameters as input: the\ndesired rank r, number of samples m, sketch size k, and the number of iterations T . Performance\nguarantee involving these parameters is provided in Theorem 3.1. As illustrated in Figure 1, our\nalgorithm has three main steps: 1) compute sketches and side information in one pass over A and B;\n2) given partial information of A and B, estimate important entries of AT B; 3) compute low rank\napproximation given estimates of a few entries of AT B. Now we explain each step in detail.\n\nFigure 1: An overview of our algorithm. A single pass is performed over the data to produce the\n\nof iterations: T\n\nsketched matrices eA, eB and the column norms kAik, kBjk, for i 2 [n1] and j 2 [n2]. We then\ncompute the sampled matrix P\u2326(fM ) through a biased sampling process, where P\u2326(fM ) = fM (i, j) if\n(i, j) 2 \u2326 and zero otherwise. Here \u2326 represents the set of sampled entries. The (i, j)-th entry offM\nis given in Eq. (2). Performing matrix completion on P\u2326(fM ) gives the desired rank-r approximation.\nAlgorithm 1 SMP-PCA: Streaming Matrix Product PCA\n1: Input: A 2 Rd\u21e5n1, B 2 Rd\u21e5n2, desired rank: r, sketch size: k, number of samples: m, number\n2: Construct a random matrix \u21e7 2 Rk\u21e5d, where \u21e7(i, j) \u21e0N (0, 1/k), 8(i, j) 2 [k] \u21e5 [d]. Perform\na single pass over A and B to obtain: eA =\u21e7 A, eB =\u21e7 B, and kAik, kBjk, for i 2 [n1] and\nj 2 [n2].\n3: Sample each entry (i, j) 2 [n1] \u21e5 [n2] independently with probability \u02c6qij = min{1, qij}, where\nqij is de\ufb01ned in Eq.(1); maintain a set \u2326 \u21e2 [n1] \u21e5 [n2] which stores all the sampled pairs (i, j).\n4: De\ufb01ne fM 2 Rn1\u21e5n2, where fM (i, j) is given in Eq. (2). Calculate P\u2326(fM ) 2 Rn1\u21e5n2, where\nP\u2326(fM ) = fM (i, j) if (i, j) 2 \u2326 and zero otherwise.\n5: Run WAltMin(P\u2326(fM ), \u2326, r, \u02c6q, T ), see Appendix A for more details.\n6: Output: bU 2 Rn1\u21e5r andbV 2 Rn2\u21e5r.\ncompute sketches eA := \u21e7A and eB := \u21e7B, where \u21e7 2 Rk\u21e5d is a random matrix with entries being\ni.i.d. N (0, 1/k) random variables. It is known that \u21e7 satis\ufb01es an \"oblivious Johnson-Lindenstrauss\n(JL) guarantee\" [15][17] and it helps preserving the top row spaces of A and B [5]. Note that any\nsketching matrix \u21e7 that is an oblivious subspace embedding can be considered here, e.g., sparse JL\ntransform and randomized Hadamard transform (see [6] for more discussion).\n\nStep 1: Compute sketches and side information in one pass over A and B. In this step we\n\nBesides eA and eB, we also compute the L2 norms for all column vectors, i.e., kAik and kBjk, for\ni 2 [n1] and j 2 [n2]. We use this additional information to design better estimates of AT B in the\n\n3\n\n\fqij = m \u00b7 ( kAik2\n2n2kAk2\n\nF\n\n+ kBjk2\n2n1kBk2\n\nF\n\n).\n\n(1)\n\nthat needs one pass over data.\nStep 2: Estimate important entries of AT B by rescaled JL embedding. In this step we use partial\n\nnext step, and also to determine important entries of eATeB to sample. Note that this is the only step\ninformation obtained from the previous step to compute a few important entries of eATeB. We \ufb01rst\ndetermine what entries of eATeB to sample, and then propose a novel rescaled JL embedding for\nestimating those entries.\nWe sample entry (i, j) of AT B independently with probability \u02c6qij = min{1, qij}, where\n\nLet \u2326 \u21e2 [n1]\u21e5 [n2] be the set of sampled entries (i, j). Since E(Pi,j qij) = m, the expected number\n\nof sampled entries is roughly m. The special form of qij ensures that we can draw m samples in\nO(n1 + m log(n2)) time; we show how to do this in Appendix C.5.\nNote that qij intuitively captures important entries of AT B by giving higher weight to heavy rows\nand columns. We show in Section 3 that this sampling actually generates good approximation to the\nmatrix AT B.\nThe biased sampling distribution of Eq. (1) is \ufb01rst proposed by Bhojanapalli et al. [1]. However, their\nalgorithm [1] needs a second pass to compute the sampled entries, while we propose a novel way of\nestimating dot products, using information obtained in the \ufb01rst step.\n\nDe\ufb01nefM 2 Rn1\u21e5n2 as\nNote that we will not compute and storefM, instead, we only calculatefM (i, j) for (i, j) 2 \u2326. This\nmatrix is denoted as P\u2326(fM ), where P\u2326(fM )(i, j) = fM (i, j) if (i, j) 2 \u2326 and 0 otherwise.\n\nfM (i, j) = kAik\u00b7k Bjk\u00b7\n\neAT\ni eBj\nkeAik\u00b7k eBjk\n\n(2)\n\n.\n\n2\n\nJL embedding\nRescaled JL embedding\n\nt\nc\nu\nd\no\nr\np\n \nt\no\nd\n \nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n1\n\n0\n\n-1\n\n-2\n\n-1\n\n-0.5\n\n0\n\n0.5\n\n1\n\nTrue dot product\n\n(a)\n\n(b)\n\nFigure 2: (a) Rescaled JL embedding (red dots) captures the dot products with smaller variance\ncompared to JL embedding (blue triangles). Mean squared error: 0.053 versus 0.129. (b) Lower\n\ufb01gure illustrates how to construct unit-norm vectors from a cone with angle \u2713. Let x be a \ufb01xed\nunit-norm vector, and let t be a random Gaussian vector with expected norm tan(\u2713/2), we set y as\neither x + t or (x + t) with probability half, and then normalize it. Upper \ufb01gure plots the ratio of\nspectral norm errors kAT B  eATeBk/kAT B fMk, when the column vectors of A and B are unit\nvectors drawn from a cone with angle \u2713. Clearly,fM has better accuracy than eATeB for all possible\nWe now explain the intuition of Eq. (2), and whyfM is a better estimator than eATeB. To estimate the\ni eBj = keAik\u00b7k eBjk\u00b7 cose\u2713ij, wheree\u2713ij is the\n(i, j) entry of AT B, a straightforward way is to use eAT\n\nvalues of \u2713, especially when \u2713 is small.\n\n4\n\n\fdistorted column norms3.\n\nangle between vectors eAi and eBj. Since we already know the actual column norms, a potentially\nbetter estimator would be kAik\u00b7k Bjk\u00b7 cose\u2713ij. This removes the uncertainty that comes from\ni eBj (JL embedding) andfM (i, j) (rescaled JL embedding)\nFigure 2(a) compares the two estimators eAT\nfor dot products. We plot simulation results on pairs of unit-norm vectors with different angles. The\nvectors have dimension 1,000 and the sketching matrix has dimension 10-by-1,000. Clearly rescaling\nby the actual norms help reduce the estimation uncertainty. This phenomenon is more prominent\nwhen the true dot products are close to \u00b11, which makes sense because cos \u2713 has a small slope\nwhen cos \u2713 approaches \u00b11, and hence the uncertainty from angles may produce smaller distortion\ncompared to that from norms. In the extreme case when cos \u2713 = \u00b11, rescaled JL embedding can\nperfectly recover the true dot product4.\nIn the lower part of Figure 2(b) we illustrate how to construct unit-norm vectors from a cone with\nangle \u2713. Given a \ufb01xed unit-norm vector x, and a random Gaussian vector t with expected norm\ntan(\u2713/2), we construct new vector y by randomly picking one from the two possible choices x+t and\n(x + t), and then renormalize it. Suppose the columns of A and B are unit vectors randomly drawn\nfrom a cone with angle \u2713, we plot the ratio of spectral norm errors kAT B  eATeBk/kAT B fMk in\nFigure 2(b). We observe thatfM always outperforms eATeB and can be much better when \u2713 approaches\n\nzero, which agrees with the trend indicated in Figure 2(a).\nStep 3: Compute low rank approximation given estimates of few entries of AT B. Finally we\ncompute the low rank approximation of AT B from the samples using alternating least squares:\n\nmin\n\nU,V 2Rn\u21e5r X(i,j)2\u2326\n\nwij(eT\n\ni U V T ej fM (i, j))2,\n\n(3)\n\nwhere wij = 1/\u02c6qij denotes the weights, and ei, ej are standard base vectors. This is a popular\ntechnique for low rank recovery and matrix completion (see [1] and the references therein). After T\n\nsubroutine is quite standard, so we defer the details to Appendix A.\n\niterations, we will get a rank-r approximation offM presented in the convenient factored form. This\n\n3 Analysis\n\nNow we present the main theoretical result. Theorem 3.1 characterizes the interaction between\nthe sketch size k, the sampling complexity m, the number of iterations T , and the spectral error\nk(AT B)r  [AT Brk, where [AT Br is the output of SMP-PCA, and (AT B)r is the optimal rank-r\napproximation of AT B. Note that the following theorem assumes that A and B have the same size.\nFor the general case of n1 6= n2, Theorem 3.1 is still valid by setting n = max{n1, n2}.\nTheorem 3.1. Given matrices A 2 Rd\u21e5n and B 2 Rd\u21e5n, let (AT B)r be the optimal rank-r\napproximation of AT B. De\ufb01ne \u02dcr = max{kAk2\nas\nthe condition number of (AT B)r, where \u21e4i is the i-th singular values of AT B.\nLet [AT Br be the output of Algorithm SMP-PCA. If the input parameters k, m, and T satisfy\n\nkBk2 } as the maximum stable rank, and \u21e2 = \u21e41\n\nkAk2 , kBk2\n\n\u21e4r\n\nF\n\nF\n\nC1kAk2kBk2\u21e22r3\n\nmax{\u02dcr, 2 log(n)} + log (3/)\n\nk \n\nkAT Bk2\nC2\u02dcr2\n\nF\n\nm \n\n\n\n\u00b7\n\nF\n\nF + kBk2\n\n\u00b7\u2713kAk2\nkAT BkF \u25c62\n\u00b7\nT  log(kAkF + kBkF\n\n\u21e3\n\n\u23182\n\nnr3\u21e22 log(n)T 2\n\n\u23182\n\n,\n\n(4)\n\n(5)\n\n,\n\n(6)\nwhere C1 and C2 are some global constants independent of A and B. Then with probability at least\n1  , we have\n\n(7)\n3We also tried using the cosine rule for computing the dot product, and another sketching method speci\ufb01cally\n\nk(AT B)r  [AT Brk \uf8ff \u2318kAT B  (AT B)rkF + \u21e3 + \u2318\u21e4r .\n\ndesigned for preserving angles [2], but empirically those methods perform worse than our current estimator.\n\n4See http://wushanshan.github.io/files/RescaledJL_project.pdf for more results.\n\n),\n\n5\n\n\fF\n\nF +kBk2\nkAT BkF\n\nRemark 1. Compared to the two-pass algorithm proposed by [1], we notice that Eq. (7) contains\nan additional error term \u2318\u21e4r. This extra term captures the cost incurred when we are approximating\nentries of AT B by Eq. (2) instead of using the actual values. The exact tradeoff between \u2318 and k is\ngiven by Eq. (4). On one hand, we want to have a small k so that the sketched matrices can \ufb01t into\nmemory. On the other hand, the parameter k controls how much information is lost during sketching,\nand a larger k gives a more accurate estimation of the inner products.\nRemark 2. The dependence on kAk2\ncaptures one dif\ufb01cult situation for our algorithm. If\nkAT BkF is much smaller than kAkF or kBkF , which could happen, e.g., when many column\nvectors of A are orthogonal to those of B, then SMP-PCA requires many samples to work. This is\nreasonable. Imagine that AT B is close to an identity matrix, then it may be hard to tell it from an\nall-zero matrix without enough samples. Nevertheless, removing this dependence is an interesting\ndirection for future research.\nRemark 3. For the special case of A = B, SMP-PCA computes a rank-r approximation of matrix\nAT A in a single pass. Theorem 3.1 provides an error bound in spectral norm for the residual matrix\n(AT A)r  [AT Ar. Most results in the online PCA literature use Frobenius norm as performance\nmeasure. Recently, [10] provides an online PCA algorithm with spectral norm guarantee. They\nachieves a spectral norm bound of \u270f\u21e41 + \u21e4r+1, which is stronger than ours. However, their algorithm\nrequires a target dimension of O(r log n/\u270f2), i.e., the output is a matrix of size n-by-O(r log n/\u270f2),\nwhile the output of SMP-PCA is simply n-by-r.\nRemark 4. We defer our proofs to Appendix C. The proof proceeds in three steps. In Appendix C.2,\nwe show that the sampled matrix provides a good approximation of the actual matrix AT B. In\nAppendix C.3, we show that there is a geometric decrease in the distance between the computed\n\nsubspaces bU,bV and the optimal ones U\u21e4, V \u21e4 at each iteration of WAltMin algorithm. The spectral\nnorm bound in Theorem 3.1 is then proved using results from the previous two steps.\nComputation Complexity. We now analyze the computation complexity of SMP-PCA. In Step\n1, we compute the sketched matrices of A and B, which requires O(nnz(A)k + nnz(B)k) \ufb02ops.\nHere nnz(\u00b7) denotes the number of non-zero entries. The main job of Step 2 is to sample a set \u2326\nand calculate the corresponding inner products, which takes O(m log(n) + mk) \ufb02ops. Here we\nde\ufb01ne n as max{n1, n2} for simplicity. According to Eq. (4), we have log(n) = O(k), so Step 2\ntakes O(mk) \ufb02ops. In Step 3, we run alternating least squares on the sampled matrix, which can\nbe completed in O((mr2 + nr3)T ) \ufb02ops. Since Eq. (5) indicates nr = O(m), the computation\ncomplexity of Step 3 is O(mr2T ). Therefore, SMP-PCA has a total computation complexity\nO(nnz(A)k + nnz(B)k + mk + mr2T ).\n\n4 Numerical Experiments\n\nSpark implementation. We implement our SMP-PCA in Apache Spark 1.6.2 [19]. For the purpose\nof comparison, we also implement a two-pass algorithm LELA [1] in Spark5. The matrices A\nand B are stored as a resilient distributed dataset (RDD) in disk (by setting its StorageLevel\nas DISK_ONLY). We implement the two passes of LELA using the treeAggregate method.\nFor SMP-PCA, we choose the subsampled randomized Hadamard transform (SRHT) [16] as the\nsketching matrix. The biased sampling procedure is performed using binary search (see Appendix C.5\nfor how to sample m elements in O(m log n) time). After obtaining the sampled matrix, we run ALS\n(alternating least squares) to get the desired low-rank matrices. More details can be found at [18].\nDescription of datasets. We test our algorithm on synthetic datasets and three real datasets:\nSIFT10K [9], NIPS-BW [11], and URL-reputation [12]. For synthetic data, we generate matri-\nces A and B as GD, where G has entries independently drawn from standard Gaussian distribution,\nand D is a diagonal matrix with Dii = 1/i. SIFT10K is a dataset of 10,000 images. Each image is\nrepresented by 128 features. We set A as the image-by-feature matrix. The task here is to compute\na low rank approximation of AT A, which is a standard PCA task. The NIPS-BW dataset contains\nbag-of-words features extracted from 1,500 NIPS papers. We divide the papers into two subsets,\nand let A and B be the corresponding word-by-paper matrices, so AT B computes the counts of\nco-occurred words between two sets of papers. The original URL-reputation dataset has 2.4 million\n\n5To our best knowledge, this the \ufb01rst distributed implementation of LELA.\n\n6\n\n\fRuntime (sec) vs Cluster size\n\nLELA\n\nSMC-PCA\n\n3000\n\n2000\n\n1000\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\nr\no\nr\nr\ne\nm\nr\no\nn\n\n \nl\n\na\nr\nt\nc\ne\np\nS\n\nSVD( !AT !B)\nSMP-PCA\nLELA\nOptimal\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nSVD( !AT !B)\nSMP-PCA\nLELA\nOptimal\n\n0\n\n2\n\n5\n(a)\n\n10\n\nSketch size (k)\n\nSketch size (k)\n\n1000\n\n2000\n\n1000\n\n2000\n\n(b)\n\nFigure 3: (a) Spark-1.6.2 running time on a 150GB dataset. All nodes are m3.2xlarge EC2 instances.\nSee [18] for more details. (b) Spectral norm error achieved by three algorithms over two datasets:\n\nSIFT10K and 1.1 for NIPS-BW. The error of SMP-PCA keeps decreasing as the sketch size k grows.\n\nSIFT10K (left) and NIPS-BW (right). SMP-PCA outperforms SVD(eATeB) by a factor of 1.8 for\n\nURLs. Each URL is represented by 3.2 million features, and is indicated as malicious or benign. This\ndataset has been used previously for CCA [13]. Here we extract two subsets of features, and let A and\nB be the corresponding URL-by-feature matrices. The goal is to compute a low rank approximation\nof AT B, the cross-covariance matrix between two subsets of features.\nSample complexity. In Figure 4(a) we present simulation results on a small synthetic data with\nn = d = 5, 000 and r = 5. We observe that a phase transition occurs when the sample complexity\nm =\u21e5( nr log n). This agrees with the experimental results shown in previous papers, see, e.g., [4, 1].\nFor all rest experiments, unless otherwise speci\ufb01ed, we set r = 5, T = 10, and m as 4nr log n.\n\nTable 1: A comparison of spectral norm error over three datasets\n\nDataset\n\nd\n\nn\n\nSynthetic\n\n100,000\n\n100,000\n\nAlgorithm Sketch size k\nOptimal\nLELA\n\n-\n-\n\nSMP-PCA\n\n2,000\n\nURL-\nmalicious\n\nURL-\nbenign\n\n792,145\n\n10,000\n\n1,603,985\n\n10,000\n\nOptimal\nLELA\n\nSMP-PCA\n\nOptimal\nLELA\n\nSMP-PCA\n\n-\n-\n\n2,000\n\n-\n-\n\n2,000\n\nError\n0.0271\n0.0274\n0.0280\n0.0163\n0.0182\n0.0188\n0.0103\n0.0105\n0.0117\n\nComparison of SMP-PCA and LELA. We now compare the statistical performance of SMP-PCA\nand LELA [1] on three real datasets and one synthetic dataset. As shown in Figure 3(b) and Table 1,\nLELA always achieves a smaller spectral norm error than SMP-PCA, which makes sense because\nLELA takes two passes and hence has more chances exploring the data. Besides, we observe that as\nthe sketch size increases, the error of SMP-PCA keeps decreasing and gets closer to that of LELA.\nIn Figure 3(a) we compare the runtime of SMP-PCA and LELA using a 150GB synthetic dataset on\nm3.2xlarge Amazon EC2 instances6. The matrices A and B have dimension n = d = 100, 000. The\nsketch dimension is set as k = 2, 000. We observe that the speedup achieved by SMP-PCA is more\nprominent for small clusters (e.g., 56 mins versus 34 mins on a cluster of size two). This is possibly\ndue to the increasing spark overheads at larger clusters, see [8] for more related discussion.\n\nComparison of SMP-PCA and SVD(eATeB). In Figure 4(b) we repeat the experiment in Section 2\nby generating column vectors of A and B from a cone with angle \u2713. Here SVD(eATeB) refers to\n\n6Each machine has 8 cores, 30GB memory, and 2\u21e580GB SSD.\n\n7\n\n\fRatio of errors vs theta\n\n105\n\nk = 400\nk = 800\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n \n\nr\no\nr\nr\ne\nm\nr\no\nn\n\n \nl\n\na\nr\nt\nc\ne\np\nS\n\nr Br\n\nAT\nSMP-PCA\n\n \n\nr\no\nr\nr\ne\nm\nr\no\nn\n\n \nl\n\na\nr\nt\nc\ne\np\nS\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n2\n\n1\n# Samples / nrlogn\n\n3\n\n(a)\n\n4\n\n100\n\n0\n\n\u03c0/4\n\n\u03c0/2 3\u03c0/4\n(b)\n\n\u03c0\n\n200 400 600 800 1000\n\nSketch size (k)\n\n(c)\n\nFigure 4: (a) A phase transition occurs when the sample complexity m =\u21e5( nr log n). (b) This\n\n\ufb01gure plots the ratio of spectral norm error of SVD(eATeB) over that of SMP-PCA. The columns of\n\nA and B are unit vectors drawn from a cone with angle \u2713. We see that the ratio of errors scales to\nin\ufb01nity as the cone angle shrinks. (c) If the top r left singular vectors of A are orthogonal to those of\nB, the product AT\n\nr Br is a very poor low rank approximation of AT B.\n\nover that of SMP-PCA, as a function of \u2713. Note that this is different from Figure 2(b), as now we\ntake the effect of random sampling and SVD into account. However, the trend in both \ufb01gures are the\n\ncomputing SVD on the sketched matrices7. We plot the ratio of the spectral norm error of SVD(eATeB)\nsame: SMP-PCA always outperforms SVD(eATeB) and can be arbitrarily better as \u2713 goes to zero.\nIn Figure 3(b) we compare SMP-PCA and SVD(eATeB) on two real datasets SIFK10K and NIPS-BW.\nThe y-axis represents spectral norm error, de\ufb01ned as ||AT B  [AT Br||/||AT B||, where [AT Br is\nthe rank-r approximation found by a speci\ufb01c algorithm. We observe that SMP-PCA outperforms\nSVD(eATeB) by a factor of 1.8 for SIFT10K and 1.1 for NIPS-BW.\nNow we explain why SMP-PCA produces a more accurate result than SVD(eATeB). The reasons are\ntwofold. First, our rescaled JL embedding fM is a better estimator for AT B than eATeB (Figure 2).\nSecond, the noise due to sampling is relatively small compared to the bene\ufb01t obtained fromfM, and\nhence the \ufb01nal result computed using P\u2326(fM ) still outperforms SVD(eATeB).\n\nComparison of SMP-PCA and AT\nr Br. Let Ar and Br be the optimal rank-r approximation of A\nand B, we show that even if one could use existing methods (e.g., algorithms for streaming PCA)\nto estimate Ar and Br, their product AT\nr Br can be a very poor low rank approximation of AT B.\nThis is demonstrated in Figure 4(c), where we intentionally make the top r left singular vectors of A\northogonal to those of B.\n\n5 Conclusion\n\nWe develop a novel one-pass algorithm SMP-PCA that directly computes a low rank approximation\nof matrix product, using ideas of matrix sketching and entrywise sampling. As a subroutine of our\nalgorithm, we propose rescaled JL for estimating entries of AT B, which has smaller error compared\nto the standard estimator \u02dcAT \u02dcB. This we believe can be extended to other applications. Moreover,\nSMP-PCA allows the non-zero entries of A and B to be presented in any arbitrary order, and hence\ncan be used for steaming applications. We design a distributed implementation for SMP-PCA. Our\n\nexperimental results show that SMP-PCA can perform arbitrarily better than SVD(eATeB), and is\n\nsigni\ufb01cantly faster compared to algorithms that require two or more passes over the data.\nAcknowledgements We thank the anonymous reviewers for their valuable comments. This research\nhas been supported by NSF Grants CCF 1344179, 1344364, 1407278, 1422549, 1302435, 1564000,\nand ARO YIP W911NF-14-1-0258.\n\n7This can be done by standard power iteration based method, without explicitly forming the product matrix\n\neATeB, whose size is too big to \ufb01t into memory according to our assumption.\n\n8\n\n\fReferences\n[1] S. Bhojanapalli, P. Jain, and S. Sanghavi. Tighter low-rank approximation via sampling the leveraged\nelement. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms\n(SODA), pages 902\u2013920. SIAM, 2015.\n\n[2] P. T. Boufounos. Angle-preserving quantized phase embeddings. In SPIE Optical Engineering+ Applica-\n\ntions. International Society for Optics and Photonics, 2013.\n\n[3] X. Chen, H. Liu, and J. G. Carbonell. Structured sparse canonical correlation analysis. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 199\u2013207, 2012.\n\n[4] Y. Chen, S. Bhojanapalli, S. Sanghavi, and R. Ward. Completing any low-rank matrix, provably. arXiv\n\npreprint arXiv:1306.2979, 2013.\n\n[5] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In\nProceedings of the 45th annual ACM symposium on Symposium on theory of computing, pages 81\u201390.\nACM, 2013.\n\n[6] M. B. Cohen, J. Nelson, and D. P. Woodruff. Optimal approximate matrix product in terms of stable rank.\n\narXiv preprint arXiv:1507.02268, 2015.\n\n[7] P. Drineas, R. Kannan, and M. W. Mahoney. Fast monte carlo algorithms for matrices ii: Computing a\n\nlow-rank approximation to a matrix. SIAM Journal on Computing, 36(1):158\u2013183, 2006.\n\n[8] A. Gittens, A. Devarakonda, E. Racah, M. F. Ringenburg, L. Gerhardt, J. Kottalam, J. Liu, K. J. Maschhoff,\nS. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell, V. Krishnamurthy, M. W. Mahoney, and\nPrabhat. Matrix factorization at scale: a comparison of scienti\ufb01c data analytics in spark and C+MPI using\nthree case studies. arXiv preprint arXiv:1607.01335, 2016.\n\n[9] H. Jegou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. Pattern Analysis\n\nand Machine Intelligence, IEEE Transactions on, 33(1):117\u2013128, 2011.\n\n[10] Z. Karnin and E. Liberty. Online pca with spectral bounds. In Proceedings of The 28th Conference on\n\nLearning Theory (COLT), volume 40, pages 1129\u20131140, 2015.\n\n[11] M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.\n\n[12] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying suspicious urls: an application of large-scale\nonline learning. In Proceedings of the 26th annual international conference on machine learning, pages\n681\u2013688. ACM, 2009.\n\n[13] Z. Ma, Y. Lu, and D. Foster. Finding linear structure in large datasets with scalable canonical correlation\n\nanalysis. arXiv preprint arXiv:1506.08170, 2015.\n\n[14] A. Magen and A. Zouzias. Low rank matrix-valued chernoff bounds and approximate matrix multiplication.\nIn Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 1422\u2013\n1436. SIAM, 2011.\n\n[15] T. Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations\n\nof Computer Science, 2006. FOCS\u201906. 47th Annual IEEE Symposium on, pages 143\u2013152. IEEE, 2006.\n\n[16] J. A. Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive\n\nData Analysis, pages 115\u2013126, 2011.\n\n[17] D. P. Woodruff. Sketching as a tool for numerical linear algebra. arXiv preprint arXiv:1411.4357, 2014.\n\n[18] S. Wu, S. Bhojanapalli, S. Sanghavi, and A. Dimakis. Github repository for \"single-pass pca of matrix\n\nproducts\". https://github.com/wushanshan/MatrixProductPCA, 2016.\n\n[19] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica.\nResilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings\nof the 9th USENIX conference on Networked Systems Design and Implementation, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1347, "authors": [{"given_name": "Shanshan", "family_name": "Wu", "institution": "UT Austin"}, {"given_name": "Srinadh", "family_name": "Bhojanapalli", "institution": "TTI Chicago"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UT-Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "University of Texas, Austin"}]}