{"title": "Fast and Memory Optimal Low-Rank Matrix Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 3177, "page_last": 3185, "abstract": "In this paper, we revisit the problem of constructing a near-optimal rank $k$ approximation of a matrix $M\\in [0,1]^{m\\times n}$ under the streaming data model where the columns of $M$ are revealed sequentially. We present SLA (Streaming Low-rank Approximation), an algorithm that is asymptotically accurate, when $k s_{k+1} (M) = o(\\sqrt{mn})$ where $s_{k+1}(M)$ is the $(k+1)$-th largest singular value of $M$. This means that its average mean-square error converges to 0 as $m$ and $n$ grow large (i.e., $\\|\\hat{M}^{(k)}-M^{(k)} \\|_F^2 = o(mn)$ with high probability, where $\\hat{M}^{(k)}$ and $M^{(k)}$ denote the output of SLA and the optimal rank $k$ approximation of $M$, respectively). Our algorithm makes one pass on the data if the columns of $M$ are revealed in a random order, and two passes if the columns of $M$ arrive in an arbitrary order. To reduce its memory footprint and complexity, SLA uses random sparsification, and samples each entry of $M$ with a small probability $\\delta$. In turn, SLA is memory optimal as its required memory space scales as $k(m+n)$, the dimension of its output. Furthermore, SLA is computationally efficient as it runs in $O(\\delta kmn)$ time (a constant number of operations is made for each observed entry of $M$), which can be as small as $O(k\\log(m)^4 n)$ for an appropriate choice of $\\delta$ and if $n\\ge m$.", "full_text": "Fast and Memory Optimal Low-Rank Matrix\n\nApproximation\n\nSe-Young Yun\nMSR, Cambridge\n\nseyoung.yun@inria.fr\n\nMarc Lelarge \u2217\nInria & ENS\n\nmarc.lelarge@ens.fr\n\nAlexandre Proutiere \u2020\nKTH, EE School / ACL\n\nalepro@kth.se\n\nAbstract\n\nIn this paper, we revisit the problem of constructing a near-optimal rank k approx-\nimation of a matrix M \u2208 [0, 1]m\u00d7n under the streaming data model where the\ncolumns of M are revealed sequentially. We present SLA (Streaming Low-rank\n\u221a\nApproximation), an algorithm that is asymptotically accurate, when ksk+1(M ) =\nmn) where sk+1(M ) is the (k + 1)-th largest singular value of M. This\no(\nmeans that its average mean-square error converges to 0 as m and n grow large\n(i.e., (cid:107) \u02c6M (k)\u2212M (k)(cid:107)2\nF = o(mn) with high probability, where \u02c6M (k) and M (k) de-\nnote the output of SLA and the optimal rank k approximation of M, respectively).\nOur algorithm makes one pass on the data if the columns of M are revealed in\na random order, and two passes if the columns of M arrive in an arbitrary order.\nTo reduce its memory footprint and complexity, SLA uses random sparsi\ufb01cation,\nand samples each entry of M with a small probability \u03b4. In turn, SLA is memory\noptimal as its required memory space scales as k(m+n), the dimension of its out-\nput. Furthermore, SLA is computationally ef\ufb01cient as it runs in O(\u03b4kmn) time (a\nconstant number of operations is made for each observed entry of M), which can\nbe as small as O(k log(m)4n) for an appropriate choice of \u03b4 and if n \u2265 m.\n\n1\n\nIntroduction\n\nWe investigate the problem of constructing, in a memory and computationally ef\ufb01cient man-\nner, an accurate estimate of the optimal rank k approximation M (k) of a large (m \u00d7 n) matrix\nM \u2208 [0, 1]m\u00d7n. This problem is fundamental in machine learning, and has naturally found nu-\nmerous applications in computer science. The optimal rank k approximation M (k) minimizes, over\nall rank k matrices Z, the Frobenius norm (cid:107)M \u2212 Z(cid:107)F (and any norm that is invariant under rota-\ntion) and can be computed by Singular Value Decomposition (SVD) of M in O(nm2) time (if we\nassume that m \u2264 n). For massive matrices M (i.e., when m and n are very large), this becomes\nunacceptably slow. In addition, storing and manipulating M in memory may become dif\ufb01cult. In\nthis paper, we design a memory and computationally ef\ufb01cient algorithm, referred to as Streaming\nLow-rank Approximation (SLA), that computes a near-optimal rank k approximation \u02c6M (k). Under\nmild assumptions on M, the SLA algorithm is asymptotically accurate in the sense that as m and n\ngrow large, its average mean-square error converges to 0, i.e., (cid:107) \u02c6M (k) \u2212 M (k)(cid:107)2\nF = o(mn) with high\nprobability (we interpret M (k) as the signal that we aim to recover form a noisy observation M).\nTo reduce its memory footprint and running time, the proposed algorithm combines random sparsi-\n\ufb01cation and the idea of the streaming data model. More precisely, each entry of M is revealed to\nthe algorithm with probability \u03b4, called the sampling rate. Moreover, SLA observes and treats the\n\u2217Work performed as part of MSR-INRIA joint research centre. M.L. acknowledges the support of the\n\u2020A. Proutiere\u2019s research is supported by the ERC FSA grant, and the SSF ICT-Psi project.\n\nFrench Agence Nationale de la Recherche (ANR) under reference ANR-11-JS02-005-01 (GAP project).\n\n1\n\n\fcolumns of M one after the other in a sequential manner. The sequence of observed columns may\nbe chosen uniformly at random in which case the algorithm requires one pass on M only, or can be\narbitrary in which case the algorithm needs two passes. SLA \ufb01rst stores (cid:96) = 1/(\u03b4 log(m)) randomly\nselected columns, and extracts via spectral decomposition an estimator of parts of the k top right\nsingular vectors of M. It then completes the estimator of these vectors by receiving and treating the\nremain columns sequentially. SLA \ufb01nally builds, from the estimated top k right singular vectors, the\nlinear projection onto the subspace generated by these vectors, and deduces an estimator of M (k).\nThe analysis of the performance of SLA is presented in Theorems 7, and 8. In summary:\nwhen m \u2264 n, log4(m)\n\nm \u2264 \u03b4 \u2264 m\u22128/9, with probability 1 \u2212 k\u03b4, the output \u02c6M (k) of SLA satis\ufb01es:\n\n(cid:107)M (k) \u2212 \u02c6M (k)(cid:107)2\n\nmn\n\nF\n\n= O\n\nk2\n\nk+1(M )\n\nmn\n\n+\n\nlog(m)\u221a\n\u03b4m\n\n,\n\n(1)\n\n(cid:18)\n\n(cid:18) s2\n\n(cid:19)(cid:19)\n\nwhere sk+1(M ) is the (k + 1)-th singular value of M. SLA requires O(kn) memory space, and if\n\u03b4 \u2265 log4(m)\nm and k \u2264 log6(m), its time is O(\u03b4kmn). To ensure the asymptotic accuracy of SLA, the\n\u221a\nmn). In the\nupper-bound in (1) needs to converge to 0 which is true as soon as ksk+1(M ) = o(\ncase where M is seen as a noisy version of M (k), this condition quanti\ufb01es the maximum amount of\nnoise allowed for our algorithm to be asymptotically accurate.\nSLA is memory optimal, since any rank k approximation algorithm needs to at least store its output,\ni.e., k right and left singular vectors, and hence needs at least O(kn) memory space. Further observe\nthat among the class of algorithms sampling each entry of M at a given rate \u03b4, SLA is computational\noptimal, since it runs in O(\u03b4kmn) time (it does a constant number of operations per observed entry\nif k = O(1)). In turn, to the best of our knowledge, SLA is both faster and more memory ef\ufb01cient\nthan existing algorithms. SLA is the \ufb01rst memory optimal and asymptotically accurate low rank\napproximation algorithm.\nThe approach used to design SLA can be readily extended to devise memory and computationally\nef\ufb01cient matrix completion algorithms. We present this extension in the supplementary material.\nNotations. Throughout the paper, we use the following notations. For any m \u00d7 n matrix A, we de-\nnote by A(cid:62) its transpose, and by A\u22121 its pseudo-inverse. We denote by s1(A) \u2265 \u00b7\u00b7\u00b7 \u2265 sn\u2227m(A) \u2265\n0, the singular values of A. When matrices A and B have the same number of rows, [A, B] to denote\nthe matrix whose \ufb01rst columns are those of A followed by those of B. A\u22a5 denotes an orthonormal\nbasis of the subspace perpendicular to the linear span of the columns of A. Aj, Ai, and Aij de-\nnote the j-th column of A, the i-th row of A, and the entry of A on the i-th line and j-th column,\nrespectively. For h \u2264 l, Ah:l (resp. Ah:l) is the matrix obtained by extracting the columns (resp.\nlines) h, . . . , l of A. For any ordered set B = {b1, . . . , bp} \u2282 {1, . . . , n}, A(B) refers to the matrix\ncomposed by the ordered set B of columns of A. A(B) is de\ufb01ned similarly (but for lines). For real\nnumbers a \u2264 b, we de\ufb01ne |A|b\na)ij = min(b, max(a, Aij)).\nFinally, for any vector v, (cid:107)v(cid:107) denotes its Euclidean norm, whereas for any matrix A, (cid:107)A(cid:107)F denotes\nits Frobenius norm, (cid:107)A(cid:107)2 its operator norm, and (cid:107)A(cid:107)\u221e its (cid:96)\u221e-norm, i.e., (cid:107)A(cid:107)\u221e = maxi,j |Aij|.\n\na the matrix with (i, j) entry equal to (|A|b\n\n2 Related Work\n\nLow-rank approximation algorithms have received a lot of attention over the last decade. There are\ntwo types of error estimate for these algorithms: either the error is additive or relative.\nTo translate our bound (1) in an additive error is easy:\n\n(cid:107)M \u2212 \u02c6M (k)(cid:107)F \u2264 (cid:107)M \u2212 M (k)(cid:107)F + O\n\nk\n\n\u221a\nsk+1(M )\n\nmn\n\n+\n\nlog1/2 m\n(\u03b4m)1/4\n\nmn\n\n.\n\n(2)\n\n(cid:32)\n\n(cid:32)\n\n(cid:33)\u221a\n\n(cid:33)\n\nSparsifying M to speed-up the computation of a low-rank approximation has been proposed in the\nliterature and the best additive error bounds have been obtained in [AM07]. When the sampling rate\n\u03b4 satis\ufb01es \u03b4 \u2265 log4 m\n\nm , the authors show that with probability 1 \u2212 exp(\u2212 log4 m),\n(cid:107)M (k)(cid:107)1/2\n\n(cid:107)M \u2212 \u02dcM (k)(cid:107)F \u2264 (cid:107)M \u2212 M (k)(cid:107)F + O\n\n(cid:18) k1/2n1/2\n\n.\n\n(3)\n\nk1/4n1/4\n\n(cid:19)\n\n+\n\nF\n\n\u03b41/2\n\n\u03b41/4\n\n2\n\n\fmn.\n\n(cid:17)\u221a\n\nmn\n\n(cid:16) 1\n\nThis performance guarantee is derived from Lemma 1.1 and Theorem 1.4 in [AM07]. To\ncompare (2) and (3), note that our assumptions on the bounded entries of M ensures that:\ns2\nk+1(M )\nIn particular, we see that the worst case\nnm which is always lower than the worst case bound for (2):\n\n(cid:16) k1/2\u221a\nk and (cid:107)M (k)(cid:107)F \u2264 (cid:107)M(cid:107)F \u2264 \u221a\n(cid:17)1/2 \u221a\n\n\u2264 1\nbound for (3) is\n\n+ k1/4\n(\u03b4m)1/4\n\n\u03b4m\n\nk + log m\u221a\n\n\u03b4m\nnm. When k = O(1), our bound is only larger by a logarithmic term in m\nk\ncompared to [AM07]. However, the algorithm proposed in [AM07] requires to store O(\u03b4mn) en-\ntries of M whereas SLA needs O(n) memory space. Recall that log4 m \u2264 \u03b4m \u2264 m1/9 so that our\nalgorithm makes a signi\ufb01cant improvement on the memory requirement at a low price in the error\nguarantee bounds. Although biased sampling algorithms can reduce the error, the algorithm have to\nrun leverage scores with multiple passes over data [BJS15]. In a recent work, [CW13] proposes a\ntime ef\ufb01cient algorithm to compute a low-rank approximation of a sparse matrix. Combined with\n[AM07], we obtain an algorithm running in time O(\u03b4mn) + O(nk2 + k3) but with an increased\nadditive error term.\nWe can also compare our result to papers providing an estimate \u02dcM (k) of the optimal low-rank ap-\nproximation of M with a relative error \u03b5, i.e. such that (cid:107)M \u2212 \u02dcM (k)(cid:107)F \u2264 (1 + \u03b5)(cid:107)M \u2212 M (k)(cid:107)F . To\nthe best of our knowledge, [CW09] provides the best result in this setting. Theorem 4.4 in [CW09]\nshows that provided the rank of M is at least 2(k + 1), their algorithm outputs with probability 1\u2212 \u03b7\na rank-k matrix \u02dcM (k) with relative error \u03b5 using memory space O (k/\u03b5 log(1/\u03b7)(n + m)) (note that\nin [CW09], the authors use as unit of memory a bit whereas we use as unit of memory an entry of the\nmatrix so we removed a log mn factor in their expression to make fair comparisons). To compare\nwith our result, we can translate our bound (1) in a relative error, and we need to take:\n\n\u221a\n\nmn\n\n\uf8f6\uf8f8 .\n\n\uf8eb\uf8edk\n\n\u03b5 = O\n\nsk+1(M ) + log1/2 m\n(\u03b4m)1/4\n(cid:107)M \u2212 M (k)(cid:107)F\n\n\u221a\nm + n) while sk+1(M ) = \u0398(\n\nFirst note that since M is assumed to be of rank at least 2(k + 1), we have (cid:107)M \u2212 M (k)(cid:107)F \u2265\nsk+1(M ) > 0 and \u03b5 is well-de\ufb01ned. Clearly, for our \u03b5 to tend to zero, we need (cid:107)M \u2212 M (k)(cid:107)F to\nbe not too small. For the scenario we have in mind, M is a noisy version of the signal M (k) so that\nM \u2212M (k) is the noise matrix. When every entry of M \u2212M (k) is generated independently at random\n\u221a\nwith a constant variance, (cid:107)M \u2212 M (k)(cid:107)F = \u0398(\nn). In such a case,\nwe have \u03b5 = o(1) and we improve the memory requirement of [CW09] by a factor \u03b5\u22121 log(k\u03b4)\u22121.\n[CW09] also considers a model where the full columns of M are revealed one after the other in an\narbitrary order, and proposes a one-pass algorithm to derive the rank-k approximation of M with the\nsame memory requirement. In this general setting, our algorithm is required to make two passes on\nthe data (and only one pass if the order of arrival of the column is random instead of arbitrary). The\nrunning time of the algorithm scales as O(kmn\u03b5\u22121 log(k\u03b4)\u22121) to project M onto k\u03b5\u22121 log(k\u03b4)\u22121\ndimensional random space. Thus, SLA improves the time again by a factor of \u03b5\u22121 log(k\u03b4)\u22121.\nWe could also think of using sketching and streaming PCA algorithms to estimate M (k). When the\ncolumns arrive sequentially, these algorithms identify the left singular vectors using one-pass on the\nmatrix and then need a second pass on the data to estimate the right singular vectors. For example,\n[Lib13] proposes a sketching algorithm that updates the p most frequent directions as columns are\nobserved. [GP14] shows that with O(km/\u03b5) memory space (for p = k/\u03b5), this sketching algorithm\n\ufb01nds m \u00d7 k matrix \u02c6U such that (cid:107)M \u2212 P \u02c6U M(cid:107)F \u2264 (1 + \u03b5)(cid:107)M \u2212 M (k)(cid:107)F , where P \u02c6U denotes\nthe projection matrix to the linear span of the columns of \u02c6U. The running time of the algorithm\nis roughly O(kmn\u03b5\u22121), which is much greater than that of SLA. Note also that to identify such\nmatrix \u02c6U in one pass on M, it is shown in [Woo14] that we have to use \u2126(km/\u03b5) memory space.\nThis result does not contradict the performance analysis of SLA, since the latter needs two passes\non M if the columns of M are observed in an arbitrary manner. Finally, note that the streaming\nPCA algorithm proposed in [MCJ13] does not apply to our problem as this paper investigates a very\nspeci\ufb01c problem: the spiked covariance model where a column is randomly generated in an i.i.d.\nmanner.\n\n3 Streaming Low-rank Approximation Algorithm\n\n3\n\n\fAlgorithm 1 Streaming Low-rank Approximation (SLA)\n\n1\n\nInput: M, k, \u03b4, and (cid:96) =\n1. A(B1), A(B2) \u2190 independently sample entries of [M1, . . . , M(cid:96)] at rate \u03b4\n2. PCA for the \ufb01rst (cid:96) columns: Q \u2190 SPCA(A(B1), k)\n3. Trimming the rows and columns of A(B2):\n\n\u03b4 log(m)\n\nA(B2) \u2190 set the entries of rows of A(B2) having more than two non-zero entries to 0\nA(B2) \u2190 set the entries of the columns of A(B2) having more than 10m\u03b4 non-zero entries to 0\n\n4. W \u2190 A(B2)Q\nRemove A(B1), A(B2), and Q from the memory space\nfor t = (cid:96) + 1 to n do\n\n5. \u02c6V (B1) \u2190 (A(B1))(cid:62)W\n\n7. At \u2190 sample entries of Mt at rate \u03b4\nRemove At from the memory space\n\n6. \u02c6I \u2190 A(B1)\n\n\u02c6V (B1)\n\n8. \u02c6V t \u2190 (At)(cid:62)W\n\n9. \u02c6I \u2190 \u02c6I + At \u02c6V t\n\nend for\n10. \u02c6R \u2190 \ufb01nd \u02c6R using the Gram-Schmidt process such that \u02c6V \u02c6R is an orthonormal matrix\n11. \u02c6U \u2190 1\n\u02c6\u03b4\nOutput: \u02c6M (k) = | \u02c6U \u02c6V (cid:62)|1\n\n\u02c6I \u02c6R \u02c6R(cid:62)\n\n0\n\nAlgorithm 2 Spectral PCA (SPCA)\n\nInput: C \u2208 [0, 1]m\u00d7(cid:96), k\n\u2126 \u2190 (cid:96) \u00d7 k Gaussian random matrix\nTrimming: \u00afC \u2190 set the entries of the rows of C with more than 10 non-zero entries to 0\n\u03a6 \u2190 \u00afC(cid:62) \u00afC \u2212 diag( \u00afC(cid:62) \u00afC)\nPower Iteration: QR \u2190 QR decomposition of \u03a6(cid:100)5 log((cid:96))(cid:101)\u2126\nOutput: Q\n\n1\n\nIn this section, we present the Streaming Low-rank Approximation (SLA) algorithm and analyze\nits performance. SLA makes one pass on the matrix M, and is provided with the columns of M\none after the other in a streaming manner. The SVD of M is M = U \u03a3V (cid:62) where U and V are\n(m\u00d7 m) and (n\u00d7 n) unitary matrices and \u03a3 is the (m\u00d7 n) matrix diag(s1(M ), . . . sn\u2227m(M )). We\nassume (or impose by design of SLA) that the (cid:96) (speci\ufb01ed below) \ufb01rst observed columns of M are\nchosen uniformly at random among all columns. An extension of SLA to scenarios where columns\nare observed in an arbitrary order is presented in \u00a73.5, but this extension requires two passes on M.\nTo be memory ef\ufb01cient, SLA uses sampling. Each observed entry of M is erased (i.e., set equal to\n0) with probability 1 \u2212 \u03b4, where \u03b4 > 0 is referred to as the sampling rate. The algorithm, whose\npseudo-code is presented in Algorithm 1, proceeds in three steps:\n\u03b4 log(m) columns of M chosen uniformly at random. These\n1. In the \ufb01rst step, we observe (cid:96) =\ncolumns form the matrix M(B) = U \u03a3(V (B))(cid:62), where B denotes the ordered set of the indexes of\nthe (cid:96) \ufb01rst observed columns. M(B) is sampled at rate \u03b4. More precisely, we apply two independent\nsampling procedures, where in each of them, every entry of M(B) is sampled at rate \u03b4. The two\nresulting independent random matrices A(B1), and A(B2) are stored in memory. A(B1), referred to\nas A(B) to simplify the notations, is used in this \ufb01rst step, whereas A(B2) will be used in subsequent\nsteps. Next through a spectral decomposition of A(B), we derive a ((cid:96) \u00d7 k) orthonormal matrix Q\nsuch that the span of its column vectors approximates that of the column vectors of V (B)\n1:k . The \ufb01rst\nstep corresponds to Lines 1 and 2 in the pseudo-code of SLA.\n2. In the second step, we complete the construction of our estimator of the top k right singular\nvectors V1:k of M. Denote by \u02c6V the k \u00d7 n matrix formed by these estimated vectors. We \ufb01rst\ncompute the components of these vectors corresponding to the set of indexes B as \u02c6V (B) = A(cid:62)\n(B1)W\nwith W = A(B2)Q. Then for t = (cid:96) + 1, . . . , n, after receiving the t-th column Mt of M, we set\n\u02c6V t = A(cid:62)\nt W , where At is obtained by sampling entries of Mt at rate \u03b4. Hence after one pass on\nM, we get \u02c6V = \u02dcA(cid:62)W , where \u02dcA = [A(B1), A(cid:96)+1, . . . , An]. As it turns out, multiplying W by \u02dcA(cid:62)\nampli\ufb01es the useful signal contained in W , and yields an accurate approximation of the span of the\n\n4\n\n\ftop k right singular vectors V1:k of M. The second step is presented in Lines 3, 4, 5, 7 and 8 in SLA\npseudo-code.\n3. In the last step, we deduce from \u02c6V a set of column vectors gathered in matrix \u02c6U such that \u02c6U(cid:62) \u02c6V\nprovides an accurate approximation of M (k). First, using the Gram-Schmidt process, we \ufb01nd \u02c6R\n\u03b4 A \u02c6V \u02c6R \u02c6R(cid:62) in a streaming manner as in\nsuch that \u02c6V \u02c6R is an orthonormal matrix and compute \u02c6U = 1\n\u03b4 A \u02c6V \u02c6R( \u02c6V \u02c6R)(cid:62) where \u02c6V \u02c6R( \u02c6V \u02c6R)(cid:62) approximates the projection matrix onto\nStep 2. Then, \u02c6U \u02c6V (cid:62) = 1\nthe linear span of the top k right singular vectors of M. Thus, \u02c6U \u02c6V (cid:62) is close to M (k). This last step\nis described in Lines 6, 9, 10 and 11 in SLA pseudo-code.\nIn the next subsections, we present in more details the rationale behind the three steps of SLA, and\nprovide a performance analysis of the algorithm.\n\n3.1 Step 1. Estimating right-singular vectors of the \ufb01rst batch of columns\n\nThe objective of the \ufb01rst step is to estimate V (B)\n1:k , those components of the top k right singular\nvectors of M whose indexes are in the set B (remember that B is the set of indexes of the (cid:96) \ufb01rst\nobserved columns). This estimator, denoted by Q, is obtained by applying the power method to\nextract the top k right singular vector of M(B), as described in Algorithm 2. In the design of this\nalgorithm and its performance analysis, we face two challenges: (i) we only have access to a sampled\nversion A(B) of M(B); and (ii) U \u03a3(V (B))(cid:62) is not the SVD of M(B) since the column vectors of\nV (B)\n1:k are not orthonormal in general (we keep the components of these vectors corresponding to the\nset of indexes B). Hence, the top k right singular vectors of M(B) that we extract in Algorithm 2 do\nnot necessarily correspond to V (B)\n1:k .\nTo address (i), in Algorithm 2, we do not directly extract the top k right singular vectors of A(B).\nWe \ufb01rst remove the rows of A(B) with too many non-zero entries (i.e., too many observed entries\nfrom M(B)), since these rows would perturb the SVD of A(B). Let us denote by \u00afA the obtained\ntrimmed matrix. We then form the covariance matrix \u00afA(cid:62) \u00afA, and remove its diagonal entries to\nobtain the matrix \u03a6 = \u00afA(cid:62) \u00afA \u2212 diag( \u00afA(cid:62) \u00afA). Removing the diagonal entries is needed because of\nthe sampling procedure. Indeed, the diagonal entries of \u00afA(cid:62) \u00afA scale as \u03b4, whereas its off-diagonal\nentries scale as \u03b42. Hence, when \u03b4 is small, the diagonal entries would clearly become dominant in\nthe spectral decomposition. We \ufb01nally apply the power method to \u03a6 to obtain Q. In the analysis of\nthe performance of Algorithm 2, the following lemma will be instrumental, and provides an upper\nbound of the gap between \u03a6 and (M(B))(cid:62)M(B) using the matrix Bernstein inequality (Theorem 6.1\n[Tro12]). All proofs are detailed in Appendix.\nLemma 1 If \u03b4 \u2264 m\u2212 8\n9 , with probability 1 \u2212 1\nsome constant c1 > 1.\n\n(cid:96)2 , (cid:107)\u03a6 \u2212 \u03b42(M(B))(cid:62)M(B)(cid:107)2 \u2264 c1\u03b4(cid:112)m(cid:96) log((cid:96)), for\n\nTo address (ii), we \ufb01rst establish in Lemma 2 that for an appropriate choice of (cid:96), the column vectors\nof V (B)\n1:k are approximately orthonormal. This lemma is of independent interest, and relates the SVD\nof a truncated matrix, here M(B), to that of the initial matrix M. More precisely:\nLemma 2 If \u03b4 \u2264 m\u22128/9, there exists a (cid:96)\u00d7 k matrix \u00afV (B) such that its column vectors are orthonor-\nmal, and with probability 1 \u2212 exp(\u2212m1/7), for all i \u2264 k satisfying that s2\n\n(cid:107)(cid:112) n\n\n1:i \u2212 \u00afV (B)\n\n1:i (cid:107)2 \u2264 m\u2212 1\n3 .\n\n(cid:96) V (B)\n\n\u03b4(cid:96)\n\ni (M ) \u2265 n\n\n(cid:112)m(cid:96) log((cid:96)),\n(cid:112)m(cid:96) log((cid:96))). However,\n(cid:112)m(cid:96) log((cid:96)), i \u2264 k}.\n\nwhen the corre-\n\n\u03b4(cid:96)\n\ni\n\nNote that as suggested by the above lemma, it might be impossible to recover V (B)\nsponding singular value si(M ) is small (more precisely, when s2\nthe singular vectors corresponding to such small singular values generate very little error for low-\nrank approximation. Thus, we are only interested in singular vectors whose singular values are\nabove the threshold ( n\n\u03b4(cid:96)\nNow to analyze the performance of Algorithm 2 when applied to A(B), we decompose \u03a6 as \u03a6 =\n1:k(cid:48) )(cid:62) is a noise matrix. The\n\u03b42(cid:96)\nn\n\n(cid:112)m(cid:96) log((cid:96)))1/2. Let k(cid:48) = max{i : s2\n\n1:k(cid:48) )(cid:62) + Y , where Y = \u03a6 \u2212 \u03b42(cid:96)\n\ni (M ) \u2265 n\n\ni (M ) \u2264 n\n\n1:k(cid:48) (\u03a31:k(cid:48)\n\u00afV (B)\n\n1:k(cid:48))2( \u00afV (B)\n\n1:k(cid:48) (\u03a31:k(cid:48)\n\u00afV (B)\n\n1:k(cid:48))2( \u00afV (B)\n\n\u03b4(cid:96)\n\nn\n\n5\n\n\ffollowing lemma quanti\ufb01es how noise may affect the performance of the power method, i.e., it\nprovides an upper bound of the gap between Q and \u00afV (B)\n1:k(cid:48) as a function of the operator norm of the\nnoise matrix Y :\nLemma 3 With probability 1 \u2212 1\n1:i )(cid:62) \u00b7 Q\u22a5(cid:107)2 \u2264 3(cid:107)Y (cid:107)2\ni \u2264 k(cid:48): (cid:107)( \u00afV (B)\n\n(cid:96)2 , the output Q of SPCA when applied to A(B) satis\ufb01es for all\nn si(M )2 .\n\n\u03b42 (cid:96)\n\nIn the proof, we analyze the power iteration algorithm from results in [HMT11].\nTo complete the performance analysis of Algorithm 2, it remains to upper bound (cid:107)Y (cid:107)2. To this aim,\nwe decompose Y into three terms:\n\nY =(cid:0)\u03a6 \u2212 \u03b42(M(B))(cid:62)M(B)\n\n(cid:1) + \u03b42(M(B))(cid:62)(cid:0)I \u2212 U1:k(cid:48)U(cid:62)\n\n1:k(cid:48)(cid:1) M(B)+\n1:k(cid:48) )(cid:62)(cid:19)\n\n1:k(cid:48))2( \u00afV (B)\n\n.\n\n(cid:18)\n\n\u03b42\n\n(M(B))(cid:62)U1:k(cid:48)U(cid:62)\n\n1:k(cid:48)M(B) \u2212 (cid:96)\nn\n\n1:k(cid:48) (\u03a31:k(cid:48)\n\u00afV (B)\n\nThe \ufb01rst term can be controlled using Lemma 1, and the last term is upper bounded using Lemma\n2. Finally, the second term corresponds to the error made by ignoring the singular vectors which\nare not within the top k(cid:48). To estimate this term, we use the matrix Chernoff bound (Theorem 2.2 in\n[Tro11]), and prove that:\nLemma 4 With probability 1 \u2212 exp(\u2212m1/4), (cid:107)(I \u2212 U1:k(cid:48)U(cid:62)\nn s2\n\n(cid:112)m(cid:96) log((cid:96)) +\n\n1:k(cid:48))M(B)(cid:107)2\n\n2 \u2264 2\n\nk+1(M ).\n\n\u03b4\n\n(cid:96)\n\nIn summary, combining the four above lemmas, we can establish that Q accurately estimates \u00afV (B)\n1:k :\nTheorem 5 If \u03b4 \u2264 m\u22128/9, with probability 1 \u2212 3\n(cid:96)2 , the output Q of Algorithm 2 when applied to\nA(B) satis\ufb01es for all i \u2264 k: (cid:107)( \u00afV (B)\n, where\nc1 is the constant from Lemma 1.\n\n1:i )(cid:62) \u00b7 Q\u22a5(cid:107)2 \u2264 3\u03b42(s2\n\n2\n3 n)+3(2+c1)\u03b4 n\n(cid:96)\n\u03b42s2\n\nk+1(M )+2m\n\nm(cid:96) log((cid:96))\n\ni (M )\n\n\u221a\n\n3.2 Step 2: Estimating the principal right singular vectors of M\n\n(cid:96) V (B)\n\nQ \u2248(cid:112) n\n\n1:k , and E[A(B2)] = \u03b4U \u03a3(V (B))(cid:62), and hence E[W ] \u2248 \u03b4(cid:112) n\n(cid:80)k\nF ] = (cid:80)m\nmn and sj(M(B)) \u2248(cid:113) (cid:96)\n\nIn this step, we aim at estimating the top k right singular vectors V1:k, or at least at producing\nk vectors whose linear span approximates that of V1:k. Towards this objective, we start from Q\nderived in the previous step, and de\ufb01ne the (m \u00d7 k) matrix W = A(B2)Q. W is stored and kept in\nmemory for the remaining of the algorithm.\nIt is tempting to directly read from W the top k(cid:48) left singular vectors U1:k(cid:48). Indeed, we know that\n1:k. However, the\n(cid:96) U1:k\u03a31:k\nlevel of the noise in W is too important so as to accurately extract U1:k(cid:48). In turn, W can be written\nas \u03b4U \u03a3(V (B))(cid:62)Q + Z, where Z = (A(B2) \u2212 \u03b4U \u03a3(V (B))(cid:62))Q partly captures the noise in W . It\n\u221a\nis then easy to see that the level of the noise Z satis\ufb01es E[(cid:107)Z(cid:107)2] \u2265 E[(cid:107)Z(cid:107)F /\n\u03b4m).\nij] \u2248 mk\u03b4: this is\nIndeed, \ufb01rst observe that Z is of rank k. Then E[(cid:107)Z(cid:107)2\nE[Z 2\ndue to the facts that (i) Q and A(B2) \u2212 \u03b4U \u03a3(V (B))(cid:62) are independent (since A(B1) and A(B2) are\n2 = 1 for all j \u2264 k, and (iii) the entries of A(B2) are independent with\nindependent), (ii) (cid:107)Qj(cid:107)2\nvariance \u0398(\u03b4(1 \u2212 \u03b4)). However, for all j \u2264 k(cid:48), the j-th singular value of \u03b4U \u03a3(V (B))(cid:62)Q scales as\nn sj(M ) when j \u2264 k(cid:48) from\nO(\u03b4\nLemma 2.\nInstead, from W , A(B1) and the subsequent sampled arriving columns At, t > (cid:96), we produce\na (n \u00d7 k) matrix \u02c6V whose linear span approximates that of V1:k(cid:48). More precisely, we \ufb01rst let\n\u02c6V (B) = A(cid:62)\nt W , where At is obtained\nfrom the t-th observed column of M after sampling each of its entries at rate \u03b4. Multiplying W by\n\u02dcA = [A(B1), A(cid:96)+1, . . . , An] ampli\ufb01es the useful signal in W , so that \u02c6V = \u02dcA(cid:62)W constitutes a good\napproximation of V1:k. To understand why, we can rewrite \u02c6V as follows:\n\n(cid:113) \u03b4m\nlog(m) ), since sj(M ) \u2264 \u221a\n\n(B1)W . Then for all t = (cid:96) + 1, . . . , n, we de\ufb01ne \u02c6V t = A(cid:62)\n\nm(cid:96)) = O(\n\nk] = \u2126(\n\n\u221a\n\n\u221a\n\nj=1\n\ni=1\n\n\u02c6V = \u03b42M(cid:62)M(B)Q + \u03b4M(cid:62)(A(B2) \u2212 \u03b4M(B))Q + ( \u02dcA \u2212 \u03b4M )(cid:62)W.\n\n6\n\n\fj (M )\n\nIn the above equation, the \ufb01rst term corresponds to the useful signal and the two remaining terms\nconstitute noise matrices. From Theorem 5, the linear span of columns of Q approximates that of\nthe columns of \u00afV (B) and thus, for j \u2264 k(cid:48), sj(\u03b42M(cid:62)M(B)Q) \u2248 \u03b42s2\nThe spectral norms of the noise matrices are bounded using random matrix arguments, and the fact\nthat (A(B2) \u2212 \u03b4M(B)) and ( \u02dcA \u2212 \u03b4M ) are zero-mean random matrices with independent entries. We\ncan show (see Lemma 14 given in the supplementary material) using the independence of A(B1)\nand A(B2) that with high probability, (cid:107)\u03b4M(cid:62)(A(B2) \u2212 \u03b4M(B))Q(cid:107)2 = O(\u03b4\nmn). We may also\n\n(cid:113) (cid:96)\nn \u2265 \u03b4(cid:112)mn log((cid:96)).\nestablish that with high probability, (cid:107)( \u02dcA \u2212 \u03b4M )(cid:62)W(cid:107)2 = O(\u03b4(cid:112)m(m + n)). This is a consequence\nhigh probability, (cid:107) \u02dcA \u2212 \u03b4M(cid:107) = O((cid:112)\u03b4(m + n)) and of the fact that due to the trimming process\n\nof a result derived in [AM07] (quoted in Lemma 13 in the supplementary material) stating that with\npresented in Line 3 in Algorithm 1, (cid:107)W(cid:107)2 = O(\n\u03b4m). In summary, as soon as n scales at least\nas m, the noise level becomes negligible, and the span of \u02c6V1:k(cid:48) provides an accurate approximation\nof that of V1:k(cid:48). The above arguments are made precise and rigorous in the supplementary material.\nThe following theorem summarizes the accuracy of our estimator of V1:k.\n\n\u221a\n\n\u221a\n\nTheorem 6 With log4(m)\nprobability 1 \u2212 k\u03b4, (cid:107)V (cid:62)\n\nm\n\n\u2264 \u03b4 \u2264 m\u2212 8\ni ( \u02c6V1:k)\u22a5(cid:107)2 \u2264 c2\n\n9 for all i \u2264 k, there exists a constant c2 such that with\n\n\u221a\n\n\u221a\n\ns2\nk+1(M )+n log(m)\ns2\ni (M )\n\nm/\u03b4+m\n\nn log(m)/\u03b4\n\n.\n\n3.3 Step 3: Estimating the principal left singular vectors of M\n\n\u03b4\n\n\u02dcAP \u02c6V .\n\n1:kV (cid:62)\n\n1:k = M PV1:k, where PV1:k = V1:kV (cid:62)\n\nIn the last step, we estimate the principal left singular vectors of M to \ufb01nally derive an estimator of\nM (k), the optimal rank-k approximation of M. The construction of this estimator is based on the\n1:k is an (n \u00d7 n) matrix\nobservation that M (k) = U1:k\u03a31:k\nrepresenting the projection onto the linear span of the top k right singular vectors V1:k of M. Hence\nto estimate M (k), we try to approximate the matrix PV1:k. To this aim, we construct a (k\u00d7 k) matrix\n\u02c6R so that the column vectors of \u02c6V \u02c6R form an orthonormal basis whose span corresponds to that\nof the column vectors of \u02c6V . This construction is achieved using Gram-Schmidt process. We then\napproximate PV1:k by P \u02c6V = \u02c6V \u02c6R \u02c6R(cid:62) \u02c6V (cid:62), and \ufb01nally our estimator \u02c6M (k) of M (k) is 1\nThe construction of \u02c6M (k) can be made in a memory ef\ufb01cient way accommodating for our streaming\nmodel where the columns of M arrive one after the other, as described in the pseudo-code of SLA.\n\u02c6V (B). Then, for t =\nFirst, after constructing \u02c6V (B) in Step 2, we build the matrix \u02c6I = A(B1)\n(cid:96) + 1, . . . , n, after constructing the t-th line \u02c6V t of \u02c6V , we update \u02c6I by adding to it the matrix At \u02c6V t,\nso that after all columns of M are observed, \u02c6I = \u02dcA \u02c6V . Hence we can build an estimator \u02c6U of the\nprincipal left singular vectors of M as \u02c6U = 1\n\u03b4\nTo quantify the estimation error of \u02c6M (k), we decompose M (k) \u2212 \u02c6M (k) as: M (k) \u2212 \u02c6M (k) =\nM (k)(I \u2212 P \u02c6V ) + (M (k) \u2212 M )P \u02c6V + (M \u2212 1\n\u02dcA)P \u02c6V . The \ufb01rst term of the r.h.s. of the above\n\u02c6V\u22a5(cid:107) \u2264 z =\nfor i \u2264 k, we have si(M )2(cid:107)V (cid:62)\nequation can be bounded using Theorem 6:\nc2(s2\nF \u2264 z. The second term can be easily bounded observing that the matrix\n(M (k) \u2212 M )P \u02c6V is of rank k: (cid:107)(M (k) \u2212 M )P \u02c6V (cid:107)2\n2 =\nksk+1(M )2. The last term in the r.h.s. can be controlled as in the performance analysis of Step 2, and\nobserving that ( 1\n= O(k\u03b4(m + n)).\n\u03b4\nIt is then easy to remark that for the range of the parameter \u03b4 we are interested in, the upper bound z\nof the \ufb01rst term dominates the upper bound of the two other terms. Finally, we obtain the following\nresult (see the supplementary material for a complete proof):\n\nk+1(M ) + n log(m)(cid:112)m/\u03b4 + m(cid:112)n log(m)/\u03b4), and hence we can conclude that for all i \u2264 k,\n(cid:13)(cid:13)si(M )UiV (cid:62)\n\ni (I \u2212 P \u02c6V )(cid:13)(cid:13)2\n\u02dcA\u2212 M )P \u02c6V is of rank k: (cid:107)(cid:16) 1\n\nF \u2264 k(cid:107)(M (k) \u2212 M )P \u02c6V (cid:107)2\n(cid:17)\n\u02dcA \u2212 M\n\n\u02c6I \u02c6R \u02c6R(cid:62), and \ufb01nally obtain \u02c6M (k) = | \u02c6U \u02c6V (cid:62)|1\n0.\n\n2 \u2264 k(cid:107)M (k) \u2212 M(cid:107)2\n\n(cid:13)(cid:13)(cid:13) 1\n\n\u02dcA \u2212 M\n\nP \u02c6V (cid:107)2\n\nF \u2264 k\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u03b4\n\n\u03b4\n\n\u03b4\n\ni\n\n9 , with probability 1 \u2212 k\u03b4, the output of the SLA algorithm\nm \u2264 \u03b4 \u2264 m\u2212 8\nTheorem 7 When log4(m)\n0(cid:107)2\nsatis\ufb01es with constant c3: (cid:107)M (k)\u2212[ \u02c6U \u02c6V (cid:62)]1\n\ns2\nmn + log(m)\u221a\nk+1(M )\n\n= c3k2\n\nmn\n\n+\n\n\u03b4n\n\n.\n\nF\n\n\u03b4m\n\n(cid:113) log(m)\n\n(cid:19)\n\n(cid:18)\n\n7\n\n\fNote that if log4(m)\nan asymptotically accurate estimate of M (k) as soon as sk+1(M )2\n\n9 , then log(m)\u221a\n\nm \u2264 \u03b4 \u2264 m\u2212 8\n\n\u03b4m\n\n= o(1). Hence if n \u2265 m, the SLA algorithm provides\n\n= o(1).\n\nmn\n\n3.4 Required Memory and Running Time\n\nRequired memory.\nLines 1-6 in SLA pseudo-code. A(B1) and A(B2) have O(\u03b4m(cid:96)) non-zero entries and we need\nO(\u03b4m(cid:96) log m) bits to store the id of these entries. Similarly, the memory required to store \u03a6 is\nO(\u03b42m(cid:96)2 log((cid:96))). Storing Q further requires O((cid:96)k) memory. Finally, \u02c6V (B1) and \u02c6I computed in\nLine 6 require O((cid:96)k) and O(km) memory space, respectively. Thus, when (cid:96) = 1\n\u03b4 log m, this \ufb01rst part\nof the algorithm requires O(k(m + n)) memory.\nLines 7-9. Before we treat the remaining columns, A(B1), A(B2), and Q are removed from the mem-\nory. Using this released memory, when the t-th column arrives, we can store it, compute \u02c6V t and \u02c6I,\nand remove the column to save memory. Therefore, we do not need additional memory to treat the\nremaining columns.\nLines 10 and 11. From \u02c6I and \u02c6V , we compute \u02c6U. To this aim, the memory required is O(k(m + n)).\nRunning time.\nFrom line 1 to 6. The SPCA algorithm requires O((cid:96)k(\u03b42m(cid:96) + k) log((cid:96))) \ufb02oating-point operations to\ncompute Q. W , \u02c6V , and \u02c6I are inner products, and their computations require O(\u03b4km(cid:96)) operations.\nWith (cid:96) =\n\u03b4 log(m), the number of operations to treat the \ufb01rst (cid:96) columns is O((cid:96)k(\u03b42m(cid:96) + k) log((cid:96)) +\nk\u03b4m(cid:96)) = O(km) + O( k2\nFrom line 7 to 9. To compute \u02c6V t and \u02c6I when the t-th column arrives, we need O(\u03b4km) operations.\nSince there are n \u2212 (cid:96) remaining columns, the total number of operations is O(\u03b4kmn).\nLines 10 and 11 \u02c6R is computed from \u02c6V using the Gram-Schmidt process which requires O(k2m)\noperations. We then compute \u02c6I \u02c6R \u02c6R(cid:62) using O(k2m) operations. Hence we conclude that:\nIn summary, we have shown that:\nTheorem 8 The memory required to run the SLA algorithm is O(k(m + n)). Its running time is\nO(\u03b4kmn + k2\nObserve that when \u03b4 \u2265 max( (log(m))4\nk2m, and therefore, the running time of SLA is O(\u03b4kmn).\n\n) and k \u2264 (log(m))6, we have \u03b4kmn \u2265 k2/\u03b4 \u2265\n\n\u03b4 + k2m).\n\n, (log(m))2\n\n\u03b4 ).\n\n1\n\nm\n\nn\n\n3.5 General Streaming Model\n\nSLA is a one-pass low-rank approximation algorithm, but the set of the (cid:96) \ufb01rst observed columns\nof M needs to be chosen uniformly at random. We can readily extend SLA to deal with scenarios\nwhere the columns of M can be observed in an arbitrary order. This extension requires two passes\non M, but otherwise performs exactly the same operations as SLA. In the \ufb01rst pass, we extract a set\nof (cid:96) columns chosen uniformly at random, and in the second pass, we deal with all other columns.\nTo extract (cid:96) randomly selected columns in the \ufb01rst pass, we proceed as follows. Assume that when\nthe t-th column of M arrives, we have already extracted l columns. Then the t-th column is extracted\n(cid:96)\u2212l\nwith probability\nn\u2212t+1. This two-pass version of SLA enjoys the same performance guarantees as\nthose of SLA.\n\n4 Conclusion\n\nThis paper revisited the low rank approximation problem. We proposed a streaming algorithm that\nsamples the data and produces a near optimal solution with a vanishing mean square error. The\nalgorithm uses a memory space scaling linearly with the ambient dimension of the matrix, i.e. the\nmemory required to store the output alone. Its running time scales as the number of sampled entries\nof the input matrix. The algorithm is relatively simple, and in particular, does exploit elaborated\ntechniques (such as sparse embedding techniques) recently developed to reduce the memory re-\nquirement and complexity of algorithms addressing various problems in linear algebra.\n\n8\n\n\fReferences\n[AM07] Dimitris Achlioptas and Frank Mcsherry. Fast computation of low-rank matrix approxi-\n\nmations. Journal of the ACM (JACM), 54(2):9, 2007.\n\n[BJS15] Srinadh Bhojanapalli, Prateek Jain, and Sujay Sanghavi. Tighter low-rank approximation\nvia sampling the leveraged element. In Proceedings of the Twenty-Sixth Annual ACM-\nSIAM Symposium on Discrete Algorithms, pages 902\u2013920. SIAM, 2015.\n\n[CW09] Kenneth L Clarkson and David P Woodruff. Numerical linear algebra in the streaming\nmodel. In Proceedings of the forty-\ufb01rst annual ACM symposium on Theory of computing,\npages 205\u2013214. ACM, 2009.\n\n[CW13] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in\ninput sparsity time. In Proceedings of the forty-\ufb01fth annual ACM symposium on Theory\nof computing, pages 81\u201390. ACM, 2013.\n\n[GP14] Mina Ghashami and Jeff M Phillips. Relative errors for deterministic low-rank matrix\n\napproximations. In SODA, pages 707\u2013717. SIAM, 2014.\n\n[Lib13]\n\n[HMT11] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with ran-\ndomness: Probabilistic algorithms for constructing approximate matrix decompositions.\nSIAM review, 53(2):217\u2013288, 2011.\nEdo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 581\u2013\n588. ACM, 2013.\nIoannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory limited, streaming\nPCA. In Advances in Neural Information Processing Systems, 2013.\nJoel A Tropp. Improved analysis of the subsampled randomized hadamard transform.\nAdvances in Adaptive Data Analysis, 3(01n02):115\u2013126, 2011.\nJoel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of\nComputational Mathematics, 12(4):389\u2013434, 2012.\n\n[MCJ13]\n\n[Tro11]\n\n[Tro12]\n\n[Woo14] David Woodruff. Low rank approximation lower bounds in row-update streams.\n\nAdvances in Neural Information Processing Systems, pages 1781\u20131789, 2014.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 1771, "authors": [{"given_name": "Se-Young", "family_name": "Yun", "institution": "Microsoft Research, Cambridge"}, {"given_name": "marc", "family_name": "lelarge", "institution": "INRIA - ENS"}, {"given_name": "Alexandre", "family_name": "Proutiere", "institution": null}]}