{"title": "A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound", "book": "Advances in Neural Information Processing Systems", "page_first": 647, "page_last": 655, "abstract": "The CUR matrix decomposition is an important extension of Nystr\u00f6m approximation to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms.", "full_text": "A Scalable CUR Matrix Decomposition Algorithm:\n\nLower Time Complexity and Tighter Bound\n\nShusen Wang and Zhihua Zhang\n\nCollege of Computer Science & Technology\n\nZhejiang University\n\nHangzhou, China 310027\n\n{wss,zhzhang}@zju.edu.cn\n\nAbstract\n\nThe CUR matrix decomposition is an important extension of Nystr\u00a8om approxima-\ntion to a general matrix. It approximates any data matrix in terms of a small num-\nber of its columns and rows. In this paper we propose a novel randomized CUR\nalgorithm with an expected relative-error bound. The proposed algorithm has the\nadvantages over the existing relative-error CUR algorithms that it possesses tighter\ntheoretical bound and lower time complexity, and that it can avoid maintaining the\nwhole data matrix in main memory. Finally, experiments on several real-world\ndatasets demonstrate signi\ufb01cant improvement over the existing relative-error al-\ngorithms.\n\n1 Introduction\n\nLarge-scale matrices emerging from stocks, genomes, web documents, web images and videos ev-\neryday bring new challenges in modern data analysis. Most efforts have been focused on manipu-\nlating, understanding and interpreting large-scale data matrices. In many cases, matrix factorization\nmethods are employed to construct compressed and informative representations to facilitate com-\nputation and interpretation. A principled approach is the truncated singular value decomposition\n(SVD) which \ufb01nds the best low-rank approximation of a data matrix. Applications of SVD such as\neigenface [20, 21] and latent semantic analysis [4] have been illustrated to be very successful.\nHowever, the basis vectors resulting from SVD have little concrete meaning, which makes it very\n\u221a\ndif\ufb01cult for us to understand and interpret the data in question. An example in [10, 19] has well\nshown this viewpoint; that is, the vector [(1/2)age \u2212 (1/\n2)height + (1/2)income], the sum of the\nsigni\ufb01cant uncorrelated features from a dataset of people\u2019s features, is not particularly informative.\nThe authors of [17] have also claimed: \u201cit would be interesting to try to \ufb01nd basis vectors for all\nexperiment vectors, using actual experiment vectors and not arti\ufb01cial bases that offer little insight.\u201d\nTherefore, it is of great interest to represent a data matrix in terms of a small number of actual\ncolumns and/or actual rows of the matrix.\nThe CUR matrix decomposition provides such techniques, and it has been shown to be very useful\nin high dimensional data analysis [19]. Given a matrix A, the CUR technique selects a subset of\ncolumns of A to construct a matrix C and a subset of rows of A to construct a matrix R, and\ncomputes a matrix U such that ~A = CUR best approximates A. The typical CUR algorithms [7,\n8, 10] work in a two-stage manner. Stage 1 is a standard column selection procedure, and Stage 2\ndoes row selection from A and C simultaneously. Thus Stage 2 is more complicated than Stage 1.\nThe CUR matrix decomposition problem is widely studied in the literature [7, 8, 9, 10, 12, 13, 16,\n18, 19, 22]. Perhaps the most widely known work on the CUR problem is [10], in which the authors\ndevised a randomized CUR algorithm called the subspace sampling algorithm. Particularly, the\nalgorithm has (1 + \u03f5) relative-error ratio with high probability (w.h.p.).\n\n1\n\n\fUnfortunately, all the existing CUR algorithms require a large number of columns and rows to be\nchosen. For example, for an m \u00d7 n matrix A and a target rank k \u2264 min{m, n}, the state-of-\nthe-art CUR algorithm \u2014 the subspace sampling algorithm in [10] \u2014 requires exactly O(k4\u03f5\n\u22126)\nrows or O(k\u03f5\n\u22124 log2 k) rows in expectation to achieve (1 + \u03f5) relative-error ratio w.h.p. Moreover,\nthe computational cost of this algorithm is at least the cost of the truncated SVD of A, that is,\nO(min{mn2, nm2}).1 The algorithms are therefore impractical for large-scale matrices.\nIn this paper we develop a CUR algorithm which beats the state-of-the-art algorithm in both theory\nand experiments. In particular, we show in Theorem 5 a novel randomized CUR algorithm with\nlower time complexity and tighter theoretical bound in comparison with the state-of-the-art CUR\nalgorithm in [10].\nThe rest of this paper is organized as follows. Section 3 introduces several existing column selection\nalgorithms and the state-of-the-art CUR algorithm. Section 4 describes and analyzes our novel\nCUR algorithm. Section 5 empirically compares our proposed algorithm with the state-of-the-art\nalgorithm.\n\ni,j\n\n\u2211\n|aij| be the \u21131-norm, \u2225A\u2225F = (\n\n2 Notations\n\u2211\nFor a matrix A = [aij] \u2208 Rm\u00d7n, let a(i) be its i-th row and aj be its j-th column. Let \u2225A\u22251 =\nij)1/2 be the Frobenius norm, and \u2225A\u22252 be the spectral\nnorm. Moreover, let Im denote an m \u00d7 m identity matrix, and 0mn denotes an m \u00d7 n zero matrix.\nA,k\u22a5 be the\nLet A = UA(cid:6)AVT\nSVD of A, where \u03c1 = rank(A), and UA,k, (cid:6)A,k, and VA,k correspond to the top k singular values.\nWe denote Ak = UA,k(cid:6)A,kVT\nA,\u03c1 be the Moore-Penrose\ninverse of A [1].\n\n\u2020\nA,k. Furthermore, let A\n\nA,k + UA,k\u22a5(cid:6)A,k\u22a5VT\n\nA,i = UA,k(cid:6)A,kVT\n\n\u03c1\ni=0 \u03c3A,iuA,ivT\n\n\u22121\nA,\u03c1VT\n\n= UA,\u03c1(cid:6)\n\n\u2211\n\ni,j a2\n\nA =\n\n3 Related Work\n\nSection 3.1 introduces several relative-error column selection algorithms related to this work. Sec-\ntion 3.2 describes the state-of-the-art CUR algorithm in [10]. Section 3.3 discusses the connection\nbetween the column selection problem and the CUR problem.\n\n3.1 Relative-Error Column Selection Algorithms\nGiven a matrix A \u2208 Rm\u00d7n, column selection is a problem of selecting c columns of A to construct\nC \u2208 Rm\u00d7c to minimize \u2225A \u2212 CC\n\u2020\nc ) possible choices of constructing C,\nso selecting the best subset is a hard problem. In recent years, many polynomial-time approximate\nalgorithms have been proposed, among which we are particularly interested in the algorithms with\nrelative-error bounds; that is, with c \u2265 k columns selected from A, there is a constant \u03b7 such that\n\nA\u2225F . Since there are (n\n\n\u2225A \u2212 CC\n\u2020\n\nA\u2225F \u2264 \u03b7\u2225A \u2212 Ak\u2225F .\n\nWe call \u03b7 the relative-error ratio. We now present some recent results related to this work.\nWe \ufb01rst introduce a recently developed deterministic algorithm called the dual set sparsi\ufb01cation\nproposed in [2, 3]. We show their results in Lemma 1. Furthermore, this algorithm is a building\nblock of some more powerful algorithms (e.g., Lemma 2), and our novel CUR algorithm also relies\non this algorithm. We attach the algorithm in Appendix A.\nLemma 1 (Column Selection via Dual Set Sparsi\ufb01cation Algorithm). Given a matrix A \u2208 Rm\u00d7n\n\u221a\nof rank \u03c1 and a target rank k (< \u03c1), there exists a deterministic algorithm to select c (> k) columns\nof A and form a matrix C \u2208 Rm\u00d7c such that\n\n(cid:13)(cid:13)(cid:13)A \u2212 CC\n\n(cid:13)(cid:13)(cid:13)\n\n\u2020\n\nA\n\n\u2264\n\nF\n\n1 +\n\n(1 \u2212\u221a\n\n1\n\nk/c)2\n\n(cid:13)(cid:13)(cid:13)A \u2212 Ak\n\n(cid:13)(cid:13)(cid:13)\n\n.\n\nF\n\n1Although some partial SVD algorithms, such as Krylov subspace methods, require only O(mnk) time,\n\nthey are all numerical unstable. See [15] for more discussions.\n\n2\n\n\fMoreover, the matrix C can be computed in TVA;k +O(mn+nck2), where TVA;k is the time needed\nto compute the top k right singular vectors of A.\n\nThere are also a variety of randomized column selection algorithms achieving relative-error bounds\nin the literature: [3, 5, 6, 10, 14].\nAn randomized algorithm in [2] selects only c = 2k\n\u03f5 (1 + o(1)) columns to achieve the expected\nrelative-error ratio (1 + \u03f5). The algorithm is based on the approximate SVD via random projec-\ntion [15], the dual set sparsi\ufb01cation algorithm [2], and the adaptive sampling algorithm [6]. Here we\npresent the main results of this algorithm in Lemma 2. Our proposed CUR algorithm is motivated\nby and relies on this algorithm.\nLemma 2 (Near-Optimal Column Selection Algorithm). Given a matrix A \u2208 Rm\u00d7n of rank \u03c1, a\ntarget rank k (2 \u2264 k < \u03c1), and 0 < \u03f5 < 1, there exists a randomized algorithm to select at most\n\n(\ncolumns of A to form a matrix C \u2208 Rm\u00d7c such that\nA\u2225F \u2264 E\u2225A \u2212 CC\n\u2020\n\nE2\u2225A \u2212 CC\n\u2020\n\n2k\n\u03f5\n\nc =\n\n1 + o(1)\n\n)\n\nwhere the expectations are taken w.r.t. C. Furthermore, the matrix C can be computed in O((mnk+\nnk3)\u03f5\n\n\u22122/3).\n\nA\u22252\n\nF\n\n\u2264 (1 + \u03f5)\u2225A \u2212 Ak\u22252\nF ,\n\n3.2 The Subspace Sampling CUR Algorithm\n\n\u22122 log \u03b4\n\n\u22122 log c log \u03b4\n\n\u22121) columns (or c = O(k\u03f5\n\nDrineas et al. [10] proposed a two-stage randomized CUR algorithm which has a relative-error\nbound w.h.p. Given a matrix A \u2208 Rm\u00d7n and a target rank k, in the \ufb01rst stage the algorithm\nchooses exactly c = O(k2\u03f5\n\u22121) in expectation) of\n\u22122 log k log \u03b4\nA to construct C \u2208 Rm\u00d7c; in the second stage it chooses exactly r = O(c2\u03f5\n\u22121) rows (or\n\u22122 log \u03b4\nr = O(c\u03f5\n\u22121) in expectation) of A and C simultaneously to construct R and U. With\nprobability at least 1 \u2212 \u03b4, the relative-error ratio is 1 + \u03f5. The computational cost is dominated by\nthe truncated SVD of A and C.\nThough the algorithm is \u03f5-optimal with high probability, it requires too many rows get chosen: at\nleast r = O(k\u03f5\n\u22124 log2 k) rows in expectation. In this paper we seek to devise an algorithm with\nmild requirement on column and row numbers.\n\n3.3 Connection between Column Selection and CUR Matrix Decomposition\n\nThe CUR problem has a close connection with the column selection problem. As aforementioned,\nthe \ufb01rst stage of existing CUR algorithms is simply a column selection procedure. However, the\nIf the second stage is na\u00a8\u0131vely solved by a column selection\nsecond stage is more complicated.\nalgorithm on AT , then the error ratio will be at least (2 + \u03f5).\nFor a relative-error CUR algorithm, the \ufb01rst stage seeks to bound a construction error ratio of\n\u2225A\u2212CC\nA\u2225F\ngiven C. Actually, the \ufb01rst\n\u2225A\u2212Ak\u2225F\nstage is a special case of the second stage where C = Ak. Given a matrix A, if an algorithm solv-\n\u2264 \u03b7, then this algorithm also solves the\ning the second stage results in a bound\ncolumn selection problem for AT with an \u03b7 relative-error ratio. Thus the second stage of CUR is a\ngeneralization of the column selection problem.\n\n, while the section stage seeks to bound\n\ny\n\u2225A\u2212CCyA\u2225F\n\ny\nAR\n\u2225A\u2212CCyA\u2225F\n\n\u2225A\u2212CC\n\n\u2225A\u2212CC\n\nR\u2225F\n\nR\u2225F\n\nAR\n\ny\n\ny\n\ny\n\n4 Main Results\n\nIn this section we introduce our proposed CUR algorithm. We call it the fast CUR algorithm because\nit has lower time complexity compared with SVD. We describe it in Algorithm 1 and give a theoret-\nical analysis in Theorem 5. Theorem 5 relies on Lemma 2 and Theorem 4, and Theorem 4 relies on\nTheorem 3. Theorem 3 is a generalization of [6, Theorem 2.1], and Theorem 4 is a generalization\nof [2, Theorem 5].\n\n3\n\n\f(\n\n)\n\n1 + o(1)\n\n, target\n\n(\n\n)\n\n\u03f5\n\n;\n\n1 + o(1)\n\nrow number r = 2c\n\u03f5\n\nAlgorithm 1 The Fast CUR Algorithm.\n1: Input: a real matrix A \u2208 Rm\u00d7n, target rank k, \u03f5 \u2208 (0, 1], target column number c = 2k\n2: // Stage 1: select c columns of A to construct C \u2208 Rm\u00d7c\n3: Compute approximate truncated SVD via random projection such that Ak \u2248 ~Uk ~(cid:6)k ~Vk;\n4: Construct U1 \u2190 columns of (A \u2212 ~Uk ~(cid:6)k ~Vk); V1 \u2190 columns of ~VT\nk ;\n5: Compute s1 \u2190 Dual Set Spectral-Frobenius Sparsi\ufb01cation Algorithm (U1, V1, c \u2212 2k/\u03f5);\n6: Construct C1 \u2190 ADiag(s1), and then delete the all-zero columns;\n7: Residual matrix D \u2190 A \u2212 C1C\n\u2020\n1A;\n8: Compute sampling probabilities: pi = \u2225di\u22252\nF , i = 1,\u00b7\u00b7\u00b7 , n;\n2/\u2225D\u22252\n9: Sampling c2 = 2k/\u03f5 columns from A with probability {p1,\u00b7\u00b7\u00b7 , pn} to construct C2;\n10: // Stage 2: select r rows of A to construct R \u2208 Rr\u00d7n\n11: Construct U2 \u2190 columns of (A \u2212 ~Uk ~(cid:6)k ~Vk)T ; V2 \u2190 columns of ~UT\nk ;\n12: Compute s2 \u2190 Dual Set Spectral-Frobenius Sparsi\ufb01cation Algorithm (U2, V2, r \u2212 2c/\u03f5);\n13: Construct R1 \u2190 Diag(s2)A, and then delete the all-zero rows;\n2/\u2225B\u22252\n14: Residual matrix B \u2190 A \u2212 AR\n1R1; Compute qj = \u2225b(j)\u22252\n\u2020\n15: Sampling r2 = 2c/\u03f5 rows from A with probability {q1,\u00b7\u00b7\u00b7 , qm} to construct R2;\n\u2020\n16: return C = [C1, C2], R = [RT\n.\n1 , RT\n\nF , j = 1,\u00b7\u00b7\u00b7 , m;\n\n\u2020\n2 ]T , and U = C\n\nAR\n\n4.1 Adaptive Sampling\n\nThe relative-error adaptive sampling algorithm is established in [6, Theorem 2.1]. The algorithm\nis based on the following idea: after selecting a proportion of columns from A to form C1 by\nan arbitrary algorithm, the algorithms randomly samples additional c2 columns according to the\nresidual A \u2212 C1C\n\u2020\n1A. Boutsidis et al. [2] used the adaptive sampling algorithm to decrease the\nresidual of the dual set sparsi\ufb01cation algorithm and obtained an (1 + \u03f5) relative-error bound. Here\nwe prove a new bound for the adaptive sampling algorithm.\nInterestingly, this new bound is a\ngeneralization of the original one in [6, Theorem 2.1]. In other words, Theorem 2.1 of [6] is a direct\ncorollary of our following theorem in which C = Ak is set.\nTheorem 3 (The Adaptive Sampling Algorithm). Given a matrix A \u2208 Rm\u00d7n and a matrix C \u2208\nA) = \u03c1, (\u03c1 \u2264 c \u2264 n), we let R1 \u2208 Rr1\u00d7n consist of r1\nRm\u00d7c such that rank(C) = rank(CC\n\u2020\n1R1. Additionally, for i = 1,\u00b7\u00b7\u00b7 , m, we de\ufb01ne\nrows of A, and de\ufb01ne the residual B = A \u2212 AR\n\u2020\npi = \u2225b(i)\u22252\n2/\u2225B\u22252\nF .\n\nWe further sample r2 rows i.i.d. from A, in each trial of which the i-th row is chosen with probability\n2 ]T \u2208 R(r1+r2)\u00d7n. Then\npi. Let R2 \u2208 Rr2\u00d7n contains the r2 sampled rows and let R = [RT\nthe following inequality holds:\n1R1\u22252\n\u2020\nF ,\n\n1 , RT\n\u2225A \u2212 AR\n\n\u2264 \u2225A \u2212 CC\n\u2020\n\nE\u2225A \u2212 CC\n\u2020\n\n\u2020\nAR\n\nA\u22252\n\nF +\n\nR\u22252\nwhere the expectation is taken w.r.t. R2.\n\nF\n\n\u03c1\nr2\n\n4.2 The Fast CUR Algorithm\n\nBased on the dual set sparsi\ufb01cation algorithm of of Lemma 1 and the adaptive sampling algorithm\nof Theorem 3, we develop a randomized algorithm to solve the second stage of CUR problem. We\npresent the results of the algorithm in Theorem 4. Theorem 5 of [2] is a special case of the following\ntheorem where C = Ak.\nTheorem 4 (The Fast Row Selection Algorithm). Given a matrix A \u2208 Rm\u00d7n and a matrix C \u2208\nA) = \u03c1, (\u03c1 \u2264 c \u2264 n), and a target rank k (\u2264 \u03c1), the\nRm\u00d7c such that rank(C) = rank(CC\n\u2020\n\u03f5 (1 + o(1)) rows of A to construct R \u2208 Rr\u00d7n, such\nproposed randomized algorithm selects r = 2\u03c1\nthat\n\u2264 \u2225A \u2212 CC\n\u2020\nwhere the expectation is taken w.r.t. R. Furthermore, the matrix R can be computed in O((mnk +\nmk3)\u03f5\n\nF + \u03f5\u2225A \u2212 Ak\u22252\nF ,\n\nE\u2225A \u2212 CC\n\u2020\n\n\u22122/3) time.\n\n\u2020\nAR\n\nA\u22252\n\nR\u22252\n\nF\n\nBased on Lemma 2 and Theorem 4, here we present the main theorem for the fast CUR algorithm.\n\n4\n\n\fTable 1: A summary of the datasets.\n\nType\n\nDataset\nRedrocknatural image18000 \u00d7 4000\n10000 \u00d7 900 http://archive.ics.uci.edu/ml/datasets/Arcene\nArcene\nDexter bag of words 20000 \u00d7 2600http://archive.ics.uci.edu/ml/datasets/Dexter\n\nhttp://www.agarwala.org/efficient gdc/\n\nSource\n\nbiology\n\nsize\n\n(\n\n\u2020\nAR\n\n)R\u2225F \u2264 (1 + \u03f5)\u2225A \u2212 Ak\u2225F .\n\nE\u2225A \u2212 CUR\u2225F = E\u2225A \u2212 C(C\n\nTheorem 5 (The Fast CUR Algorithm). Given a matrix A \u2208 Rm\u00d7n and a positive integer k \u226a\nmin{m, n}, the fast CUR algorithm (described in Algorithm 1) randomly selects c = 2k\n\u03f5 (1 + o(1))\ncolumns of A to construct C \u2208 Rm\u00d7c with the near-optimal column selection algorithm of Lemma 2,\n\u03f5 (1 + o(1)) rows of A to construct R \u2208 Rr\u00d7n with the fast row selection\nand then selects r = 2c\n)\nalgorithm of Theorem 4. Then we have\nMoreover, the algorithm runs in time O\nSince k, c, r \u226a min{m, n} by the assumptions, so the time complexity of the fast CUR algorithm\nis lower than that of the SVD of A. This is the main reason why we call it the fast CUR algorithm.\nAnother advantage of this algorithm is avoiding loading the whole m \u00d7 n data matrix A into main\nmemory. None of three steps \u2014 the randomized SVD, the dual set sparsi\ufb01cation algorithm, and the\nadaptive sampling algorithm \u2014 requires loading the whole of A into memory. The most memory-\nexpensive operation throughout the fast CUR Algorithm is computing the Moore-Penrose inverse\nof C and R, which requires maintaining an m \u00d7 c matrix or an r \u00d7 n matrix in memory.\nIn\ncomparison, the subspace sampling algorithm requires loading the whole matrix into memory to\ncompute its truncated SVD.\n\n\u2020\n\u22122/3 + (m + n)k3\u03f5\n\n\u22122/3 + mk2\u03f5\n\n\u22122 + nk2\u03f5\n\nmnk\u03f5\n\n\u22124\n\n.\n\n5 Empirical Comparisons\n\nIn this section we provide empirical comparisons among the relative-error CUR algorithms on sev-\neral datasets. We report the relative-error ratio and the running time of each algorithm on each data\nset. The relative-error ratio is de\ufb01ned by\n\nRelative-error ratio =\n\n\u2225A \u2212 CUR\u2225F\n\u2225A \u2212 Ak\u2225F\n\n,\n\nwhere k is a speci\ufb01ed target rank.\nWe conduct experiments on three datasets, including natural image, biology data, and bags of words.\nTable 1 brie\ufb02y summarizes some information of the datasets. Redrock is a large size natural image.\nArcene and Dexter are both from the UCI datasets [11]. Arcene is a biology dataset with 900\ninstances and 10000 attributes. Dexter is a bag of words dataset with a 20000-vocabulary and 2600\ndocuments. Each dataset is actually represented as a data matrix, upon which we apply the CUR\nalgorithms.\nWe implement all the algorithms in MATLAB 7.10.0. We conduct experiments on a workstation\nwith 12 Intel Xeon 3.47GHz CPUs, 12GB memory, and Ubuntu 10.04 system. According to the\nanalysis in [10] and this paper, k, c, and r should be integers far less than m and n. For each data\nset and each algorithm, we set k = 10, 20, or 50, and c = \u03b1k, r = \u03b1c, where \u03b1 ranges in each set of\nexperiments. We repeat each set of experiments for 20 times and report the average and the standard\ndeviation of the error ratios. The results are depicted in Figures 1, 2, 3.\nThe results show that the fast CUR algorithm has much lower relative-error ratio than the subspace\nsampling algorithm. The experimental results well match our theoretical analyses in Section 4. As\nfor the running time, the fast CUR algorithm is more ef\ufb01cient when c and r are small. When c and\nr become large, the fast CUR algorithm becomes less ef\ufb01cient. This is because the time complexity\n\u22124 and large c and r imply small \u03f5. However, the purpose\nof the fast CUR algorithm is linear in \u03f5\nof CUR is to select a small number of columns and rows from the data matrix, that is, c \u226a n and\nr \u226a m. So we are not interested in the cases where c and r are large compared with n and m, say\nk = 20 and \u03b1 = 10.\n\n5\n\n\f(a) k = 10, c = \u03b1k, and r = \u03b1c.\n\n(b) k = 20, c = \u03b1k, and r = \u03b1c.\n\n(c) k = 50, c = \u03b1k, and r = \u03b1c.\n\nFigure 1: Empirical results on the Redrock data set.\n\n(a) k = 10, c = \u03b1k, and r = \u03b1c.\n\n(b) k = 20, c = \u03b1k, and r = \u03b1c.\n\n(c) k = 50, c = \u03b1k, and r = \u03b1c.\n\nFigure 2: Empirical results on the Arcene data set.\n\n6\n\n246810121416182022242628303234360100200300400500600700Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416182022240100200300400500600700Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416180100200300400500600700Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416182022242628303234360.60.70.80.911.11.21.3Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416182022240.50.60.70.80.911.11.21.31.4Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416180.40.50.60.70.80.911.11.21.3Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR24681012141618202224262830024681012Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416182002468101214Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR2468101214468101214161820Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416182022242628300.60.70.80.911.11.21.31.4Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR24681012141618200.60.70.80.911.11.21.3Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR24681012140.40.50.60.70.80.911.11.21.3Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR\f(a) k = 10, c = \u03b1k, and r = \u03b1c.\n\n(b) k = 20, c = \u03b1k, and r = \u03b1c.\n\n(c) k = 50, c = \u03b1k, and r = \u03b1c.\n\nFigure 3: Empirical results on the Dexter data set.\n\n6 Conclusions\n\nIn this paper we have proposed a novel randomized algorithm for the CUR matrix decomposition\nproblem. This algorithm is faster, more scalable, and more accurate than the state-of-the-art algo-\n\u22121(1 + o(1))\nrithm, i.e., the subspace sampling algorithm. Our algorithm requires only c = 2k\u03f5\n\u22121(1 + o(1)) rows to achieve (1+\u03f5) relative-error ratio. To achieve the same\ncolumns and r = 2c\u03f5\nrelative-error bound, the subspace sampling algorithm requires c = O(k\u03f5\n\u22122 log k) columns and\nr = O(c\u03f5\n\u22122 log c) rows selected from the original matrix. Our algorithm also beats the subspace\nsampling algorithm in time-complexity. Our algorithm costs O(mnk\u03f5\n\u22122/3 +\n\u22124) time, which is lower than O(min{mn2, m2n}) of the subspace sampling algo-\nmk2\u03f5\nrithm when k is small. Moreover, our algorithm enjoys another advantage of avoiding loading the\nwhole data matrix into main memory, which also makes our algorithm more scalable. Finally, the\nempirical comparisons have also demonstrated the effectiveness and ef\ufb01ciency of our algorithm.\n\n\u22122/3 + (m + n)k3\u03f5\n\n\u22122 + nk2\u03f5\n\nA The Dual Set Sparsi\ufb01cation Algorithm\n\nFor the sake of completeness, we attach the dual set sparsi\ufb01cation algorithm here and describe\nsome implementation details. The dual set sparsi\ufb01cation algorithms are deterministic algorithms\nestablished in [2]. The fast CUR algorithm calls the dual set spectral-Frobenius sparsi\ufb01cation al-\ngorithm [2, Lemma 13] in both stages. We show this algorithm in Algorithm 2 and its bounds in\nLemma 6.\nLemma 6 (Dual Set Spectral-Frobenius Sparsi\ufb01cation). Let U = {x1,\u00b7\u00b7\u00b7 , xn} \u2282 Rl, (l < n),\ncontains the columns of an arbitrary matrix X \u2208 Rl\u00d7n. Let V = {v1,\u00b7\u00b7\u00b7 , vn} \u2282 Rk, (k < n),\nbe a decompositions of the identity, i.e.\ni = Ik. Given an integer r with k < r < n,\nAlgorithm 2 deterministically computes a set of weights si \u2265 0 (i = 1,\u00b7\u00b7\u00b7 , n) at most r of which\nare non-zero, such that\n\nn\ni=1 vivT\n\n\u2211\n)2\n\nk\nr\n\n\u221a\n\nand\n\n7\n\n( n\u2211\n\n\u03bbk\n\ni=1\n\nsivivT\ni\n\n)\n\n(\n\n\u2265\n\n1 \u2212\n\n)\n\n( n\u2211\n\ntr\n\ni=1\n\nsixixT\ni\n\n\u2264 \u2225X\u22252\nF .\n\n24681012141618202224262830050100150200250Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR24681012141618202224050100150200250300Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR24681012141618050100150200250300350400Time (s)\u03b1Running Time Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416182022242628300.90.9511.051.11.151.21.25Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416182022240.850.90.9511.051.11.151.21.25Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR246810121416180.750.80.850.90.9511.051.11.151.2Relative Error Ratio\u03b1Construction Error (Frobenius Norm) Subspace Sampling (Exactly)Subspace Sampling (Expected)Fast CUR\fi=1 \u2282 Rl, (l < n); V = {vi}n\n\nAlgorithm 2 Deterministic Dual Set Spectral-Frobenius Sparsi\ufb01cation Algorithm.\n1: Input: U = {xi}n\n2: Initialize: s0 = 0m\u00d71, A0 = 0k\u00d7k;\n3: Compute \u2225xi\u22252\n4: for \u03c4 = 0 to r \u2212 1 do\n5:\n6:\n\nCompute the eigenvalue decomposition of A\u03c4 ;\nFind an index j in {1,\u00b7\u00b7\u00b7 , n} and compute a weight t > 0 such that\n(\n\n2 for i = 1,\u00b7\u00b7\u00b7 , n, and then compute \u03b4U =\n\ni=1 \u2282 Rk, with\n\u2211\n1\u2212\n\ni=1 vivT\n\u221a\n\u2225xi\u22252\n\nn\ni=1\n\nk/r\n\nn\n\n2\n\n;\n\n\u22121\nU\n\n\u03b4\n\n\u2225xj\u22252\n\n2 \u2264 t\n\n\u22121 \u2264 vT\n\nj\n\n(\nA\u03c4 \u2212 (L\u03c4 + 1)Ik\nvj\n\u03d5(L\u03c4 + 1, A\u03c4 ) \u2212 \u03d5(L\u03c4 , A\u03c4 )\n(\nk\u2211\n\u03bbi(A) \u2212 L\n\n)\u22122\n)\u22121\n\n,\n\n\u2212 vT\n\nj\n\nA\u03c4 \u2212 (L\u03c4 + 1)Ik\n\nL\u03c4 = \u03c4 \u2212\n\n\u221a\n\nrk;\n\nwhere\n\n\u03d5(L, A) =\n\n\u2211\n\ni = Ik (k < n); k < r < n;\n\n)\u22121\n\nvj;\n\ni=1\n\nUpdate the j-th component of s\u03c4 and A\u03c4 :\n\ns\u03c4 +1[j] = s\u03c4 [j] + t, A\u03c4 +1 = A\u03c4 + tvjvT\nj ;\n\n7:\n8: end for\n9: return s =\n\n\u221a\n\n1\u2212\n\nk/r\n\nsr.\n\nr\n\nThe weights si can be computed deterministically in O(rnk2 + nl) time.\nHere we would like to mention the implementation of Algorithm 2, which is not described in detailed\nby [2]. In each iteration the algorithm performs once eigenvalue decomposition: A\u03c4 = W(cid:3)WT .\n(A\u03c4 is guaranteed to be positive semi-de\ufb01nite in each iteration). Since\n\n(\n\n)q\n\nA\u03c4 \u2212 \u03b1Ik\n\n= WDiag\n\n(\u03bb1 \u2212 \u03b1)q,\u00b7\u00b7\u00b7 , (\u03bbk \u2212 \u03b1)q\n\nWT ,\n\nwe can ef\ufb01ciently compute (A\u03c4 \u2212 (L\u03c4 + 1)Ik)q based on the eigenvalue decomposition of A\u03c4 . With\nthe eigenvalues at hand, \u03d5(L, A\u03c4 ) can also be computed directly.\n\n)\n\n(\n\nAcknowledgments\n\nThis work has been supported in part by the Natural Science Foundations of China (No. 61070239),\nthe Google visiting faculty program, and the Scholarship Award for Excellent Doctoral Student\ngranted by Ministry of Education.\n\nReferences\n[1] Adi Ben-Israel and Thomas N.E. Greville. Generalized Inverses: Theory and Applications.\n\nSecond Edition. Springer, 2003.\n\n[2] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based\n\nmatrix reconstruction. CoRR, abs/1103.0995, 2011.\n\n[3] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near optimal column-based\nmatrix reconstruction. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Founda-\ntions of Computer Science, FOCS \u201911, pages 305\u2013314, 2011.\n\n[4] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard\nHarshman. Indexing by latent semantic analysis. Journal of The American Society for Infor-\nmation Science, 41(6):391\u2013407, 1990.\n\n[5] Amit Deshpande and Luis Rademacher. Ef\ufb01cient volume sampling for row/column subset se-\nlection. In Proceedings of the 2010 IEEE 51st Annual Symposium on Foundations of Computer\nScience, FOCS \u201910, pages 329\u2013338, 2010.\n\n[6] Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation\nand projective clustering via volume sampling. Theory of Computing, 2(2006):225\u2013247, 2006.\n[7] Petros Drineas. Pass-ef\ufb01cient algorithms for approximating large matrices. In In Proceeding\n\nof the 14th Annual ACM-SIAM Symposium on Dicrete Algorithms, pages 223\u2013232, 2003.\n\n8\n\n\f[8] Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast monte carlo algorithms for\nmatrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on\nComputing, 36(1):184\u2013206, 2006.\n\n[9] Petros Drineas and Michael W. Mahoney. On the Nystr\u00a8om method for approximating a gram\nmatrix for improved kernel-based learning. Journal of Machine Learning Research, 6:2153\u2013\n2175, 2005.\n\n[10] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix de-\ncompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844\u2013881, September\n2008.\n\n[11] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[12] S. A. Goreinov, E. E. Tyrtyshnikov, and N. L. Zamarashkin. A theory of pseudoskeleton\n\napproximations. Linear Algebra and Its Applications, 261:1\u201321, 1997.\n\n[13] S. A. Goreinov, N. L. Zamarashkin, and E. E. Tyrtyshnikov. Pseudo-skeleton approximations\n\nby matrices of maximal volume. Mathematical Notes, 62(4):619\u2013623, 1997.\n\n[14] Venkatesan Guruswami and Ali Kemal Sinop. Optimal column-based low-rank matrix re-\nconstruction. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA \u201912, pages 1207\u20131214. SIAM, 2012.\n\n[15] Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness:\nProbabilistic algorithms for constructing approximate matrix decompositions. SIAM Review,\n53(2):217\u2013288, 2011.\n\n[16] John Hopcroft and Ravi Kannan. Computer Science Theory for the Information Age. 2012.\n[17] Finny G. Kuruvilla, Peter J. Park, and Stuart L. Schreiber. Vector algebra in the analysis of\n\ngenome-wide expression data. Genome Biology, 3:research0011\u2013research0011.1, 2002.\n\n[18] Lester Mackey, Ameet Talwalkar, and Michael I. Jordan. Divide-and-conquer matrix factor-\n\nization. In Advances in Neural Information Processing Systems 24. 2011.\n\n[19] Michael W. Mahoney and Petros Drineas. CUR matrix decompositions for improved data\n\nanalysis. Proceedings of the National Academy of Sciences, 106(3):697\u2013702, 2009.\n\n[20] L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of human faces.\n\nJournal of the Optical Society of America A, 4(3):519\u2013524, Mar 1987.\n\n[21] Matthew Turk and Alex Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro-\n\nscience, 3(1):71\u201386, 1991.\n[22] Eugene E. Tyrtyshnikov.\n\nComputing, 64:367\u2013380, 2000.\n\nIncomplete cross approximation in the mosaic-skeleton method.\n\n9\n\n\f", "award": [], "sourceid": 301, "authors": [{"given_name": "Shusen", "family_name": "Wang", "institution": null}, {"given_name": "Zhihua", "family_name": "Zhang", "institution": null}]}