{"title": "CUR from a Sparse Optimization Viewpoint", "book": "Advances in Neural Information Processing Systems", "page_first": 217, "page_last": 225, "abstract": "The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. In particular, we show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity.", "full_text": "CUR from a Sparse Optimization Viewpoint\n\nJacob Bien\u2217\n\nYa Xu\u2217\n\nDepartment of Statistics\n\nDepartment of Statistics\n\nStanford University\nStanford, CA 94305\n\nStanford University\nStanford, CA 94305\n\nMichael W. Mahoney\n\nDepartment of Mathematics\n\nStanford University\nStanford, CA 94305\n\njbien@stanford.edu\n\nyax.stanford@gmail.com\n\nmmahoney@cs.stanford.edu\n\nAbstract\n\nThe CUR decomposition provides an approximation of a matrix X that has low\nreconstruction error and that is sparse in the sense that the resulting approximation\nlies in the span of only a few columns of X. In this regard, it appears to be simi-\nlar to many sparse PCA methods. However, CUR takes a randomized algorithmic\napproach, whereas most sparse PCA methods are framed as convex optimization\nproblems. In this paper, we try to understand CUR from a sparse optimization\nviewpoint. We show that CUR is implicitly optimizing a sparse regression objec-\ntive and, furthermore, cannot be directly cast as a sparse PCA method. We also\nobserve that the sparsity attained by CUR possesses an interesting structure, which\nleads us to formulate a sparse PCA method that achieves a CUR-like sparsity.\n\n1 Introduction\nCUR decompositions are a recently-popular class of randomized algorithms that approximate a data\nmatrix X \u2208 Rn\u00d7p by using only a small number of actual columns of X [12, 4]. CUR decomposi-\ntions are often described as SVD-like low-rank decompositions that have the additional advantage of\nbeing easily interpretable to domain scientists. The motivation to produce a more interpretable low-\nrank decomposition is also shared by sparse PCA (SPCA) methods, which are optimization-based\nprocedures that have been of interest recently in statistics and machine learning.\n\nAlthough CUR and SPCA methods start with similar motivations, they proceed very differently. For\nexample, most CUR methods have been randomized, and they take a purely algorithmic approach.\nBy contrast, most SPCA methods start with a combinatorial optimization problem, and they then\nsolve a relaxation of this problem. Thus far, it has not been clear to researchers how the CUR and\nSPCA approaches are related. It is the purpose of this paper to understand CUR decompositions\nfrom a sparse optimization viewpoint, thereby elucidating the connection between CUR decompo-\nsitions and the SPCA class of sparse optimization methods.\n\nTo do so, we begin by putting forth a combinatorial optimization problem (see (6) below) which\nCUR is implicitly approximately optimizing. This formulation will highlight two interesting features\nof CUR: \ufb01rst, CUR attains a distinctive pattern of sparsity, which has practical implications from\nthe SPCA viewpoint; and second, CUR is implicitly optimizing a regression-type objective. These\ntwo observations then lead to the three main contributions of this paper: (a) \ufb01rst, we formulate a\nnon-randomized optimization-based version of CUR (see Problem 1: GL-REG in Section 3) that is\nbased on a convex relaxation of the CUR combinatorial optimization problem; (b) second, we show\nthat, in contrast to the original PCA-based motivation for CUR, CUR\u2019s implicit objective cannot\nbe directly expressed in terms of a PCA-type objective (see Theorem 3 in Section 4); and (c) third,\nwe propose an SPCA approach (see Problem 2: GL-SPCA in Section 5) that achieves the sparsity\nstructure of CUR within the PCA framework. We also provide a brief empirical evaluation of our\ntwo proposed objectives. While our proposed GL-REG and GL-SPCA methods are promising in\nand of themselves, our purpose in this paper is not to explore them as alternatives to CUR; instead,\nour goal is to use them to help clarify the connection between CUR and SPCA methods.\n\n\u2217Jacob Bien and Ya Xu contributed equally.\n\n1\n\n\fWe conclude this introduction with some remarks on notation. Given a matrix A, we use A(i) to\ndenote its ith row (as a row-vector) and A(i) its ith column. Similarly, given a set of indices I,\nAI and AI denote the submatrices of A containing only these I rows and columns, respectively.\nFinally, we let Lcol(A) denote the column space of A.\n2 Background\n\nIn this section, we provide a brief background on CUR and SPCA methods, with a particular em-\nphasis on topics to which we will return in subsequent sections. Before doing so, recall that, given\nan input matrix X, Principal Component Analysis (PCA) seeks the k-dimensional hyperplane with\nthe lowest reconstruction error. That is, it computes a p \u00d7 k orthogonal matrix W that minimizes\n\nERR(W) = ||X \u2212 XWWT ||F .\n\n(1)\nWriting the SVD of X as U\u03a3VT , the minimizer of (1) is given by Vk, the \ufb01rst k columns of V. In\nthe data analysis setting, each column of V provides a particular linear combination of the columns\nof X. These linear combinations are often thought of as latent factors. In many applications, in-\nterpreting such factors is made much easier if they are comprised of only a small number of actual\ncolumns of X, which is equivalent to Vk only having a small number of nonzero elements.\n\n2.1 CUR matrix decompositions\n\nCUR decompositions were proposed by Drineas and Mahoney [12, 4] to provide a low-rank approx-\nimation to a data matrix X by using only a small number of actual columns and/or rows of X. Fast\nrandomized variants [3], deterministic variants [5], Nystr\u00a8om-based variants [1, 11], and heuristic\nvariants [17] have also been considered. Observing that the best rank-k approximation to the SVD\nprovides the best set of k linear combinations of all the columns, one can ask for the best set of k\nactual columns. Most formalizations of \u201cbest\u201d lead to intractable combinatorial optimization prob-\nlems [12], but one can take advantage of oversampling (choosing slightly more than k columns) and\nrandomness as computational resources to obtain strong quality-of-approximation guarantees.\nTheorem 1 (Relative-error CUR [12]). Given an arbitrary matrix X \u2208 Rn\u00d7p and an integer k,\nthere exists a randomized algorithm that chooses a random subset I \u2282 {1, . . . , p} of size c =\nO(k log k log(1/\u03b4)/\u01eb2) such that XI, the n \u00d7 c submatrix containing those c columns of X, satis\ufb01es\n(2)\n\n||X \u2212 XI B||F \u2264 (1 + \u01eb)||X \u2212 Xk||F ,\n\n||X \u2212 XI XI+X||F = min\nB\u2208Rc\u00d7p\n\nwith probability at least 1 \u2212 \u03b4, where Xk is the best rank k approximation to X.\n\nThe algorithm referred to by Theorem 1 is very simple:\n\n1) Compute the normalized statistical leverage scores, de\ufb01ned below in (3).\n2) Form I by randomly sampling c columns of X, using these normalized statistical leverage scores\n\nas an importance sampling distribution.\n\n3) Return the n \u00d7 c matrix XI consisting of these selected columns.\nThe key issue here is the choice of the importance sampling distribution. Let the p \u00d7 k matrix Vk\nbe the top-k right singular vectors of X. Then the normalized statistical leverage scores are\n\n\u03c0i =\n\n||Vk(i)||2\n2,\n\n(3)\n\n1\nk\n\nfor all i = 1, . . . , p, where Vk(i) denotes the i-th row of Vk. These scores, proportional to the\nEuclidean norms of the rows of the top-k right singular vectors, de\ufb01ne the relevant nonuniformity\nstructure to be used to identify good (in the sense of Theorem 1) columns. In addition, these scores\nare proportional to the diagonal elements of the projection matrix onto the top-k right singular\nsubspace. Thus, they generalize the so-called hat matrix [8], and they have a natural interpretation\nas capturing the \u201cstatistical leverage\u201d or \u201cin\ufb02uence\u201d of a given column on the best low-rank \ufb01t of\nthe data matrix [8, 12].\n\n2.2 Regularized sparse PCA methods\n\nSPCA methods attempt to make PCA easier to interpret for domain experts by \ufb01nding sparse approx-\nimations to the columns of V.1 There are several variants of SPCA. For example, Jolliffe et al. [10]\n1For SPCA, we only consider sparsity in the right singular vectors V and not in the left singular vectors U.\n\nThis is similar to considering only the choice of columns and not of both columns and rows in CUR.\n\n2\n\n\fand Witten et al. [19] use the maximum variance interpretation of PCA and provide an optimization\nproblem which explicitly encourages sparsity in V based on a Lasso constraint [18]. d\u2019Aspremont\net al. [2] take a similar approach, but instead formulate the problem as an SDP.\n\nZou et al. [21] use the minimum reconstruction error interpretation of PCA to suggest a different\napproach to the SPCA problem; this formulation will be most relevant to our present purpose. They\nbegin by formulating PCA as the solution to a regression-type problem.\nTheorem 2 (Zou et al. [21]). Given an arbitrary matrix X \u2208 Rn\u00d7p and an integer k, let A and W\nbe p \u00d7 k matrices. Then, for any \u03bb > 0, let\n\n(A\u2217, V\u2217\n\nk) = argminA,W\u2208Rp\u00d7k ||X \u2212 XWAT ||2\n\nF + \u03bb||W||2\nF\n\ns.t. AT A = Ik.\n\n(4)\n\nThen, the minimizing matrices A\u2217 and V\u2217\nsi = 1 or \u22121.\n\nk satisfy A\u2217(i) = siV(i) and V\u2217(i)\n\nk = si\n\n\u03a32\nii\nii+\u03bb\n\n\u03a32\n\nV(i), where\n\nThat is, up to signs, A\u2217 consists of the top-k right singular vectors of X, and V\u2217\nk consists of\nthose same vectors \u201cshrunk\u201d by a factor depending on the corresponding singular value. Given this\nregression-type characterization of PCA, Zou et al. [21] then \u201csparsify\u201d the formulation by adding\nan L1 penalty on W:\n(A\u2217, V\u2217\ns.t. AT A = Ik, (5)\nwhere ||W||1 = Pij |Wij|. This regularization tends to sparsify W element-wise, so that the\nsolution V\u2217\n\nk) = argminA,W\u2208Rp\u00d7k ||X \u2212 XWAT ||2\n\nk gives a sparse approximation of Vk.\n\nF + \u03bb||W||2\n\nF + \u03bb1||W||1\n\n3 Expressing CUR as an optimization problem\n\nIn this section, we present an optimization formulation of CUR. Recall, from Section 2.1, that CUR\ntakes a purely algorithmic approach to the problem of approximating a matrix in terms of a small\nnumber of its columns. That is, it achieves sparsity indirectly by randomly selecting c columns, and\nit does so in such a way that the reconstruction error is small with high probability (Theorem 1). By\ncontrast, SPCA methods are generally formulated as the exact solution to an optimization problem.\nFrom Theorem 1, it is clear that CUR seeks a subset I of size c for which minB\u2208Rc\u00d7p ||X\u2212 XI B||F\nis small. In this sense, CUR can be viewed as a randomized algorithm for approximately solving the\nfollowing combinatorial optimization problem:\n\nmin\n\nI\u2282{1,...,p}\n\nmin\n\nB\u2208Rc\u00d7p\n\n||X \u2212 XI B||F\n\ns.t. |I| \u2264 c.\n\n(6)\n\nIn words, this objective asks for the subset of c columns of X which best describes the entire matrix\nX. Notice that relaxing |I| = c to |I| \u2264 c does not affect the optimum. This optimization problem\nis analogous to all-subsets multivariate regression [7], which is known to be NP-hard.\n\nHowever, by using ideas from the optimization literature we can approximate this combinatorial\nproblem as a regularized regression problem that is convex. First, notice that (6) is equivalent to\n\nmin\n\nB\u2208Rp\u00d7p\n\n||X \u2212 XB||F\n\ns.t.\n\npX\n\ni=1\n\n1{||B(i)||26=0} \u2264 c,\n\n(7)\n\nwhere we now optimize over a p \u00d7 p matrix B. To see the equivalence between (6) and (7), note that\nthe constraint in (7) is the same as \ufb01nding some subset I with |I| \u2264 c such that BI c = 0.\nThe formulation in (7) provides a natural entry point to proposing a convex optimization approach\ncorresponding to CUR. First notice that (7) uses an L0 norm on the rows of B, which is not convex.\nHowever, we can approximate the L0 constraint by a group lasso penalty, which uses a well-known\nconvex heuristic proposed by Yuan et al. [20] that encourages prespeci\ufb01ed groups of parameters\nto be simultaneously sparse. Thus, the combinatorial problem in (6) can be approximated by the\nfollowing convex (and thus tractable) problem:\nProblem 1 (Group lasso regression: GL-REG). Given an arbitrary matrix X \u2208 Rn\u00d7p, let B \u2208\nRp\u00d7p and t > 0. The GL-REG problem is to solve\n\nB\u2217 = argminB||X \u2212 XB||F\n\ns.t.\n\npX\n\ni=1\n\n||B(i)||2 \u2264 t,\n\n(8)\n\nwhere t is chosen to get c nonzero rows in B\u2217.\n\n3\n\n\fSince the rows of B are grouped together in the penalty Pp\ni=1 ||B(i)||2, the row vector B(i) will tend\nto be either dense or entirely zero. Note also that the algorithm to solve Problem 1 is a special case\nof Algorithm 1 (see below), which solves the GL-SPCA problem, to be introduced later. (Finally,\nas a side remark, note that our proposed GL-REG is strikingly similar to a recently proposed method\nfor sparse inverse covariance estimation [6, 15].)\n\n4 Distinguishing CUR from SPCA\n\nOur original intention in casting CUR in the optimization framework was to understand better\nwhether CUR could be seen as an SPCA-type method. So far, we have established CUR\u2019s con-\nnection to regression by showing that CUR can be thought of as an approximation algorithm for the\nsparse regression problem (7). In this section, we discuss the relationship between regression and\nPCA, and we show that CUR cannot be directly cast as an SPCA method.\n\nTo do this, recall that regression, in particular \u201cself\u201d regression, \ufb01nds a B \u2208 Rp\u00d7p that minimizes\n\nOn the other hand, PCA-type methods \ufb01nd a set of directions W that minimize\n\nERR(W) := ||X \u2212 XWW+||F .\n\n||X \u2212 XB||F .\n\n(9)\n\n(10)\n\nHere, unlike in (1), we do not assume that W is orthogonal, since the minimizer produced from\nSPCA methods is often not required to be orthogonal (recall Section 2.2).\n\nClearly, with no constraints on B or W, we can trivially achieve zero reconstruction error in both\ncases by taking B = Ip and W any p \u00d7 p full-rank matrix. However, with additional constraints,\nthese two problems can be very different. It is common to consider sparsity and/or rank constraints.\nWe have seen in Section 3 that CUR effectively requires B to be row-sparse; in the standard PCA\nsetting, W is taken to be rank k (with k < p), in which case (10) is minimized by Vk and obtains\nthe optimal value ERR(Vk) = ||X \u2212 Xk||F ; \ufb01nally, for SPCA, W is further required to be sparse.\nTo illustrate the difference between the reconstruction errors (9) and (10) when extra constraints\nare imposed, consider the 2-dimensional toy example in Figure 1. In this example, we compare\nregression with a row-sparsity constraint to PCA with both rank and sparsity constraints. With\nX \u2208 Rn\u00d72, we plot X(2) against X(1) as the solid points in both plots of Figure 1. Constraining\nB(2) = 0 (giving row-sparsity, as with CUR methods), (9) becomes minB12 ||X(2) \u2212 X(1)B12||2,\nwhich is a simple linear regression, represented by the black thick line and minimizing the sum\nof squared vertical errors as shown. The red line (left plot) shows the \ufb01rst principal component\ndirection, which minimizes ERR(W) among all rank-one matrices W. Here, ERR(W) is the sum\nof squared projection distances (red dotted lines). Finally, if W is further required to be sparse in\nthe X(2) direction (as with SPCA methods), we get the rank-one, sparse projection represented by\nthe green line in Figure 1 (right). The two sets of dotted lines in each plot clearly differ, indicating\nthat their corresponding reconstruction errors are different as well. Since we have shown that CUR\nis minimizing a regression-based objective, this toy example suggests that CUR may not in fact be\noptimizing a PCA-type objective such as (10). Next, we will make this intuition more precise.\nThe \ufb01rst step to showing that CUR is an SPCA method would be to produce a matrix VCUR for\nwhich XI XI+X = XVCUR V+\nCUR, i.e. to express CUR\u2019s approximation in the form of an SPCA\napproximation. However, this equality implies Lcol(XVCUR V+\nCUR) \u2286 Lcol(XI ), meaning that\n(VCUR)I c = 0. If such a VCUR existed, then clearly ERR(VCUR) = ||X \u2212 XI XI+X||F , and so\nCUR could be regarded as implicitly performing sparse PCA in the sense that (a) VCUR is sparse;\nand (b) by Theorem 1 (with high probability), ERR(VCUR) \u2264 (1 + \u01eb)ERR(Vk). Thus, the existence\nof such a VCUR would cast CUR directly as a randomized approximation algorithm for SPCA. How-\never, the following theorem states that unless an unrealistic constraint on X holds, there does not\nexist a matrix VCUR for which ERR(VCUR) = ||X \u2212 XI XI+X||F . The larger implication of this\ntheorem is that CUR cannot be directly viewed as an SPCA-type method.\nTheorem 3. Let I \u2282 {1, . . . , p} be an index set and suppose W \u2208 Rp\u00d7p satis\ufb01es WI c = 0. Then,\n\n||X \u2212 XWW+||F > ||X \u2212 XI XI+X||F ,\n\nunless Lcol(XI ) \u22a5 Lcol(XI c\n\n), in which case \u201c\u2265\u201d holds.\n\n4\n\n\fRegression\nerror (9)\nPCA\nerror (10)\n\n)\n2\n(\nX\n\nRegression\nerror (9)\nSPCA\nerror (10)\n\n)\n2\n(\nX\n\nX(1)\n\nX(1)\n\nFigure 1: Example of the difference in reconstruction errors (9) and (10), when additional constraints\nimposed. Left: regression with row-sparsity constraint (black) compared with PCA with low rank\nconstraint (red). Right: regression with row-sparsity constraint (black) compared with PCA with\nlow rank and sparsity constraint (green). In both plots, the corresponding errors are represented by\nthe dotted lines.\n\nProof.\n\n||X \u2212 XWW+||2\n\nF = ||X \u2212 XIWIW+||2\nF = ||X \u2212 XIWI(WT\nWI)\u22121WT ||2\nI\nF\nF \u2265 ||XI c\nF + ||XI c\n= ||XI \u2212 XIWIW+\n||2\nI ||2\n||2\nF\n= ||XI c\n\u2212 XI XI+XI c\nF + ||XI XI+XI c\n||2\n||2\nF\n= ||X \u2212 XI XI+X||2\nThe last inequality is strict unless XI XI+XI c\n\nF + ||XI XI+XI c\n= 0.\n\nF \u2265 ||X \u2212 XI XI+X||2\n||2\nF .\n\n5 CUR-type sparsity and the group lasso SPCA\nAlthough CUR cannot be directly cast as an SPCA-type method, in this section we propose a sparse\nPCA approach (which we call the group lasso SPCA or GL-SPCA) that accomplishes something\nvery close to CUR. Our proposal produces a V\u2217 that has rows that are entirely zero, and it is mo-\ntivated by the following two observations about CUR. First, following from the de\ufb01nition of the\nleverage scores (3), CUR chooses columns of X based on the norm of their corresponding rows of\nVk. Thus, it essentially \u201czeros-out\u201d the rows of Vk with small norms (in a probabilistic sense).\nSecond, as we have noted in Section 4, if CUR could be expressed as a PCA method, its principal\ndirections matrix \u201cVCUR\u201d would have p \u2212 c rows that are entirely zero, corresponding to removing\nthose columns of X.\nRecall that Zou et al. [21] obtain a sparse V\u2217 by including in (5) an additional L1 penalty from\nthe optimization problem (4). Since the L1 penalty is on the entire matrix viewed as a vector,\nit encourages only unstructured sparsity. To achieve the CUR-type row sparsity, we propose the\nfollowing modi\ufb01cation of (4):\nProblem 2 (Group lasso SPCA: GL-SPCA). Given an arbitrary matrix X \u2208 Rn\u00d7p and an integer\nk, let A and W be p \u00d7 k matrices, and let \u03bb, \u03bb1 > 0. The GL-SPCA problem is to solve\n\n(A\u2217, V\u2217) = argminA,W||X \u2212 XWAT ||2\n\nF + \u03bb||W||2\n\nF + \u03bb1\n\npX\n\ni=1\n\n||W(i)||2 s.t. AT A = Ik. (11)\n\nthe lasso penalty \u03bb1||W||1 in (5)\n\nis replaced in (11) by a group lasso penalty\ni=1 ||W(i)||2, where rows of W are grouped together so that each row of V\u2217 will tend to\n\nThus,\n\u03bb1 Pp\nbe either dense or entirely zero.\n\nImportantly, the GL-SPCA problem is not convex in W and A together; it is, however, convex in\nW, and it is easy to solve in A. Thus, analogous to the treatment in Zou et al. [21], we propose\nan iterative alternate-minimization algorithm to solve GL-SPCA. This is described in Algorithm 1;\nand the justi\ufb01cation of this algorithm is given in Section 7. Note that if we \ufb01x A to be I throughout,\nthen Algorithm 1 can be used to solve the GL-REG problem discussed in Section 3.\n\n5\n\n\fAlgorithm 1: Iterative algorithm for solving the GL-SPCA (and GL-REG) problems.\n\n(For the GL-REG problem, \ufb01x A = I throughout this algorithm.)\n\nInput: Data matrix X and initial estimates for A and W\nOutput: Final estimates for A and W\nrepeat\n\nCompute SVD of XT XW as UDVT and then A \u2190 UVT ;\nS \u2190 {i : ||W(i)||2 6= 0};\nfor i \u2208 S do\n\nCompute bi = Pj6=i (cid:0)X(j)T X(i)(cid:1) WT\n(j);\nif ||AT XT X(i) \u2212 bi||2 \u2264 \u03bb1/2 then\n\nWT\n\n(i) \u2190 0;\n\nelse\n\nWT\n\n(i) \u2190\n\nuntil convergence;\n\n2+\u03bb+\u03bb1/||W(i)||2 (cid:0)AT XT X(i) \u2212 bi(cid:1);\n\n2\n\n2||X(i)||2\n\n1\n\n2\n\n3\n\n4\n\nWe remark that such row-sparsity in V\u2217 can have either advantages or disadvantages. Consider, for\nexample, when there are a small number of informative columns in X and the rest are not important\nfor the task at hand [12, 14]. In such a case, we would expect that enforcing entire rows to be zero\nwould lead to better identi\ufb01cation of the signal columns; and this has been empirically observed in\nthe application of CUR to DNA SNP analysis [14]. The unstructured V\u2217, by contrast, would not\nbe able to \u201cborrow strength\u201d across all columns of V\u2217 to differentiate the signal columns from the\nnoise columns. On the other hand, requiring such structured sparsity is more restrictive and may\nnot be desirable. For example, in microarray analysis in which we have measured p genes on n\npatients, our goal may be to \ufb01nd several underlying factors. Biologists have identi\ufb01ed \u201cpathways\u201d\nof interconnected genes [16], and it would be desirable if each sparse factor could be identi\ufb01ed with\na different pathway (that is, a different set of genes). Requiring all factors of V\u2217 to exclude the same\np \u2212 c genes does not allow a different sparse subset of genes to be active in each factor.\nWe \ufb01nish this section by pointing out that while most SPCA methods only enforce unstructured\nzeros in V\u2217, the idea of having a structured sparsity in the PCA context has very recently been\nexplored [9]. Our GL-SPCA problem falls within the broad framework of this idea.\n\n6 Empirical Comparisons\nIn this section, we evaluate the performance of the four methods discussed above on both syn-\nthetic and real data. In particular, we compare the randomized CUR algorithm of Mahoney and\nDrineas [12, 4] to our GL-REG (of Problem 1), and we compare the SPCA algorithm proposed\nby Zou et al. [21] to our GL-SPCA (of Problem 2). We have also compared against the SPCA\nalgorithm of Witten et al. [19], and we found the results to be very similar to those of Zou et al.\n6.1 Simulations\nWe \ufb01rst consider synthetic examples of the form X = bX + E, where bX is the underlying signal\nmatrix and E is a matrix of noise. In all our simulations, E has i.i.d. N (0, 1) entries, while the\nsignal bX has one of the following forms:\nCase I) bX = [0n\u00d7(p\u2212c); bX\u2217] where the n \u00d7 c matrix bX\u2217 is the nonzero part of bX. In other words,\n\nbX has c nonzero columns and does not necessarily have a low-rank structure.\nbeing low-rank, V has entire rows equal to zero (i.e. it is row-sparse).\n\nCase II) bX = UVT where U and V each consist of k < p orthogonal columns. In addition to\nCase III) bX = UVT where U and V each consist of k < p orthogonal columns. Here V is\n\nlow-rank and sparse, but the sparsity is not structured (i.e. it is scattered-sparse).\n\nA successful method attains low reconstruction error of the true signal bX and has high precision in\nidentifying correctly the zeros in the underlying model. As previously discussed, the four methods\n\n6\n\n\foptimize for different types of reconstruction error. Thus, in comparing CUR and GL-REG, we\nuse the regression-type reconstruction error ERRreg(I) = ||bX \u2212 XI XI+X||F , whereas for the\ncomparison of SPCA and GL-SPCA, we use the PCA-type error ERR(V) = ||bX \u2212 XVV+||F .\nTable 1 presents the simulation results from the three cases. All comparisons use n = 100 and\np = 1000. In Case II and III, the signal matrix has rank k = 10. The underlying sparsity level is\n20%, i.e. 80% of the entries of bX (Case I) and V (Case II&III) are zeros. Note that all methods\nexcept for GL-REG require the rank k as an input, and we always take it to be 10 even in Case I. For\neasy comparison, we have tuned each method to have the correct total number of zeros. The results\nare averaged over 5 trials.\n\nERRreg(I)\n\nERR(V)\n\nCase I\nMethods\n316.29 (0.835)\nCUR\n316.29 (0.989)\nGL-REG\nSPCA\n177.92 (0.809)\nGL-SPCA 141.85 (0.998)\n\nCase II\n315.28 (0.797)\n315.28 (0.750)\n44.388 (0.799)\n37.310 (0.767)\n\nCase III\n315.64 (0.166)\n315.64 (0.107)\n44.995 (0.792)\n45.500 (0.804)\n\nTable 1: Simulation results: The reconstruction errors and the percentages of correctly identi\ufb01ed\nzeros (in parentheses).\n\nWe notice in Table 1 that the two regression-type methods CUR and GL-REG have very similar\nperformance. As we would expect, since CUR only uses information in the top k singular vectors, it\ndoes slightly worse than GL-REG in terms of precision when the underlying signal is not low-rank\n(Case I). In addition, both methods perform poorly if the sparsity is not structured as in Case III. The\ntwo PCA-type methods perform similarly as well. Again, the group lasso method seems to work\nbetter in Case I. We note that the precisions reported here are based on element-wise sparsity\u2014if we\nwere measuring row-sparsity, methods like SPCA would perform poorly since they do not encourage\nentire rows to be zero.\n6.2 Microarray example\nWe next consider a microarray dataset of soft tissue tumors studied by Nielsen et al. [13]. Ma-\nhoney and Drineas [12] apply CUR to this dataset of n = 31 tissue samples and p = 5520 genes.\nAs with the simulation results, we use two sets of comparisons: we compare CUR with GL-REG,\nand we compare SPCA with GL-SPCA. Since we do not observe the underlying truth bX, we take\nERRreg(I) = ||X \u2212 XI XI+X||F and ERR(V) = ||X \u2212 XVV+||F . Also, since we do not observe\nthe true sparsity, we cannot measure the precision as we do in Table 1. The left plot in Figure 2\nshows ERRreg(I) as a function of |I|. We see that CUR and GL-REG perform similarly. (However,\nsince CUR is a randomized algorithm, on every run it gives a different result. From a practical\nstandpoint, this feature of CUR can be disconcerting to biologists wanting to report a single set of\nimportant genes. In this light, GL-REG may be thought of as an attractive non-randomized alterna-\ntive to CUR.) The right plot of Figure 2 compares GL-SPCA to SPCA (speci\ufb01cally, Zou et al. [21]).\nSince SPCA does not explicitly enforce row-sparsity, for a gene to be not used in the model requires\nall of the (k = 4) columns of V\u2217 to exclude it. This likely explains the advantage of GL-SPCA over\nSPCA seen in the \ufb01gure.\n7 Justi\ufb01cation of Algorithm 1\nThe algorithm alternates between minimizing with respect to A and B until convergence.\nSolving for A given B:\nIf B is \ufb01xed, then the regularization penalty in (11) can be ignored, in\nwhich case the optimization problem becomes minA ||X \u2212 XBAT ||2\nF subject to AT A = I. This\nproblem was considered by Zou et al. [21], who showed that the solution is obtained by computing\nthe SVD of (XT X)B as (XT X)B = UDVT and then setting bA = UVT . This explains step 1 in\nAlgorithm 1.\nSolving for B given A: If A is \ufb01xed, then (11) becomes an unconstrained convex optimization\nproblem in B. The subgradient equations (using that AT A = Ik) are\n\n2BT XT X(i) \u2212 2AT XT X(i) + 2\u03bbBT\n\n(i) + \u03bb1si = 0;\n\ni = 1, . . . , p,\n\n(12)\n\n7\n\n\fMicroarray Dataset\n\nMicroarray Dataset\n\n)\nI\n(\ng\ne\nr\nR\nR\nE\n\n0\n0\n4\n\n0\n0\n3\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\nGL-REG\nCUR\n\n)\n\nV\n\n(\nR\nR\nE\n\n0\n6\n4\n\n0\n4\n4\n\n0\n2\n4\n\n0\n0\n4\n\n0\n8\n3\n\n0\n6\n3\n\nGL-SPCA\nSPCA\n\n0\n\n100\n\n50\n150\nNumber of genes used\n\n200\n\n1000\n\n2000\n\n3000\n\n4000\n\n5000\n\nNumber of genes used\n\nFigure 2: Left: Comparison of CUR, multiple runs, with GL-REG; Right: Comparison of GL-\nSPCA with SPCA (speci\ufb01cally, Zou et al. [21]).\n\nwhere the subgradient vectors si = BT\nde\ufb01ne bi = Pj6=i(X(j)T X(i))BT\ncan be written as\n\n(i)/||B(i)||2 if B(i) 6= 0, or ||si||2 \u2264 1 if B(i) = 0. Let us\n(i), so that the subgradient equations\n\n(j) = BT XT X(i)\u2212||X(i)||2\n2\n\nBT\n\nbi + (||X(i)||2\n\n2 + \u03bb)BT\n\n(i) \u2212 AT XT X(i) + (\u03bb1/2)si = 0.\n\n(13)\n\nThe following claim explains Step 3 in Algorithm 1.\nClaim 1. B(i) = 0 if and only if ||AT XT X(i) \u2212 bi||2 \u2264 \u03bb1/2.\n\nProof. First, if B(i) = 0, the subgradient equations (13) become bi \u2212 AT XT X(i) + (\u03bb1/2)si = 0.\nSince ||si||2 \u2264 1 if B(i) = 0, we have ||AT XT X(i) \u2212 bi||2 \u2264 \u03bb1/2. To prove the other\n(i)/||B(i)||2. Substituting this expression into\ndirection, recall that B(i)\n(13), rearranging terms, and taking the norm on both sides, we get 2||AT XT X(i) \u2212 bi||2 =\n(cid:0)2||X(i)||2\n\n2 + 2\u03bb + \u03bb1/||B(i)||2(cid:1) ||B(i)||2 > \u03bb1.\n\n6= 0 implies si = BT\n\nBy Claim 1, ||AT XT X(i) \u2212 bi||2 > \u03bb1/2 implies that B(i) 6= 0 which further implies si =\nBT\n\n(i)/||B(i)||2. Substituting into (13) gives Step 4 in Algorithm 1.\n\n8 Conclusion\nIn this paper, we have elucidated several connections between two recently-popular matrix decom-\nposition methods that adopt very different perspectives on obtaining interpretable low-rank matrix\ndecompositions. In doing so, we have suggested two optimization problems, GL-REG and GL-\nSPCA, that highlight similarities and differences between the two methods.\nIn general, SPCA\nmethods obtain interpretability by modifying an existing intractable objective with a convex regu-\nlarization term that encourages sparsity, and then exactly optimizing that modi\ufb01ed objective. On\nthe other hand, CUR methods operate by using randomness and approximation as computational re-\nsources to optimize approximately an intractable objective, thereby implicitly incorporating a form\nof regularization into the steps of the approximation algorithm. Understanding this concept of im-\nplicit regularization via approximate computation is clearly of interest more generally, in particular\nfor applications where the size scale of the data is expected to increase.\n\nAcknowledgments\n\nWe would like to thank Art Owen and Robert Tibshirani for encouragement and helpful suggestions.\nJacob Bien was supported by the Urbanek Family Stanford Graduate Fellowship, and Ya Xu was\nsupported by the Melvin and Joan Lane Stanford Graduate Fellowship. In addition, support from\nthe NSF and AFOSR is gratefully acknowledged.\n\n8\n\n\fReferences\n[1] M.-A. Belabbas and P.J. Wolfe. Fast low-rank approximation for covariance matrices.\n\nIn\nSecond IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive\nProcessing, pages 293\u2013296, 2007.\n\n[2] A. d\u2019Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for\n\nsparse PCA using semide\ufb01nite programming. SIAM Review, 49(3):434\u2013448, 2007.\n\n[3] P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo algorithms for matrices III:\nComputing a compressed approximate matrix decomposition. SIAM Journal on Computing,\n36:184\u2013206, 2006.\n\n[4] P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions.\n\nSIAM Journal on Matrix Analysis and Applications, 30:844\u2013881, 2008.\n\n[5] S.A. Goreinov and E.E. Tyrtyshnikov. The maximum-volume concept in approximation by\n\nlow-rank matrices. Contemporary Mathematics, 280:47\u201351, 2001.\n\n[6] T. Hastie, R. Tibshirani, and J. Friedman. Applications of the lasso and grouped lasso to the\n\nestimation of sparse graphical models. Manuscript. Submitted. 2010.\n\n[7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-\n\nVerlag, New York, 2003.\n\n[8] D.C. Hoaglin and R.E. Welsch. The hat matrix in regression and ANOVA. The American\n\nStatistician, 32(1):17\u201322, 1978.\n\n[9] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Tech-\n\nnical report. Preprint: arXiv:0909.1440 (2009).\n\n[10] I. T. Jolliffe, N. T. Trenda\ufb01lov, and M. Uddin. A modi\ufb01ed principal component technique based\n\non the LASSO. Journal of Computational and Graphical Statistics, 12(3):531\u2013547, 2003.\n\n[11] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystr\u00a8om method. In Annual Advances in\n\nNeural Information Processing Systems 22: Proceedings of the 2009 Conference, 2009.\n\n[12] M.W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proc.\n\nNatl. Acad. Sci. USA, 106:697\u2013702, 2009.\n\n[13] T. Nielsen, R.B. West, S.C. Linn, O. Alter, M.A. Knowling, J. O\u2019Connell, S. Zhu, M. Fero,\nG. Sherlock, J.R. Pollack, P.O. Brown, D. Botstein, and M. van de Rijn. Molecular characteri-\nsation of soft tissue tumours: a gene expression study. Lancet, 359(9314):1301\u20131307, 2002.\n\n[14] P. Paschou, E. Ziv, E.G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M.W. Mahoney, and\nP. Drineas. PCA-correlated SNPs for structure identi\ufb01cation in worldwide human populations.\nPLoS Genetics, 3:1672\u20131686, 2007.\n\n[15] J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression\n\nmodels. Journal of the American Statistical Association, 104:735\u2013746, 2009.\n\n[16] A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette,\nA. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and J. P. Mesirov. Gene set enrich-\nment analysis: A knowledge-based approach for interpreting genome-wide expression pro\ufb01les.\nProc. Natl. Acad. Sci. USA, 102(43):15545\u201315550, 2005.\n\n[17] J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is more: Compact matrix decomposition for\nlarge sparse graphs. In Proceedings of the 7th SIAM International Conference on Data Mining,\n2007.\n\n[18] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety: Series B, 58(1):267\u2013288, 1996.\n\n[19] D. M. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with ap-\nplications to sparse principal components and canonical correlation analysis. Biostatistics,\n10(3):515\u2013534, 2009.\n\n[20] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\n\nJournal of the Royal Statistical Society: Series B, 68(1):49\u201367, 2006.\n\n[21] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Compu-\n\ntational and Graphical Statistics, 15(2):262\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1191, "authors": [{"given_name": "Jacob", "family_name": "Bien", "institution": null}, {"given_name": "Ya", "family_name": "Xu", "institution": null}, {"given_name": "Michael", "family_name": "Mahoney", "institution": null}]}