{"title": "Sparse PCA with Oracle Property", "book": "Advances in Neural Information Processing Systems", "page_first": 1529, "page_last": 1537, "abstract": "In this paper, we study the estimation of the $k$-dimensional sparse principal subspace of covariance matrix $\\Sigma$ in the high-dimensional setting. We aim to recover the oracle principal subspace solution, i.e., the principal subspace estimator obtained assuming the true support is known a priori. To this end, we propose a family of estimators based on the semidefinite relaxation of sparse PCA with novel regularizations. In particular, under a weak assumption on the magnitude of the population projection matrix, one estimator within this family exactly recovers the true support with high probability, has exact rank-$k$, and attains a $\\sqrt{s/n}$ statistical rate of convergence with $s$ being the subspace sparsity level and $n$ the sample size. Compared to existing support recovery results for sparse PCA, our approach does not hinge on the spiked covariance model or the limited correlation condition. As a complement to the first estimator that enjoys the oracle property, we prove that, another estimator within the family achieves a sharper statistical rate of convergence than the standard semidefinite relaxation of sparse PCA, even when the previous assumption on the magnitude of the projection matrix is violated. We validate the theoretical results by numerical experiments on synthetic datasets.", "full_text": "Sparse PCA with Oracle Property\n\nQuanquan Gu\n\nZhaoran Wang\n\nDepartment of Operations Research\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University\n\nPrinceton, NJ 08544, USA\nqgu@princeton.edu\n\nand Financial Engineering\n\nPrinceton University\n\nPrinceton, NJ 08544, USA\n\nzhaoran@princeton.edu\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University\n\nPrinceton, NJ 08544, USA\n\nhanliu@princeton.edu\n\nAbstract\n\nIn this paper, we study the estimation of the k-dimensional sparse principal sub-\nspace of covariance matrix \u03a3 in the high-dimensional setting. We aim to recover\nthe oracle principal subspace solution, i.e., the principal subspace estimator ob-\ntained assuming the true support is known a priori. To this end, we propose a\nfamily of estimators based on the semide\ufb01nite relaxation of sparse PCA with novel\nregularizations. In particular, under a weak assumption on the magnitude of the\npopulation projection matrix, one estimator within this family exactly recovers the\n\ntrue support with high probability, has exact rank-k, and attains a(cid:112)s/n statistical\n\nrate of convergence with s being the subspace sparsity level and n the sample size.\nCompared to existing support recovery results for sparse PCA, our approach does\nnot hinge on the spiked covariance model or the limited correlation condition. As a\ncomplement to the \ufb01rst estimator that enjoys the oracle property, we prove that, an-\nother estimator within the family achieves a sharper statistical rate of convergence\nthan the standard semide\ufb01nite relaxation of sparse PCA, even when the previous\nassumption on the magnitude of the projection matrix is violated. We validate the\ntheoretical results by numerical experiments on synthetic datasets.\n\n1\n\nIntroduction\n\nPrincipal Component Analysis (PCA) aims at recovering the top k leading eigenvectors u1, . . . , uk\n\nof the covariance matrix \u03a3 from sample covariance matrix(cid:98)\u03a3. In applications where the dimension p\n\nis much larger than the sample size n, classical PCA could be inconsistent [12]. To avoid this problem,\none common assumption is that the leading eigenvector u1 of the population covariance matrix \u03a3 is\nsparse, i.e., the number of nonzero elements in u1 is less than the sample size, s = supp(u1) < n.\nThis gives rise to Sparse Principal Component Analysis (SPCA). In the past decade, signi\ufb01cant\nprogress has been made toward the methodological development [13, 8, 30, 22, 7, 14, 28, 19, 27] as\nwell as theoretical understanding [12, 20, 1, 24, 21, 4, 6, 3, 2, 18, 15] of sparse PCA.\nHowever, all the above studies focused on estimating the leading eigenvector u1. When the top k\neigenvalues of \u03a3 are not distinct, there exist multiple groups of leading eigenvectors that are equivalent\nup to rotation. In order to address this problem, it is reasonable to de-emphasize eigenvectors\nand to instead focus on their span U, i.e., the principal subspace of variation. [23, 25, 16, 27]\n\n1\n\n\fproposed Subspace Sparsity, which de\ufb01nes sparsity on the projection matrix onto subspace U, i.e.,\n\u03a0\u2217 = U U(cid:62), as the number of nonzero entries on the diagonal of \u03a0\u2217, i.e., s = |supp(diag(\u03a0\u2217))|.\nThey proposed to estimate the principal subspace instead of principal eigenvectors of \u03a3, based\n(cid:96)1,1-norm regularization on a convex set called Fantope [9], that provides a tight relaxation for\nsimultaneous rank and orthogonality constraints on the positive semide\ufb01nite cone. The convergence\n\nrate of their estimator is O(\u03bb1/(\u03bbk \u2212 \u03bbk+1)s(cid:112)log p/n), where \u03bbk, k = 1, . . . , p is the k-th largest\n\neigenvalue of \u03a3. Moreover, their support recovery relies on limited correlation condition (LCC) [16],\nwhich is similar to irrepresentable condition in sparse linear regression. We notice that [1] also\nanalyzed the semide\ufb01nite relaxation of sparse PCA. However, they only considered rank-1 principal\nsubspace and the stringent spiked covariance model, where the population covariance matrix is block\ndiagonal.\nIn this paper, we aim to recover the oracle principal subspace solution, i.e., the principal subspace\nestimator obtained assuming the true support is known a priori. Based on recent progress made\non penalized M-estimators with nonconvex penalty functions [17, 26], we propose a family of\nestimators based on the semide\ufb01nite relaxation of sparse PCA with novel regularizations. It estimates\n\nthe k-dimensional principal subspace of a population matrix \u03a3 based on its empirical version(cid:98)\u03a3. In\n\nparticular, under a weak assumption on the magnitude of the projection matrix, i.e,\n\n\u221a\n\n(cid:114) s\n\n,\n\nn\n\n|\u03a0\u2217\n\nij| \u2265 \u03bd +\n\nmin\n(i,j)\u2208T\n\nk\u03bb1\nC\n\u03bbk \u2212 \u03bbk+1\n\nwhere T is the support of \u03a0\u2217, \u03bd is a parameter from nonconvex penalty and C is an universal constant,\none estimator within this family exactly recovers the oracle solution with high probability, and is\nexactly of rank k. It is worth noting that unlike the linear regression setting, where the estimators that\ncan recover the oracle solution often have nonconvex formulations, our estimator here is obtained\nfrom a convex optimization1, and has unique global solution. Compared to existing support recovery\nresults for sparse PCA, our approach does not hinge on the spiked covariance model [1] or the limited\ncorrelation condition [16]. Moreover, it attains the same convergence rate as standard PCA as if the\nsupport of the true projection matrix is provided a priori. More speci\ufb01cally, the Frobenius norm error\n\nof the estimator(cid:98)\u03a0 is bounded with high probability as follows\n(cid:114)\n\n(cid:107)(cid:98)\u03a0 \u2212 \u03a0\u2217(cid:107)F \u2264\n\nC\u03bb1\n\n\u03bbk \u2212 \u03bbk+1\n\nks\nn\n\n,\n\nwhere k is the dimension of the subspace.\nAs a complement to the \ufb01rst estimator that enjoys the oracle property, we prove that, another estimator\nwithin the family achieves a sharper statistical rate of convergence than the standard semide\ufb01nite\nrelaxation of sparse PCA, even when the previous assumption on the magnitude of the projection\nmatrix is violated. This estimator is based on nonconvex optimizaiton. With a suitable choice of\nthe regularization parameter, we show that any local optima to the optimization problem is a good\nestimator for the projection matrix of the true principal subspace. In particular, we show that the\n\nFrobenius norm error of the estimator(cid:98)\u03a0 is bounded with high probability as\n(cid:33)\n\n(cid:114)\n\n(cid:32)(cid:114) s1s\n\n(cid:107)(cid:98)\u03a0 \u2212 \u03a0\u2217(cid:107)F \u2264\n\nn\n\nn\n\nwhere s1, m1, m2 are all no larger than s. Evidently, it is sharper than the convergence rate proved\n\nin [23]. Note that the above rate consists of two terms, the O((cid:112)s1s/n) term corresponds to the\nO((cid:112)m1m2 log p/n) corresponds to the entries of projection matrix violating the previous assumption\n\nentries of projection matrix satisfying the previous assumption (i.e., with large magnitude), while\n\n(i.e., with small magnitude).\nFinally, we demonstrate the numerical experiments on synthetic datasets, which support our theoretical\nanalysis.\nThe rest of this paper is arranged as follows. Section 2 introduces two estimators for the principal\nsubspace of a covariance matrix. Section 3 analyzes the statistical properties of the two estimators.\n\nC\u03bb1\n\n\u03bbk \u2212 \u03bbk+1\n\n+\n\nm1m2\n\nlog p\n\n,\n\n1Even though we use nonconvex penalty, the resulting problem as a whole is still a convex optimization\n\nproblem, because we add another strongly convex term in the regularization part, i.e., \u03c4 /2(cid:107)\u03a0(cid:107)F .\n\n2\n\n\fWe present an algorithm for solving the estimators in Section 4. Section 5 shows the experiments on\nsynthetic datasets. Section 6 concludes this work with remarks.\nNotation. Let [p] be the shorthand for {1, . . . , p}. For matrices A, B of compatible dimension,\n(cid:104)A, B(cid:105) := tr(A(cid:62)B) is the Frobenius inner product, and (cid:107)A(cid:107)F = (cid:104)A, A(cid:105) is the squared Frobenius\nnorm. (cid:107)x(cid:107)q is the usual (cid:96)q norm with (cid:107)x(cid:107)0 de\ufb01ned as the number of nonzero entries of x. (cid:107)A(cid:107)a,b\nis the (a, b)-norm de\ufb01ned to be the (cid:96)b norm of the vector of rowwise (cid:96)a norms of A, e.g. (cid:107)A(cid:107)1,\u221e\nis the maximum absolute row sum. (cid:107)A(cid:107)2 is the spectral norm of A, and (cid:107)A(cid:107)\u2217 is the trace norm\n(nuclear norm) of A. For a symmetric matrix A, we de\ufb01ne \u03bb1(A) \u2265 \u03bb2(A) \u2265 . . . \u2265 \u03bbp(A) to be the\neigenvalues of A with multiplicity. When the context is obvious we write \u03bbj = \u03bbj(A) as shorthand.\n\n2 The Proposed Estimators\n\nIn this section, we present a family of estimators based on the semide\ufb01nite relaxation of sparse PCA\nwith novel regularizations, for the principal subspace of the population covariance matrix. Before\ngoing into the details of the proposed estimators, we \ufb01rst present the formal de\ufb01nition of principal\nsubspace estimation.\n\n2.1 Problem De\ufb01nition\nLet \u03a3 \u2208 Rp\u00d7p be an unknown covariance matrix, with eigen-decomposition as follows\n\np(cid:88)\n\n\u03a3 =\n\n\u03bbiuiu(cid:62)\ni ,\n\ni=1\n\nwhere \u03bb1 \u2265 . . . \u2265 \u03bbp are eigenvalues (with multiplicity) and u1, . . . , up \u2208 Rp are the associated\neigenvectors. The k-dimensional principal subspace of \u03a3 is the subspace spanned by u1, . . . , uk.\nThe projection matrix to the k-dimensional principal subspace is\ni = U U(cid:62),\n\nk(cid:88)\n\nuiu(cid:62)\n\n\u03a0\u2217 =\n\ni=1\n\nwhere U = [u1, . . . , uk] is an orthonormal matrix. The reason why principal subspace is more\nappealing is that it avoids the problem of un-identi\ufb01ability of eigenvectors when the eigenvalues are\nnot distinct. In fact, we only need to assume \u03bbk \u2212 \u03bbk+1 > 0 instead of \u03bb1 > . . . > \u03bbk > \u03bbk+1. Then\nthe principal subspace \u03a0\u2217 is unique and identi\ufb01able. We also assume that k is \ufb01xed.\nNext, we introduce the de\ufb01nition of Subspace Sparsity [25], which can be seen as the extension of\nconventional Eigenvector Sparsity used in sparse PCA.\n[25] (Subspace Sparsity) The projection \u03a0\u2217 onto the subspace spanned by the\nDe\ufb01nition 1.\neigenvectors of \u03a3 corresponding to its k largest eigenvalues satis\ufb01es (cid:107)U(cid:107)2,0 \u2264 s, or equivalently\n(cid:107)diag(\u03a0)(cid:107)0 \u2264 s.\nIn the extreme case that k = 1, the support of the projection matrix onto the rank-1 principal subspace\nis the same as the support of the sparse leading eigenvector.\nThe problem de\ufb01nition of principal subspace estimation is: given an i.i.d. sample {x1, x2, . . . , xn} \u2282\naim to estimate \u03a0\u2217 based on the empirical covariance matrix S \u2208 Rp\u00d7p, that is given by (cid:98)\u03a3 =\n1/n(cid:80)n\nRp which are drawn from an unknown distribution of zero mean and covariance matrix \u03a3, we\ni . We are particularly interested in the high dimensional setting, where p \u2192 \u221e as\n\ni=1 xix(cid:62)\n\nn \u2192 \u221e, in sharp contrast to conventional setting where p is \ufb01xed and n \u2192 \u221e.\nNow we are ready to design a family of estimators for \u03a0\u2217.\n\n2.2 A Family of Sparse PCA Estimators\n\nGiven a sample covariance matrix (cid:98)\u03a3 \u2208 Rp\u00d7p, we propose a family of sparse principal subspace\nestimator(cid:98)\u03a0 that is de\ufb01ned to be a solution of the semide\ufb01nite relaxation of sparse PCA\n\n(cid:98)\u03a0\u03c4 = argmin\n\n\u2212 (cid:104)(cid:98)\u03a3, \u03a0(cid:105) +\n\n\u03a0\n\n(cid:107)\u03a0(cid:107)2\n\nF + P\u03bb(\u03a0),\n\nsubject to \u03a0 \u2208 F k,\n\n\u03c4\n2\n\n(1)\n\n3\n\n\fwhere \u03c4 > 0, \u03bb > 0 is a regularization parameter, F k is a convex body called the Fantope [9, 23],\nthat is de\ufb01ned as follows\n\nand P\u03bb(\u03a0) is a decomposable nonconvex penalty, i.e., P\u03bb(\u03a0) =(cid:80)p\n\nF k = {X : 0 \u227a X \u227a I and tr(X) = k },\n\ni,j=1 p\u03bb(\u03a0ij). Typical nonconvex\npenalties include the smoothly clipped absolute deviation (SCAD) penalty [10] and minimax concave\npenalty MCP [29], which can eliminate the estimation bias and attain more re\ufb01ned statistical rates of\nconvergence [17, 26]. For example, MCP penalty is de\ufb01ned as\n\n(cid:90) |t|\n\n(cid:16)\n\n0\n\n(cid:17)\n\n(cid:18)\n\n(cid:19)\n\np\u03bb(t) = \u03bb\n\n1 \u2212 z\n\u03bbb\n\ndz =\n\n\u03bb|t| \u2212 t2\n2b\n\n1(|t| \u2264 b\u03bb) +\n\n1(|t| > b\u03bb),\n\nb\u03bb2\n2\n\n(2)\n\nwhere b > 0 is a \ufb01xed parameter.\nAn important property of the nonconvex penalties p\u03bb(t) is that they can be formulated as the sum of\nthe (cid:96)1 penalty and a concave part q\u03bb(t): p\u03bb(t) = \u03bb|t| + q\u03bb(t). For example, if p\u03bb(t) is chosen to be\nthe MCP penalty, then the corresponding q\u03bb(t) is:\n1(|t| \u2264 b\u03bb) +\n\n(cid:18) b\u03bb2\n\n1(|t| > b\u03bb),\n\n\u2212 \u03bb|t|\n\n(cid:19)\n\nq\u03bb(t) = \u2212 t2\n2b\n\n2\n\nWe rely on the following regularity conditions on p\u03bb(t) and its concave component q\u03bb(t):\n\n(a) p\u03bb(t) satis\ufb01es p(cid:48)\n(b) q(cid:48)\n\n\u03bb(t) = 0,\n\nfor\n\n|t| \u2265 \u03bd > 0.\n\n\u03bb(t) is monotone and Lipschitz continuous, i.e., for t(cid:48) \u2265 t, there exists a constant \u03b6\u2212 \u2265 0\nsuch that\n\n\u2212\u03b6\u2212 \u2264 q(cid:48)\n\n\u03bb(t(cid:48)) \u2212 q(cid:48)\nt(cid:48) \u2212 t\n\n\u03bb(t)\n\n.\n\n(c) q\u03bb(t) and q(cid:48)\n(d) q(cid:48)\n\n\u03bb(t) is bounded, i.e., |q(cid:48)\n\n\u03bb(t)| \u2264 \u03bb for any t.\n\n\u03bb(t) pass through the origin, i.e., q\u03bb(0) = q(cid:48)\n\n\u03bb(0) = 0.\n\nThe above conditions apply to a variety of nonconvex penalty functions. For example, for MCP in\n(2), we have \u03bd = b\u03bb and \u03b6\u2212 = 1/b.\nIt is easy to show that when \u03c4 > \u03b6\u2212, the problem in (1) is strongly convex, and therefore its solution is\nunique. We notice that [16] also introduced the same regularization term \u03c4 /2(cid:107)\u03a0(cid:107)2\nF in their estimator.\nHowever, our motivation is quite different from theirs. We introduce this term because it is essential\nfor the estimator in (1) to achieve the oracle property provided that the magnitude of all the entries in\nthe population projection matrix is suf\ufb01ciently large. We call (1) Convex Sparse PCA Estimator.\n\nNote that constraint \u03a0 \u2208 F k only guarantees that the rank of(cid:98)\u03a0 is \u2265 k. However, we can prove that\n\nour estimator is of rank k exactly. This is in contrast to [23], where some post projection is needed,\nto make sure their estimator is of rank k.\n\n(cid:98)\u03a0\u03c4 =0 = argmin\n\n\u2212 (cid:104)(cid:98)\u03a3, \u03a0(cid:105) + P\u03bb(\u03a0),\n\n2.3 Nonconvex Sparse PCA Estimator\nIn the case that the magnitude of entries in the population projection matrix \u03a0\u2217 violates the previous\nassumption, (1) with \u03c4 > \u03b6\u2212 no longer enjoys the desired oracle property. To this end, we consider\nanother estimator from the family of estimators in (1) with \u03c4 = 0,\n\nsubject to \u03a0 \u2208 F k.\n\n\u03a0\n\nSince \u2212(cid:104)(cid:98)\u03a3, \u03a0(cid:105) is an af\ufb01ne function, and P\u03bb(\u03a0) is nonconvex, the estimator in (3) is nonconvex.\n\nWe simply refer to it as Nonconvex Sparse PCA Estimator. We will prove that it achieves a sharper\nstatistical rate of convergence than the standard semide\ufb01nite relaxation of sparse PCA [23], even\nwhen the previous assumption on the magnitude of the projection matrix is violated.\nIt is worth noting that although our estimators in (1) and (3) are for the projection matrix \u03a0 of the\nprincipal subspace, we can also provide an estimator of U. By de\ufb01nition, the true subspace satis\ufb01es\n\n(3)\n\n4\n\n\f\u03a0\u2217 = U U(cid:62). Thus, the estimator (cid:98)U can be computed from(cid:98)\u03a0 using eigenvalue decomposition. In\ndetail, we can set the columns of (cid:98)U to be the top k leading eigenvectors of(cid:98)\u03a0. In case that the top k\neigenvalues of(cid:98)\u03a0 are the same, we can follow the standard PCA convention by rotating the eigenvectors\nwith a rotation matrix R, such that ((cid:98)U R)T(cid:98)\u03a3((cid:98)U R) is diagonal. Then (cid:98)U R is the orthonormal basis for\n\nthe estimated principal subspace, and can be used for visualization and dimension reduction.\n\n3 Statistical Properties of the Proposed Estimators\n\n(4)\n\nL(\u03a0).\n\nargmin\n\nsupp(diag(\u03a0))\u2282S,\u03a0\u2208F k\n\n2(cid:107)\u03a0(cid:107)2\n\n3.1 Oracle Property and Convergence Rate of Convex Sparse PCA\n\nthe support of \u03a0\u2217 under suitable conditions on its magnitude. Before we present this theorem, we\n\nIn this section, we present the statistical properties of the two estimators in the family (1). One is\nwith \u03c4 > \u03b6\u2212, the other is with \u03c4 = 0. The proofs are all included in the longer version of this paper.\nTo evaluate the statistical performance of the principal subspace estimators, we need to de\ufb01ne the\nestimator error between the estimated projection matrix and the true projection matrix. In our study,\n\nwe use the Frobenius norm error (cid:107)(cid:98)\u03a0 \u2212 \u03a0\u2217(cid:107)F .\nWe \ufb01rst analyze the estimator in (1) when \u03c4 > \u03b6\u2212. We prove that, the estimator(cid:98)\u03a0 in (1) recovers\nintroduce the de\ufb01nition of an oracle estimator, denoted by (cid:98)\u03a0O. Recall that S = supp(diag(\u03a0\u2217)).\nThe oracle estimator(cid:98)\u03a0O is de\ufb01ned as(cid:98)\u03a0O =\nwhere L(\u03a0) = \u2212(cid:104)(cid:98)\u03a3, \u03a0(cid:105) + \u03c4\nThe following theorem shows that, under suitable conditions, (cid:98)\u03a0 in (1) is the same as the oracle\nestimator(cid:98)\u03a0O with high probability, and therefore exactly recovers the support of \u03a0\u2217.\nTheorem 1. (Support Recovery) Suppose the nonconvex penalty P\u03bb(\u03a0) = (cid:80)p\nk\u03bb1/(\u03bbk \u2212 \u03bbk+1)(cid:112)s/n.\n(cid:112)log p/n and \u03c4 > \u03b6\u2212, we\nij| \u2265 \u03bd + C\n\ufb01es conditions (a) and (b). If \u03a0\u2217 satis\ufb01es min(i,j)\u2208T |\u03a0\u2217\nhave with probability at least 1 \u2212 1/n2 that (cid:98)\u03a0 = (cid:98)\u03a0O, which further implies supp(diag((cid:98)\u03a0)) =\nFor the estimator in (1) with the regularization parameter \u03bb = C\u03bb1\nsupp(diag((cid:98)\u03a0O)) = supp(diag(\u03a0\u2217)) and rank((cid:98)\u03a0) = rank((cid:98)\u03a0O) = k.\nij| \u2265\n\nF . Note that the above oracle estimator is not a practical estimator,\n\nbecause we do not know the true support S in practice.\n\n\u221a\n\n(cid:112)log p/n + C\n\nk\u03bb1/(\u03bbk \u2212 \u03bbk+1)(cid:112)s/n.\n\nFor example, if we use MCP penalty, the magnitude assumption turns out to be min(i,j)\u2208T |\u03a0\u2217\nCb\u03bb1\nNote that in our proposed estimator in (1), we do not rely on any oracle knowledge on the true support.\nOur theory in Theorem 1 shows that, with high probability, the estimator is identical to the oracle\nestimator, and thus exactly recovers the true support.\nCompared to existing support recovery results for sparse PCA [1, 16], our condition on the magnitude\nis weaker. Note that the limited correlation condition [16] and the even stronger spiked covariance\ncondition [1] impose constraints not only on the principal subspace corresponding to \u03bb1, . . . , \u03bbk,\nbut also on the \u201cnon-signal\u201d part, i.e., the subspace corresponding to \u03bbk+1, . . . , \u03bbp. Unlike these\nconditions, we only impose conditions on the \u201csignal\u201d part, i.e., the magnitude of the projection\nmatrix \u03a0\u2217 corresponding to \u03bb1, . . . , \u03bbk. We attribute the oracle property of our estimator to novel\nregularizations (\u03c4 /2(cid:107)\u03a0(cid:107)2\nThe oracle property immediately implies that under the above conditions on the magnitude, the\nestimator in (1) achieves the convergence rate of standard PCA as if we know the true support S a\npriori. This is summarized in the following theorem.\nTheorem 2. Under the same conditions of Theorem 1, we have with probability at least 1 \u2212 1/n2\nthat\n\nF plus nonconvex penalty).\n\ni,j=1 p\u03bb(\u03a0) satis-\n\n\u221a\n\n(cid:107)(cid:98)\u03a0 \u2212 \u03a0\u2217(cid:107)F \u2264 C\n\n\u221a\n\nk\u03bb1\n\u03bbk \u2212 \u03bbk+1\n\n(cid:114) s\n\n,\n\nn\n\n5\n\n\ffor some universal constant C.\n\nEvidently, the estimator attains a much sharper statistical rate of convergence than the state-of-the-art\nresult proved in [23].\n\n3.2 Convergence Rate of Nonconvex Sparse PCA\n\nWe now analyze the estimator in (3), which is a special case of (1) when \u03c4 = 0. We basically\nshow that any local optima of the non-convex optimization problem in (3) is a good estimator. In\n\nother words, our theory applies to any projection matrix(cid:98)\u03a0\u03c4 =0 \u2208 Rp\u00d7p that satis\ufb01es the \ufb01rst-order\n\nnecessary conditions (variational inequality) to be a local minimum of (3):\n\n(cid:104)(cid:98)\u03a0\u03c4 =0 \u2212 \u03a0(cid:48),\u2212(cid:98)\u03a3 + \u2207P\u03bb((cid:98)\u03a0)(cid:105) \u2264 0, \u2200 \u03a0(cid:48) \u2208 F k\n\nRecall that S = supp(diag(\u03a0\u2217)) with |S| = s, T = S \u00d7 S with |T| = s2, and T c = [p] \u00d7 [p] \\ T .\nFor (i, j) \u2208 T1 \u2282 T with |T1| = t1, we assume |\u03a0\u2217\nij| \u2265 \u03bd, while for (i, j) \u2208 T2 \u2282 T with |T2| = t2,\nwe assume |\u03a0\u2217\nij| < \u03bd. Clearly, we have s2 = t1 + t2. There exists a minimal submatrix A \u2208 Rn1\u00d7n2\nof \u03a0\u2217, which contains all the elements in T1, with s1 = min{n1, n2}. There also exists a minimal\nsubmatrix B \u2208 Rm1\u00d7m2 of \u03a0\u2217, that contains all the elements in T2.\nNote that in general, s1 \u2264 s, m1 \u2264 s and m2 \u2264 s. In the worst case, we have s1 = m1 = m2 = s.\ni,j=1 p\u03bb(\u03a0) satis\ufb01es conditions (b) (c)\n\nTheorem 3. Suppose the nonconvex penalty P\u03bb(\u03a0) = (cid:80)p\n(\u03bbk \u2212 \u03bbk+1)/4, with probability at least 1 \u2212 4/p2, any local optimal solution(cid:98)\u03a0\u03c4 =0 satis\ufb01es\n\nand (d). For the estimator in (3) with regularization parameter \u03bb = C\u03bb1\n\n(cid:114)\n\n(cid:112)log p/n and \u03b6\u2212 \u2264\n(cid:125)\n\nn\n\n.\n\nlog p\n\n(cid:107)(cid:98)\u03a0\u03c4 =0 \u2212 \u03a0\u2217(cid:107)F \u2264 4C\u03bb1\n\n\u221a\n\n(cid:124)\n\ns1\n(\u03bbk \u2212 \u03bbk+1)\nij|\u2265\u03bd\n\nT1:|\u03a0\u2217\n\n(cid:123)(cid:122)\n\n\u221a\n\n+\n\n(cid:124)\n\n12C\u03bb1\nm1m2\n(\u03bbk \u2212 \u03bbk+1)\nij|<\u03bd\nT2:|\u03a0\u2217\n\n(cid:123)(cid:122)\n\n(cid:114) s\n(cid:125)\n\nn\n\nij|, 1 \u2264 i, j \u2264 p. We have the following comments:\n\nNote that the upper bound can be decomposed into two parts according to the magnitude of the entries\nin the true projection matrix, i.e., |\u03a0\u2217\nOn the one hand, for those strong \u201csignals\u201d, i.e., |\u03a0\u2217\n\u221a\nof O(\u03bb1\n\u221a\n\u03bbk+1)s/\nthe other case that s1 < s, the convergence rate could be even sharper.\nOn the other hand, for those weak \u201csignals\u201d, i.e., |\u03a0\u2217\nrate of O(\u03bb1\n\ns1/(\u03bbk \u2212 \u03bbk+1)(cid:112)s/n). Since s1 is at most equal to s, the worst-case rate is O(\u03bb1/(\u03bbk \u2212\nn), which is sharper than the rate proved in [23], i.e., O(\u03bb1/(\u03bbk \u2212 \u03bbk+1)s(cid:112)log p/n). In\nm1m2/(\u03bbk \u2212 \u03bbk+1)(cid:112)log p/n). Since both m1 and m2 are at most equal to s, the\nworst-case rate is O(\u03bb1/(\u03bbk \u2212 \u03bbk+1)s(cid:112)log p/n), which is the same as the rate proved in [23]. In\n\nij| \u2265 \u03bd, we are able to achieve the convergence rate\n\nij| < \u03bd, we are able to achieve the convergence\n\nthe other case that\nThe above discussions clearly demonstrate the advantage of our estimator, which essentially bene\ufb01ts\nfrom non-convex penalty.\n\nm1m2 < s, the convergence rate will be sharper than that in [23].\n\n\u221a\n\n\u221a\n\n4 Optimization Algorithm\n\nIn this section, we present an optimization algorithm to solve (1) and (3). Since (3) is a special case\nof (1) with \u03c4 = 0, it is suf\ufb01cient to develop an algorithm for solving (1).\nObserving that (1) has both nonsmooth regularization term and nontrivial constraint set F k, it is\ndif\ufb01cult to directly apply gradient descent and its variants. Following [23], we present an alternating\ndirection method of multipliers (ADMM) algorithm. The proposed ADMM algorithm can ef\ufb01ciently\ncompute the global optimum of (1). It can also \ufb01nd a local optimum to (3). It is worth noting that\nother algorithms such as Peaceman Rachford Splitting Method [11] can also be used to solve (1).\nWe introduce an auxiliary variable \u03a6 \u2208 Rp\u00d7p, and consider an equivalent form of (1) as follows\n\n(cid:107)\u03a0(cid:107)2\n\nF + P\u03bb(\u03a6),\n\n\u03c4\n2\n\nsubject to \u03a0 = \u03a6, \u03a0 \u2208 F k.\n\n(5)\n\n\u2212(cid:104)(cid:98)\u03a3, \u03a0(cid:105) +\n\nargmin\n\n\u03a0,\u03a6\n\n6\n\n\fThe augmented Lagrangian function corresponding to (5) is\n\nL(\u03a0, \u03a6, \u0398) = \u221e1F k (\u03a0) \u2212 (cid:104)(cid:98)\u03a3, \u03a0(cid:105) +\n\n\u03c4\n2\n\n(cid:107)\u03a0(cid:107)2\n\nF + P\u03bb(\u03a6) + (cid:104)\u0398, \u03a0 \u2212 \u03a6(cid:105) +\n\n(6)\nwhere \u0398 \u2208 Rd\u00d7d is the Lagrange multiplier associated with the equality constraint \u03a0 = \u03a6 in (5),\nand \u03c1 > 0 is a penalty parameter that enforces the equality constraint \u03a0 = \u03a6. The detailed update\nscheme is described in Algorithm 1. In details, the \ufb01rst subproblem (Line 5 of Algorithm 1) can be\n\nsolved by projecting \u03c1/(\u03c1 + \u03c4 )\u03a6(t) \u2212 1/(\u03c1 + \u03c4 )\u0398(t) + 1/(\u03c1 + \u03c4 )(cid:98)\u03a3 onto Fantope F k. This projection\n\nhas a simple form solution as shown by [23, 16]. The second subproblem (Line 6 of Algorithm 1)\ncan be solved by generalized soft-thresholding operator as shown by [5] [17].\n\n(cid:107)\u03a0 \u2212 \u03a6(cid:107)2\nF ,\n\n\u03c1\n2\n\nAlgorithm 1 Solving Convex Relaxation (5) using ADMM.\n\n1: Input: Covariance Matrix Estimator(cid:98)\u03a3\n\nLagrangian, Maximum number of iterations T\n\n2: Parameter: Regularization parameters \u03bb > 0, \u03c4 \u2265 0, Penalty parameter \u03c1 > 0 of the augmented\n3: \u03a0(0) \u2190 0, \u03a6(0) \u2190 0, \u0398(0) \u2190 0\n4: For t = 0, . . . , T \u2212 1\n\u03a0(t+1) \u2190 arg min\u03a0\u2208F k\n5:\n\u03a6(t+1) \u2190 arg min\u03a6\n2(cid:107)\u03a6 \u2212 (\u03a0(t+1) + 1\n6:\n\u0398(t+1) \u2190 \u0398(t) + \u03c1(\u03a0(t+1) \u2212 \u03a6(t+1))\n7:\n8: End For\n9: Output: \u03a0(T )\n\n\u03c1+\u03c4 \u03a6(t) \u2212 1\n\u03c1 \u0398(t))(cid:107)2\n\n\u03c1+\u03c4 \u0398(t) + 1\nF + P \u03bb\n(\u03a6)\n\n\u03c1+\u03c4(cid:98)\u03a3)(cid:107)2\n\n2(cid:107)\u03a0 \u2212 ( \u03c1\n\nF\n\n1\n\n1\n\n\u03c1\n\n5 Experiments\n\nIn this section, we conduct simulations on synthetic datasets to validate the effectiveness of the\nproposed estimators in Section 2. We generate two synthetic datasets via designing two covariance\nmatrices. The covariance matrix \u03a3 is basically constructed through the eigenvalue decomposition.\nIn detail, for synthetic dataset I, we set s = 5 and k = 1. The leading eigenvalue of its covariance\n\u221a\nmatrix \u03a3 is set as \u03bb1 = 100, and its corresponding eigenvector is sparse in the sense that only the\n\ufb01rst s = 5 entries are nonzero and set be to 1/\n5. The other eigenvalues are set as \u03bb2 = . . . =\n\u03bbp = 1, and their eigenvectors are chosen arbitrarily. For synthetic dataset II, we set s = 10 and\nk = 5. The top-5 eigenvalues are set as \u03bb1 = . . . = \u03bb4 = 100 and \u03bb5 = 10. We generate their\ncorresponding eigenvectors by sampling its nonzero entries from a standard Gaussian distribution,\nand then orthnormalizing them while retaining the \ufb01rst s = 10 rows nonzero. The other eigenvalues\nare set as \u03bb6 = . . . = \u03bbp = 1, and the associated eigenvectors are chosen arbitrarily. Based on the\ncovariance matrix, the groundtruth rank-k projection matrix \u03a0\u2217 can be immediately calculated. Note\nthat synthetic dataset II is more challenging than synthetic dataset I, because the smallest magnitude\nof \u03a0\u2217 in synthetic dataset I is 0.2, while that in synthetic dataset II is much smaller (about 10\u22123).\nWe sample n = 80 i.i.d. observations from a normal distribution N (0, \u03a3) with p = 128, and then\n\ncalculate the sample covariance matrix(cid:98)\u03a3.\n\nSince the focus of this paper is principal subspace estimation rather than principal eigenvectors\nestimation, it is suf\ufb01cient to compare our proposed estimators (Convex SPCA in (1) and Nonconvex\nSPCA in 3) with the estimator proposed in [23], which is referred to as Fantope SPCA. Note that\nFantope PCA is the pioneering and the state-of-the-art estimator for principal subspace estimation\nof SPCA. However, since Fantope SPCA uses convex penalty (cid:107)\u03a0(cid:107)1,1on the projection matrix \u03a0,\nthe estimator is biased [29]. We also compare our proposed estimators with the oracle estimator\nin (4), which is not a practical estimator but provides the optimal results that we could achieve. In\nour experiments, we need to compare the estimator attained by the algorithmic procedure and the\noracle estimator. To obtain the oracle estimator, we apply standard PCA on the submatrix (supported\n\non the true support) of the sample covariance(cid:98)\u03a3. Note that the true support is known because we use\n(cid:107)(cid:98)\u03a0 \u2212 \u03a0\u2217(cid:107)F . We also use True Positive Rate (TPR) and False Positive Rate (FPR) to evaluate the\n\nsynthetic datasets here.\nIn order to evaluate the performance of the above estimators, we look at the Frobenius norm error\n\n7\n\n\fsupport recovery result. The larger the TPR and the smaller the FPR, the better the support recovery\nresult.\nBoth of our estimators use MCP penalty, though other nonconvex penalties such as SCAD could\nbe used as well. In particular, we set b = 3. For Convex SPCA, we set \u03c4 = 2\nb . The regularization\nparameter \u03bb in our estimators as well as Fantope SPCA is tuned by 5-fold cross validation on a\nheld-out dataset. The experiments are repeated 20 times, and the mean as well as the standard errors\nare reported. The empirical results on synthetic datasets I and II are displayed in Table 1.\n\nTable 1: Empirical results for subspace estimation on synthetic datasets I and II.\n\n(cid:107)(cid:98)\u03a0 \u2212 \u03a0\u2217(cid:107)F\n\n(cid:107)(cid:98)\u03a0 \u2212 \u03a0\u2217(cid:107)F\n\nOracle\n\n0.0289\u00b10.0134\n0.0317\u00b10.0149\nFantope SPCA\n0.0290\u00b10.0132\nConvex SPCA\nNonconvex SPCA 0.0290\u00b10.0133\n\nOracle\n\n0.1487\u00b10.0208\n0.2788\u00b10.0437\nFantope SPCA\n0.2031\u00b10.0331\nConvex SPCA\nNonconvex SPCA 0.2041\u00b10.0326\n\nSynthetic I\n\nn = 80\np = 128\n\ns = 5\nk = 1\n\nSynthetic II\n\nn = 80\np = 128\ns = 10\nk = 5\n\nTPR\n\nFPR\n\n1\n\n1.0000\u00b10.0000\n1.0000\u00b10.0000\n1.0000\u00b10.0000\n\n0\n\n0.0146\u00b10.0218\n0.0000\u00b10.0000\n0.0000\u00b10.0000\n\nTPR\n\nFPR\n\n1\n\n1.0000\u00b10.0000\n1.0000\u00b10.0000\n1.0000\u00b10.0000\n\n0\n\n0.8695\u00b10.1634\n0.5814\u00b10.0674\n0.6000\u00b10.0829\n\nIt can be observed that both Convex SPCA and Nonconvex SPCA estimators outperform Fantope\nSPCA estimator [23] greatly in both datasets. In details, on synthetic dataset I with relatively large\nmagnitude of \u03a0\u2217, our Convex SPCA estimator achieves the same estimation error and perfect support\nrecovery as the oracle estimator. This is consistent with our theoretical results in Theorems 1 and 2.\nIn addition, our Nonconvex SPCA estimator achieves very similar results with Convex SPCA. This is\nnot very surprising, because provided that the magnitude of all the entries in \u03a0\u2217 is large, Nonconvex\n\u221a\nSPCA attains a rate which is only 1/\ns slower than Convex SPCA. Fantope SPCA cannot recover\nthe support perfectly because it detected several false positive supports. This implies that the LCC\ncondition is stronger than our large magnitude assumption, and does not hold on this dataset.\nOn synthetic dataset II, our Convex SPCA estimator does not perform as well as the oracle estimator.\nThis is because the magnitude of \u03a0\u2217 is small (about 10\u22123). Given the sample size n = 80, the\nconditions of Theorems 1 are violated. But note that Convex SPCA is still slightly better than\nNonconvex SPCA. And both of them are much better than Fantope SPCA. This again illustrates the\nsuperiority of our estimators over existing best approach, i.e., Fantope SPCA [23].\n\n6 Conclusion\n\nIn this paper, we study the estimation of the k-dimensional principal subspace of a population\n\nmatrix \u03a3 based on sample covariance matrix(cid:98)\u03a3. We proposed a family of estimators based on novel\n\nregularizations. The \ufb01rst estimator is based on convex optimization, which is suitable for projection\nmatrix with large magnitude entries. It enjoys oracle property and the same convergence rate as\nstandard PCA. The second estimator is based on nonconvex optimization, and it also attains faster\nrate than existing principal subspace estimator, even when the large magnitude assumption is violated.\nNumerical experiments on synthetic datasets support our theoretical results.\n\nAcknowledgement\n\nWe would like to thank the anonymous reviewers for their helpful comments. This research is\npartially supported by the grants NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH\nR01GM083084, and NIH R01HG06841.\n\n8\n\n\fReferences\n[1] A. Amini and M. Wainwright. High-dimensional analysis of semide\ufb01nite relaxations for sparse principal\n\ncomponents. The Annals of Statistics, 37(5B):2877\u20132921, 2009.\n\n[2] Q. Berthet and P. Rigollet. Computational lower bounds for sparse PCA. arXiv preprint arXiv:1304.0828,\n\n2013.\n\n[3] Q. Berthet and P. Rigollet. Optimal detection of sparse principal components in high dimension. The\n\nAnnals of Statistics, 41(4):1780\u20131815, 2013.\n\n[4] A. Birnbaum, I. M. Johnstone, B. Nadler, D. Paul, et al. Minimax bounds for sparse PCA with noisy\n\nhigh-dimensional data. The Annals of Statistics, 41(3):1055\u20131084, 2013.\n\n[5] P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with\n\napplications to biological feature selection. The Annals of Applied Statistics, 5(1):232\u2013253, 03 2011.\n\n[6] T. T. Cai, Z. Ma, Y. Wu, et al. Sparse PCA: Optimal rates and adaptive estimation. The Annals of Statistics,\n\n41(6):3074\u20133110, 2013.\n\n[7] A. d\u2019Aspremont, F. Bach, and L. Ghaoui. Optimal solutions for sparse principal component analysis. The\n\nJournal of Machine Learning Research, 9:1269\u20131294, 2008.\n\n[8] A. d\u2019Aspremont, L. E. Ghaoui, M. I. Jordan, and G. R. Lanckriet. A direct formulation for sparse PCA\n\nusing semide\ufb01nite programming. SIAM Review, pages 434\u2013448, 2007.\n\n[9] J. Dattorro. Convex Optimization & Euclidean Distance Geometry. Meboo Publishing USA, 2011.\n[10] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal\n\nof the American Statistical Association, 96(456):1348\u20131360, 2001.\n\n[11] B. He, H. Liu, Z. Wang, and X. Yuan. A strictly contractive peaceman\u2013rachford splitting method for\n\nconvex programming. SIAM Journal on Optimization, 24(3):1011\u20131040, 2014.\n\n[12] I. Johnstone and A. Lu. On consistency and sparsity for principal components analysis in high dimensions.\n\nJournal of the American Statistical Association, 104(486):682\u2013693, 2009.\n\n[13] I. Jolliffe, N. Trenda\ufb01lov, and M. Uddin. A modi\ufb01ed principal component technique based on the lasso.\n\nJournal of Computational and Graphical Statistics, 12(3):531\u2013547, 2003.\n\n[14] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, and R. Sepulchre. Generalized power method for sparse principal\n\ncomponent analysis. The Journal of Machine Learning Research, 11:517\u2013553, 2010.\n\n[15] R. Krauthgamer, B. Nadler, and D. Vilenchik. Do semide\ufb01nite relaxations really solve sparse PCA? arXiv\n\npreprint arXiv:1306.3690, 2013.\n\n[16] J. Lei and V. Q. Vu. Sparsistency and agnostic inference in sparse PCA. arXiv preprint arXiv:1401.6978,\n\n2014.\n\n[17] P.-L. Loh and M. J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic\n\ntheory for local optima. arXiv preprint arXiv:1305.2436, 2013.\n\n[18] K. Lounici. Sparse principal component analysis with missing observations.\n\nProbability VI, pages 327\u2013356. Springer, 2013.\n\nIn High Dimensional\n\n[19] Z. Ma. Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772\u2013\n\n801, 2013.\n\n[20] D. Paul and I. M. Johnstone. Augmented sparse principal component analysis for high dimensional data.\n\narXiv preprint arXiv:1202.1242, 2012.\n\n[21] D. Shen, H. Shen, and J. Marron. Consistency of sparse PCA in high dimension, low sample size contexts.\n\nJournal of Multivariate Analysis, 115:317\u2013333, 2013.\n\n[22] H. Shen and J. Huang. Sparse principal component analysis via regularized low rank matrix approximation.\n\nJournal of multivariate analysis, 99(6):1015\u20131034, 2008.\n\n[23] V. Q. Vu, J. Cho, J. Lei, and K. Rohe. Fantope projection and selection: A near-optimal convex relaxation\n\nof sparse pca. In NIPS, pages 2670\u20132678, 2013.\n\n[24] V. Q. Vu and J. Lei. Minimax rates of estimation for sparse PCA in high dimensions. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 1278\u20131286, 2012.\n\n[25] V. Q. Vu and J. Lei. Minimax sparse principal subspace estimation in high dimensions. The Annals of\n\nStatistics, 41(6):2905\u20132947, 2013.\n\n[26] Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for sparse\n\nnonconvex learning problems. The Annals of Statistics, 42(6):2164\u20132201, 12 2014.\n\n[27] Z. Wang, H. Lu, and H. Liu. Nonconvex statistical optimization: minimax-optimal sparse pca in polynomial\n\ntime. arXiv preprint arXiv:1408.5352, 2014.\n\n[28] X.-T. Yuan and T. Zhang. Truncated power method for sparse eigenvalue problems. The Journal of\n\nMachine Learning Research, 14(1):899\u2013925, 2013.\n\n[29] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. Ann. Statist., 38(2):894\u2013\n\n942, 2010.\n\n[30] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of computational and\n\ngraphical statistics, 15(2):265\u2013286, 2006.\n\n9\n\n\f", "award": [], "sourceid": 818, "authors": [{"given_name": "Quanquan", "family_name": "Gu", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}