{"title": "Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time", "book": "Advances in Neural Information Processing Systems", "page_first": 3383, "page_last": 3391, "abstract": "We provide statistical and computational analysis of sparse Principal Component Analysis (PCA) in high dimensions. The sparse PCA problem is highly nonconvex in nature. Consequently, though its global solution attains the optimal statistical rate of convergence, such solution is computationally intractable to obtain. Meanwhile, although its convex relaxations are tractable to compute, they yield estimators with suboptimal statistical rates of convergence. On the other hand, existing nonconvex optimization procedures, such as greedy methods, lack statistical guarantees. In this paper, we propose a two-stage sparse PCA procedure that attains the optimal principal subspace estimator in polynomial time. The main stage employs a novel algorithm named sparse orthogonal iteration pursuit, which iteratively solves the underlying nonconvex problem. However, our analysis shows that this algorithm only has desired computational and statistical guarantees within a restricted region, namely the basin of attraction. To obtain the desired initial estimator that falls into this region, we solve a convex formulation of sparse PCA with early stopping. Under an integrated analytic framework, we simultaneously characterize the computational and statistical performance of this two-stage procedure. Computationally, our procedure converges at the rate of $1/\\sqrt{t}$ within the initialization stage, and at a geometric rate within the main stage. Statistically, the final principal subspace estimator achieves the minimax-optimal statistical rate of convergence with respect to the sparsity level $s^*$, dimension $d$ and sample size $n$. Our procedure motivates a general paradigm of tackling nonconvex statistical learning problems with provable statistical guarantees.", "full_text": "Tighten after Relax: Minimax-Optimal Sparse PCA\n\nin Polynomial Time\n\nZhaoran Wang\n\nHuanran Lu\n\nHan Liu\n\nDepartment of Operations Research and Financial Engineering\n\nPrinceton University\nPrinceton, NJ 08540\n\n{zhaoran,huanranl,hanliu}@princeton.edu\n\nAbstract\n\nWe provide statistical and computational analysis of sparse Principal Component\nAnalysis (PCA) in high dimensions. The sparse PCA problem is highly nonconvex\nin nature. Consequently, though its global solution attains the optimal statistical rate\nof convergence, such solution is computationally intractable to obtain. Meanwhile,\nalthough its convex relaxations are tractable to compute, they yield estimators with\nsuboptimal statistical rates of convergence. On the other hand, existing nonconvex\noptimization procedures, such as greedy methods, lack statistical guarantees.\nIn this paper, we propose a two-stage sparse PCA procedure that attains the optimal\nprincipal subspace estimator in polynomial time. The main stage employs a novel\nalgorithm named sparse orthogonal iteration pursuit, which iteratively solves the\nunderlying nonconvex problem. However, our analysis shows that this algorithm\nonly has desired computational and statistical guarantees within a restricted region,\nnamely the basin of attraction. To obtain the desired initial estimator that falls into\nthis region, we solve a convex formulation of sparse PCA with early stopping.\nUnder an integrated analytic framework, we simultaneously characterize the com-\n\u221a\nputational and statistical performance of this two-stage procedure. Computationally,\nour procedure converges at the rate of 1/\nt within the initialization stage, and at\na geometric rate within the main stage. Statistically, the \ufb01nal principal subspace\nestimator achieves the minimax-optimal statistical rate of convergence with respect\nto the sparsity level s\u2217, dimension d and sample size n. Our procedure motivates a\ngeneral paradigm of tackling nonconvex statistical learning problems with provable\nstatistical guarantees.\n\n1, . . . , u\u2217\n\nIntroduction\n\n1\nWe denote by x1, . . . , xn the n realizations of a random vector X \u2208 Rd with population covariance\nmatrix \u03a3 \u2208 Rd\u00d7d. The goal of Principal Component Analysis (PCA) is to recover the top k leading\nk of \u03a3. In high dimensional settings with d (cid:29) n, [1\u20133] showed that classical\neigenvectors u\u2217\nPCA can be inconsistent. Additional assumptions are needed to avoid such a curse of dimensionality.\nFor example, when the \ufb01rst leading eigenvector is of primary interest, one common assumption is that\nu\u2217\n1 is sparse \u2014 the number of nonzero entries of u\u2217\n1, denoted by s\u2217, is smaller than n. Under such\nan assumption of sparsity, signi\ufb01cant progress has been made on the methodological development\n[4\u201313] as well as theoretical understanding [1, 3, 14\u201321] of sparse PCA.\nHowever, there remains a signi\ufb01cant gap between the computational and statistical aspects of sparse\nPCA: No tractable algorithm is known to attain the statistical optimal sparse PCA estimator provably\nwithout relying on the spiked covariance assumption. This gap arises from the nonconvexity of sparse\n\n1\n\n\fPCA. In detail, the sparse PCA estimator for the \ufb01rst leading eigenvector u\u2217\n\n1 is\n\n(cid:98)u1 = argmin\n\n(cid:107)v(cid:107)2=1\n\n\u2212vT(cid:98)\u03a3v,\n\nsubject to (cid:107)v(cid:107)0 = s\u2217,\n\n(1)\n\nTo address this computational issue, [5] proposed a convex relaxation approach, named DSPCA, for\nestimating the \ufb01rst leading eigenvector. [13] generalized DSPCA to estimate the principal subspace\nspanned by the top k leading eigenvectors. Nevertheless, [13] proved the obtained estimator only\n\nnonzero coordinates, and s\u2217 is the sparsity level of u\u2217\n1. Although this estimator has been proven to\nattain the optimal statistical rate of convergence [15, 17], its computation is intractable because it\nrequires minimizing a concave function over cardinality constraints [22]. Estimating the top k leading\n\nwhere (cid:98)\u03a3 is the sample covariance estimator, (cid:107) \u00b7 (cid:107)2 is the Euclidean norm, (cid:107) \u00b7 (cid:107)0 gives the number of\neigenvectors is even more challenging because of the extra orthogonality constraint on(cid:98)u1, . . . ,(cid:98)u2.\nattains the suboptimal s\u2217(cid:112)log d/n statistical rate. Meanwhile, several methods have been proposed\nproposed the truncated power method, which attains the optimal(cid:112)s\u2217 log d/n rate for estimating u\u2217\nHowever, it hinges on the assumption that the initial estimator u(0) satis\ufb01es(cid:12)(cid:12)sin \u2220(u(0), u\u2217)(cid:12)(cid:12)\u2264 1\u2212C,\n\nto directly address the underlying nonconvex problem (1), e.g., variants of power methods or iterative\nthresholding methods [10\u201312], greedy method [8], as well as regression-type methods [4, 6, 7, 18].\nHowever, most of these methods lack statistical guarantees. There are several exceptions: (1) [11]\n1.\nwhere C \u2208 (0, 1) is a constant. Suppose u(0) is chosen uniformly at random on the (cid:96)2 sphere, this\nassumption holds with probability decreasing to zero when d \u2192 \u221e [23]. (2) [12] proposed an iterative\nthresholding method, which attains a near optimal statistical rate when estimating several individual\nleading eigenvectors. [18] proposed a regression-type method, which attains the optimal principal\nsubspace estimator. However, these two methods hinge on the spiked covariance assumption, and\nrequire the data to be exactly Gaussian (sub-Gaussian not included). For them, the spiked covariance\nassumption is crucial, because they use diagonal thresholding method [1] to obtain the initialization,\nwhich would fail when the assumption of spiked covariance doesn\u2019t hold, or each coordinate of X\nhas the same variance. Besides, except [12] and [18], all the computational procedures only recover\nthe \ufb01rst leading eigenvector, and leverage the de\ufb02ation method [24] to recover the rest, which leads\nto identi\ufb01ability and orthogonality issues when the top k eigenvalues of \u03a3 are not distinct.\nTo close the gap between computational and statistical aspects of sparse PCA, we propose a two-stage\nprocedure for estimating the k-dimensional principal subspace U\u2217 spanned by the top k leading\neigenvectors u\u2217\nk. The details of the two stages are as follows: (1) For the main stage, we\npropose a novel algorithm, named sparse orthogonal iteration pursuit, to directly estimate the principal\nsubspace of \u03a3. Our analysis shows, when its initialization falls into a restricted region, namely the\nbasin of attraction, this algorithm enjoys fast optimization rate of convergence, and attains the optimal\nprincipal subspace estimator. (2) To obtain the desired initialization, we compute a convex relaxation\nof sparse PCA. Unlike [5, 13], which calculate the exact minimizers, we early stop the corresponding\noptimization algorithm as soon as the iterative sequence enters the basin of attraction for the main\nstage. The rationale is, this convex optimization algorithm converges at a slow sublinear rate towards\na suboptimal estimator, and incurs relatively high computational overhead within each iteration.\nUnder a uni\ufb01ed analytic framework, we provide simultaneous statistical and computational guarantees\nfor this two-stage procedure. Given the sample size n is suf\ufb01ciently large, and the eigengap between\nthe k-th and (k + 1)-th eigenvalues of the population covariance matrix \u03a3 is nonzero, we prove: (1)\n\nThe \ufb01nal subspace estimator (cid:98)U attained by our two-stage procedure achieves the minimax-optimal\n(cid:112)s\u2217 log d/n statistical rate of convergence. (2) Within the initialization stage, the iterative sequence\nof subspace estimators(cid:8)U (t)(cid:9)T\n\nt=0 (at the T -th iteration we early stop the initialization stage) satis\ufb01es\n\n1, . . . , u\u2217\n\nD(cid:0)U\u2217,U (t)(cid:1) \u2264 \u03b41(\u03a3) \u00b7 s\u2217(cid:112)log d/n\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:124)\n\n\u221a\n+ \u03b42(k, s\u2217, d, n) \u00b7 1/\n\n(cid:123)(cid:122)\n\n(cid:125)\n\nt\n\nStatistical Error\n\nOptimization Error\n\n(2)\n\nwith high probability. Here D(\u00b7,\u00b7) is the subspace distance, while s\u2217 is the sparsity level of U\u2217, both\nof which will be de\ufb01ned in \u00a72. Here \u03b41(\u03a3) is a quantity which depends on the population covariance\nmatrix \u03a3, while \u03b42(k, s\u2217, d, n) depends on k, s\u2217, d and n (see \u00a74 for details). (3) Within the main\n\nt=T +1 (where (cid:101)T denotes the total number of iterations of sparse\n\nstage, the iterative sequence(cid:8)U (t)(cid:9)T +(cid:101)T\n\n2\n\n\forthogonal iteration pursuit) satis\ufb01es\n\nD(cid:0)U\u2217,U (t)(cid:1) \u2264 \u03b43(\u03a3, k) \u00b7\n\n(cid:124)\n\nOptimal Rate\n\n(cid:125)(cid:124)\n\n(cid:123)\n(cid:122)\n(cid:112)s\u2217 log d/n\n(cid:123)(cid:122)\n(cid:125)\n\nStatistical Error\n\n+ \u03b3(\u03a3)(t\u2212T\u22121)/4 \u00b7 D(cid:0)U\u2217,U (T +1)(cid:1)\n(cid:124)\n(cid:125)\n\n(cid:123)(cid:122)\n\nOptimization Error\n\nwith high probability, where \u03b43(\u03a3, k) is a quantity that only depends on \u03a3 and k, and\n\n\u03b3(\u03a3) = [3\u03bbk+1(\u03a3) + \u03bbk(\u03a3)]/[\u03bbk+1(\u03a3) + 3\u03bbk(\u03a3)] < 1.\n\nHere \u03bbk(\u03a3) and \u03bbk+1(\u03a3) are the k-th and (k + 1)-th eigenvalues of \u03a3. See \u00a74 for more details.\nUnlike previous works, our theory and method don\u2019t depend on the spiked covariance assumption, or\nrequire the data distribution to be Gaussian.\n\n(3)\n\n(4)\n\nFigure 1: An illustration of our two-stage procedure.\n\nOur analysis shows, at the initialization stage, the optimization error decays to zero at the rate of 1/\n\nHowever, the upper bound of D(cid:0)U\u2217,U (t)(cid:1) in (2) can\u2019t be smaller than the suboptimal s\u2217(cid:112)log d/n\nas the optimization error term in (3) decreases to zero geometrically, the upper bound of D(cid:0)U\u2217,U (t)(cid:1)\ndecreases towards the(cid:112)s\u2217 log d/n statistical rate of convergence, which is minimax-optimal with\n\nrate of convergence, even with in\ufb01nite number of iterations. This phenomenon, which is illustrated in\nFigure 1, reveals the limit of the convex relaxation approaches for sparse PCA. Within the main stage,\n\nrespect to the sparsity level s\u2217, dimension d and sample size n [17]. Moreover, in Theorem 2 we will\nshow that, the basin of attraction for the proposed sparse orthogonal iteration pursuit algorithm can\nbe characterized as\n\n\u221a\n\nt.\n\nU : D(cid:0)U\u2217,U(cid:1) \u2264 R = min\n\n(cid:110)(cid:113)\n(cid:111)\nk\u03b3(\u03a3)(cid:2)1 \u2212 \u03b3(\u03a3)1/2(cid:3)/2,(cid:112)2\u03b3(\u03a3)/4\n\n.\n\n(5)\n\nHere \u03b3(\u03a3) is de\ufb01ned in (4) and R denotes its radius.\nThe contribution of this paper is three-fold: (1) We propose the \ufb01rst tractable procedure that provably\nattains the subspace estimator with minimax-optimal statistical rate of convergence with respect to the\nsparsity level s\u2217, dimension d and sample size n, without relying on the restrictive spiked covariance\nassumption or the Gaussian assumption. (2) We propose a novel algorithm named sparse orthogonal\niteration pursuit, which converges to the optimal estimator at a fast geometric rate. The computation\nwithin each iteration is highly ef\ufb01cient compared with convex relaxation approaches. (3) We build a\njoint analytic framework that simultaneously captures the computational and statistical properties of\nsparse PCA. Under this framework, we characterize the phenomenon of basin of attraction for the\nproposed sparse orthogonal iteration pursuit algorithm. In comparison with our previous work on\nnonconvex M-estimators [25], our analysis provides a more general paradigm of solving nonconvex\nlearning problems with provable guarantees. One byproduct of our analysis is novel techniques for\nanalyzing the statistical properties of the intermediate solutions of the Alternating Direction Method\nof Multipliers [26].\nNotation: Let A = [Ai,j] \u2208 Rd\u00d7d and v = (v1, . . . , vd)T \u2208 Rd. The (cid:96)q norm (q \u2265 1) of v is (cid:107)v(cid:107)q.\nSpeci\ufb01cally, (cid:107)v(cid:107)0 gives the number of nonzero entries of v. For matrix A, the i-th largest eigenvalue\nand singular value are \u03bbi(A) and \u03c3i(A). For q \u2265 1, (cid:107)A(cid:107)q is the matrix operator q-norm, e.g., we\nhave (cid:107)A(cid:107)2 = \u03c31(A). The Frobenius norm is denoted as (cid:107)A(cid:107)F . For A1 and A2, their inner product\nis (cid:104)A1, A2(cid:105) = tr(AT\n1 A2). For a set S, |S| denotes its cardinality. The d \u00d7 d identity matrix is Id.\n\n3\n\nUUinitU(t)SuboptimalRateOptimalRateBasinofAttractionConvexRelaxationSparseOrthogonalIterationPursuit\fFor index sets I,J \u2286 {1, . . . , d}, we de\ufb01ne AI,J \u2208 Rd\u00d7d to be the matrix whose (i, j)-th entry is\nAi,j if i \u2208 I and j \u2208 J , and zero otherwise. When I = J , we abbreviate it as AI. If I or J is\n{1, . . . , d}, we replace it with a dot, e.g., AI,\u00b7. We denote by Ai,\u2217 \u2208 Rd the i-th row vector of A. A\nmatrix is orthonormal if its columns are unit length orthogonal vectors. The (p, q)-norm of a matrix,\ndenoted as (cid:107)A(cid:107)p,q, is obtained by \ufb01rst taking the (cid:96)p norm of each row, and then taking (cid:96)q norm of\nthese row norms. We denote diag(A) to be the vector consisting of the diagonal entries of A. With a\nlittle abuse of notation, we denote by diag(v) the the diagonal matrix with v1, . . . , vd on its diagonal.\nHereafter, we use generic numerical constants C, C(cid:48), C(cid:48)(cid:48), . . ., whose values change from line to line.\n\n2 Background\n\nIn the following, we introduce the distance between subspaces and the notion of sparsity for subspace.\nSubspace Distance: Let U and U(cid:48) be two k-dimensional subspaces of Rd. We denote the projection\nmatrix onto them by \u03a0 and \u03a0(cid:48) respectively. One de\ufb01nition of the distance between U and U(cid:48) is\n\nD(U,U(cid:48)) = (cid:107)\u03a0 \u2212 \u03a0(cid:48)(cid:107)F .\n\n(6)\n\nThis de\ufb01nition is invariant to the rotations of the orthonormal basis.\nSubspace Sparsity: For the k-dimensional principal subspace U\u2217 of \u03a3, the de\ufb01nition of its sparsity\nshould be invariant to the choice of basis, because \u03a3\u2019s top k eigenvalues might be not distinct. Here\nwe de\ufb01ne the sparsity level s\u2217 of U\u2217 to be the number of nonzero coef\ufb01cients on the diagonal of its\nprojection matrix \u03a0\u2217. One can verify that (see [17] for details)\n\ns\u2217 =(cid:12)(cid:12)supp[diag(\u03a0\u2217)](cid:12)(cid:12) = (cid:107)U\u2217(cid:107)2,0,\n\n(7)\nwhere (cid:107) \u00b7 (cid:107)2,0 gives the row-sparsity level, i.e., the number of nonzero rows. Here the columns of U\u2217\ncan be any orthonormal basis of U\u2217. This de\ufb01nition reduces to the sparsity of u\u2217\nSubspace Estimation: For the k-dimensional s\u2217-sparse principal subspace U\u2217 of \u03a3, [17] considered\nthe following estimator for the orthonormal matrix U\u2217 consisting of the basis of U\u2217,\n\n1 when k = 1.\n\nsubject to U orthonormal, and (cid:107)U(cid:107)2,0 \u2264 s\u2217,\n\nU\u2208Rd\u00d7k\n\nwhere (cid:98)\u03a3 is an estimator of \u03a3. Let (cid:98)U be the column space of (cid:98)U. [17] proved that, assuming (cid:98)\u03a3 is\nthe sample covariance estimator, and the data are independent sub-Gaussian, (cid:98)U attains the optimal\n\nstatistical rate. However, direct computation of this estimator is NP-hard even for k = 1 [22].\n\n(8)\n\n(cid:98)U = argmin\n\n\u2212(cid:10)(cid:98)\u03a3, UUT(cid:11),\n\n3 A Two-stage Procedure for Sparse PCA\n\nIn this following, we present the two-stage procedure for sparse PCA. We will \ufb01rst introduce sparse\northogonal iteration pursuit for the main stage and then present the convex relaxation for initialization.\n\nAlgorithm 1 Main stage: Sparse orthogonal iteration pursuit. Here T denotes the total number of\niterations of the initialization stage. To unify the later analysis, let t start from T + 1.\n\n1: Function: (cid:98)U \u2190 Sparse Orthogonal Iteration Pursuit(cid:0)(cid:98)\u03a3, Uinit(cid:1)\n2: Input: Covariance Matrix Estimator (cid:98)\u03a3, Initialization Uinit\n3: Parameter: Sparsity Parameter(cid:98)s, Maximum Number of Iterations (cid:101)T\n4: Initialization: (cid:101)U(T +1) \u2190 Truncate(cid:0)Uinit,(cid:98)s(cid:1), U(T +1), R(T +1)\n5: For t = T + 1, . . . , T + (cid:101)T \u2212 1\n(cid:101)V(t+1) \u2190 (cid:98)\u03a3 \u00b7 U(t),\n(cid:101)U(t+1) \u2190 Truncate(cid:0)V(t+1),(cid:98)s(cid:1),\n9: Output: (cid:98)U \u2190 U(T +(cid:101)T )\n\n6:\n7:\n8: End For\n\nV(t+1), R(t+1)\nU(t+1), R(t+1)\n\n1 \u2190 Thin QR(cid:0)(cid:101)V(t+1)(cid:1)\n2 \u2190 Thin QR(cid:0)(cid:101)U(t+1)(cid:1)\n\n2\n\n\u2190 Thin QR(cid:0)(cid:101)U(T +1)(cid:1)\n\n4\n\n\f= (cid:101)U(t+1), where R(t+1)\n\n(cid:101)V(t+1), and U(t+1) \u00b7 R(t+1)\n\nSparse Orthogonal Iteration Pursuit: For the main stage, we propose sparse orthogonal iteration\npursuit (Algorithm 1) to solve (8). In Algorithm 1, Truncate(\u00b7,\u00b7) (Line 7) is de\ufb01ned in Algorithm\n2. In Lines 6 and 7, Thin QR(\u00b7) denotes the thin QR decomposition (see [27] for details). In detail,\nV(t+1) \u2208 Rd\u00d7k and U(t+1) \u2208 Rd\u00d7k are orthonormal matrices, and they satisfy V(t+1) \u00b7 R(t+1)\n=\n\u2208 Rk\u00d7k. This decomposition can be\naccomplished with O(k2d) operations using Householder algorithm [27]. Here remind that k is the\nrank of the principal subspace of interest, which is much smaller than the dimension d.\nAlgorithm 1 consists of two steps: (1) Line 6 performs a matrix multiplication and a renormalization\nusing QR decomposition. This step is named orthogonal iteration in numerical analysis [27]. When\nthe \ufb01rst leading eigenvector (k = 1) is of interest, it reduces to the well-known power iteration. The\nintuition behind this step can be understood as follows. We consider the minimization problem in (8)\n\nwithout the row-sparsity constraint. Note that the gradient of the objective function is \u22122(cid:98)\u03a3 \u00b7 U(t).\n\n, R(t+1)\n\n2\n\n1\n\n1\n\n2\n\nHence, the gradient descent update scheme for this problem is\n\n(cid:101)V(t+1) \u2190 Porth\n\n(cid:0)U(t) + \u03b7 \u00b7 2(cid:98)\u03a3 \u00b7 U(t)(cid:1),\n(cid:0)U(t)+\u03b7\u00b72(cid:98)\u03a3\u00b7U(t)(cid:1)=Porth\n\n(9)\nwhere \u03b7 is the step size, and Porth(\u00b7) denotes the renormalization step. [28] showed that the optimal\nstep size \u03b7 is in\ufb01nity. Thus we have Porth\nwhich implies that (9) is equivalent to Line 6. (2) In Line 7, we take a truncation step to enforce the\n\n(cid:0)(cid:98)\u03a3\u00b7U(t)(cid:1),\nrow-sparsity constraint in (8). In detail, we greedily select the(cid:98)s most important rows. To enforce\n(cid:101)U(t+1) is row-sparse by truncation, and QR decomposition preserves its row-sparsity. By iteratively\n\nthe orthonormality constraint in (8), we perform another renormalization step after the truncation.\nNote that the QR decomposition in Line 7 gives a both orthonormal and row-sparse U(t+1), because\n\n(cid:0)\u03b7\u00b72(cid:98)\u03a3\u00b7U(t)(cid:1)=Porth\n\nperforming these two steps, we are approximately solving the nonconvex problem in (8). Although\nit is not clear whether this procedure achieves the global minimum of (8), we will prove that, the\nobtained estimator enjoys the same optimal statistical rate of convergence as the global minimum.\n\ni,\u2217\n\n,\n\ni,\u2217\n\n(cid:13)(cid:13)2\u2019s\n\nfor all i \u2208 {1, . . . , d}\n\ni,\u2217 \u2190 1(cid:0)i \u2208 I(cid:98)s\n\nAlgorithm 3 Initialization stage: Solving convex relaxation (10) using ADMM. In Lines 6 and 7,\n\nAlgorithm 2 Main stage: The Truncate(\u00b7,\u00b7) function used in Line 7 of Algorithm 1.\n\nThis projection can be computed using Algorithm 4 in [29]. The second can be solved by entry-wise\nsoft-thresholding shown in Algorithm 5 in [29]. We defer these two algorithms and their derivations\nto the extended version [29] of this paper.\n\n1: Function: (cid:101)U(t+1) \u2190 Truncate(cid:0)V(t+1),(cid:98)s(cid:1)\n2: Row Sorting: I(cid:98)s \u2190 The set of row index i(cid:48)s with the top(cid:98)s largest(cid:13)(cid:13)V(t+1)\n(cid:1) \u00b7 V(t+1)\n3: Truncation: (cid:101)U(t+1)\n4: Output: (cid:101)U(t+1)\nwe need to solve two subproblems. The \ufb01rst one is equivalent to projecting \u03a6(t)\u2212\u0398(t)+(cid:98)\u03a3/\u03c1 to A.\n1: Function: Uinit \u2190 ADMM(cid:0)(cid:98)\u03a3(cid:1)\n2: Input: Covariance Matrix Estimator (cid:98)\u03a3\n6: \u03a0(t+1)\u2190 argmin(cid:8)L(cid:0)\u03a0, \u03a6(t), \u0398(t)(cid:1) + \u03b2/2 \u00b7(cid:13)(cid:13)\u03a0 \u2212 \u03a6(t)(cid:13)(cid:13)2\n7: \u03a6(t+1)\u2190 argmin(cid:8)L(cid:0)\u03a0(t+1), \u03a6, \u0398(t)(cid:1) + \u03b2/2 \u00b7(cid:13)(cid:13)\u03a0(t+1) \u2212 \u03a6(cid:13)(cid:13)2\n8: \u0398(t+1)\u2190\u0398(t) \u2212 \u03b2(cid:0)\u03a0(t+1) \u2212 \u03a6(t+1)(cid:1)\n10: \u03a0(T ) \u2190 1/T \u00b7(cid:80)T\n\n3: Parameter: Regularization Parameter \u03c1 > 0 in (10), Penalty Parameter \u03b2 > 0 of the Augmented\n4: \u03a0(0) \u2190 0, \u03a6(0) \u2190 0, \u0398(0) \u2190 0\n5: For t = 0, . . . , T \u2212 1\n\n(cid:12)(cid:12) \u03a0 \u2208 A(cid:9)\n(cid:12)(cid:12) \u03a6 \u2208 B(cid:9)\n\nt=0 \u03a0(t), let the columns of Uinit be the top k leading eigenvectors of \u03a0(T )\n\nLagrangian, Maximum Number of Iterations T\n\nF\n\nF\n\n9: End For\n11: Output: Uinit \u2208 Rd\u00d7k\n\n5\n\n\fConvex Relaxation for Initialization: To obtain a good initialization for sparse orthogonal iteration\npursuit, we consider the following convex minimization problem proposed by [5, 13]\n\n(cid:110)\u2212(cid:10)(cid:98)\u03a3, \u03a0(cid:11) + \u03c1(cid:107)\u03a0(cid:107)1,1\n\n(cid:12)(cid:12) tr(\u03a0) = k, 0 (cid:22) \u03a0 (cid:22) Id\n\n(cid:111)\n\n,\n\n(10)\n\nminimize\n\nwhich relaxes the combinatorial optimization problem in (8). The intuition behind this relaxation can\nbe understood as follows: (1) \u03a0 is a reparametrization for UUT in (8), which is a projection matrix\nwith k nonzero eigenvalues of 1. In (10), this constraint is relaxed to tr(\u03a0) = k and 0 (cid:22) \u03a0 (cid:22) Id,\nwhich indicates that the eigenvalues of \u03a0 should be in [0, 1] while the sum of them is k. (2) For the\nrow-sparsity constraint in (8), [13] proved that (cid:107)\u03a0\u2217(cid:107)0,0 \u2264 |supp[diag(\u03a0\u2217)]|2 = (cid:107)U\u2217(cid:107)2\n2,0 = (s\u2217)2.\nCorrespondingly, the row-sparsity constraint in (8) translates to (cid:107)\u03a0(cid:107)0,0 \u2264 (s\u2217)2, which is relaxed to\nthe regularization term (cid:107)\u03a0(cid:107)1,1 in (10). For notational simplicity, we de\ufb01ne\n\n(11)\nNote (10) has both nonsmooth regularization term and nontrivial constraint A. We use the Alternating\nDirection Method of Multipliers (ADMM, Algorithm 3). It considers the equivalent form of (10)\n\nA =(cid:8)\u03a0 : \u03a0 \u2208 Rd\u00d7d, tr(\u03a0) = k, 0 (cid:22) \u03a0 (cid:22) Id\n(cid:12)(cid:12) \u03a0 = \u03a6, \u03a0 \u2208 A, \u03a6 \u2208 B(cid:111)\n\n(cid:110)\u2212(cid:10)(cid:98)\u03a3, \u03a0(cid:11) + \u03c1(cid:107)\u03a6(cid:107)1,1\n\nand iteratively minimizes the augmented Lagrangian L(\u03a0, \u03a6, \u0398) + \u03b2/2 \u00b7 (cid:107)\u03a0 \u2212 \u03a6(cid:107)2\n\nL(\u03a0, \u03a6, \u0398) = \u2212(cid:10)(cid:98)\u03a3, \u03a0(cid:11) + \u03c1(cid:107)\u03a6(cid:107)1,1 \u2212 (cid:104)\u0398, \u03a0 \u2212 \u03a6(cid:105), \u03a0 \u2208 A, \u03a6 \u2208 B, \u0398 \u2208 Rd\u00d7d\n\n(13)\nis the Lagrangian corresponding to (12), \u0398 \u2208 Rd\u00d7d is the Lagrange multiplier associated with the\nequality constraint \u03a0 = \u03a6, and \u03b2 > 0 is a penalty parameter that enforces such an equality constraint.\nNote that other variants of ADMM, e.g., Peaceman-Rachford Splitting Method [30] is also applicable,\nwhich would yield similar theoretical guarantees along with improved practical performance.\n\nF , where\n\n, where B = Rd\u00d7d,\n\nminimize\n\n(cid:9) .\n\n(12)\n\n4 Theoretical Results\nTo describe our results, we de\ufb01ne the model class Md(\u03a3, k, s\u2217) as follows,\n\n\uf8f1\uf8f2\uf8f3X = \u03a31/2Z, where Z \u2208 Rd is sub-Gaussian with mean zero,\n\nvariance proxy less than 1, and covariance matrix Id;\n\nMd(\u03a3, k, s\u2217) :\n\nThe k-dimensional principal subspace U\u2217 of \u03a3 is s\u2217-sparse; \u03bbk(\u03a3)\u2212\u03bbk+1(\u03a3)>0.\nwhere \u03a31/2 satis\ufb01es \u03a31/2\u00b7\u03a31/2 = \u03a3. Here remind the sparsity of U\u2217 is de\ufb01ned in (7) and \u03bbj(\u03a3) is\nthe j-th eigenvalue of \u03a3. For notational simplicity, hereafter we abbreviate \u03bbj(\u03a3) as \u03bbj. This model\nclass doesn\u2019t restrict \u03a3 to spiked covariance matrices, where the (k + 1), . . . , d-th eigenvalues of\n\u03a3 can only be identical. Moreover, we don\u2019t require X to be exactly Gaussian, which is a crucial\nrequirement in several previous works, e.g., [12, 18].\nWe \ufb01rst introduce some notation. Remind D(\u00b7,\u00b7) is the subspace distance de\ufb01ned in (6). Note that\n\u03b3(\u03a3) < 1 is de\ufb01ned in (4) and will be abbreviated as \u03b3 hereafter. We de\ufb01ne\n\n1,\n\n(cid:111)2 \u00b7 (\u03bbk \u2212 \u03bbk+1)2/\u03bb2\n(cid:105) \u00b7(cid:0)k \u00b7 s\u2217 \u00b7 d2 log d/n(cid:1)1/4\n(cid:105) \u00b7(cid:112)s\u2217\u00b7(k + log d)/n,\n\n(14)\n\n,\n\n(15)\n\n(16)\n\nwhich denotes the required sample complexity. We also de\ufb01ne\n\nnmin = C \u00b7 (s\u2217)2 log d \u00b7 min\n\n(cid:110)(cid:113)\n\u03b61 = [C\u03bb1/(\u03bbk\u2212\u03bbk+1)] \u00b7 s\u2217(cid:112)log d/n,\n\nk \u00b7 \u03b3(1 \u2212 \u03b31/2)/2,(cid:112)2\u03b3/4\n4/(cid:112)\u03bbk\u2212\u03bbk+1\nk \u00b7 [\u03bbk/(\u03bbk \u2212 \u03bbk+1)]2 \u00b7(cid:104)(cid:112)\u03bb1\u03bbk+1/(\u03bbk \u2212 \u03bbk+1)\nTmin =(cid:6)\u03b6 2\n\n2 /(R \u2212 \u03b61)2(cid:7) ,\n\nwhich will be used in the analysis of the \ufb01rst stage, and\n\n\u03be1 = C\n\n\u03b62 =\n\n(cid:104)\n\n\u221a\n\nwhich will be used in the analysis of the main stage. Meanwhile, remind the radius of the basin of\nattraction for sparse orthogonal iteration pursuit is de\ufb01ned in (5). We de\ufb01ne\n\n(17)\nas the required minimum numbers of iterations of the two stages respectively. The following results\nwill be proved in the extended version [29] of this paper accordingly.\nMain Result: Recall that U (t) denotes the subspace spanned by the columns of U(t) in Algorithm 1.\n\n(cid:101)Tmin = 4(cid:100)log(R/\u03be1)/log(1/\u03b3)(cid:101)\n\n6\n\n\f\u221a\n\nTheorem 1. Let x1, . . . , xn be independent realizations of X \u2208 Md(\u03a3, k, s\u2217) with n \u2265 nmin, and\n\na suf\ufb01ciently large C > 0 in (10) and the penalty parameter \u03b2 of ADMM (Line 3 of Algorithm 3)\nis \u03b2 = d \u00b7 \u03c1/\n\n(cid:112)log d/n for\n(cid:98)\u03a3 be the sample covariance matrix. Suppose the regularization parameter \u03c1 = C\u03bb1\nk. Also, suppose the sparsity parameter(cid:98)s in Algorithm 1 (Line 3) is chosen such\nthat(cid:98)s = C max(cid:8)(cid:6)4k/(\u03b3\u22121/2 \u2212 1)2(cid:7) , 1(cid:9) \u00b7 s\u2217, where C \u2265 1 is an integer constant. After T \u2265 Tmin\niterations of Algorithm 3 and then (cid:101)T \u2265 (cid:101)Tmin iterations of Algorithm 1, we obtain (cid:98)U = U (T +(cid:101)T ) and\n(cid:105) \u00b7(cid:112)s\u2217\u00b7(k + log d)/n\nD(cid:0)U\u2217,(cid:98)U(cid:1) \u2264 C\u03be1 = C(cid:48)\u221a\n(cid:102)Md(\u03a3, k, s\u2217, \u03ba), which is the same as Md(\u03a3, k, s\u2217) except the eigengap of \u03a3 satis\ufb01es \u03bbk \u2212 \u03bbk+1 >\n\u03bbk \u2212 \u03bbk+1 \u2265 \u03ba\u03bb1, which is more restrictive because \u03bb1 \u2265 \u03bbk. Within (cid:102)M, we assume that the rank k\n\nk \u00b7 [\u03bbk/(\u03bbk \u2212 \u03bbk+1)]2 \u00b7(cid:104)(cid:112)\u03bb1\u03bbk+1/(\u03bbk \u2212 \u03bbk+1)\n\nMinimax-Optimality: To establish the optimality of Theorem 1, we consider a smaller model class\n\n\u03ba\u03bbk for some constant \u03ba > 0. This condition is mild compared to previous works, e.g., [12] assumes\n\nwith high probability. Here the equality follows from the de\ufb01nition of \u03be1 in (16).\n\nsup\n\nof the principal subspace is \ufb01xed. This assumption is reasonable, e.g., in applications like population\ngenetics [31], the rank k of principal subspaces represents the number of population groups, which\ndoesn\u2019t increase when the sparsity level s\u2217, dimension d and sample size n are growing.\nTheorem 3.1 of [17] implies the following minimax lower bound\n\nE D(cid:0)(cid:101)U,U\u2217(cid:1)2 \u2265 C\u03bb1\u03bbk+1/(\u03bbk\u2212\u03bbk+1)2 \u00b7 (s\u2217\u2212k) \u00b7(cid:8)k + log[(d\u2212k)/(s\u2217\u2212k)](cid:9)/n,\ninf(cid:101)U\nX\u2208(cid:102)Md(\u03a3,k,s\u2217)\nwhere (cid:101)U denotes any principal subspace estimator. Suppose s\u2217 and d are suf\ufb01ciently large (to avoid\nBy Lemma 2.1 in [29], we have D(cid:0)U\u2217,(cid:98)U(cid:1) \u2264 \u221a\nabove within (cid:102)Md(\u03a3, k, s\u2217, \u03ba), up to the 1/4 constant in front of log d and a total constant of k \u00b7 \u03ba\u22124.\nTheorem 2. Under the same condition as in Theorem 1, and provided that D(cid:0)U\u2217,U init(cid:1) \u2264 R, the\n\ntrivial cases), the right-hand side is lower bounded by C(cid:48)\u03bb1\u03bbk+1/(\u03bbk\u2212\u03bbk+1)2\u00b7s\u2217(k+1/4\u00b7log d)/n.\n2k. For n, d and s\u2217 suf\ufb01ciently large, it is easy to\nderive the same upper bound in expectation from in Theorem 1. It attains the minimax lower bound\n\nAnalysis of the Main Stage: Remind that U (t) is the subspace spanned by the columns of U(t) in\nAlgorithm 1, and the initialization is Uinit while its column space is U init.\n\niterative sequence U (T +1),U (T +2), . . . ,U (t), . . . satis\ufb01es\n\nD(cid:0)U\u2217,U (t)(cid:1) \u2264\n\nC\u03be1(cid:124)(cid:123)(cid:122)(cid:125)\n\n+\n\n\u03b3(t\u2212T\u22121)/4 \u00b7 \u03b3\u22121/2R\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\nStatistical Error\n\nOptimization Error\n\n(18)\n\nwith high probability, where \u03be1 is de\ufb01ned in (16), R is de\ufb01ned in (5), and \u03b3 is de\ufb01ned in (4).\nTheorem 2 shows that, as long as U init falls into its basin of attraction, sparse orthogonal iteration\npursuit converges at a geometric rate of convergence in optimization error since \u03b3 < 1. According to\nthe de\ufb01nition of \u03b3 in (4), when \u03bbk is close to \u03bbk+1, \u03b3 is close to 1, then the optimization error term\ndecays at a slower rate. Here the optimization error doesn\u2019t increase with dimension d, which makes\nthis algorithm suitable to solve ultra high dimensional problems. In (18), when t is suf\ufb01ciently large\n\nsuch that \u03b3(t\u2212T\u22121)/4\u00b7\u03b3\u22121/2R\u2264 \u03be1, D(cid:0)U\u2217,U (t)(cid:1) is upper bounded by 2C\u03be1, which gives the optimal\nstatistical rate. Solving t in this inequality, we obtain that t = (cid:101)T \u2265 (cid:101)Tmin, which is de\ufb01ned in (17).\nAnalysis of the Initialization Stage: Let \u03a0(t) = 1/t\u00b7(cid:80)t\n\n3. Let U (t) be the k-dimensional subspace spanned by the top k leading eigenvectors of \u03a0(t).\nTheorem 3. Under the same condition as in Theorem 1, the iterative sequence of k-dimensional\nsubspaces U (0),U (1), . . . ,U (t), . . . satis\ufb01es\n\ni=1 \u03a0(i) where \u03a0(i) is de\ufb01ned in Algorithm\n\nD(cid:0)U\u2217,U (t)(cid:1) \u2264\n\n\u03b61(cid:124)(cid:123)(cid:122)(cid:125)\n\n+\n\n\u03b62 \u00b7 1/\n\n(cid:123)(cid:122)\n\n\u221a\n\n(cid:125)\n\nt\n\n(cid:124)\n\nStatistical Error\n\nOptimization Error\n\n(19)\n\nwith high probability. Here \u03b61 and \u03b62 are de\ufb01ned in (15).\n\n7\n\n\f\u221a\n\nIn Theorem 3 the optimization error term decays to zero at the rate of 1/\nt. Note that \u03b62 increases\nd \u00b7 (log d)1/4. That is to say, computationally convex relaxation is less ef\ufb01cient\nwith d at the rate of\nthan sparse orthogonal iteration pursuit, which justi\ufb01es the early stopping of ADMM. To ensure U (T )\n\u221a\nT \u2264 R. Solving T gives T \u2265 Tmin where Tmin is\nenters the basin of attraction, we need \u03b61 + \u03b62/\nde\ufb01ned in (17). The proof of Theorem 3 is a nontrivial combination of optimization and statistical\nanalysis under the variational inequality framework, which is provided in the extended version [29]\nof this paper with detail.\n\n\u221a\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 2: An Illustration of main results. See \u00a75 for detailed experiment settings and the interpretation.\n\nTable 1: A comparison of subspace estimation error with existing sparse PCA procedures. The error\n\nis measured by D(U\u2217,(cid:98)U) de\ufb01ned in (6). Standard deviations are provided in the parentheses.\nD(U\u2217,(cid:98)U) for Setting (i) D(U\u2217,(cid:98)U) for Setting (ii)\n\nProcedure\n\nOur Procedure\n\nConvex Relaxation [13]\n\nTPower [11] + De\ufb02ation Method [24]\nGPower [10] + De\ufb02ation Method [24]\nPathSPCA [8] + De\ufb02ation Method [24]\n\n0.32 (0.0067)\n1.62 (0.0398)\n1.15 (0.1336)\n1.84 (0.0226)\n2.12 (0.0226)\n\n0.064 (0.00016)\n\n0.57 (0.021)\n0.01 (0.00042)\n1.75 (0.029)\n2.10 (0.018)\n\n(i): d = 200, s = 10, k = 5, n = 50, \u03a3\u2019s eigenvalues are {100, 100, 100, 100, 4, 1, . . . , 1};\n(ii): The same as (i) except n = 100, \u03a3\u2019s eigenvalues are {300, 240, 180, 120, 60, 1, . . . , 1}.\n\n5 Numerical Results\nFigure 2 illustrates the main theoretical results. For (a)-(c), we set d=200, s\u2217=10, k=5, n=100, and\n\u03a3\u2019s eigenvalues are {100, 100, 100, 100, 10, 1, . . . , 1}. In detail, (a) illustrates the 1/\nt decay of\noptimization error at the initialization stage; (b) illustrates the decay of the total estimation error (in\nlog-scale) at the main stage; (c) illustrates the basin of attraction phenomenon, as well as the geometric\ndecay of optimization error (in log-scale) of sparse orthogonal iteration pursuit as characterized in \u00a74.\nFor (d) and (e), the eigenstructure is the same, while d, n and s\u2217 take multiple values. They show that\n\nthe theoretical(cid:112)s\u2217 log d/n statistical rate of our estimator is tight in practice.\n\n\u221a\n\nIn Table 1, we compare the subspace error of our procedure with existing methods, where all except\nour procedure and convex relaxation [13] leverage the de\ufb02ation method [24] for subspace estimation\nwith k > 1. We consider two settings: Setting (i) is more challenging than setting (ii), since the top k\neigenvalues of \u03a3 are not distinct, the eigengap is small and the sample size is smaller. Our procedure\nsigni\ufb01cantly outperforms other existing methods on subspace recovery in both settings.\n\nAcknowledgement:\nIIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841.\n\nThis research is partially supported by the grants NSF IIS1408910, NSF\n\nReferences\n[1] I. Johnstone, A. Lu. On consistency and sparsity for principal components analysis in high dimensions,\n\nJournal of the American Statistical Association 2009;104:682\u2013693.\n\n8\n\n1020301.522.53tD(U\u2217,U(t))InitialStage510152010\u22121100tD(U\u2217,U(t))MainStage102030100tD(U\u2217,U(t))\u2212D(U\u2217,U(T+eT)) InitialStageMainStage11.520.20.40.60.81ps\u2217logd/nD(U\u2217,bU)n=60 d=128d=192d=2560.60.811.21.41.61.80.20.40.6ps\u2217logd/nD(U\u2217,bU)n=100 d=128d=192d=256\f[2] D. Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model, Statistica\n\nSinica 2007;17:1617.\n\n[3] B. Nadler. Finite sample approximation results for principal component analysis: A matrix perturbation\n\napproach, The Annals of Statistics 2008:2791\u20132817.\n\n[4] I. Jolliffe, N. Trenda\ufb01lov, M. Uddin. A modi\ufb01ed principal component technique based on the Lasso,\n\nJournal of Computational and Graphical Statistics 2003;12:531\u2013547.\n\n[5] A. d\u2019Aspremont, L. E. Ghaoui, M. I. Jordan, G. R. Lanckriet. A Direct Formulation for Sparse PCA Using\n\nSemide\ufb01nite Programming, SIAM Review 2007:434\u2013448.\n\n[6] H. Zou, T. Hastie, R. Tibshirani. Sparse principal component analysis, Journal of computational and\n\ngraphical statistics 2006;15:265\u2013286.\n\n[7] H. Shen, J. Huang. Sparse principal component analysis via regularized low rank matrix approximation,\n\nJournal of Multivariate Analysis 2008;99:1015\u20131034.\n\n[8] A. d\u2019Aspremont, F. Bach, L. Ghaoui. Optimal solutions for sparse principal component analysis, The\n\nJournal of Machine Learning Research 2008;9:1269\u20131294.\n\n[9] D. Witten, R. Tibshirani, T. Hastie. A penalized matrix decomposition, with applications to sparse principal\n\ncomponents and canonical correlation analysis, Biostatistics 2009;10:515\u2013534.\n\n[10] M. Journ\u00b4ee, Y. Nesterov, P. Richt\u00b4arik, R. Sepulchre. Generalized power method for sparse principal\n\ncomponent analysis, The Journal of Machine Learning Research 2010;11:517\u2013553.\n\n[11] X.-T. Yuan, T. Zhang. Truncated power method for sparse eigenvalue problems, The Journal of Machine\n\nLearning Research 2013;14:899\u2013925.\n\n[12] Z. Ma. Sparse principal component analysis and iterative thresholding, The Annals of Statistics 2013;41.\n[13] V. Q. Vu, J. Cho, J. Lei, K. Rohe. Fantope projection and selection: A near-optimal convex relaxation of\n\nsparse PCA, in Advances in Neural Information Processing Systems:2670\u20132678 2013.\n\n[14] A. Amini, M. Wainwright. High-dimensional analysis of semide\ufb01nite relaxations for sparse principal\n\ncomponents, The Annals of Statistics 2009;37:2877\u20132921.\n\n[15] V. Q. Vu, J. Lei. Minimax Rates of Estimation for Sparse PCA in High Dimensions, in International\n\nConference on Arti\ufb01cial Intelligence and Statistics:1278\u20131286 2012.\n\n[16] A. Birnbaum, I. M. Johnstone, B. Nadler, D. Paul, others. Minimax bounds for sparse PCA with noisy\n\nhigh-dimensional data, The Annals of Statistics 2013;41:1055\u20131084.\n\n[17] V. Q. Vu, J. Lei. Minimax sparse principal subspace estimation in high dimensions, The Annals of Statistics\n\n2013;41:2905\u20132947.\n\n[18] T. T. Cai, Z. Ma, Y. Wu, others. Sparse PCA: Optimal rates and adaptive estimation, The Annals of Statistics\n\n2013;41:3074\u20133110.\n\n[19] Q. Berthet, P. Rigollet. Optimal detection of sparse principal components in high dimension, The Annals of\n\nStatistics 2013;41:1780\u20131815.\n\n[20] Q. Berthet, P. Rigollet. Complexity Theoretic Lower Bounds for Sparse Principal Component Detection, in\n\nCOLT:1046-1066 2013.\n\n[21] J. Lei, V. Q. Vu. Sparsistency and Agnostic Inference in Sparse PCA, arXiv:1401.6978 2014.\n[22] B. Moghaddam, Y. Weiss, S. Avidan. Spectral bounds for sparse PCA: Exact and greedy algorithms,\n\nAdvances in neural information processing systems 2006;18:915.\n\n[23] K. Ball. An elementary introduction to modern convex geometry, Flavors of geometry 1997;31:1\u201358.\n[24] L. Mackey. De\ufb02ation methods for sparse PCA, Advances in neural information processing systems\n\n2009;21:1017\u20131024.\n\n[25] Z. Wang, H. Liu, T. Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex\n\nlearning problems, The Annals of Statistics 2014;42:2164\u20132201.\n[26] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein. Distributed optimization and statistical learning via the\nalternating direction method of multipliers, Foundations and Trends R(cid:13) in Machine Learning 2011;3:1\u2013122.\n\n[27] G. H. Golub, C. F. Van Loan. Matrix computations. Johns Hopkins University Press 2012.\n[28] R. Arora, A. Cotter, K. Livescu, N. Srebro. Stochastic optimization for PCA and PLS, in Communication,\n\nControl, and Computing (Allerton), 2012 50th Annual Allerton Conference on:861\u2013868IEEE 2012.\n\n[29] Z. Wang, H. Lu, H. Liu. Nonconvex statistical optimization: Minimax-optimal Sparse PCA in polynomial\n\ntime, arXiv:1408.5352 2014.\n\n[30] B. He, H. Liu, Z. Wang, X. Yuan. A Strictly Contractive Peaceman\u2013Rachford Splitting Method for Convex\n\nProgramming, SIAM Journal on Optimization 2014;24:1011\u20131040.\n\n[31] B. E. Engelhardt, M. Stephens. Analysis of population structure: a unifying framework and novel methods\n\nbased on sparse factor analysis, PLoS genetics 2010;6:e1001117.\n\n9\n\n\f", "award": [], "sourceid": 1726, "authors": [{"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton University"}, {"given_name": "Huanran", "family_name": "Lu", "institution": "Princeton University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}