{"title": "Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1941, "page_last": 1949, "abstract": "In this paper we focus on the principal component regression and its application to high dimension non-Gaussian data. The major contributions are in two folds. First, in low dimensions and under a double asymptotic framework where both the dimension $d$ and sample size $n$ can increase, by borrowing the strength from recent development in minimax optimal principal component estimation, we first time sharply characterize the potential advantage of classical principal component regression over least square estimation under the Gaussian model. Secondly, we propose and analyze a new robust sparse principal component regression on high dimensional elliptically distributed data. The elliptical distribution is a semiparametric generalization of the Gaussian, including many well known distributions such as multivariate Gaussian, rank-deficient Gaussian, $t$, Cauchy, and logistic. It allows the random vector to be heavy tailed and have tail dependence. These extra flexibilities make it very suitable for modeling finance and biomedical imaging data. Under the elliptical model, we prove that our method can estimate the regression coefficients in the optimal parametric rate and therefore is a good alternative to the Gaussian based methods. Experiments on synthetic and real world data are conducted to illustrate the empirical usefulness of the proposed method.", "full_text": "Robust Sparse Principal Component Regression\n\nunder the High Dimensional Elliptical Model\n\nFang Han\n\nDepartment of Biostatistics\nJohns Hopkins University\n\nBaltimore, MD 21210\nfhan@jhsph.edu\n\nHan Liu\n\nDepartment of Operations Research\n\nand Financial Engineering\n\nPrinceton University\nPrinceton, NJ 08544\n\nhanliu@princeton.edu\n\nAbstract\n\nIn this paper we focus on the principal component regression and its application to\nhigh dimension non-Gaussian data. The major contributions are two folds. First,\nin low dimensions and under the Gaussian model, by borrowing the strength from\nrecent development in minimax optimal principal component estimation, we \ufb01rst\ntime sharply characterize the potential advantage of classical principal component\nregression over least square estimation. Secondly, we propose and analyze a new\nrobust sparse principal component regression on high dimensional elliptically dis-\ntributed data. The elliptical distribution is a semiparametric generalization of the\nGaussian, including many well known distributions such as multivariate Gaus-\nsian, rank-de\ufb01cient Gaussian, t, Cauchy, and logistic. It allows the random vector\nto be heavy tailed and have tail dependence. These extra \ufb02exibilities make it very\nsuitable for modeling \ufb01nance and biomedical imaging data. Under the elliptical\nmodel, we prove that our method can estimate the regression coef\ufb01cients in the\noptimal parametric rate and therefore is a good alternative to the Gaussian based\nmethods. Experiments on synthetic and real world data are conducted to illustrate\nthe empirical usefulness of the proposed method.\n\nIntroduction\n\n1\nPrincipal component regression (PCR) has been widely used in statistics for years (Kendall, 1968).\nTake the classical linear regression with random design for example. Let x1, . . . , xn \u2208 Rd be n\nindependent realizations of a random vector X \u2208 Rd with mean 0 and covariance matrix \u03a3. The\nclassical linear regression model and simple principal component regression model can be elaborated\nas follows:\n\n(Classical linear regression model)\n(Principal Component Regression Model)\n\nY = X\u03b2 + \u0001;\n\n(1.1)\nwhere X = (x1, . . . , xn)T \u2208 Rn\u00d7d, Y \u2208 Rn, ui is the i-th leading eigenvector of \u03a3, and \u0001 \u2208\nNn(0, \u03c32Id) is independent of X, \u03b2 \u2208 Rd and \u03b1 \u2208 R. Here Id \u2208 Rd\u00d7d is the identity matrix. The\n\nprincipal component regression then can be conducted in two steps: First we obtain an estimator(cid:98)u1\nof u1; Secondly we project the data in the direction of (cid:98)u1 and solve a simple linear regression in\n\nY = \u03b1Xu1 + \u0001,\n\nestimating \u03b1.\nBy checking Equation (1.1), it is easy to observe that the principal component regression model is a\nsubset of the general linear regression (LR) model with the constraint that the regression coef\ufb01cient\n\u03b2 is proportional to u1. There has been a lot of discussions on the advantage of principal component\nregression over classical linear regression. In low dimensional settings, Massy (1965) pointed out\nthat principal component regression can be much more ef\ufb01cient in handling collinearity among pre-\ndictors compared to the linear regression. More recently, Cook (2007) and Artemiou and Li (2009)\nargued that principal component regression has potential to play a more important role. In partic-\n\nular, letting (cid:98)uj be the j-th leading eigenvector of the sample covariance matrix (cid:98)\u03a3 of x1, . . . , xn,\n\n1\n\n\fArtemiou and Li (2009) show that under mild conditions with high probability the correlation be-\n\ntween the response Y and X(cid:98)ui is higher than or equal to the correlation between Y and X(cid:98)uj when\n\ni < j. This indicates, although not rigorous, there is possibility that principal component regression\ncan borrow strength from the low rank structure of \u03a3, which motivates our work.\nEven though the statistical performance of principal component regression in low dimensions is not\nfully understood, there is even less analysis on principal component regression in high dimensions\nwhere the dimension d can be even exponentially larger than the sample size n. This is partially\ndue to the fact that estimating the leading eigenvectors of \u03a3 itself has been dif\ufb01cult enough. For\nexample, Johnstone and Lu (2009) show that, even under the Gaussian model, when d/n \u2192 \u03b3\n\nfor some \u03b3 > 0, there exist multiple settings under which (cid:98)u1 can be an inconsistent estimator of\n\nu1. To attack this \u201ccurse of dimensionality\u201d, one solution is adding a sparsity assumption on u1,\nleading to various versions of the sparse PCA. See, Zou et al. (2006); d\u2019Aspremont et al. (2007);\nMoghaddam et al. (2006), among others. Under the (sub)Gaussian settings, minimax optimal rates\nare being established in estimating u1, . . . , um (Vu and Lei, 2012; Ma, 2013; Cai et al., 2013).\nVery recently, Han and Liu (2013b) relax the Gaussian assumption in conducting a scale invariant\nversion of the sparse PCA (i.e., estimating the leading eigenvector of the correlation instead of the\ncovariance matrix). However, it can not be easily applied to estimate u1 and the rate of convergence\nthey proved is not the parametric rate.\nThis paper improves upon the aforementioned results in two directions. First, with regard to the\nclassical principal component regression, under a double asymptotic framework in which d is al-\nlowed to increase with n, by borrowing the very recent development in principal component anal-\nysis (Vershynin, 2010; Lounici, 2012; Bunea and Xiao, 2012), we for the \ufb01rst time explicitly show\nthe advantage of principal component regression over the classical linear regression. We explicitly\ncon\ufb01rm the following two advantages of principal component regression: (i) Principal component\nregression is insensitive to collinearity, while linear regression is very sensitive to; (ii) Principal\ncomponent regression can utilize the low rank structure of the covariance matrix \u03a3, while linear\nregression cannot.\nSecondly, in high dimensions where d can increase much faster, even exponentially faster, than n,\nwe propose a robust method in conducting (sparse) principal component regression under a non-\nGaussian elliptical model. The elliptical distribution is a semiparametric generalization to the Gaus-\nsian, relaxing the light tail and zero tail dependence constraints, but preserving the symmetry prop-\nerty. We refer to Kl\u00a8uppelberg et al. (2007) for more details. This distribution family includes many\nwell known distributions such as multivariate Gaussian, rank de\ufb01cient Gaussian, t, logistic, and\nmany others. Under the elliptical model, we exploit the result in Han and Liu (2013a), who showed\nthat by utilizing a robust covariance matrix estimator, the multivariate Kendall\u2019s tau, we can obtain\n\nan estimator(cid:101)u1, which can recover u1 in the optimal parametric rate as shown in Vu and Lei (2012).\nWe then exploit (cid:101)u1 in conducting principal component regression and show that the obtained esti-\nmator \u02c7\u03b2 can estimate \u03b2 in the optimal(cid:112)s log d/n rate. The optimal rate in estimating u1 and\n\n\u03b2, combined with the discussion in the classical principal component regression, indicates that the\nproposed method has potential to handle high dimensional complex data and has its advantage over\nhigh dimensional linear regression methods, such as ridge regression and lasso. These theoretical\nresults are also backed up by numerical experiments on both synthetic and real world equity data.\n2 Classical Principal Component Regression\nThis section is devoted to the discussion on the advantage of classical principal component re-\ngression over the classical linear regression. We start with a brief introduction of notations. Let\nM = [Mij] \u2208 Rd\u00d7d and v = (v1, ..., vd)T \u2208 Rd. We denote vI to be the subvector of v whose\nentries are indexed by a set I. We also denote MI,J to be the submatrix of M whose rows are\nindexed by I and columns are indexed by J. Let MI\u2217 and M\u2217J be the submatrix of M with rows\nindexed by I, and the submatrix of M with columns indexed by J. Let supp(v) := {j : vj (cid:54)= 0}.\nFor 0 < q < \u221e, we de\ufb01ne the (cid:96)0, (cid:96)q and (cid:96)\u221e vector norms as\n\nd(cid:88)\n\ni=1\n\n(cid:107)v(cid:107)0 := card(supp(v)), (cid:107)v(cid:107)q := (\n\n|vi|q)1/q and (cid:107)v(cid:107)\u221e := max\n1\u2264i\u2264d\n\n|vi|.\n\nLet Tr(M) be the trace of M. Let \u03bbj(M) be the j-th largest eigenvalue of M and \u0398j(M) be the\ncorresponding leading eigenvector. In particular, we let \u03bbmax(M) := \u03bb1(M) and \u03bbmin(M) :=\n\n2\n\n\f\u03bbd(M). We de\ufb01ne Sd\u22121 := {v \u2208 Rd : (cid:107)v(cid:107)2 = 1} to be the d-dimensional unit sphere. We de\ufb01ne\nthe matrix (cid:96)max norm and (cid:96)2 norm as (cid:107)M(cid:107)max := max{|Mij|} and (cid:107)M(cid:107)2 := supv\u2208Sd\u22121 (cid:107)Mv(cid:107)2.\nWe de\ufb01ne diag(M) to be a diagonal matrix with [diag(M)]jj = Mjj for j = 1, . . . , d. We denote\nvec(M) := (MT\u22171, . . . , MT\u2217d)T . For any two sequence {an} and {bn}, we denote an\nc,C(cid:16) bn if there\nexist two \ufb01xed constants c, C such that c \u2264 an/bn \u2264 C.\nLet x1, . . . , xn \u2208 Rd be n independent observations of a d-dimensional random vector X \u223c\nNd(0, \u03a3), u1 := \u03981(\u03a3) and \u00011, . . . , \u0001n \u223c N1(0, \u03c32) are independent from each other and\n{Xi}n\n\ni=1. We suppose that the following principal component regression model holds:\n\nY = \u03b1Xu1 + \u0001,\n\n(2.1)\nwhere Y = (Y1, . . . , Yn)T \u2208 Rn, X = [x1, . . . , xn]T \u2208 Rn\u00d7d and \u0001 = (\u00011, . . . , \u0001n)T \u2208 Rn. We\nare interested in estimating the regression coef\ufb01cient \u03b2 := \u03b1u1.\n\nLet (cid:98)\u03b2 represent the solution of the classical least square estimator without taking the information\nthat \u03b2 is proportional to u1 into account. (cid:98)\u03b2 can be expressed as follows:\nWe then have the following proposition, which shows that the mean square error of (cid:98)\u03b2 \u2212 \u03b2 is highly\n\n(cid:98)\u03b2 := (XT X)\u22121XT Y .\n\n(2.2)\n\nrelated to the scale of \u03bbmin(\u03a3).\nProposition 2.1. Under the principal component regression model shown in (2.1), we have\n\nE(cid:107)(cid:98)\u03b2 \u2212 \u03b2(cid:107)2\n\n\u03c32\n\n2 =\n\nn \u2212 d \u2212 1\n\n+ \u00b7\u00b7\u00b7 +\n\n1\n\n\u03bbd(\u03a3)\n\n\u03bb1(\u03a3)\n\n(cid:18) 1\n\n(cid:19)\n\n.\n\nProposition 2.1 re\ufb02ects the vulnerability of least square estimator on the collinearity. More speci\ufb01-\n\nestimator even when d is \ufb01xed. On the other hand, using the Markov inequality, when \u03bbd(\u03a3) is\n\nMotivated from Equation (2.1), the classical principal component regression estimator can be elab-\norated as follows.\n\ncally, when \u03bbd(\u03a3) is extremely small, going to zero in the scale of O(1/n), (cid:98)\u03b2 can be an inconsistent\nlower bounded by a \ufb01xed constant and d = o(n), the rate of convergence of (cid:98)\u03b2 is well known to be\nOP ((cid:112)d/n).\n(cid:80) xixT\n(1) We \ufb01rst estimate u1 using the leading eigenvector(cid:98)u1 of the sample covariance(cid:98)\u03a3 := 1\ndata (cid:98)Z := X(cid:98)u1 \u2208 Rn:\nThe \ufb01nal principal component regression estimator (cid:101)\u03b2 is then obtained as (cid:101)\u03b2 = (cid:101)\u03b1(cid:98)u1. We then have\nthe following important theorem, which provides a rate of convergence for (cid:101)\u03b2 to approximate \u03b2.\n\ni .\n(2) We then estimate \u03b1 \u2208 R in Equation (2.1) by the standard least square estimation on the projected\n\n(cid:101)\u03b1 := ((cid:98)ZT (cid:98)Z)\u22121(cid:98)ZT Y ,\n\nTheorem 2.2. Let r\u2217(\u03a3) := Tr(\u03a3)/\u03bbmax(\u03a3) represent the effective rank of \u03a3 (Vershynin, 2010).\nSuppose that\n\nn\n\n(cid:114)\n(cid:32)\n\n(cid:107)\u03a3(cid:107)2 \u00b7\n\n(cid:40)(cid:114) 1\n\nn\n\n+\n\n\u03b1 +\n\nr\u2217(\u03a3) log d\n\nn\n\n= o(1).\n\n(cid:33)\n\n(cid:114)\n\n\u00b7\n\n1(cid:112)\u03bbmax(\u03a3)\n\n(cid:107)(cid:101)\u03b2 \u2212 \u03b2(cid:107)2 = OP\n\nUnder the Model (2.1), when \u03bbmax(\u03a3) > c1 and \u03bb2(\u03a3)/\u03bb1(\u03a3) < C1 < 1 for some \ufb01xed constants\nC1 and c1, we have\n\n(cid:41)\n\nr\u2217(\u03a3) log d\n\nn\n\n.\n\n(2.3)\n\nTheorem 2.2, compared to Proposition 2.1, provides several important messages on the performance\n\nof principal component regression. First, compared to the least square estimator (cid:98)\u03b2, (cid:101)\u03b2 is insensitive\nto collinearity in the sense that \u03bbmin(\u03a3) plays no role in the rate of convergence of (cid:101)\u03b2. Secondly,\nthe rate of convergence for (cid:98)\u03b2 is OP ((cid:112)d/n) and for (cid:101)\u03b2 is OP ((cid:112)r\u2217(\u03a3) log d/n), while r\u2217(\u03a3) :=\n\nwhen \u03bbmin(\u03a3) is lower bounded by a \ufb01xed constant and \u03b1 is upper bounded by a \ufb01xed constant,\n\n3\n\n\fTr(\u03a3)/\u03bbmax(\u03a3) \u2264 d and is of order o(d) when there exists a low rank structure for \u03a3. These\ntwo observations, combined together, illustrate the advantages of the classical principal component\nregression over least square estimation. These advantages justify the use of principal component\n\nregression. There is one more thing to be noted: the performance of (cid:101)\u03b2, unlike (cid:98)\u03b2, depends on \u03b1.\nWhen \u03b1 is small, (cid:101)\u03b2 can predict \u03b2 more accurately.\n\nThese three observations are veri\ufb01ed in Figure 1. Here the data are generated according to Equation\n(2.1) and we set n = 100, d = 10, \u03a3 to be a diagonal matrix with descending diagonal values\n\u03a3ii = \u03bbi and \u03c32 = 1. In Figure 1(A), we set \u03b1 = 1, \u03bb1 = 10, \u03bbj = 1 for j = 2, . . . , d \u2212 1, and\nchanging \u03bbd from 1 to 1/100; In Figure 1(B), we set \u03b1 = 1, \u03bbj = 1 for j = 2, . . . , d and changing\n\u03bb1 from 1 to 100; In Figure 1(C), we set \u03bb1 = 10, \u03bbj = 1 for j = 2, . . . , d, and changing \u03b1 from\n0.1 to 10. In the three \ufb01gures, the empirical mean square error is plotted against 1/\u03bbd, \u03bb1, and \u03b1. It\ncan be observed that the results, each by each, matches the theory.\n\nA\n\nB\n\nC\n\nFigure 1: Justi\ufb01cation of Proposition 2.1 and Theorem 2.2. The empirical mean square errors are\nplotted against 1/\u03bbd, \u03bb1, and \u03b1 separately in (A), (B), and (C). Here the results of classical linear\nregression and principal component regression are marked in black solid line and red dotted line.\n3 Robust Sparse Principal Component Regression under Elliptical Model\nIn this section, we propose a new principal component regression method. We generalize the settings\nin classical principal component regression discussed in the last section in two directions: (i) We\nconsider the high dimensional settings where the dimension d can be much larger than the sample\nsize n; (ii) In modeling the predictors x1, . . . , xn, we consider a more general elliptical, instead of\nthe Gaussian distribution family. The elliptical family can capture characteristics such as heavy tails\nand tail dependence, making it more suitable for analyzing complex datasets in \ufb01nance, genomics,\nand biomedical imaging.\n3.1 Elliptical Distribution\nIn this section we de\ufb01ne the elliptical distribution and introduce the basic property of the elliptical\ndistribution. We denote by X d= Y if random vectors X and Y have the same distribution.\nHere we only consider the continuous random vectors with density existing. To our knowledge,\nthere are essentially four ways to de\ufb01ne the continuous elliptical distribution with density. The most\nintuitive way is as follows: A random vector X \u2208 Rd is said to follow an elliptical distribution\nECd(\u00b5, \u03a3, \u03be) if and only there exists a random variable \u03be > 0 (a.s.) and a Gaussian distribution\nZ \u223c Nd(0, \u03a3) such that\n\nX d= \u00b5 + \u03beZ.\n\n(3.1)\n\nNote that here \u03be is not necessarily independent of Z. Accordingly, elliptical distribution can be\nregarded as a semiparametric generalization to the Gaussian distribution, with the nonparametric\npart \u03be. Because \u03be can be very heavy tailed, X can also be very heavy tailed. Moreover, when E\u03be2\nexists, we have\n\nCov(X) = E\u03be2\u03a3 and \u0398j(Cov(X)) = \u0398j(\u03a3) for j = 1, . . . , d.\n\nThis implies that, when E\u03be2 exists, to recover u1 := \u03981(Cov(X)), we only need to recover \u03981(\u03a3).\nHere \u03a3 is conventionally called the scatter matrix.\n\n4\n\n0204060801000.20.40.60.81.01/lambda_minMean Square ErrorLRPCR0204060801000.00.20.40.60.8lambda_maxMean Square ErrorLRPCR02468100.00.20.40.60.81.0alphaMean Square ErrorLRPCR\fWe would like to point out that the elliptical family is signi\ufb01cantly larger than the Gaussian. In\nfact, Gaussian is fully parameterized by \ufb01nite dimensional parameters (mean and variance).\nIn\ncontrast, the elliptical is a semiparametric family (since the elliptical density can be represented as\ng((x\u2212\u00b5)T \u03a3\u22121(x\u2212\u00b5)) where the function g(\u00b7) function is completely unspeci\ufb01ed.). If we consider\nthe \u201cvolumes\u201d of the family of the elliptical family and the Gaussian family with respect to the\nLebesgue reference measure, the volume of Gaussian family is zero (like a line in a 3-dimensional\nspace), while the volume of the elliptical family is positive (like a ball in a 3-dimensional space).\n3.2 Multivariate Kendall\u2019s tau\nAs a important step in conducting the principal component regression, we need to estimate u1 =\n\u03981(Cov(X)) = \u03981(\u03a3) as accurately as possible. Since the random variable \u03be in Equation (3.1)\ncan be very heavy tailed, the according elliptical distributed random vector can be heavy tailed.\nTherefore, as has been pointed out by various authors (Tyler, 1987; Croux et al., 2002; Han and Liu,\n\n2013b), the leading eigenvector of the sample covariance matrix (cid:98)\u03a3 can be a bad estimator in esti-\nof this estimator. Let X \u223c ECd(\u00b5, \u03a3, \u03be) and (cid:102)X be an independent copy of X. The population\n\nmating u1 = \u03981(\u03a3) under the elliptical distribution. This motivates developing robust estimator.\nIn particular, in this paper we consider using the multivariate Kendall\u2019s tau (Choi and Marden, 1998)\nand recently deeply studied by Han and Liu (2013a). In the following we give a brief description\nmultivariate Kendall\u2019s tau matrix, denoted by K \u2208 Rd\u00d7d, is de\ufb01ned as:\n\n.\n\n(3.2)\n\n(cid:32)\n\nK := E\n\n(cid:33)\n\n(X \u2212(cid:102)X)(X \u2212(cid:102)X)T\n\n(cid:107)X \u2212(cid:102)X(cid:107)2\n(cid:88)\n\n2\n\ni(cid:54)=j\n\n(cid:32)\n\n(cid:33)\n\n1\n\nn(n \u2212 1)\n\nLet x1, . . . , xn be n independent observations of X. The sample version of multivariate Kendall\u2019s\n\ntau is accordingly de\ufb01ned as(cid:98)K =\n(cid:98)K is a matrix version U statistic and it is easy to see that\nand we have that E((cid:98)K) = K.\nmaxjk |Kjk| \u2264 1, maxjk |(cid:98)Kjk| \u2264 1. Therefore, (cid:98)K is a bounded matrix and hence can be a nicer\n\n(xi \u2212 xj)(xi \u2212 xj)T\n\nstatistic than the sample covariance matrix. Moreover, we have the following important proposition,\ncoming from Oja (2010), showing that K has the same eigenspace as \u03a3 and Cov(X).\nProposition 3.1 (Oja (2010)). Let X \u223c ECd(\u00b5, \u03a3, \u03be) be a continuous distribution and K be the\npopulation multivariate Kendall\u2019s tau statistic. Then if \u03bbj(\u03a3) (cid:54)= \u03bbk(\u03a3) for any k (cid:54)= j, we have\n\n(cid:107)xi \u2212 xj(cid:107)2\n\n(3.3)\n\n,\n\n2\n\n\u0398j(\u03a3) = \u0398j(K) and \u03bbj(K) = E\n\n\u03bbj(\u03a3)U 2\nj\n1 + . . . + \u03bbd(\u03a3)U 2\nd\n\n\u03bb1(\u03a3)U 2\n\n,\n\n(3.4)\n\nwhere U := (U1, . . . , Ud)T follows a uniform distribution in Sd\u22121. In particular, when E\u03be2 exists,\n\u0398j(Cov(X)) = \u0398j(K).\n\n3.3 Model and Method\nIn this section we discuss the model we build and the accordingly proposed method in conducting\nhigh dimensional (sparse) principal component regression on non-Gaussian data.\nSimilar as in Section 2, we consider the classical simple principal component regression model:\n\nY = \u03b1Xu1 + \u0001 = \u03b1[x1, . . . , xn]T u1 + \u0001.\n\nTo relax the Gaussian assumption, we assume that both x1, . . . , xn \u2208 Rd and \u00011, . . . , \u0001n \u2208 R be\nelliptically distributed. We assume that xi \u2208 ECd(0, \u03a3, \u03be). To allow the dimension d increasing\nmuch faster than n, we impose a sparsity structure on u1 = \u03981(\u03a3). Moreover, to make u1 iden-\nti\ufb01able, we assume that \u03bb1(\u03a3) (cid:54)= \u03bb2(\u03a3). Thusly, the formal model of the robust sparse principal\ncomponent regression considered in this paper is as follows:\nMd(Y , \u0001; \u03a3, \u03be, s) :\n\n(cid:26) Y = \u03b1Xu1 + \u0001,\n\nx1, . . . , xn \u223c ECd(0, \u03a3, \u03be), (cid:107)\u03981(\u03a3)(cid:107)0 \u2264 s, \u03bb1(\u03a3) (cid:54)= \u03bb2(\u03a3).\n\n(3.5)\n\n5\n\n\f(cid:101)u1 = arg max\n\nvT(cid:98)Kv,\n\nThen the robust sparse principal component regression can be elaborated as a two step procedure:\n(i) Inspired by the model Md(Y , \u0001; \u03a3, \u03be, s) and Proposition 3.1, we consider the following opti-\nmization problem to estimate u1 := \u03981(\u03a3):\n\nsubject to v \u2208 Sd\u22121 \u2229 B0(s),\n\n(3.6)\n\nv\u2208Rd\n\nof \u03981(Cov(X)), whenever the covariance matrix exists.\n(ii) We then estimate \u03b1 \u2208 R in Equation (3.5) by the standard least square estimation on the projected\n\nwhere B0(s) := {v \u2208 Rd : (cid:107)v(cid:107)0 \u2264 s} and (cid:98)K is the estimated multivariate Kendall\u2019s tau matrix.\nThe corresponding global optimum is denoted by(cid:101)u1. Using Proposition 3.1,(cid:101)u1 is also an estimator\ndata (cid:101)Z := X(cid:101)u1 \u2208 Rn:\nThe \ufb01nal principal component regression estimator \u02c7\u03b2 is then obtained as \u02c7\u03b2 = \u02c7\u03b1(cid:101)u1.\n\n\u02c7\u03b1 := ((cid:101)ZT (cid:101)Z)\u22121(cid:101)ZT Y ,\n\n3.4 Theoretical Property\nIn Theorem 2.2, we show that how to estimate u1 accurately plays an important role in conducting\nthe principal component regression. Following this discussion and the very recent results in Han and\nLiu (2013a), the following \u201ceasiest\u201d and \u201chardest\u201d conditions are considered. Here \u03baL, \u03baU are two\nconstants larger than 1.\n\n1,\u03baU(cid:16) d\u03bbj(\u03a3) for any j \u2208 {2, . . . , d} and \u03bb2(\u03a3)\n\nCondition 1 (\u201cEasiest\u201d): \u03bb1(\u03a3)\nj \u2208 {3, . . . , d};\nCondition 2 (\u201cHardest\u201d): \u03bb1(\u03a3)\nIn the sequel, we say that the model Md(Y , \u0001; \u03a3, \u03be, s) holds if the data (Y , X) are generated using\nthe model Md(Y , \u0001; \u03a3, \u03be, s).\nUnder Conditions 1 and 2, we then have the following theorem, which shows that under certain\n\nconditions, (cid:107) \u02c7\u03b2 \u2212 \u03b2(cid:107)2 = OP ((cid:112)s log d/n), which is the optimal parametric rate in estimating the\n\n\u03baL,\u03baU(cid:16) \u03bbj(\u03a3) for any j \u2208 {2, . . . , d}.\n\n1,\u03baU(cid:16) \u03bbj(\u03a3) for any\n\nregression coef\ufb01cient (Ravikumar et al., 2008).\nTheorem 3.2. Let the model Md(Y , \u0001; \u03a3, \u03be, s) hold and |\u03b1| in Equation (3.5) are upper bounded\nby a constant and (cid:107)\u03a3(cid:107)2 is lower bounded by a constant. Then under Condition 1 or Condition 2\nand for all random vector X such that\n\nwe have the robust principal component regression estimator \u02c7\u03b2 satis\ufb01es that\n\nmax\n\nv\u2208Sd\u22121,(cid:107)v(cid:107)0\u22642s\n\n|vT ((cid:98)\u03a3 \u2212 \u03a3)v| = oP (1),\n\n(cid:32)(cid:114)\n\n(cid:33)\n\n(cid:107) \u02c7\u03b2 \u2212 \u03b2(cid:107)2 = OP\n\ns log d\n\nn\n\n.\n\nNormal\n\nmultivariate-t\n\nEC1\n\nEC2\n\nFigure 2: Curves of averaged estimation errors between the estimates and true parameters for differ-\nent distributions (normal, multivariate-t, EC1, and EC2, from left to right) using the truncated power\nmethod. Here n = 100, d = 200, and we are interested in estimating the regression coef\ufb01cient \u03b2.\nThe horizontal-axis represents the cardinalities of the estimates\u2019 support sets and the vertical-axis\nrepresents the empirical mean square error. Here from the left to the right, the minimum mean\nsquare errors for lasso are 0.53, 0.55, 1, and 1.\n\n6\n\n0204060800.00.20.40.60.81.01.2number of selected featuresaveraged errorPCRRPCR0204060800.00.20.40.60.81.01.2number of selected featuresaveraged errorPCRRPCR0204060800.00.51.01.5number of selected featuresaveraged errorPCRRPCR0204060800.00.20.40.60.81.01.21.4number of selected featuresaveraged errorPCRRPCR\f4 Experiments\n\nIn this section we conduct study on both synthetic and real-world data to investigate the empirical\nperformance of the robust sparse principal component regression proposed in this paper. We use the\ntruncated power algorithm proposed in Yuan and Zhang (2013) to approximate the global optimums\n\n(cid:101)u1 to (3.6). Here the cardinalities of the support sets of the leading eigenvectors are treated as tuning\n\nparameters. The following three methods are considered:\nlasso: the classical L1 penalized regression;\nPCR: The sparse principal component regression using the sample covariance matrix as the suf\ufb01-\ncient statistic and exploiting the truncated power algorithm in estimating u1;\nRPCR: The robust sparse principal component regression proposed in this paper, using the mul-\ntivariate Kendall\u2019s tau as the suf\ufb01cient statistic and exploiting the truncated power algorithm to\nestimate u1.\n\n4.1 Simulation Study\n\nd=\n\n\u03ba\u03be\u2217\n\n1 /\u03be\u2217\n\n2. Here \u03be\u2217\n\n1\n\nd= \u03c7d and \u03be\u2217\n\n2\n\n\u221a\n\n1 + . . . + Y 2\nd\n\nj=1(\u03c9j\u2212\u03c9d)ujuT\n\ni=1 si,(cid:80)j\n\nas \u03a3 =(cid:80)2\n\ni.i.d.\u223c N (0, 1),(cid:112)Y 2\n\nsj for k \u2208 [1 +(cid:80)j\u22121\n\nd= \u03c7d. Here \u03c7d is the chi-distribution with degree of freedom\nd= \u03c7d. In this setting, X follows the Gaussian\n\u221a\n\nIn this section, we conduct simulation study to back up the theoretical results and further investigate\nthe empirical performance of the proposed robust sparse principal component regression method.\nTo illustrate the empirical usefulness of the proposed method, we \ufb01rst consider generating the data\nmatrix X. To generate X, we need to consider how to generate \u03a3 and \u03be.\nIn detail, let \u03c91 >\n\u03c92 > \u03c93 = . . . = \u03c9d be the eigenvalues and u1, . . . , ud be the eigenvectors of \u03a3 with uj :=\n(uj1, . . . , ujd)T . The top 2 leading eigenvectors u1, u2 of \u03a3 are speci\ufb01ed to be sparse with sj :=\n(cid:107)uj(cid:107)0 and ujk = 1/\ni=1 si] and zero for all the others. \u03a3 is generated\nj +\u03c9dId. Across all settings, we let s1 = s2 = 10, \u03c91 = 5.5, \u03c92 = 2.5,\nand \u03c9j = 0.5 for all j = 3, . . . , d. With \u03a3, we then consider the following four different elliptical\ndistributions:\n(Normal) X \u223c ECd(0, \u03a3, \u03b61) with \u03b61\nd. For Y1, . . . , Yd\ndistribution (Fang et al., 1990).\n(Multivariate-t) X \u223c ECd(0, \u03a3, \u03b62) with \u03b62\nd= \u03c7\u03ba with\n\u03ba \u2208 Z+. In this setting, X follows a multivariate-t distribution with degree of freedom \u03ba (Fang\net al., 1990). Here we consider \u03ba = 3.\n(EC1) X \u223c ECd(0, \u03a3, \u03b63) with \u03b63 \u223c F (d, 1), an F distribution.\n(EC2) X \u223c ECd(0, \u03a3, \u03b64) with \u03b64 \u223c Exp(1), an exponential distribution.\nWe then simulate x1, . . . , xn from X. This forms a data matrix X. Secondly, we let Y = Xu1 + \u0001,\nwhere \u0001 \u223c Nn(0, In). This produces the data (Y , X). We repeatedly generate n data according\nto the four distributions discussed above for 1,000 times. To show the estimation accuracy, Figure\n2 plots the empirical mean square error between the estimate \u02c7u1 and true regression coef\ufb01cient \u03b2\nagainst the numbers of estimated nonzero entries (de\ufb01ned as (cid:107) \u02c7u1(cid:107)0), for PCR and RPCR, under\ndifferent schemes of (n, d), \u03a3 and different distributions. Here we considered n = 100 and d = 200.\nIt can be seen that we do not plot the results of lasso in Figure 2. As discussed in Section 2,\nespecially as shown in Figure 1, linear regression and principal component regression have their\nown advantages in different settings. More speci\ufb01cally, we do not plot the results of lasso here\nsimply because it performs so bad under our simulation settings. For example, under the Gaussian\nsettings with n = 100 and d = 200, the lowest mean square error for lasso is 0.53 and the errors\nare averagely above 1.5, while for RPCR is 0.13 and is averagely below 1.\nFigure 2 shows when the data are non-Gaussian but follow an elliptically distribution, RPCR out-\nperforms PCR constantly in terms of estimation accuracy. Moreover, when the data are indeed nor-\nmally distributed, there is no obvious difference between RPCR and PCR, indicating that RPCR\nis a safe alternative to the classical sparse principal component regression.\n\n7\n\n\fA\n\nB\n\nFigure 3: (A) Quantile vs. quantile plot of the log-return values for one stock \u201dGoldman Sachs\u201d.\n(B) Prediction error against the number of features selected. The scale of the prediction errors is\nenlarged by 100 times for better visualization.\n\n4.2 Application to Equity Data\nIn this section we apply the proposed robust sparse principal component regression and the other\ntwo methods to the stock price data from Yahoo! Finance (finance.yahoo.com). We collect\nthe daily closing prices for 452 stocks that are consistently in the S&P 500 index between January 1,\n2003 through January 1, 2008. This gives us altogether T=1,257 data points, each data point corre-\nsponds to the vector of closing prices on a trading day. Let St = [Stt,j] denote by the closing price of\nstock j on day t. We are interested in the log return data X = [Xtj] with Xtj = log(Stt,j/Stt\u22121,j).\nWe \ufb01rst show that this data set is non-Gaussian and heavy tailed. This is done \ufb01rst by conducting\nmarginal normality tests (Kolmogorove-Smirnov, Shapiro-Wilk, and Lillifors) on the data. We \ufb01nd\nthat at most 24 out of 452 stocks would pass any of three normality test. With Bonferroni correction\nthere are still over half stocks that fail to pass any normality tests. Moreover, to illustrate the heavy\ntailed issue, we plot the quantile vs. quantile plot for one stock, \u201cGoldman Sachs\u201d, in Figure 3(A).\nIt can be observed that the log return values for this stock is heavy tailed compared to the Gaussian.\nTo illustrate the power of the proposed method, we pick a subset of the data \ufb01rst. The stocks can\nbe summarized into 10 Global Industry Classi\ufb01cation Standard (GICS) sectors and we are focusing\non the subcategory \u201cFinancial\u201d. This leave us 74 stocks and we denote the resulting data to be\nF \u2208 R1257\u00d774. We are interested in predicting the log return value in day t for each stock indexed\nby k (i.e., treating Ft,k as the response) using the log return values for all the stocks in day t \u2212 1\nto day t \u2212 7 (i.e., treating vec(Ft\u22127\u2264t(cid:48)\u2264t\u22121,\u00b7) as the predictor). The dimension for the regressor is\naccordingly 7 \u00d7 74 = 518. For each stock indexed by k, to learn the regression coef\ufb01cient \u03b2k, we\nuse Ft(cid:48)\u2208{1,...,1256},\u00b7 as the training data and applying the three different methods on this dataset. For\n\neach method, after obtaining an estimator (cid:98)\u03b2k, we use vec(Ft(cid:48)\u2208{1250,...,1256},\u00b7)(cid:98)\u03b2 to estimate F1257,k.\nnumber of features selected (i.e., (cid:107)(cid:98)\u03b2(cid:107)0) in Figure 3(B). To visualize the difference more clearly, in\n\nThis procedure is repeated for each k and the averaged prediction errors are plotted against the\n\nthe \ufb01gures we enlarge the scale of the prediction errors by 100 times. It can be observed that RPCR\nhas the universally lowest prediction error with regard to different number of features.\n\nAcknowledgement\n\nHan\u2019s research is supported by a Google fellowship. Liu is supported by NSF Grants III-1116730\nand NSF III-1332109, an NIH sub-award and a FDA sub-award from Johns Hopkins University.\n\n8\n\nlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\u22123\u22122\u221210123\u22126\u22124\u22122024Theoretical QuantilesSample Quantiles0501001500.100.150.200.250.300.350.400.45number of selected featuresaveraged prediction errorlassoPCRRPCR\fReferences\nArtemiou, A. and Li, B. (2009). On principal components and regression: a statistical explanation of a natural\n\nphenomenon. Statistica Sinica, 19(4):1557.\n\nBunea, F. and Xiao, L. (2012). On the sample covariance matrix estimator of reduced effective rank population\n\nmatrices, with applications to fPCA. arXiv preprint arXiv:1212.5321.\n\nCai, T. T., Ma, Z., and Wu, Y. (2013). Sparse PCA: Optimal rates and adaptive estimation. The Annals of\n\nStatistics (to appear).\n\nChoi, K. and Marden, J. (1998). A multivariate version of kendall\u2019s \u03c4. Journal of Nonparametric Statistics,\n\n9(3):261\u2013293.\n\nCook, R. D. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science, 22(1):1\u201326.\nCroux, C., Ollila, E., and Oja, H. (2002). Sign and rank covariance matrices: statistical properties and ap-\nplication to principal components analysis. In Statistical data analysis based on the L1-norm and related\nmethods, pages 257\u2013269. Springer.\n\nd\u2019Aspremont, A., El Ghaoui, L., Jordan, M. I., and Lanckriet, G. R. (2007). A direct formulation for sparse\n\nPCA using semide\ufb01nite programming. SIAM review, 49(3):434\u2013448.\n\nFang, K., Kotz, S., and Ng, K. (1990). Symmetric multivariate and related distributions. Chapman&Hall,\n\nLondon.\n\nHan, F. and Liu, H. (2013a). Optimal sparse principal component analysis in high dimensional elliptical model.\n\narXiv preprint arXiv:1310.3561.\n\nHan, F. and Liu, H. (2013b). Scale-invariant sparse PCA on high dimensional meta-elliptical data. Journal of\n\nthe American Statistical Association (in press).\n\nJohnstone, I. M. and Lu, A. Y. (2009). On consistency and sparsity for principal components analysis in high\n\ndimensions. Journal of the American Statistical Association, 104(486).\n\nKendall, M. G. (1968). A course in multivariate analysis.\nKl\u00a8uppelberg, C., Kuhn, G., and Peng, L. (2007). Estimating the tail dependence function of an elliptical\n\ndistribution. Bernoulli, 13(1):229\u2013251.\n\nLounici, K. (2012).\narXiv:1205.7060.\n\nSparse principal component analysis with missing observations.\n\narXiv preprint\n\nMa, Z. (2013). Sparse principal component analysis and iterative thresholding. to appear Annals of Statistics.\nMassy, W. F. (1965). Principal components regression in exploratory statistical research. Journal of the Amer-\n\nican Statistical Association, 60(309):234\u2013256.\n\nMoghaddam, B., Weiss, Y., and Avidan, S. (2006). Spectral bounds for sparse PCA: Exact and greedy algo-\n\nrithms. Advances in neural information processing systems, 18:915.\n\nOja, H. (2010). Multivariate Nonparametric Methods with R: An approach based on spatial signs and ranks,\n\nvolume 199. Springer.\n\nRavikumar, P., Raskutti, G., Wainwright, M., and Yu, B. (2008). Model selection in gaussian graphical models:\nHigh-dimensional consistency of l1-regularized mle. Advances in Neural Information Processing Systems\n(NIPS), 21.\n\nTyler, D. E. (1987). A distribution-free m-estimator of multivariate scatter. The Annals of Statistics, 15(1):234\u2013\n\n251.\n\nVershynin, R. (2010).\n\narXiv:1011.3027.\n\nIntroduction to the non-asymptotic analysis of random matrices.\n\narXiv preprint\n\nVu, V. Q. and Lei, J. (2012). Minimax rates of estimation for sparse pca in high dimensions. Journal of Machine\n\nLearning Research (AIStats Track).\n\nYuan, X. and Zhang, T. (2013). Truncated power method for sparse eigenvalue problems. Journal of Machine\n\nLearning Research, 14:899\u2013925.\n\nZou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational\n\nand graphical statistics, 15(2):265\u2013286.\n\n9\n\n\f", "award": [], "sourceid": 988, "authors": [{"given_name": "Fang", "family_name": "Han", "institution": "Johns Hopkins University"}, {"given_name": "Han", "family_name": "Liu", "institution": "Princeton University"}]}