{"title": "On the consistency theory of high dimensional variable screening", "book": "Advances in Neural Information Processing Systems", "page_first": 2431, "page_last": 2439, "abstract": "Variable screening is a fast dimension reduction technique for assisting high dimensional feature selection. As a preselection method, it selects a moderate size subset of candidate variables for further refining via feature selection to produce the final model. The performance of variable screening depends on both computational efficiency and the ability to dramatically reduce the number of variables without discarding the important ones. When the data dimension $p$ is substantially larger than the sample size $n$, variable screening becomes crucial as 1) Faster feature selection algorithms are needed; 2) Conditions guaranteeing selection consistency might fail to hold.This article studies a class of linear screening methods and establishes consistency theory for this special class. In particular, we prove the restricted diagonally dominant (RDD) condition is a necessary and sufficient condition for strong screening consistency. As concrete examples, we show two screening methods $SIS$ and $HOLP$ are both strong screening consistent (subject to additional constraints) with large probability if $n > O((\\rho s + \\sigma/\\tau)^2\\log p)$ under random designs. In addition, we relate the RDD condition to the irrepresentable condition, and highlight limitations of $SIS$.", "full_text": "On the consistency theory of high dimensional\n\nvariable screening\n\nXiangyu Wang\n\nDept. of Statistical Science\n\nDuke University, USA\n\nxw56@stat.duke.edu\n\nChenlei Leng\n\nDept. of Statistics\n\nUniversity of Warwick, UK\nC.Leng@warwick.ac.uk\n\nDavid B. Dunson\n\nDept. of Statistical Science\n\nDuke University, USA\n\ndunson@stat.duke.edu\n\nAbstract\n\nVariable screening is a fast dimension reduction technique for assisting high di-\nmensional feature selection. As a preselection method, it selects a moderate size\nsubset of candidate variables for further re\ufb01ning via feature selection to produce\nthe \ufb01nal model. The performance of variable screening depends on both compu-\ntational ef\ufb01ciency and the ability to dramatically reduce the number of variables\nwithout discarding the important ones. When the data dimension p is substantially\nlarger than the sample size n, variable screening becomes crucial as 1) Faster fea-\nture selection algorithms are needed; 2) Conditions guaranteeing selection consis-\ntency might fail to hold. This article studies a class of linear screening methods\nand establishes consistency theory for this special class. In particular, we prove\nthe restricted diagonally dominant (RDD) condition is a necessary and suf\ufb01cient\ncondition for strong screening consistency. As concrete examples, we show two\nscreening methods SIS and HOLP are both strong screening consistent (subject\nto additional constraints) with large probability if n > O((\u03c1s+\u03c3/\u03c4 )2 log p) under\nrandom designs. In addition, we relate the RDD condition to the irrepresentable\ncondition, and highlight limitations of SIS.\n\n1\n\nIntroduction\n\nThe rapidly growing data dimension has brought new challenges to statistical variable selection, a\ncrucial technique for identifying important variables to facilitate interpretation and improve predic-\ntion accuracy. Recent decades have witnessed an explosion of research in variable selection and\nrelated \ufb01elds such as compressed sensing [1, 2], with a core focus on regularized methods [3\u20137].\nRegularized methods can consistently recover the support of coef\ufb01cients, i.e., the non-zero signals,\nvia optimizing regularized loss functions under certain conditions [8\u201310]. However, in the big data\nera when p far exceeds n, such regularized methods might fail due to two reasons. First, the con-\nditions that guarantee variable selection consistency for convex regularized methods such as lasso\nmight fail to hold when p >> n; Second, the computational expense of both convex and non-convex\nregularized methods increases dramatically with large p.\nBearing these concerns in mind, [11] propose the concept of \u201cvariable screening\u201d, a fast technique\nthat reduces data dimensionality from p to a size comparable to n, with all predictors having non-\nzero coef\ufb01cients preserved. They propose a marginal correlation based fast screening technique\n\u201cSure Independence Screening\u201d (SIS) that can preserve signals with large probability. However,\nthis method relies on a strong assumption that the marginal correlations between the response and\nthe important predictors are high [11], which is easily violated in the practice. [12] extends the\nmarginal correlation to the Spearman\u2019s rank correlation, which is shown to gain certain robustness\nbut is still limited by the same strong assumption. [13] and [14] take a different approach to attack\nthe screening problem. They both adopt variants of a forward selection type algorithm that includes\none variable at a time for constructing a candidate variable set for further re\ufb01ning. These methods\n\n1\n\n\feliminate the strong marginal assumption in [11] and have been shown to achieve better empirical\nperformance. However, such improvement is limited by the extra computational burden caused\nby their iterative framework, which is reported to be high when p is large [15]. To ameliorate\nconcerns in both screening performance and computational ef\ufb01ciency, [15] develop a new type of\nscreening method termed \u201cHigh-dimensional ordinary least-square projection\u201d (HOLP ). This new\nscreener relaxes the strong marginal assumption required by SIS and can be computed ef\ufb01ciently\n(complexity is O(n2p)), thus scalable to ultra-high dimensionality.\nThis article focuses on linear models for tractability. As computation is one vital concern for design-\ning a good screening method, we primarily focus on a class of linear screeners that can be ef\ufb01ciently\ncomputed, and study their theoretical properties. The main contributions of this article lie in three\naspects.\n\n1. We de\ufb01ne the notion of strong screening consistency to provide a uni\ufb01ed framework for\nanalyzing screening methods.\nIn particular, we show a necessary and suf\ufb01cient condi-\ntion for a screening method to be strong screening consistent is that the screening matrix\nis restricted diagonally dominant (RDD). This condition gives insights into the design of\nscreening matrices, while providing a framework to assess the effectiveness of screening\nmethods.\n\n2. We relate RDD to other existing conditions. The irrepresentable condition (IC) [8] is nec-\nessary and suf\ufb01cient for sign consistency of lasso [3]. In contrast to IC that is speci\ufb01c to the\ndesign matrix, RDD involves another ancillary matrix that can be chosen arbitrarily. Such\n\ufb02exibility allows RDD to hold even when IC fails if the ancillary matrix is carefully chosen\n(as in HOLP ). When the ancillary matrix is chosen as the design matrix, certain equiva-\nlence is shown between RDD and IC, revealing the dif\ufb01culty for SIS to achieve screening\nconsistency. We also comment on the relationship between RDD and the restricted eigen-\nvalue condition (REC) [6] which is commonly seen in the high dimensional literature. We\nillustrate via a simple example that RDD might not be necessarily stronger than REC.\n\nsize of n = O(cid:0)(\u03c1s + \u03c3/\u03c4 )2 log p(cid:1) is suf\ufb01cient for SIS and HOLP to be screening con-\n\n3. We study the behavior of SIS and HOLP under random designs, and prove that a sample\n\nsistent, where s is the sparsity, \u03c1 measures the diversity of signals and \u03c4 /\u03c3 evaluates the\nsignal-to-noise ratio. This is to be compared to the sign consistency results in [9] where the\ndesign matrix is \ufb01xed and assumed to follow the IC.\n\nThe article is organized as follows. In Section 1, we set up the basic problem and describe the\nframework of variable screening. In Section 2, we provide a deterministic necessary and suf\ufb01cient\ncondition for consistent screening. Its relationship with the irrepresentable condition is discussed\nin Section 3. In Section 4, we prove the consistency of SIS and HOLP under random designs by\nshowing the RDD condition is satis\ufb01ed with large probability, although the requirement on SIS is\nmuch more restictive.\n\n2 Linear screening\n\nConsider the usual linear regression\n\nY = X\u03b2 + \u0001,\n\nwhere Y is the n\u00d7 1 response vector, X is the n\u00d7 p design matrix and \u0001 is the noise. The regression\ntask is to learn the coef\ufb01cient vector \u03b2. In the high dimensional setting where p >> n, a sparsity\nassumption is often imposed on \u03b2 so that only a small portion of the coordinates are non-zero. Such\nan assumption splits the task of learning \u03b2 into two phases. The \ufb01rst is to recover the support of\n\u03b2, i.e., the location of non-zero coef\ufb01cients; The second is to estimate the value of these non-zero\nsignals. This article mainly focuses on the \ufb01rst phase.\nAs pointed out in the introduction, when the dimensionality is too high, using regularization methods\nmethods raises concerns both computationally and theoretically. To reduce the dimensionality, [11]\nsuggest a variable screening framework by \ufb01nding a submodel\n\nMd = {i : | \u02c6\u03b2i| is among the largest d coordinates of | \u02c6\u03b2|} or M\u03b3 = {i : | \u02c6\u03b2i| > \u03b3}.\n\n2\n\n\f(1)\n\nLet Q = {1, 2,\u00b7\u00b7\u00b7 , p} and de\ufb01ne S as the true model with s = |S| being its cardinarlity. The\nhope is that the submodel size |Md| or |M\u03b3| will be smaller or comparable to n, while S \u2286 Md\nor S \u2286 M\u03b3. To achieve this goal two steps are usually involved in the screening analysis. The\n\ufb01rst is to show there exists some \u03b3 such that mini\u2208S | \u02c6\u03b2i| > \u03b3 and the second step is to bound the\nsize of |M\u03b3| such that |M\u03b3| = O(n). To unify these steps for a more comprehensive theoretical\nframework, we put forward a slightly stronger de\ufb01nition of screening consistency in this article.\nDe\ufb01nition 2.1. (Strong screening consistency) An estimator \u02c6\u03b2 (of \u03b2) is strong screening consistent\nif it satis\ufb01es that\n\nand\n\nmin\ni\u2208S\n\n| \u02c6\u03b2i| > max\ni(cid:54)\u2208S\n\n| \u02c6\u03b2i|\n\n\u2200i \u2208 S.\n\nsign( \u02c6\u03b2i) = sign(\u03b2i),\n\n(2)\nRemark 2.1. This de\ufb01nition does not differ much from the usual screening property studied in the\nliterature, which requires mini\u2208S | \u02c6\u03b2i| > max(n\u2212s)\n| \u02c6\u03b2i|, where max(k) denotes the kth largest item.\ni(cid:54)\u2208S\nThe key of strong screening consistency is the property (1) that requires the estimator to preserve\nconsistent ordering of the zero and non-zero coef\ufb01cients. It is weaker than variable selection consis-\ntency in [8]. The requirement in (2) can be seen as a relaxation of the sign consistency de\ufb01ned in [8],\nas no requirement for \u02c6\u03b2i, i (cid:54)\u2208 S is needed. As shown later, such relaxation tremendously reduces the\nrestriction on the design matrix, and allows screening methods to work for a broader choice of X.\nThe focus of this article is to study the theoretical properties of a special class of screeners that take\nthe linear form as\n\n\u02c6\u03b2 = AY\n\nfor some p\u00d7n ancillary matrix A. Examples include sure independence screening (SIS) where A =\nX T /n and high-dimensional ordinary least-square projection (HOLP ) where A = X T (XX T )\u22121.\nWe choose to study the class of linear estimators because linear screening is computationally ef\ufb01-\ncient and theoretically tractable. We note that the usual ordinary least-squares estimator is also a\nspecial case of linear estimators although it is not well de\ufb01ned for p > n.\n\n3 Deterministic guarantees\n\nIn this section, we derive the necessary and suf\ufb01cient condition that guarantees \u02c6\u03b2 = AY to be strong\nscreening consistent. The design matrix X and the error \u0001 are treated as \ufb01xed in this section and we\nwill investigate random designs later. We consider the set of sparse coef\ufb01cient vectors de\ufb01ned by\n\n(cid:26)\n\n(cid:26)(cid:88)\n\n(cid:27)\n\n.\n\nmaxi\u2208supp(\u03b2) |\u03b2i|\nmini\u2208supp(\u03b2) |\u03b2i| \u2264 \u03c1\n\nB(s, \u03c1) =\n\n\u03b2 \u2208 Rp : |supp(\u03b2)| \u2264 s,\n\nThe set B(s, \u03c1) contains vectors having at most s non-zero coordinates with the ratio of the largest\nand smallest coordinate bounded by \u03c1. Before proceeding to the main result of this section, we\nintroduce some terminology that helps to establish the theory.\nDe\ufb01nition 3.1. (restricted diagonally dominant matrix) A p \u00d7 p symmetric matrix \u03a6 is restricted\ndiagonally dominant with sparsity s if for any I \u2286 Q, |I| \u2264 s \u2212 1 and i \u2208 Q \\ I\n\n\u03a6ii > C0 max\n\n|\u03a6ij + \u03a6kj|,\n\n|\u03a6ij \u2212 \u03a6kj|\n\n+ |\u03a6ik| \u2200k (cid:54)= i, k \u2208 Q \\ I,\n\n(cid:88)\n\n(cid:27)\n\nj\u2208I\nwhere C0 \u2265 1 is a constant.\nNotice this de\ufb01nition implies that for i \u2208 Q \\ I\n|\u03a6ij + \u03a6kj| +\n\n(cid:18)(cid:88)\n\n\u03a6ii \u2265 C0\n\nj\u2208I\n\nj\u2208I\n\n(cid:88)\n\nj\u2208I\n\n(cid:19)\n\n/2 \u2265 C0\n\n(cid:88)\n\nj\u2208I\n\n|\u03a6ij|,\n\n|\u03a6ij \u2212 \u03a6kj|\n\n(3)\n\nwhich is related to the usual diagonally dominant matrix. The restricted diagonally dominant ma-\ntrix provides a necessary and suf\ufb01cient condition for any linear estimators \u02c6\u03b2 = AY to be strong\nscreening consistent. More precisely, we have the following result.\n\n3\n\n\fTheorem 1. For the noiseless case where \u0001 = 0, a linear estimator \u02c6\u03b2 = AY is strong screening\nconsistent for every \u03b2 \u2208 B(s, \u03c1), if and only if the screening matrix \u03a6 = AX is restricted diagonally\ndominant with sparsity s and C0 \u2265 \u03c1.\nProof. Assume \u03a6 is restricted diagonally dominant with sparsity s and C0 \u2265 \u03c1. Recall \u02c6\u03b2 = \u03a6\u03b2.\nSuppose S is the index set of non-zero predictors. For any i \u2208 S, k (cid:54)\u2208 S, if we let I = S \\ {i}, then\nwe have\n| \u02c6\u03b2i| = |\u03b2i|\n\n(\u03a6ij + \u03a6kj) + \u03a6ki \u2212(cid:88)\n\n\u03a6kj \u2212 \u03a6ki\n\n(cid:88)\n\n= |\u03b2i|\n\n(cid:26)\n\n(cid:27)\n\n(cid:19)\n\n(cid:18)\n\n\u03a6ii +\n\n\u03a6ij\n\n> \u2212|\u03b2i|\n\n\u03a6kj + \u03a6ki\n\n\u03b2j\n\u03b2i\n\nj\u2208I\n\u03b2j\n\u03b2i\n\n(cid:88)\n\nj\u2208I\n\n\u03a6ii +\n\n(cid:18)(cid:88)\n\nj\u2208I\n\n\u03a6ii +\n\n(cid:18)\n(cid:18)(cid:88)\n\nj\u2208I\n\n\u03b2j\n\u03b2i\n\n(cid:19)\n\n(cid:19)\n(cid:19)\n\nand\n| \u02c6\u03b2i| = |\u03b2i|\n\n> |\u03b2i|\n\n\u03b2j\n\u03b2i\n\n\u03a6ij\n\n\u03a6kj + \u03a6ki\n\n= |\u03b2i|\n\n\u03a6ii +\n\nj\u2208I\n= sign(\u03b2i) \u00b7 \u02c6\u03b2k.\n\n\u03b2j\n\u03b2i\n\nj\u2208I\n\n(cid:88)\n(cid:18)(cid:88)\n(cid:88)\n\nj\u2208I\n\n\u03b2i\n\n= \u2212|\u03b2i|\n(cid:26)\n\n(cid:19)\n\n\u03b2j\n\u03b2i\n\nj\u2208I\n\n\u03b2j\u03a6kj + \u03b2i\u03a6ki\n\n= \u2212sign(\u03b2i) \u00b7 \u02c6\u03b2k,\n\n(\u03a6ij \u2212 \u03a6kj) \u2212 \u03a6ki +\n\n\u03b2j\n\u03b2i\n\n(cid:88)\n\nj\u2208I\n\n\u03b2j\n\u03b2i\n\n\u03a6kj + \u03a6ki\n\n(cid:27)\n\nTherefore, whatever value sign(\u03b2i) is, it always holds that | \u02c6\u03b2i| > | \u02c6\u03b2k| and thus mini\u2208S | \u02c6\u03b2i| >\nmaxk(cid:54)\u2208S | \u02c6\u03b2k|.\nTo prove the sign consistency for non-zero coef\ufb01cients, we notice that for i \u2208 S,\n\n(cid:88)\n\nj\u2208I\n\n(cid:18)\n\n(cid:19)\n\n(cid:88)\n\nj\u2208I\n\n\u03b2j\n\u03b2i\n\n\u02c6\u03b2i\u03b2i = \u03a6ii\u03b22\n\ni +\n\n\u03a6ij\u03b2j\u03b2i = \u03b22\ni\n\n\u03a6ii +\n\n\u03a6ij\n\n> 0.\n\nThe proof of necessity is left to the supplementary materials.\n\nThe noiseless case is a good starting point to analyze \u02c6\u03b2. Intuitively, in order to preserve the correct\norder of the coef\ufb01cients in \u02c6\u03b2 = AX\u03b2, one needs AX to be close to a diagonally dominant matrix,\nso that \u02c6\u03b2i, i \u2208 MS will take advantage of the large diagonal terms in AX to dominate \u02c6\u03b2i, i (cid:54)\u2208 MS\nthat is just linear combinations of off-diagonal terms.\nWhen noise is considered, the condition in Theorem 1 needs to be changed slightly to accommodate\nextra discrepancies. In addition, the smallest non-zero coef\ufb01cient has to be lower bounded to ensure\na certain level of signal-to-noise ratio. Thus, we augment our previous de\ufb01nition of B(s, \u03c1) to have\na signal strength control\n\nB\u03c4 (s, \u03c1) = {\u03b2 \u2208 B(s, \u03c1)| min\n\n|\u03b2i| \u2265 \u03c4}.\n\ni\u2208supp(\u03b2)\n\nThen we can obtain the following modi\ufb01ed Theorem.\nTheorem 2. With noise, the linear estimator \u02c6\u03b2 = AY is strong screening consistent for every\n\u03b2 \u2208 B\u03c4 (s, \u03c1) if \u03a6 = AX \u2212 2\u03c4\u22121(cid:107)A\u0001(cid:107)\u221eIp is restricted diagonally dominant with sparsity s and\nC0 \u2265 \u03c1.\nThe proof of Theorem 2 is essentially the same as Theorem 1 and is thus left to the supplementary\nmaterials. The condition in Theorem 2 can be further tailored to a necessary and suf\ufb01cient version\nwith extra manipulation on the noise term. Nevertheless, this might not be useful in practice due to\nthe randomness in noise. In addition, the current version of Theorem 2 is already tight in the sense\nthat there exists some noise vector \u0001 such that the condition in Theorem 2 is also necessary for strong\nscreening consistency.\nTheorems 1 and 2 establish ground rules for verifying consistency of a given screener and provide\npractical guidance for screening design. In Section 4, we consider some concrete examples of an-\ncillary matrix A and prove that conditions in Theorems 1 and 2 are satis\ufb01ed by the corresponding\nscreeners with large probability under random designs.\n\n4\n\n\f4 Relationship with other conditions\n\nFor some special cases such sure independence screening (\u201dSIS\u201d), the restricted diagonally dominant\n(RDD) condition is related to the strong irrepresentable condition (IC) proposed in [8]. Assume each\ncolumn of X is standardized to have mean zero. Letting C = X T X/n and \u03b2 be a given coef\ufb01cient\nvector, the IC is expressed as\n\n(cid:107)CSc,SC\u22121\n\nS,S \u00b7 sign(\u03b2S)(cid:107)\u221e \u2264 1 \u2212 \u03b8\n\n(4)\nfor some \u03b8 > 0, where CA,B represents the sub-matrix of C with row indices in A and column\nindices in B. The authors enumerate several scenarios of C such that IC is satis\ufb01ed. We verify some\nof these scenarios for screening matrix \u03a6.\nCorollary 1. If \u03a6ii = 1, \u2200i and |\u03a6ij| < c/(2s), \u2200i (cid:54)= j for some 0 \u2264 c < 1 as de\ufb01ned in Corollary\n1 and 2 in [8], then \u03a6 is a restricted diagonally dominant matrix with sparsity s and C0 \u2265 1/c.\nIf |\u03a6ij| < r|i\u2212j|, \u2200i, j for some 0 < r < 1 as de\ufb01ned in Corollary 3 in [8], then \u03a6 is a restricted\ndiagonally dominant matrix with sparsity s and C0 \u2265 (1 \u2212 r)2/(4r).\nA more explicit but nontrivial relationship between IC and RDD is illustrated below when |S| = 2.\nTheorem 3. Assume \u03a6ii = 1, \u2200i and |\u03a6ij| < r, \u2200i (cid:54)= j. If \u03a6 is restricted diagonally dominant with\nsparsity 2 and C0 \u2265 \u03c1, then \u03a6 satis\ufb01es\n\n(cid:107)\u03a6Sc,S\u03a6\u22121\n\nS,S \u00b7 sign(\u03b2S)(cid:107)\u221e \u2264 \u03c1\u22121\n1 \u2212 r\n\nfor all \u03b2 \u2208 B(2, \u03c1). On the other hand, if \u03a6 satis\ufb01es the IC for all \u03b2 \u2208 B(2, \u03c1) for some \u03b8, then \u03a6 is\na restricted diagonally dominant matrix with sparsity 2 and\n1 \u2212 r\n1 + r\n\nC0 \u2265 1\n1 \u2212 \u03b8\n\n.\n\nTheorem 3 demonstrates certain equivalence between IC and RDD. However, it does not mean\nthat RDD is also a strong requirement. Notice that IC is directly imposed on the covariance matrix\nX T X/n. This makes IC a strong assumption that is easily violated; for example, when the predictors\nare highly correlated. In contrast to IC, RDD is imposed on matrix AX where there is \ufb02exibility in\nchoosing A. Only when A is chose to be X/n, RDD is equivalently strong as IC, as shown in next\ntheorem. For other choices of A, such as HOLP de\ufb01ned in next section, the estimator satis\ufb01es RDD\neven when predictors are highly correlated. Therefore, RDD is considered as weak requirement.\nFor \u201dSIS\u201d, the screening matrix \u03a6 = X T X/n coincides with the covariance matrix, making RDD\nand IC effectively equivalent. The following theorem formalizes this.\nTheorem 4. Let A = X T /n and standardize columns of X to have sample variance one. Assume\nX satis\ufb01es the sparse Riesz condition [16], i.e,\n\nmin\n\n\u03c0\u2286Q, |\u03c0|\u2264s\n\n\u03bbmin(X T\n\n\u03c0 X\u03c0/n) \u2265 \u00b5,\n\n\u221a\n\n\u221a\n\ns/\u00b5, then X satis\ufb01es the IC for any \u03b2 \u2208 B(s, \u03c1).\n\ns/\u00b5, the strong screening consistency of SIS for B(s +\n\nfor some \u00b5 > 0. Now if AX is restricted diagonally dominant with sparsity s + 1 and C0 \u2265 \u03c1 with\n\u03c1 >\nIn other words, under the condition \u03c1 >\n1, \u03c1) implies the model selection consistency of lasso for B(s, \u03c1).\nTheorem 4 illustrates the dif\ufb01culty of SIS. The necessary condition that guarantees good screening\nperformance of SIS also guarantees the model selection consistency of lasso. However, such a\nstrong necessary condition does not mean that SIS should be avoided in practice given its substantial\nadvantages in terms of simplicity and computational ef\ufb01ciency. The strong screening consistency\nde\ufb01ned in this article is stronger than conditions commonly used in justifying screening procedures\nas in [11].\nAnother common assumption in the high dimensional literature is the restricted eigenvalue condition\n(REC). Compared to REC, RDD is not necessarily stronger due to its \ufb02exibility in choosing the\nancillary matrix A. [17, 18] prove that the REC is satis\ufb01ed when the design matrix is sub-Gaussian.\nHowever, REC might not be guaranteed when the row of X follows heavy-tailed distribution. In\ncontrast, as the example shown in next section and in [15], by choosing A = X T (XX T )\u22121, the\nresulting estimator satis\ufb01es RDD even when the rows of X follow heavy-tailed distributions.\n\n5\n\n\f5 Screening under random designs\n\nIn this section, we consider linear screening under random designs when X and \u0001 are Gaussian.\nThe theory developed in this section can be easily extended to a broader family of distributions, for\nexample, where \u0001 follows a sub-Gaussian distribution [19] and X follows an elliptical distribution\n[11, 15]. We focus on the Gaussian case for conciseness. Let \u0001 \u223c N (0, \u03c32) and X \u223c N (0, \u03a3).\nWe prove the screening consistency of SIS and HOLP by verifying the condition in Theorem 2.\nRecall the ancillary matrices for SIS and HOLP are de\ufb01ned respectively as\n\nASIS = X/n,\n\nAHOLP = X T (XX T )\u22121.\n\nFor simplicity, we assume \u03a3ii = 1 for i = 1, 2,\u00b7\u00b7\u00b7 , p. To verify the RDD condition, it is essential\nto quantify the magnitude of the entries of AX and A\u0001.\nLemma 1. Let \u03a6 = ASISX, then for any t > 0 and i (cid:54)= j \u2208 Q, we have\ntn\n2eK\n\n|\u03a6ii \u2212 \u03a3ii| \u2265 t\n\n\u2264 2 exp\n\n\u2212 min\n\n8e2K\n\nP\n\n,\n\n(cid:19)\n(cid:19)\n\n(cid:26)\n(cid:26)\n\n(cid:18) t2n\n(cid:18) t2n\n\n(cid:19)(cid:27)\n(cid:19)(cid:27)\n\n,\n\n,\n\nand\n\nP\n\n|\u03a6ij \u2212 \u03a3ij| \u2265 t\n\n\u2264 6 exp\n\n\u2212 min\n\n,\n\ntn\n6eK\n\n72e2K\n\n(cid:18)\n(cid:18)\n\nwhere K = (cid:107)X 2(1) \u2212 1(cid:107)\u03c81 is a constant, X 2(1) is a chi-square random variable with one degree\nof freedom and the norm (cid:107) \u00b7 (cid:107)\u03c81 is de\ufb01ned in [19].\nLemma 1 states that the screening matrix \u03a6 = ASISX for SIS will eventually converge to the co-\nvariance matrix \u03a3 in l\u221e when n tends to in\ufb01nity and log p = o(n). Thus, the screening performance\nof SIS strongly relies on the structure of \u03a3. In particular, the (asymptotically) necessary and suf\ufb01-\ncient condition for SIS being strong screening consistent is \u03a3 satisfying the RDD condition. For\nthe noise term, we have the following lemma.\nLemma 2. Let \u03b7 = ASIS\u0001. For any t > 0 and i \u2208 Q, we have\n\n(cid:26)\n\n(cid:18) t2n\n\n\u2212 min\n\n,\n\ntn\n6eK\n\n72e2K\n\n(cid:19)(cid:27)\n\n,\n\nP (|\u03b7i| \u2265 \u03c3t) \u2264 6 exp\n\nwhere K is de\ufb01ned the same as in Lemma 1.\n\nThe proof of Lemma 2 is essentially the same as the proof of off-diagonal terms in Lemma 1 and\nis thus omitted. As indicated before, the necessary and suf\ufb01cient condition for SIS to be strong\nscreening consistent is that \u03a3 follows RDD. As RDD is usually hard to verify, we consider a stronger\nsuf\ufb01cient condition inspired by Corollary 1.\nTheorem 5. Let r = maxi(cid:54)=j |\u03a3ij|. If r < 1\n\n2\u03c1s , then for any \u03b4 > 0, if the sample size satis\ufb01es\n\n(cid:18) 1 + 2\u03c1s + 2\u03c3/\u03c4\n\n(cid:19)2\n\nn > 144K\n\n(5)\nleast 1 \u2212 \u03b4, \u03a6 = ASISX \u2212\nwhere K is de\ufb01ned in Lemma 1,\n2\u03c4\u22121(cid:107)ASIS\u0001(cid:107)\u221eIp is restricted diagonally dominant with sparsity s and C0 \u2265 \u03c1. In other words,\nSIS is screening consistent for any \u03b2 \u2208 B\u03c4 (s, \u03c1).\n\nthen with probability at\n\n1 \u2212 2\u03c1sr\n\nlog(3p/\u03b4),\n\n(cid:18)\n\nProof. Taking union bound on the results from Lemma 1 and 2, we have for any t > 0 and p > 2,\n\nmin\ni\u2208Q\n\n\u03a6ii \u2264 1 \u2212 t or max\ni(cid:54)=j\n\nt\nP\n6e\nIn other words, for any \u03b4 > 0, when n \u2265 K log(7p2/\u03b4), with probability at least 1 \u2212 \u03b4, we have\n\n|\u03a6ij| \u2265 r + t or (cid:107)\u03b7(cid:107)\u221e \u2265 \u03c3t\n\n\u2264 7p2 exp\n\n\u2212 n\nK\n\n72e2 ,\n\nmin\n\n(cid:18) t2\n\n(cid:19)(cid:27)\n\n.\n\n(cid:19)\n\n(cid:26)\n\n(cid:114)\n\n(cid:114)\n\n\u03a6ii \u2265 1 \u2212 6\n\nmin\ni\u2208Q\n\n\u221a\n\n2e\n\nK log(7p2/\u03b4)\n\nn\n\n, max\ni(cid:54)=j\n\n|\u03a6ij| \u2264 r + 6\n\n\u221a\n\n2e\n\n6\n\nK log(7p2/\u03b4)\n\nn\n\n,\n\n\f(cid:114)\n\n\u221a\n|\u03b7i| \u2264 6\n\n2e\u03c3\n\nmax\ni\u2208Q\n\nK log(7p2/\u03b4)\n\nn\n\n.\n\nA suf\ufb01cient condition for \u03a6 to be restricted diagonally dominant is that\n|\u03b7i|.\n\n|\u03a6ij| + 2\u03c4\u22121 max\n\nmin\n\n\u03a6ii > 2\u03c1s max\ni(cid:54)=j\n\ni\n\ni\nPlugging in the values we have\n\u221a\n1 \u2212 6\n\nK log(7p2/\u03b4)\n\n(cid:114)\n\n2e\n\n(cid:114)\n\n\u221a\n\n> 2\u03c1s(r + 6\n\n2e\n\nK log(7p2/\u03b4)\n\n) + 12\n\n2e\u03c4\u22121\u03c3\n\nn\n\nn\n\nn\nSolving the above inequality (notice that 7p2/\u03b4 < 9p2/\u03b42 and \u03c1 > 1) completes the proof.\nThe requirement that maxi(cid:54)=j |\u03a3ij| < 1/(\u03c1sr) or the necessary and suf\ufb01cient condition that \u03a3 is\nRDD strictly constrains the correlation structure of X, causing the dif\ufb01culty for SIS to be strong\nscreening consistent. For HOLP we instead have the following result.\nLemma 3. Let \u03a6 = AHOLP X. Assume p > c0n for some c0 > 1, then for any C > 0 there exists\nsome 0 < c1 < 1 < c2 and c3 > 0 such that for any t > 0 and any i \u2208 Q, j (cid:54)= i, we have\n\n(cid:19)\n\n(cid:18)\n\n(cid:18)\n\n|\u03a6ii| < c1\u03ba\u22121 n\np\n\n\u2264 2e\u2212Cn, P\n\n|\u03a6ii| > c2\u03ba\n\nn\np\n\n\u2264 2e\u2212Cn\n\n(cid:114)\n\nK log(7p2/\u03b4)\n\n.\n\n\u221a\n\n(cid:19)\n\nP\n\nand\n\nP\n\nwhere c4 =\n\n\u221a\nc2(c0\u2212c1)\n\u221a\nc3(c0\u22121)\n\n.\n\n(cid:18)\n\n(cid:19)\n\n\u221a\n\nn\np\n\n|\u03a6ij| > c4\u03bat\n\n\u2264 5e\u2212Cn + 2e\u2212t2/2,\n\nProof. The proof of Lemma 3 relies heavily on previous results for the Stiefel Manifold provided in\nthe supplementary materials. We only sketch the basic idea here and leave the complete proof to the\nsupplementary materials. De\ufb01ning H = X T (XX T )\u22121/2, then we have \u03a6 = HH T and H follows\nthe Matrix Angular Central Gaussian (MACG) with covariance \u03a3. The diagonal terms of HH T\ncan be bounded similarly via the Johnson-Lindenstrauss lemma, by using the fact that HH T =\n\u03a31/2U (U T \u03a3U )\u22121U \u03a3, where U is a p \u00d7 n random projection matrix. Now for off-diagonal terms,\nwe decompose the Stiefel manifold as H = (G(H2)H1 H2), where H1 is a (p \u2212 n + 1) \u00d7 1 vector,\nH2 is a p \u00d7 (n \u2212 1) matrix and G(H2) is chosen so that (G(H2) H2) \u2208 O(p), and show that H1\nfollows Angular Central Gaussian (ACG) distribution with covariance G(H2)T \u03a3G(H2) conditional\non H2. It can be shown that e2HH T e1\n1 HH T e1, then\n1 G(H2)H1 = t1, and we obtain the desired coupling distribution as\n1 H2 = 0 is equivalent to eT\neT\n2 G(H2)H1|eT\n1 G(H2)H1 = t1. Using the normal representation of ACG(\u03a3), i.e.,\n2 HH T e1\neT\nif x = (x1,\u00b7\u00b7\u00b7 , xp) \u223c N (0, \u03a3), then x/(cid:107)x(cid:107) \u223c ACG(\u03a3), we can write G(H2)H1 in terms of\nnormal variables and then bound all terms using concentration inequalities.\n\n= e2G(H2)H1|eT\n\n1 H2 = 0. Let t2\n\n1 = eT\n\n(d)\n= eT\n\n(d)\n\nn\n\nLemma 3 quanti\ufb01es the entries of the screening matrix for HOLP . As illustrated in the lemma,\np ) and the off-diagonal terms are\nregardless of the covariance \u03a3, diagonal terms of \u03a6 are always O( n\n\u221a\np ). Thus, with n \u2265 O(s2), \u03a6 is likely to satisfy the RDD condition with large probability. For\n\nO(\nthe noise vector we have the following result.\nLemma 4. Let \u03b7 = AHOLP \u0001. Assume p > c0n for some c0 > 1, then for any C > 0 there exist the\nsame c1, c2, c3 as in Lemma 3 such that for any t > 0 and i \u2208 Q,\n\n(cid:18)\n\n|\u03b7i| \u2265 2\u03c3\n\nP\n\n\u221a\nc2\u03bat\n1 \u2212 c\u22121\n\n0\n\n\u221a\n\nn\np\n\n< 4e\u2212Cn + 2e\u2212t2/2,\n\nif n \u2265 8C/(c0 \u2212 1)2.\nThe proof is almost identical to Lemma 2 and is provided in the supplementary materials. The\nfollowing theorem results after combining Lemma 3 and 4.\n\n(cid:19)\n\n7\n\n\f(cid:26)\n\nTheorem 6. Assume p > c0n for some c0 > 1. For any \u03b4 > 0, if the sample size satis\ufb01es\n\n2C(cid:48)\u03ba4(\u03c1s + \u03c3/\u03c4 )2 log(3p/\u03b4),\n\nn > max\n\n(6)\nwhere C(cid:48) = max{ 4c2\n0 )2} and c1, c2, c3, c4, C are the same constants de\ufb01ned in Lemma 3,\nthen with probability at least 1 \u2212 \u03b4, \u03a6 = AHOLP X \u2212 2\u03c4\u22121(cid:107)AHOLP \u0001(cid:107)\u221eIp is restricted diagonally\ndominant with sparsity s and C0 \u2265 \u03c1. This implies HOLP is screening consistent for any \u03b2 \u2208\nB\u03c4 (s, \u03c1).\n\n(c0 \u2212 1)2\n\n4c2\n1(1\u2212c\nc2\n\n4\nc2\n1\n\n\u22121\n\n8C\n\n,\n\n,\n\n(cid:27)\n\nProof. Notice that if\n\ni\n\nmin\n\n|\u03a6ii| > 2s\u03c1 max\n\n|\u03a6ij| + 2\u03c4\u22121(cid:107)X T (XX T )\u22121\u0001(cid:107)\u221e,\n\n(7)\nthen the proof is complete because \u03a6 \u2212 2\u03c4\u22121(cid:107)X T (XX T )\u22121\u0001(cid:107)\u221e is already a restricted diagonally\ndominant matrix. Let t =\nc1\u03ba\u22121 n\np\n\nCn/\u03bd. The above equation then requires\n\u2212 2\u03c3\n\n=(cid:0)c1\u03ba\u22121 \u2212 2c4\n\nC\u03bas\u03c1\n\u03bd\n\nC\u03bas\u03c1\n\u03bd\n\n\u2212 2\u03c3\n\n(cid:1) n\n\n\u2212 2c4\n\n(1 \u2212 c\u22121\n\n(1 \u2212 c\u22121\n\n> 0,\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nn\np\n\np\n\nij\n\nc2C\u03ba\n0 )\u03c4 \u03bd\n\nwhich implies that\n\nn\np\n\n\u221a\n\nc2C\u03bat\n0 )\u03c4 \u03bd\n\u221a\n\n2c4\n\n\u03bd >\n\u221a\n\nC\u03ba2\u03c1s\nc1\n\n\u221a\n\n+\n\n= C1\u03ba2\u03c1s + C2\u03ba2\u03c4\u22121\u03c3 > 1,\n\nc2C\u03ba2\n2\u03c3\nc1(1 \u2212 c\u22121\n0 )\u03c4\n. Therefore, taking union bounds on all matrix entries, we\n\nwhere C1 = 2c4\nc1\nhave\n\nC\n\n, C2 = 2\n\n(cid:18)(cid:8)(7) does not hold(cid:9)(cid:19)\n\nP\n\nc2C\n\u22121\nc1(1\u2212c\n0 )\n\nwhich is satis\ufb01ed provided (noticing\nthe precise condition we need.\n\nC log 3p\n\n\u03b4 . Now pushing \u03bd to the limit gives (6),\n\nThere are several interesting observations on equation (5) and (6). First, (\u03c1s + \u03c3/\u03c4 )2 appears in\nboth expressions. We note that \u03c1s evaluates the sparsity and the diversity of the signal \u03b2 while \u03c3/\u03c4\nis closely related to the signal-to-noise ratio. Furthermore, HOLP relaxes the correlation constraint\nr < 1/(2\u03c1s) or the covariance constraint (\u03a3 is RDD) with the conditional number constraint. Thus\nfor any \u03a3, as long as the sample size is large enough, strong screening consistency is assured.\nFinally, HOLP provides an example to satisfy the RDD condition in answer to the question raised\nin Section 4.\n\n6 Concluding remarks\n\nThis article studies and establishes a necessary and suf\ufb01cient condition in the form of restricted\ndiagonally dominant screening matrices for strong screening consistency of a linear screener. We\nverify the condition for both SIS and HOLP under random designs.\nIn addition, we show a\nclose relationship between RDD and the IC, highlighting the dif\ufb01culty of using SIS in screening for\narbitrarily correlated predictors. For future work, it is of interest to see how linear screening can be\nadapted to compressed sensing [20] and how techniques such as preconditioning [21] can improve\nthe performance of marginal screening and variable selection.\n\nAcknowledgments This research was partly support by grant NIH R01-ES017436 from the Na-\ntional Institute of Environmental Health Sciences.\n\n8\n\n< (p + 5p2)e\u2212Cn + 2p2e\u2212Cn/\u03bd < (7 +\n\n)p2e\u2212Cn/\u03bd2\n\n,\n\n1\nn\n\nwhere the second inequality is due to the fact that p > n and \u03bd > 1. Now for any \u03b4 > 0, (7) holds\nwith probability at least 1 \u2212 \u03b4 if\n\n(cid:18)\n\nn \u2265 \u03bd2\nC\n\nlog(7 + 1/n) + 2 log p \u2212 log \u03b4\n\u221a\n\n8 < 3) n \u2265 2\u03bd2\n\n(cid:19)\n\n,\n\n\fReferences\n[1] David L Donoho.\n\n52(4):1289\u20131306, 2006.\n\nCompressed sensing.\n\nIEEE Transactions on Information Theory,\n\n[2] Richard Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4), 2007.\n[3] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Statistical Methodology), 58(1):267\u2013288, 1996.\n\n[4] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its\noracle properties. Journal of the American Statistical Association, 96(456):1348\u20131360, 2001.\n[5] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is\n\nmuch larger than n. The Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[6] Peter J Bickel, Ya\u2019acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and\n\ndantzig selector. The Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[7] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The\n\nAnnals of Statistics, 38(2):894\u2013942, 2010.\n\n[8] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine\n\nLearning Research, 7:2541\u20132563, 2006.\n\n[9] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity\nusing l1-constrained quadratic programming. IEEE Transactions on Information Theory, 2009.\n[10] Jason D Lee, Yuekai Sun, and Jonathan E Taylor. On model selection consistency of m-\nestimators with geometrically decomposable penalties. Advances in Neural Processing Infor-\nmation Systems, 2013.\n\n[11] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature\nspace. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849\u2013\n911, 2008.\n\n[12] Gaorong Li, Heng Peng, Jun Zhang, Lixing Zhu, et al. Robust rank correlation based screening.\n\nThe Annals of Statistics, 40(3):1846\u20131877, 2012.\n\n[13] Hansheng Wang. Forward regression for ultra-high dimensional variable screening. Journal\n\nof the American Statistical Association, 104(488):1512\u20131524, 2009.\n\n[14] Haeran Cho and Piotr Fryzlewicz. High dimensional variable selection via tilting. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 74(3):593\u2013622, 2012.\n\n[15] Xiangyu Wang and Chenlei Leng. High-dimensional ordinary least-squares projection for\n\nscreening variables. https://stat.duke.edu/\u02dcxw56/holp-paper.pdf, 2015.\n\n[16] Cun-Hui Zhang and Jian Huang. The sparsity and bias of the lasso selection in high-\n\ndimensional linear regression. The Annals of Statistics, 36(4):1567\u20131594, 2008.\n\n[17] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for\ncorrelated gaussian designs. The Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n[18] Shuheng Zhou. Restricted eigenvalue conditions on subgaussian random matrices. arXiv\n\npreprint arXiv:0912.4045, 2009.\n\n[19] Roman Vershynin.\n\nIntroduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[20] Lingzhou Xue and Hui Zou. Sure independence screening and compressed random sensing.\n\nBiometrika, 98(2):371\u2013380, 2011.\n\n[21] Jinzhu Jia and Karl Rohe. Preconditioning to comply with the irrepresentable condition. arXiv\n\npreprint arXiv:1208.5584, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1439, "authors": [{"given_name": "Xiangyu", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Chenlei", "family_name": "Leng", "institution": null}, {"given_name": "David", "family_name": "Dunson", "institution": "Duke University"}]}