{"title": "Exact Post Model Selection Inference for Marginal Screening", "book": "Advances in Neural Information Processing Systems", "page_first": 136, "page_last": 144, "abstract": "We develop a framework for post model selection inference, via marginal screening, in linear regression. At the core of this framework is a result that characterizes the exact distribution of linear functions of the response $y$, conditional on the model being selected (``condition on selection framework). This allows us to construct valid confidence intervals and hypothesis tests for regression coefficients that account for the selection procedure. In contrast to recent work in high-dimensional statistics, our results are exact (non-asymptotic) and require no eigenvalue-like assumptions on the design matrix $X$. Furthermore, the computational cost of marginal regression, constructing confidence intervals and hypothesis testing is negligible compared to the cost of linear regression, thus making our methods particularly suitable for extremely large datasets. Although we focus on marginal screening to illustrate the applicability of the condition on selection framework, this framework is much more broadly applicable. We show how to apply the proposed framework to several other selection procedures including orthogonal matching pursuit and marginal screening+Lasso.\"", "full_text": "Exact Post Model Selection Inference for Marginal\n\nScreening\n\nJason D. Lee\n\nComputational and Mathematical Engineering\n\nStanford University\nStanford, CA 94305\n\njdl17@stanford.edu\n\nJonathan E. Taylor\n\nDepartment of Statistics\n\nStanford University\nStanford, CA 94305\n\njonathan.taylor@stanford.edu\n\nAbstract\n\nWe develop a framework for post model selection inference, via marginal screen-\ning, in linear regression. At the core of this framework is a result that charac-\nterizes the exact distribution of linear functions of the response y, conditional on\nthe model being selected (\u201ccondition on selection\" framework). This allows us\nto construct valid con\ufb01dence intervals and hypothesis tests for regression coef-\n\ufb01cients that account for the selection procedure.\nIn contrast to recent work in\nhigh-dimensional statistics, our results are exact (non-asymptotic) and require no\neigenvalue-like assumptions on the design matrix X. Furthermore, the computa-\ntional cost of marginal regression, constructing con\ufb01dence intervals and hypoth-\nesis testing is negligible compared to the cost of linear regression, thus making\nour methods particularly suitable for extremely large datasets. Although we focus\non marginal screening to illustrate the applicability of the condition on selection\nframework, this framework is much more broadly applicable. We show how to\napply the proposed framework to several other selection procedures including or-\nthogonal matching pursuit and marginal screening+Lasso.\n\n1\n\nIntroduction\n\nConsider the model\n\nyi = \u00b5(xi) + \u0001i, \u0001i \u223c N (0, \u03c32I),\n\n(1)\nwhere \u00b5(x) is an arbitrary function, and xi \u2208 Rp. Our goal is to perform inference on\n(X T X)\u22121X T \u00b5, which is the best linear predictor of \u00b5.\nIn the classical setting of n > p , the\nleast squares estimator \u02c6\u03b2 = (X T X)\u22121X T y is a commonly used estimator for (X T X)\u22121X T \u00b5.\nUnder the linear model assumption \u00b5 = X\u03b20, the exact distribution of \u02c6\u03b2 is\n\nj using the z-test.\n\n(2)\nj = 0 and form con\ufb01dence intervals\n\n\u02c6\u03b2 \u223c N (\u03b20, \u03c32(X T X)\u22121).\nUsing the normal distribution, we can test the hypothesis H0 : \u03b20\nfor \u03b20\nHowever in the high-dimensional p > n setting, the least squares estimator is an underdetermined\nproblem, and the predominant approach is to perform variable selection or model selection [4].\nThere are many approaches to variable selection including AIC/BIC, greedy algorithms such as\nforward stepwise regression, orthogonal matching pursuit, and regularization methods such as the\nLasso. The focus of this paper will be on the model selection procedure known as marginal screen-\ning, which selects the k most correlated features xj with the response y.\nMarginal screening is the simplest and most commonly used of the variable selection procedures\n[9, 21, 16]. Marginal screening requires only O(np) computation and is several orders of magnitude\n\n1\n\n\ffaster than regularization methods such as the Lasso; it is extremely suitable for extremely large\ndatasets where the Lasso may be computationally intractable to apply. Furthermore, the selection\nproperties are comparable to the Lasso [8].\nSince marginal screening utilizes the response variable y, the con\ufb01dence intervals and statistical\ntests based on the distribution in (2) are not valid; con\ufb01dence intervals with nominal 1\u2212 \u03b1 coverage\nmay no longer cover at the advertised level:\n\nPr(cid:0)\u03b20\n\nj \u2208 C1\u2212\u03b1(x)(cid:1) < 1 \u2212 \u03b1.\n\nS XS)\u22121X T\n\nSeveral authors have previously noted this problem including recent work in [13, 14, 15, 2]. A major\nline of work [13, 14, 15] has described the dif\ufb01culty of inference post model selection: the distri-\nbution of post model selection estimates is complicated and cannot be approximated in a uniform\nsense by their asymptotic counterparts.\nIn this paper, we describe how to form exact con\ufb01dence intervals for linear regression coef\ufb01cients\npost model selection. We assume the model (1), and operate under the \ufb01xed design matrix X\nsetting. The linear regression coef\ufb01cients constrained to a subset of variables S is linear in \u00b5,\nS \u00b5 = \u03b7T \u00b5 for some \u03b7. We derive the conditional distribution of \u03b7T y for any\neT\nj (X T\nvector \u03b7, so we are able to form con\ufb01dence intervals for regression coef\ufb01cients.\nIn Section 2 we discuss related work on high-dimensional statistical inference, and Section 3 in-\ntroduces the marginal screening algorithm and shows how z intervals may fail to have the correct\ncoverage properties. Section 4 and 5 show how to represent the marginal screening selection event\nas constraints on y, and construct pivotal quantities for the truncated Gaussian. Section 6 uses these\ntools to develop valid con\ufb01dence intervals, and Section 7 evaluates the methodology on two real\ndatasets.\nAlthough the focus of this paper is on marginal screening, the \u201ccondition on selection\" framework,\n\ufb01rst proposed for the Lasso in [12], is much more general; we use marginal screening as a simple and\nclean illustration of the applicability of this framework. In Section 8, we discuss several extensions\nincluding how to apply the framework to other variable/model selection procedures and to nonlinear\nregression problems. Section 8 covers 1) marginal screening+Lasso, a screen and clean procedure\nthat \ufb01rst uses marginal screening and cleans with the Lasso, and orthogonal matching pursuit (OMP).\n\n2 Related Work\n\nMost of the theoretical work on high-dimensional linear models focuses on consistency. Such results\nestablish, under restrictive assumptions on X, the Lasso \u02c6\u03b2 is close to the unknown \u03b20 [19] and\nselects the correct model [26, 23, 11]. We refer to the reader to [4] for a comprehensive discussion\nabout the theoretical properties of the Lasso.\nThere is also recent work on obtaining con\ufb01dence intervals and signi\ufb01cance testing for penalized M-\nestimators such as the Lasso. One class of methods uses sample splitting or subsampling to obtain\ncon\ufb01dence intervals and p-values [24, 18]. In the post model selection literature, the recent work of\n[2] proposed the POSI approach, a correction to the usual t-test con\ufb01dence intervals by controlling\nthe familywise error rate for all parameters in any possible submodel. The POSI methodology is\nextremely computationally intensive and currently only applicable for p \u2264 30.\nA separate line of work establishes the asymptotic normality of a corrected estimator obtained by\n\u201cinverting\u201d the KKT conditions [22, 25, 10]. The corrected estimator \u02c6b has the form \u02c6b = \u02c6\u03b2 + \u03bb \u02c6\u0398\u02c6z,\nwhere \u02c6z is a subgradient of the penalty at \u02c6\u03b2 and \u02c6\u0398 is an approximate inverse to the Gram matrix\nX T X. The two main drawbacks to this approach are 1) the con\ufb01dence intervals are valid only when\nthe M-estimator is consistent, and thus require restricted eigenvalue conditions on X, 2) obtaining\n\u02c6\u0398 is usually much more expensive than obtaining \u02c6\u03b2, and 3) the method is speci\ufb01c to regularized\nestimators, and does not extend to marginal screening, forward stepwise, and other variable selection\nmethods.\nMost closely related to our work is the \u201ccondition on selection\" framework laid out in [12] for the\nLasso. Our work extends this methodology to other variable selection methods such as marginal\nscreening, marginal screening followed by the Lasso (marginal screening+Lasso) and orthogonal\nmatching pursuit. The primary contribution of this work is the observation that many model selection\n\n2\n\n\fmethods, including marginal screening and Lasso, lead to \u201cselection events\" that can be represented\nas a set of constraints on the response variable y. By conditioning on the selection event, we can\ncharacterize the exact distribution of \u03b7T y. This paper focuses on marginal screening, since it is\nthe simplest of variable selection methods, and thus the applicability of the \u201ccondition on selection\nevent\" framework is most transparent. However, this framework is not limited to marginal screening\nand can be applied to a wide a class of model selection procedures including greedy algorithms such\nas orthogonal matching pursuit. We discuss some of these possible extensions in Section 8, but leave\na thorough investigation to future work.\nA remarkable aspect of our work is that we only assume X is in general position, and the test is exact,\nmeaning the distributional results are true even under \ufb01nite samples. By extension, we do not make\nany assumptions on n and p, which is unusual in high-dimensional statistics [4]. Furthermore, the\ncomputational requirements of our test are negligible compared to computing the linear regression\ncoef\ufb01cients.\n\n3 Marginal Screening\nLet X \u2208 Rn\u00d7p be the design matrix, y \u2208 Rn the response variable, and assume the model\nyi = \u00b5(xi) + \u0001i, \u0001i \u223c N (0, \u03c32I). We will assume that X is in general position and has unit norm\ncolumns. The algorithm estimates \u02c6\u03b2 via Algorithm 1. The marginal screening algorithm chooses\n\nAlgorithm 1 Marginal screening algorithm\n1: Input: Design matrix X, response y, and model size k.\n2: Compute |X T y|.\n3: Let \u02c6S be the index of the k largest entries of |X T y|.\n4: Compute \u02c6\u03b2 \u02c6S = (X T\n\u02c6S\n\nX \u02c6S)\u22121X T\n\ny\n\n\u02c6S\n\nthe k variables with highest absolute dot product with y, and then \ufb01ts a linear model over those k\nvariables. We will assume k \u2264 min(n, p). For any \ufb01xed subset of variables S, the distribution of\n\u02c6\u03b2S = (X T\n\nS XS)\u22121(cid:1)\n\nS \u00b5, \u03c32(X T\n\n(3)\nS)j, where j is indexing a variable in the set S. The z-test\n\nS y is\n\nS XS)\u22121X T\n\n\u02c6\u03b2S \u223c N(cid:0)(X T\n(cid:16) \u02c6\u03b2j\u2208S \u2212 \u03c3z1\u2212\u03b1/2(X T\nand each interval has 1 \u2212 \u03b1 coverage, meaning Pr(cid:0)\u03b2(cid:63)\n\nWe will use the notation \u03b2(cid:63)\nj\u2208S := (\u03b2(cid:63)\nintervals for a regression coef\ufb01cient are\n\nS XS)\u22121X T\n\nC(\u03b1, j, S) :=\n\nchosen using a model selection procedure that depends on y, the distributional result (3) no longer\n< 1\u2212 \u03b1.\nholds and the z-test intervals will not cover at the 1\u2212 \u03b1 level, and Pr\n\n\u2208 C(\u03b1, j, \u02c6S)\n\nS XS)jj, \u02c6\u03b2j\u2208S + \u03c3z1\u2212\u03b1/2(X T\n\nj\u2208S \u2208 C(\u03b1, j, S)(cid:1) = 1 \u2212 \u03b1. However if \u02c6S is\n\nS XS)jj\n\n(4)\n\n(cid:16)\n\n\u03b2(cid:63)\nj\u2208 \u02c6S\n\n(cid:17)\n\n(cid:17)\n\n3.1 Failure of z-test con\ufb01dence intervals\nWe will illustrate empirically that the z-test intervals do not cover at 1 \u2212 \u03b1 when \u02c6S is chosen by\nmarginal screening in Algorithm 1. For this experiment we generated X from a standard normal\n2 = SNR, y = X\u03b20 + \u0001, and\nwith n = 20 and p = 200. The signal vector is 2 sparse with \u03b20\n\u0001 \u223c N (0, 1). The con\ufb01dence intervals were constructed for the k = 2 variables selected by the\nmarginal screening algorithm. The z-test intervals were constructed via (4) with \u03b1 = .1, and the\nadjusted intervals were constructed using Algorithm 2. The results are described in Figure 1.\n\n1 , \u03b20\n\n4 Representing the selection event\n\nSince Equation (3) does not hold for a selected \u02c6S when the selection procedure depends on y, the\nz-test intervals are not valid. Our strategy will be to understand the conditional distribution of y\n\n3\n\n\fFigure 1: Plots of the coverage proportion across a range of SNR (log-scale). We see that the\ncoverage proportion of the z intervals can be far below the nominal level of 1 \u2212 \u03b1 = .9, even at\nSNR =5. The adjusted intervals always have coverage proportion .9. Each point represents 500\nindependent trials.\n\nand contrasts (linear functions of y) \u03b7T y, then construct inference conditional on the selection event\n\u02c6E. We will use \u02c6E(y) to represent a random variable, and E to represent an element of the range of\n\u02c6E(y). In the case of marginal screening, the selection event \u02c6E(y) corresponds to the set of selected\nvariables \u02c6S and signs s:\n\n(cid:110)\n\n\u02c6E(y) =\n\n(cid:110)\n(cid:110)\n\n=\n\n=\n\ni y)xT\n\ny : sign(xT\ni y > \u00b1xT\n\ny : \u02c6sixT\ny : A( \u02c6S, \u02c6s)y \u2264 b( \u02c6S, \u02c6s)\n\nj y and \u02c6sixT\n\ni y > \u00b1xT\n(cid:111)\n\nj y for all i \u2208 \u02c6S and j \u2208 \u02c6Sc(cid:111)\ni y \u2265 0 for all i \u2208 \u02c6S and j \u2208 \u02c6Sc(cid:111)\n\n(5)\n\nfor some matrix A( \u02c6S, \u02c6s) and vector b( \u02c6S, \u02c6s)1. We will use the selection event \u02c6E and the selected\nvariables/signs pair ( \u02c6S, \u02c6s) interchangeably since they are in bijection.\n\n: A(S, s)y \u2264\nb(S, s)}2. The vector y can be decomposed with respect to the partition as follows y =\n\nThe space Rn is partitioned by the selection events, Rn = (cid:70)\n(cid:80)\nS,s y 1 (A(S, s)y \u2264 b(S, s)).\ny|{ \u02c6E(y) = E} d= z(cid:12)(cid:12){A(S, s)z \u2264 b}, z \u223c N (\u00b5, \u03c32I).\n\nTheorem 4.1. The distribution of y conditional on the selection event is a constrained Gaussian,\n\nconditional y(cid:12)(cid:12){A(S, s)y \u2264 b(S, s)} is a Gaussian constrained to the set {A(S, s)y \u2264 b(S, s)}.\n\nProof. The event E is in bijection with a pair (S, s), and y is unconditionally Gaussian. Thus the\n\n(S,s){y\n\n5 Truncated Gaussian test\n\nThis section summarizes the recent tools developed in [12] for testing contrasts3 \u03b7T y of a con-\nstrained Gaussian y. The results are stated without proof and the proofs can be found in [12]. The\nprimary result is a one-dimensional pivotal quantity for \u03b7T \u00b5. This pivot relies on characterizing the\ndistribution of \u03b7T y as a truncated normal. The key step to deriving this pivot is the following lemma:\nLemma 5.1. The conditioning set can be rewritten in terms of \u03b7T y as follows:\n\n{Ay \u2264 b} = {V\u2212(y) \u2264 \u03b7T y \u2264 V +(y),V 0(y) \u2265 0}\n\n1b can be taken to be 0 for marginal screening, but this extra generality is needed for other model selection\n\nmethods.\n\n2It is also possible to use a coarser partition, where each element of the partition only corresponds to a\n\nsubset of variables S. See [12] for details.\n\n3A contrast of y is a linear function of the form \u03b7T y.\n\n4\n\n\u22121010.40.50.60.70.80.91log10 SNRCoverage Proportion AdjustedZ test\fwhere\n\n\u03b1 =\n\nA\u03a3\u03b7\n\u03b7T \u03a3\u03b7\nV\u2212 = V\u2212(y) = max\n\nj: \u03b1j <0\n\nbj \u2212 (Ay)j + \u03b1j\u03b7T y\n\n\u03b1j\n\nbj \u2212 (Ay)j + \u03b1j\u03b7T y\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nV + = V +(y) = min\nV 0 = V 0(y) = min\nMoreover, (V +,V\u2212,V 0) are independent of \u03b7T y.\nTheorem 5.2. Let \u03a6(x) denote the CDF of a N (0, 1) random variable, and let F [a,b]\nCDF of T N (\u00b5, \u03c3, a, b), i.e.:\n\n\u03b1j\nbj \u2212 (Ay)j\n\nj: \u03b1j >0\n\nj: \u03b1j =0\n\n.\n\nThen F [V\u2212,V +]\n\nF [a,b]\n\n\u00b5,\u03c32 (x) =\n\n\u03a6((x \u2212 \u00b5)/\u03c3) \u2212 \u03a6((a \u2212 \u00b5)/\u03c3)\n\u03a6((b \u2212 \u00b5)/\u03c3) \u2212 \u03a6((a \u2212 \u00b5)/\u03c3)\n\u03b7T \u00b5, \u03b7T \u03a3\u03b7(\u03b7T y) is a pivotal quantity, conditional on {Ay \u2264 b}:\n\n\u03b7T \u00b5, \u03b7T \u03a3\u03b7(\u03b7T y)(cid:12)(cid:12) {Ay \u2264 b} \u223c Unif(0, 1)\n\nF [V\u2212,V +]\n\nwhere V\u2212 and V + are de\ufb01ned in (7) and (8).\n\n\u00b5,\u03c32 denote the\n\n.\n\n(10)\n\n(11)\n\nFigure 2: Histogram and qq plot of F [V\u2212,V +]\ndistribution is very close to Unif(0, 1), which is in agreement with Theorem 5.2.\n\n\u03b7T \u00b5, \u03b7T \u03a3\u03b7(\u03b7T y) where y is a constrained Gaussian. The\n\n6\n\nInference for marginal screening\n\nIn this section, we apply the theory summarized in Sections 4 and 5 to marginal screening.\nparticular, we will construct con\ufb01dence intervals for the selected variables.\nTo summarize the developments so far, recall that our model (1) says that y \u223c N (\u00b5, \u03c32I).\nThe distribution of interest is y|{ \u02c6E(y) = E}, and by Theorem 4.1,\nthis is equivalent to\ny|{A(S, s)z \u2264 b(S, s)}, where y \u223c N (\u00b5, \u03c32I). By applying Theorem 5.2, we obtain the pivotal\nquantity\n\nIn\n\n(\u03b7T y)(cid:12)(cid:12) { \u02c6E(y) = E} \u223c Unif(0, 1)\n\n(12)\n\nfor any \u03b7, where V\u2212 and V + are de\ufb01ned in (7) and (8).\nIn this section, we describe how to form con\ufb01dence intervals for the components of \u03b2(cid:63)\n=\n\u02c6S\n, and\n(X T\n\u02c6S\n\u02c6\u03b2 \u02c6S = (X T\n\u02c6S\n\n\u00b5. The best linear predictor of \u00b5 that uses only the selected variables is \u03b2(cid:63)\n\u02c6S\n\ny is an unbiased estimate of \u03b2(cid:63)\n\u02c6S\n\nX \u02c6S)\u22121X T\n\nX \u02c6S)\u22121X T\n\n. If we choose\n\n\u02c6S\n\n\u02c6S\n\n\u03b7j = ((X T\n\u02c6S\n\nX \u02c6S)\u22121X T\n\n\u02c6S\n\nej)T ,\n\n(13)\n\nF [V\u2212,V +]\n\u03b7T \u00b5, \u03c32||\u03b7||2\n\n2\n\n5\n\n00.20.40.60.8105010015020025030000.20.40.60.8100.20.40.60.81 empirical cdfUnif(0,1) cdf\fj \u00b5 = \u03b2(cid:63)\n\nthen \u03b7T\nthe model \u02c6S.\n\nj\u2208 \u02c6S\n\n, so the above framework provides a method for inference about the jth variable in\n\n6.1 Con\ufb01dence intervals for selected variables\n\n(cid:18)\n\nNext, we discuss how to obtain con\ufb01dence intervals\nto obtain an interval\n\nto invert a pivotal quantity [5].\n\n(cid:19)\n\n(cid:12)(cid:12) { \u02c6E = E}\n\nj\u2208 \u02c6S\n\nfor \u03b2(cid:63)\n\nThe standard way\nsince\n= \u03b1 one can de\ufb01ne a (1 \u2212 \u03b1) (conditional)\n\n.\nIn other words,\n\n\u03b1\n\n2 \u2264 F [V\u2212,V +]\n\nPr\ncon\ufb01dence interval for \u03b2(cid:63)\n\nj\u2208 \u02c6S\n\n\u03b2(cid:63)\n\nis\nj y) \u2264 1 \u2212 \u03b1\n, \u03c32||\u03b7j||2 (\u03b7T\nas(cid:110)\n\nj, \u02c6E\n\n2\n\nx :\n\n\u03b1\n2\n\n\u2264 F [V\u2212,V +]\n\nx, \u03c32||\u03b7j||2 (\u03b7T\n\nj y) \u2264 1 \u2212 \u03b1\n2\n\n.\n\n(14)\n\n(cid:111)\n\nIn fact, F is monotone decreasing in x, so to \ufb01nd its endpoints, one need only solve for the root of a\nsmooth one-dimensional function. The monotonicity is a consequence of the fact that the truncated\nGaussian distribution is a natural exponential family and hence has monotone likelihood ratio in \u00b5\n[17].\nWe now formalize the above observations in the following result, an immediate consequence of\nTheorem 5.2.\nCorollary 6.1. Let \u03b7j be de\ufb01ned as in (13), and let L\u03b1 = L\u03b1(\u03b7j, ( \u02c6S, \u02c6s)) and U\u03b1 = U\u03b1(\u03b7j, ( \u02c6S, \u02c6s))\nbe the (unique) values satisfying\nF [V\u2212,V +]\nL\u03b1, \u03c32||\u03b7j||2(\u03b7T\n\nj y) =\n\n(15)\n\nj y) = 1 \u2212 \u03b1\n2\n\n\u03b1\n2\n\nF [V\u2212,V +]\nU\u03b1, \u03c32||\u03b7j||2 (\u03b7T\n, conditional on \u02c6E:\n\nThen [L\u03b1, U\u03b1] is a (1 \u2212 \u03b1) con\ufb01dence interval for \u03b2(cid:63)\n\nP(cid:16)\n\n\u03b2(cid:63)\nj\u2208 \u02c6S\n\n\u2208 [L\u03b1, U\u03b1](cid:12)(cid:12) { \u02c6E = E}(cid:17)\n\nj\u2208 \u02c6S\n\n= 1 \u2212 \u03b1.\n\n(16)\n\nProof. The con\ufb01dence region of \u03b2(cid:63)\n1\u2212 \u03b1 level. The function F [V\u2212,V +]\nmost extreme values where H0 is still accepted. This gives a 1 \u2212 \u03b1 con\ufb01dence interval.\n\nis the set of \u03b2j such that the test of H0 : \u03b2(cid:63)\naccepts at the\nj y) is monotone in x, so solving for L\u03b1 and U\u03b1 identify the\n\nx, \u03c32||\u03b7j||2(\u03b7T\n\nj\u2208 \u02c6S\n\nj\u2208 \u02c6S\n\nNext, we establish the unconditional coverage of the constructed con\ufb01dence intervals and the false\ncoverage rate (FCR) control [1].\nCorollary 6.2. For each j \u2208 \u02c6S,\n\n(17)\n\n(cid:16)\n\n(cid:17)\n\n\u03b1](cid:9)\n\n\u2208 [Lj\n\u03b1, U j\n\n\u03b1, U j\n\u03b1]\nj\u2208 \u02c6E is \u03b1.\n\n= 1 \u2212 \u03b1.\n\nPr\n\n\u03b2(cid:63)\nj\u2208 \u02c6S\n\nFurthermore, the FCR of the intervals(cid:8)[Lj\n(cid:88)\n(cid:88)\n\n\u2208 [Lj\n\n\u03b1, U j\n\u03b1]\n\n\u03b2(cid:63)\nj\u2208 \u02c6S\n\n(cid:17)\n\n=\n\nE\n\nProof. By (16), the conditional coverage of the con\ufb01dence intervals are 1 \u2212 \u03b1. The coverage holds\nfor every element of the partition { \u02c6E(y) = E}, so\n\u03b2(cid:63)\nj\u2208 \u02c6S\n\n\u2208 [L\u03b1, U\u03b1](cid:12)(cid:12) { \u02c6E = E}(cid:17)\n\nPr( \u02c6E = E)\n\n(cid:16)\n\n(cid:16)\n\nPr\n\nPr\n\n=\n\n(1 \u2212 \u03b1) Pr( \u02c6E = E) = 1 \u2212 \u03b1.\n\nE\n\nRemark 6.3. We would like to emphasize that the previous Corollary shows that the constructed\ncon\ufb01dence intervals are unconditionally valid. The conditioning on the selection event \u02c6E(y) = E\nwas only for mathematical convenience to work out the exact pivot. Unlike standard z-test intervals,\nthe coverage target, \u03b2(cid:63)\n, and the interval [L\u03b1, U\u03b1] are random. In a typical con\ufb01dence interval\nonly the interval is random; however in the post-selection inference setting, the selected model is\nrandom, so both the interval and the target are necessarily random [2].\n\nj\u2208 \u02c6S\n\nWe summarize the algorithm for selecting and constructing con\ufb01dence intervals below.\n\n6\n\n\fAlgorithm 2 Con\ufb01dence intervals for selected variables\n1: Input: Design matrix X, response y, model size k.\n2: Use Algorithm 1 to select a subset of variables \u02c6S and signs \u02c6s = sign(X T\n\u02c6S\n3: Let A = A( \u02c6S, \u02c6s) and b = b( \u02c6S, \u02c6s) using (5). Let \u03b7j = (X T\n\u02c6S\n4: Solve for Lj\n\n)\u2020ej.\n\n\u03b1 and U j\n\n\u03b1 using Equation (15) where V\u2212 and V + are computed via (7) and (8) using\n\ny).\n\nthe A, b, and \u03b7j previously de\ufb01ned.\n\n5: Output: Return the intervals [Lj\n\n\u03b1] for j \u2208 \u02c6S.\n\n\u03b1, U j\n\n7 Experiments\n\nIn Figure 1, we have already seen that the con\ufb01dence intervals constructed using Algorithm 2 have\nexactly 1 \u2212 \u03b1 coverage proportion. In this section, we perform two experiments on real data where\nthe linear model does not hold, the noise is not Gaussian, and the noise variance is unknown.\n\n7.1 Diabetes dataset\n\nThe diabetes dataset contains n = 442 diabetes patients measured on p = 10 baseline variables [6].\nThe baseline variables are age, sex, body mass index, average blood pressure, and six blood serum\nmeasurements, and the response y is a quantitative measure of disease progression measured one\n(cid:107)y\u2212\u02c6y(cid:107)\nyear after the baseline. Since the noise variance \u03c32 is unknown, we estimate it by \u03c32 =\nn\u2212p ,\n\nFigure 3: Plot of 1 \u2212 \u03b1 vs the coverage proportion for diabetes dataset. The nominal curve is the\nline y = x. The coverage proportion of the adjusted intervals agree with the nominal coverage level,\nbut the z-test coverage proportion is strictly below the nominal level. The adjusted intervals perform\nwell, despite the noise being non-Gaussian, and \u03c32 unknown.\n\nwhere \u02c6y = X \u02c6\u03b2 and \u02c6\u03b2 = (X T X)\u22121X T y. For each trial we generated new responses \u02dcyi = X \u02c6\u03b2 + \u02dc\u0001,\nand \u02dc\u0001 is bootstrapped from the residuals ri = yi \u2212 \u02c6yi. We used marginal screening to select k = 2\nvariables, and then \ufb01t linear regression on the selected variables. The adjusted con\ufb01dence intervals\nwere constructed using Algorithm 2 with the estimated \u03c32. The nominal coverage level is varied\nacross 1 \u2212 \u03b1 \u2208 {.5, .6, .7, .8, .9, .95, .99}. From Figure 3, we observe that the adjusted intervals\nalways cover at the nominal level, whereas the z-test is always below. The experiment was repeated\n2000 times.\n\n7.2 Ribo\ufb02avin dataset\n\nOur second data example is a high-throughput genomic dataset about ribo\ufb02avin (vitamin B2) pro-\nduction rate [3]. There are p = 4088 variables which measure the log expression level of different\ngenes, a single real-valued response y which measures the logarithm of the ribo\ufb02avin production\nrate, and n = 71 samples. We \ufb01rst estimate \u03c32 by using cross-validation [20], and apply marginal\nscreening with k = 30, as chosen in [3]. We then use Algorithm 2 to identify genes signi\ufb01cant at\n\n7\n\n0.60.810.20.40.60.811\u2212\u03b1Coverage Proportion Z\u2212testAdjustedNominal\f\u03b1 = 10%. The genes identi\ufb01ed as signi\ufb01cant were YCKE_at, YOAB_at, and YURQ_at. After\nusing Bonferroni to control for FWER, we found YOAB_at remained signi\ufb01cant.\n\n8 Extensions\n\nThe purpose of this section is to illustrate the broad applicability of the condition on selection frame-\nwork. For expository purposes, we focused the paper on marginal screening where the framework\nis particularly easy to understand. In the rest of this section, we show how to apply the framework\nto marginal screening+Lasso, and orthogonal matching pursuit. This is a non-exhaustive list of\nselection procedures where the condition on selection framework is applicable, but we hope this in-\ncomplete list emphasizes the ease of constructing tests and con\ufb01dence intervals post-model selection\nvia conditioning.\n\n8.1 Marginal screening + Lasso\n\nThe marginal screening+Lasso procedure was introduced in [7] as a variable selection method for\nthe ultra-high dimensional setting of p = O(enk\n). Fan et al. [7] recommend applying the marginal\nscreening algorithm with k = n \u2212 1, followed by the Lasso on the selected variables. This is a\ntwo-stage procedure, so to properly account for the selection we must encode the selection event\nof marginal screening followed by Lasso. This can be done by representing the two stage selection\nas a single event. Let ( \u02c6Sm, \u02c6sm) be the variables and signs selected by marginal screening, and the\n( \u02c6SL, \u02c6zL) be the variables and signs selected by Lasso [12]. In Proposition 2.2 of [12], it is shown\nhow to encode the Lasso selection event ( \u02c6SL, \u02c6zL) as a set of constraints {ALy \u2264 bL} 4, and in\nSection 4 we showed how to encode the marginal screening selection event ( \u02c6Sm, \u02c6sm) as a set of\nconstraints {Amy \u2264 bm}. Thus the selection event of marginal screening+Lasso can be encoded\nas {ALy \u2264 bL, Amy \u2264 bm}. Using these constraints, the hypothesis test and con\ufb01dence intervals\ndescribed in Algorithm 2 are valid for marginal screening+Lasso.\n\n8.2 Orthogonal Matching Pursuit\n\nOrthogonal matching pursuit (OMP) is a commonly used variable selection method. At each itera-\ntion, OMP selects the variable most correlated with the residual r, and then recomputes the residual\nusing the residual of least squares using the selected variables. Similar to Section 4, we can represent\nthe OMP selection event as a set of linear constraints on y.\n\n\u02c6E(y) =(cid:8)y : sign(xT\n\n= {y : \u02c6sixT\n\u02c6sixT\npi\n\n(I \u2212 X \u02c6Si\u22121\n\npi\n\nj ri, for all j (cid:54)= pi and all i \u2208 [k](cid:9)\n\nri > \u00b1xT\n\u2020\nX\n\u02c6Si\u22121\n)y > 0, for all j (cid:54)= pi, and all i \u2208 [k] }\n\nj (I \u2212 X \u02c6Si\u22121\n\n)y > \u00b1xT\n\n\u2020\n\u02c6Si\u22121\n\nX\n\n)y and\n\nri)xT\npi\n\npi\n\n(I \u2212 X \u02c6Si\u22121\n\nX\n\n\u2020\n\u02c6Si\u22121\n\nThe selection event encodes that OMP selected a certain variable and the sign of the correlation of\nthat variable with the residual, at steps 1 to k. The primary difference between the OMP selection\nevent and the marginal screening selection event is that the OMP event also describes the order at\nwhich the variables were chosen.\n\n9 Conclusion\n\nDue to the increasing size of datasets, marginal screening has become an important method for\nfast variable selection. However, the standard hypothesis tests and con\ufb01dence intervals used in\nlinear regression are invalid after using marginal screening to select important variables. We have\ndescribed a method to form con\ufb01dence intervals after marginal screening. The condition on selection\nframework is not restricted to marginal screening, and also applies to OMP and marginal screening\n+ Lasso. The supplementary material also discusses the framework applied to non-negative least\nsquares.\n\n4The Lasso selection event is with respect to the Lasso optimization problem after marginal screening.\n\n8\n\n\fReferences\n[1] Yoav Benjamini and Daniel Yekutieli. False discovery rate\u2013adjusted multiple con\ufb01dence intervals for\n\nselected parameters. Journal of the American Statistical Association, 100(469):71\u201381, 2005.\n\n[2] Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid post-selection infer-\n\nence. Annals of Statistics, 41(2):802\u2013837, 2013.\n\n[3] Peter B\u00fchlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward\n\napplications in biology. Statistics, 1, 2014.\n\n[4] Peter Lukas B\u00fchlmann and Sara A van de Geer. Statistics for High-dimensional Data. Springer, 2011.\n[5] George Casella and Roger L Berger. Statistical inference, volume 70. Duxbury Press Belmont, CA, 1990.\n[6] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals\n\nof statistics, 32(2):407\u2013499, 2004.\n\n[7] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849\u2013911, 2008.\n\n[8] Christopher R Genovese, Jiashun Jin, Larry Wasserman, and Zhigang Yao. A comparison of the lasso\n\nand marginal regression. The Journal of Machine Learning Research, 98888:2107\u20132143, 2012.\n\n[9] Isabelle Guyon and Andr\u00e9 Elisseeff. An introduction to variable and feature selection. The Journal of\n\nMachine Learning Research, 3:1157\u20131182, 2003.\n\n[10] Adel Javanmard and Andrea Montanari. Con\ufb01dence intervals and hypothesis testing for high-dimensional\n\nregression. arXiv preprint arXiv:1306.3171, 2013.\n\n[11] Jason Lee, Yuekai Sun, and Jonathan E Taylor. On model selection consistency of penalized m-estimators:\n\na geometric theory. In Advances in Neural Information Processing Systems, pages 342\u2013350, 2013.\n\n[12] Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact inference after model selection\n\nvia the lasso. arXiv preprint arXiv:1311.6238, 2013.\n\n[13] Hannes Leeb and Benedikt M P\u00f6tscher. The \ufb01nite-sample distribution of post-model-selection estimators\n\nand uniform versus nonuniform approximations. Econometric Theory, 19(1):100\u2013142, 2003.\n\n[14] Hannes Leeb and Benedikt M P\u00f6tscher. Model selection and inference: Facts and \ufb01ction. Econometric\n\nTheory, 21(1):21\u201359, 2005.\n\n[15] Hannes Leeb and Benedikt M P\u00f6tscher. Can one estimate the conditional distribution of post-model-\n\nselection estimators? The Annals of Statistics, pages 2554\u20132591, 2006.\n\nLeek.\n\n[16] Jeff\ntors.\nprediction-the-lasso-vs-just-using-the-top-10.\n\npredic-\nhttp://simplystatistics.tumblr.com/post/18132467723/\n\nPrediction:\n\nvs\n\njust\n\nlasso\n\nthe\n\nusing\n\nthe\n\ntop\n\n10\n\n[17] Erich L. Lehmann and Joseph P. Romano. Testing Statistical Hypotheses. Springer, 3 edition, 2005.\n[18] Nicolai Meinshausen, Lukas Meier, and Peter B\u00fchlmann. P-values for high-dimensional regression. Jour-\n\nnal of the American Statistical Association, 104(488), 2009.\n\n[19] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A uni\ufb01ed framework\nfor high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science,\n27(4):538\u2013557, 2012.\n\n[20] Stephen Reid, Robert Tibshirani, and Jerome Friedman. A study of error variance estimation in lasso\n\nregression. arXiv preprint arXiv:1311.5274, 2013.\n\n[21] Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu. Signi\ufb01cance analysis of microarrays applied\nto the ionizing radiation response. Proceedings of the National Academy of Sciences, 98(9):5116\u20135121,\n2001.\n\n[22] Sara van de Geer, Peter B\u00fchlmann, and Ya\u2019acov Ritov. On asymptotically optimal con\ufb01dence regions and\n\ntests for high-dimensional models. arXiv preprint arXiv:1303.0518, 2013.\n\n[23] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using (cid:96)1-constrained\n\nquadratic programming (lasso). 55(5):2183\u20132202, 2009.\n\n[24] Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statistics,\n\n37(5A):2178, 2009.\n\n[25] Cun-Hui Zhang and S Zhang.\n\nCon\ufb01dence intervals for low-dimensional parameters with high-\n\ndimensional data. arXiv preprint arXiv:1110.2563, 2011.\n\n[26] P. Zhao and B. Yu. On model selection consistency of lasso. 7:2541\u20132563, 2006.\n\n9\n\n\f", "award": [], "sourceid": 122, "authors": [{"given_name": "Jason", "family_name": "Lee", "institution": "Stanford University"}, {"given_name": "Jonathan", "family_name": "Taylor", "institution": "Stanford University"}]}