{"title": "Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1187, "page_last": 1195, "abstract": "Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the `uncertainty' associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values. We consider here a broad class of regression problems, and propose an efficient algorithm for constructing confidence intervals and p-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a `de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. Furthermore, proofs are remarkably simple. We test our method on a diabetes prediction problem.", "full_text": "Con\ufb01dence Intervals and Hypothesis Testing for\n\nHigh-Dimensional Statistical Models\n\nAdel Javanmard\nStanford University\nStanford, CA 94305\n\nadelj@stanford.edu\n\nAndrea Montanari\nStanford University\nStanford, CA 94305\n\nmontanar@stanford.edu\n\nAbstract\n\nFitting high-dimensional statistical models often requires the use of non-linear\nparameter estimation procedures. As a consequence, it is generally impossible to\nobtain an exact characterization of the probability distribution of the parameter\nestimates. This in turn implies that it is extremely challenging to quantify the un-\ncertainty associated with a certain parameter estimate. Concretely, no commonly\naccepted procedure exists for computing classical measures of uncertainty and\nstatistical signi\ufb01cance as con\ufb01dence intervals or p-values.\nWe consider here a broad class of regression problems, and propose an ef\ufb01cient\nalgorithm for constructing con\ufb01dence intervals and p-values. The resulting con\ufb01-\ndence intervals have nearly optimal size. When testing for the null hypothesis that\na certain parameter is vanishing, our method has nearly optimal power.\nOur approach is based on constructing a \u2018de-biased\u2019 version of regularized M-\nestimators. The new construction improves over recent work in the \ufb01eld in that it\ndoes not assume a special structure on the design matrix. Furthermore, proofs are\nremarkably simple. We test our method on a diabetes prediction problem.\n\n1\n\nIntroduction\n\nIt is widely recognized that modern statistical problems are increasingly high-dimensional, i.e. re-\nquire estimation of more parameters than the number of observations/examples. Examples abound\nfrom signal processing [16], to genomics [21], collaborative \ufb01ltering [12] and so on. A number\nof successful estimation techniques have been developed over the last ten years to tackle these\nproblems. A widely applicable approach consists in optimizing a suitably regularized likelihood\nfunction. Such estimators are, by necessity, non-linear and non-explicit (they are solution of certain\noptimization problems).\nThe use of non-linear parameter estimators comes at a price. In general, it is impossible to char-\nacterize the distribution of the estimator. This situation is very different from the one of classical\nstatistics in which either exact characterizations are available, or asymptotically exact ones can be\nderived from large sample theory [26]. This has an important and very concrete consequence. In\nclassical statistics, generic and well accepted procedures are available for characterizing the uncer-\ntainty associated to a certain parameter estimate in terms of con\ufb01dence intervals or p-values [28, 14].\nHowever, no analogous procedures exist in high-dimensional statistics.\nIn this paper we develop a computationally ef\ufb01cient procedure for constructing con\ufb01dence intervals\nand p-values for a broad class of high-dimensional regression problems. The salient features of\nour procedure are: (i) Our approach guarantees nearly optimal con\ufb01dence interval sizes and testing\npower. (ii) It is the \ufb01rst one that achieves this goal under essentially no assumptions on the pop-\nulation covariance matrix of the parameters, beyond the standard conditions for high-dimensional\nconsistency. (iii) It allows for a streamlined analysis with respect to earlier work in the same area.\n\n1\n\n\fTable 1: Unbiased estimator for \u03b80 in high dimensional linear regression models\nInput: Measurement vector y, design matrix X, parameter \u03b3.\n\nOutput: Unbiased estimator(cid:98)\u03b8u.\n1: Set \u03bb = \u03c3\u03b3, and let(cid:98)\u03b8n be the Lasso estimator as per Eq. (3).\n2: Set(cid:98)\u03a3 \u2261 (XTX)/n.\n\n3: for i = 1, 2, . . . , p do\n4:\n\nLet mi be a solution of the convex program:\n\n5: Set M = (m1, . . . , mp)T. If any of the above problems is not feasible, then set M = Ip\u00d7p.\n\n6: De\ufb01ne the estimator(cid:98)\u03b8u as follows:\n\nminimize mT(cid:98)\u03a3m\nsubject to (cid:107)(cid:98)\u03a3m \u2212 ei(cid:107)\u221e \u2264 \u03b3\nM XT(Y \u2212 X(cid:98)\u03b8n(\u03bb))\n\n(cid:98)\u03b8u =(cid:98)\u03b8n(\u03bb) +\n\n1\nn\n\n(4)\n\n(5)\n\n(2)\n\n(iv) Our method has a natural generalization non-linear regression models (e.g.\nlogistic regres-\nsion, see Section 4). We provide heuristic and numerical evidence supporting this generalization,\ndeferring a rigorous study to future work.\nFor the sake of clarity, we will focus our presentation on the case of linear regression, defer-\nring the generalization to Section 4.\nIn the random design model, we are given n i.i.d. pairs\n(Y1, X1), (Y2, X2), . . . , (Yn, Xn), with vectors Xi \u2208 Rp and response variables Yi given by\n\nYi = (cid:104)\u03b80, Xi(cid:105) + Wi ,\n\nWi \u223c N(0, \u03c32) .\n\n(1)\nHere (cid:104)\u00b7 , \u00b7(cid:105) is the standard scalar product in Rp. In matrix form, letting Y = (Y1, . . . , Yn)T and\ndenoting by X the design matrix with rows X T\n\n1 , . . . , X T\n\nn , we have\n\nY = X \u03b80 + W ,\n\nW \u223c N(0, \u03c32In\u00d7n) .\n\nThe goal is estimate the unknown (but \ufb01xed) vector of parameters \u03b80 \u2208 Rp.\nIn the classic setting, n (cid:29) p and the estimation method of choice is ordinary least squares yielding\n\n(cid:98)\u03b8OLS = (XTX)\u22121XTY . In particular (cid:98)\u03b8 is Gaussian with mean \u03b80 and covariance \u03c32(XTX)\u22121.\n\nThis directly allows to construct con\ufb01dence intervals1.\nIn the high-dimensional setting where p > n, the matrix (XTX) is rank de\ufb01cient and one has to\nresort to biased estimators. A particularly successful approach is the Lasso [24, 7] which promotes\nsparse reconstructions through an (cid:96)1 penalty.\n\n(cid:107)Y \u2212 X\u03b8(cid:107)2\n\n2 + \u03bb(cid:107)\u03b8(cid:107)1\n\n.\n\n(3)\n\n(cid:98)\u03b8n(Y, X; \u03bb) \u2261 arg min\n\n(cid:110) 1\n\n\u03b8\u2208Rp\n\n2n\n\n(cid:111)\n\nis not tractable in general, and hence there is no simple procedure to construct con\ufb01dence intervals\nand p-values. In order to overcome this challenge, we construct a de-biased estimator from the Lasso\n\nIn case the right hand side has more than one minimizer, one of them can be selected arbitrarily for\nour purposes. We will often omit the arguments Y , X, as they are clear from the context. We denote\nby S \u2261 supp(\u03b80) \u2286 [p] the support of \u03b80, and let s0 \u2261 |S|. A copious theoretical literature [6, 2, 4]\nshows that, under suitable assumptions on X, the Lasso is nearly as accurate as if the support S was\n2 = O(s0\u03c32(log p)/n). These\n\nknown a priori. Namely, for n = \u2126(s0 log p), we have (cid:107)(cid:98)\u03b8n \u2212 \u03b80(cid:107)2\nremarkable properties come at a price. Deriving an exact characterization for the distribution of(cid:98)\u03b8n\nsolution. The de-biased estimator is given by the simple formula(cid:98)\u03b8u =(cid:98)\u03b8n +(1/n) M XT(Y \u2212X(cid:98)\u03b8n),\nas in Eq. (5). The basic intuition is that XT(Y \u2212 X(cid:98)\u03b8n)/(n\u03bb) is a subgradient of the (cid:96)1 norm at the\nLasso solution(cid:98)\u03b8n. By adding a term proportional to this subgradient, our procedure compensates\ni + 1.96\u03c3(cid:112)Qii/n] is a 95% con\ufb01-\n1For instance, letting Q \u2261 (XTX/n)\u22121,(cid:98)\u03b8OLS\n\ni \u2212 1.96\u03c3(cid:112)Qii/n,(cid:98)\u03b8OLS\n\nthe bias introduced by the (cid:96)1 penalty in the Lasso.\n\ndence interval [28].\n\n2\n\n\fallows to construct con\ufb01dence intervals and p-values in complete analogy with classical statistics\n\ni \u2212 1.96\u03c3(cid:112)Qii/n,(cid:98)\u03b8u\n\nA key role is played by the matrix M \u2208 Rp\u00d7p whose function is to \u2018decorrelate\u2019 the columns of X.\nWe propose here to construct M by solving a convex program that aims at optimizing two objectives.\n\n\u221a\ncon\ufb01dence interval. The size of this interval is of order \u03c3/\nn, which is the optimal (minimum) one,\ni.e. the same that would have been obtained by knowing a priori the support of \u03b80. In practice the\n\nWe will prove in Section 2 that (cid:98)\u03b8u is approximately Gaussian, with mean \u03b80 and covariance\n\u03c32(M(cid:98)\u03a3M )/n, where(cid:98)\u03a3 = (XTX/n) is the empirical covariance of the feature vectors. This result\ni + 1.96\u03c3(cid:112)Qii/n] is a 95%\nprocedures. For instance, letting Q \u2261 M(cid:98)\u03a3M, [(cid:98)\u03b8u\nnoise standard deviation is not known, but \u03c3 can be replaced by any consistent estimator(cid:98)\u03c3.\nOne one hand, we try to control |M(cid:98)\u03a3\u2212 I|\u221e (here and below | \u00b7 |\u221e denotes the entrywise (cid:96)\u221e norm)\nwhich \u2013as shown in Theorem 2.1\u2013 controls the non-Gaussianity and bias of(cid:98)\u03b8u. On the other, we\nminimize [M(cid:98)\u03a3M ]i,i, for each i \u2208 [p], which controls the variance of(cid:98)\u03b8u\nThe idea of constructing a de-biased estimator of the form (cid:98)\u03b8u = (cid:98)\u03b8n + (1/n) M XT(Y \u2212 X(cid:98)\u03b8n)\nwas used by Javanmard and Montanari in [10], that suggested the choice M = c\u03a3\u22121, with \u03a3 =\nE{X1X T\n1 } the population covariance matrix and c a positive constant. A simple estimator for \u03a3\nwas proposed for sparse covariances, but asymptotic validity and optimality were proven only for\nuncorrelated Gaussian designs (i.e. Gaussian X with \u03a3 = I). Van de Geer, B\u00a8ulhmann and Ritov\n[25] used the same construction with M an estimate of \u03a3\u22121 which is appropriate for sparse inverse\ncovariances. These authors prove semi-parametric optimality in a non-asymptotic setting, provided\n0 log p). In this paper, we do not assume any sparsity constraint on\nthe sample size is at least n = \u2126(s2\n\u03a3\u22121, but still require the sample size scaling n = \u2126(s2\n0 log p). We refer to a forthcoming publication\nwherein the condition on the sample size scaling is relaxed [11].\nFrom a technical point of view, our proof starts from a simple decomposition of the de-biased esti-\nfrom earlier work\u2013 we realize that M need not be a good estimator of \u03a3\u22121 in order for the de-biasing\nprocedure to work. We instead set M as to minimize the error term and the variance of the Gaussian\nterm. As a consequence of this choice, our approach applies to general covariance structures \u03a3. By\ncontrast, earlier approaches applied only to sparse \u03a3, as in [10], or sparse \u03a3\u22121 as in [25]. The only\nassumptions we make on \u03a3 are the standard compatibility conditions required for high-dimensional\nconsistency [4]. We refer the reader to the long version of the paper [9] for the proofs of our main\nresults and the technical steps.\n\nmator(cid:98)\u03b8u into a Gaussian part and an error term, already used in [25]. However \u2013departing radically\n\ni .\n\n1.1 Further related work\n\nThe theoretical literature on high-dimensional statistical models is vast and rapidly growing. Re-\nstricting ourselves to linear regression, earlier work investigated prediction error [8], model selec-\ntion properties [17, 31, 27, 5], (cid:96)2 consistency [6, 2]. Of necessity, we do not provide a complete set\nof references, and instead refer the reader to [4] for an in-depth introduction to this area.\nThe problem of quantifying statistical signi\ufb01cance in high-dimensional parameter estimation is, by\ncomparison, far less understood. Zhang and Zhang [30], and B\u00a8uhlmann [3] proposed hypothesis\ntesting procedures under restricted eigenvalue or compatibility conditions [4]. These methods are\nhowever effective only for detecting very large coef\ufb01cients. Namely, they both require |\u03b80,i| \u2265\nc max{\u03c3s0 log p/ n, \u03c3/\ns0 larger than the ideal detection level [10]. In other words,\nin order for the coef\ufb01cient \u03b80,i to be detectable with appreciable probability, it needs to be larger than\nthe overall (cid:96)2 error, rather than the (cid:96)2 error per coordinate.\nLockart et al. [15] develop a test for the hypothesis that a newly added coef\ufb01cient along the Lasso\nregularization path is irrelevant. This however does not allow to test arbitrary coef\ufb01cients at a given\nvalue of \u03bb, which is instead the problem addressed in this paper. It further assumes that the current\nLasso support contains the actual support supp(\u03b80) and that the latter has bounded size. Finally,\nresampling methods for hypothesis testing were studied in [29, 18, 19].\n\nn}, which is\n\n\u221a\n\n\u221a\n\n1.2 Preliminaries and notations\n\nWe let(cid:98)\u03a3 \u2261 XTX/n be the sample covariance matrix. For p > n,(cid:98)\u03a3 is always singular. However,\nwe may require(cid:98)\u03a3 to be nonsingular for a restricted set of directions.\n\n3\n\n\fDe\ufb01nition 1.1. For a matrix(cid:98)\u03a3 and a set S of size s0, the compatibility condition is met, if for some\n\n\u03c60 > 0, and all \u03b8 satisfying (cid:107)\u03b8Sc(cid:107)1 \u2264 3(cid:107)\u03b8S(cid:107)1, it holds that\n\n\u03b8T(cid:98)\u03a3\u03b8 .\n\n(cid:107)\u03b8S(cid:107)2\n\n1 \u2264 s0\n\u03c62\n0\n\nDe\ufb01nition 1.2. The sub-gaussian norm of a random variable X, denoted by (cid:107)X(cid:107)\u03c82, is de\ufb01ned as\n\n(cid:107)X(cid:107)\u03c82 = sup\np\u22651\n\np\u22121/2(E|X|p)1/p .\n\nThe sub-gaussian norm of a random vector X \u2208 Rn is de\ufb01ned as (cid:107)X(cid:107)\u03c82 = supx\u2208Sn\u22121 (cid:107)(cid:104)X, x(cid:105)(cid:107)\u03c82.\nFurther, for a random variable X, its sub-exponential norm, denoted by (cid:107)X(cid:107)\u03c81, is de\ufb01ned as\n\n(cid:107)X(cid:107)\u03c81 = sup\np\u22651\n\np\u22121(E|X|p)1/p .\n\nFor a matrix A and set of indices I, J, we let AI,J denote the submatrix formed by the rows in\nI and columns in J. Also, AI,\u00b7 (resp. A\u00b7,I) denotes the submatrix containing just the rows (reps.\ncolumns) in I. Likewise, for a vector v, vI is the restriction of v to indices in I. We use the shorthand\nA\u22121\nI,J = (A\u22121)I,J. In particular, A\u22121\ni,i = (A\u22121)i,i. The maximum and the minimum singular values\nof A are respectively denoted by \u03c3max(A) and \u03c3min(A). We write (cid:107)v(cid:107)p for the standard (cid:96)p norm of\na vector v and (cid:107)v(cid:107)0 for the number of nonzero entries of v. For a matrix A, (cid:107)A(cid:107)p is the (cid:96)p operator\ni,j |Aij|p)1/p. For an integer p \u2265 1,\nwe let [p] \u2261 {1, . . . , p}. For a vector v, supp(v) represents the positions of nonzero entries of v.\nThroughout, with high probability (w.h.p) means with probability converging to one as n \u2192 \u221e, and\n\nnorm, and |A|p is the elementwise (cid:96)p norm, i.e., |A|p = ((cid:80)\n\u03a6(x) \u2261(cid:82) x\nTheorem 2.1. Consider the linear model (1) and let(cid:98)\u03b8u be de\ufb01ned as per Eq. (5). Then,\n\n2 An de-biased estimator for \u03b80\n\n2\u03c0 denotes the CDF of the standard normal distribution.\n\n\u2212\u221e e\u2212t2/2dt/\n\nn((cid:98)\u03b8u \u2212 \u03b80) = Z + \u2206 , Z|X \u223c N(0, \u03c32M(cid:98)\u03a3M T) , \u2206 =\n\n\u221a\n\n\u221a\n\nn(M(cid:98)\u03a3 \u2212 I)(\u03b80 \u2212(cid:98)\u03b8) .\n\n\u221a\n\nFurther, suppose that \u03c3min(\u03a3) = \u2126(1), and \u03c3max(\u03a3) = O(1). In addition assume the rows of the\nwhitened matrix X\u03a3\u22121/2 are sub-gaussian, i.e., (cid:107)\u03a3\u22121/2X1(cid:107)\u03c82 = O(1). Let E be the event that the\n\u221a\n(see inputs in Table 1), the following holds true. On the event E, w.h.p, (cid:107)\u2206(cid:107)\u221e = O(s0 log p/\nn).\nNote that compatibility condition (and hence the event E) holds w.h.p. for random design matrices\nof a general nature.\nIn fact [22] shows that under some general assumptions, the compatibility\n\ncompatibility condition holds for(cid:98)\u03a3, and maxi\u2208[p] (cid:98)\u03a3i,i = O(1). Then, using \u03b3 = O((cid:112)(log p)/n)\ncondition on \u03a3 implies a similar condition on (cid:98)\u03a3, w.h.p., when n is suf\ufb01ciently large. Bounds on\nthe variances [M(cid:98)\u03a3M T]ii will be given in Section 3.2. Finally, the claim of Theorem 2.1 does not\n\nrely on the speci\ufb01c choice of the objective function in optimization problem (4) and only uses the\noptimization constraints.\n\u221a\nRemark 2.2. Theorem 2.1 does not make any assumption about the parameter vector \u03b80. If we\nn/ log p), then we have (cid:107)\u2206(cid:107)\u221e = o(1),\nfurther assume that the support size s0 satis\ufb01es s0 = o(\n\nw.h.p. Hence,(cid:98)\u03b8u is an asymptotically unbiased estimator for \u03b80.\n\n3 Statistical inference\n\nA direct application of Theorem 2.1 is to derive con\ufb01dence intervals and statistical hypothesis tests\nfor high dimensional models. Throughout, we make the sparsity assumption s0 = o(\n\nn/ log p).\n\n\u221a\n\n3.1 Con\ufb01dence intervals\nWe \ufb01rst show that the variances of variables Zj|X are \u2126(1).\n\n4\n\n\fLemma 3.1. Let M = (m1, . . . , mp)T be the matrix with rows mT\n\nprogram (4). Then for all i \u2208 [p], [M(cid:98)\u03a3M T]i,i \u2265 (1 \u2212 \u03b3)2/(cid:98)\u03a3i,i .\n\ni obtained by solving convex\n\nBy Remark 2.2 and Lemma 3.1, we have\n\u2264 x\n\nP(cid:110)\u221a\nn((cid:98)\u03b8u\n\u03c3[M(cid:98)\u03a3M T]1/2\ni \u2212 \u03b80,i)\n\ni,i\n\n(cid:111)\n\n(cid:12)(cid:12)(cid:12)X\n\n= \u03a6(x) + o(1) ,\n\n\u2200x \u2208 R .\n\n(6)\n\n(cid:111)\n\n{(cid:98)\u03b8n(\u03bb),(cid:98)\u03c3} \u2261 arg min\n\n(cid:110) 1\n\nSince the limiting probability is independent of X, Eq. (6) also holds unconditionally for random\ndesign X.\nFor constructing con\ufb01dence intervals, a consistent estimate of \u03c3 is needed. To this end, we use the\nscaled Lasso [23] given by\n\n(cid:107)Y \u2212 X\u03b8(cid:107)2\n\n2 +\n\n+ \u03bb(cid:107)\u03b8(cid:107)1\n\n.\n\n\u03c3\n2\n\n2\u03c3n\n\n\u03b8\u2208Rp,\u03c3>0\n\nThis is a joint convex minimization which provides an estimate of the noise level in addition to an\nestimate of \u03b80. We use \u03bb = c1\nof Theorem 2.1 (cf. [23]). We hence obtain the following.\nCorollary 3.2. Let\n\n(cid:112)(log p)/n that yields a consistent estimate(cid:98)\u03c3, under the assumptions\n\u03b4(\u03b1, n) = \u03a6\u22121(1 \u2212 \u03b1/2)(cid:98)\u03c3 n\u22121/2\n\nThen Ii = [(cid:98)\u03b8u\nNotice that the same corollary applies to any other consistent estimator (cid:98)\u03c3 of the noise standard\n\n(7)\ni + \u03b4(\u03b1, n)] is an asymptotic two-sided con\ufb01dence interval for \u03b80,i with\n\ni \u2212 \u03b4(\u03b1, n),(cid:98)\u03b8u\n\n[M(cid:98)\u03a3M T]i,i .\n\nsigni\ufb01cance \u03b1.\n\n(cid:113)\n\ndeviation.\n\n3.2 Hypothesis testing\n\nAn important advantage of sparse linear regression models is that they provide parsimonious expla-\nnations of the data in terms of a small number of covariates. The easiest way to select the \u2018active\u2019\n(cid:54)= 0. This approach however does not provide a\n\ncovariates is to choose the indexes i for which(cid:98)\u03b8n\n\nmeasure of statistical signi\ufb01cance for the \ufb01nding that the coef\ufb01cient is non-zero.\nMore precisely, we are interested in testing an individual null hypothesis H0,i : \u03b80,i = 0 versus the\nalternative HA,i : \u03b80,i (cid:54)= 0, and assigning p-values for these tests. We construct a p-value Pi for the\ntest H0,i as follows:\n\ni\n\nPi = 2\n\n1 \u2212 \u03a6\n\n(cid:18) \u221a\nn|(cid:98)\u03b8u\n(cid:98)\u03c3[M(cid:98)\u03a3M T]1/2\ni |\n\ni,i\n\n(cid:19)(cid:19)\n\n.\n\n(cid:18)\n(cid:26)1\n\nThe decision rule is then based on the p-value Pi:\nif Pi \u2264 \u03b1\notherwise\n\nTi,X(y) =\n\n0\n\n(reject H0,i) ,\n(accept H0,i) .\n\n(8)\n\n(9)\n\nWe measure the quality of the test Ti,X(y) in terms of its signi\ufb01cance level \u03b1i and statistical power\n1\u2212 \u03b2i. Here \u03b1i is the probability of type I error (i.e. of a false positive at i) and \u03b2i is the probability\nof type II error (i.e. of a false negative at i).\nNote that it is important to consider the tradeoff between statistical signi\ufb01cance and power. Indeed\nany signi\ufb01cance level \u03b1 can be achieved by randomly rejecting H0,i with probability \u03b1. This test\nachieves power 1 \u2212 \u03b2 = \u03b1. Further note that, without further assumption, no nontrivial power can\nbe achieved. In fact, choosing \u03b80,i (cid:54)= 0 arbitrarily close to zero, H0,i becomes indistinguishable\nfrom its alternative. We will therefore assume that, whenever \u03b80,i (cid:54)= 0, we have |\u03b80,i| > \u00b5 as well.\nWe take a minimax perspective and require the test to behave uniformly well over s0-sparse vectors.\nFormally, for \u00b5 > 0 and i \u2208 [p], de\ufb01ne\n\n(cid:111)\n(cid:110)P\u03b80(Ti,X(y) = 1) : \u03b80 \u2208 Rp, (cid:107)\u03b80(cid:107)0 \u2264 s0(n), \u03b80,i = 0\n(cid:110)P\u03b80(Ti,X(y) = 0) : \u03b80 \u2208 Rp, (cid:107)\u03b80(cid:107)0 \u2264 s0(n), |\u03b80,i| \u2265 \u00b5\n(cid:111)\n\n\u03b1i(n) \u2261 sup\n\u03b2i(n; \u00b5) \u2261 sup\n\n.\n\n.\n\n5\n\n\fHere, we made dependence on n explicit. Also, P\u03b8(\u00b7) is the induced probability for random design\nX and noise realization w, given the \ufb01xed parameter vector \u03b8. Our next theorem establishes bounds\non \u03b1i(n) and \u03b2i(n; \u00b5).\n\u221a\nTheorem 3.3. Consider a random design model that satis\ufb01es the conditions of Theorem 2.1. Under\nthe sparsity assumption s0 = o(\nn/ log p), the following holds true for any \ufb01xed sequence of\nintegers i = i(n):\n\n\u221a\nn \u00b5\n\u03c3[\u03a3\u22121\ni,i ]1/2\nwhere, for \u03b1 \u2208 [0, 1] and u \u2208 R+, the function G(\u03b1, u) is de\ufb01ned as follows:\n\n1 \u2212 \u03b2\u2217\n\nlim\nn\u2192\u221e\n\nlim\n\nn\u2192\u221e \u03b1i(n) \u2264 \u03b1 .\n1 \u2212 \u03b2i(\u00b5; n)\n\u2265 1 ,\n1 \u2212 \u03b2\u2217\ni (\u00b5; n)\n\n(cid:18)\ni (\u00b5; n) \u2261 G\n\n\u03b1,\n\n(cid:19)\n\n(10)\n\n(11)\n\n,\n\nG(\u03b1, u) = 2 \u2212 \u03a6(\u03a6\u22121(1 \u2212 \u03b1\n2\n\n) + u) \u2212 \u03a6(\u03a6\u22121(1 \u2212 \u03b1\n2\n\n) \u2212 u) .\n\nIt is easy to see that, for any \u03b1 > 0, u (cid:55)\u2192 G(\u03b1, u) is continuous and monotone increasing. Moreover,\nG(\u03b1, 0) = \u03b1 which is the trivial power obtained by randomly rejecting H0,i with probability \u03b1. As\n\u221a\n\u00b5 deviates from zero, we obtain nontrivial power. Notice that in order to achieve a speci\ufb01c power\n\u03b2 > \u03b1, our scheme requires \u00b5 = O(\u03c3/\n\ni,i \u2264 \u03c3max(\u03a3\u22121) \u2264 (\u03c3min(\u03a3))\u22121 = O(1).\n\nn), since \u03a3\u22121\n\n3.2.1 Minimax optimality\n\nThe authors of [10] prove an upper bound for the minimax power of tests with a given signi\ufb01cance\nlevel \u03b1, under the Gaussian random design models (see Theorem 2.6 therein). This bound is obtained\nby considering an oracle test that knows all the active parameters except i, i.e., S\\{i}. To state the\nbound formally, for a set S \u2286 [p] and i \u2208 Sc, de\ufb01ne \u03a3i|S \u2261 \u03a3i,i \u2212 \u03a3i,S(\u03a3S,S)\u22121\u03a3S,i, and let\n\n(cid:111)\n\n.\n\n(cid:110)\n\n\u03b7\u03a3,s0 \u2261 min\ni\u2208[p],S\n\n\u03a3i|S : S \u2286 [p]\\{i}, |S| < s0\n\u221a\n\nIn asymptotic regime and under our sparsity assumption s0 = o(\nsimpli\ufb01es to\n\nn/ log p), the bound of [10]\n\n1 \u2212 \u03b2opt\n(\u03b1; \u00b5)\nG(\u03b1, \u00b5/\u03c3e\ufb00 )\n\ni\n\nlim\nn\u2192\u221e\n\n\u2264 1 ,\n\n\u03c3e\ufb00 =\n\n\u03c3\u221a\nn \u03b7\u03a3,s0\n\n,\n\n(12)\n\nUsing the bound of (12) and specializing the result of Theorem 3.3 to Gaussian design X, we obtain\nthat our scheme achieves a near optimal minimax power for a broad class of covariance matrices.\nWe can compare our test to the optimal test by computing how much \u00b5 must be increased in order to\nachieve the minimax optimal power. It follows from the above that \u00b5 must be increased to \u02dc\u00b5, with\nthe two differing by a factor:\n\n(cid:113)\n\nii \u03b7\u03a3,s0 \u2264(cid:113)\n\n\u03a3\u22121\n\ni,i \u03a3i,i \u2264(cid:112)\u03c3max(\u03a3)/\u03c3min(\u03a3) ,\n\n\u03a3\u22121\n\n\u02dc\u00b5/\u00b5 =\n\nsince \u03a3\u22121\n\nii \u2264 (\u03c3min(\u03a3))\u22121, and \u03a3i|S \u2264 \u03a3i,i \u2264 \u03c3max(\u03a3) due to \u03a3S,S (cid:31) 0.\n\n4 General regularized maximum likelihood\n\nIn this section, we generalize our results beyond the linear regression model to general regularized\nmaximum likelihood. Here, we only describe the de-biasing method. Formal guarantees can be\nobtained under suitable restricted strong convexity assumptions [20] and will be the object of a\nforthcoming publication.\nFor univariate Y , and vector X \u2208 Rp, we let {f\u03b8(Y |X)}\u03b8\u2208Rp be a family of conditional probability\ndensities parameterized by \u03b8, that are absolutely continuous with respect to a common measure\n\u03c9(dy), and suppose that the gradient \u2207\u03b8f\u03b8(Y |X) exists and is square integrable.\nAs in for linear regression, we assume that the data is given by n i.i.d. pairs (X1, Y1), . . . (Xn, Yn),\nwhere conditional on Xi, the response variable Yi is distributed as\n\nYi \u223c f\u03b80(\u00b7|Xi) .\n\n6\n\n\f(cid:80)n\nfor some parameter vector \u03b80 \u2208 Rp. Let Li(\u03b8) = \u2212 log f\u03b8(Yi|Xi) be the normalized negative\nlog-likelihood corresponding to the observed pair (Yi, Xi), and de\ufb01ne L(\u03b8) = 1\ni=1 Li(\u03b8) . We\nconsider the following regularized estimator:\n\nn\n\nwhere \u03bb is a regularization parameter and R : Rp \u2192 R+ is a norm.\n\nWe next generalize the de\ufb01nition of(cid:98)\u03a3. Let Ii(\u03b8) be the Fisher information of f\u03b8(Y |Xi), de\ufb01ned as\n(cid:17)(cid:12)(cid:12)(cid:12)Xi\n(cid:105)\n= \u2212E(cid:104)(cid:16)\u22072\nIi(\u03b8) \u2261 E(cid:104)(cid:16)\u2207\u03b8 log f\u03b8(Y |Xi)\nsian operator. We assume E[Ii(\u03b8)] (cid:31) 0 de\ufb01ne(cid:98)\u03a3 \u2208 Rp\u00d7p as follows:\n\nwhere the second identity holds under suitable regularity conditions [13], and \u22072\n\n\u03b8 log f (Y |Xi, \u03b8)\n\n\u03b8 denotes the Hes-\n\n,\n\n\u03b8\u2208Rp\n\n(cid:98)\u03b8 \u2261 arg min\n(cid:17)(cid:16)\u2207\u03b8 log f\u03b8(Y |Xi)\nn(cid:88)\n\n(cid:8)L(\u03b8) + \u03bbR(\u03b8)(cid:9) ,\n(cid:17)T(cid:12)(cid:12)(cid:12)Xi\nIi((cid:98)\u03b8) .\n\n(cid:98)\u03a3 \u2261 1\n\n(cid:105)\n\n(13)\n\n(14)\n\nn\n\ni=1\n\nsetting is somewhat more general) with the crucial difference of the construction of M.\n\nNote that (in general) (cid:98)\u03a3 depends on (cid:98)\u03b8. Finally, the de-biased estimator (cid:98)\u03b8u is de\ufb01ned by (cid:98)\u03b8u \u2261\n(cid:98)\u03b8 \u2212 M\u2207\u03b8L((cid:98)\u03b8) , with M given again by the solution of the convex program (4), and the de\ufb01nition of\n(cid:98)\u03a3 provided here. Notice that this construction is analogous to the one in [25] (although the present\nA a simple heuristic derivation of this method is the following. By Taylor expansion of L((cid:98)\u03b8)\naround \u03b80 we get (cid:98)\u03b8u \u2248 (cid:98)\u03b8 \u2212 M\u2207\u03b8L(\u03b80) \u2212 M\u22072\n\u03b8L(\u03b80) \u2248 (cid:98)\u03a3\n(which amounts to taking expectation with respect to the response variables yi), we get(cid:98)\u03b8u \u2212 \u03b80 \u2248\n\u2212M\u2207\u03b8L(\u03b80)\u2212 [M(cid:98)\u03a3\u2212 I]((cid:98)\u03b8 \u2212(cid:98)\u03b80). Conditionally on {Xi}1\u2264i\u2264n, the \ufb01rst term has zero expectation\nand covariance [M(cid:98)\u03a3M ]. Further, by central limit theorem, its low-dimensional marginals are ap-\nproximately Gaussian. The bias term \u2212[M(cid:98)\u03a3\u2212 I]((cid:98)\u03b8 \u2212(cid:98)\u03b80) can be bounded as in the linear regression\ncase, building on the fact that M is chosen such that |M(cid:98)\u03a3 \u2212 I|\u221e \u2264 \u03b3.\nis given by Ii = [(cid:98)\u03b8u\n\u03b4(\u03b1, n) = \u03a6\u22121(1 \u2212 \u03b1/2)n\u22121/2[M(cid:98)\u03a3M T]1/2\n\n\u03b8L(\u03b80)((cid:98)\u03b8 \u2212 \u03b80) . Approximating \u22072\n\nSimilar to the linear case, an asymptotic two-sided con\ufb01dence interval for \u03b80,i (with signi\ufb01cance \u03b1)\n\ni \u2212 \u03b4(\u03b1, n),(cid:98)\u03b8u\n\ni + \u03b4(\u03b1, n)], where\n\n.\n\ni,i\n\nMoreover, an asymptotically valid p-value Pi for testing null hypothesis H0,i is constructed as:\n\n(cid:18)\n\nPi = 2\n\n1 \u2212 \u03a6\n\n(cid:18) \u221a\nn|(cid:98)\u03b8u\n[M(cid:98)\u03a3M T]1/2\ni |\n\ni,i\n\n(cid:19)(cid:19)\n\n.\n\nIn the next section, we shall apply the general approach presented here to L1-regularized logistic\nregression. In this case, the binary response Yi \u2208 {0, 1} is distributed as Yi \u223c f\u03b80 (\u00b7|Xi) where\nf\u03b80 (1|x) = (1 + e\u2212(cid:104)x,\u03b80(cid:105))\u22121 and f\u03b80(0|x) = (1 + e(cid:104)x,\u03b80(cid:105))\u22121. It is easy to see that in this case\n\nIi((cid:98)\u03b8) = (cid:98)qi(1 \u2212(cid:98)qi)XiX T\n\ni , with(cid:98)qi = (1 + e\u2212(cid:104)(cid:98)\u03b8,Xi(cid:105))\u22121, and thus\n(cid:98)qi(1 \u2212(cid:98)qi)XiX T\n\n(cid:98)\u03a3 =\n\nn(cid:88)\n\ni .\n\n1\nn\n\ni=1\n\n5 Diabetes data example\n\nWe consider the problem of estimating relevant attributes in predicting type-2 diabetes. We evaluate\nthe performance of our hypothesis testing procedure on the Practice Fusion Diabetes dataset [1].\nThis dataset contains de-identi\ufb01ed medical records of 10000 patients, including information on di-\nagnoses, medications, lab results, allergies, immunizations, and vital signs. From this dataset, we ex-\ntract p numerical attributes resulting in a sparse design matrix Xtot \u2208 Rntot\u00d7p, with ntot = 10000,\n\n7\n\n\f(a) Q-Q plot of Z\n\n(b) Normalized histograms of \u02dcZ for one realization.\n\nFigure 1: Q-Q plot of Z and normalized histograms of \u02dcZS (in red) and \u02dcZSc (in blue) for one real-\nization. No \ufb01tting of the Gaussian mean and variance was done in panel (b).\n\nand p = 805 (only 5.9% entries of Xtot are non-zero). Next, we standardize the columns of X to\nhave mean 0 and variance 1. The attributes consist of: (i)Transcript records: year of birth, gender\nand BMI; (ii)Diagnoses informations: 80 binary attributes corresponding to different ICD-9 codes.\n(iii)Medications: 80 binary attributes indicating the use of different medications. (iv) Lab results:\nFor 70 lab test observations, we include attributes indicating patients tested, abnormality \ufb02ags, and\nthe observed values. We also bin the observed values into 10 quantiles and make 10 binary attributes\nindicating the bin of the corresponding observed value.\nWe consider logistic model as described in the previous section with a binary response identifying\nthe patients diagnosed with type-2 diabetes. For the sake of performance evaluation, we need to\nknow the true signi\ufb01cant attributes. Letting L(\u03b8) be the logistic loss corresponding to the design\nXtot and response vector Y \u2208 Rntot, we take \u03b80 as the minimizer of L(\u03b8). Notice that here, we are\nin the low dimensional regime (ntot > p) and no regularization is needed.\nNext, we take random subsamples of size n = 500 from the patients, and examine the performance\nof our testing procedure. The experiment is done using glmnet-package in R that \ufb01ts the entire path\nof the regularized logistic estimator. We then choose the value of \u03bb that yields maximum AUC (area\nunder ROC curve), approximated by a 5-fold cross validation.\nResults: Type I errors and powers of our decision rule (9) are computed by comparing to \u03b80. The\naverage error and power (over 20 random subsamples) and signi\ufb01cance level \u03b1 = 0.05 are respec-\ntively, 0.0319 and 0.818. Let Z = (zi)p\ni,i .\nIn Fig. 1(a), sample quantiles of Z are depicted versus the quantiles of a standard normal distribu-\ntion. The plot clearly corroborates our theoretical result regarding the limiting distribution of Z.\nIn order to build further intuition about the proposed p-values, let \u02dcZ = (\u02dczi)p\ni=1 be the vector with\n\u02dczi \u2261 \u221a\ni,i . In Fig. 1(b), we plot the normalized histograms of \u02dcZS (in red) and \u02dcZSc (in\nblue). As the plot showcases, \u02dcZSc has roughly standard normal distribution, and the entries of \u02dcZS\nappear as distinguishable spikes. The entries of \u02dcZS with larger magnitudes are easier to be marked\noff from the normal distribution tail.\n\nn((cid:98)\u03b8u\ni \u2212 \u03b80,i)/[M(cid:98)\u03a3M ]1/2\n\ni=1 denote the vector with zi \u2261 \u221a\n\nn(cid:98)\u03b8u\ni /[M(cid:98)\u03a3M ]1/2\n\nReferences\n\n[1] Practice Fusion Diabetes Classi\ufb01cation. http://www.kaggle.com/c/pf2012-diabetes, 2012. Kaggle\n\ncompetition dataset.\n\n8\n\n-3-2-10123-4-2024Standard normal quantilesSample Quantiles of ZHistograms of Z~Density-10-505100.00.10.20.30.4Z~ScZ~SN(0, 1)\f[2] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Amer. J.\n\nof Mathematics, 37:1705\u20131732, 2009.\n\n[3] P. B\u00a8uhlmann. Statistical signi\ufb01cance in high-dimensional linear models. arXiv:1202.1377, 2012.\n[4] P. B\u00a8uhlmann and S. van de Geer. Statistics for high-dimensional data. Springer-Verlag, 2011.\n[5] E. Cand`es and Y. Plan. Near-ideal model selection by (cid:96)1 minimization. The Annals of Statistics,\n\n37(5A):2145\u20132177, 2009.\n\n[6] E. J. Cand\u00b4es and T. Tao. Decoding by linear programming. IEEE Trans. on Inform. Theory, 51:4203\u2013\n\n4215, 2005.\n\n[7] S. Chen and D. Donoho. Examples of basis pursuit. In Proceedings of Wavelet Applications in Signal and\n\nImage Processing III, San Diego, CA, 1995.\n\n[8] E. Greenshtein and Y. Ritov. Persistence in high-dimensional predictor selection and the virtue of over-\n\nparametrization. Bernoulli, 10:971\u2013988, 2004.\n\n[9] A. Javanmard and A. Montanari. Con\ufb01dence Intervals and Hypothesis Testing for High-Dimensional\n\nRegression. arXiv:1306.3171, 2013.\n\n[10] A. Javanmard and A. Montanari. Hypothesis testing in high-dimensional regression under the gaussian\n\nrandom design model: Asymptotic theory. arXiv:1301.4240, 2013.\n\n[11] A. Javanmard and A. Montanari. Nearly Optimal Sample Size in Hypothesis Testing for High-\n\nDimensional Regression. arXiv:1311.0274, 2013.\n\n[12] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer,\n\n42(8):30\u201337, August 2009.\n\n[13] E. Lehmann and G. Casella. Theory of point estimation. Springer, 2 edition, 1998.\n[14] E. Lehmann and J. Romano. Testing statistical hypotheses. Springer, 2005.\n[15] R. Lockhart, J. Taylor, R. Tibshirani, and R. Tibshirani. A signi\ufb01cance test for the lasso. arXiv preprint\n\narXiv:1301.7161, 2013.\n\n[16] M. Lustig, D. Donoho, J. Santos, and J. Pauly. Compressed sensing mri. IEEE Signal Processing Maga-\n\nzine, 25:72\u201382, 2008.\n\n[17] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the lasso.\n\nAnn. Statist., 34:1436\u20131462, 2006.\n\n[18] N. Meinshausen and P. B\u00a8uhlmann. Stability selection. J. R. Statist. Soc. B, 72:417\u2013473, 2010.\n[19] J. Minnier, L. Tian, and T. Cai. A perturbation method for inference on regularized regression estimates.\n\nJournal of the American Statistical Association, 106(496), 2011.\n\n[20] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Statistical Science, 27(4):538\u2013557, 2012.\n\n[21] J. Peng, J. Zhu, A. Bergamaschi, W. Han, D.-Y. Noh, J. R. Pollack, and P. Wang. Regularized multivari-\nate regression for identifying master predictors with application to integrative genomics study of breast\ncancer. The Annals of Applied Statistics, 4(1):53\u201377, 2010.\n\n[22] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE Transactions\n\non Information Theory, 59(6):3434\u20133447, 2013.\n\n[23] T. Sun and C.-H. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879\u2013898, 2012.\n[24] R. Tibshirani. Regression shrinkage and selection with the Lasso. J. Royal. Statist. Soc B, 58:267\u2013288,\n\n1996.\n\n[25] S. van de Geer, P. B\u00a8uhlmann, and Y. Ritov. On asymptotically optimal con\ufb01dence regions and tests for\n\nhigh-dimensional models. arXiv:1303.0518, 2013.\n\n[26] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.\n[27] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using (cid:96)1-constrained\n\nquadratic programming. IEEE Trans. on Inform. Theory, 55:2183\u20132202, 2009.\n\n[28] L. Wasserman. All of statistics: a concise course in statistical inference. Springer Verlag, 2004.\n[29] L. Wasserman and K. Roeder. High dimensional variable selection. Annals of statistics, 37(5A):2178,\n\n2009.\n\n[30] C.-H. Zhang and S. Zhang. Con\ufb01dence Intervals for Low-Dimensional Parameters in High-Dimensional\n\nLinear Models. arXiv:1110.2563, 2011.\n\n[31] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research,\n\n7:2541\u20132563, 2006.\n\n9\n\n\f", "award": [], "sourceid": 620, "authors": [{"given_name": "Adel", "family_name": "Javanmard", "institution": "Stanford University"}, {"given_name": "Andrea", "family_name": "Montanari", "institution": "Stanford University"}]}