{"title": "First order expansion of convex regularized estimators", "book": "Advances in Neural Information Processing Systems", "page_first": 3462, "page_last": 3473, "abstract": "We consider first order expansions of convex penalized estimators in\nhigh-dimensional regression problems with random designs. Our setting includes\nlinear regression and logistic regression as special cases.  For a given\npenalty function $h$ and the corresponding penalized estimator $\\hbeta$, we\nconstruct a quantity $\\eta$, the first order expansion of $\\hbeta$, such that\nthe distance between $\\hbeta$ and $\\eta$ is an order of magnitude smaller than\nthe estimation error $\\|\\hat{\\beta} - \\beta^*\\|$.  In this sense, the first\norder expansion $\\eta$ can be thought of as a generalization of influence\nfunctions from the mathematical statistics literature to regularized estimators\nin high-dimensions.  Such first order expansion implies that the risk of\n$\\hat{\\beta}$ is asymptotically the same as the risk of $\\eta$ which leads to a\nprecise characterization of the MSE of $\\hbeta$; this characterization takes a\nparticularly simple form for isotropic design.  Such first order expansion also\nleads to inference results based on $\\hat{\\beta}$.  We provide sufficient\nconditions for the existence of such first order expansion for three\nregularizers: the Lasso in its constrained form, the lasso in its penalized\nform, and the Group-Lasso.  The results apply to general loss functions under\nsome conditions and those conditions are satisfied for the squared loss in\nlinear regression and for the logistic loss in the logistic model.", "full_text": "First order expansion of convex regularized\n\nestimators\n\nPierre C Bellec,\n\nDepartment of Statistics,\n\nRutgers University,\n\n501 Hill Center,\n\nPiscataway, NJ 08854, USA.\n\npierre.bellec@rutgers.edu\n\nArun K Kuchibhotla,\nDepartment of Statistics,\n\nThe Wharton School,\n\nUniversity of Pennsylvania,\nPhiladelphia, PA 19104, USA.\n\narunku@upenn.edu\n\nAbstract\n\nWe consider \ufb01rst order expansions of convex penalized estimators in high-\ndimensional regression problems with random designs. Our setting includes linear\nregression and logistic regression as special cases. For a given penalty function h\nand the corresponding penalized estimator \u02c6\u03b2, we construct a quantity \u03b7, the \ufb01rst or-\nder expansion of \u02c6\u03b2, such that the distance between \u02c6\u03b2 and \u03b7 is an order of magnitude\nsmaller than the estimation error (cid:107) \u02c6\u03b2\u2212 \u03b2\u2217(cid:107). In this sense, the \ufb01rst order expansion \u03b7\ncan be thought of as a generalization of in\ufb02uence functions from the mathematical\nstatistics literature to regularized estimators in high-dimensions. Such \ufb01rst order\nexpansion implies that the risk of \u02c6\u03b2 is asymptotically the same as the risk of \u03b7\nwhich leads to a precise characterization of the MSE of \u02c6\u03b2; this characterization\ntakes a particularly simple form for isotropic design. Such \ufb01rst order expansion\nalso leads to inference results based on \u02c6\u03b2. We provide suf\ufb01cient conditions for\nthe existence of such \ufb01rst order expansion for three regularizers: the Lasso in its\nconstrained form, the lasso in its penalized form, and the Group-Lasso. The results\napply to general loss functions under some conditions and those conditions are\nsatis\ufb01ed for the squared loss in linear regression and for the logistic loss in the\nlogistic model.\n\n\u02c6\u03b2 = argmin\u03b2\u2208Rp n\u22121(cid:80)n\n\nconsider\n\nIntroduction. We\nobservations\n(X1, Y1), ..., (Xn, Yn) with responses Yi and feature vectors Xi \u2208 Rp. The literature of the\npast two decades has demonstrated the great success of regularized estimators that are commonly\nde\ufb01ned as solutions to regularized optimization problems of the form\n\nproblems where\n\nobserves\n\nlearning\n\none\n\n(1)\nwhere (cid:96)(\u00b7,\u00b7) is referred to as the loss (e.g. squared loss, logistic loss) and h : Rp \u2192 R is a\nregularization penalty (e.g. the (cid:96)1-norm for the Lasso, the (cid:96)2,1 norm for the Group-Lasso). All tuning\nparameters are included in h(\u00b7). The performance of such regularized estimators is measured in terms\nof prediction error or in terms estimation error (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) if the data comes from a model such as\nY = X\u03b2\u2217 + \u03b5 for some noise random variable \u03b5 in linear regression or\n\ni=1 (cid:96)(Yi, X T\n\ni \u03b2) + h(\u03b2),\n\nP(Y = 1|X = x) = 1/(1 + exp(xT \u03b2\u2217)) = 1 \u2212 P(Y = 0|X = x)\n\nin logistic regression, where \u03b2\u2217 is the unknown coef\ufb01cient vector. For instance, if s = (cid:107)\u03b2\u2217(cid:107)0 is the\nsparsity of \u03b2\u2217 in the above model, and (Xi, Yi)i=1,...,n are iid observations with the same distribution\nas (X, Y ), both the Lasso in linear regression and the logistic Lasso in logistic regression enjoy rate\noptimality: (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2264 s log(ep/s)/n; see [35, 1] or the proof of Proposition 3.4 in Appendix F\nfor self-contained proofs. The latter estimation bound is optimal in a minimax sense and cannot\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fbe improved, and the minimax rate s log(ep/s)/n represents the scale below which uncertainty is\nunavoidable by information theoretic arguments, see for instance [36, Section 5].\nWe are interested in providing \ufb01rst order expansion of \u02c6\u03b2 at scales negligible compared to the minimax\nestimation rate, e.g. at scales negligible compared to s log(ep/s)/n in the aforementioned sparsity\ncontexts. To be more precise, the results below will construct random \ufb01rst order expansion \u03b7 such\nthat \u03b7 is measurable w.r.t. a much smaller sigma algebra than that generated by (Xi, Yi)i=1,...,n, and\n(2)\nwhere op(1) is a quantity that converges to 0 in probability. In other words, we provide a \ufb01rst-order\nexpansion of \u02c6\u03b2 similar to an in\ufb02uence function expansion, cf. Section 1. This allows for understanding\nbias and standard deviation of \u02c6\u03b2 at a \ufb01ner scale than simply showing that \u02c6\u03b2 \u2212 \u03b2\u2217 converge to zero at\nthe minimax rate. The present paper intends to answer the two questions below regarding such \ufb01rst\norder expansion.\n\nfor some norm (cid:107) \u00b7 (cid:107)K related to the problem at hand,\n\nK = op(1)(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2\n\n(cid:107)\u03b7 \u2212 \u02c6\u03b2(cid:107)2\n\nK\n\n(Q1) How to construct \u03b7 such that (2) holds for a given convex regularized estimator such as (1)?\n(Q2) How are such \ufb01rst order expansions useful in high-dimensional learning problems where\n\nconvex regularized estimators (1) are commonly used?\n\nAn expansion \u03b7 satisfying (2) is interesting in and by itself because it describes phenomena at a \ufb01ner\nscale than most of the literature in high-dimensional problems which focuses on minimax prediction\nand estimation bounds. More importantly, we will see in Section 4 that such \ufb01rst-order expansions\nlead to exact identities for the loss of estimators, and in Section 5 that such \ufb01rst-order expansions can\nbe used for inference (i.e., uncertainty quanti\ufb01cation) about the unknown coef\ufb01cient vector \u03b2\u2217.\n\nNotation. Throughout the paper, C1, C2, C3, ... denote positive absolute constants and we write\na (cid:46) b if a \u2264 Cb for some absolute constant C > 0. The Euclidean norm in Rp or in Rn is denoted\nby (cid:107) \u00b7 (cid:107). For any positive de\ufb01nite matrix A, we write (cid:107)u(cid:107)A = (cid:107)A1/2u(cid:107) for the matrix square-root\nA. For matrices, (cid:107) \u00b7 (cid:107)op and (cid:107) \u00b7 (cid:107)F denote the operator norm and Frobenius norm. For any real a,\na+ = max(0, a). If S \u2282 {1, ..., p}, v \u2208 Rp, M \u2208 Rp\u00d7p then vS is the restriction (vj, j \u2208 S) and\nMS,S is the square submatrix of M made of entries indexed in S \u00d7 S.\n\n\u221a\n\nn\n\ni=1\n\n\u03c8(Xi, Yi)\n\nn(cid:88)\n\ni=1\n\n\u03c8(Xi, Yi)\u221a\n\nn( \u02c6\u03b2 \u2212 \u03b2\u2217) =\n\nIn\ufb02uence functions and Construction of \u03b7\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = op(1)(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) \u21d4 (1 + op(1))\n\n1\nTo answer (Q1), we start with a recap of unregularized estimators that correspond to h(\u00b7) \u2261 0, when\np is \ufb01xed as n \u2192 +\u221e. In this case, it is well-known that certain smoothness assumptions on the\nloss such as twice differentiability [25, 19] or stochastic equicontinuity [42, 41] imply (for any norm,\nsince all norms are equivalent in Rp for \ufb01xed p):\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u02c6\u03b2 \u2212 \u03b2\u2217 \u2212 n(cid:88)\nwe can take \u03b7 = \u03b2\u2217 +(cid:80) \u03c8(Xi, Yi)/n in (2). This representation allows us to claim asymptotic\n\n, (3)\nfor some target \u03b2\u2217 and a mean zero function \u03c8(\u00b7,\u00b7) sometimes referred to as the in\ufb02uence function.\nSee [25, Theorem 3.1], [42, Page 52], [41, Theorem 6.17], [19, Lemma 5.4] for details. In this case\nunbiasedness and \ufb02uctuations of order n\u22121/2 for \u02c6\u03b2 around \u03b2\u2217. It also shows that estimator \u02c6\u03b2 behaves\nlike an average and hence allows transfer of results (e.g., central limit theorems) for averages to study\nof \u02c6\u03b2 in terms of variance estimation, con\ufb01dence intervals, hypothesis testing and bootstrap.\nA general study of such representation for regularized problems is lacking in the literature. [23]\nis the \ufb01rst work that analyzed linear regression lasso when the number of covariates p is \ufb01xed and\ndoes not change with the sample size n. In the more challenging regime where p \u2265 n, Theorem 5.1\nof [22] provides a \ufb01rst order expansion allowing for p to diverge (almost exponentially) with n. In\nthe present work, we simplify and present a uni\ufb01ed derivation of such \ufb01rst order expansion result,\ngeneralizing [22, Theorem 5.1] beyond the squared loss, beyond the (cid:96)1 penalty and beyond certain\nassumptions of [22] on E[XiX T\n(cid:96)(cid:48)(Yi, X(cid:62)\n\ni (\u03b2\u2212\u03b2\u2217)}2+h(\u03b2), (4)\n\ni ]. The derivation of (3) can be motivated by de\ufb01ning\ni \u03b2\u2217)X(cid:62)\n\ni (\u03b2\u2212\u03b2\u2217)+\n\ni \u03b2\u2217){X(cid:62)\n\n(cid:96)(cid:48)(cid:48)(Yi, X(cid:62)\n\nn(cid:88)\n\nn\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n\u02dc\u03b7 := argmin\n\u03b2\u2208Rp\n\n1\n2n\n\ni=1\n\n2\n\n\fapproximation of(cid:80)n\n\u03b7 := argmin\u03b2\u2208Rp n\u22121(cid:80)n\nwhere K := n\u22121(cid:80)n\n(cid:0)\u03b2\u2217 + n\u22121(cid:80)n\n\n:= argmin\u03b2\u2208Rp\n\ni=1\n\nwith h(\u00b7) \u2261 0. Here and throughtout (cid:96)(cid:48)(y, u) and (cid:96)(cid:48)(cid:48)(y, u) represent (\ufb01rst and second) partial\nderivatives of (cid:96) with respect to u. The right hand side of (4) (with h(\u00b7) \u2261 0) is the quadratic\ni \u03b2)/n around \u03b2 = \u03b2\u2217 (without the term independent of \u03b2). The\n\ufb01nal \ufb01rst order expansion \u03b7 is obtained by replacing the quadratic part of the approximation by its\nexpectation as in the next display. Following the intuitive construction of \u03b7 for the unregularized\nproblem, we construct a \ufb01rst order expansion for the regularized problem as\n\ni=1 (cid:96)(Yi, X(cid:62)\n\n1\n2\n\ni \u03b2\u2217)X(cid:62)\n\ni=1 (cid:96)(cid:48)(Yi, X(cid:62)\n\n2 (\u03b2 \u2212 \u03b2\u2217)(cid:62)K(\u03b2 \u2212 \u03b2\u2217) + h(\u03b2)\n\ni (\u03b2 \u2212 \u03b2\u2217) + 1\ni=1 K\u22121Xi(cid:96)(cid:48)(Yi, X(cid:62)\n\n(cid:13)(cid:13)K 1/2(cid:0)\u03b2 \u2212 \u03b2\u2217 \u2212 n\u22121(cid:80)n\n(cid:3) . From this de\ufb01nition, we can write \u03b7 =\nE(cid:2)(cid:96)(cid:48)(cid:48)(Yi, X(cid:62)\ni \u03b2\u2217)(cid:1) , for a function \u03b7K(\u00b7) (depending on h(\u00b7), K). Our\ni=1 K\u22121Xi(cid:96)(cid:48)(Yi, X(cid:62)\n(cid:80)n\n(cid:107) \u02c6\u03b2 \u2212 \u03b7K(\u03b2\u2217 + 1\ni=1 (cid:96)(cid:48)(Yi, X(cid:62)\n\ni \u03b2\u2217)Xi)(cid:107)K = op(1)(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)K.\n\ni \u03b2\u2217)(cid:1)(cid:13)(cid:13)2\n\ni \u03b2\u2217)XiX(cid:62)\n\n+ h(\u03b2).\n\n(5)\n\ni\n\nn\n\n\u03b7K\nmain results prove under some mild assumptions that\n\nComparing this with (3) we note that for the unregularized problem, \u03b7K(\u03b2) = \u03b2 is the identity.\n\n2 Main Results: Approximation Theorem\nWe introduce the notion of Gaussian complexity for the following results. For any set T \u2282 Rp and a\ncovariance matrix \u03a3, the Gaussian complexity of T is given by\n\n\u03b3(T, \u03a3) := E(cid:104)\n\nsupu\u2208T : (cid:107)\u03a31/2u(cid:107)=1 |g(cid:62)\u03a31/2u|(cid:105)\n\n(6)\nwhere the expectation is with respect to the standard normal vector g \u223c N (0, Ip). We also need\nthe notion of L-subGaussianity. A random vector X is said to be L-subGaussian with respect to a\n(positive de\ufb01nite) matrix \u03a3 if\n\nsupu\u2208T\n\n,\n\n= E(cid:104)\n\nThis implies supu\nde\ufb01ned by (cid:107)u(cid:107)2\n\n\u2200u \u2208 Rp, E[exp(uT X)] \u2264 exp(L2(cid:107)u(cid:107)2\n\n(7)\nP(|u(cid:62)X| \u2265 t(cid:107)u(cid:107)\u03a3) \u2264 2 exp(\u2212t2/(2L2)). Recall that the scaled norm (cid:107) \u00b7 (cid:107)K is\n\n\u03a3/2)\n\nwhere (cid:107)u(cid:107)\u03a3 = (cid:107)\u03a31/2u(cid:107).\n\ni u)2]. Consider the following assumptions:\n(A1) There exists constants 0 \u2264 B, B2, B3 < \u221e such that the loss satis\ufb01es \u2200u1, u2 \u2208 R,\u2200y,\n\ni=1\n\nE[(cid:96)(cid:48)(cid:48)(Yi, X(cid:62)\n\ni \u03b2\u2217)(X(cid:62)\n\nK = n\u22121(cid:80)n\n\n|g(cid:62)\u03a31/2u|\n(cid:107)\u03a31/2u(cid:107)\n\n(cid:105)\n\n|(cid:96)(cid:48)(cid:48)(y, u1) \u2212 (cid:96)(cid:48)(cid:48)(y, u2)|\n\n|u1 \u2212 u2|\n\n\u2264 B,\n\n|(cid:96)(cid:48)(cid:48)(y, u1)| \u2264 B2,\n\nsup\nu\u2208Rp\n\n(cid:107)\u03a31/2u(cid:107)2\n(cid:107)K 1/2u(cid:107)2\n\n\u2264 B3.\n\n(8)\n\n(A2) The observations (X1, Y1), . . . , (Xn, Yn) are iid. Further X1, . . . , Xn are mean zero and\n\nL-subGaussian with respect to their covariance \u03a3, i.e., (7) holds.\n\nNote that L in (A2) is necessarily no smaller than one, i.e., L \u2265 1. De\ufb01ne the error\n\nwhere\n\nE := (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)K + (cid:107)\u03b7 \u2212 \u03b2\u2217(cid:107)K\n\n(cid:107) \u00b7 (cid:107)K is the norm (cid:107)u(cid:107)K = (cid:107)K 1/2u(cid:107).\n\n(9)\nThe quantity E quanti\ufb01es the error made by \u02c6\u03b2 and \u03b7 in estimating \u03b2\u2217 with respect to the norm (cid:107) \u00b7 (cid:107)K.\nBounds on (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)K and (cid:107)\u03b7 \u2212 \u03b2\u2217(cid:107)K follow from the existing literature; see [35] or Proposition 3.4\nand its proof in Appendix F.\nTheorem 2.1. Let rn := n\u22121/2\u03b3(T, \u03a3) and assume that rn \u2264 1. Further assume (A1) and (A2)\nhold true. Then with probability at least 1 \u2212 2e\u2212C4nr2\n\n1. If { \u02c6\u03b2\u2212 \u03b2\u2217, \u03b7\u2212 \u03b2\u2217} \u2286 T then (cid:107) \u02c6\u03b2\u2212 \u03b7(cid:107)K (cid:46) LB2B3r1/2\n2. If { \u02c6\u03b2 \u2212 \u03b7, \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2286 T then (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)K (cid:46) B2B3L2rnE + BB3/2\n\nn \u2212 2e\u2212C5 log n we have the following:\n\u221a\nn)E 3/2.\n\u221a\nn)E 2.\nThe set T mentioned in Theorem 2.1(1) are available in the literature for many convex penalties.\nIn the following, we will \ufb01nd this for constrained lasso, penalized lasso, and group lasso (with\nnon-overlapping groups) under sharp conditions. We refer to [7] for slope penalty, and Negahban\net al. [35, Lemma 1] and van de Geer [39, Def. 4.4 and Theorem 4.1] where set T is presented for a\ngeneral class of penalty functions including nuclear norm, group lasso (with overlapping groups).\nProofs of Theorem 2.1 and all following results are given in the supplement. An outline Theorem 2.1 is\ngiven in Section 6. Although Theorem 2.1 is stated under assumption (A2), we present a deterministic\nversion of the result (in Section 6) that replaces rn by suprema of different stochastic processes.\n\nn E + B1/2(B3L)3/2(1 + r3\n3 L3(1 + r3\nn\n\nn\n\n3\n\n\fYi = X T\n\nSquared loss in the linear model. Consider (cid:96)(y, u) = (y \u2212 u)2/2 and n iid observations\n\nThen we have K = \u03a3 = E[X1X T\nsatis\ufb01ed with B = 0 and B2 = B3 = 1. The conclusions of the Theorem 2.1 can be rewritten as\n\n(10)\n1 ] and the second derivative (cid:96)(cid:48)(cid:48) is constant. Hence condition (8) is\n\ni \u03b2\u2217 + \u03b5i, and Xi is independent of \u03b5i for i = 1, . . . , n,\n\n{ \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2286 T \u21d2 (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)K (cid:46) Lr1/2\nn E.\n{ \u02c6\u03b2 \u2212 \u03b7, \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2286 T \u21d2 (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)K (cid:46) L2rnE,\n\n(11)\n(12)\nwhere E = (cid:107)\u03a31/2( \u02c6\u03b2 \u2212 \u03b2\u2217)(cid:107) + (cid:107)\u03a31/2(\u03b7 \u2212 \u03b2\u2217)(cid:107). Since rn \u2264 1 (and typically rn \u2192 0 while L stays\nbounded, as we will see in the examples below), the inequality in (12) is stronger than the inequality\nin (11). In the linear model, we thus refer to inequality (11) as the \u201cslow rate\u201d inequality, and to (12)\nas the \u201cfast rate\u201d one. The set T encodes the low-dimensional structure and characterizes the rate\nrn through the Gaussian complexity \u03b3(T, \u03a3). The fast rate inequality is granted provided that T\ncontains the difference (\u03b7 \u2212 \u02c6\u03b2) additionally to the error vectors { \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217}. Conditions that\nensure the fast rate inequality will be made explicit in Section 3.2 for the Lasso.\n\nLogistic loss in the logistic model. The following proposition shows that (8) is again satis\ufb01ed.\nProposition 2.2. Consider the logistic loss (cid:96)(y, u) = yu \u2212 log(1 + eu) for y \u2208 {0, 1}, u \u2208 R.\nAssume that (Xi, Yi)i=1,...,n are iid satisfying the logistic regression model\nP(Yi = 1|Xi) = 1 \u2212 P(Yi = 0|Xi) = 1/(1 + exp(Xi\n\nfor some \u03b2\u2217 \u2208 Rp with (cid:107)\u03a31/2\u03b2\u2217(cid:107) \u2264 1.1 Assume (A2) holds. Then (8) holds with B = 1/(6\nB2 = 1 and an absolute constant B3 > 0.\nIn this logistic model, the conclusions of Theorem 2.1 present an extra term compared to the linear\nmodel with squared loss because the Lipschitz constant B in (8) is non-zero: Theorem 2.1 reads that\nwith high probability\n\n{ \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2282 T \u21d2 (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)K (cid:46) Lr1/2\n\n(13)\n{ \u02c6\u03b2 \u2212 \u03b7, \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2282 T \u21d2 (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)K (cid:46) L2rnE + BL3(1 + r3\n(14)\nSimilar to the case of squared loss, inequality (14) is stronger than inequality (13) when \u02c6\u03b2\u2212 \u03b7 belongs\nin T additionally to { \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2282 T .\n\n\u221a\nn E + B1/2L3/2(1 + r3\nn)E 2.\n\nn)E 3/2,\n\n(cid:62)\u03b2\u2217)),\n\n\u221a\n\n\u221a\n\n3),\n\nn\n\nn\n\n3 What is the low-dimensional set T ? Application to Lasso and Group-Lasso\n\nWe now provide applications of the above result to three different penalty functions commonly used\nin high-dimensional settings. Throughout this section, for any cone T \u2286 Rp, let \u03c6(T ) be the smallest\nsingular value of \u03a31/2 restricted to T , i.e., \u03c6(T ) = minu\u2208T :(cid:107)u(cid:107)=1 (cid:107)\u03a31/2u(cid:107). Further consider\n\n(N1) The features are normalized such that \u03a3jj \u2264 1 for all 1 \u2264 j \u2264 p.\n\n3.1 Constrained Lasso\n\nLet R > 0 be a \ufb01xed parameter. Our \ufb01rst example studies the constrained lasso penalty [38]\n\nh(\u03b2) = +\u221e if (cid:107)\u03b2(cid:107)1 > R\n\n(15)\ni.e., h is the convex indicator function of the (cid:96)1-ball of radius R > 0. Applying the above result\nrequires two ingredients: proving that the error vectors { \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} belong to some set T with\nhigh probability, and proving that rn = n\u22121/2\u03b3(T, \u03a3) is small. De\ufb01ne for any real k \u2265 1,\n\nh(\u03b2) = 0\n\nand\n\n(16)\nThe parameter k above will typically be a constant times s = (cid:107)\u03b2\u2217(cid:107)0, the sparsity of \u03b2\u2217. If R = (cid:107)\u03b2\u2217(cid:107)1,\nthen the triangle inequality reveals that the error vectors of \u02c6\u03b2 and \u03b7 satisfy\n\nTlasso(k) := {u \u2208 Rp : (cid:107)u(cid:107)1 \u2264\n\nk(cid:107)u(cid:107)}.\n\n\u221a\n\nif (cid:107)\u03b2(cid:107)1 \u2264 R,\n\nwhere S = {j = 1, ..., p : \u03b2\u2217\nBy the Cauchy-Schwarz inequality (cid:107)uS(cid:107)1 \u2264 \u221a\n\n{ \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2286 T := {u \u2208 Rp : (cid:107)uSc(cid:107)1 \u2264 (cid:107)uS(cid:107)1},\n\n(17)\nj (cid:54)= 0} is the support of the true \u03b2\u2217 and uS is the restriction of u to S.\n\ns(cid:107)uS(cid:107)2, thus T in (17) satis\ufb01es T \u2282 Tlasso(4s).\n\n1The constant 1 can be replaced by another absolute constant; this will only change B3 to a different constant.\n\n4\n\n\fLemma 3.1. If (N1) holds and k \u2265 1, then we have \u03b3(T, \u03a3) (cid:46) \u03c6(T )\u22121(cid:112)k log(2p/k) for any cone\nHence under (N1) and by setting k = 4s and rn = \u03c6(T )\u22121(cid:112)s log(ep/s)/n we have in the linear\n\nT \u2282 Tlasso(k) where Tlasso(k) is de\ufb01ned in (16).\n\nmodel with squared loss that, with high probability,\n\n(cid:107)\u03a31/2(\u03b7 \u2212 \u02c6\u03b2)(cid:107) (cid:46) L\u03c6(T )\u22121/2(s log(ep/s)/n)1/4((cid:107)\u03a31/2( \u02c6\u03b2 \u2212 \u03b2\u2217)(cid:107) + (cid:107)\u03a31/2(\u03b7 \u2212 \u03b2\u2217)(cid:107))\n\n(18)\nand we have established that \u03b7 is a \ufb01rst order expansion of \u02c6\u03b2 with respect to the norm (cid:107) \u00b7 (cid:107)\u03a3 if\ns log(ep/s)/n \u2192 0. It is informative to study the order of magnitude of the right hand side in (18).\nFor that purpose, the following Lemma gives explicit bounds on (cid:107)\u03a31/2( \u02c6\u03b2\u2212\u03b2\u2217)(cid:107) and (cid:107)\u03a31/2(\u03b7\u2212\u03b2\u2217)(cid:107).\nLemma 3.2. Consider the linear model with squared loss (10) and assume (A2). Let \u02c6\u03b2, \u03b7 in (1) and\n(5) with penalty (15). Then if R = (cid:107)\u03b2\u2217(cid:107)1, we have with probability at least 1 \u2212 2e\u2212nr2\nn,\n(cid:107)\u03a31/2( \u02c6\u03b2 \u2212 \u03b2\u2217)(cid:107) (cid:46) L\u03c3\u2217rn(1 \u2212 C6L2rn)\u22121,\n\n(cid:107)\u03a31/2(\u03b7 \u2212 \u03b2\u2217)(cid:107) (cid:46) L\u03c3\u2217rn,\n\nwhere rn = \u03c6(T )\u22121(cid:112)s log(ep/s)/n and (\u03c3\u2217)2 = (\u03b52\n\n1 + ... + \u03b52\n\nn)/n.\n\n(19)\n\nand\n\nThe above lemma provides a slight improvement in the rate compared to [17, Theorem 11.1(a)].\nCombined with inequality (18), we have established that (cid:107)\u03a31/2( \u02c6\u03b2 \u2212 \u03b7)(cid:107) (cid:46) L2\u03c3\u2217r3/2\nn . If rn \u2192 0\n(e.g., if s log(ep/s)/n \u2192 0 while \u03c6(T ) stays bounded away from 0), this means that the distance\n(cid:107)\u03a31/2( \u02c6\u03b2 \u2212 \u03b7)(cid:107) between \u02c6\u03b2 and \u03b7 is an order of magnitude smaller than the risk bounds in (19).\nInclusion (17) is granted regardless of the loss (cid:96), as soon as \u03b2\u2217 lies on the boundary of {\u03b2 \u2208 Rp :\n(cid:107)\u03b2(cid:107)1 = R}. In logistic regression, i.e., the setting of Proposition 2.2 with the constrained Lasso\npenalty (15), inequality (13) yields that with high probability, (cid:107)\u03b7 \u2212 \u02c6\u03b2(cid:107)K (cid:46) L[r1/2\nn + L1/2(1 +\nn)E 1/2]E. An extra term appears compared to the squared loss. In order to obtain a \ufb01rst-order\nr3\nn\nexpansion as in (2) requires rn \u2192 0 as well as (1 + r3\nn)E 1/2 \u2192 0. These conditions can be\nobtained if risk bounds such as (19) are available, see [35, 1] or Proposition 3.4 and its proof in\nAppendix F for applicable general techniques. A more detailed discussion of Logistic Lasso is given\nin the next subsection.\n\n\u221a\n\n\u221a\n\nn\n\n3.2 Penalized Lasso\n\nh(\u03b2) = \u03bb(cid:107)\u03b2(cid:107)1\n\n\u03bb \u2265 0.\n\nWe now consider the (cid:96)1-norm penalty\n\nfor some\n\n(20)\nHere, the fact that \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217 \u2208 T for some low-dimensional cone T is not granted almost\nsurely, in that regard the situation differs from the constrained Lasso case in (17). We may \ufb01nd\nsuch low-dimensional cone T simultaneously for \u02c6\u03b2, \u03b7 for both the squared loss and logistic loss as\nfollows, using ideas from [35, 11]. Let fn be the convex function so that the objective in (1) is equal\nto fn(\u03b2) + h(\u03b2) and let gn be the convex function so that the objective in (5) is gn(\u03b2) + h(\u03b2). Since\n\u02c6\u03b2 and \u03b7 are solutions of the corresponding optimization problems (1) and (5),\nh( \u02c6\u03b2) \u2212 h(\u03b2\u2217) \u2264 fn(\u03b2\u2217) \u2212 fn( \u02c6\u03b2) \u2264 \u2207fn(\u03b2\u2217)T (\u03b2\u2217 \u2212 \u02c6\u03b2),\nh(\u03b7) \u2212 h(\u03b2\u2217) \u2264 gn(\u03b2\u2217) \u2212 gn(\u03b7) \u2264 \u2207gn(\u03b2\u2217)T (\u03b2\u2217 \u2212 \u03b7).\n\n(21)\n\nSince \u2207gn(\u03b2\u2217) = \u2207fn(\u03b2\u2217), both \u03b7 and \u02c6\u03b2 belong to the set \u02c6T = {b \u2208 Rp : h(b) \u2212 h(\u03b2\u2217) \u2264\n\u2207fn(\u03b2\u2217)T (b \u2212 \u03b2\u2217)}. Next, for both the squared loss and the logistic loss, \u2207fn(\u03b2\u2217) has subGaussian\ncoordinates under (A2). Combining these remarks, we obtain the following, proved in supplement.\nLemma 3.3. Let h be as in (20). Consider the linear model (10) and assume (A2), (N1). Let \u03be > 0\nn)/n and\n(cid:107)\u03b2\u2217(cid:107)0 = s. Then\n\nbe a constant and let \u03bb = L\u03c3\u2217(1 + 3\u03be)(cid:112)2 log(p/s)/n where (\u03c3\u2217)2 = (\u03b52\nP(cid:104){ \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2282 T\n(L/2)(1 + 3\u03be)(cid:112)2 log(p/s)/n, then the previous display (22) also holds.\n\nIf instead the logistic regression model and assumptions of Proposition 2.2 are ful\ufb01lled and \u03bb =\n\n\u03be2 log(p/s)(p/s)\u03be where T = Tlasso\n\n(cid:0)s(6 + 2\u03be\u22121)2(cid:1) .\n\n(cid:105) \u2265 1 \u2212\n\n1 + . . . + \u03b52\n\n(22)\n\n2\n\n5\n\n\fThe set T above is the set Tlasso(k) in (16) with k = s(6 + 2\u03be\u22121)2. Eq. (22) de\ufb01nes a low-\ndimensional cone T that contains both error vectors \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217 for the squared loss and the\nlogistic loss. The Gaussian width of the set T in (18) is already bounded in Lemma 3.1. Hence\nthe Gaussian width of T in the previous lemma is bounded from above as in the previous section,\n\ni.e., \u03b3(\u03a3, T ) (cid:46) \u03c6(T )\u22121(6 + 2\u03be\u22121)(cid:112)s log(2p/s) by Lemma 3.1, and the \u201cslow rate\u201d inequality (18)\n\nagain holds with high probability, where \u03c6(T ) denotes the restricted eigenvalue of the set T of the\nprevious lemma. Risk bounds similar to (19) are given below. We emphasize here the fact that the\nerror vectors of the Lasso belong to the cone (22) with high probability is not new: this is a powerful\ntechnique used throughout the literature on high-dimensional statistics starting from [11, 35]. The\nnovelty of our results are inequalities such as (18) which shows that the distance (cid:107)\u03a31/2( \u02c6\u03b2 \u2212 \u03b7)(cid:107) is an\n\norder of magnitude faster than the minimax risk(cid:112)s log(ep/s)/n. We will now state a result similar\n\nto Lemma 3.2 for linear and logistic lasso.\nProposition 3.4. Consider the penalized lasso estimator \u02c6\u03b2 given by\n\n(cid:80)n\ni=1 (cid:96)(Yi, X(cid:62)\n\ni \u03b2) + \u03bb(cid:107)\u03b2(cid:107)1,\n\n\u02c6\u03b2 := argmin\u03b2\u2208Rp\n\n1\nn\n\nwhere (cid:96) is either the squared or logistic loss and \u03bb is chosen as in Lemma 3.3 for some \u03be > 0.\nAssume (A1), (A2). With T de\ufb01ned in (22), assume that \u2203\u03b8 > 0 s.t. for all u \u2208 T with (cid:107)u(cid:107)K \u2264 1,\n(23)\n\n(cid:8)(cid:96)(Yi, X(cid:62)\n\ni \u03b2\u2217) \u2212 u(cid:62)Xi(cid:96)(cid:48)(Yi, X(cid:62)\n\ni \u03b2\u2217)(cid:9) ,\n\ni u) \u2212 (cid:96)(Yi, X(cid:62)\n\ni \u03b2\u2217 + X(cid:62)\n\n(cid:80)n\n\ni=1\n\n\u03b82(cid:107)u(cid:107)2\n\nK \u2264 1\n\nn\n\nas well as\n\n3 \u03c6(T )\u03b82 \u00d7\nThen with probability at least 1 \u2212 2/(\u03be2 log(p/s)(p/s)\u03be),\n\nL(2 + 5\u03be)(cid:112)2s log(p/s)/n \u2264 B1/2\n(cid:114)\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)K \u2264 L(2 + 5\u03be)\n3 \u03c6(T )\u03b82\n\nB1/2\n\nn\n\n2s log(p/s)\n\n\u00d7\n\n(cid:26)1/\u03c3\u2217,\n(cid:26)\u03c3\u2217,\n\n2,\n\n0.5,\n\nfor (cid:96), the squared loss,\nfor (cid:96), the logistic loss.\n\nfor (cid:96), the squared loss,\nfor (cid:96), the logistic loss.\n\n(24)\n\n(25)\n\nThe proof is given Appendix F. Assumption (23) is the classical restricted strong convexity condition\nand we verify this for linear and logistic loss in Proposition F.1. Results similar to Proposition 3.4\nare known in the literature [35] but the main novelty of our result is that the tuning parameter \u03bb is of\n\norder(cid:112)log(p/s)/n and not(cid:112)log(p)/n which proves the minimax optimal rate.\n\nFaster rates for the penalized lasso. Fast rates for the Lasso can be obtained using the second\ninequality of Theorem 2.1, which when specialized to the squared loss gives (12). To verify the main\nadditional assumption of \u02c6\u03b2 \u2212 \u03b7 \u2208 T , we prove sparsity of \u03b7 and \u02c6\u03b2. Since \u02c6\u03b2, \u03b7 are de\ufb01ned through a\npenalized quadratic problem, we can leverage existing results in the literature that imply that \u03b7, \u02c6\u03b2\nsatis\ufb01es (cid:107)\u03b7(cid:107)0 \u2228 (cid:107) \u02c6\u03b2(cid:107)0 \u2264 \u02dcCs under suitable conditions on the design and as long as s log(ep/s)/n is\nsmall enough, for some constant \u02dcC that depends on the restricted singular values of \u03a3; cf., e.g.,[44,\nLemma 1], [9, Theorem 3] [22, Lemma 3.5], [4, Lemma 6.1]. We prove such as result for the\nGroup-Lasso in Proposition 3.7 below. Now we de\ufb01ne the cones T0 and T as the sets\n\nT0 := {u \u2208 Rp : (cid:107)u(cid:107)0 \u2264 (2 \u02dcC + 1)s} \u2282 T = {u \u2208 Rp : (cid:107)u(cid:107)1 \u2264 (2 \u02dcC + 1)1/2\u221a\n\n(26)\nwhere the inclusion is obtained thanks to the Cauchy-Schwarz inequality. Then {\u03b7 \u2212 \u02c6\u03b2, \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212\n\u03b2\u2217} \u2282 T with high probability, the Gaussian width \u03b3(T, \u03a3) is bounded by Lemma 3.1 and the second\ninequality of Theorem 2.1 yields\n\ns(cid:107)u(cid:107)}.\n\n(cid:107)\u03a31/2(\u03b7 \u2212 \u02c6\u03b2)(cid:107) (cid:46) L2rnE, where\n\nrn = \u03c6(T )\u22121(s log(ep/s)/n)1/2.\n\nSince E (cid:46) rn with high probability by known prediction bounds for the Lasso (see Proposition 3.4\nand its proof in Appendix F for rates with squared and logistic loss), we obtain that with high\nprobability,\n\n(cid:107)\u03a31/2(\u03b7 \u2212 \u02c6\u03b2)(cid:107) (cid:46) L3\u03c6\u22122(T )s log(ep/s)/n = L3r2\nn,\n\n(27)\na rate that is the square of the minimax rate rn, hence much smaller. For squared loss, this rate\nis also faster than the rate obtained in (18) which is of order r3/2\nn . This faster rate is obtained\n\n6\n\n\fmaxA\u2282[p]:|A|\u2264c5s maxj\u2208A\n\n(cid:80)\nj\u2208Ac |\u03a3ij| \u2264 c6.\n\nthanks to the inclusion { \u02c6\u03b2 \u2212 \u03b7, \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2282 T , whereas in the setting of (18) we only had\n{ \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2282 T but not \u02c6\u03b2 \u2212 \u03b7 \u2208 T . To our knowledge, the only result in the literature similar\nto the above bounds is given by [22, Theorem 5.1]. This result from [22] shows that (27) holds for\nsquared loss, provided that the covariance \u03a3 satis\ufb01es (a) the minimal singular value of \u03a3 is at least\nc3 > 0, (b) the maximal singular value of \u03a3 is at most c4, and (c) the covariance matrix \u03a3 satis\ufb01es\n(28)\nOur results show that a \ufb01rst order expansion for the Lasso can be obtained using the slow rate bound\n(11) without the requirement that the spectral norm of \u03a3 is bounded, and for the fast rate without\nthe stringent assumption (28) on the correlations of \u03a3. Not only do our results generalize Theorem\n5.1 from [22] to more general \u03a3, Theorem 2.1 shows how to obtain \ufb01rst-order expansion \u03b7 beyond\nthe squared loss (e.g. logistic loss) and beyond the (cid:96)1-penalty of the lasso: the previous subsection\ntackles the constrained Lasso penalty (15) and the next subsection tackles the Group-Lasso penalty.\nSparsity of \u03b7 for any general loss function is proved in Proposition 3.7. This alone does not imply\ninclusion of \u03b7 \u2212 \u02c6\u03b2 in a low-dimensional set without sparsity of \u02c6\u03b2. Sparsity of \u02c6\u03b2 for general loss\nfunction is not well-studied but for logistic loss function Section D.4 of the supplement of [10]\nproves a sparsity bound of the form (cid:107) \u02c6\u03b2(cid:107)0 \u2264 \u02dcCs, similar to the squared loss. Unfortunately the proof\n\nthere requires \u03bb (cid:38)(cid:112)log p/n instead of condition \u03bb (cid:38)(cid:112)log(p/s)/n used in Lemma 3.3 above and\n\nin [26, 37, 7, 4, 2].\n\n3.3 Group-Lasso\nConsider now a partition of {1, ..., p} into M groups G1, ..., GM . For simplicity, we assume that the\ngroups have the same size d = p/M, which is typically the case in multitask learning with d tasks\nand M shared features. The Group-Lasso penalty studied in this subsection is\n\nh( \u02c6\u03b2) = \u03bb(cid:80)M\n\nGk\n\n\u221a\n\n\u221a\ns(\n\nk=1 (cid:107)\u03b2Gk(cid:107)\n\n(cid:54)= 0 (Lemma 3.6).\n\nwhere \u03b2Gk \u2208 R|Gk| is the restriction (\u03b2j, j \u2208 Gk).\n\nd +(cid:112)2 log(M/s)) where s is the number of\n\n2\u03be)(cid:112)2 log(M/s)] where (\u03c3\u2217)2 = ((cid:80)n\nP(cid:16){ \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2282 T\nfor T = {\u03b4 \u2208 Rp :(cid:80)M\nk=1 (cid:107)\u03b4Gk(cid:107) \u2264 \u221a\n\n(29)\nIn both the linear model with squared loss and in logistic regression with the logistic loss, we now\nshow that \u02c6\u03b2 \u2212 \u03b2\u2217 and \u03b7 \u2212 \u03b2\u2217 belong to a low-dimensional cone (Lemma 3.5), and that the Gaussian\nwidth of this cone is bounded from above by\ngroups with \u03b2\u2217\nLemma 3.5. Consider the linear model (10) and assume that maxk=1,...,M (cid:107)\u03a3Gk,Gk(cid:107)op \u2264 1 and\n\u221a\nthat each group has the same size |Gk| = d = p/M. Let \u03be > 0 and set \u03bb = L\u03c3\u2217(1 + \u03be)[\nd + (1 +\n(cid:54)= 0. Then\ni )/n and s is the number of groups with \u03b2\u2217\n(30)\ns(cid:107)\u03b4(cid:107)2(2 + 3\u03be\u22121)}. If instead the logistic regression model and\nassumptions of Proposition 2.2 are ful\ufb01lled and \u03bb is as above with \u03c3\u2217 = 1/2, then (30) also holds.\nThe fact that the Group-Lasso belongs with high probability to a low-dimensional cone has been\nused before to prove risk bounds, e.g., [31, 5]. However the tuning parameter in the above lemma is\nsmaller than that used in these works and using such cones to prove \ufb01rst expansion as in the present\npaper are, to our knowledge, novel.\nLemma 3.6. Assume that maxk=1,...,M (cid:107)\u03a3Gk,Gk(cid:107)op \u2264 1 and that each group has the same\nsize |Gk| = d = p/M. The set T de\ufb01ned in the previous lemma satis\ufb01es \u03b3(T, \u03a3) (cid:46)\n\nC(\u03be)\u03c6(T )\u22121(cid:112)sd + s log(M/s) for some constant C(\u03be) that depends only on \u03be.\n\n(cid:17) \u2265 1 \u2212 2(cid:14)(cid:0)2\u03be2 log(M/s)(M/s)\u03be(cid:1) .\n\ni=1 \u03b52\n\nGk\n\n(cid:54)= 0) and\nHence if the number of groups M, the group-sparsity s (number of groups such that \u03b2\u2217\nthe group size d = p/M satisfy (sd + s log(M/s))/n \u2192 0 while \u03c6(T ) is bounded away from 0, the\nabove Lemmas combined with Theorem 2.1 imply that \u03b7 is a \ufb01rst-order expansion of \u02c6\u03b2 for both the\nsquared loss in linear regression and logistic loss in the logistic model. We leverage this result to\nobtain an exact risk identity for the Group-Lasso in the next section.\nProposition 3.7. Assume (A1), (A2). Let the setting of Lemma 3.6 be ful\ufb01lled. Fix \u03bb as in Lemma 3.5\nfor both squared and logistic loss for some \u03be > 0 and T be the cone de\ufb01ned in Lemma 3.5. If\n(cid:107)K(cid:107)op \u2264 Cmax < \u221e and the assumptions of Proposition 3.4 hold, then\n\nGk\n\nP(cid:16)|{k \u2208 [M ] : \u03b7Gk (cid:54)= 0}| \u2264 s \u02dcC\n\n(cid:17) \u2265 1 \u2212 2/(\u03be2 log(M/s)(M/s)\u03be),\n\n7\n\n\fwhere \u02dcC := 1 + Cmax{2(3 + \u03be)(1 + \u03be\u22121)}2B2\n\nwith \u02dcC replaced by (1 + o(1)) \u02dcC provided \u03c6(T )\u22121(cid:112)sd + s log(M/s)/\n\n\u221a\n\n3 \u03c6(T )\u22122. For the squared loss, the same holds for \u02c6\u03b2\n\nn \u2192 0.\n\nThe proof is given in Appendix G. For the Lasso the assumption of (cid:107)K(cid:107)op \u2264 Cmax can be relaxed to\na bound on the sparse maximal eigenvalue of K using devices from [44, Lemma 1], [47, Corollary\n2], [8, Lemma 3] or [4, Proposition 7.4]. See also [31, Theorem 3.1] and [29, Lemma 6] for similar\nresults for the Group-Lasso, although with a larger tuning parameter than in Proposition 3.7.\nFor the squared loss, if the condition number of \u03a3 stays bounded then Cmax/\u03c6(T )\u22122 is also bounded.\nn \u2192 0, Proposition 3.7 yields that both \u02c6\u03b2 \u2212 \u03b7 belongs to the\nk=1 (cid:107)\u03b4Gk(cid:107) \u2264 (1 + o(1))(2 \u02dcCs)1/2(cid:107)\u03b4(cid:107)2}, which yields the \u201cfast rate\u201d bound (12).\n\nThen if rn =(cid:112)sd + s log(M/s)/\ncone {\u03b4 \u2208 Rp :(cid:80)M\n\n\u221a\n\n4 Application to exact risk identities\n\n(cid:12)(cid:12)(cid:12)(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) \u2212 EZ\nn1/2\u03c3\u2217(cid:80)n\n\n\u03b2\u2217 + n\u22121/2(cid:80)n\n\nIn the linear model with the squared loss and identity covariance (\u03a3 = Ip), the expansion \u03b7 in\n(5) is particularly simple: \u03b7 becomes the proximal operator of the penalty h at the point z =\ni=1 \u03b5iXi, i.e, \u03b7 = proxh(z) where proxh(x) = argminb\u2208Rp (cid:107)x \u2212 b(cid:107)2/2 + h(b).\nHence the loss (cid:107)\u03b7 \u2212 \u03b2\u2217(cid:107) of \u03b7 has a simple form and if a \ufb01rst-order expansion (2) is available, for\ninstance for the Lasso or Group-Lasso as a consequence of the Lemmas of the previous section, then\nthe loss (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) is exactly the loss of prox(z) up to a smaller order term. Let us emphasize that the\nnext result and following discussion provide exact risk identities for the loss (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) (as in (32)\nbelow), and not only upper bounds up to multiplicative constants.\nTheorem 4.1. [Exact Risk Identity] Consider the linear model (10) and the regularized problem (1)\nwith an arbitrary proper convex function h(\u00b7). Assume that X1, . . . , Xn are iid N (0, Ip) independent\nof \u03b51, ..., \u03b5n and set \u03c3\u2217 = ( 1\n\ni )1/2. Then with probability at least 1 \u2212 2 exp(\u2212t2/2),\n\n(cid:80)n\n(cid:104)(cid:107)\u03b2\u2217 \u2212 proxh(\u03b2\u2217 + n\u22121/2\u03c3\u2217Z)(cid:107)2(cid:105)1/2(cid:12)(cid:12)(cid:12) \u2264 \u03c3\u2217(t + 1)\n\ni=1 \u03b52\n\nn\n\nn1/2\n\n(31)\n\n+ (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)\ni=1 \u03b5iXi \u223c N (0, Ip) and EZ denotes the expectation with respect Z.\n\nwhere Z = 1\nTheorem 4.1 is a generalization of Corollary 5.2 of [22] where the result is stated for h(\u03b2) = \u03bb(cid:107)\u03b2(cid:107)1\n\nwith \u03bb (cid:38) \u03c3\u2217(cid:112)2 log(p)/n. For the case of lasso, either in its constrained form with tuning parameter\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) = (1 + op(1))EZ[(cid:107)\u03b2\u2217 \u2212 proxh(\u03b2\u2217 + n\u22121/2\u03c3\u2217Z)(cid:107)2]1/2.\n\nchosen as in Lemma 3.3 or the penalized Lasso with tuning parameter as in Lemma 3.3, inequality (18)\nholds thanks to (17) and Lemma 3.1 for the constrained lasso, and thanks to Lemmas 3.1 and 3.3 for\nthe penalized lasso. Hence for both the constrained and penalized lasso, if \u03a3 = Ip with Gaussian\n\u221a\nn) + Op(s log(ep/s)/n)1/4)((cid:107)\u03b7 \u2212\ndesign, the second term on the right hand side of (31) is Op(\u03c3\u2217/\n\u03b2\u2217(cid:107) + (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)). Hence if s, n, p \u2192 +\u221e with s log(ep/s)/n \u2192 0 and s/p \u2192 0 then (31) implies\n(32)\nFor the penalized lasso, since \u03b7 represents a soft-thresholding operator which can be written in closed\nform, Theorem 4.1 allows a re\ufb01ned study of the risk of \u02c6\u03b2; see [14, Theorem 5.1]. Similarly for the\ngroup lasso, we have from Lemmas 3.5 and 3.6 that (cid:107)\u03b7\u2212 \u02c6\u03b2(cid:107) = Op((sd+s log(M/s))1/4/n1/4)(cid:107) \u02c6\u03b2\u2212\n\u03b2\u2217(cid:107) (slow rate) which is again negligible relative to (cid:107) \u02c6\u03b2\u2212\u03b2\u2217(cid:107) if (sd+s log(M/s))/n \u2192 0, s/M \u2192 0.\nThus, (32) again holds true. For the group lasso \u03b7 = proxh(\u03b2\u2217 + n\u22121/2\u03c3\u2217Z) represents the Block\nJames-Stein estimator in the sequence model; see [13, Section 2.1].\nExtending Corollary 5.2 of [22] to more general loss/penalty functions, the above device lets us\ncharacterize the risk (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107): Up to a multiplicative constant of order 1 + op(1), the risk is the same\nas the risk of the proximal of h in the Gaussian sequence model where one observes N (\u03b2\u2217, (\u03c3\u2217)2/n).\n\n5 Application to inference\n\nThe second application we wish to mention is related to con\ufb01dence intervals in the linear model when\nthe squared loss is used and X1, ..., Xn are iid Gaussian N (0, \u03a3). Assume that one is interested\nin constructing a con\ufb01dence interval for a speci\ufb01c linear combination aT \u03b2\u2217 for some a \u2208 Rp.\n\n8\n\n\f\u221a\n\nnr3\n\nn).\n\nFurther assume, for simplicity, that \u03a3 is known and that a is normalized with (cid:107)\u03a3\u22121/2a(cid:107) = 1. Then\nprevious works on de-biasing [45, 46, 20, 21, 40, 22, 4] suggests, given an estimator \u02c6\u03b2 that may\nbe biased, to consider the bias-corrected estimator \u02c6\u03b8 de\ufb01ned by \u02c6\u03b8 = aT \u02c6\u03b2 + (cid:107)za(cid:107)\u22122zT\na (y \u2212 X \u02c6\u03b2),\nwhere y = (Y1, ..., Yn) is the response vector and X is the design matrix with rows X1, ..., Xn and\nza = X\u03a3\u22121a \u223c N (0, In) is sometimes referred to as a score vector for the estimation of aT \u03b2\u2217.\nProposition 5.1. Assume that X1, ..., Xn are iid N (0, \u03a3) and is independent of \u03b5 = (\u03b51, ..., \u03b5n) \u223c\n\u221a\nN (0, In). Assume that for some cone T and rn = \u03b3(T, \u03a3)/\n\nn we have\n\nnrn)(cid:107)\u03a31/2(\u03b7 \u2212 \u02c6\u03b2)(cid:107),\n\nP((cid:107)\u03a31/2( \u02c6\u03b2 \u2212 \u03b2\u2217)(cid:107) + (cid:107)\u03a31/2(\u03b7 \u2212 \u03b2\u2217)(cid:107) \u2264 C7rn,{\u03b7 \u2212 \u02c6\u03b2, \u03b7 \u2212 \u03b2\u2217, \u02c6\u03b2 \u2212 \u03b2\u2217} \u2282 T ) \u2265 1 \u2212 \u03b1.\n\n\u221a\nn(\u02c6\u03b8 \u2212 aT \u03b2\u2217) \u2212 Tn = Op((1 + rn))(cid:107)\u03a31/2(\u03b7 \u2212 \u03b2\u2217)(cid:107) + Op(\n\nThen for some Tn with the t-distribution with n degrees-of-freedom, with probability 1\u2212\u03b1\u22124e\u2212nr2\n\n(33)\nn/2,\n(34)\n(35)\nBecause Tn has t distribution with n degrees of freedom, asymptotically P(|Tn| \u2264 1.96) \u2192 0.95\nand hence from (35), we get that P(n1/2|\u02c6\u03b8 \u2212 a(cid:62)\u03b2\u2217| \u2264 1.96) \u2192 0.95 if r3\nn \u2192 0. Therefore,\n[\u02c6\u03b8 \u2212 1.96/n1/2, \u02c6\u03b8 + 1.96/n1/2] represents a 95% con\ufb01dence interval for a(cid:62)\u03b2\u2217. Conclusion (34) is a\nconsequence of Theorem 2.1.\n\nLasso. Eq. (33) is satis\ufb01ed for the penalized Lasso for rn =(cid:112)s log(ep/s)/n and the cone T in\n\n\u221a\n= Op(rn(1 + rn)) + Op(\n\n(26), in situations where (cid:107) \u02c6\u03b2(cid:107)0 \u2264 \u02dcCs with high probability as explained in the discussion surrounding\n(26). In order to construct con\ufb01dence interval based on (34), the right hand side of (35) needs to\nconverges to 0. This is the case if rn \u2192 0 and\nn \u2192 0. For the Lasso with rn = s log(ep/s)/n,\nthis translates to the sparsity condition s3 log(ep/s)3/n2 \u2192 0, i.e., s = o(n2/3) up to logarithmic\nfor the Lasso beyond the condition s (cid:46) \u221a\nfactors. Hence the \ufb01rst order expansion results of the present paper lets us derive de-biasing results\n(other recent approaches, [22, 4] also allow to prove such result beyond s (cid:46) \u221a\nn required in the early results [46, 20, 40] on de-biasing\nn). Moreover, the\nabove proposition is general and apply to any regularized estimator such that (33) holds, with suitable\nbounds on the Gaussian complexity \u03b3(T, \u03a3). For s \u226b n2/3, the estimator \u02c6\u03b8 requires an adjustment\nfor asymptotic normality in the form a degree-of-freedom adjustment [4].\nGroup-Lasso.\nn and the\ncondition number of \u03a3 is bounded, then (33) holds thanks to Proposition 3.7, the last paragraph of\nSection 3.3 and the risk bound (86). Here, (35) is o(1) if and only if (sd + s log(M/s))/n2/3 \u2192 0.\nThis improves the sample size requirement of [34], although \u03a3 is assumed known in Proposition 5.1.\n\nIf s is the number of non-zero groups, rn = (cid:112)sd + s log(M/s)/\n\nnr3\n\n\u221a\n\nn\n\n\u221a\n\n\u221a\n\n(Detailed proofs are given in Appendix E)\n\ni . Under assumption (A1), we have\n\n6 Proof sketch of Theorem 2.1\n\nTheorem 6.1. De\ufb01ne \u02c6K := n\u22121(cid:80)n\n\ni=1 (cid:96)(cid:48)(cid:48)(Yi, X(cid:62)\nn E 3/2.\n(i) If { \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2286 T then (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)K (cid:46) Q1/2\n(ii) If { \u02c6\u03b2 \u2212 \u03b7, \u02c6\u03b2 \u2212 \u03b2\u2217, \u03b7 \u2212 \u03b2\u2217} \u2286 T then (cid:107) \u02c6\u03b2 \u2212 \u03b7(cid:107)K (cid:46) Qn,2E + BZnE 2,\nwhere\n\ni \u03b2\u2217)XiX(cid:62)\nn,1E + B1/2Z 1/2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) u(cid:62) \u02c6Ku\n\n(cid:107)u(cid:107)2\n\nK\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , Qn,2 = sup\n\nu,v\u2208T\n\n\u2212 1\n\nQn,1 = sup\nu\u2208T\n\n|u(cid:62)( \u02c6K \u2212 K)v|\n(cid:107)u(cid:107)K(cid:107)v(cid:107)K\n\nand Zn = sup\nu\u2208T\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n|X(cid:62)\ni u|3\n(cid:107)u(cid:107)3\n\nK\n\n.\n\nTheorem 6.1 follows from the strong convexity of the objective function of \u03b7 with respect to the\nnorm (cid:107) \u00b7 (cid:107)K (cf. for instance, Lemma 1 of [6]) combined with Taylor expansions of the loss (cid:96).\nNext, to prove Theorem 2.1, it remains to bound Qn,1(T ), Qn,2(T ) and Zn(T ). The quadratic\nprocesses Qn,1(T ), Qn,2(T ) and cubic process Zn(T ) can be bounded in terms of \u03b3(T, \u03a3) using\ngeneric chaining results, Theorem 1.13 of Mendelson [33] and Eq. (3.9) of [32], as follows.\nProposition 6.2. [Control of Qn,1, Qn,2 and Zn] Under assumptions (A1) and (A2), we have\n(i) With probability 1 \u2212 2 exp(\u2212C8t2\u03b32(T, \u03a3)),\n\nmax{Qn,1(T ), Qn,2(T )} \u2264 C9B2B3L2(cid:0)tn\u22121/2\u03b3(T, \u03a3) + t2n\u22121\u03b32(T, \u03a3)(cid:1) .\n3 L3(cid:0)1 + n\u22121\u03b33(T, \u03a3)(cid:1) t3.\n\n(ii) With probability 1 \u2212 2 exp(\u2212C10t log n), Zn(T ) \u2264 C11B3/2\n\n9\n\n\fReferences\n[1] Pierre Alquier, Vincent Cottet, and Guillaume Lecu\u00e9. Estimation bounds and sharp oracle\ninequalities of regularized procedures with lipschitz loss functions. The Annals of Statistics, 47\n(4):2117\u20132144, 2019.\n\n[2] Pierre C Bellec. The noise barrier and the large signal bias of the lasso and other convex\nestimators. arXiv:1804.01230, 2018. URL https://arxiv.org/pdf/1804.01230.\npdf.\n\n[3] Pierre C Bellec. Sharp oracle inequalities for least squares estimators in shape restricted\n\nregression. The Annals of Statistics, 46(2):745\u2013780, 2018.\n\n[4] Pierre C Bellec and Cun-Hui Zhang. De-biasing the lasso with degrees-of-freedom adjustment.\n\narXiv:1902.08885, 2019. URL https://arxiv.org/pdf/1902.08885.pdf.\n\n[5] Pierre C Bellec, Guillaume Lecu\u00e9, and Alexandre B Tsybakov. Towards the study of least\nsquares estimators with convex penalty. In Seminaire et Congres, to appear, number 39. Societe\nmathematique de France, 2017. URL https://arxiv.org/pdf/1701.09120.pdf.\n\n[6] Pierre C Bellec, Arnak S Dalalyan, Edwin Grappin, and Quentin Paris. On the prediction loss\nof the lasso in the partially labeled setting. Electronic Journal of Statistics, 12(2):3443\u20133472,\n2018.\n\n[7] Pierre C. Bellec, Guillaume Lecu\u00e9, and Alexandre B. Tsybakov. Slope meets lasso: Improved\noracle bounds and optimality. Ann. Statist., 46(6B):3603\u20133642, 2018. ISSN 0090-5364. doi:\n10.1214/17-AOS1670. URL https://arxiv.org/pdf/1605.08651.pdf.\n\n[8] Alexandre Belloni and Victor Chernozhukov. Least squares after model selection in high-\n\ndimensional sparse models. Bernoulli, 19(2):521\u2013547, 2013.\n\n[9] Alexandre Belloni, Victor Chernozhukov, and Lie Wang. Pivotal estimation via square-root\nlasso in nonparametric regression. Ann. Statist., 42(2):757\u2013788, 04 2014. URL http://dx.\ndoi.org/10.1214/14-AOS1204.\n\n[10] Alexandre Belloni, Victor Chernozhukov, and Ying Wei. Post-selection inference for generalized\nlinear models with many controls. Journal of Business & Economic Statistics, 34(4):606\u2013619,\n2016.\n\n[11] Peter J. Bickel, Ya\u2019acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso and\ndantzig selector. Ann. Statist., 37(4):1705\u20131732, 08 2009. doi: 10.1214/08-AOS620. URL\nhttp://dx.doi.org/10.1214/08-AOS620.\n\n[12] St\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration inequalities: A\n\nnonasymptotic theory of independence. Oxford University Press, 2013.\n\n[13] T Tony Cai and Harrison H Zhou. A data-driven block thresholding approach to wavelet\n\nestimation. The Annals of Statistics, 37(2):569\u2013595, 2009.\n\n[14] Emmanuel J Candes. Modern statistical estimation via oracle inequalities. Acta numerica, 15:\n\n257\u2013325, 2006.\n\n[15] Antoine Dedieu. Error bounds for sparse classi\ufb01ers in high-dimensions. arXiv preprint\n\narXiv:1810.03081, 2018.\n\n[16] Sjoerd Dirksen. Tail bounds via generic chaining. Electronic Journal of Probability, 20, 2015.\n\n[17] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical learning with sparsity: the\n\nlasso and generalizations. CRC press, 2015.\n\n[18] Daniel Hsu, Sham Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian\nrandom vectors. Electron. Commun. Probab., 17:no. 52, 1\u20136, 2012. doi: 10.1214/ECP.v17-2079.\nURL http://ecp.ejpecp.org/article/view/2079.\n\n10\n\n\f[19] Hidehiko Ichimura. Semiparametric least squares (sls) and weighted sls estimation of single-\n\nindex models. Journal of Econometrics, 58(1-2):71\u2013120, 1993.\n\n[20] Adel Javanmard and Andrea Montanari. Con\ufb01dence intervals and hypothesis testing for high-\ndimensional regression. The Journal of Machine Learning Research, 15(1):2869\u20132909, 2014.\n\n[21] Adel Javanmard and Andrea Montanari. Hypothesis testing in high-dimensional regression\nunder the gaussian random design model: Asymptotic theory. IEEE Transactions on Information\nTheory, 60(10):6522\u20136554, 2014.\n\n[22] Adel Javanmard and Andrea Montanari. De-biasing the lasso: Optimal sample size for gaussian\n\ndesigns. Annals of Statistics, to appear, 2015.\n\n[23] Keith Knight and Wenjiang Fu. Asymptotics for lasso-type estimators. Annals of statistics,\n\npages 1356\u20131378, 2000.\n\n[24] Vladimir Koltchinskii. Sparsity in penalized empirical risk minimization.\n\nIn Annales de\nl\u2019Institut Henri Poincar\u00e9, Probabilit\u00e9s et Statistiques, volume 45, pages 7\u201357. Institut Henri\nPoincar\u00e9, 2009.\n\n[25] A. K. Kuchibhotla. Deterministic Inequalities for Smooth M-estimators.\n\nprints:1809.05172, September 2018.\n\nArXiv e-\n\n[26] Guillaume Lecu\u00e9 and Shahar Mendelson. Regularization and the small-ball method i: Sparse\nrecovery. Ann. Statist., 46(2):611\u2013641, 04 2018. doi: 10.1214/17-AOS1562. URL https:\n//doi.org/10.1214/17-AOS1562.\n\n[27] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newton-type methods for\n\nminimizing composite functions. SIAM Journal on Optimization, 24(3):1420\u20131443, 2014.\n\n[28] Christopher Liaw, Abbas Mehrabian, Yaniv Plan, and Roman Vershynin. A simple tool for\nbounding the deviation of random matrices on geometric sets. In Geometric aspects of functional\nanalysis, pages 277\u2013299. Springer, 2017.\n\n[29] Han Liu and Jian Zhang. Estimation consistency of the group lasso and its applications. In\n\nArti\ufb01cial Intelligence and Statistics, pages 376\u2013383, 2009.\n\n[30] Po-Ling Loh. Statistical consistency and asymptotic normality for high-dimensional robust\n\nm-estimators. The Annals of Statistics, 45(2):866\u2013896, 2017.\n\n[31] Karim Lounici, Massimiliano Pontil, Sara van de Geer, and Alexandre B. Tsybakov. Oracle\ninequalities and optimal inference under group sparsity. Ann. Statist., 39(4):2164\u20132204, 08\n2011. doi: 10.1214/11-AOS896. URL http://dx.doi.org/10.1214/11-AOS896.\n\n[32] Shahar Mendelson. Empirical processes with a bounded \u03c81 diameter. Geometric and Functional\n\nAnalysis, 20(4):988\u20131027, 2010.\n\n[33] Shahar Mendelson.\n\ncesses.\n10.1016/j.spa.2016.04.028.\nv126y2016i12p3652-3680.html.\n\nStochastic Processes and their Applications, 126(12):3652\u20133680, 2016.\n\nUpper bounds on product and multiplier empirical pro-\ndoi:\nURL https://ideas.repec.org/a/eee/spapps/\n\n[34] Ritwik Mitra and Cun-Hui Zhang. The bene\ufb01t of group sparsity in group inference with\n\nde-biased scaled group lasso. Electronic Journal of Statistics, 10(2):1829\u20131873, 2016.\n\n[35] Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A uni\ufb01ed\nframework for high-dimensional analysis of m-estimators with decomposable regularizers.\nIn Advances in Neural Information Processing Systems, pages 1348\u20131356, 2009.\n\n[36] Philippe Rigollet and Alexandre Tsybakov. Exponential screening and optimal rates of sparse\n\nestimation. The Annals of Statistics, 39(2):731\u2013771, 2011.\n\n[37] Tingni Sun and Cun-Hui Zhang. Sparse matrix inversion with scaled lasso. Journal of Machine\n\nLearning Research, 14(1):3385\u20133418, 2013.\n\n11\n\n\f[38] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[39] Sara van de Geer. Weakly decomposable regularization penalties and structured sparsity.\n\nScandinavian Journal of Statistics, 41(1):72\u201386, 2014.\n\n[40] Sara Van de Geer, Peter B\u00fchlmann, Ya\u2019acov Ritov, and Ruben Dezeure. On asymptotically\noptimal con\ufb01dence regions and tests for high-dimensional models. The Annals of Statistics, 42\n(3):1166\u20131202, 2014.\n\n[41] Aad van der Vaart. Part iii: Semiparameric statistics. Lectures on Probability Theory and\n\nStatistics, pages 331\u2013457, 2002.\n\n[42] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.\n\n[43] Roman Vershynin. High-dimensional probability: An introduction with applications in data\n\nscience, volume 47. Cambridge University Press, 2018.\n\n[44] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The\n\nAnnals of statistics, pages 894\u2013942, 2010.\n\n[45] Cun-Hui Zhang. Statistical inference for high-dimensional data. Mathematisches Forschungsin-\nstitut Oberwolfach: Very High Dimensional Semiparametric Models, Report, (48):28\u201331, 2011.\n\n[46] Cun-Hui Zhang and Stephanie S Zhang. Con\ufb01dence intervals for low dimensional parameters\nin high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 76(1):217\u2013242, 2014.\n\n[47] Cun-Hui Zhang and Tong Zhang. A general theory of concave regularization for high-\ndoi:\n\ndimensional sparse estimation problems.\n10.1214/12-STS399. URL https://doi.org/10.1214/12-STS399.\n\nStatist. Sci., 27(4):576\u2013593, 11 2012.\n\n12\n\n\f", "award": [], "sourceid": 1909, "authors": [{"given_name": "Pierre", "family_name": "Bellec", "institution": "Rutgers"}, {"given_name": "Arun", "family_name": "Kuchibhotla", "institution": "Wharton Statistics"}]}