{"title": "Estimation with Norm Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1556, "page_last": 1564, "abstract": "Analysis of estimation error and associated structured statistical recovery based on norm regularized regression, e.g., Lasso, needs to consider four aspects: the norm, the loss function, the design matrix, and the noise vector. This paper presents generalizations of such estimation error analysis on all four aspects, compared to the existing literature. We characterize the restricted error set, establish relations between error sets for the constrained and regularized problems, and present an estimation error bound applicable to {\\em any} norm. Precise characterizations of the bound are presented for a variety of noise vectors, design matrices, including sub-Gaussian, anisotropic, and dependent samples, and loss functions, including least squares and generalized linear models. Gaussian widths, as a measure of size of suitable sets, and associated tools play a key role in our generalized analysis.", "full_text": "Estimation with Norm Regularization\n\nArindam Banerjee\n\nSheng Chen\n\nFarideh Fazayeli\n\nVidyashankar Sivakumar\n\nDepartment of Computer Science & Engineering\n\n{banerjee,shengc,farideh,sivakuma}@cs.umn.edu\n\nUniversity of Minnesota, Twin Cities\n\nAbstract\n\nAnalysis of non-asymptotic estimation error and structured statistical recovery\nbased on norm regularized regression, such as Lasso, needs to consider four as-\npects: the norm, the loss function, the design matrix, and the noise model. This\npaper presents generalizations of such estimation error analysis on all four aspects.\nWe characterize the restricted error set, where the estimation error vector lies, es-\ntablish relations between error sets for the constrained and regularized problems,\nand present an estimation error bound applicable to any norm. Precise charac-\nterizations of the bound is presented for a variety of design matrices, including\nsub-Gaussian, anisotropic, and dependent samples, noise models, including both\nGaussian and sub-Gaussian noise, and loss functions, including least squares and\ngeneralized linear models. Gaussian width, a geometric measure of size of sets,\nand associated tools play a key role in our generalized analysis.\n\n1\n\nIntroduction\n\nOver the past decade, progress has been made in developing non-asymptotic bounds on the esti-\nmation error of structured parameters based on norm regularized regression. Such estimators are\nusually of the form [16, 9, 3]:\n\nL(\u03b8; Z n) + \u03bbnR(\u03b8) ,\n\n\u02c6\u03b8\u03bbn = argmin\n\u03b8\u2208Rp\n\n(1)\ni=1 where yi \u2208\nwhere R(\u03b8) is a suitable norm, L(\u00b7) is a suitable loss function, Z n = {(yi, Xi)}n\nR, Xi \u2208 Rp is the training set, and \u03bbn > 0 is a regularization parameter. The optimal parameter \u03b8\u2217\nis often assumed to be \u2018structured\u2019, usually characterized as a small value according to some norm\nR(\u00b7). Since \u02c6\u03b8\u03bbn is an estimate of the optimal structure \u03b8\u2217, the focus has been on bounding a suitable\nmeasure of the error vector \u02c6\u2206n = (\u02c6\u03b8\u03bbn \u2212 \u03b8\u2217), e.g., the L2 norm (cid:107) \u02c6\u2206n(cid:107)2.\nTo understand the state-of-the-art on non-asymptotic bounds on the estimation error for norm-\nregularized regression, four aspects of (1) need to be considered: (i) the norm R(\u03b8), (ii) properties\n(cid:80)n\nof the design matrix X \u2208 Rn\u00d7p, (iii) the loss function L(\u00b7), and (iv) the noise model, typically in\nterms of \u03c9 = y \u2212 E[y|x]. Most of the literature has focused on a linear model: y = X\u03b8 + \u03c9,\ni=1(yi \u2212 (cid:104)\u03b8, Xi(cid:105))2. Early work\nand a squared-loss function: L(\u03b8; Z n) = 1\non such estimators focussed on the L1 norm [21, 20, 8], and led to suf\ufb01cient conditions on the\ndesign matrix X, including the restricted-isometry properties (RIP) and restricted eigenvalue (RE)\nconditions [2, 9, 13, 3]. While much of the development has focussed on isotropic Gaussian design\nmatrices, recent work has extended the analysis for L1 norm to correlated Gaussian designs [13] as\nwell as anisotropic sub-Gaussian design matrices [14].\nBuilding on such development, [9] presents a uni\ufb01ed framework for the case of decomposable norms\nand also considers generalized linear models (GLMs) for certain norms such as L1. Two key insights\nare offered in [9]: \ufb01rst, for suitably large \u03bbn, the error vector \u02c6\u2206n lies in a restricted set, a cone or\n\nn(cid:107)y \u2212 X\u03b8(cid:107)2\n\n2 = 1\nn\n\n1\n\n\fa star, and second, on the restricted error set, the loss function needs to satisfy restricted strong\nconvexity (RSC), a generalization of the RE condition, for the analysis to work out.\nFor isotropic Gaussian design matrices, additional progress has been made. [4] considers a con-\nstrained estimation formulation for all atomic norms, where the gain condition, equivalent to the\nRE condition, uses Gordons inequality [5, 7] and is succinctly represented in terms of the Gaussian\nwidth of the intersection of the cone of the error set and a unit ball/sphere. [11] considers three\nrelated formulations for generalized Lasso problems, establish recovery guarantees based on Gor-\ndons inequality, and quantities related to the Gaussian width. Sharper analysis for recovery has been\nconsidered in [1], yielding a precise characterization of phase transition behavior using quantities\nrelated to the Gaussian width. [12] consider a linear programming estimator in a 1-bit compressed\nsensing setting and, interestingly, the concept of Gaussian width shows up in the analysis. In spite\nof the advances, most of these results are restricted to isotropic Gaussian design matrices.\nIn this paper, we consider structured estimation problems with norm regularization, which substan-\ntially generalize existing results on all four pertinent aspects: the norm, the design matrix, the loss,\nand the noise model. The analysis we present applies to all norms. We characterize the structure of\nthe restricted set for all norms, develop precise relationships between the error sets of the regular-\nized and constrained versions [2], and establish an estimation error bound in Section 2. The bound\ndepends on the regularization parameter \u03bbn and a certain RSC condition constant \u03ba. In Section 3,\nfor both Gaussian and sub-Gaussian noise \u03c9, we develop suitable characterizations for \u03bbn in terms\nof the Gaussian width of the unit norm ball \u2126R = {u|R(u) \u2264 1}. In Section 4, we characterize\nthe RSC condition for any norm, considering two families of design matrices X \u2208 Rn\u00d7p: Gaussian\nand sub-Gaussian, and three settings for each family: independent isotropic designs, independent\nanisotropic designs where the rows are correlated as \u03a3p\u00d7p, and dependent isotropic designs where\nthe rows are isotropic but columns are correlated as \u0393n\u00d7n, implying dependent samples. In Sec-\ntion 5, we show how to extend the analysis to generalized linear models (GLMs) with sub-Gaussian\ndesign matrices and any norm.\nOur analysis techniques are simple and largely uniform across different types of noise and design\nmatrices. Parts of our analysis are geometric, where Gaussian widths, as a geometric measure of size\nof suitable sets, and associated tools play a key role [4, 7]. We also use standard covering arguments,\nuse Sudakov-Dudley inequality to switch from covering numbers to Gaussian widths [7], and use\ngeneric chaining to upper bound \u2018sub-Gaussian widths\u2019 with Gaussian widths [15].\n\n2 Restricted Error Set and Recovery Guarantees\n\nIn this section, we give a characterization of the restricted error set Er in which the error vector\n\u02c6\u2206n lives, establish clear relationships between the error sets for the regularized and constrained\nproblems, and \ufb01nally establish upper bounds on the estimation error. The error bound is determin-\nistic, but has quantities which involve \u03b8\u2217, X, \u03c9, for which we develop high probability bounds in\nSections 3, 4, and 5.\n\n2.1 The Restricted Error Set and the Error Cone\n\nWe start with a characterization of the restricted error set Er where \u02c6\u2206n will belong.\n\nLemma 1 For any \u03b2 > 1, assuming\n\nwhere R\u2217(\u00b7) is the dual norm of R(\u00b7). Then the error vector \u02c6\u2206n = \u02c6\u03b8\u03bbn \u2212 \u03b8\u2217 belongs to the set\n\nEr = Er(\u03b8\u2217, \u03b2) =\n\n\u2206 \u2208 Rp\n\n\u03bbn \u2265 \u03b2R\u2217(\u2207L(\u03b8\u2217; Z n)) ,\n\n(cid:26)\n\n(cid:12)(cid:12)(cid:12)(cid:12) R(\u03b8\u2217 + \u2206) \u2264 R(\u03b8\u2217) +\n\n1\n\u03b2\n\n(cid:27)\n\nR(\u2206)\n\n.\n\n(2)\n\n(3)\n\nThe restricted error set Er need not be convex for general norms. Interestingly, for \u03b2 = 1, the\ninequality in (3) is just the triangle inequality, and is satis\ufb01ed by all \u2206. Note that \u03b2 > 1 restricts the\nset of \u2206 which satisfy the inequality, yielding the restricted error set. In particular, \u2206 cannot go in\nthe direction of \u03b8\u2217, i.e., \u2206 (cid:54)= \u03b1\u03b8\u2217 for any \u03b1 > 0. Further, note that the condition in (2) is similar\n\n2\n\n\fto that in [9] for \u03b2 = 2, but the above characterization holds for any norm, not just decomposable\nnorms [9].\nWhile Er need not be a convex set, we establish a relationship between Er and Cc, the cone for the\nconstrained problem [4], where\n\nCc = Cc(\u03b8\u2217) = cone{\u2206 \u2208 Rp | R(\u03b8\u2217 + \u2206) \u2264 R(\u03b8\u2217)} .\n\n(4)\n2 = {u|(cid:107)u(cid:107)2 \u2264 1} is the unit ball\n\nTheorem 1 Let Ar = Er \u2229 \u03c1Bp\nof L2 norm and \u03c1 > 0 is any suitable radius. Then, for any \u03b2 > 1 we have\n\n2 and Ac = Cc \u2229 \u03c1Bp\n\n(cid:18)\n\n(cid:19)\n\n2, where Bp\n(cid:107)\u03b8\u2217(cid:107)2\n\u03c1\n\nw(Ar) \u2264\n\n2\n\n1 +\n\nw(Ac) ,\n\n\u03b2 \u2212 1\n\n(5)\nwhere w(A) denotes the Gaussian width of any set A given by: w(A) = Eg[supa\u2208A(cid:104)a, g(cid:105)], where\ng is an isotropic Gaussian random vector, i.e., g \u223c N (0, Ip\u00d7p).\nThus, the Gaussian width of the error sets of regularized and constrained problems are closely re-\nIn particular, for (cid:107)\u03b8\u2217(cid:107)2 = 1, with \u03c1 = 1, \u03b2 = 2, we have w(Ar) \u2264 3w(Ac). Related\nlated.\nobservations have been made for the special case of the L1 norm [2], although past work did not\nprovide an explicit characterization in terms of Gaussian widths. The result also suggests that it is\npossible to move between the error analysis of the regularized and the constrained versions of the\nestimation problem.\n\n2.2 Recovery Guarantees\n\nIn order to establish recovery guarantees, we start by assuming that restricted strong convexity (RSC)\nis satis\ufb01ed by the loss function in Cr = cone(Er), i.e., for any \u2206 \u2208 Cr, there exists a suitable\nconstant \u03ba so that\n\n\u03b4L(\u2206, \u03b8\u2217) (cid:44) L(\u03b8\u2217 + \u2206) \u2212 L(\u03b8\u2217) \u2212 (cid:104)\u2207L(\u03b8\u2217), \u2206(cid:105) \u2265 \u03ba(cid:107)\u2206(cid:107)2\n2 .\n\n(6)\nIn Sections 4 and 5, we establish precise forms of the RSC condition for a wide variety of design\nmatrices and loss functions. In order to establish recovery guarantees, we focus on the quantity\n\nF(\u2206) = L(\u03b8\u2217 + \u2206) \u2212 L(\u03b8\u2217) + \u03bbn(R(\u03b8\u2217 + \u2206) \u2212 R(\u03b8\u2217)) .\n\n(7)\nSince \u02c6\u03b8\u03bbn = \u03b8\u2217 + \u02c6\u2206n is the estimated parameter, i.e., \u02c6\u03b8\u03bbn is the minimum of the objective, we\nclearly have F( \u02c6\u2206n) \u2264 0, which implies a bound on (cid:107) \u02c6\u2206n(cid:107)2. Unlike previous analysis, the bound\ncan be established without making any additional assumptions on the norm R(\u03b8). We start with the\nfollowing result, which expresses the upper bound on (cid:107) \u02c6\u2206n(cid:107)2 in terms of the gradient of the objective\nat \u03b8\u2217.\nLemma 2 Assume that the RSC condition is satis\ufb01ed in Cr by the loss L(\u00b7) with parameter \u03ba. With\n\u02c6\u2206n = \u02c6\u03b8\u03bbn \u2212 \u03b8\u2217, for any norm R(\u00b7), we have\n\n(cid:107) \u02c6\u2206n(cid:107)2 \u2264 1\n\u03ba\n\n(cid:107)\u2207L(\u03b8\u2217) + \u03bbn\u2207R(\u03b8\u2217)(cid:107)2 ,\n\n(8)\n\nwhere \u2207R(\u00b7) is any sub-gradient of the norm R(\u00b7).\nNote that the right hand side is simply the L2 norm of the gradient of the objective evaluated at\n\u03b8\u2217. For the special case when \u02c6\u03b8\u03bbn = \u03b8\u2217, the gradient of the objective is zero, implying correctly\nthat (cid:107) \u02c6\u2206n(cid:107)2 = 0. While the above result provides useful insights about the bound on (cid:107) \u02c6\u2206n(cid:107)2,\nthe quantities on the right hand side depend on \u03b8\u2217, which is unknown. We present another form\nof the result in terms of quantities such as \u03bbn, \u03ba, and the norm compatibility constant \u03a8(Cr) =\nsupu\u2208Cr\nTheorem 2 Assume that the RSC condition is satis\ufb01ed in Cr by the loss L(\u00b7) with parameter \u03ba.\nWith \u02c6\u2206n = \u02c6\u03b8\u03bbn \u2212 \u03b8\u2217, for any norm R(\u00b7), we have\n(cid:107) \u02c6\u2206n(cid:107)2 \u2264 1 + \u03b2\n\u03b2\n\n, which are often easier to compute or bound.\n\n\u03a8(Cr) .\n\nR(u)\n(cid:107)u(cid:107)2\n\n\u03bbn\n\u03ba\n\n(9)\n\nThe above result is deterministic, but contains \u03bbn and \u03ba. In Section 3, we give precise characteri-\nzations of \u03bbn, which needs to satisfy (2). In Sections 4 and 5, we characterize the RSC condition\nconstant \u03ba for different losses and a variety of design matrices.\n\n3\n\n\f3 Bounds on the Regularization Parameter\n\nRecall that the parameter \u03bbn needs to satisfy the inequality\n\n\u03bbn \u2265 \u03b2R\u2217(\u2207L(\u03b8\u2217; Z n)) .\n\n(10)\nThe right hand side of the inequality has two issues: the expression depends on \u03b8\u2217, the optimal\nparameter which is unknown, and the quantity is a random variable, since it depends on Z n. In\nthis section, we characterize E[R\u2217(\u2207L(\u03b8\u2217; Z n))] in terms of the Gaussian width of the unit norm\nball \u2126R = {u : R(u) \u2264 1}, and also discuss the upper bounds of R\u2217(\u2207L(\u03b8\u2217; Z n)). For ease of\nexposition, we present results for the case of squared loss, i.e., L(\u03b8\u2217; Z n) = 1\n2n(cid:107)y \u2212 X\u03b8\u2217(cid:107)2 with\nthe linear model y = X\u03b8 + \u03c9, where \u03c9 can be Gaussian or sub-Gaussian noise. For this setting,\n\u2207L(\u03b8\u2217; Z n) = 1\nn X T \u03c9. The analysis can be extended to GLMs, using analysis\ntechniques discussed in Section 5.\nGaussian Designs: First, we consider Gaussian designs X, where xij \u223c N (0, 1) are independent,\nand \u03c9 is elementwise independent Gaussian or sub-Gaussian noise.\nTheorem 3 Let \u2126R = {u : R(u) \u2264 1}. Then, for Gaussian design X and Gaussian or sub-\nGaussian noise \u03c9, for a suitable constant \u03b70 > 0, we have\nE[R\u2217(\u2207L(\u03b8\u2217; Z n))] \u2264 \u03b70\u221a\nn\n\nn X T (y \u2212 X\u03b8\u2217) = 1\n\nw(\u2126R) .\nFurther, for any \u03c4 > 0, with probability at least 1 \u2212 3 exp(\u2212 min( \u03c4 2\n(w(\u2126R) + \u03c4 ) ,\n\n2\u03a62 , cn))\n\n(12)\n\n(11)\n\nR\u2217(\u2207L(\u03b8\u2217; Z n)) \u2264 \u03b71\u221a\nn\n\nwhere c is an absolute constant, \u03b71 is a constant depending on sub-Gaussian norm of \u03c9, and \u03a6 is a\nconstant depending on the norm R(\u00b7).\nFor anisotropic Gaussian design or correlated isotropic design, the result continues to hold with\ndifferent \u03b70 and \u03b71, which depend on the covariance structure of X.\nSub-Gaussian Designs: Recall that for a sub-Gaussian variable x, the sub-Gaussian norm |||x|||\u03c82\nsupp\u22651\nxij are i.i.d., and \u03c9 is elementwise independent Gaussian or sub-Gaussian noise.\nTheorem 4 Let \u2126R = {u : R(u) \u2264 1}. Then, for sub-Gaussian design X and Gaussian or sub-\nGaussian noise \u03c9, for a suitable constant \u03b72 > 0, we have\nE[R\u2217(\u2207L(\u03b8\u2217; Z n))] \u2264 \u03b72\u221a\nn\n\np (E[|x|p])1/p [18]. Now, we consider sub-Gaussian design X, where |||xij|||\u03c82\n1\u221a\n\n=\n\u2264 k and\n\nw(\u2126R) .\n\n(13)\n\nInterestingly, the analysis for the result above involves \u2018sub-Gaussian width\u2019 which can be upper\nbounded by a constant times the Gaussian width, using generic chaining [15]. Further, one can\nget Gaussian-like exponential concentration around the expectation for important classes of sub-\nGaussian random variables, including bounded random variables [6], and when Xu = (cid:104)h, u(cid:105), where\nu is any unit vector, are such that their Malliavin derivatives have almost surely bounded norm in\n\nL2[0, 1], i.e.,(cid:82) 1\n\n0 |DrXu|2dr \u2264 \u03b7 [19].\n\nNext, we provide a mechanism for bounding the Gaussian width w(\u2126R) of the unit norm ball in\nterms of the Gaussian width of a suitable cone, obtained by shifting or translating the norm ball. In\nparticular, the result involves taking any point on the boundary of the unit norm ball, considering\nthat as the origin, and constructing a cone using the norm ball. Since such a construction can be done\nwith any point on the boundary, the tightest bound is obtained by taking the in\ufb01mum over all points\non the boundary. The motivation behind getting an upper bound of the Gaussian width w(\u2126R) of\nthe unit norm ball in terms of the Gaussian width of such a cone is because considerable advances\nhave been made in recent years in upper bounding Gaussian widths of such cones.\nLemma 3 Let \u2126R = {u : R(u) \u2264 1} be the unit norm ball and \u0398R = {u : R(u) = 1} be the\nboundary. For any \u02dc\u03b8 \u2208 \u0398R, \u03c1(\u02dc\u03b8) = sup\u03b8:R(\u03b8)\u22641 (cid:107)\u03b8 \u2212 \u02dc\u03b8(cid:107)2 is the diameter of \u2126R measured with\nrespect to \u02dc\u03b8. Let G(\u02dc\u03b8) = cone(\u2126R \u2212 \u02dc\u03b8) \u2229 \u03c1(\u02dc\u03b8)Bp\n2, i.e., the cone of (\u2126R \u2212 \u02dc\u03b8) intersecting the ball of\nradius \u03c1(\u02dc\u03b8). Then\nw(\u2126R) \u2264 inf\n\u02dc\u03b8\u2208\u0398R\n\nw(G(\u02dc\u03b8)) .\n\n(14)\n\n4\n\n\f\u2265 \u221a\n\nn(cid:107)X\u2206(cid:107)2\n\n4 Least Squares Models: Restricted Eigenvalue Conditions\nWhen the loss function is squared loss, i.e., L(\u03b8; Z n) = 1\n2n(cid:107)y \u2212 X\u03b8(cid:107)2, the RSC condition (6)\n2 \u2265 \u03ba(cid:107)\u2206(cid:107)2\n2,\nbecomes equivalent to the Restricted Eigenvalue (RE) condition [2, 9], i.e., 1\nor equivalently, (cid:107)X\u2206(cid:107)2\n\u03ban for any \u2206 in the error cone Cr. Since the absolute magnitude of\n(cid:107)\u2206(cid:107)2\n(cid:107)\u2206(cid:107)2 does not play a role in the RE condition, without loss of generality we work with unit vectors\nu \u2208 A = Cr \u2229 Sp\u22121, where Sp\u22121 is the unit sphere.\nIn this section, we establish RE conditions for a variety of Gaussian and sub-Gaussian design ma-\ntrices, with isotropic, anisotropic, or dependent rows, i.e., when samples (rows of X) are correlated.\nResults for certain types of design matrices for certain types of norms, especially the L1 norm, have\nappeared in the literature [2, 13, 14]. Our analysis considers a wider variety of design matrices and\nestablishes RSC conditions for any A \u2286 Sp\u22121, thus corresponding to any norm. Interestingly, the\nGaussian width w(A) of A shows up in all bounds, as a geometric measure of the size of the set A,\neven for sub-Gaussian design matrices. In fact, all existing RE results do implicitly have the width\nterm, but in a form speci\ufb01c to the chosen norm [13, 14]. The analysis on atomic norm in [4] has the\nw(A) term explicitly, but the analysis relies on Gordon\u2019s inequality [5, 7], which is applicable only\nfor isotropic Gaussian design matrices.\nThe proof technique we use is simple, a standard covering argument, and is largely the same across\nall the cases considered. A unique aspect of our analysis, used in all the proofs, is a way to go\nfrom covering numbers of A to the Gaussian width of A using the Sudakov-Dudley inequality [7].\nOur general techniques are in contrast to much of the existing literature on RE conditions, which\ncommonly use specialized tools such as Gaussian comparison principles [13, 9], and/or specialized\nanalysis geared to a particular norm such as L1 [14].\n\nij] = \u03c32.\n\nj ] = \u0393 \u2208 Rn\u00d7n. For convenience, we assume E[x2\n\n4.1 Restricted Eigenvalue Conditions: Gaussian Designs\nIn this section, we focus on the case of Gaussian design matrices X \u2208 Rn\u00d7p, and consider three\nsettings: (i) independent-isotropic, where the entries are elementwise independent, (ii) independent-\ni ] = \u03a3 \u2208 Rp\u00d7p,\nanisotropic, where rows Xi are independent but each row has a covariance E[XiX T\nand (iii) dependent-isotropic, where the rows are isotropic but the columns Xj are correlated with\nij] = 1, noting that the analysis easily\nE[XjX T\nextends to the general case of E[x2\nIndependent Isotropic Gaussian (IIG) Designs: The IIG setting has been extensively studied\nin the literature [3, 9]. As discussed in the recent work on atomic norms [4], one can use Gordon\u2019s\ninequality [5, 7] to get RE conditions for the IIG setting. Our goal in this section is two-fold: \ufb01rst, we\npresent the RE conditions obtained using our simple proof technique, and show that it is equivalent,\nup to constants, the RE condition obtained using Gordon\u2019s inequality, an technique only applicable\nto the IIG setting; and second, we go over some facets of how we present the results, which will\napply to all subsequent RE-style results as well as give a way to plug-in \u03ba in the estimation error\nbound in (9).\nTheorem 5 Let the design matrix X \u2208 Rn\u00d7p be elementwise independent and normal, i.e., xij \u223c\nN (0, 1). Then, for any A \u2286 Sp\u22121, any n \u2265 2, and any \u03c4 > 0, with probability at least (1 \u2212\n\u03b71 exp(\u2212\u03b72\u03c4 2)), we have\n\n(cid:107)Xu(cid:107)2 \u2265 1\n2\n\ninf\nu\u2208A\n\n\u221a\n\nn \u2212 \u03b70w(A) \u2212 \u03c4 ,\n\n\u03b70, \u03b71, \u03b72 > 0 are absolute constants.\n\n(15)\n\n(16)\n\nWe consider the equivalent result one could obtain by directly using Gordon\u2019s inequality [5, 7]:\nTheorem 6 Let the design matrix X be elementwise independent and normal, i.e., xij \u223c N (0, 1).\nThen, for any A \u2286 Sp\u22121 and any \u03c4 > 0, with probability at least (1 \u2212 2 exp(\u2212\u03c4 2/2)), we have\n\nwhere \u03b3n = E[(cid:107)h(cid:107)2] > n\u221a\n\nn+1\n\n(cid:107)Xu(cid:107)2 \u2265 \u03b3n \u2212 w(A) \u2212 \u03c4 ,\n\ninf\nu\u2208A\n\nis the expected length of a Gaussian random vector in Rn.\n\n5\n\n\fInterestingly, the results are equivalent, up to constants. However, unlike Gordon\u2019s inequality, our\nproof technique generalizes to all the other design matrices considered in the sequel.\nWe emphasize three additional aspects in the context of the above analysis, which will continue to\nhold for all the subsequent results but will not be discussed explicitly. First, to get a form of the\nresult which can be used as \u03ba and plugged in to the estimation error bound (9), one can simply\nchoose \u03c4 = 1\n\nn \u2212 \u03b70w(A)) so as to get\n\n\u221a\n\n2 ( 1\n\n2\n\nw(A) ,\n\n(17)\n\n(cid:107)Xu(cid:107)2 \u2265 1\n4\n\ninf\nu\u2208A\n\n\u221a\n\nn \u2212 \u03b70\n2\n\n2\n\nwith high probability. Table 1 shows a summary of recovery bounds on Independent Isotropic\nGaussian design matrices with Gaussian noise. Second, the result does not depend on the fact that\nu \u2208 A \u2286 Cr \u2229 Sp\u22121 so that (cid:107)u(cid:107)2 = 1. For example, one can consider the cone Cr to be intersecting\nwith a sphere \u03c1Sp\u22121 of a different radius \u03c1, to give A\u03c1 = Cr \u2229 \u03c1Sp\u22121 so that u \u2208 A\u03c1 has (cid:107)u(cid:107)2 = \u03c1.\n\u221a\nFor simplicity, let A = A1, i.e., corresponding to \u03c1 = 1. Then, a straightforward extension yields\ninf u\u2208A\u03c1 (cid:107)Xu(cid:107)2 \u2265 ( 1\nn\u2212 \u03b70w(A)\u2212 \u03c4 )(cid:107)u(cid:107)2, with probability at least (1\u2212 \u03b71 exp(\u2212\u03b72\u03c4 2)), since\n(cid:107)2(cid:107)u(cid:107)2 and w(A(cid:107)u(cid:107)2) = (cid:107)u(cid:107)2w(A) [4]. Such a scale independence is in fact\n(cid:107)Xu(cid:107)2 = (cid:107)X u(cid:107)u(cid:107)2\nnecessary for the error bound analysis in Section 2. Finally, note that the leading constant 1\n2 was\na consequence of our choice of \u0001 = 1\n4 for the \u0001-net covering of A in the proof. One can get other\nconstants, less than 1, with different choices of \u0001, and the constants \u03b70, \u03b71, \u03b72 will change based on\nthis choice.\nIndependent Anisotropic Gaussian (IAG) Designs: We consider a setting where the rows Xi of\nthe design matrix are independent, but each row is sampled from an anisotropic Gaussian distribu-\ntion, i.e., Xi \u223c N (0, \u03a3p\u00d7p) where Xi \u2208 Rp. The setting has been considered in the literature [13]\nfor the special case of L1 norms, and sharp results have been established using Gaussian comparison\ntechniques [7]. We show that equivalent results can be obtained by our simple technique, which does\nnot rely on Gaussian comparisons [7, 9].\nTheorem 7 Let the design matrix X be row wise independent and each row Xi \u223c N (0, \u03a3p\u00d7p).\nThen, for any A \u2286 Sp\u22121 and any \u03c4 > 0, with probability at least 1 \u2212 \u03b71 exp(\u2212\u03b72\u03c4 2), we have\n\n(18)\n\u03bd = inf u\u2208A (cid:107)\u03a31/2u(cid:107)2, \u039bmax(\u03a3) denotes the largest eigenvalue of \u03a3 and \u03b70, \u03b71, \u03b72 > 0 are\n\ninf\nu\u2208A\n\n\u03bd\n\n\u03bd\n\nn \u2212 \u03b70\u039bmax(\u03a3)\n\n\u221a\n\nw(A) \u2212 \u03c4 ,\n\n(cid:107)Xu(cid:107)2 \u2265 1\n2\n\n\u221a\n\n\u221a\n\n\u221a\nwhere\nconstants.\n\n(cid:107) \u02dcXu(cid:107)2 \u2265 3\n4\nwhere \u03b70, \u03b71, \u03b72 > 0 are constants.\n\ninf\nu\u2208A\n\nNote that with the assumption that E[x2\nij] = 1, \u0393 will be a correlation matrix implying Tr(\u0393) = n,\nand making the sample size dependence explicit. Intuitively, due to sample correlations, n samples\nare effectively equivalent to Tr(\u0393)\n\n\u039bmax(\u0393) =\n\nn\n\n\u039bmax(\u0393) samples.\n\n6\n\n\u221a\n\n\u03bd appears in [13] as\nA comparison with the results of [13] is instructive. The leading term\nwell\u2014we have simply considered inf u\u2208A on both sides, and the result in [13] is for any u with the\n(cid:107)\u03a31/2u(cid:107)2 term. The second term in [13] depends on the largest entry in the diagonal of \u03a3,\nlog p,\nand (cid:107)u(cid:107)1. These terms are a consequence of the special case analysis for L1 norm. In contrast, we\nconsider the general case and simply get the scaled Gaussian width term \u039bmax(\u03a3)\u221a\nDependent Isotropic Gaussian (DIG) Designs: We now consider a setting where the rows of the\nj ] = \u0393 \u2208\ndesign matrix \u02dcX are isotropic Gaussians, but the columns \u02dcXj are correlated with E[ \u02dcXj \u02dcX T\nRn\u00d7n. Interestingly, correlation structure over the columns make the samples dependent, a scenario\nwhich has not yet been widely studied in the literature [22, 10]. We show that our simple technique\ncontinues to work in this scenario and gives a rather intuitive result.\nTheorem 8 Let \u02dcX \u2208 Rn\u00d7p be a matrix whose rows \u02dcXi are isotropic Gaussian random vectors in\nj ] = \u0393. Then, for any set A \u2286 Sp\u22121 and any\nRp and the columns \u02dcXj are correlated with E[ \u02dcXj \u02dcX T\n\u03c4 > 0, with probability at least (1 \u2212 \u03b71 exp(\u2212\u03b72\u03c4 2), we have\n\n\u03bd w(A).\n\n(cid:112)Tr(\u0393) \u2212(cid:112)\u039bmax(\u0393)\n\n(cid:18)\n\n\u03b70w(A) +\n\n5\n2\n\n\u2212 \u03c4\n\n(19)\n\n\u221a\n\n(cid:19)\n\n\fj ] = \u0393n\u00d7n. For convenience, we assume E[x2\n\n4.2 Restricted Eigenvalue Conditions: Sub-Gaussian Designs\nIn this section, we focus on the case of sub-Gaussian design matrices X \u2208 Rn\u00d7p, and consider three\nsettings: (i) independent-isotropic, where the rows are independent and isotropic, (ii) independent-\nanisotropic, where the rows Xi are independent but each row has a covariance E[XiX T\ni ] = \u03a3p\u00d7p,\nand (iii) dependent-isotropic, where the rows are isotropic and the columns Xj are correlated\nwith E[XjX T\nij] = 1 and the sub-Gaussian norm\n\u2264 k [18]. In recent work, [17] also considers generalizations of RE conditions to sub-\n|||xij|||\u03c82\nGaussian designs, although our proof techniques are different.\nIndependent Isotropic Sub-Gaussian Designs: We start with the setting where the sub-Gaussian\ndesign matrix X \u2208 Rn\u00d7p has independent rows Xi and each row is isotropic.\nTheorem 9 Let X \u2208 Rn\u00d7p be a design matrix whose rows Xi are independent isotropic sub-\nGaussian random vectors in Rp. Then, for any set A \u2286 Sp\u22121 and any \u03c4 > 0, with probability at\nleast (1 \u2212 2 exp(\u2212\u03b71\u03c4 2)), we have\ninf\nu\u2208A\n\nn \u2212 \u03b70w(A) \u2212 \u03c4 ,\n\n(cid:107)Xu(cid:107)2 \u2265 \u221a\n\n(20)\n\n= k and E[XiX T\n\n(cid:107)Xu(cid:107)2 \u2265 \u221a\n\nwhere \u03b70, \u03b71 > 0 are constants which depend only on the sub-Gaussian norm |||xij|||\u03c82\nIndependent Anisotropic Sub-Gaussian Designs: We consider a setting where the rows Xi of the\ndesign matrix are independent, but each row is sampled from an anisotropic sub-Gaussian distribu-\ntion, i.e., |||xij|||\u03c82\nTheorem 10 Let the sub-Gaussian design matrix X be row wise independent, and each row has\ni ] = \u03a3 \u2208 Rp\u00d7p. Then, for any A \u2286 Sp\u22121 and any \u03c4 > 0, with probability at least\nE[XiX T\n(1 \u2212 2 exp(\u2212\u03b71\u03c4 2)), we have\ninf\nu\u2208A\n\n(21)\n\u03bd = inf u\u2208A (cid:107)\u03a31/2u(cid:107)2, \u039bmax(\u03a3) denotes the largest eigenvalue of \u03a3, and \u03b70, \u03b71 > 0 are\n\nwhere\nconstants which depend on the sub-Gaussian norm |||xij|||\u03c82\nNote that [14] establish RE conditions for anisotropic sub-Gaussian designs for the special case of\nL1 norm. In contrast, our results are general and in terms of the Gaussian width w(A).\nDependent Isotropic Sub-Gaussian Designs: We consider the setting where the sub-Gaussian de-\nsign matrix \u02dcX has isotropic sub-Gaussian rows, but the columns \u02dcXj are correlated with E[ \u02dcXj \u02dcX T\nj ] =\n\u0393, implying dependent samples.\nTheorem 11 Let \u02dcX \u2208 Rn\u00d7p be a sub-Gaussian design matrix with isotropic rows and correlated\nj ] = \u0393 \u2208 Rn\u00d7n. Then, for any A \u2286 Sp\u22121 and any \u03c4 > 0, with probability at\ncolumns with E[ \u02dcXj \u02dcX T\nleast (1 \u2212 2 exp(\u2212\u03b71\u03c4 2)), we have\n\nn \u2212 \u03b70 \u039bmax(\u03a3) w(A) \u2212 \u03c4 ,\n\n(cid:107) \u02dcXu(cid:107)2 \u2265(cid:112)Tr(\u0393) \u2212 \u03b70 \u039bmax(\u0393)w(A) \u2212 \u03c4 ,\n\ni ] = \u03a3p\u00d7p.\n\n\u221a\n\n\u03bd\n\n= k.\n\n= k.\n\n\u221a\n\nn(cid:88)\n\ni=1\n\n7\n\nwhere \u03b70, \u03b71 are constants which depend on the sub-Gaussian norm |||xij|||\u03c82\n\ninf\nu\u2208A\n\n(22)\n\n= k.\n\n5 Generalized Linear Models: Restricted Strong Convexity\nIn this section, we consider the setting where the conditional probabilistic distribution of y|x follows\n(cid:80)n\nan exponential family distribution: p(y|x; \u03b8) = exp{y(cid:104)\u03b8, x(cid:105) \u2212 \u03c8((cid:104)\u03b8, x(cid:105))}, where \u03c8(\u00b7) is the log-\npartition function. Generalized linear models consider the negative likelihood of such conditional\ni=1(\u03c8((cid:104)\u03b8, Xi(cid:105)) \u2212 (cid:104)\u03b8, yiXi(cid:105)). Least squares\ndistributions as the loss function: L(\u03b8; Z n) = 1\nregression and logistic regression are popular special cases of GLMs. Since \u2207\u03c8((cid:104)\u03b8, x(cid:105)) = E[y|x],\nwe have \u2207L(\u03b8\u2217; Z n) = 1\nn X T \u03c9, where \u03c9i = \u2207\u03c8((cid:104)\u03b8, Xi(cid:105)) \u2212 yi = E[y|Xi] \u2212 yi plays the role of\nnoise. Hence, the analysis in Section 3 can be applied assuming \u03c9 is Gaussian or sub-Gaussian. To\nobtain RSC conditions for GLMs, \ufb01rst note that\n\nn\n\n\u03b4L(\u03b8\u2217, \u2206; Z n) =\n\n1\nn\n\n\u22072\u03c8((cid:104)\u03b8\u2217, Xi(cid:105) + \u03b3i(cid:104)\u2206, Xi(cid:105))(cid:104)\u2206, Xi(cid:105)2 ,\n\n(23)\n\n\fTable 1: A summary of various values for L1 and L2 norms with all values correct up to constants.\n\n(cid:104)\n\n(cid:110)(cid:16)\n\n(cid:17)\n\n(cid:111)(cid:105)2\n\n1 \u2212 c2\n\nw(A)\u221a\nn\n\n, 0\n\nR(u)\n\n\u03bbn := c1\n\n\u03ba :=\n\nmax\n\nw(\u2126R)\u221a\nn\n\n(cid:19)\n\n(cid:18)(cid:113) log p\nO(cid:0)(cid:112) p\n(cid:1)\n\nn\n\nn\n\nL1 norm\n\nO\n\nL2 norm\n\nO (1)\n\nO(1)\n\n\u03a8(Cr)\n\n\u221a\n\ns\n\n1\n\n(cid:107) \u02c6\u2206n(cid:107)2 := c3\n\n\u03a8(Cr)\u03bbn\n\n\u03ba\n\n(cid:19)\n\n(cid:18)(cid:113) s log p\nO(cid:0)(cid:112) p\n(cid:1)\n\nn\n\nn\n\nO\n\nn(cid:88)\n\ni=1\n\nwhere \u03b3i \u2208 [0, 1], by mean value theorem. Since \u03c8 is of Legendre type, the second derivative\n\u22072\u03c8(\u00b7) is always positive. Since the RSC condition relies on a non-trivial lower bound for the above\nquantity, the analysis considers a suitable compact set where (cid:96) = (cid:96)\u03c8(T ) = min|a|\u22642T \u22072\u03c8(a) is\nbounded away from zero. Outside this compact set, we will only use \u22072\u03c8(\u00b7) > 0. Then,\n\n\u03b4L(\u03b8\u2217, \u2206; Z n) \u2265 (cid:96)\nn\n\n(cid:104)Xi, \u2206(cid:105)2 I[|(cid:104)Xi, \u03b8\u2217(cid:105)| < T ] I[|(cid:104)Xi, \u2206(cid:105)| < T ] .\n\n(24)\n\nWe give a characterization of the RSC condition for independent isotropic sub-Gaussian design ma-\ntrices X \u2208 Rn\u00d7p. The analysis can be suitably generalized to the other design matrices considered in\nSection 4 by using the same techniques. As before, we denote \u2206 as u, and consider u \u2208 A \u2286 Sp\u22121\nso that (cid:107)u(cid:107)2 = 1. Further, we assume (cid:107)\u03b8\u2217(cid:107)2 \u2264 c1 for some constant c1. Assuming X has sub-\nGaussian entries with |||xij|||\u03c82\n\u2264 k, (cid:104)Xi, \u03b8\u2217(cid:105) and (cid:104)Xi, u(cid:105) are sub-Gaussian random variables with\nsub-Gaussian norm at most Ck. Let \u03c61 = \u03c61(T ; u) = P{|(cid:104)Xi, u(cid:105)| > T} \u2264 e \u00b7 exp(\u2212c2T 2/C 2k2),\nand \u03c62 = \u03c62(T ; \u03b8\u2217) = P{|(cid:104)Xi, \u03b8\u2217(cid:105)| > T} \u2264 e \u00b7 exp(\u2212c2T 2/C 2k2). The result we present is in\nterms of the constants (cid:96) = (cid:96)\u03c8(T ), \u03c61 = \u03c6(T ; u) and \u03c62 = \u03c6(T, \u03b8\u2217) for any suitably chosen T .\nTheorem 12 Let X \u2208 Rn\u00d7p be a design matrix with independent isotropic sub-Gaussian rows.\nThen, for any set A \u2286 Sp\u22121, any \u03b1 \u2208 (0, 1), any \u03c4 > 0, and any n \u2265\n\u03b12(1\u2212\u03c61\u2212\u03c62) (cw2(A) +\nc2(1\u2212\u03c61\u2212\u03c62)5\nK4\nwe have\n\n(1\u2212\u03b1)\u03c4 2) for suitable constants c and c2, with probability at least 1\u22123 exp(cid:0)\u2212\u03b71\u03c4 2(cid:1),\n\n(25)\nwhere \u03c0 = (1 \u2212 \u03b1)(1 \u2212 \u03c61 \u2212 \u03c62), (cid:96) = (cid:96)\u03c8(T ) = min|a|\u22642T +\u03b2 \u22072\u03c8(a), and constants (\u03b70, \u03b71)\ndepend on the sub-Gaussian norm |||xij|||\u03c82\n\ninf\nu\u2208A\n\n(cid:112)n\u2202L(\u03b8\u2217; u, X) \u2265 (cid:96)\n\nn \u2212 \u03b70w(A) \u2212 \u03c4 )(cid:1) ,\n\n\u03c0(cid:0)\u221a\n\n= k.\n\n\u221a\n\n2\n\nThe form of the result is closely related to the corresponding result for the RE condition on\ninf u\u2208A (cid:107)Xu(cid:107)2 in Section 4.2. Note that RSC analysis for GLMs was considered in [9] for spe-\nci\ufb01c norms, especially L1, whereas our analysis applies to any set A \u2286 Sp\u22121, and hence to any\nnorm. Further, following similar argument structure as in Section 4.2, the analysis for GLMs can be\nextended to anisotropic and dependent design matrices.\n\n6 Conclusions\n\nThe paper presents a general set of results and tools for characterizing non-asymptotic estimation\nerror in norm regularized regression problems. The analysis holds for any norm, and includes much\nof existing literature focused on structured sparsity and related themes as special cases. The work\ncan be viewed as a direct generalization of results in [9], which presented related results for decom-\nposable norms. Our analysis illustrates the important role Gaussian widths, as a geometric measure\nof size of suitable sets, play in such results. Further, the error sets of regularized and constrained\nversions of such problems are shown to be closely related [2]. Going forward, it will be interesting\nto explore similar generalizations for the semi-parametric and non-parametric settings.\nAcknowledgements: We thank the anonymous reviewers for helpful comments and suggestions on\nrelated work. We thank Sergey Bobkov, Snigdhansu Chatterjee, and Pradeep Ravikumar for discus-\nsions related to the paper. The research was supported by NSF grants IIS-1447566, IIS-1422557,\nCCF-1451986, CNS-1314560, IIS-0953274, IIS-1029711, and by NASA grant NNX12AQ39A.\n\n8\n\n\fReferences\n\n[1] D. Amelunxen, M. Lotz, M. B. McCoy, and J. A. Tropp. Living on the edge: A geometric\n\ntheory of phase transitions in convex optimization. Inform. Inference, 3(3):224\u2013294, 2013.\n\n[2] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nAnnals of Statistics, 37(4):1705\u20131732, 2009.\n\n[3] P. Buhlmann and S. van de Geer. Statistics for High Dimensional Data: Methods, Theory and\n\nApplications. Springer Series in Statistics. Springer, 2011.\n\n[4] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear\n\ninverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[5] Y. Gordon. On Milmans inequality and random subspaces which escape through a mesh in Rn.\nIn Geometric Aspects of Functional Analysis, volume 1317 of Lecture Notes in Mathematics,\npages 84\u2013106. Springer, 1988.\n\n[6] M. Ledoux. The concentration of measure phenomenon. Mathematical Surveys and Mon-\n\ngraphs. American Mathematical Society.\n\n[7] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.\n\nSpringer, 2013.\n\n[8] N. Meinshausen and B Yu. Lasso-type recovery of sparse representations for high-dimensional\n\ndata. The Annals of Statistics, 37(1):246\u2014270, 2009.\n\n[9] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for the analysis\n\nof regularized M-estimators. Statistical Science, 27(4):538\u2013557, December 2012.\n\n[10] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and\n\nhigh-dimensional scaling. Annals of Statistics, 39(2):1069\u20131097, 2011.\n\n[11] S. Oymak, C. Thrampoulidis, and B. Hassibi. The Squared-Error of Generalized Lasso: A\n\nPrecise Analysis. Arxiv, arXiv:1311.0830v2, 2013.\n\n[12] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: A\nconvex programming approach. IEEE Transactions on Information Theory, 59(1):482\u2013494,\n2013.\n\n[13] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted Eigenvalue Properties for Correlated\n\nGaussian Designs. Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[14] Z. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements.\n\nTransactions on Information Theory, 59(6):3434\u20133447, 2013.\n\nIEEE\n\n[15] M. Talagrand. The Generic Chaining. Springer, 2005.\n[16] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58(1):267\u2013288, 1996.\n\n[17] J. A. Tropp. Convex recovery of a structured signal from independent random linear measure-\n\nments. In Sampling Theory, a Renaissance. (To Appear), 2014.\n\n[18] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and\nG. Kutyniok, editors, Compressed Sensing, chapter 5, pages 210\u2013268. Cambridge University\nPress, 2012.\n\n[19] A. B. Vizcarra and F. G. Viens. Some applications of the Malliavin calculus to sub-Gaussian\nand non-sub-Gaussian random \ufb01elds. In Seminar on Stochastic Analysis, Random Fields and\nApplications, Progress in Probability, volume 59, pages 363\u2013396. Birkhauser, 2008.\n\n[20] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using\nIEEE Transactions on Information Theory,\n\n(cid:96)1-constrained quadratic programming(Lasso).\n55:2183\u20132202, 2009.\n\n[21] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132567, November 2006.\n\n[22] S. Zhou. Gemini: Graph estimation with matrix variate normal instances. The Annals of\n\nStatistics, 42(2):532\u2013562, 2014.\n\n9\n\n\f", "award": [], "sourceid": 821, "authors": [{"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota, Twin Cites"}, {"given_name": "Sheng", "family_name": "Chen", "institution": "University of Minnesota"}, {"given_name": "Farideh", "family_name": "Fazayeli", "institution": "University of Minnesota-Twin Cities"}, {"given_name": "Vidyashankar", "family_name": "Sivakumar", "institution": "University of Minnesota"}]}