{"title": "A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers", "book": "Advances in Neural Information Processing Systems", "page_first": 1348, "page_last": 1356, "abstract": "The estimation of high-dimensional parametric models requires imposing some structure on the models, for instance that they be sparse, or that matrix structured parameters have low rank. A general approach for such structured parametric model estimation is to use regularized M-estimation procedures, which regularize a loss function that measures goodness of fit of the parameters to the data with some regularization function that encourages the assumed structure. In this paper, we aim to provide a unified analysis of such regularized M-estimation procedures. In particular, we report the convergence rates of such estimators in any metric norm. Using just our main theorem, we are able to rederive some of the many existing results, but also obtain a wide range of novel convergence rates results. Our analysis also identifies key properties of loss and regularization functions such as restricted strong convexity, and decomposability, that ensure the corresponding regularized M-estimators have good convergence rates.", "full_text": "A uni\ufb01ed framework for high-dimensional analysis of\n\nM-estimators with decomposable regularizers\n\nSahand Negahban\nDepartment of EECS\n\nUC Berkeley\n\nsahand n@eecs.berkeley.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Sciences\npradeepr@cs.utexas.edu\n\nUT Austin\n\nMartin J. Wainwright\nDepartment of Statistics\nDepartment of EECS\n\nUC Berkeley\n\nwainwrig@eecs.berkeley.edu\n\nBin Yu\n\nDepartment of Statistics\nDepartment of EECS\n\nUC Berkeley\n\nbinyu@stat.berkeley.edu\n\nAbstract\n\nHigh-dimensional statistical inference deals with models in which the the num-\nber of parameters p is comparable to or larger than the sample size n. Since it\nis usually impossible to obtain consistent procedures unless p/n \u2192 0, a line of\nrecent work has studied models with various types of structure (e.g., sparse vec-\ntors; block-structured matrices; low-rank matrices; Markov assumptions). In such\nsettings, a general approach to estimation is to solve a regularized convex program\n(known as a regularized M-estimator) which combines a loss function (measuring\nhow well the model \ufb01ts the data) with some regularization function that encour-\nages the assumed structure. The goal of this paper is to provide a uni\ufb01ed frame-\nwork for establishing consistency and convergence rates for such regularized M-\nestimators under high-dimensional scaling. We state one main theorem and show\nhow it can be used to re-derive several existing results, and also to obtain several\nnew results on consistency and convergence rates. Our analysis also identi\ufb01es\ntwo key properties of loss and regularization functions, referred to as restricted\nstrong convexity and decomposability, that ensure the corresponding regularized\nM-estimators have fast convergence rates.\n\n1 Introduction\nIn many \ufb01elds of science and engineering (among them genomics, \ufb01nancial engineering, natural lan-\nguage processing, remote sensing, and social network analysis), one encounters statistical inference\nproblems in which the number of predictors p is comparable to or even larger than the number of\nobservations n. Under this type of high-dimensional scaling, it is usually impossible to obtain sta-\ntistically consistent estimators unless one restricts to subclasses of models with particular structure.\nFor instance, the data might be sparse in a suitably chosen basis, could lie on some manifold, or the\ndependencies among the variables might have Markov structure speci\ufb01ed by a graphical model.\nIn such settings, a common approach to estimating model parameters is is through the use of a\nregularized M-estimator, in which some loss function (e.g., the negative log-likelihood of the data)\nis regularized by a function appropriate to the assumed structure. Such estimators may also be\ninterpreted from a Bayesian perspective as maximum a posteriori estimates, with the regularizer\nre\ufb02ecting prior information.\nIn this paper, we study such regularized M-estimation procedures,\nand attempt to provide a unifying framework that both recovers some existing results and provides\n\n1\n\n\fnew results on consistency and convergence rates under high-dimensional scaling. We illustrate\nsome applications of this general framework via three running examples of constrained parametric\nstructures. The \ufb01rst class is that of sparse vector models; we consider both the case of \u201chard-sparse\u201d\nmodels which involve an explicit constraint on the number on non-zero model parameters, and also\na class of \u201cweak-sparse\u201d models in which the ordered coef\ufb01cients decay at a certain rate. Second,\nwe consider block-sparse models, in which the parameters are matrix-structured, and entire rows are\neither zero or not. Our third class is that of low-rank matrices, which arise in system identi\ufb01cation,\ncollaborative \ufb01ltering, and other types of matrix completion problems.\nTo motivate the need for a uni\ufb01ed analysis, let us provide a brief (and hence necessarily incomplete)\noverview of the broad range of past and on-going work on high-dimensional inference. For the case\nof sparse regression, a popular regularizer is the !1 norm of the parameter vector, which is the sum of\nthe absolute values of the parameters. A number of researchers have studied the Lasso [15, 3] as well\nas the closely related Dantzig selector [2] and provided conditions on various aspects of its behavior,\nincluding !2-error bounds [7, 1, 21, 2] and model selection consistency [22, 19, 6, 16]. For gener-\nalized linear models (GLMs) and exponential family models, estimators based on !1-regularized\nmaximum likelihood have also been studied, including results on risk consistency [18] and model\nselection consistency [11]. A body of work has focused on the case of estimating Gaussian graphical\nmodels, including convergence rates in Frobenius and operator norm [14], and results on operator\nnorm and model selection consistency [12]. Motivated by inference problems involving block-sparse\nmatrices, other researchers have proposed block-structured regularizers [17, 23], and more recently,\nhigh-dimensional consistency results have been obtained for model selection and parameter consis-\ntency [4, 8].\nIn this paper, we derive a single main theorem, and show how we are able to rederive a wide range\nof known results on high-dimensional consistency, as well as some novel ones, including estima-\ntion error rates for low-rank matrices, sparse matrices, and \u201cweakly\u201d-sparse vectors. Due to space\nconstraints, many of the technical details are deferred to the full-length version of this conference\npaper.\n2 Problem formulation and some key properties\nIn this section, we begin with a precise formulation of the problem, and then develop some key\nproperties of the regularizer and loss function. In particular, we de\ufb01ne a notion of decomposability\nregularized M-estimator must satisfy certain constraints We use these constraints to de\ufb01ne a notion\nof restricted strong convexity that the loss function must satisfy.\n2.1 Problem set-up\nConsider a random variable Z with distribution P taking values in a set Z. Let Zn\n1 := {Z1, . . . , Zn}\ndenote n observations drawn in an i.i.d. manner from P, and suppose \u03b8\u2217 \u2208 Rp is some parameter\n1 , and in order to do\nof this distribution. We consider the problem of estimating \u03b8\u2217 from the data Zn\nso, we consider the following class of regularized M-estimators. Let L : Rp \u00d7Z n %\u2192 R be some\nloss function that assigns a cost to any parameter \u03b8 \u2208 Rp, for a given set of observations Zn\n1 . Let\nr : Rp %\u2192 R denote a regularization function. We then consider the regularized M-estimator given\nby\n(1)\n!\u03b8 \u2208 arg min\nwhere \u03bbn > 0 is a user-de\ufb01ned regularization penalty. For ease of notation, in the sequel, we adopt\nthe shorthand L(\u03b8) for L(\u03b8; Zn\n1 ). Throughout the paper, we assume that the loss function L is\nconvex and differentiable, and that the regularizer r is a norm.\nOur goal is to provide general techniques for deriving bounds on the error!\u03b8\u2212\u03b8\u2217 in some error metric\nd. A common example is the !2-norm d(!\u03b8\u2212\u03b8\u2217) := &!\u03b8\u2212\u03b8\u2217&2. As discussed earlier, high-dimensional\nparameter estimation is made possible by structural constraints on \u03b8\u2217 such as sparsity, and we will\nsee that the behavior of the error is determined by how well these constraints are captured by the\nregularization function r(\u00b7). We now turn to the properties of the regularizer r and the loss function\nL that underlie our analysis.\n\nfor regularizing functions r, and then prove that when it is satis\ufb01ed, the error !\u2206= !\u03b8 \u2212 \u03b8\u2217 of the\n\n\u03b8\u2208Rp\"L(\u03b8; Zn\n\n1 ) + \u03bbnr(\u03b8)#,\n\n2\n\n\f2.2 Decomposability\nOur \ufb01rst condition requires that the regularization function r be decomposable, in a sense to be\nde\ufb01ned precisely, with respect to a family of subspaces. This notion is a formalization of the manner\nin which the regularization function imposes constraints on possible parameter vectors \u03b8\u2217 \u2208 Rp. We\nbegin with some abstract de\ufb01nitions, which we then illustrate with a number of concrete examples.\nTake some arbitrary inner product space H, and let &\u00b7& 2 denote the norm induced by the inner\nproduct. Consider a pair (A, B) of subspaces of H such that A \u2286 B\u22a5. For a given subspace A and\nvector u \u2208H , we let \u03c0A(u) := argminv\u2208A &u \u2212 v&2 denote the orthogonal projection of u onto A.\nWe let V = {(A, B) | A \u2286 B\u22a5} be a collection of subspace pairs. For a given statistical model,\nour goal is to construct subspace collections V such that for any given \u03b8\u2217 from our model class, there\nexists a pair (A, B) \u2208V with &\u03c0A(\u03b8\u2217)&2 \u2248 &\u03b8\u2217&2, and &\u03c0B(\u03b8\u2217)&2 \u2248 0. Of most interest to us are\nsubspace pairs (A, B) in which this property holds but the subspace A is relatively small and B is\nrelatively large. Note that A represents the constraints underlying our model class, and imposed by\nour regularizer. For the bulk of the paper, we assume that H = Rp and use the standard Euclidean\ninner product (which should be assumed unless otherwise speci\ufb01ed).\nAs a \ufb01rst concrete (but toy) example, consider the model class of all vectors \u03b8\u2217 \u2208 Rp, and the sub-\nspace collection T that consists of a single subspace pair (A, B) = (Rp, 0). We refer to this choice\n(V = T ) as the trivial subspace collection. In this case, for any \u03b8\u2217 \u2208 Rp, we have \u03c0A(\u03b8\u2217) = \u03b8\u2217 and\n\u03c0B(\u03b8\u2217) = 0. Although this collection satis\ufb01es our desired property, it is not so useful since A = Rp\nis a very large subspace. As a second example, consider the class of s-sparse parameter vectors\n\u03b8\u2217 \u2208 Rp, meaning that \u03b8\u2217i )= 0 only if i \u2208 S, where S is some s-sized subset of {1, 2, . . . , p}. For\nany given subset S and its complement Sc, let us de\ufb01ne the subspaces\n\nA(S) = {\u03b8 \u2208 Rp | \u03b8Sc = 0},\n\nand B(S) = {\u03b8 \u2208 Rp | \u03b8S = 0},\n\nand the s-sparse subspace collection S = {(A(S), B(S)) | S \u2282{ 1, . . . , p}, |S| = s}. With this\nset-up, for any s-sparse parameter vector \u03b8\u2217, we are guaranteed that there exists some (A, B) \u2208S\nsuch that \u03c0A(\u03b8\u2217) = \u03b8\u2217 and \u03c0B(\u03b8\u2217) = 0. In this case, the property is more interesting, since the\nsubspaces A(S) are relatively small as long as |S| = s + p.\nWith this set-up, we say that the regularizer r is decomposable with respect to a given subspace pair\n(A, B) if\n\nr(u + z) = r(u) + r(z)\n\nfor all u \u2208 A and z \u2208 B.\n\n(2)\nIn our subsequent analysis, we impose the following condition on the regularizer:\nDe\ufb01nition 1. The regularizer r is decomposable with respect to a given subspace collection V,\nmeaning that it is decomposable for each subspace pair (A, B) \u2208V .\nNote that any regularizer is decomposable with respect\nto the trivial subspace collection\nT = {(Rp, 0)}. It will be of more interest to us when the regularizer decomposes with respect to\na larger collection V that includes subspace pairs (A, B) in which A is relatively small and B is\nrelatively large. Let us illustrate with some examples.\n\u2022 Sparse vectors and !1 norm regularization. Consider a model involving s-sparse regression vec-\ntors \u03b8\u2217 \u2208 Rp, and recall the de\ufb01nition of the s-sparse subspace collection S discussed above. We\nclaim that the !1-norm regularizer r(u) = &u&1 is decomposable with respect to S. Indeed, for\nany s-sized subset S and vectors u \u2208 A(S) and v \u2208 B(S), we have &u + v&1 = &u&1 + &v&1, as\nrequired.\n\u2022 Group-structured sparse matrices and !1,q matrix norms. Various statistical problems involve\nmatrix-valued parameters \u0398 \u2208 Rk\u00d7m; examples include multivariate regression problems or\n(inverse) covariance matrix estimation. We can de\ufb01ne an inner product on such matrices via\n,,\u0398, \u03a3-- = trace(\u0398T \u03a3) and the induced (Frobenius) norm$k\ni,j. Let us suppose\nthat \u0398 satis\ufb01es a group sparsity condition, meaning that the ith row, denoted \u0398i, is non-zero only\nif i \u2208 S \u2286{ 1, . . . , k} and the cardinality of S is controlled. For a given subset S, we can de\ufb01ne\nthe subspace pair\n\nj=1 \u03982\n\ni=1$m\n\nB(S) =\"\u0398 \u2208 Rk\u00d7m | \u0398i = 0 for all i \u2208 Sc#,\nFor some \ufb01xed s \u2264 k, we then consider the collection\n\nV = {(A(S), B(S)) | S \u2282{ 1, . . . , k}, |S| = s},\n\nand A(S) = (B(S))\u22a5,\n\n3\n\n\fand\n\ni=1[$m\n\nA(U, V ) :=\"\u0398 \u2208 Rk\u00d7m | row(\u0398) \u2286 V, col(\u0398) \u2286 U#,\nB(U, V ) :=\"\u0398 \u2208 Rk\u00d7m | row(\u0398) \u2286 V \u22a5, col(\u0398) \u2286 U\u22a5#.\n\nwhich is a group-structured analog of the s-sparse set S for vectors. For any q \u2208 [1,\u221e], now\nsuppose that the regularizer is the !1/!q matrix norm, given by r(\u0398) =$k\nj=1 |\u0398ij|q]1/q,\ncorresponding to applying the !q norm to each row and then taking the !1-norm of the result. It\ncan be seen that the regularizer r(\u0398) = |||\u0398|||1,q is decomposable with respect to the collection V.\n\u2022 Low-rank matrices and nuclear norm. The estimation of low-rank matrices arises in vari-\nous contexts, including principal component analysis, spectral clustering, collaborative \ufb01lter-\ning, and matrix completion. In particular, consider the class of matrices \u0398 \u2208 Rk\u00d7m that have\nrank r \u2264 min{k, m}. For any given matrix \u0398, we let row(\u0398) \u2286 Rm and col(\u0398) \u2286 Rk denote its\nrow space and column space respectively. For a given pair of r-dimensional subspaces U \u2286 Rk\nand V \u2286 Rm, we de\ufb01ne a pair of subspaces A(U, V ) and B(U, V ) of Rk\u00d7m as follows:\n(3a)\n(3b)\nNote that A(U, V ) \u2286 B\u22a5(U, V ), as is required by our construction. We then consider the col-\nlection V = {(A(U, V ), B(U, V )) | U \u2286 Rk, V \u2286 Rm}, where (U, V ) range over all pairs of\nr-dimensional subspaces. Now suppose that we regularize with the nuclear norm r(\u0398) = |||\u0398|||1,\ncorresponding to the sum of the singular values of the matrix \u0398. It can be shown that the nuclear\nnorm is decomposable with respect to V. Indeed, since any pair of matrices M \u2208 A(U, V ) and\nM% \u2208 B(U, V ) have orthogonal row and column spaces, we have |||M + M%|||1 = |||M|||1 +|||M%|||1\n(e.g., see the paper [13]).\nThus, we have demonstrated various models and regularizers in which decomposability is satis\ufb01ed\nwith interesting subspace collections V. We now show that decomposability has important con-\nsequences for the error !\u2206= !\u03b8 \u2212 \u03b8\u2217, where !\u03b8 \u2208 Rp is any optimal solution of the regularized\nM-estimation procedure (1). In order to state a lemma that captures this fact, we need to de\ufb01ne the\ndual norm of the regularizer, given by r\u2217(v) := supu\u2208Rp &u,v\u2019\nr(u) . For the regularizers of interest, the\ndual norm can be obtained via some easy calculations. For instance, given a vector \u03b8 \u2208 Rp and\nr(\u03b8) = &\u03b8&1, we have r\u2217(\u03b8) = &\u03b8&\u221e. Similarly, given a matrix \u0398 \u2208 Rk\u00d7m and the nuclear norm\nregularizer r(\u0398) = |||\u0398|||1, we have r\u2217(\u0398) = |||\u0398|||2, corresponding to the operator norm (or maximal\nsingular value).\nLemma 1. Suppose!\u03b8 is an optimal solution of the regularized M-estimation procedure (1), with\nassociated error \u2206= !\u03b8\u2212\u03b8\u2217. Furthermore, suppose that the regularization penalty is strictly positive\nwith \u03bbn \u2265 2 r\u2217(\u2207L(\u03b8\u2217)). Then for any (A, B) \u2208V\n\nr(\u03c0B(!\u2206)) \u2264 3r(\u03c0B\u22a5(!\u2206)) + 4r(\u03c0A\u22a5(\u03b8\u2217)).\n\nThis property plays an essential role in our de\ufb01nition of restricted strong convexity and subsequent\nanalysis.\n2.3 Restricted Strong Convexity\n\nNext we state our assumption on the loss function L. In general, guaranteeing that L(!\u03b8) \u2212L (\u03b8\u2217)\nis small is not suf\ufb01cient to show that !\u03b8 and \u03b8\u2217 are close. (As a trivial example, consider a loss\nfunction that is identically zero.) The standard way to ensure that a function is \u201cnot too \ufb02at\u201d is via\nthe notion of strong convexity\u2014in particular, by requiring that there exist some constant \u03b3> 0 such\nthat L(\u03b8\u2217 + \u2206)\u2212L (\u03b8\u2217)\u2212,\u2207L(\u03b8\u2217), \u2206- \u2265 \u03b3d 2(\u2206) for all \u2206 \u2208 Rp. In the high-dimensional setting,\nwhere the number of parameters p may be much larger than the sample size, the strong convexity\nassumption need not be satis\ufb01ed. As a simple example, consider the usual linear regression model\ny = X\u03b8\u2217 + w, where y \u2208 Rn is the response vector, \u03b8\u2217 \u2208 Rp is the unknown parameter vector,\nX \u2208 Rn\u00d7p is the design matrix, and w \u2208 Rn is a noise vector, with i.i.d. zero mean elements. The\nleast-squares loss is given by L(\u03b8) = 1\n2, and has the Hessian H(\u03b8) = 1\nn X T X. It is\neasy to check that the p \u00d7 p matrix H(\u03b8) will be rank-de\ufb01cient whenever p > n, showing that the\nleast-squares loss cannot be strongly convex (with respect to d(\u00b7) = &\u00b7& 2) when p > n.\nHerein lies the utility of Lemma 1: it guarantees that the error !\u2206 must lie within a restricted set,\nso that we only need the loss function to be strongly convex for a limited set of directions. More\nprecisely, we have:\n\n2n&y \u2212 X\u03b8&2\n\n4\n\n\fDe\ufb01nition 2. Given some subset C\u2286 Rp and error norm d(\u00b7), we say that the loss function L\nsatis\ufb01es restricted strong convexity (RSC) (with respect to d(\u00b7)) with parameter \u03b3(L) > 0 over C if\n(4)\n\nL(\u03b8\u2217 + \u2206) \u2212L (\u03b8\u2217) \u2212 ,\u2207L(\u03b8\u2217), \u2206- \u2265 \u03b3(L) d2(\u2206)\n\nfor all \u2206 \u2208C .\n\nIn the statement of our results, we will be interested in loss functions that satisfy RSC over sets\nC(A, B, \u0001) that are indexed by a subspace pair (A, B) and a tolerance \u0001 \u2265 0 as follows:\n\nC(A, B, \u0001) :=\"\u2206 \u2208 Rp | r(\u03c0B(\u2206)) \u2264 3r(\u03c0B\u22a5(\u2206)) + 4r(\u03c0A\u22a5(\u03b8\u2217)),\n\nIn the special case of least-squares regression with hard sparsity constraints, the RSC condition cor-\nresponds to a lower bound on the sparse eigenvalues of the Hessian matrix X T X, and is essentially\nequivalent to a restricted eigenvalue condition introduced by Bickel et al. [1].\n\nd(\u2206) \u2265 \u0001#.\n\n(5)\n\nthe error d(!\u03b8 \u2212 \u03b8\u2217). Although it may appear somewhat abstract at \ufb01rst sight, we illustrate that this\n\n3 Convergence rates\nWe are now ready to state a general result that provides bounds and hence convergence rates for\nresult has a number of concrete consequences for speci\ufb01c models. In particular, we recover the best\nknown results about estimation in s-sparse models with general designs [1, 7], as well as a number\nof new results, including convergence rates for estimation under !q-sparsity constraints, estimation\nin sparse generalized linear models, estimation of block-structured sparse matrices and estimation\nof low-rank matrices.\nIn addition to the regularization parameter \u03bbn and RSC constant \u03b3(L) of the loss function, our\ngeneral result involves a quantity that relates the error metric d to the regularizer r; in particular, for\nany set A \u2286 Rp, we de\ufb01ne\n\n\u03a8(A) :=\n\nsup\n\nr(u),\n\n{u\u2208Rp | d(u)=1}\n\n(6)\n\n(7)\n\nso that r(u) \u2264 \u03a8(A)d(u) for u \u2208 A.\nTheorem 1 (Bounds for general models). For a given subspace collection V, suppose that the reg-\nularizer r is decomposable, and consider the regularized M-estimator (1) with \u03bbn \u2265 2 r\u2217(\u2207L(\u03b8\u2217)).\nThen, for any pair of subspaces (A, B) \u2208V and tolerance \u0001 \u2265 0 such that the loss function L satis-\n\ufb01es restricted strong convexity over C(A, B, \u0001), we have\n\nd(!\u03b8 \u2212 \u03b8\u2217) \u2264 max%\u0001,\n\n1\n\n\u03b3(L)&2 \u03a8(B\u22a5) \u03bbn +\u20192 \u03bbn \u03b3(L) r(\u03c0A\u22a5(\u03b8\u2217))().\n\nThe proof is motivated by arguments used in past work on high-dimensional estimation (e.g., [9,\n14]); we provide the details in the full-length version. The remainder of this paper is devoted to il-\nlustrations of the consequences of Theorem 1 for speci\ufb01c models. In all of these uses of Theorem 1,\nwe choose the regularization parameter as small as possible\u2014namely, \u03bbn = 2 r\u2217(\u2207L(\u03b8\u2217)). Al-\nthough Theorem 1 allows for more general choices, in this conference version, we focus exclusively\non the case when d(\u00b7) to be the !2-norm, In addition, we choose a tolerance parameter \u0001 = 0 for all\nof the results except for the weak-sparse models treated in Section 3.1.2.\n\n3.1 Bounds for linear regression\nConsider the standard linear regression model y = X\u03b8\u2217 + w, where \u03b8\u2217 \u2208 Rp is the regression\nvector, X \u2208 Rn\u00d7p is the design matrix, and w \u2208 Rn is a noise vector. Given the observations\n(y, X), our goal is to estimate the regression vector \u03b8\u2217. Without any structural constraints on \u03b8\u2217,\nwe can apply Theorem 1 with the trivial subspace collection T = {(Rp, 0)} to establish a rate\n&!\u03b8 \u2212 \u03b8\u2217&2 = O(\u03c3\u2019p/n) for ridge regression, which holds as long as X is full-rank (and hence\nrequires n > p). Here we consider the sharper bounds that can be obtained when it is assumed that\n\u03b8\u2217 is an s-sparse vector.\n\n5\n\n\f3.1.1 Lasso estimates of hard sparse models\nMore precisely, let us consider estimating an s-sparse regression vector \u03b8\u2217 by solving the Lasso\n2 + \u03bbn&\u03b8&1#. The Lasso is a special case of our M-\nprogram !\u03b8 \u2208 arg min\u03b8\u2208Rp\" 1\nestimator (1) with r(\u03b8) = &\u03b8&1, and L(\u03b8) = 1\n2. Recall the de\ufb01nition of the s-sparse\nsubspace collection S from Section 2.2. For this problem, let us set \u0001 = 0 so that the restricted strong\nconvexity set (5) reduces to C(A, B, 0) = {\u2206 \u2208 Rp |& \u2206Sc&1 \u2264 3&\u2206S&1}. Establishing restricted\nstrong convexity for the least-squares loss is equivalent to ensuring the following bound on the\ndesign matrix:\n\n2n&y \u2212 X\u03b8&2\n\n2n&y \u2212 X\u03b8&2\n\n2\n\n&X\u03b8&2\n\n2/n \u2265 \u03b3(L)&\u03b8&2\n\nfor all \u03b8 \u2208 Rp such that &\u03b8S&1 \u2264 3&\u03b8S&1.\n\n(8)\nAs mentioned previously, this condition is essentially the same as the restricted eigenvalue condition\ndeveloped by Bickel et al. [1]. In very recent work, Raskutti et al. [10] show that condition (8) holds\nwith high probability for various random ensembles of Gaussian matrices with non-i.i.d. elements.\nIn addition to the RSC condition, we assume that X has bounded column norms (speci\ufb01cally,\n&Xi&2 \u2264 2\u221an for all i = 1, . . . , p), and that the noise vector w \u2208 Rn has i.i.d.\nele-\nments with zero-mean and sub-Gaussian tails (i.e., there exists some constant \u03c3> 0 such that\nP[|wi| > t] \u2264 exp(\u2212t2/2\u03c32) for all t > 0). Under these conditions, we recover as a corollary of\nTheorem 1 the following known result [1, 7].\nCorollary 1. Suppose that the true vector \u03b8\u2217 \u2208 Rp is exactly s-sparse with support S, and that the\ndesign matrix X satis\ufb01es condition (8). If we solve the the Lasso with \u03bb2\n, then with\nprobability at least 1 \u2212 c1 exp(\u2212c2n\u03bb2\n\nn), the solution satis\ufb01es\n\nn = 16\u03c32 log p\n\nn\n\n.\n\n(9)\n\n8\u03c3\n\n\u03b3(L)* s log p\n\nn\n\n&!\u03b8 \u2212 \u03b8\u2217&2 \u2264\n\n\u03b3(L) \u221as\u03bbn = 8\u03c3\n\nProof. As noted previously, the !1-regularizer is decomposable for the sparse subspace collection\nS, while condition (8) ensures that RSC holds for all sets C(A, B, 0) with (A, B) \u2208S . We must\nverify that the given choice of regularization satis\ufb01es \u03bbn \u2265 2 r\u2217(\u2207L(\u03b8\u2217)). Note that r\u2217(\u00b7) = &\u00b7&\u221e,\nand moreover that \u2207L(\u03b8\u2217) = X T w/n. Under the column normalization condition on the design\nmatrix X and the sub-Gaussian nature of the noise, it follows that &X T w/n&\u221e \u2264+4\u03c32 log p\nn with\nhigh probability. The bound in Theorem 1 is thus applicable, and it remains to compute the form\nthat its different terms take in this special case. For the !1-regularizer and the !2 error metric, we\nhave \u03a8(AS) =\u2019|S|. Given the hard sparsity assumption, r(\u03b8\u2217Sc) = 0, so that Theorem 1 implies\nthat &!\u03b8 \u2212 \u03b8\u2217&2 \u2264 2\n3.1.2 Lasso estimates of weak sparse models\nWe now consider models that satisfy a weak sparsity assumption. More concretely, suppose that \u03b8\u2217\nlies in the !q-\u201cball\u201d of radius Rq\u2014namely, the set Bq(Rq) := {\u03b8 \u2208 Rp | $p\ni=1 |\u03b8i|q \u2264 Rq} for\nsome q \u2208 (0, 1]. Our analysis exploits the fact that any \u03b8\u2217 \u2208 Bq(Rq) can be well approximated by\nan s-sparse vector (for an appropriately chosen sparsity index s). It is natural to approximate \u03b8\u2217 by\n|\u03b8\u2217i |\u2265 \u03c4}. For any choice of threshold \u03c4> 0, it can be\na vector supported on the set S = {i |\nshown that |S|\u2264 Rq\u03c4\u2212q, and it is optimal to choose \u03c4 equal to the same regularization parameter\n\u03bbn from Corollary 1 (see the full-length version for details). Accordingly, we consider the s-sparse\nsubspace collection S with subsets of size s = Rq\u03bb\u2212q\nn . We assume that the noise vector w \u2208 Rn\nis as de\ufb01ned above and that the columns are normalized as in the previous section. We also assume\nthat the matrix X satis\ufb01es the condition\n\n\u03b3(L)+ s log p\n\nn , as claimed.\n\n2\n\n&Xv&2 \u2265 \u03ba1&v&2 \u2212 \u03ba2,log p\nn - 1\n(10)\nRaskutti et al. [10] show that this property holds with high probablity for suitable Gaussian random\nmatrices. Under this condition, it can be veri\ufb01ed that RSC holds with \u03b3(L) = \u03ba1/2 over the set\nC.A(S), B(S),\u0001 n), where \u0001n = .4/\u03ba1 +\u20194/\u03ba1)R\n/1\u2212q/2. The following result,\nwhich we obtain by applying Theorem 1 in this setting, is new to the best of our knowledge:\n\nq .+ 16 \u03c32 log p\n\nfor constants \u03ba1,\u03ba 2 > 0.\n\n&v&1\n\n1\n2\n\nn\n\n6\n\n\fCorollary 2. Suppose that the true vector \u03b8\u2217 \u2208 Bq(Rq), and the design matrix X satis\ufb01es condi-\ntion (10). If we solve the Lasso with \u03bb2\nn), the\nsolution satis\ufb01es\n\nn = 16\u03c32 log p\n\nn\n\n1\n2\n\nq 0*16 \u03c32 log p\n\nn\n\n&!\u03b8 \u2212 \u03b8\u2217&2 \u2264 R\n\n, then with probability 1 \u2212 c1 exp(\u2212c2n\u03bb2\n11\u2212q/22 2\n\u03b3(L)\n\n\u221a2\n\n\u2019\u03b3(L)3 .\n\n+\n\n(11)\n\nWe note that both of the rates\u2014for hard-sparsity in Corollary 1 and weak-sparsity in Corollary 2\u2014\nare known to be optimal1 in a minimax sense [10].\n3.2 Bounds for generalized linear models\nOur next example is a generalized linear model with canonical link function, where the distribution\nof response y \u2208Y based on a predictor x \u2208 Rp is given by p(y | x; \u03b8\u2217) = exp(y,\u03b8\u2217, x- \u2212\na(,\u03b8\u2217, X-) + d(y)), for some \ufb01xed functions a : R %\u2192 R and d : Y %\u2192 R, where &x&\u221e \u2264 A, and\n|y|\u2264 B. We consider estimating \u03b8\u2217 from observations {(xi, yi)}n\ni=1 by !1-regularized maximum\nlikelihood!\u03b8 \u2208 arg min\u03b8\u2208Rp\"\u2212 1\ni=1 a(,\u03b8, xi-) +&\u03b8&1#. This is a special\ni=1 yixi/- + 1\nn$n\ncase of our M-estimator (1) with L(\u03b8) = \u2212,\u03b8,. 1\ni=1 yixi/- + 1\nn$n\nn$n\ni=1 a(,\u03b8, xi-), and r(\u03b8) =\n&\u03b8&1. Let X \u2208 Rn\u00d7p denote the matrix with ith row xi. For analysis, we again use the s-sparse\nsubspace collection S and \u0001 = 0. With these choices, it can be veri\ufb01ed that an appropriate version of\nthe RSC will hold if the second derivative a%% is strongly convex, and the design matrix X satis\ufb01es a\nversion of the condition (8).\nCorollary 3. Suppose that the true vector \u03b8\u2217 \u2208 Rp is exactly s-sparse with support S, and the\nmodel (a, X) satis\ufb01es an RSC condition. Suppose that we compute the !1-regularized MLE with\nn = 32A2B2 log p\n\u03bb2\n\nn), the solution satis\ufb01es\n\nn,\u03b8,.$n\n\nn\n\n. Then with probability 1 \u2212 c1 exp(\u2212c2n\u03bb2\n\u03b3(L)* s log p\n\n16AB\n\n&!\u03b8 \u2212 \u03b8\u2217&2 \u2264\n\nn\n\n.\n\n(12)\n\nWe defer the proof to the full-length version due to space constraints.\n3.3 Bounds for sparse matrices\nIn this section, we consider some extensions of our results to estimation of regression matrices.\nVarious authors have proposed extensions of the Lasso based on regularizers that have more structure\nthan the !1 norm (e.g., [17, 20, 23, 5]). Such regularizers allow one to impose various types of\nblock-sparsity constraints, in which groups of parameters are assumed to be active (or inactive)\nsimultaneously. We assume that the observation model takes on the form Y = X\u0398\u2217 + W, where\n\u0398\u2217 \u2208 Rk\u00d7m is the unknown \ufb01xed set of parameters, X \u2208 Rn\u00d7k is the design matrix, and W \u2208\nRn\u00d7m is the noise matrix. As a loss function, we use the Frobenius norm 1\nF,\nnL(\u0398) = |||Y \u2212 X\u0398|||2\nand as a regularizer, we use the !1,q-matrix norm for some q \u2265 1, which takes the form |||\u0398|||1,q =\n$k\ni=1 &(\u0398i1, . . . , \u0398im)&q. We refer to the resulting estimator as the q-group Lasso. We de\ufb01ne the\nquantity \u03b7(m; q) = 1 if q \u2208 (1, 2] and \u03b7(m; q) = m1/2\u22121/q if q > 2. We then set the regularization\nparameter as follows:\n\n\u03bbn =4 4\u03c3\u221an[\u03b7(m; q)\u221alog k + Cqm1\u22121/q]\n\n4\u03c3+ log(km)\nCorollary 4. Suppose that the true parameter matix \u0398\u2217 has non-zero rows only for indices i \u2208 S \u2286\n{1, . . . , k} where |S| = s, and that the design matrix X \u2208 Rn\u00d7k satis\ufb01es condition (8). Then with\nprobability at least 1 \u2212 c1 exp(\u2212c2n\u03bb2\n(13)\n\nn), the q-block Lasso solution satis\ufb01es\n\nif q > 1\nfor q = 1.\n\n\u03a8(S)\u03bbn.\n\nn\n\n1Raskutti et al. [10] show that the rate (11) is achievable by solving the computationally intractable problem\n\nof minimizing L(\u03b8) over the \"q-ball.\n\n|||!\u0398 \u2212 \u0398\u2217|||F \u2264\n\n2\n\u03b3(L)\n\n7\n\n\fn\n\nn\n\n\u03b3(L)&+ s log k\n\n\u03b3(L)+ s m log(km)\n\nThe proof is provided in the full-length version; here we consider three special cases of the above\nresult. A simple argument shows that \u03a8(S) = \u221as if q \u2265 2, and \u03a8(S) = m1/q\u22121/2 \u221as if q \u2208 [1, 2].\nFor q = 1, solving the group Lasso is identical solving a Lasso problem with sparsity sm and\nambient dimension km, and the resulting upper bound 8\u03c3\nre\ufb02ects this fact (compare\nn (,\nn +\u2019 sm\nto Corollary 1). For the case q = 2, Corollary 4 yields the upper bound 8\u03c3\nwhich also has a natural interpretation: the term s log k\ncaptures the dif\ufb01culty of \ufb01nding the s non-\nn captures the dif\ufb01culty of estimating the sm free\nzero rows out of the total k, whereas the term sm\nparameters in the matrix (once the non-zero rows have been determined). We note that recent work\n\u03b3(L)+ c\u221am s log k\nby Lounici et al. [4] established the bound O( \u03c3\nn ), which is equivalent apart\n+ sm\nfrom a term \u221am. Finally, for q = \u221e, we obtain the upper bound 8\u03c3\n3.4 Bounds for estimating low rank matrices\nFinally, we consider the implications of our main result for the problem of estimating low-rank ma-\ntrices. This structural assumption is a natural variant of sparsity, and has been studied by various\nauthors (see the paper [13] and references therein). To illustrate our main theorem in this con-\ntext, let us consider the following instance of low-rank matrix learning. Given a low-rank matrix\n\u0398\u2217 \u2208 Rk\u00d7m, suppose that we are given n noisy observations of the form Yi = ,,Xi, \u0398\u2217-- + Wi,\nwhere Wi \u223c N(0, 1) and ,,A, B-- := trace(AT B). Such an observation model arises in sys-\ntem identi\ufb01cation settings in control theory [13]. The following regularized M-estimator can be\nconsidered in order to estimate the desired low-rank matrix \u0398\u2217:\n\nn(.\nn + m\u2019 s\n\n\u03b3(L)&+ s log k\n\nn\n\n1\n2n\n\nn5i=1\n\nF\n\nmin\n\n\u0398\u2208Rm\u00d7p\n\n1\nn\n\nn5i=1\n\n|,,Xi, \u2206--|2 \u2265 \u03b3(L)|||\u2206|||2\n\n|Yi \u2212 ,,Xi, \u0398)--|2 + |||\u0398|||1,\n\nfor all \u2206 such that |||\u03c0B(\u2206)|||1 \u2264 3|||\u03c0B\u22a5(\u2206)|||1,\n\n(14)\nwhere the regularizer, |||\u0398|||1, is the nuclear norm, or the sum of the singular values of \u0398. Recall\nthe rank-r collection V de\ufb01ned for low-rank matrices in Section 2.2. Let \u0398\u2217 = U\u03a3W T be the\nsingular value decomposition (SVD) of \u0398\u2217, so that U \u2208 Rk\u00d7r and W \u2208 Rm\u00d7r are orthogonal, and\n\u03a3 \u2208 Rr\u00d7r is a diagonal matrix. If we let A = A(U, W ) and B = B(U, W ), then, \u03c0B(\u0398\u2217) = 0, so\nthat by Lemma 1 we have that |||\u03c0B(\u2206)|||1 \u2264 3|||\u03c0B\u22a5(\u2206)|||1. Thus, for restricted strong convexity to\nhold it can be shown that the design matrices Xi must satisfy\n(15)\nand satisfy the appropriate analog of the column-normalization condition. As with analogous con-\nditions for sparse linear regression, these conditions hold w.h.p. for various non-i.i.d. Gaussian\nrandom matrices.2\nCorollary 5. Suppose that the true matrix \u0398\u2217 has rank r + min(k, m), and that the design matrices\n{Xi} satisfy condition (15). If we solve the regularized M-estimator (14) with \u03bbn = 16\u221ak+\u221am\u221an\n,\nthen with probability at least 1 \u2212 c1 exp(\u2212c2(k + m)), we have\n+* rm\nn 7.\n\n(16)\nProof. Note that if rank(\u0398\u2217) = r, then |||\u0398\u2217|||1 \u2264 \u221ar|||\u0398\u2217|||F so that \u03a8(B\u22a5) = \u221a2r, since the\nsubspace B(U, V )\u22a5 consists of matrices with rank at most 2r. All that remains is to show that\n\u03bbn \u2265 2 r\u2217(\u2207L(\u0398\u2217)). Standard analysis gives that the dual norm to ||| \u00b7 |||1 is the operator norm,\n||| \u00b7 |||2. Applying this observation we may construct a bound on the operator norm of \u2207L(\u0398\u2217) =\nn$n\nn$n\ni=1 XiWi. Given unit vectors u \u2208 Rk and v \u2208 Rm, 1\nF = 1.\nTherefore, 1\nn). A standard argument shows that the supremum over all\nunit vectors u and v is bounded above by 8\u221ak+\u221am\u221an with probability at least 1\u2212c1 exp(\u2212c2(k+m)),\nverifying that \u03bbn \u2265 2r\u2217(\u2207L(\u0398\u2217)) with high probability.\n2This claim involves some use of concentration of measure and Gaussian comparison inequalities analogous\nto arguments in Raskutti et al. [10]; see the full-length length version for details.\n\nn$n\ni=1(uT Xiv)Wi \u223c N(0, 1\n\ni=1 |,,Xi, vuT--|2 \u2264 |||vuT|||2\n\n|||!\u0398 \u2212 \u0398\u2217|||F \u2264\n\n16\n\n\u03b3(L) 6* rk\n\nn\n\n1\n\n8\n\n\fReferences\n[1] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nSubmitted to Annals of Statistics, 2008.\n\n[2] E. Candes and T. Tao. The Dantzig selector: Statistical estimation when p is much larger than\n\nn. Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[3] S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J.\n\n[4] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in\n\nSci. Computing, 20(1):33\u201361, 1998.\n\nmulti-task learning. Arxiv, 2009.\n\n[5] L. Meier, S. Van de Geer, and P. B\u00a8uhlmann. The group lasso for logistic regression. Journal\n\nof the Royal Statistical Society, Series B, 70:53\u201371, 2008.\n\n[6] N. Meinshausen and P. B\u00a8uhlmann. High-dimensional graphs and variable selection with the\n\nLasso. Annals of Statistics, 34:1436\u20131462, 2006.\n\n[7] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional\n\ndata. Annals of Statistics, 37(1):246\u2013270, 2009.\n\n[8] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Union support recovery in high-dimensional\nmultivariate regression. Technical report, Department of Statistics, UC Berkeley, August 2008.\n[9] S. Portnoy. Asymptotic behavior of M-estimators of p regression parameters when p2/n is\n\nlarge: I. consistency. Annals of Statistics, 12(4):1296\u20131309, 1984.\n\n[10] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional\nlinear regression over !q-balls. Technical Report arXiv:0910.2042, UC Berkeley, Department\nof Statistics, 2009.\n[11] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using\n\n!1-regularized logistic regression. Annals of Statistics, 2008. To appear.\n[12] P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estima-\ntion by minimizing !1-penalized log-determinant divergence. Technical Report 767, Depart-\nment of Statistics, UC Berkeley, September 2008.\n[13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix\n\nequations via nuclear norm minimization. Allerton Conference, 2007.\n\n[14] A.J. Rothman, P.J. Bickel, E. Levina, and J. Zhu. Sparse permutation invariant covariance\n\nestimation. Electron. J. Statist., 2:494\u2013515, 2008.\n\n[15] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, 58(1):267\u2013288, 1996.\n\n[16] J. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise.\n\nIEEE Trans. Info Theory, 52(3):1030\u20131051, March 2006.\n\n[17] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Technometrics,\n\n[18] S. Van de Geer. High-dimensional generalized linear models and the lasso. Annals of Statistics,\n\n27:349\u2013363, 2005.\n\n36(2):614\u2013645, 2008.\n\n[19] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using !1-\nconstrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183\u20132202,\nMay 2009.\n\n[20] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\n\nJournal of the Royal Statistical Society B, 1(68):49, 2006.\n\n[21] C. Zhang and J. Huang. Model selection consistency of the lasso selection in high-dimensional\n\nlinear regression. Annals of Statistics, 36:1567\u20131594, 2008.\n\n[22] P. Zhao and B. Yu. On model selection consistency of Lasso. Journal of Machine Learning\n\nResearch, 7:2541\u20132567, 2006.\n\n[23] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite\n\nabsolute penalties. Annals of Statistics, 37(6A):3468\u20133497, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1070, "authors": [{"given_name": "Sahand", "family_name": "Negahban", "institution": null}, {"given_name": "Bin", "family_name": "Yu", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}