{"title": "Robust Lasso with missing and grossly corrupted observations", "book": "Advances in Neural Information Processing Systems", "page_first": 1881, "page_last": 1889, "abstract": "This paper studies the problem of accurately recovering a sparse vector $\\beta^{\\star}$ from highly corrupted linear measurements $y = X \\beta^{\\star} + e^{\\star} + w$ where $e^{\\star}$ is a sparse error vector whose nonzero entries may be unbounded and $w$ is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both $\\beta^{\\star}$ and $e^{\\star}$. Our first result shows that the extended Lasso can faithfully recover both the regression and the corruption vectors. Our analysis is relied on a notion of extended restricted eigenvalue for the design matrix $X$. Our second set of results applies to a general class of Gaussian design matrix $X$ with i.i.d rows $\\oper N(0, \\Sigma)$, for which we provide a surprising phenomenon: the extended Lasso can recover exact signed supports of both $\\beta^{\\star}$ and $e^{\\star}$ from only $\\Omega(k \\log p \\log n)$ observations, even the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is optimal.", "full_text": "Robust Lasso with missing and grossly corrupted\n\nobservations\n\nNam H. Nguyen\n\nJohns Hopkins University\n\nnam@jhu.edu\n\nNasser M. Nasrabadi\nU.S. Army Research Lab\n\nnasser.m.nasrabadi.civ@mail.mil\n\nTrac D. Tran\n\nJohns Hopkins University\n\ntrac@jhu.edu\n\nAbstract\n\nThis paper studies the problem of accurately recovering a sparse vector \u03b2(cid:63) from\nhighly corrupted linear measurements y = X\u03b2(cid:63) + e(cid:63) + w where e(cid:63) is a sparse\nerror vector whose nonzero entries may be unbounded and w is a bounded noise.\nWe propose a so-called extended Lasso optimization which takes into consider-\nation sparse prior information of both \u03b2(cid:63) and e(cid:63). Our \ufb01rst result shows that the\nextended Lasso can faithfully recover both the regression and the corruption vec-\ntors. Our analysis is relied on a notion of extended restricted eigenvalue for the\ndesign matrix X. Our second set of results applies to a general class of Gaus-\nsian design matrix X with i.i.d rows N (0, \u03a3), for which we provide a surprising\nphenomenon: the extended Lasso can recover exact signed supports of both \u03b2(cid:63)\nand e(cid:63) from only \u2126(k log p log n) observations, even the fraction of corruption is\narbitrarily close to one. Our analysis also shows that this amount of observations\nrequired to achieve exact signed support is optimal.\n\n1\n\nIntroduction\n\nOne of the central problems in statistics is the linear regression in which the goal is to accurately\nestimate a regression vector \u03b2(cid:63) \u2208 Rp from the noisy observations\n\ny = X\u03b2(cid:63) + w,\n\n(1)\nwhere X \u2208 Rn\u00d7p is the measurement or design matrix, and w \u2208 Rn is the stochastic observation\nvector noise. A particular situation recently attracted much attention from research community\nconcerns with the model in which the number of regression variables p is larger than the number\nof observations n (p \u2265 n). In such circumstances, without imposing some additional assumptions\nfor this model, it is well known that the problem is ill-posed, and thus the linear regression is not\nconsistent. Accordingly, there have been various lines of work on high dimensional inference based\non imposing different types of structure constraints such as sparsity and group sparsity [15] [5] [21].\nAmong them, the most popular model focused on sparsity assumption of the regression vector. To\nestimate \u03b2, a standard method, namely Lasso [15], was proposed to use l1-penalty as a surrogate\nfunction to enforce sparsity constraint.\n\nwhere \u03bb is the positive regularization parameter and l1-norm (cid:107)\u03b2(cid:107)1 is de\ufb01ned by (cid:107)\u03b2(cid:107)1 =(cid:80)p\n\n(2)\ni=1 |\u03b2i|.\nDuring the past few years, there has been numerous studies to understand the (cid:96)1-regularization for\nsparse regression models [23] [11] [10] [17] [4] [2] [22]. These works are mainly characterized by\n\nmin\n\n(cid:107)y \u2212 X\u03b2(cid:107)2\n\n2 + \u03bb(cid:107)\u03b2(cid:107)1 ,\n\n1\n2\n\n\u03b2\n\n1\n\n\fthe type of the loss functions considered. For instance, some authors [4] seek to obtain a regression\n\nestimate (cid:98)\u03b2 that delivers small prediction error while other authors [2] [11] [22] seek to produce a\nregressor with minimal parameter estimation error, which is measured by the (cid:96)2-norm of ((cid:98)\u03b2 \u2212 \u03b2(cid:63)).\n\nAnother line of work [23] [17] considers the variable selection in which the goal is to obtain an\nestimate that correctly identi\ufb01es the support of the true regression vector. To achieve low prediction\nor parameter estimation loss, it is now well known that it is both suf\ufb01cient and necessary to impose\ncertain lower bounds on the smallest singular values of the design matrix [10] [2], while a notion of\nsmall mutual incoherence for the design matrix [4] [23] [17] is required to achieve accurate variable\nselection.\nWe notice that all the previous work relies on the assumption that the observation noise has bounded\nenergy. Without this assumption, it is very likely that the estimated regressor is either not reliable\nor unable to identify the correct support. With this observation in mind, in this paper, we extend the\nlinear model (1) by considering the noise with unbounded energy. It is clear that if all the entries\nof y is corrupted by large error, then it is impossible to faithfully recover the regression vector \u03b2(cid:63).\nHowever, in many practical applications such as face and acoustic recognition, only a portion of the\nobservation vector is contaminated by gross error. Formally, we have the mathematical model\n\ny = X\u03b2(cid:63) + e(cid:63) + w,\n\n(3)\nwhere e(cid:63) \u2208 Rn is the sparse error whose locations of nonzero entries are unknown and magnitudes\ncan be arbitrarily large and w is another noise vector with bounded entries. In this paper, we assume\nthat w has a multivariate Gaussian N (0, \u03c32In\u00d7n) distribution. This model also includes as a par-\nticular case the missing data problem in which all the entries of y is not fully observed, but some\nare missing. This problem is particularly important in computer vision and biology applications.\nIf some entries of y are missing, the nonzero entries of e(cid:63) whose locations are associated with the\nmissing entries of the observation vector y have the same values as entries of y but with inverse\nsigns.\nThe problems of recovering the data under gross error has gained increasing attentions recently with\nmany interesting practical applications [18] [6] [7] as well as theoretical consideration [9] [13] [8].\nAnother recent line of research on recovering the data from grossly corrupted measurements has\nbeen also studied in the context of robust principal component analysis (RPCA) [3] [20] [1]. Let us\nconsider some examples to illustrate:\n\n\u2022 Face recognition. The model (3) has been originally proposed by Wright et al. [19] in\nthe context of face recognition. In this problem, a face test sample y is assumed to be\nrepresented as a linear combination of training faces in the dictionary X, y = X\u03b2 where \u03b2\nis the coef\ufb01cient vector used for classi\ufb01cation. However, it is often the case that the face is\noccluded by unwanted objects such as glasses, hats etc. These occlusions, which occupy a\nportion of the test face, can be considered as the sparse error e(cid:63) in the model (3).\n\n\u2022 Subspace clustering. One of the important problem on high dimensional analysis is to\ncluster the data points into multiple subspaces. A recent work of Elhamifar and Vidal [6]\nshowed that this problem can be solved by expressing each data point as a sparse linear\ncombination of all other data points. Coef\ufb01cient vectors recovered from solving the Lasso\nproblems are then employed for clustering. If the data points are represented as a matrix X,\nthen we wish to \ufb01nd a sparse coef\ufb01cient matrix B such that X = XB and diag(B) = 0.\nWhen the data is missing or contaminated with outliers, [6] formulates the problem as\nX = XB + E and minimize a sum of two (cid:96)1-norms with respect to both B and E.\n\u2022 Sensor network. In this model, sensors collect measurements of a signal \u03b2(cid:63) independently\nby simply projecting \u03b2(cid:63) onto row vectors of a sensing matrix X, yi = (cid:104)Xi, \u03b2(cid:63)(cid:105). The\nmeasurements yi are then sent to the center hub for analysis. However, it is highly likely\nthat some sensors might fail to send the measurements correctly and sometimes report\ntotally irrelevant measurements. Therefore, it is more accurate to employ the observation\nmodel (3) than model (1).\n\nIt is worth noticing that in the aforementioned applications, e(cid:63) plays the role as the sparse (unde-\nsired) error. However, in many other applications, e(cid:63) can contain meaningful information, and thus\nnecessary to be recovered. An example of this kind is signal separation, in which \u03b2(cid:63) and e(cid:63) are two\ndistinct signal components (video or audio). Furthermore, in applications such as classi\ufb01cation and\n\n2\n\n\fclustering, the assumption that the test sample y is a linear combination of a few training samples in\nthe dictionary (design matrix) X might be violated. This sparse component e(cid:63) can thus be seen as\nthe compensation for linear regression model mismatch.\nGiven the observation model (1) and the sparsity assumptions on both regression vector \u03b2(cid:63) and error\ne(cid:63), we propose the following convex minimization to estimate the unknown parameter \u03b2(cid:63) as well as\nthe error e(cid:63).\n\nmin\n\u03b2,e\n\n1\n2\n\n(cid:107)y \u2212 X\u03b2 \u2212 e(cid:107)2\n\n2 + \u03bb\u03b2 (cid:107)\u03b2(cid:107)1 + \u03bbe (cid:107)e(cid:107)1 ,\n\n(4)\n\nwhere \u03bb\u03b2 and \u03bbe are positive regularization parameters. This optimization, we call extended Lasso,\ncan be seen as a generalization of the Lasso program. Indeed, by setting \u03bbe = 0, (4) returns to\nthe standard Lasso. The additional regularization associated with e encourages sparsity on the error\nwhere parameter \u03bbe controls the sparsity level. In this paper, we focus on the following questions:\nwhat are necessary and suf\ufb01cient conditions for the ambient dimension p, the number of observations\nn, the sparsity index k of the regression \u03b2(cid:63) and the fraction of corruption so that (i) the extended\nLasso is able (or unable) to recover the exact support sets of both \u03b2(cid:63) and e(cid:63)? (ii) the extended Lasso\nis able to recover \u03b2(cid:63) and e(cid:63) with small prediction error and parameter error? We are particularly\ninterested in understanding the asymptotic situation where the the fraction of error is arbitrarily close\nto 100%.\nPrevious work. The problem of recovering the estimation vector \u03b2(cid:63) and error e(cid:63) has originally\nproposed and analyzed by Wright and Ma [18]. In the absence of the stochastic noise w in the\nobservation model (3), the authors proposed to estimate (\u03b2(cid:63), e(cid:63)) by solving the linear program\n\n(cid:107)\u03b2(cid:107)1 + (cid:107)e(cid:107)1\n\nmin\n\u03b2,e\n\ns.t.\n\ny = X\u03b2 + e.\n\n(5)\n\nThe result of [18] is asymptotic in nature. They showed that for a class of Gaussian design matrix\nwith i.i.d entries, the optimization (5) can recover (\u03b2(cid:63), e(cid:63)) precisely with high probability even when\nthe fraction of corruption is arbitrarily close to one. However, the result holds under rather stringent\nconditions. In particularly, they require the number of observations n grow proportionally with the\nambient dimension p, and the sparsity index k is a very small portion of n. These conditions is\nof course far from the optimal bound in compressed sensing (CS) and statistics literature (recall\nk \u2264 O(n/ log p) is suf\ufb01cient in conventional analysis [17]).\nAnother line of work has also focused on the optimization (5). In both papers of Laska et al. [7] and\nLi et al. [9], the authors establish that for Gaussian design matrix X, if n \u2265 C(k + s) log p where s\nis the sparsity level of e(cid:63), then the recovery is exact. This follows from the fact that the combination\nmatrix [X, I] obeys the restricted isometry property, a well-known property used to guarantee exact\nrecovery of sparse vectors via (cid:96)1-minimization. These results, however, do not allow the fraction of\ncorruption close to one.\nAmong the previous work, the most closely related to the current paper are recent results by Li [8]\nand Nguyen et al. [13] in which a positive regularization parameter \u03bb is employed to control the\n\u221a\nsparsity of e(cid:63). Using different methods, both sets of authors show that as \u03bb is deterministically se-\nlected to be 1/\nlog p and X is a sub-orthogonal matrix, then the solution of following optimization\nis exact even a constant fraction of observation is corrupted. Moreover, [8] establishes a similar\nresult with Gaussian design matrix in which the number of observations is only an order of k log p -\nan amount that is known to be optimal in CS and statistics.\n\n(cid:107)\u03b2(cid:107)1 + \u03bb(cid:107)e(cid:107)1\n\nmin\n\u03b2,e\n\ns.t.\n\ny = X\u03b2 + e.\n\n(6)\n\nOur contribution. This paper considers a general setting in which the observations are contaminated\nby both sparse and dense errors. We allow the corruptions to linearly grow with the number of\nobservations and have arbitrarily large magnitudes. We establish a general scaling of the quadruplet\n(n, p, k, s) such that the extended Lasso stably recovers both the regression and corruption vectors.\nOf particular interest to us are the following equations:\n\n(a) First, under what scalings of (n, p, k, s) does the extended Lasso obtain the unique solution\n\nwith small estimation error.\n\n(b) Second, under what scalings of (n, p, k) does the extended Lasso obtain the exact signed\n\nsupport recovery even almost all the observations are corrupted?\n\n3\n\n\f(c) Third, under what scalings of (n, p, k, s) does no solution of the extended Lasso specify\n\nthe correct signed support?\n\n, e(cid:63)T\n\nTo answer for the \ufb01rst question, we introduce a notion of extended restricted eigenvalue for a matrix\n[X, I] where I is an identity matrix. We show that this property satis\ufb01es for a general class of\nrandom Gaussian design matrix. The answers to the last two questions requires stricter conditions\nfor the design matrix. In particular, for random Gaussian design matrix with i.i.d rows N (0, \u03a3), we\nrely on two standard assumptions: invertibility and mutual incoherence.\nIf we denote Z = [X, I] where I is an identity matrix and \u03b2 = [\u03b2(cid:63)T\n]T , then the observation\nvector y is reformulated as y = Z\u03b2 + w, which is the same as standard Lasso model. However,\nprevious results [2] [17] applying to random Gaussian design matrix are irrelevant to this setting\nsince the Z no longer behave like a Gaussian matrix. To establish theoretical analysis, we need\nmore study on the interaction between the Gaussian and identity matrices. By exploiting the fact\nthat the matrix Z consists of two component where one component has special structure, our analysis\nreveals an interesting phenomenon: extended Lasso can accurately recover both the regressor \u03b2(cid:63) and\ncorruption e(cid:63) even when the fraction of corruption is up to 100%. We measure the recoverability of\nthese variables under two criterions: parameter accuracy and feature selection accuracy. Moreover,\nour analysis can be extended to the situation in which the identity matrix can be replaced by a tight\nframe D as well as extended to other models such as group Lasso or matrix Lasso with sparse error.\nNotation We summarize here some standard notation used throughout the paper. We reserve T\nand S as the sparse support of \u03b2(cid:63) and e(cid:63), respectively. Given and design matrix X \u2208 Rn\u00d7p and\nsubsets S and T , we use XST to denote the |S| \u00d7 |T| submatrix obtained by extracting those rows\nindexed by S and columns indexed by T . We use the notation C1, C2, c1, c2, etc., to refer to positive\nconstants, whose value may change from line to line. Given two functions f and g, the notation\nf (n) = O(g(n)) means that there exists a constant c < +\u221e such that f (n) \u2264 cg(n); the notation\nf (n) = \u2126(g(n)) means that f (n) \u2265 cg(n) and the notation f (n) = \u0398(g(n)) means that f (n) =\n(g(n)) and f (n) = \u2126(g(n)). The symbol f (n) = o(g(n)) means that f (n)/g(n) \u2192 0.\n\n2 Main results\n\nIn this section, we provide precise statements of the main results of this paper. In the \ufb01rst sub-\nsection, we establish the parameter estimation and provide a deterministic result which bases on the\nnotion of extended restricted eigenvalue. We further show that the random Gaussian design matrix\nsatis\ufb01es this property with high probability. The next sub-section considers the feature estimation.\nWe establish conditions for the design matrix such that the solution of the extended Lasso has the\nexact signed supports.\n\n2.1 Parameter estimation\n\nAs in conventional Lasso, to obtain a low parameter estimation bound, it is necessary to impose\nIn this paper, we introduce a notion of extended restricted\nconditions on the design matrix X.\neigenvalue (extended RE) condition. Let C be a restricted set, we say that the matrix X satis\ufb01es the\nextended RE assumption over the set C if there exists some \u03bal > 0 such that\nfor all (h, f ) \u2208 C,\n\n(cid:107)Xh + f(cid:107)2 \u2265 \u03bal((cid:107)h(cid:107)2 + (cid:107)f(cid:107)2)\n\n(7)\n\nwhere the restricted set C of interest is de\ufb01ned with \u03bbn := \u03bbe/\u03bb\u03b2 as follow\n\nC := {(h, f ) \u2208 Rp \u00d7 Rn | (cid:107)hT c(cid:107)1 + \u03bbn (cid:107)fSc(cid:107)1 \u2264 3(cid:107)hT(cid:107)1 + 3\u03bbn (cid:107)fS(cid:107)1}.\n\n(8)\n\nThis assumption is a natural extension of the restricted eigenvalue condition and restricted strong\nconvexity considered in [2] [14] and [12]. In the absent of a vector f in the equation (7) and in the\nset C, this condition returns to the restricted eigenvalue de\ufb01ned in [2]. As explained at more length\nin [2] and [16], restricted eigenvalue is among the weakest assumption on the design matrix such\nthat the solution of the Lasso is consistent.\nWith this assumption at hand, we now state the \ufb01rst theorem\n\n4\n\n\fTheorem 1. Consider the optimal solution ((cid:98)\u03b2,(cid:98)e) to the optimization problem (4) with regularization\n\nparameters chosen as\n\n\u03bb\u03b2 \u2265 2\n\u03b3\n\n(9)\nwhere \u03b3 \u2208 (0, 1]. Assuming that the design matrix X obeys the extended RE, then the error set\n\n= \u03b3\n\n,\n\n(h, f ) = ((cid:98)\u03b2 \u2212 \u03b2(cid:63),(cid:98)e \u2212 e(cid:63)) is bounded by\n\n\u03bbe\n\u03bb\u03b2\n\n(cid:107)X\u2217w(cid:107)\u221e and \u03bbn :=\n(cid:16)\n\n(cid:107)h(cid:107)2 + (cid:107)f(cid:107)2 \u2264 3\u03ba\u22122\n\n\u03bb\u03b2\n\nl\n\n\u221a\n\nk + \u03bbe\n\ns\n\n.\n\n(cid:107)w(cid:107)\u221e\n(cid:107)X\u2217w(cid:107)\u221e\n(cid:17)\n\n\u221a\n\n(10)\n\n\u03b3\n\n(cid:112)\u03c32 log p and \u03bbe \u2265 4(cid:112)\u03c32 log n.\n\nthat with high probability, (cid:107)X\u2217w(cid:107)\u221e \u2264 2(cid:112)\u03c32 log p and (cid:107)w(cid:107)\u221e \u2264 2(cid:112)\u03c32 log n. Thus, it is suf\ufb01cient\n\nThere are several interesting observations from this theorem\n1) The error bound naturally split into two components related to the sparsity indices of \u03b2(cid:63) and e(cid:63).\nIn addition, the error bound contains three quantity: the sparsity indices, regularization parameters\nand the extended RE constant. If the terms related to the corruption e(cid:63) are omitted, then we obtain\nsimilar parameter estimation bound as the standard Lasso [2] [12].\n2) The choice of regularization parameters \u03bb\u03b2 and \u03bbe can make explicitly: assuming w is a Gaussian\nrandom vector whose entries are N (0, \u03c32) and the design matrix has unit-normed columns, it is clear\nto select \u03bb\u03b2 \u2265 4\n3) At the \ufb01rst glance, the parameter \u03b3 does not seem to have any meaningful interpretation and the\n\u03b3 = 1 seems to be the best selection due to the smallest estimation error it can produce. However,\nthis parameter actually control the sparsity level of the regression vector with respect to the fraction\nof corruption. This relation is made via the restricted set C.\nIn the following lemma, we show that the extended RE condition actually exists for a large class of\nrandom Gaussian design matrix whose rows are i.i.d zero mean with covariance \u03a3. Before stating the\nlemma, let us de\ufb01ne some quantities operating on the covariance matrix \u03a3: Cmin := \u03bbmin(\u03a3) is the\nsmallest eigenvalue of \u03a3, Cmax := \u03bbmax(\u03a3) is the biggest eigenvalue of \u03a3 and \u03be(\u03a3) := maxi \u03a3ii\nis the maximal entry on the diagonal of the matrix \u03a3.\nLemma 1. Consider the random Gaussian design matrix whose rows are i.i.d N (0, \u03a3) and assume\nn2Cmax\u03be(\u03a3) = \u0398(1). Select\n\n(cid:115)\n\n\u03b3(cid:112)\u03be(\u03a3)n\n\nlog n\nlog p\n\n,\n\n\u03bbn :=\n\n(11)\nthen with probability greater than 1 \u2212 c1 exp(\u2212c2n), the matrix X satis\ufb01es the extended RE with\n\u221a\nparameter \u03bal = 1\nfor some\n4\nsmall constants C1, C2.\n\n, provided that n \u2265 C \u03be(\u03a3)\n\nk log p and s \u2264 min\n\n\u03b32 log n , C2n\n\n(cid:110)\n\n(cid:111)\n\nCmin\n\nC1\n\nn\n\n2\n\ndent with the Gaussian stochastic noise w, we can easily show that (cid:107)X\u2217w(cid:107)\u221e \u2264 2(cid:112)\u03be(\u03a3)n\u03b42 log p\n\nWe would like to make some remarks:\n1) The choice of parameter \u03bbn is nothing special here. When design matrix is Gaussian and indepen-\nwith probability at least 1\u2212 2 exp(\u2212 log p). Therefore, the selection of \u03bbn follows from Theorem 1.\n2) The proof of this lemma, shown in the Appendix, boils down to control two terms\n\n\u2022 Restricted eigenvalue with X.\n2 + (cid:107)f(cid:107)2\n\n(cid:107)Xh(cid:107)2\n\n2 \u2265 \u03bar((cid:107)h(cid:107)2\n\n2 + (cid:107)f(cid:107)2\n2)\n\nfor all\n\n(h, f ) \u2208 C.\n\n\u2022 Mutual incoherence. Column space of the matrix X is incoherent with the column space\n\nof the identity matrix. That is, there exists some \u03bam > 0 such that\n\n|(cid:104)Xh, f(cid:105)| \u2264 \u03bam((cid:107)h(cid:107)2 + (cid:107)f(cid:107)2)2\n\nfor all\n\n(h, f ) \u2208 C.\n\nIf the incoherence between these two column spaces is suf\ufb01ciently small such that 4\u03bam < \u03bar, then\nwe can conclude that (cid:107)Xh + f(cid:107)2\n2 \u2265 (\u03bar \u2212 2\u03bam)((cid:107)h(cid:107)2 + (cid:107)f(cid:107)2)2. The small mutual incoherence\n\n5\n\n\fproperty is especially important since it provides how the regression separates away from the sparse\nerror.\n3) To simplify our result, we consider a special case of the uniform Gaussian design, in which\nn Ip\u00d7p. In this situation, Cmin = Cmax = \u03be(\u03a3) = 1/n. We have the following result which is\n\u03a3 = 1\na corollary of Theorem 1 and Lemma 1\nCorollary 1 (Standard Gaussian design). Let X be a standard Gaussian design matrix. Consider\n\nthe optimal solution ((cid:98)\u03b2,(cid:98)e) to the optimization problem (4) with regularization parameters chosen as\n\n\u03bb\u03b2 \u2265 4\n\u03b3\n\n\u03c32 log p and \u03bbe \u2265 4\n\n(12)\n\u03b32 log n , C2n} for some small\nwhere \u03b3 \u2208 (0, 1]. Also assuming that n \u2265 Ck log p and s \u2264 min{C1\nconstants C1, C2. Then with probability greater than 1 \u2212 c1 exp(\u2212c2n), the error set (h, f ) =\n(cid:19)\n\n((cid:98)\u03b2 \u2212 \u03b2(cid:63),(cid:98)e \u2212 e(cid:63)) is bounded by\n\n\u03c32 log n,\n\nn\n\n(cid:112)\n\n(cid:112)\n\n(cid:112)\n\n(cid:107)h(cid:107)2 + (cid:107)f(cid:107)2 \u2264 384\n\n\u03c32k log p +\n\n\u03c32s log n\n\n,\n\n(13)\n\n(cid:18) 1\n\n(cid:112)\n\n\u03b3\n\n\u221a\n\nCorollary 1 reveals an interesting phenomenon: by setting \u03b3 = 1/\nlog n, even when the fraction\nof corruption is linearly proportional with the number of samples n, the extended Lasso (4) is still\ncapable to recover both coef\ufb01cient vector \u03b2(cid:63) and corruption (missing) vector e(cid:63) within a bounded\nerror (13). Without the dense noise w in the observation model (3) (\u03c3 = 0), the extended Lasso\nrecovers the exact solution. This result is impossible to achieve with standard Lasso. Furthermore, if\nwe know in prior that the number of corrupted observations is an order of O(n/ log p), then selecting\n\u03b3 = 1 instead of 1/ log n will minimize the estimation error (see equation (13)) of Theorem 1.\n\n2.2 Feature selection with random Gaussian design\n\nIn many applications, the feature selection criteria is more preferred [17] [23]. Feature selection\nrefers to the property that the recovered parameter has the same signed support as the true regressor.\nIn general, good feature selection implies good parameter estimation but the reverse direction does\nnot usually hold. In this part, we investigate conditions for the design matrix and the scaling of\n(n, p, k, s) such as both regression and sparse error vectors obtain this criteria.\nConsider the linear model (3) where X is the Gaussian random design matrix whose rows are i.i.d\nzero mean with covariance matrix \u03a3. It has been well known in the Lasso that in order to obtain\nfeature selection accuracy, the covariance matrix \u03a3 must obey two properties: invertibility and small\nmutual coherence restricted on the set T . The \ufb01rst property guarantees that (4) is strictly convex,\nleading to the unique solution of the convex program, while the second property requires the sepa-\nration between two components of \u03a3, one related to the set T and the other to the set T c must be\nsuf\ufb01ciently small.\n\n1. Invertibility. To guarantee uniqueness, we require \u03a3T T to be invertible. Particularly, let\nCmin = \u03bbmin(\u03a3T T ), we require Cmin > 0.\n2. Mutual incoherence. For some \u03b3 \u2208 (0, 1),\n\n(cid:13)(cid:13)\u03a3\u2217\nT cT (\u03a3T T )\u22121(cid:13)(cid:13)\u221e \u2264 1\n\n(1 \u2212 \u03b3)\n\n(14)\nwhere (cid:107)\u00b7(cid:107)\u221e refers to (cid:96)\u221e/(cid:96)\u221e operator norm. It is worth noting that in the standard Lasso\nthe factor 1\n2 is omitted. Our condition is tighter than condition used to establish feature\nestimation in the Lasso by a constant factor. In fact, the quantity 1/2 is nothing special\nhere and we can set any value close to one with a compensation that the number of samples\nn will increase. Thus, we put 1/2 for the simplicity of the proof.\n\n2\n\nToward the end, we will also elaborate three other quantities operating on the restricted co-\nvariance matrix \u03a3T T : Cmax, which is de\ufb01ned as the maximum eigenvalue of \u03a3T T : Cmax :=\n\u03bbmax(\u03a3T T ); D\u2212\nT T and \u03a3T T :\nD\u2212\n\nmax, which are denoted as (cid:96)\u221e-norm of matrices \u03a3\u22121\n\nmax := (cid:107)\u03a3T T(cid:107)\u221e.\n\nmax :=(cid:13)(cid:13)(\u03a3T T )\u22121(cid:13)(cid:13)\u221e and D+\n\nmax and D+\n\n6\n\n\f\u03a3T c|T := \u03a3T cT c \u2212 \u03a3T cT \u03a3\u22121\n\nOur result also involves in two other quantities operating on the conditional covariance matrix of\n(XT c|XT ) de\ufb01ned as\n\nT T \u03a3T T c.\n\n(15)\n2 mini(cid:54)=j[(\u03a3T c|T )ii + (\u03a3T c|T )jj \u2212\n\nWe then de\ufb01ne \u03c1u(\u03a3T c|T ) = maxi(\u03a3T c|T )ii and \u03c1l(\u03a3T c|T ) = 1\n2(\u03a3T c|T )ij]. Toward the end, we denote a shorthand \u03c1u and \u03c1l.\nWe establish the following result for Gaussian random design whose covariance matrix \u03a3 obeys the\ntwo assumptions.\nTheorem 2. (Achievability) Given the linear model (3) with random Gaussian design and the co-\nvariance matrix \u03a3 satisfy invertibility and incoherence properties for any \u03b3 \u2208 (0, 1), suppose we\nsolve the extended Lasso (4) with regularization parameters obeying\n\nmax{\u03c1u, D+\n\nmax}n\u03c32 log p\n\nand\n\n\u03bbe = 8\n\n\u03c32 log n.\n\n(16)\n\n(cid:112)\n\n(cid:113)\n\n\u03bb\u03b2 =\n\n4\n\u03b3\n\n1\n\n32\u03b32 log n , the sequence (n, p, k, s) and regularization parameters \u03bb\u03b2, \u03bbe satisfying\n\nAlso, let \u03b7 =\ns \u2264 \u03b7n\n\nn \u2265 max\n\n(cid:26)\n\n(cid:27)\n\nk log(p \u2212 k) log n\n\n,\n(17)\ni | > f\u03b2(\u03bb\u03b2) and\n\nand\n\n(18)\n\nC1\n\n1\n\n(1 \u2212 \u03b7)\n\n\u03c1u\nCmin\n\nk log(p \u2212 k), C2\n\n\u03b7\n\n(1 \u2212 \u03b7)2\n\nmax{\u03c1u, D+\n\nmax}\n\nCmin\n\nwhere C1 and C2 are numerical constants. In addition, suppose that mini\u2208T |\u03b2(cid:63)\nmini\u2208S |e(cid:63)\n\n(cid:114)\ni | > fe(\u03bb\u03b2, \u03bbe) where\n\nf\u03b2 := c1\n\n\u03bb\u03b2\nn \u2212 s\n\n(cid:13)(cid:13)(cid:13)\u03a3\n\n\u22121/2\nT T\n\n(cid:13)(cid:13)(cid:13)2\n(cid:114)\n\nk log(p \u2212 k)\n\nn\n\u221a\ns + s\n\n\u221a\n\n(cid:115)\n\n\u221e + 20\nk log(p \u2212 k)\n\n\u03c32 log k\n\nCmin(n \u2212 s)\n\n(cid:13)(cid:13)(cid:13)\u03a3\n\n\u22121/2\nT T\n\n(cid:13)(cid:13)(cid:13)2\n\nfe := c2(Cmax(k\n\n(19)\nThen the following properties holds with probability greater than 1\u2212c exp(\u2212c(cid:48) max{log n, log pk})\n\n1. The solution pair ((cid:98)\u03b2,(cid:98)e) of the extended Lasso (4) is unique and has exact signed support.\n\n\u221e + c3\u03bbe.\n\nn\n\nk))1/2 \u03bb\u03b2\nn \u2212 s\n\n(cid:13)(cid:13)(cid:13)(cid:98)\u03b2 \u2212 \u03b2(cid:63)(cid:13)(cid:13)(cid:13)\u221e\n\n\u2264 f\u03b2(\u03bb\u03b2) and (cid:107)(cid:98)e \u2212 e(cid:63)(cid:107)\u221e \u2264 fe(\u03bb\u03b2).\n\n2. (cid:96)\u221e-norm bounds:\n\nThere are several interesting observations from the theorem\n1) The \ufb01rst and important observation is that extended Lasso is robust to arbitrarily large and sparse\nerror observation. In that sense, the extended Lasso can be viewed as a generalization of the Lasso.\nUnder the same invertibility and mutual incoherence assumptions on the covariance matrix \u03a3 as\nthe standard Lasso, the extended Lasso program can recover both the regression vector and error\nwith exact signed supports even when almost all the observations are contaminated by arbitrarily\nlarge error with unknown support. What we sacri\ufb01ce for the corruption robustness is an additional\nlog factor to the number of samples. We notice that when the error fraction is O(n/ log n), only\nO(k log(p \u2212 k)) samples are suf\ufb01cient to recover the exact signed supports of both regression and\nsparse error vectors.\n2) We consider the special case with Gaussian random design in which the covariance matrix \u03a3 =\nIn this case, entries of X is i.i.d N (0, 1/n) and we have quantities Cmin = Cmax =\nn Ip\u00d7p.\n1\nmax = D\u2212\nmax = \u03c1u = \u03c1l = 1. In addition, the invertibility and mutual incoherence properties\nD+\nare automatically satis\ufb01ed. The theorem implies that when the number of errors s is close to n,\nlog n = \u2126(k log(p \u2212\nthe number of samples n needed to recover exact signed supports satis\ufb01es\nk)). Furthermore, Theorem 2 guarantees consistency in element-wise (cid:96)\u221e-norm of the estimated\nregression at the rate\n\n(cid:18)(cid:112)\u03c32 log p\n\n(cid:113) k log(p\u2212k)\n\n(cid:13)(cid:13)(cid:13)(cid:98)\u03b2 \u2212 \u03b2(cid:63)(cid:13)(cid:13)(cid:13)\u221e = O\n\n(cid:19)\n\n\u03b32n\n\nn\n\n.\n\n\u221a\n\nAs \u03b3 is chosen to be 1/\nof O(\u03c3\n\n\u221a\n\nlog p), which is known to be the same as that of the standard Lasso.\n\n32 log n (equivalent to establish s close to n), the (cid:96)\u221e error rate is an order\n\n7\n\n\f3) Corollary 1, though interesting, is not able to guarantee stable recovery when the fraction of\ncorruption converges to one. We show in Theorem 2 that this fraction can come arbitrarily close\nto one by sacri\ufb01cing a factor of log n for the number of samples. Theorem 2 also implies that\nthere is a signi\ufb01cant difference between recovery to obtain small parameter estimation error versus\nrecovery to obtain correct variable selection. When the amount of corrupted observations is linearly\nproportional with n, recovering the exact signed supports require an increase from \u2126(k log p) (in\nCorollary 1) to \u2126(k log p log n) samples (in Theorem 2). This behavior is captured similarly by the\nstandard Lasso, as pointed out in [17], Corollary 2.\nOur next theorem show that the number of samples needed to recover accurate signed support is\noptimal. That is, whenever the rescaled sample size satis\ufb01es (20), then for whatever regularization\nparameters \u03bb\u03b2 and \u03bbe are selected, no solution of the extended Lasso correctly identi\ufb01es the signed\nsupports with high probability.\nTheorem 3. (Inachievability) Given the linear model (3) with random Gaussian design and the\ncovariance matrix \u03a3 satisfy invertibility and incoherence properties for any \u03b3 \u2208 (0, 1). Let \u03b7 =\n32\u03b32 log(n\u2212s) and the sequence (n, p, k, s) satis\ufb01es s \u2265 \u03b7n and\n\n1\n\n\uf8f1\uf8f2\uf8f3C3\n\n(cid:32)\n\n(cid:112)\u03c32 log n\n\n\u03bbe\n\n(cid:33)\u22121\uf8fc\uf8fd\uf8fe ,\n\nn \u2264 min\n\n1\n\n(1 \u2212 \u03b7)\n\n\u03c1u\nCmin\n\nk log(p \u2212 k), C4\n\n\u03b7\n\n(1 \u2212 \u03b7)2\n\nmin{\u03c1l, D+\nCmax\n\nmax}\n\nk log(p \u2212 k) log(1 \u2212 \u03b7)n\n\n1 +\n\n(20)\nwhere C3 and C4 are some small universal constants. Then with probability tending to one, no\nsolution pair of the extended Lasso (5) has the correct signed support.\n\n3\n\nIllustrative simulations\n\nIn this section, we provide some simulations to illustrate the possibility of the extended Lasso in\nrecovering the exact regression signed support when a signi\ufb01cant fraction of observations is cor-\nrupted by large error. Simulations are performed for a range of parameters (n, p, k, s) where the\ndesign matrix X is uniform Gaussian random whose rows are i.i.d N (0, Ip\u00d7p). For each \ufb01xed set of\n(n, p, k, s), we generate sparse vectors \u03b2(cid:63) and e(cid:63) where locations of nonzero entries are uniformly\nrandom and magnitudes are Gaussian distributed.\nIn our experiments, we consider varying problem sizes p = {128, 256, 512} and three types of\nregression sparsity indices: sublinear sparsity (k = 0.2p/ log(0.2p)), linear sparsity (k = 0.1p) and\nfractional power sparsity (k = 0.5p0.75). In all cases, we \ufb01xed the error support size s = n/2.\nThis means half of the observations is corrupted. By this selection, Theorem 2 suggests that number\nof samples n \u2265 2Ck log(p \u2212 k) log n to guarantee exact signed support recovery. We choose\nlog n = 4\u03b8k log(p \u2212 k) where parameter \u03b8 is the rescaled sample size. This parameter control the\nsuccess/failure of the extended Lasso.\n\nIn the algorithm, we select \u03bb\u03b2 = 2(cid:112)\u03c32 log p log n and \u03bbe = 2(cid:112)\u03c32 log n as suggested by Theorem\n\n2, where the noise level \u03c3 = 0.1 is \ufb01xed. The algorithm reports a success if the solution pair has\nthe same signed support as (\u03b2(cid:63), e(cid:63)). In Fig. 1, each point on the curve represents the average of 100\ntrials.\nAs demonstrated by simulations, our extended Lasso is cable to recover the exact signed support\nof both \u03b2(cid:63) and e(cid:63) even 50% of the observations are contaminated. Furthermore, up to unknown\nlog n \u2264 2k log(p\u2212\nconstants, our theorem 2 and 3 match with simulation results. As the sample size n\nk), the probability of success starts going to zero, implying the failure of the extended Lasso.\n\nn\n\nAcknowledgments\n\nWe acknowledge support from the Army Research Of\ufb01ce (ARO) under Grant 60291-MA and Na-\ntional Science Foundation (NSF) under Grant CCF-1117545.\n\nReferences\n[1] A. Agarwal, S. Negahban, and M. Wainwright. Noisy matrix decomposition via convex relaxation: Opti-\nmal rates in high dimensions. Proc. 28th Inter. Conf. Mach. Learn. (ICML-11), pages 1129\u20131136, 2011.\n\n8\n\n\fFigure 1: Probability of success in recovering the signed supports\n\n[2] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of\n\nstatistics, 37(4):1705\u20131732, 2009.\n\n[3] E. J. Cand`es, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? Submitted for publica-\n\ntion, 2009.\n\n[4] E. J. Cand`es and Y. Plan. Near-ideal model selection by l1 minimization. Annals of Statistics, 37:2145\u2013\n\n2177, 2009.\n\n[5] E. J. Cand`es and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. Annals\n\nof statistics, 35(6):2313\u20132351, 2007.\n\n[6] E. Elhamifar and R. Vidal. Sparse subspace clustering. IEEE Conference on Computer Vision and Pattern\n\nRecognition (CVPR), pages 2790\u20132797, 2009.\n\n[7] J. N. Laska, M. A. Davenport, and R. G. Baraniuk. Exact signal recovery from sparsely corrupted mea-\nsurements through the pursuit of justice. In Asilomar conference on Signals, Systems and Computers,\npages 1556\u20131560, 2009.\n\n[8] X. Li. Compressed sensing and matrix completion with constant proportion of corruptions. Preprint,\n\n2011.\n\n[9] Z. Li, F. Wu, and J. Wright. On the systematic measurement matrix for compressed sensing in the presence\n\nof gross error. In Data compression conference (DCC), pages 356\u2013365, 2010.\n\n[10] N. Meinshausen and P. Buehlmann. High dimensional graphs and variable selection with the lasso. Annals\n\nof statistics, 34(3):1436\u20131462, 2008.\n\n[11] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.\n\nAnnals of statistics, 37(1):2246\u20132270, 2009.\n\n[12] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-dimensional\n\nanalysis of m-estimators with decomposable regularizers. Preprint, 2010.\n\n[13] N. H. Nguyen and Trac. D. Tran. Exact recoverability from dense corrupted observations via l1 mini-\n\nmization. preprint, 2010.\n\n[14] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated gaussian designs.\n\nJournal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[15] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[16] S. A. van de Geer and P. Buehlmann. On the conditions used to prove oracle results for the lasso. Elec-\n\ntronic Journal of Statistics, 3(1360-1392), 2009.\n\n[17] M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1 -constrained\n\nquadratic programming ( lasso ). IEEE Trans. Information Theory, 55(5):2183\u20132202, 2009.\n\n[18] J. Wright and Y. Ma. Dense error correction via l1 minimization.\n\nTheory, 56(7):3540\u20133560, 2010.\n\nIEEE Transaction on Information\n\n[19] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust face recognition via sparse representa-\n\ntion. IEEE Transaction on Pattern Analysis and Machine Intelligence, 31(2):210\u2013227, 2009.\n\n[20] H. Xu, C. Caramanis, and S. Sanghavi. Robust pca via outlier pursuit. Ad. Neural Infor. Proc. Sys. (NIPS),\n\npages 2496\u20132504, 2010.\n\n[21] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B, 68(1):49\u201367, 2006.\n\n[22] T. Zhang. Some sharp performance bounds for least squares regression with l1 regularization. Annals of\n\nstatistics, 37(5):2109\u20132144, 2009.\n\n[23] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research,\n\n7:2541\u20132563, 2006.\n\n9\n\n00.20.40.60.8100.20.40.60.81Rescaled sample size \u03b8Probability of successSublinear sparsity p=128p=256p=51200.20.40.60.8100.20.40.60.81Rescaled sample size \u03b8Probability of successLinear sparsity p=128p=256p=51200.20.40.60.8100.20.40.60.81Rescaled sample size \u03b8Probability of successFractional power sparsity p=128p=256p=512\f", "award": [], "sourceid": 1065, "authors": [{"given_name": "Nasser", "family_name": "Nasrabadi", "institution": null}, {"given_name": "Trac", "family_name": "Tran", "institution": null}, {"given_name": "Nam", "family_name": "Nguyen", "institution": null}]}