{"title": "Robust Regression and Lasso", "book": "Advances in Neural Information Processing Systems", "page_first": 1801, "page_last": 1808, "abstract": "We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to $\\ell_1$ regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective.", "full_text": "Robust Regression and Lasso\n\nDepartment of Electrical and Computer Engineering\n\nHuan Xu\n\nMcGill University\n\nMontreal, QC Canada\n\nxuhuan@cim.mcgill.ca\n\nConstantine Caramanis\n\nDepartment of Electrical and Computer Engineering\n\nThe University of Texas at Austin\n\nAustin, Texas\n\ncmcaram@ece.utexas.edu\n\nDepartment of Electrical and Computer Engineering\n\nShie Mannor\n\nMcGill University\n\nMontreal, QC Canada\n\nshie.mannor@mcgill.ca\n\nAbstract\n\nWe consider robust least-squares regression with feature-wise disturbance. We\nshow that this formulation leads to tractable convex optimization problems, and\nwe exhibit a particular uncertainty set for which the robust problem is equivalent\nto `1 regularized regression (Lasso). This provides an interpretation of Lasso from\na robust optimization perspective. We generalize this robust formulation to con-\nsider more general uncertainty sets, which all lead to tractable convex optimization\nproblems. Therefore, we provide a new methodology for designing regression al-\ngorithms, which generalize known formulations. The advantage is that robustness\nto disturbance is a physical property that can be exploited: in addition to obtaining\nnew formulations, we use it directly to show sparsity properties of Lasso, as well\nas to prove a general consistency result for robust regression problems, including\nLasso, from a uni\ufb01ed robustness perspective.\n\n1 Introduction\n\nIn this paper we consider linear regression problems with least-square error. The problem is to \ufb01nd\na vector x so that the `2 norm of the residual b \u2212 Ax is minimized, for a given matrix A \u2208 Rn\u00d7m\nand vector b \u2208 Rn. From a learning/regression perspective, each row of A can be regarded as a\ntraining sample, and the corresponding element of b as the target value of this observed sample.\nEach column of A corresponds to a feature, and the objective is to \ufb01nd a set of weights so that the\nweighted sum of the feature values approximates the target value.\n\nIt is well known that minimizing the least squared error can lead to sensitive solutions [1, 2]. Many\nregularization methods have been proposed to decrease this sensitivity. Among them, Tikhonov\nregularization [3] and Lasso [4, 5] are two widely known and cited algorithms. These methods\nminimize a weighted sum of the residual norm and a certain regularization term, kxk2 for Tikhonov\nIn addition to providing regularity, Lasso is also known for\nregularization and kxk1 for Lasso.\n\n1\n\n\fthe tendency to select sparse solutions. Recently this has attracted much attention for its ability\nto reconstruct sparse solutions when sampling occurs far below the Nyquist rate, and also for its\nability to recover the sparsity pattern exactly with probability one, asymptotically as the number of\nobservations increases (there is an extensive literature on this subject, and we refer the reader to\n[6, 7, 8, 9, 10] and references therein). In many of these approaches, the choice of regularization\nparameters often has no fundamental connection to an underlying noise model [2].\n\nIn [11], the authors propose an alternative approach to reducing sensitivity of linear regression, by\nconsidering a robust version of the regression problem: they minimize the worst-case residual for\nthe observations under some unknown but bounded disturbances. They show that their robust least\nsquares formulation is equivalent to `2-regularized least squares, and they explore computational\naspects of the problem. In that paper, and in most of the subsequent research in this area and the\nmore general area of Robust Optimization (see [12, 13] and references therein) the disturbance is\ntaken to be either row-wise and uncorrelated [14], or given by bounding the Frobenius norm of the\ndisturbance matrix [11].\n\nIn this paper we investigate the robust regression problem under more general uncertainty sets,\nfocusing in particular on the case where the uncertainty set is de\ufb01ned by feature-wise constraints,\nand also the case where features are meaningfully correlated. This is of interest when values of\nfeatures are obtained with some noisy pre-processing steps, and the magnitudes of such noises are\nknown or bounded. We prove that all our formulations are computationally tractable. Unlike much\nof the previous literature, we provide a focus on structural properties of the robust solution. In\naddition to giving new formulations, and new properties of the solutions to these robust problems,\nwe focus on the inherent importance of robustness, and its ability to prove from scratch important\nproperties such as sparseness, and asymptotic consistency of Lasso in the statistical learning context.\nIn particular, our main contributions in this paper are as follows.\n\n\u2022 We formulate the robust regression problem with feature-wise independent disturbances,\nand show that this formulation is equivalent to a least-square problem with a weighted `1\nnorm regularization term. Hence, we provide an interpretation for Lasso from a robustness\nperspective. This can be helpful in choosing the regularization parameter. We generalize\nthe robust regression formulation to loss functions given by an arbitrary norm, and uncer-\ntainty sets that allow correlation between disturbances of different features.\n\n\u2022 We investigate the sparsity properties for the robust regression problem with feature-wise\nindependent disturbances, showing that such formulations encourage sparsity. We thus eas-\nily recover standard sparsity results for Lasso using a robustness argument. This also im-\nplies a fundamental connection between the feature-wise independence of the disturbance\nand the sparsity.\n\n\u2022 Next, we relate Lasso to kernel density estimation. This allows us to re-prove consistency\nin a statistical learning setup, using the new robustness tools and formulation we introduce.\n\nNotation. We use capital letters to represent matrices, and boldface letters to represent column\nvectors. For a vector z, we let zi denote the ith element. Throughout the paper, ai and r>\nj denote\nthe ith column and the jth row of the observation matrix A, respectively; aij is the ij element of A,\nhence it is the jth element of ri, and ith element of aj. For a convex function f (\u00b7), \u2202f (z) represents\nany of its sub-gradients evaluated at z.\n\n2 Robust Regression with Feature-wise Disturbance\n\nWe show that our robust regression formulation recovers Lasso as a special case. The regression\nformulation we consider differs from the standard Lasso formulation, as we minimize the norm of\nthe error, rather than the squared norm. It is known that these two coincide up to a change of the reg-\nularization coef\ufb01cient. Yet our results amount to more than a representation or equivalence theorem.\nIn addition to more \ufb02exible and potentially powerful robust formulations, we prove new results, and\ngive new insight into known results. In Section 3, we show the robust formulation gives rise to new\nsparsity results. Some of our results there (e.g. Theorem 4) fundamentally depend on (and follow\nfrom) the robustness argument, which is not found elsewhere in the literature. Then in Section 4,\nwe establish consistency of Lasso directly from the robustness properties of our formulation, thus\nexplaining consistency from a more physically motivated and perhaps more general perspective.\n\n2\n\n\f2.1 Formulation\n\nRobust linear regression considers the case that the observed matrix A is corrupted by some distur-\nbance. We seek the optimal weight for the uncorrupted (yet unknown) sample matrix. We consider\nthe following min-max formulation:\n\nRobust Linear Regression: min\n\nx\u2208Rm(cid:26) max\n\n\u2206A\u2208U kb \u2212 (A + \u2206A)xk2(cid:27) .\n\n(1)\n\nHere, U is the set of admissible disturbances of the matrix A. In this section, we consider the speci\ufb01c\nsetup where the disturbance is feature-wise uncorrelated, and norm-bounded for each feature:\n\nU ,n(\u03b41,\u00b7\u00b7\u00b7 , \u03b4m)(cid:12)(cid:12)(cid:12)k\u03b4ik2 \u2264 ci, i = 1,\u00b7\u00b7\u00b7 , mo,\n\nfor given ci \u2265 0. This formulation recovers the well-known Lasso:\nTheorem 1. The robust regression problem (1) with the uncertainty set (2) is equivalent to the\nfollowing `1 regularized regression problem:\n\n(2)\n\nmin\n\nx\u2208Rmnkb \u2212 Axk2 +\n\nm\n\nXi=1\n\nci|xi|o.\n\n(3)\n\nProof. We defer the full details to [15], and give only an outline of the proof here. Showing that the\nrobust regression is a lower bound for the regularized regression follows from the standard triangle\ninequality. Conversely, one can take the worst-case noise to be \u03b4\u2217\ni )u, where u is given\ni\nby\n\n, \u2212cisgn(x\u2217\n\nu ,(cid:26) b\u2212Ax\u2217\n\nif Ax\u2217 6= b,\nkb\u2212Ax\u2217k2\nany vector with unit `2 norm otherwise;\n\nfrom which the result follows after some algebra.\n\n,\n\nIf we take ci = c and normalized ai for all i, Problem (3) is the well-known Lasso [4, 5].\n\n2.2 Arbitrary norm and correlated disturbance\n\nmin\n\nIt is possible to generalize this result to the case where the `2-norm is replaced by an arbitrary norm,\nand where the uncertainty is correlated from feature to feature. For space considerations, we refer\nto the full version ([15]), and simply state the main results here.\nTheorem 2. Let k \u00b7 ka denote an arbitrary norm. Then the robust regression problem\nx\u2208Rm(cid:26) max\nis equivalent to the regularized regression problem minx\u2208Rmnkb \u2212 Axka +Pm\n\n\u2206A\u2208Ua kb \u2212 (A + \u2206A)xka(cid:27) ; Ua ,n(\u03b41,\u00b7\u00b7\u00b7 , \u03b4m)(cid:12)(cid:12)(cid:12)k\u03b4ika \u2264 ci, i = 1,\u00b7\u00b7\u00b7 , mo;\ni=1 ci|xi|o.\n\nUsing feature-wise uncorrelated disturbance may lead to overly conservative results. We relax this,\nallowing the disturbances of different features to be correlated. Consider the following uncertainty\nset:\n\nU 0 ,(cid:8)(\u03b41,\u00b7\u00b7\u00b7 , \u03b4m)(cid:12)(cid:12)fj(k\u03b41ka,\u00b7\u00b7\u00b7 ,k\u03b4mka) \u2264 0; j = 1,\u00b7\u00b7\u00b7 , k(cid:9) ,\n\nwhere fj(\u00b7) are convex functions. Notice that both k and fj can be arbitrary, hence this is a very\ngeneral formulation and provides us with signi\ufb01cant \ufb02exibility in designing uncertainty sets and\nequivalently new regression algorithms. The following theorem converts this formulation to a con-\nvex and tractable optimization problem.\nTheorem 3. Assume that the set Z , {z \u2208 Rm|fj(z) \u2264 0, j = 1,\u00b7\u00b7\u00b7 , k; z \u2265 0} has non-empty\nrelative interior. The robust regression problem\n\nmin\n\nx\u2208Rm(cid:26) max\n\n\u2206A\u2208U 0 kb \u2212 (A + \u2206A)xka(cid:27) ,\n\n3\n\n\fis equivalent to the following regularized regression problem\n\nmin\n+,\u03ba\u2208Rm\n\n\u03bb\u2208Rk\n\n+ ,x\u2208Rmnkb \u2212 Axka + v(\u03bb, \u03ba, x)o;\nXj=1\n\nc\u2208Rmh(\u03ba + |x|)>c \u2212\n\nk\n\nwhere: v(\u03bb, \u03ba, x) , max\n\n(4)\n\n\u03bbjfj(c)i.\n\nExample 1. Suppose U 0 = n(\u03b41,\u00b7\u00b7\u00b7 , \u03b4m)(cid:12)(cid:12)(cid:12)(cid:13)(cid:13)k\u03b41ka,\u00b7\u00b7\u00b7 ,k\u03b4mka(cid:13)(cid:13)s \u2264 l;o for a symmetric norm\n\nk \u00b7 ks, then the resulting regularized regression problem is\nso; where k \u00b7 k\u2217\n\nx\u2208Rmnkb \u2212 Axka + lkxk\u2217\n\ns is the dual norm of k \u00b7 ks.\n\nmin\n\nThe robust regression formulation (1) considers disturbances that are bounded in a set, while in\npractice, often the disturbance is a random variable with unbounded support. In such cases, it is not\npossible to simply use an uncertainty set that includes all admissible disturbances, and we need to\nconstruct a meaningful U based on probabilistic information. In the full version [15] we consider\ncomputationally ef\ufb01cient ways to use chance constraints to construct uncertainty sets.\n\n3 Sparsity\n\nIn this section, we investigate the sparsity properties of robust regression (1), and equivalently Lasso.\nLasso\u2019s ability to recover sparse solutions has been extensively discussed (cf [6, 7, 8, 9]), and takes\none of two approaches. The \ufb01rst approach investigates the problem from a statistical perspective.\nThat is, it assumes that the observations are generated by a (sparse) linear combination of the fea-\ntures, and investigates the asymptotic or probabilistic conditions required for Lasso to correctly\nrecover the generative model. The second approach treats the problem from an optimization per-\nspective, and studies under what conditions a pair (A, b) de\ufb01nes a problem with sparse solutions\n(e.g., [16]).\n\nWe follow the second approach and do not assume a generative model. Instead, we consider the\nconditions that lead to a feature receiving zero weight. In particular, we show that (i) as a direct\nresult of feature-wise independence of the uncertainty set, a slight change of a feature that was\noriginally assigned zero weight still gets zero weight (Theorem 4); (ii) using Theorem 4, we show\nthat \u201cnearly\u201d orthogonal features get zero weight (Corollary 1); and (iii) \u201cnearly\u201d linearly dependent\nfeatures get zero weight (Theorem 5). Substantial research regarding sparsity properties of Lasso\ncan be found in the literature (cf [6, 7, 8, 9, 17, 18, 19, 20] and many others). In particular, similar\nresults as in point (ii), that rely on an incoherence property, have been established in, e.g., [16], and\nare used as standard tools in investigating sparsity of Lasso from a statistical perspective. However,\na proof exploiting robustness and properties of the uncertainty is novel. Indeed, such a proof shows\na fundamental connection between robustness and sparsity, and implies that robustifying w.r.t. a\nfeature-wise independent uncertainty set might be a plausible way to achieve sparsity for other\nproblems.\nTheorem 4. Given ( \u02dcA, b), let x\u2217 be an optimal solution of the robust regression problem:\n\nmin\n\nx\u2208Rm(cid:26) max\n\n\u2206A\u2208U kb \u2212 ( \u02dcA + \u2206A)xk2(cid:27) .\n\nLet I \u2286 {1,\u00b7\u00b7\u00b7 , m} be such that for all i \u2208 I, x\u2217\n\ni = 0. Now let\n\nThen, x\u2217 is an optimal solution of\n\n\u02dcU ,n(\u03b41,\u00b7\u00b7\u00b7 , \u03b4m)(cid:12)(cid:12)(cid:12)k\u03b4jk2 \u2264 cj, j 6\u2208 I; k\u03b4ik2 \u2264 ci + `i, i \u2208 Io.\n\n\u2206A\u2208 \u02dcU kb \u2212 (A + \u2206A)xk2(cid:27) ,\nfor any A that satis\ufb01es kai \u2212 \u02dcaik \u2264 `i for i \u2208 I, and aj = \u02dcaj for j 6\u2208 I.\n\nx\u2208Rm(cid:26) max\n\nmin\n\n4\n\n\fProof. Notice that for i \u2208 I, x\u2217\nresidual. We have\n\ni = 0, hence the ith column of both A and \u2206A has no effect on the\n\n.\n\nmax\n\n= max\n\n= max\n\nBy de\ufb01nition of x\u2217,\n\nb \u2212 ( \u02dcA + \u2206A)x\u2217(cid:13)(cid:13)(cid:13)2\n\u2206A\u2208 \u02dcU(cid:13)(cid:13)(cid:13)\nFor i \u2208 I, kai\u2212\u02dcaik \u2264 li, and aj = \u02dcaj for j 6\u2208 I. Thus(cid:8) \u02dcA+\u2206A(cid:12)(cid:12)\u2206A \u2208 U(cid:9) \u2286(cid:8)A+\u2206A(cid:12)(cid:12)\u2206A \u2208 \u02dcU(cid:9).\n\nTherefore, for any \ufb01xed x0, the following holds:\n\n\u2206A\u2208U(cid:13)(cid:13)(cid:13)\n\nb \u2212 (A + \u2206A)x\u2217(cid:13)(cid:13)(cid:13)2\n\u2206A\u2208U(cid:13)(cid:13)(cid:13)\n\u2206A\u2208U(cid:13)(cid:13)(cid:13)\n\u2206A\u2208 \u02dcU(cid:13)(cid:13)(cid:13)\n\nb \u2212 (A + \u2206A)x\u2217(cid:13)(cid:13)(cid:13)2\nb \u2212 ( \u02dcA + \u2206A)x0(cid:13)(cid:13)(cid:13)2 \u2264 max\n\u2206A\u2208 \u02dcU(cid:13)(cid:13)(cid:13)\nb \u2212 ( \u02dcA + \u2206A)x\u2217(cid:13)(cid:13)(cid:13)2 \u2264 max\n\u2206A\u2208U(cid:13)(cid:13)(cid:13)\nb \u2212 (A + \u2206A)x\u2217(cid:13)(cid:13)(cid:13)2 \u2264 max\n\u2206A\u2208 \u02dcU(cid:13)(cid:13)(cid:13)\n\n\u2206A\u2208U(cid:13)(cid:13)(cid:13)\nb \u2212 (A + \u2206A)x0(cid:13)(cid:13)(cid:13)2\nb \u2212 ( \u02dcA + \u2206A)x0(cid:13)(cid:13)(cid:13)2\nb \u2212 (A + \u2206A)x0(cid:13)(cid:13)(cid:13)2\n\nSince this holds for arbitrary x0, we establish the theorem.\n\nTherefore we have\n\n.\n\n.\n\n.\n\nmax\n\nmax\n\nmax\n\nTheorem 4 is established using the robustness argument, and is a direct result of the feature-wise\nindependence of the uncertainty set.\nIt explains why Lasso tends to assign zero weight to non-\n\nrelative features. Consider a generative model1 b =Pi\u2208I wiai + \u02dc\u03be where I \u2286 {1\u00b7\u00b7\u00b7 , m} and \u02dc\u03be is\na random variable, i.e., b is generated by features belonging to I. In this case, for a feature i0 6\u2208 I,\nLasso would assign zero weight as long as there exists a perturbed value of this feature, such that\nthe optimal regression assigned it zero weight. This is also shown in the next corollary, in which\nwe apply Theorem 4 to show that the problem has a sparse solution as long as an incoherence-type\nproperty is satis\ufb01ed (this result is more in line with the traditional sparsity results).\nCorollary 1. Suppose that for all i, ci = c.\n\nIf there exists I \u2282 {1,\u00b7\u00b7\u00b7 , m} such that for all\nv \u2208 span(cid:0){ai, i \u2208 I}S{b}(cid:1), kvk = 1, we have v>aj \u2264 c \u2200j 6\u2208 I, then any optimal solution x\u2217\nj = 0, \u2200j 6\u2208 I.\nProof. For j 6\u2208 I, let a=\nj denote the projection of aj onto the span of {ai, i \u2208 I}S{b}, and let\na+\nj\n\nj . Thus, we have ka=\n\n, aj \u2212 a=\n\nsatis\ufb01es x\u2217\n\nj k \u2264 c. Let \u02c6A be such that\ni \u2208 I;\ni 6\u2208 I.\n\n\u02c6ai =(cid:26) ai\n\na+\ni\n\nNow let\n\n\u02c6U , {(\u03b41,\u00b7\u00b7\u00b7 , \u03b4m)|k\u03b4ik2 \u2264 c, i \u2208 I;k\u03b4jk2 = 0, j 6\u2208 I}.\n\nConsider the robust regression problem min\u02c6xn max\u2206A\u2208 \u02c6U(cid:13)(cid:13)\nto min\u02c6xn(cid:13)(cid:13)\n\nb\u2212( \u02c6A+\u2206A)\u02c6x(cid:13)(cid:13)2o, which is equivalent\nb \u2212 \u02c6A\u02c6x(cid:13)(cid:13)2 +Pi\u2208I c|\u02c6xi|(cid:9). Now we show that there exists an optimal solution \u02c6x\u2217 such\nthat \u02c6x\u2217\nj = 0 for all j 6\u2208 I. This is because \u02c6aj are orthogonal to the span of of {\u02c6ai, i \u2208 I}S{b}.\nHence for any given \u02c6x, by changing \u02c6xj to zero for all j 6\u2208 I, the minimizing objective does not\nincrease.\nj k \u2264 c \u2200j 6\u2208 I, (and recall that U = {(\u03b41,\u00b7\u00b7\u00b7 , \u03b4m)|k\u03b4ik2 \u2264 c,\u2200i}) applying\nSince k\u02c6a \u2212 \u02c6ajk = ka=\nTheorem 4 we establish the corollary.\n\nThe next corollary follows easily from Corollary 1.\nCorollary 2. Suppose there exists I \u2286 {1,\u00b7\u00b7\u00b7 , m}, such that for all i \u2208 I, kaik < ci. Then any\noptimal solution x\u2217 satis\ufb01es x\u2217\n\ni = 0, for i \u2208 I.\n\n1While we are not assuming generative models to establish the results, it is still interesting to see how these\n\nresults can help in a generative model setup.\n\n5\n\n\fThe next theorem shows that sparsity is achieved when a set of features are \u201calmost\u201d linearly depen-\ndent. Again we refer to [15] for the proof.\nTheorem 5. Given I \u2286 {1,\u00b7\u00b7\u00b7 , m} such that there exists a non-zero vector (wi)i\u2208I satisfying\n\ni = 0.\n\n\u03c3iciwi|,\n\nkXi\u2208I\n\nwiaik2 \u2264 min\n\n0, which leads to the following corollary.\n\n\u03c3i\u2208{\u22121,+1}|Xi\u2208I\nthen there exists an optimal solution x\u2217 such that \u2203i \u2208 I : x\u2217\nNotice that for linearly dependent features, there exists non-zero (wi)i\u2208I such that kPi\u2208I wiaik2 =\nCorollary 3. Given I \u2286 {1,\u00b7\u00b7\u00b7 , m}, let AI , (cid:16)ai(cid:17)i\u2208I\nSetting I = {1,\u00b7\u00b7\u00b7 , m}, we immediately get the following corollary.\nCorollary 4. If n < m, then there exists an optimal solution with no more than n non-zero coef\ufb01-\ncients.\n\ni\u2208I has at most t non-zero coef\ufb01cients.\n\n, and t , rank(AI ). There exists an\n\noptimal solution x\u2217 such that x\u2217\nI\n\n, (xi)>\n\n4 Density Estimation and Consistency\n\nIn this section, we investigate the robust linear regression formulation from a statistical perspective\nand rederive using only robustness properties that Lasso is asymptotically consistent. We note that\nour result applies to a considerably more general framework than Lasso. In the full version ([15])\nwe use some intermediate results used to prove consistency, to show that regularization can be\nidenti\ufb01ed with the so-called maxmin expected utility (MMEU) framework, thus tying regularization\nto a fundamental tenet of decision-theory.\n\nn\n\ni\n\ni\n\nx)2 + cnkxk1o;\n\n\u221an\n\n(bi \u2212 r>\n\nx(P) , arg min\n\nx(cn,Sn) , arg min\n\nWe restrict our discussion to the case where the magnitude of the allowable uncertainty for all\nfeatures equals c, (i.e., the standard Lasso) and establish the statistical consistency of Lasso from\na distributional robustness argument. Generalization to the non-uniform case is straightforward.\nThroughout, we use cn to represent c where there are n samples (we take cn to zero).\nRecall the standard generative model in statistical learning: let P be a probability measure with\nbounded support that generates i.i.d. samples (bi, ri), and has a density f \u2217(\u00b7). Denote the set of the\n\ufb01rst n samples by Sn. De\ufb01ne\nx nvuut\n1\nXi=1\nn\nx nsZb,r\n(b \u2212 r>x)2dP(b, r)o.\n\nx)2 + cnkxk1o = arg min\nx n\n\nIn words, x(cn,Sn) is the solution to Lasso with the tradeoff parameter set to cn\u221an, and x(P)\n\nis the \u201ctrue\u201d optimal solution. We have the following consistency result. The theorem itself is a\nwell-known result. However, the proof technique is novel. This technique is of interest because\nthe standard techniques to establish consistency in statistical learning including VC dimension and\nalgorithm stability often work for a limited range of algorithms, e.g., SVMs are known to have\nin\ufb01nite VC dimension, and we show in the full version ([15]) that Lasso is not stable. In contrast,\na much wider range of algorithms have robustness interpretations, allowing a uni\ufb01ed approach to\nprove their consistency.\nTheorem 6. Let {cn} be such that cn \u2193 0 and limn\u2192\u221e n(cn)m+1 = \u221e. Suppose there exists a\nconstant H such that kx(cn,Sn)k2 \u2264 H almost surely. Then,\n(b \u2212 r>x(cn,Sn))2dP(b, r) =sZb,r\n\n(b \u2212 r>x(P))2dP(b, r),\n\nn\u2192\u221esZb,r\n\nn vuut\n\n(bi \u2212 r>\n\nn\n\nXi=1\n\nlim\n\nalmost surely.\n\n6\n\n\fThe full proof and results we develop along the way are deferred to [15], but we provide the main\nideas and outline here. The key to the proof is establishing a connection between robustness and\nkernel density estimation.\nStep 1: For a given x, we show that the robust regression loss over the training data is equal to the\nworst-case expected generalization error. To show this we establish a more general result:\nProposition 1. Given a function g : Rm+1 \u2192 R and Borel sets Z1,\u00b7\u00b7\u00b7 ,Zn \u2286 Rm+1, let\n\nThe following holds\n\nPn , {\u00b5 \u2208 P|\u2200S \u2286 {1,\u00b7\u00b7\u00b7 , n} : \u00b5([i\u2208S\n\u00b5\u2208PnZRm+1\n\nh(ri, bi) = sup\n\n(ri,bi)\u2208Zi\n\nsup\n\n1\nn\n\nn\n\nXi=1\n\nZi) \u2265 |S|/n}.\n\nh(r, b)d\u00b5(r, b).\n\nStep 2: Next we show that robust regression has a form like that in the left hand side above. Also,\nthe set of distributions we supremize over, in the right hand side above, includes a kernel density\nestimator for the true (unknown) distribution. Indeed, consider the following kernel estimator: given\nsamples (bi, ri)n\n\ni=1,\n\nhn(b, r) , (ncm+1)\u22121\n\nn\n\nXi=1\n\nK(cid:18) b \u2212 bi, r \u2212 ri\n\nc\n\n(cid:19) ,\n\n(5)\n\nwhere: K(x) , I[\u22121,+1]m+1(x)/2m+1.\n\nObserve that the estimated distribution given by Equation (5) belongs to the set of distributions\n\nPn(A, \u2206, b, c) , {\u00b5 \u2208 P|Zi = [bi \u2212 c, bi + c] \u00d7\n\nm\n\nYj=1\n[aij \u2212 \u03b4ij, aij + \u03b4ij];\n\nij =nc2\n\n\u2200S \u2286 {1,\u00b7\u00b7\u00b7 , n} : \u00b5([i\u2208S\n\nof distributions used in the representation from Proposition 1.\n\nZi) \u2265 |S|/n},\nand hence belongs to \u02c6P(n) = \u02c6P(n) , S\u2206|\u2200j,Pi \u03b42\nj Pn(A, \u2206, b, c), which is precisely the set\nStep 3: Combining the last two steps, and using the fact thatRb,r |hn(b, r) \u2212 h(b, r)|d(b, r) goes to\n\u2191 \u221e since hn(\u00b7) is a kernel density estimation of f (\u00b7)\nzero almost surely when cn \u2193 0 and ncm+1\n(see e.g. Theorem 3.1 of [21]), we prove consistency of robust regression.\nWe can remove the assumption that kx(cn,Sn)k2 \u2264 H, and as in Theorem 6, the proof technique\nrather than the result itself is of interest. We postpone the proof to [15].\nTheorem 7. Let {cn} converge to zero suf\ufb01ciently slowly. Then\n(b \u2212 r>x(cn,Sn))2dP(b, r) =sZb,r\n\n(b \u2212 r>x(P))2dP(b, r),\n\nn\u2192\u221esZb,r\n\nlim\n\nn\n\nalmost surely.\n\n5 Conclusion\n\nIn this paper, we consider robust regression with a least-square-error loss, and extend the results of\n[11] (i.e., Tikhonov regularization is equivalent to a robust formulation for Frobenius norm-bounded\ndisturbance set) to a broader range of disturbance sets and hence regularization schemes. A special\ncase of our formulation recovers the well-known Lasso algorithm, and we obtain an interpretation\nof Lasso from a robustness perspective. We consider more general robust regression formulations,\nallowing correlation between the feature-wise noise, and we show that this too leads to tractable\nconvex optimization problems.\n\nWe exploit the new robustness formulation to give direct proofs of sparseness and consistency for\nLasso. As our results follow from robustness properties, it suggests that they may be far more\ngeneral than Lasso, and that in particular, consistency and sparseness may be properties one can\nobtain more generally from robusti\ufb01ed algorithms.\n\n7\n\n\fReferences\n\n[1] L. Elden. Perturbation theory for the least-square problem with linear equality constraints. BIT, 24:472\u2013\n\n476, 1985.\n\n[2] G. Golub and C. Van Loan. Matrix Computation. John Hopkins University Press, Baltimore, 1989.\n[3] A. Tikhonov and V. Arsenin. Solution for Ill-Posed Problems. Wiley, New York, 1977.\n[4] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n\n32(2):407\u2013499, 2004.\n\n[6] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on\n\nScienti\ufb01c Computing, 20(1):33\u201361, 1998.\n\n[7] A. Feuer and A. Nemirovski. On sparse representation in pairs of bases. IEEE Transactions on Informa-\n\ntion Theory, 49(6):1579\u20131581, 2003.\n\n[8] E. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly\n\nincomplete frequency information. IEEE Transactions on Information Theory, 52(2):489\u2013509, 2006.\n\n[9] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information\n\nTheory, 50(10):2231\u20132242, 2004.\n\n[10] M. Wainwright.\n\nusing\n\nSharp thresholds\n\nfor noisy and high-dimensional\n\nspar-\nsity\nfrom:\nhttp://www.stat.berkeley.edu/tech-reports/709.pdf, Department of Statistics,\nUC Berkeley, 2006.\n\nrecovery of\nTechnical Report Available\n\n`1-constrained\n\nquadratic\n\nprogramming.\n\n[11] L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data. SIAM Journal\n\non Matrix Analysis and Applications, 18:1035\u20131064, 1997.\n\n[12] A. Ben-Tal and A. Nemirovski. Robust solutions of uncertain linear programs. Operations Research\n\nLetters, 25(1):1\u201313, August 1999.\n\n[13] D. Bertsimas and M. Sim. The price of robustness. Operations Research, 52(1):35\u201353, January 2004.\n[14] P. Shivaswamy, C. Bhattacharyya, and A. Smola. Second order cone programming approaches for han-\n\ndling missing and uncertain data. Journal of Machine Learning Research, 7:1283\u20131314, July 2006.\n\n[15] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. Submitted, available from\n\nhttp://arxiv.org/abs/0811.1790v1, 2008.\n\n[16] J. Tropp. Just relax: Convex programming methods for identifying sparse signals. IEEE Transactions on\n\nInformation Theory, 51(3):1030\u20131051, 2006.\n\n[17] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computa-\n\ntion, 10(6):1445\u20131480, 1998.\n\n[18] R. R. Coifman and M. V. Wickerhauser. Entropy-based algorithms for best-basis selection. IEEE Trans-\n\nactions on Information Theory, 38(2):713\u2013718, 1992.\n\n[19] S. Mallat and Z. Zhang. Matching Pursuits with time-frequence dictionaries. IEEE Transactions on Signal\n\nProcessing, 41(12):3397\u20133415, 1993.\n\n[20] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306, 2006.\n[21] L. Devroye and L. Gy\u00a8or\ufb01. Nonparametric Density Estimation: the l1 View. John Wiley & Sons, 1985.\n\n8\n\n\f", "award": [], "sourceid": 683, "authors": [{"given_name": "Huan", "family_name": "Xu", "institution": null}, {"given_name": "Constantine", "family_name": "Caramanis", "institution": null}, {"given_name": "Shie", "family_name": "Mannor", "institution": null}]}