{"title": "Outlier-robust estimation of a sparse linear model using $\\ell_1$-penalized Huber's $M$-estimator", "book": "Advances in Neural Information Processing Systems", "page_first": 13188, "page_last": 13198, "abstract": "We study the problem of estimating a $p$-dimensional \n$s$-sparse vector in a linear model with Gaussian design. \nIn the case where the labels are contaminated by at most \n$o$ adversarial outliers, we prove that the $\\ell_1$-penalized \nHuber's $M$-estimator based on $n$ samples attains the \noptimal rate of convergence $(s/n)^{1/2} + (o/n)$, up to a \nlogarithmic factor. For more general design matrices, our results \nhighlight the importance of two properties: the transfer principle\nand the incoherence property. These properties with suitable \nconstants are shown to yield the optimal rates of robust \nestimation with adversarial contamination.", "full_text": "Outlier-robust estimation of a sparse linear model\n\nusing (cid:96)1-penalized Huber\u2019s M-estimator\n\nAbstract\n\nWe study the problem of estimating a p-dimensional s-sparse vector in a linear\nmodel with Gaussian design and additive noise. In the case where the labels are\ncontaminated by at most o adversarial outliers, we prove that the (cid:96)1-penalized\nHuber\u2019s M-estimator based on n samples attains the optimal rate of convergence\n(s/n)1/2 + (o/n), up to a logarithmic factor. For more general design matrices,\nour results highlight the importance of two properties: the transfer principle and\nthe incoherence property. These properties with suitable constants are shown to\nyield the optimal rates, up to log-factors, of robust estimation with adversarial\ncontamination.\n\n1\n\nIntroduction\n\nIs it possible to attain optimal rates of estimation in outlier-robust sparse regression using penalized\nempirical risk minimization (PERM) with convex loss and convex penalties? Current state of literature\non robust estimation does not answer this question. Furthermore, it contains some signals that might\nsuggest that the answer to this question is negative. First, it has been shown in (Chen et al., 2013,\nTheorem 1) that in the case of adversarially corrupted samples, no method based on penalized\nempirical loss minimization, with convex loss and convex penalty, can lead to consistent support\nrecovery. The authors then advocate for robustifying the (cid:96)1-penalized least-squares estimators by\nreplacing usual scalar products by their trimmed counterparts. Second, (Chen et al., 2018) established\nthat in the multivariate Gaussian model subject to Huber\u2019s contamination, coordinatewise median\u2014\nwhich is the ERM for the (cid:96)1-loss\u2014is sub-optimal. Similar result was proved in (Lai et al., 2016,\nProp. 2.1) for the geometric median, the ERM corresponding to the (cid:96)2-loss. These negative results\nprompted researchers to use other techniques, often of higher computational complexity, to solve the\nproblem of outlier-corrupted sparse linear regression.\nIn the present work, we prove that the (cid:96)1-penalized empirical risk minimizer based on Huber\u2019s loss is\nminimax-rate-optimal, up to possible logarithmic factors. Naturally, this result is not valid in the most\ngeneral situation, but we demonstrate its validity under the assumptions that the design matrix satis\ufb01es\nsome incoherence condition and only the response is subject to contamination. The incoherence\ncondition is shown to be satis\ufb01ed by the Gaussian design with a covariance matrix that has bounded\nand bounded away from zero diagonal entries. This relatively simple setting is chosen in order to\nconvey the main message of this work: for properly chosen convex loss and convex penalty functions,\nthe PERM is minimax-rate-optimal in sparse linear regression with adversarially corrupted labels.\nTo describe more precisely the aforementioned optimality result, let D\u25e6\ni ); i = 1, . . . , n}\nbe iid feature-label pairs such that X i \u2208 Rp are Gaussian with zero mean and covariance matrix \u03a3\nand y\u25e6\n\ni are de\ufb01ned by the linear model\n\nn = {(X i, y\u25e6\n\ni = X(cid:62)\ny\u25e6\ni \u03b2\nwhere the random noise \u03bei, independent of X i, is Gaussian with zero mean and variance \u03c32. Instead of\nobserving the \u201cclean\u201d data D\u25e6\nn, we have access to a contaminated version of it, Dn = {(X i, yi); i =\n1, . . . , n}, in which a small number o \u2208 {1, . . . , n} of labels y\u25e6\ni are replaced by an arbitrary value.\nSetting \u03b8\u2217\nn, and using the matrix-vector notation, the described model can be written\nas\n\ni = (yi \u2212 y\u25e6\ni )/\n\ni = 1, . . . , n,\n\n+ \u03bei,\n\n\u221a\n\n\u2217\n\n\u221a\n\n\u2217\n\nY = X\u03b2\n\n+\n\nn \u03b8\n\n+ \u03be,\n\n(1)\n\n\u2217\n\nSubmitted to 33rd Conference on Neural Information Processing Systems (NeurIPS 2019). Do not distribute.\n\n1\n2\n3\n4\n\n5\n6\n7\n8\n9\n\n10\n\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n\n24\n25\n26\n27\n28\n29\n30\n31\n\n32\n33\n34\n\n35\n36\n37\n38\n39\n\n\f40\n\n41\n42\n43\n44\n\n45\n\n46\n47\n\n48\n49\n50\n\n51\n52\n53\n54\n55\n\n56\n57\n58\n\n59\n60\n\n61\n62\n\n63\n\n64\n65\n66\n\n67\n\n68\n69\n70\n71\n72\n73\n\n= (\u03b8\u2217\n\nwhere X = [X(cid:62)\n\u2217\n1, . . . , \u03b8\u2217\n\u03b8\nestimate the vector \u03b2\nsome small value s \u2208 {1, . . . , p}, the vector \u03b2\n0} \u2264 s. In such a setting, it is well-known that if we have access to the clean data D\u25e6\n\nn ] is the n \u00d7 p design matrix, Y = (y1, . . . , yn)(cid:62) is the response vector,\n1 ; . . . ; X(cid:62)\nn)(cid:62) is the contamination and \u03be = (\u03be1, . . . , \u03ben)(cid:62) is the noise vector. The goal is to\n\u2217 \u2208 Rp. The dimension p is assumed to be large, possibly larger than n but, for\n\u2217(cid:107)0 = Card{j : \u03b2\u2217 (cid:54)=\nn and measure\n\nthe quality of an estimator(cid:98)\u03b2 by the Mahalanobis norm1 (cid:107)\u03a31/2((cid:98)\u03b2 \u2212 \u03b2\n\n\u2217 is assumed to be s-sparse: (cid:107)\u03b2\n\n\u2217\n\nr\u25e6(n, p, s) = \u03c3\n\n(cid:17)1/2\n\n)(cid:107)2, the optimal rate is\n\n(cid:16) s log(p/s)\nn is unavailable but one has access to Dn, the\n\nn\n\n.\n\nIn the outlier-contaminated setting, i.e., when D\u25e6\nminimax-optimal-rate (Chen et al., 2016) takes the form\n\n(cid:16) s log(p/s)\n\n(cid:17)1/2\n\nn\n\nr(n, p, s, o) = \u03c3\n\n+\n\n\u03c3o\nn\n\n.\n\n(2)\n\nThe \ufb01rst estimators proved to attain this rate (Chen et al., 2016; Gao, 2017) were computationally\nintractable2 for large p, s and o. This motivated several authors to search for polynomial-time\nalgorithms attaining nearly optimal rate; the most relevant results will be reviewed later in this work.\nThe assumption that only a small number o of labels are contaminated by outliers implies that the\n\u2217 while ensuring\nvector \u03b8\ncomputational tractability of the resulting estimator, a natural approach studied in several papers\n(Laska et al., 2009; Nguyen and Tran, 2013; Dalalyan and Chen, 2012) is to use some version of the\n(cid:96)1-penalized ERM. This corresponds to de\ufb01ning\n\n\u2217 in (1) is o-sparse. In order to take advantage of sparsity of both \u03b2\n\n\u2217 and \u03b8\n\n(cid:107)Y \u2212 X(cid:62)\u03b2 \u2212 \u221a\n\nn \u03b8(cid:107)2\n\n2 + \u03bbs(cid:107)\u03b2(cid:107)1 + \u03bbo(cid:107)\u03b8(cid:107)1\n\n(3)\n\n(cid:98)\u03b2 \u2208 arg min\n\n\u03b2\u2208Rp\n\nmin\n\u03b8\u2208Rn\n\n(cid:110) 1\n\n2n\n\n(cid:111)\n\n,\n\n\u221a\nwhere \u03bbs, \u03bbo > 0 are tuning parameters. This estimator is very attractive from a computational\nperspective, since it can be seen as the Lasso for the augmented design matrix M = [X,\nn In],\nwhere In is the n \u00d7 n identity matrix. To date, the best known rate for this type of estimator is\n\n(cid:16) s log p\n\n(cid:17)1/2\n\n\u03c3\n\nn\n\n(cid:16) o\n\n(cid:17)1/2\n\n+ \u03c3\n\nn\n\n,\n\n(4)\n\nobtained in (Nguyen and Tran, 2013) under some restrictions on (n, p, s, o). A quick comparison of\n(2) and (4) shows that the latter is sub-optimal. Indeed, the ratio of the two rates may be as large as\n(n/o)1/2. The main goal of the present paper is to show that this sub-optimality is not an intrinsic\nproperty of the estimator (3), but rather an artefact of previous proof techniques. By using a re\ufb01ned\n\nargument, we prove that(cid:98)\u03b2 de\ufb01ned by (3) does attain the optimal rate under very mild assumptions.\nIn the sequel, we refer to(cid:98)\u03b2 as (cid:96)1-penalized Huber\u2019s M-estimator. The rationale for this term is that\n\nthe minimization with respect to \u03b8 in (3) can be done explicitly. It yields (Donoho and Montanari,\n2016, Section 6)\n\n(cid:110)\n\n(cid:98)\u03b2 \u2208 arg min\n\n\u03b2\u2208Rp\n\nn(cid:88)\n\ni=1\n\n\u03bb2\no\n\n\u03a6\n\n(cid:16) yi \u2212 X(cid:62)\n\n\u221a\n\ni \u03b2\nn\n\n\u03bbo\n\n(cid:17)\n\n(cid:111)\n\n,\n\n+ \u03bbs(cid:107)\u03b2(cid:107)1\n\n(5)\n\nwhere \u03a6 : R \u2192 R is Huber\u2019s function de\ufb01ned by \u03a6(u) = (1/2)u2 \u2227 (|u| \u2212 1/2).\n\nTo prove the rate-optimality of the estimator(cid:98)\u03b2, we \ufb01rst establish a risk bound for a general design\n\nmatrix X not necessarily formed by Gaussian vectors. This is done in the next section. Then, in\nSection 3, we state and discuss the result showing that all the necessary conditions are satis\ufb01ed for the\nGaussian design. Relevant prior work is presented in Section 4, while Section 5 discusses potential\nextensions. Section 7 provides a summary of our results and an outlook on future work. The proofs\nare deferred to the supplementary material.\n\n1In the sequel, we use notation (cid:107)\u03b2(cid:107)q = ((cid:80)\n\n2In the sense that there is no algorithm computing these estimators in time polynomial in (n, p, s, o).\n\nj |\u03b2j|q)1/q for any vector \u03b2 \u2208 Rp and any q \u2265 1.\n\n2\n\n\f74\n\n75\n\n76\n77\n78\n79\n80\n\n81\n82\n83\n84\n\n85\n86\n87\n\n88\n\n89\n90\n\n91\n92\n\n93\n94\n\n95\n\n96\n\n97\n\n98\n\n99\n\n100\n\n101\n\n102\n103\n104\n105\n\n106\n107\n\n108\n109\n110\n111\n\n112\n113\n\n114\n115\n116\n\n2 Risk bound for the (cid:96)1-penalized Huber\u2019s M-estimator\n\nThis section is devoted to bringing forward suf\ufb01cient conditions on the design matrix that allow for\n\nrate-optimal risk bounds for the estimator(cid:98)\u03b2 de\ufb01ned by (3) or, equivalently, by (5). There are two\nrate \u03c3(cid:112)(s/n) log(p/s). On the other hand, in the case where n = p and X =\n\nqualitative conditions that can be easily seen to be necessary: we call them restricted invertibility and\nincoherence. Indeed, even when there is no contamination, i.e., the number of outliers is known to\nbe o = 0, the matrix X has to satisfy a restricted invertibility condition (such as restricted isometry,\nrestricted eigenvalue or compatibility) in order that the Lasso estimator (3) does achieve the optimal\nn In, even in\n\u2217.\n+ \u03b8\n\u2217 when the design matrix X is aligned with the\n\nthe extremely favorable situation where the noise \u03be is zero, the only identi\ufb01able vector is \u03b2\nTherefore, it is impossible to consistently estimate \u03b2\nidentity matrix In or close to be so.\nThe next de\ufb01nition formalizes what we call restricted invertibility and incoherence by introducing\nthree notions: the transfer principle, the incoherence property and the augmented transfer principle.\nWe will show that these notions play a key role in robust estimation by (cid:96)1-penalized least squares.\n\u221a\nDe\ufb01nition 1. Let Z \u2208 Rn\u00d7p be a (random) matrix and \u03a3 \u2208 Rp\u00d7p. We use notation Z(n) = Z/\nn.\n(i) We say that Z satis\ufb01es the transfer principle with a1 \u2208 (0, 1) and a2 \u2208 (0,\u221e), denoted by\n\n\u221a\n\n\u2217\n\nTP\u03a3(a1; a2), if for all v \u2208 Rp,\n\n(ii) We say that Z satis\ufb01es the incoherence property IP\u03a3(b1; b2; b3) for some positive numbers\n\n(6)\n\n(7)\n\n(cid:13)(cid:13)Z(n)v(cid:13)(cid:13)2 \u2265 a1(cid:107)\u03a31/2v(cid:107)2 \u2212 a2(cid:107)v(cid:107)1.\n(cid:13)(cid:13)\u03a31/2v(cid:13)(cid:13)2(cid:107)u(cid:107)2 + b2(cid:107)v(cid:107)1(cid:107)u(cid:107)2 + b3\n\n(cid:13)(cid:13)\u03a31/2v(cid:13)(cid:13)2(cid:107)u(cid:107)1.\n\nb1, b2 and b3, if for all [v; u] \u2208 Rp+n,\n\n|u(cid:62)Z(n)v| \u2264 b1\n\n(iii) We say that Z satis\ufb01es the augmented transfer principle ATP\u03a3(c1; c2; c3) for some positive\n\nnumbers c1, c2 and c3, if for all [v; u] \u2208 Rp+n,\n\n(cid:13)(cid:13)[\u03a31/2v; u](cid:13)(cid:13)2 \u2212 c2(cid:107)v(cid:107)1 \u2212 c3(cid:107)u(cid:107)1.\n\n(cid:107)Z(n)v + u(cid:107)2 \u2265 c1\n\nThese three properties are inter-related and related to extreme singular values of the matrix Z(n).\n(P1) If Z satis\ufb01es ATP\u03a3(c1; c2; c3) then it also satis\ufb01es TP\u03a3(c1; c2).\n(P2) If Z satis\ufb01es TP\u03a3(a1; a2) and IP\u03a3(b1; b2; b3) then it also satis\ufb01es ATP\u03a3(c1; c2; c3) with\n\n1 \u2212 b1 \u2212 \u03b12, c2 = a2 + 2b2/\u03b1 and c3 = 2b3/\u03b1 for any positive \u03b1 <(cid:112)a2\n\n1 \u2212 b1.\n\nc2\n1 = a2\n\n(P3) If Z satis\ufb01es IP\u03a3(b1; b2; b3), then it also satis\ufb01es IP\u03a3(0; b2; b1 + b3)\n(P4) Any matrix Z satis\ufb01es TPI(sp(Z(n)); 0), and IPI(s1(Z(n)); 0; 0), where sp(Z(n)) and s1(Z(n))\n\nare, respectively, the p-th largest and the largest singular values of Z(n).\n\nClaim (P1) is true, since if we choose u = 0 in (7) we obtain (6). Claim (P2) coincides with Lemma 7,\nproved in the supplement. (P3) is a direct consequence of the inequality (cid:107)u(cid:107)2 \u2264 (cid:107)u(cid:107)1, valid for\nany vector u. (P4) is a well-known characterization of the smallest and the largest singular values\nof a matrix. We will show later on that a Gaussian matrix satis\ufb01es with high probability all these\nconditions with constants a1 and c1 independent of (n, p) and a2, b2, b3, c2, c3 of order n\u22121/2, up to\nlogarithmic factors.\nTo state the main theorem of this section, we consider the simpli\ufb01ed setting in which \u03bbs = \u03bbo = \u03bb.\nRemind that in practice it is always recommended to normalize the columns of the matrix X so that\ntheir Euclidean norm is of the order\nn. The more precise version of the next result with better\nconstants is provided in the supplement (see Proposition 1). We recall that a matrix \u03a3 is said to satisfy\nthe restricted eigenvalue condition RE(s, c0) with some constant \u03ba > 0, if (cid:107)\u03a31/2v(cid:107)2 \u2265 \u03ba(cid:107)vJ(cid:107)2 for\nany vector v \u2208 Rp and any set J \u2282 {1, . . . , p} such that Card(J) \u2264 s and (cid:107)vJ c(cid:107)1 \u2264 c0(cid:107)vJ(cid:107)1.\nTheorem 1. Let \u03a3 satisfy the RE(s, 5) condition with constant \u03ba > 0. Let b1, b2, b3, c1, c2, c3\nbe some positive real numbers such that X satis\ufb01es the IP\u03a3(0; b2; b3) and the ATP\u03a3(c1; c2; c3).\nAssume that for some \u03b4 \u2208 (0, 1), the tuning parameter \u03bb satis\ufb01es\n(cid:107)X(n)\u2022,j (cid:107)2\n\nn \u2265(cid:112)8 log(n/\u03b4)\n\n(cid:1)(cid:112)8 log(p/\u03b4).\n\n(cid:95)(cid:0) max\n\n\u221a\n\u03bb\n\n\u221a\n\nj=1,...,p\n\n3\n\n\f117\n\n118\n\n119\n\n120\n121\n122\n\n123\n\n124\n125\n\n126\n127\n128\n129\n130\n\n131\n132\n133\n134\n\n135\n\n136\n\n137\n\n138\n\n139\n\n140\n141\n\n142\n143\n\nIf the sparsity s and the number of outliers o satisfy the condition\n\nthen, with probability at least 1 \u2212 2\u03b4, we have\n\n(cid:13)(cid:13)\u03a31/2((cid:98)\u03b2 \u2212 \u03b2\n\n\u2217\n\n)(cid:13)(cid:13)2 \u2264 24\u03bb\n\nc2\n1\n\ns\n\n\u03ba2 + o \u2264\n\n(cid:1)2 ,\n\nc2\n1\n\n400(cid:0)c2 \u2228 c3 \u2228 5b2/c1\n(cid:17)(cid:16) s\n(cid:16) 2c2\n\n(cid:95) b3\n\nc1\n\nc2\n1\n\n\u03ba2 + 7o\n\n(cid:17)\n\n\u221a\n5\u03bb\ns\n\u03ba .\n6c2\n1\n\n+\n\n(8)\n\n(9)\n\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2\n\n\u2217(cid:107)1 + (cid:107)(cid:98)\u03b8 \u2212 \u03b8\n\n(cid:107)(cid:98)\u03b2 \u2212 \u03b2\n\n2 + (cid:107)(cid:98)\u03b8 \u2212 \u03b8\n\n\u2217(cid:107)2\n\nTheorem 1 is somewhat hard to parse. At this stage, let us simply mention that in the case of a\nGaussian design considered in the next section, c1 is of order 1 while b2, b3, c2, c3 are of order n\u22121/2,\nup to a factor logarithmic in p, n and 1/\u03b4. Here \u03b4 is an upper bound on the probability that the\nGaussian matrix X does not satisfy either IP\u03a3 or ATP\u03a3. Since Theorem 1 allows us to choose \u03bb\n\u2217, measured in\nn\u03ba2 )1/2), under the assumption that\n\nof the order(cid:112)log{(p + n)/\u03b4}/n, we infer from (9) that the error of estimating \u03b2\n\nEuclidean norm, is of order\n( s\nn\u03ba2 + o\nTo complete this section, we present a sketch of the proof of Theorem 1. In order to convey the main\nideas without diving too much into technical details, we assume \u03a3 = Ip. This means that the RE\ncondition is satis\ufb01ed with \u03ba = 1 for any s and c0. From the fact that the ATP\u03a3 holds for X, we infer\nthat [X\nn In] satis\ufb01es the RE(s + o, 5) condition with the constant c1/2. Using the well-known\nrisk bounds for the Lasso estimator (Bickel et al., 2009), we get\n\nn\u03ba2 )1/2 = O( o\nn ) log(np/\u03b4) is smaller than a universal constant.\n\nn\u03ba2 + o\n\nn + ( s\n\nn + ( s\n\n\u221a\n\ns\n\n\u2217(cid:107)1 \u2264 C\u03bb(s + o).\n\n\u2217(cid:107)2\n2 \u2264 C\u03bb2(s + o)\n\n.\n\n2n\n\nand\n\n\u03b2\u2208Rp\n\n(cid:111)\n\n(cid:110) 1\n\n(cid:107)Y \u2212 X\u03b2 \u2212 \u221a\n\n(cid:98)\u03b2 \u2208 arg min\n\nThe KKT conditions of this convex optimization problem take the following form\n\n(10)\nNote that these are the risk bounds established in3 (Cand\u00e8s and Randall, 2008; Dalalyan and Chen,\n2012; Nguyen and Tran, 2013). These bounds are most likely unimprovable as long as the estimation\n\u2217, considering \u03b8\n\u2217 as a\nof \u03b8\nnuisance parameter, the following argument leads to a sharper risk bound. First, we note that\n\n\u2217 is of interest. However, if we focus only on the estimation error of \u03b2\n\nfor every j \u2208 {1, . . . , p}. Multiplying the last displayed equation from left by \u03b2\n\nwhere sgn((cid:98)\u03b2) is the subset of Rp containing all the vectors w such that wj(cid:98)\u03b2j = |(cid:98)\u03b2j| and |wj| \u2264 1\n\n1/nX(cid:62)(Y \u2212 X(cid:98)\u03b2 \u2212 \u221a\n\u2217 \u2212(cid:98)\u03b2)(cid:62)X(cid:62)(Y \u2212 X(cid:98)\u03b2 \u2212 \u221a\n\nn(cid:98)\u03b8(cid:107)2\n2 + \u03bb(cid:107)\u03b2(cid:107)1\nn(cid:98)\u03b8) \u2208 \u03bb \u00b7 sgn((cid:98)\u03b2),\n\u2217 \u2212(cid:98)\u03b2, we get\nn(cid:98)\u03b8) \u2264 \u03bb(cid:0)(cid:107)\u03b2\n(cid:1).\n\u2217(cid:107)1 \u2212 (cid:107)(cid:98)\u03b2(cid:107)1\n\u2217 \u2212(cid:98)\u03b2 and u = \u03b8\n\u2217 \u2212(cid:98)\u03b8. We arrive at\n2 = 1/nv(cid:62)X(cid:62)Xv \u2264 \u2212v(cid:62)(X(n))(cid:62)u \u2212 1/nv(cid:62)X(cid:62)\u03be + \u03bb(cid:0)(cid:107)\u03b2\n(cid:1).\n\u2217(cid:107)1 \u2212 (cid:107)(cid:98)\u03b2(cid:107)1\n\u2217(cid:107)1 \u2212 (cid:107)(cid:98)\u03b2(cid:107)1 \u2264\n(cid:1).\n2 \u2264 |v(cid:62)(X(n))(cid:62)u| + \u03bb/2(cid:0)4(cid:107)vS(cid:107)1 \u2212 (cid:107)v(cid:107)1\n\nOn the one hand, the duality inequality and the lower bound on \u03bb imply that |v(cid:62)X(cid:62)\u03be| \u2264\n(cid:107)v(cid:107)1(cid:107)X(cid:62)\u03be(cid:107)\u221e \u2264 n\u03bb(cid:107)v(cid:107)1/2. On the other hand, well-known arguments yield (cid:107)\u03b2\n2(cid:107)vS(cid:107)1 \u2212 (cid:107)v(cid:107)1. Therefore, we have\n\nRecall now that Y = X\u03b2\n\n+ \u03be and set v = \u03b2\n\n1/n(cid:107)Xv(cid:107)2\n\n1/n(cid:107)Xv(cid:107)2\n\n1/n(\u03b2\n\n\u221a\n\nn \u03b8\n\n+\n\n\u2217\n\n\u2217\n\n(11)\n2 \u2264 2/n(cid:107)Xv(cid:107)2\n\n2 +\n\n1(cid:107)v(cid:107)2\n\n1. Combining with (11), this yields\n\nSince X satis\ufb01es the ATPI(c1, c2, c3) that implies the TPI(c1, c2), we get c2\n2(cid:107)v(cid:107)2\n2c2\n1(cid:107)v(cid:107)2\nc2\n\n2|v(cid:62)(X(n))(cid:62)u| + \u03bb(cid:0)4(cid:107)vS(cid:107)1 \u2212 (cid:107)v(cid:107)1\n2b3(cid:107)v(cid:107)2(cid:107)u(cid:107)1 + 2b2(cid:107)v(cid:107)1(cid:107)u(cid:107)2 + \u03bb(cid:0)4(cid:107)vS(cid:107)1 \u2212 (cid:107)v(cid:107)1\n\n(cid:1) + 2c2\n\nIPI(0,b2,b3)\u2264\n\n2(cid:107)v(cid:107)2\n\n\u2264\n\n2\n\n1\n\n(cid:1) + 2c2\n\n2(cid:107)v(cid:107)2\n\n1\n\n(12)\n\n\u2264\n\n1 + (cid:107)v(cid:107)1(2b2(cid:107)u(cid:107)2 \u2212 \u03bb) + 4\u03bb(cid:107)vS(cid:107)1 + 2c2\n3the \ufb01rst two references deal with the small dimensional case only, that is where s = p (cid:28) n.\n\n(cid:107)u(cid:107)2\n\n(cid:107)v(cid:107)2\n\n2 +\n\n2b2\n3\nc2\n1\n\nc2\n1\n2\n\n2(cid:107)v(cid:107)2\n1.\n\n4\n\n\f144\n145\n146\n\n147\n148\n149\n150\n\n151\n\n152\n153\n154\n155\n156\n157\n158\n159\n160\n161\n\n162\n163\n164\n\n165\n166\n167\n168\n169\n170\n171\n\n172\n173\n\n174\n175\n176\n177\n178\n\nUsing the \ufb01rst inequality in (10) and condition (8), we upper bound (2b2(cid:107)u(cid:107)2 \u2212 \u03bb) by 0. To upper\n\u221a\nbound the second last term, we use the Cauchy-Schwarz inequality: 4\u03bb(cid:107)vS(cid:107)1 \u2264 4\u03bb\ns(cid:107)v(cid:107)2 \u2264\n(4/c1)2\u03bb2s + (c1/2)2(cid:107)v(cid:107)2\n1/4)(cid:107)v(cid:107)2\n(c2\n\n2. Combining all these bounds and rearranging the terms, we arrive at\n2 \u2264 2{(b3/c1) \u2228 c2}2((cid:107)u(cid:107)1 + (cid:107)v(cid:107)1)2 + (4/c1)2\u03bb2s.\n\nTaking the square root of both sides and using the second inequality in (10), we obtain an inequality\nof the same type as (9) but with slightly larger constants. As a concluding remark for this sketch of\nproof, let us note that if instead of using the last arguments, we replace all the error terms appearing\nin (12) by their upper bounds provided by (10), we do not get the optimal rate.\n\n3 The case of Gaussian design\n\nOur main result, Theorem 1, shows that if the design matrix satis\ufb01es the transfer principle and the\nincoherence property with suitable constants, then the (cid:96)1-penalized Huber\u2019s M-estimator achieves\nthe optimal rate under adversarial contamination. As a concrete example of a design matrix for which\nthe aforementioned conditions are satis\ufb01ed, we consider the case of correlated Gaussian design. As\nopposed to most of prior work on robust estimation for linear regression with Gaussian design, we\nallow the covariance matrix to have a non degenerate null space. We will simply assume that the\nn rows of the matrix X are independently drawn from the Gaussian distribution Np(0, \u03a3) with a\ncovariance matrix \u03a3 satisfying the RE(s, 5) condition. We will also assume in this section that all the\ndiagonal entries of \u03a3 are equal to 1: \u03a3jj = 1. The more formal statements of the results, provided in\nthe supplementary material, do not require this condition.\nTheorem 2. Let \u03b4 \u2208 (0, 1/7) be a tolerance level and n \u2265 100. For every positive semi-de\ufb01nite\nmatrix \u03a3 with all the diagonal entries bounded by one, with probability at least 1 \u2212 2\u03b4, the matrix X\nsatis\ufb01es the TP\u03a3(a1, a2), the IP\u03a3(b1, b2, b3) and the ATP\u03a3(c1, c2, c3) with constants\n\n\u221a\n\na1 = 1 \u2212 4.3 +(cid:112)2 log(9/\u03b4)\n2 +(cid:112)2 log(81/\u03b4)\n\u2212 17.5 + 9.6(cid:112)2 log(2/\u03b4)\n\n\u221a\n4.8\n\nb1 =\n\n\u221a\n\n,\n\n,\n\nc1 =\n\nn\n\nn\n\n\u221a\n\nn\n\n3\n4\n\n(cid:114)\n\n(cid:114)\n\n2 log p\n\nn\n\n,\n\n2 log n\n\n(cid:114)\n\nn\n\n2 log p\n\nn\n\na2 = b2 = 1.2\n\nb3 = 1.2\n\n,\n\nc2 = 3.6\n\n(cid:114)\n\n,\n\nc3 = 2.4\n\n2 log n\n\nn\n\n.\n\nThe proof of this result is provided in the supplementary material. It relies on by now standard tools\nsuch as Gordon\u2019s comparison inequality, Gaussian concentration inequality and the peeling argument.\nNote that the TP\u03a3 and related results have been obtained in Raskutti et al. (2010); Oliveira (2016);\nRudelson and Zhou (2013). The IP\u03a3 is basically a combination of a high probability version of\nChevet\u2019s inequality (Vershynin, 2018, Exercises 8.7.3-4) and the peeling argument. A property similar\nto the ATP\u03a3 for Gaussian matrices with non degenerate covariance was established in (Nguyen and\nTran, 2013, Lemma 1) under further restrictions on n, p, s, o.\nTheorem 3. There exist universal positive constants d1, d2, d3 such that if\n\nthen, with probability at least 1 \u2212 4\u03b4, (cid:96)1-penalized Huber\u2019s M-estimator with \u03bb2\nand \u03bb2\n\nsn = 9\u03c32 log(p/\u03b4)\n\ns log p\n\n\u03ba2 + o log n \u2264 d1n\n(cid:13)(cid:13)\u03a31/2((cid:98)\u03b2 \u2212 \u03b2\n\n)(cid:13)(cid:13)2 \u2264 d3\u03c3\n\non = 8\u03c32 log(n/\u03b4) satis\ufb01es\n\u2217\n\n(cid:26)(cid:16) s log(p/\u03b4)\n\n(cid:17)1/2\n\nn\u03ba2\n\n+\n\no log(n/\u03b4)\n\nn\n\n.\n\n(13)\n\nand\n\n1/7 \u2265 \u03b4 \u2265 2e\u2212d2n\n\n(cid:27)\n\nEven though the constants appearing in Theorem 2 are reasonably small and smaller than in the\nanalogous results in prior work, the constants d1, d2 and d3 are large, too large for being of any\npractical relevance. Finally, let us note that if s and o are known, it is very likely that following the\ntechniques developed in (Bellec et al., 2018, Theorem 4.2), one can replace the terms log(p/\u03b4) and\nlog(n/\u03b4) in (13) by log(p/s\u03b4) and log(n/o\u03b4), respectively.\n\n5\n\n\f179\n180\n181\n\n182\n183\n184\n185\n186\n187\n188\n189\n190\n191\n192\n193\n194\n\n195\n\n196\n\n197\n\n198\n199\n200\n201\n202\n203\n204\n205\n206\n207\n208\n\n209\n\n210\n211\n212\n213\n214\n215\n\n216\n217\n218\n219\n220\n221\n222\n223\n224\n225\n226\n227\n228\n\n229\n230\n231\n232\n\nComparing Theorem 3 with (Nguyen and Tran, 2013, Theorem 1), we see that our rate improvement\nis not only in terms of its dependence on the proportion of outliers, o/n, but also in terms of the\ncondition number \u03ba, which is now completely decoupled from o in the risk bound.\nWhile our main focus is on the high dimensional situation in which p can be larger than n, it also\napplies to the case of small dimensional dense vectors, i.e., when s = p is signi\ufb01cantly smaller than\nn. One of the applications of such a setting is the problem of stylized communication considered, for\n\u2217 \u2208 Rp to a remote\ninstance, in (Cand\u00e8s and Randall, 2008). The problem is to transmit a signal \u03b2\n\u2217 corrupted by small noise\nreceiver. What the receiver gets is a linearly transformed codeword X\u03b2\nand malicious errors. While all the entries of the received codeword are affected by noise, only a\nfraction of them is corrupted by malicious errors, corresponding to outliers. The receiver has access\n\u2217 as well as to the encoding matrix X. Theorem 3.1 from (Cand\u00e8s and\nto the corrupted version of X\u03b2\nRandall, 2008) establishes that the Dantzig selector (Cand\u00e8s and Tao, 2007), for a properly chosen\ntuning parameter proportional to the noise level, achieves the (sub-optimal) rate \u03c32(s + o)/n, up to\na logarithmic factor. A similar result, with a noise-level-free version of the Dantzig selector, was\nproved in (Dalalyan and Chen, 2012). Our Theorem 3 implies that the error of the (cid:96)1-penalized\nHuber\u2019s estimator goes to zero at the faster rate \u03c32{(s/n) + (o/n)2}.\nFinally, one can deduce from Theorem 3 that as soon as the number of outliers satis\ufb01es o =\n\no((cid:112)sn/\u03ba2), the rate of convergence remains the same as in the outlier-free setting.\n\n4 Prior work\n\nAs attested by early references such as (Tukey, 1960), robust estimation has a long history. A\nremarkable\u2014by now classic\u2014result by Huber (1964) shows that among all the shift invariant M-\nestimators of a location parameter, the one that minimizes the asymptotic variance corresponds to the\nloss function \u03c6(x) = 1/2{x2 \u2227 (2x \u2212 1)}. This result was proved in the case when the reference\ndistribution is univariate Gaussian. Apart from some exceptions, such as (Yatracos, 1985), during\nseveral decades the literature on robust estimation was mainly exploring the notions of breakdown\npoint, in\ufb02uence function, asymptotic ef\ufb01ciency, etc., see for instance (Donoho and Gasko, 1992;\nHampel et al., 2005; Huber and Ronchetti, 2009) and the recent survey (Yu and Yao, 2017). A more\nrecent trend in statistics is to focus on \ufb01nite sample risk bounds that are minimax-rate-optimal when\nthe sample size n, the dimension p of the unknown parameter and the number o of outliers tend\njointly to in\ufb01nity (Chen et al., 2018, 2016; Gao, 2017).\nIn the problem of estimating the mean of a multivariate Gaussian distribution, it was shown that\nthe optimal rate of the estimation error measured in Euclidean norm scales as (p/n)1/2 + (o/n).\nSimilar results were established for the problem of robust linear regression as well. However, the\nestimator that was shown to achieve this rate under fairly general conditions on the design is based\non minimizing regression depths, which is a hard computational problem. Several alternative robust\nestimators with polynomial complexity were proposed (Diakonikolas et al., 2016; Lai et al., 2016;\nCheng et al., 2019; Collier and Dalalyan, 2017; Diakonikolas et al., 2018).\nMany recent papers studied robust linear regression. (Karmalkar and Price, 2018) considered (cid:96)1-\nconstrained minimization of the (cid:96)1-norm of residuals and found a sharp threshold on the proportion\nof outliers determining whether the error of estimation tends to zero or not, when the noise level goes\nto zero. From a methodological point of view, (cid:96)1-penalized Huber\u2019s estimator has been considered\nin (She and Owen, 2011; Lee et al., 2012). These papers contain also comprehensive empirical\nevaluation and proposals for data-driven choice of tuning parameters. Robust sparse regression with\nan emphasis on contaminated design was investigated in (Chen et al., 2013; Balakrishnan et al., 2017;\nDiakonikolas et al., 2019; Liu et al., 2018, 2019). Iterative and adaptive hard thresholding approaches\nwere considered in (Bhatia et al., 2017; Suggala et al., 2019). Methods based on penalizing the vector\nof outliers were studied by Li (2013); Foygel and Mackey (2014); Adcock et al. (2018), who adopted\na more signal-processing point of view in which the noise vector is known to have a small (cid:96)2 norm\nand nothing else is known about it. We should stress that our proof techniques share many common\nfeatures with those in (Foygel and Mackey, 2014).\nThe problem of robust estimation of graphical models, closely related to the present work, was\naddressed in (Balmand and Dalalyan, 2015; Katiyar et al., 2019; Liu et al., 2019). Quite surprisingly,\nat least to us, the minimax rate of robust estimation of the precision matrix in Frobenius norm is not\nknown yet.\n\n6\n\n\f5 Extensions\n\nThe results presented in previous sections pave the way for some future investigations, that are\ndiscussed below. None of these extensions is carried out in this work, they are listed here as possible\navenues for future research.\n\n240\n\n241\n242\n\ni , y\u25e6\n\n239 {(X i, yi); i = 1, . . . , n} such that (X i, yi) = (X\u25e6\ni , y\u25e6\ni = (yi \u2212 X(cid:62)\ni \u03b2\n\nContaminated design In addition to labels, the features also might be corrupted by outliers.\nThis is the case, for instance, in Gaussian graphical models. Formally, this means that instead of\ni ); i = 1, . . . , n} satisfying y\u25e6\nobserving the clean data {(X\u25e6\ni = (X\u25e6\ni )(cid:62)\u03b2\u2217 + \u03bei, we observe\ni ) for all i except for a fraction of outliers\n\u221a\ni \u2208 O. In such a setting, we can set \u03b8\u2217\n\u2217 \u2212 \u03bei)/\nn and recover exactly the same\nmodel as in (1).\nThe important difference as compared to the setting investigated in previous section is that it is\nnot reasonable anymore to assume that the feature vectors {X i : i \u2208 O} are iid Gaussian. In the\nadversarial setting, they may even be correlated with the noise vector \u03be. It is then natural to remove\n\nall the observations for which maxj |X ij| > (cid:112)2 log np/\u03b4 and to assume, that the (cid:96)1-penalized\nHuber estimator is applied to data for which maxij |X ij| \u2264(cid:112)2 log np/\u03b4. This implies that \u03bb can\n\n243\n244\n245\n\n246\n\nbe chosen of the order of4 \u03c3 \u02dcO(n\u22121/2 + (o/n)), which is an upper bound on (cid:107)X(cid:62)\u03be(cid:107)\u221e/n.\nIn addition, TP\u03a3 is clearly satis\ufb01ed since it is satis\ufb01ed for the submatrix XOc and (cid:107)Xv(cid:107)2 \u2265\n249 (cid:107)XOc v(cid:107)2. As for the IP\u03a3, we know from Theorem 2 that XOc satis\ufb01es IP\u03a3 with constants b1, b2,\nb3 of order \u02dcO(n\u22121/2). On the other hand,\n\nOXOv| \u2264 (cid:107)X(cid:107)\u221e(cid:107)uO(cid:107)1(cid:107)v(cid:107)1 \u2264(cid:112)2o log(np/\u03b4)(cid:107)uO(cid:107)2(cid:107)v(cid:107)1.\n(cid:26)(cid:114) s\n\nThis implies that X satis\ufb01es IP\u03a3 with b1 = \u02dcO(n\u22121/2), b2 = \u02dcO((o/n)1/2) and b3 = \u02dcO(n\u22121/2).\nApplying Theorem 1, we obtain that if (so + o2) log(np) \u2264 cn for a suf\ufb01ciently small constant c > 0,\nthen with high probability\n\n(cid:26)(cid:114) s\n\n(cid:114) o\n\n(cid:27)\n\n(cid:27)\n\n|u(cid:62)\n\n(cid:17)\n\n247\n\n248\n\n250\n\n251\n\no\n\ns\n\n\u221a\n\n(cid:107)\u03a31/2((cid:98)\u03b2 \u2212 \u03b2\n\n\u2217\n\n)(cid:107)2 = \u03c3 \u02dcO\n\n(s + o)\n\n= \u03c3O\n\n\u221a\n\nn\n\n+\n\nn\n\n+\n\nn\n\n(cid:16) 1\u221a\n\n+\n\nn\n\no\nn\n\n233\n\n234\n235\n236\n\n237\n238\n\n252\n253\n254\n\n255\n256\n257\n\n258\n259\n260\n261\n\n262\n263\n\n264\n265\n266\n267\n268\n269\n\n270\n271\n272\n273\n274\n275\n276\n277\n\n+\n\nn\n\no3\nn\n\n.\n\nestimator(cid:98)\u03b2 has the advantage of being independent of the covariance matrix \u03a3 and on the sparsity s.\n\nThis rate of convergence appear to be slower than those obtained by methods tailored to deal with\ncorruption in design, see (Liu et al., 2018, 2019) and the references therein. Using more careful\nanalysis, this rate might be improvable. On the positive side, unlike many of its competitors, the\nFurthermore, the upper bound does not depend, even logarithmically, on (cid:107)\u03b2\n\u2217(cid:107)2. Finally, if o3 \u2264 sn,\nour bound yields the minimax-optimal rate. To the best of our knowledge, none of the previously\nstudied robust estimators has such a property.\n\nSub-Gaussian design The proof of Theorem 2 makes use of some results, such as Gordon-Sudakov-\nFernique or Gaussian concentration inequality, which are speci\ufb01c to the Gaussian distribution. A\nnatural question is whether the rate \u03c3{( s log(p/s)\nn} can be obtained for more general design\ndistributions. In the case of a sub-Gaussian design with the scale- parameter 1, it should be possible\nto adapt the methodology developed in this work to show that the TP\u03a3 and the IP\u03a3 are satis\ufb01ed\nwith high-probability. Indeed, for proving the IP\u03a3, it is possible to replace Gordon\u2019s comparison\ninequality by Talagrand\u2019s sub-Gaussian comparison inequality (Vershynin, 2018, Cor. 8.6.2). The\nGaussian concentration inequality can be replaced by generic chaining.\n\n)1/2 + o\n\nn\n\nHeavier tailed noise distributions For simplicity, we assumed in the paper that the random\nvariables \u03bei are drawn from a Gaussian distribution. As usual for the Lasso analysis, all the results\nextend to the case of sub-Gaussian noise, see (Koltchinskii, 2011). Indeed, we only need to control\ntail probabilities of the random variable (cid:107)X(cid:62)\u03be(cid:107)\u221e and (cid:107)\u03be(cid:107)\u221e, which can be done using standard tools.\nWe believe that it is possible to extend our results beyond sub-Gaussian noise, by assuming some\ntype of heavy-tailed distributions. The rationale behind this is that any random variable \u03be can be\nwritten (in many different ways) as a sum of a sub-Gaussian variable \u03benoise and a \u201csparse\u201d variable\n\u03beout. By \u201csparse\u201d we mean that \u03beout takes the value 0 with high probability. The most naive way for\n\n4We use notation an = \u02dcO(bn) as a shorthand for an \u2264 Cbn logc n for some C, c > 0 and for every n.\n\n7\n\n\fgetting such a decomposition is to set \u03benoise = \u03be1(|\u03be| < \u03c4 ) and \u03beout = \u03be1(|\u03be| \u2265 \u03c4 ). The random\nnoise terms \u03beout\ncan be merged with \u03b8i and considered as outliers. We hope that this approach can\nestablish a connection between two types of robustness: robustness to outliers considered in this work\nand robustness to heavy tails considered in many recent papers (Devroye et al., 2016; Catoni, 2012;\nMinsker, 2018; Lugosi and Mendelson, 2019; Lecu\u00e9 and Lerasle, 2017).\n\ni\n\n278\n279\n280\n281\n282\n\n283\n\n6 Numerical illustration\n\n284\n285\n286\n287\n288\n289\n\n290\n\nWe performed a synthetic experiment to illustrate the obtained theoretical result and to check that it\nis in line with numerical results. We chose n = 1000 and p = 100 for 3 different levels of sparsity\n\u2217 was set to have its \ufb01rst s non-zero coordinates\ns = 5, 15, 25. The noise variance was set to 1 and \u03b2\nequal to 10. Each corrupted response coordinate was \u03b8\u2217\nj = 10. The fraction \u0001 = o/n of outliers\nwas ranging between 0 and 0.25 with a step-size of 5 for the number of outliers o is used. The\nMSE was computed using 200 independent repetitions. The optimisation problem in (3) was solved\n\nusing the glmnet package with the tuning parameters \u03bbs = \u03bbo =(cid:112)(8/n)(log(p/s) + log(n/o)).\n\n291\n292\n293\n\nThe obtained plots clearly demonstrate that there is a linear dependence on \u03b5 of the square-root of the\nmean squared error. The R-notebook of this experiment can be found in the supplementary material.\n\n294\n\n7 Conclusion\n\n295\n296\n297\n298\n299\n300\n301\n302\n\n303\n304\n305\n306\n307\n308\n309\n310\n311\n312\n\nWe provided the \ufb01rst proof of the rate-optimality\u2014up to logarithmic terms that can be avoided\u2014\nof (cid:96)1-penalized Huber\u2019s M-estimator in the setting of robust linear regression with adversarial\ncontamination. We established this result under the assumption that the design is Gaussian with a\ncovariance matrix \u03a3 that need not be invertible. The condition number governing the risk bound is the\nratio of the largest diagonal entry of \u03a3 and its restricted eigenvalue. Thus, in addition to improving\nthe rate of convergence, we also relaxed the assumptions on the design. Furthermore, we outlined\nsome possible extensions, namely to corrupted design and/or sub-Gaussian design, which seem to be\nfairly easy to carry out building on the current work.\nNext on our agenda is the more thorough analysis of the robust estimation by (cid:96)1-penalization in the\ncase of contaminated design. A possible approach, complementary to the one described in Section 5\nabove, is to adopt an errors-in-variables point of view similar to that developed in (Belloni et al., 2016).\nAnother interesting avenue for future research is the development of scale-invariant robust estimators\nand their adaptation to the Gaussian graphical models. This can be done using methodology brought\nforward in (Sun and Zhang, 2013; Balmand and Dalalyan, 2015). Finally, we would like to better\nunderstand what is the largest fraction of outliers for which the (cid:96)1-penalized Huber\u2019s M-estimator\nhas a risk\u2014measured in Euclidean norm\u2014upper bounded by \u03c3o/n. Answering this question even\nunder stringent assumptions of independent standard Gaussian design X ij with (s log p)/n going to\nzero as n tends to in\ufb01nity would be of interest.\n\n8\n\nfractionofoutliers,\"00.050.10.150.20.25sqrtoftheMSE01020304050s=5s=15s=25\f313\n\n314\n315\n316\n\n317\n318\n319\n\n320\n321\n\n322\n323\n\n324\n325\n\n326\n327\n\n328\n329\n330\n\n331\n332\n\n333\n334\n\n335\n336\n\n337\n338\n\n339\n340\n\n341\n342\n\n343\n344\n\n345\n346\n347\n\n348\n349\n350\n\n351\n352\n\n353\n354\n355\n\n356\n357\n\nReferences\nAdcock, B., Bao, A., Jakeman, J., and Narayan, A. (2018). Compressed sensing with sparse\ncorruptions: Fault-tolerant sparse collocation approximations. SIAM/ASA Journal on Uncertainty\nQuanti\ufb01cation, 6(4):1424\u20131453.\n\nBalakrishnan, S., Du, S. S., Li, J., and Singh, A. (2017). Computationally ef\ufb01cient robust sparse\nestimation in high dimensions. Proceedings of the 2017 Conference on Learning Theory, PMLR,\n65:169\u2013212.\n\nBalmand, S. and Dalalyan, A. S. (2015). Convex programming approach to robust estimation of a\n\nmultivariate gaussian model. arXiv. 1512.04734.\n\nBellec, P. C. (2017). Localized Gaussian width of $M$-convex hulls with applications to Lasso and\n\nconvex aggregation. arXiv e-prints, page arXiv:1705.10696.\n\nBellec, P. C., Lecu\u00e9, G., and Tsybakov, A. B. (2018). Slope meets lasso: Improved oracle bounds and\n\noptimality. Ann. Statist., 46(6B):3603\u20133642.\n\nBelloni, A., Rosenbaum, M., and Tsybakov, A. B. (2016). An {(cid:96)1, (cid:96)2, (cid:96)\u221e}-regularization approach\n\nto high-dimensional errors-in-variables models. Electron. J. Statist., 10(2):1729\u20131750.\n\nBhatia, K., Jain, P., Kamalaruban, P., and Kar, P. (2017). Consistent robust regression. In Advances in\nNeural Information Processing Systems 30: Annual Conference on Neural Information Processing\nSystems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 2107\u20132116.\n\nBickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig\n\nselector. Ann. Statist., 37(4):1705\u20131732.\n\nBoucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: a nonasymptotic\n\ntheory of independence. Oxford University Press.\n\nCand\u00e8s, E. and Randall, P. A. (2008). Highly robust error correction by convex programming. IEEE\n\nTrans. Inform. Theory, 54(7):2829\u20132840.\n\nCand\u00e8s, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when p is much larger\n\nthan n. Ann. Statist., 35(6):2313\u20132351.\n\nCatoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. Ann.\n\nInst. Henri Poincar\u00e9 Probab. Stat., 48(4):1148\u20131185.\n\nChen, M., Gao, C., and Ren, Z. (2016). A general decision theory for Huber\u2019s \u0001-contamination model.\n\nElectron. J. Statist., 10(2):3752\u20133774.\n\nChen, M., Gao, C., and Ren, Z. (2018). Robust covariance and scatter matrix estimation under\n\nHuber\u2019s contamination model. Ann. Statist., 46(5):1932\u20131960.\n\nChen, Y., Caramanis, C., and Mannor, S. (2013). Robust sparse regression under adversarial\ncorruption. In Proceedings of the 30th International Conference on Machine Learning, volume 28\nof Proceedings of Machine Learning Research, pages 774\u2013782. PMLR.\n\nCheng, Y., Diakonikolas, I., and Ge, R. (2019). High-dimensional robust mean estimation in nearly-\nlinear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms,\nSODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2755\u20132771.\n\nCollier, O. and Dalalyan, A. S. (2017). Minimax estimation of a p-dimensional linear functional in\nsparse Gaussian models and robust estimation of the mean. arXiv e-prints, page arXiv:1712.05495.\n\nDalalyan, A. S. and Chen, Y. (2012). Fused sparsity and robust estimation for linear models with\nunknown variance. In Advances in Neural Information Processing Systems 25: NIPS, pages\n1268\u20131276.\n\nDevroye, L., Lerasle, M., Lugosi, G., and Oliveira, R. I. (2016). Sub-Gaussian mean estimators. Ann.\n\nStatist., 44(6):2695\u20132725.\n\n9\n\n\f358\n359\n360\n\n361\n362\n363\n364\n\n365\n366\n367\n\n368\n369\n\n370\n371\n\n372\n373\n\n374\n375\n\n376\n377\n378\n\n379\n\n380\n381\n\n382\n383\n\n384\n385\n\n386\n387\n388\n\n389\n390\n391\n\n392\n393\n394\n\n395\n396\n\n397\n398\n\n399\n400\n\n401\n402\n\nDiakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. (2016). Robust\nestimators in high dimensions without the computational intractability. In Foundations of Computer\nScience (FOCS), 2016 IEEE 57th Annual Symposium on, pages 655\u2013664. IEEE.\n\nDiakonikolas, I., Kamath, G., Kane, D. M., Li, J., Moitra, A., and Stewart, A. (2018). Robustly\nlearning a gaussian: Getting optimal error, ef\ufb01ciently. In Proceedings of the Twenty-Ninth Annual\nACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January\n7-10, 2018, pages 2683\u20132702.\n\nDiakonikolas, I., Kong, W., and Stewart, A. (2019). Ef\ufb01cient algorithms and lower bounds for robust\nlinear regression. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2745\u20132754.\n\nDonoho, D. and Montanari, A. (2016). High dimensional robust m-estimation: asymptotic variance\n\nvia approximate message passing. Probability Theory and Related Fields, 166(3):935\u2013969.\n\nDonoho, D. L. and Gasko, M. (1992). Breakdown properties of location estimates based on halfspace\n\ndepth and projected outlyingness. Ann. Statist., 20(4):1803\u20131827.\n\nFoygel, R. and Mackey, L. (2014). Corrupted sensing: novel guarantees for separating structured\n\nsignals. IEEE Trans. Inform. Theory, 60(2):1223\u20131247.\n\nGao, C. (2017). Robust Regression via Mutivariate Regression Depth. arXiv e-prints, page\n\narXiv:1702.04656.\n\nHampel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (2005). Robust statistics: the approach\nbased on in\ufb02uence functions. Wiley series in probability and mathematical statistics. Probability\nand mathematical statistics. Wiley.\n\nHuber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist., 35(1):73\u2013101.\n\nHuber, P. J. and Ronchetti, E. M. (2009). Robust statistics. Wiley Series in Probability and Statistics.\n\nJohn Wiley & Sons, Inc., Hoboken, NJ, second edition.\n\nKarmalkar, S. and Price, E. (2018). Compressed sensing with adversarial sparse noise via l1 regression.\n\narXiv. 1809.08055.\n\nKatiyar, A., Hoffmann, J., and Caramanis, C. (2019). Robust estimation of tree structured Gaussian\n\nGraphical Model. arXiv e-prints, page arXiv:1901.08770.\n\nKoltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recov-\nery Problems: \u00c9cole d\u2019\u00c9t\u00e9 de Probabilit\u00e9s de Saint-Flour XXXVIII-2008. Lecture Notes in\nMathematics. Springer Berlin Heidelberg.\n\nLai, K. A., Rao, A. B., and Vempala, S. (2016). Agnostic estimation of mean and covariance. In\nFoundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 665\u2013674.\nIEEE.\n\nLaska, J. N., Davenport, M. A., and Baraniuk, R. G. (2009). Exact signal recovery from sparsely\ncorrupted measurements through the pursuit of justice. In Asilomar Conference on Signals, Systems\nand Computers, pages 1556\u20131560.\n\nLecu\u00e9, G. and Lerasle, M. (2017). Robust machine learning by median-of-means : theory and practice.\n\narXiv e-prints, page arXiv:1711.10306.\n\nLee, Y., MacEachern, S. N., and Jung, Y. (2012). Regularization of case-speci\ufb01c parameters for\n\nrobustness and ef\ufb01ciency. Statist. Sci., 27(3):350\u2013372.\n\nLi, X. (2013). Compressed sensing and matrix completion with constant proportion of corruptions.\n\nConstructive Approximation, 37(1):73\u201399.\n\nLiu, L., Li, T., and Caramanis, C. (2019). High dimensional robust estimation of sparse models via\n\ntrimmed hard thresholding. CoRR, abs/1901.08237.\n\n10\n\n\f403\n404\n\n405\n406\n\n407\n408\n\n409\n410\n\n411\n412\n\n413\n414\n\n415\n416\n\n417\n418\n\n419\n420\n\n421\n422\n\n423\n424\n\n425\n426\n\n427\n428\n429\n\n430\n431\n\n432\n433\n\nLiu, L., Shen, Y., Li, T., and Caramanis, C. (2018). High dimensional robust sparse regression. CoRR,\n\nabs/1805.11643.\n\nLugosi, G. and Mendelson, S. (2019). Sub-Gaussian estimators of the mean of a random vector. Ann.\n\nStatist., 47(2):783\u2013794.\n\nMinsker, S. (2018). Sub-Gaussian estimators of the mean of a random matrix with heavy-tailed\n\nentries. Ann. Statist., 46(6A):2871\u20132903.\n\nNguyen, N. H. and Tran, T. D. (2013). Robust lasso with missing and grossly corrupted observations.\n\nIEEE Trans. Inform. Theory, 59(4):2036\u20132058.\n\nOliveira, R. (2013). The lower tail of random quadratic forms, with applications to ordinary least\n\nsquares and restricted eigenvalue properties. arXiv. 1312.2903.\n\nOliveira, R. (2016). The lower tail of random quadratic forms with applications to ordinary least\n\nsquares. Probability Theory and Related Fields, 166(3-4):1175\u20131194.\n\nRaskutti, G., Wainwright, M. J., and Yu, B. (2010). Restricted eigenvalue properties for correlated\n\nGaussian designs. J. Mach. Learn. Res., 11:2241\u20132259.\n\nRudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements. IEEE\n\nTrans. Inf. Theory, 59(6):3434\u20133447.\n\nShe, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of\n\nthe American Statistical Association, 106(494):626\u2013639.\n\nSuggala, A. S., Bhatia, K., Ravikumar, P., and Jain, P. (2019). Adaptive hard thresholding for\n\nnear-optimal consistent robust regression. CoRR, abs/1903.08192.\n\nSun, T. and Zhang, C.-H. (2013). Sparse matrix inversion with scaled lasso. Journal of Machine\n\nLearning Research, 14:3385\u20133418.\n\nTukey, J. W. (1960). A survey of sampling from contaminated distributions. Contributions to\n\nProbability and Statistics.\n\nVershynin, R. (2018). High-dimensional probability, volume 47 of Cambridge Series in Statistical\nand Probabilistic Mathematics. Cambridge University Press, Cambridge. An introduction with\napplications in data science, With a foreword by Sara van de Geer.\n\nYatracos, Y. G. (1985). Rates of convergence of minimum distance estimators and kolmogorov\u2019s\n\nentropy. Ann. Statist., 13(2):768\u2013774.\n\nYu, C. and Yao, W. (2017). Robust linear regression: a review and comparison. Comm. Statist.\n\nSimulation Comput., 46(8):6261\u20136282.\n\n11\n\n\f", "award": [], "sourceid": 7240, "authors": [{"given_name": "Arnak", "family_name": "Dalalyan", "institution": "ENSAE ParisTech"}, {"given_name": "Philip", "family_name": "Thompson", "institution": "University of Cambridge, Statistical Laboratory"}]}