{"title": "Consistent Robust Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2110, "page_last": 2119, "abstract": "We present the first efficient and provably consistent estimator for the robust regression problem. The area of robust learning and optimization has generated a significant amount of interest in the learning and statistics communities in recent years owing to its applicability in scenarios with corrupted data, as well as in handling model mis-specifications. In particular, special interest has been devoted to the fundamental problem of robust linear regression where estimators that can tolerate corruption in up to a constant fraction of the response variables are widely studied. Surprisingly however, to this date, we are not aware of a polynomial time estimator that offers a consistent estimate in the presence of dense, unbounded corruptions. In this work we present such an estimator, called CRR. This solves an open problem put forward in the work of (Bhatia et al, 2015). Our consistency analysis requires a novel two-stage proof technique involving a careful analysis of the stability of ordered lists which may be of independent interest. We show that CRR not only offers consistent estimates, but is empirically far superior to several other recently proposed algorithms for the robust regression problem, including extended Lasso and the TORRENT algorithm. In comparison, CRR offers comparable or better model recovery but with runtimes that are faster by an order of magnitude.", "full_text": "Consistent Robust Regression\n\nKush Bhatia\u2217\n\nUniversity of California, Berkeley\n\nkushbhatia@berkeley.edu\n\nPrateek Jain\n\nMicrosoft Research, India\nprajain@microsoft.com\n\nParameswaran Kamalaruban\u2020\n\nEPFL, Switzerland\n\nkamalaruban.parameswaran@epfl.ch\n\nPurushottam Kar\n\nIndian Institute of Technology, Kanpur\n\npurushot@cse.iitk.ac.in\n\nAbstract\n\nWe present the \ufb01rst ef\ufb01cient and provably consistent estimator for the robust\nregression problem. The area of robust learning and optimization has generated a\nsigni\ufb01cant amount of interest in the learning and statistics communities in recent\nyears owing to its applicability in scenarios with corrupted data, as well as in\nhandling model mis-speci\ufb01cations. In particular, special interest has been devoted\nto the fundamental problem of robust linear regression where estimators that can\ntolerate corruption in up to a constant fraction of the response variables are widely\nstudied. Surprisingly however, to this date, we are not aware of a polynomial time\nestimator that offers a consistent estimate in the presence of dense, unbounded\ncorruptions. In this work we present such an estimator, called CRR. This solves an\nopen problem put forward in the work of [3]. Our consistency analysis requires\na novel two-stage proof technique involving a careful analysis of the stability of\nordered lists which may be of independent interest. We show that CRR not only\noffers consistent estimates, but is empirically far superior to several other recently\nproposed algorithms for the robust regression problem, including extended Lasso\nand the TORRENT algorithm. In comparison, CRR offers comparable or better\nmodel recovery but with runtimes that are faster by an order of magnitude.\n\n1\n\nIntroduction\n\nThe problem of robust learning involves designing and analyzing learning algorithms that can extract\nthe underlying model despite dense, possibly malicious, corruptions in the training data provided to\nthe algorithm. The problem has been studied in a dizzying variety of models and settings ranging\nfrom regression [19], classi\ufb01cation [11], dimensionality reduction [4] and matrix completion [8].\nIn this paper we are interested in the Robust Least Squares Regression (RLSR) problem that \ufb01nds\nseveral applications to robust methods in face recognition and vision [22, 21], and economics [19].\nIn this problem, we are given a set of n covariates in d dimensions, arranged as a data matrix\nX = [x1, . . . , xn], and a response vector y \u2208 Rn. However, it is known apriori that a certain number\nk of these responses cannot be trusted since they are corrupted. These may correspond to corrupted\npixels in visual recognition tasks or untrustworthy measurements in general sensing tasks.\nUsing these corrupted data points in any standard least-squares solver, especially when k = O (n), is\nlikely to yield a poor model with little predictive power. A solution to this is to exclude corrupted\n\n\u2217Work done in part while Kush was a Research Fellow at Microsoft Research India.\n\u2020Work done in part while Kamalaruban was interning at Microsoft Research India.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: A comparison of different RLSR algorithms and their properties. CRR is the \ufb01rst ef\ufb01cient\nRLSR algorithm to guarantee consistency in the presence of a constant fraction of corruptions.\n\nPaper\n\nBreakdown Point Adversary Consistent\n\nWright & Ma, 2010 [21]\nChen & Dalalyan, 2010 [7]\n\nChen et al., 2013 [6]\n\nNguyen & Tran, 2013 [16]\nNguyen & Tran, 2013b [17]\nMcWilliams et al., 2014 [14]\n\nBhatia et al., 2015 [3]\n\nThis paper\n\n(cid:16) 1\u221a\n(cid:16) 1\u221a\n\n(cid:17)\n(cid:17)\n\n\u03b1 \u2192 1\n\u03b1 \u2265 \u2126 (1)\n\u03b1 \u2265 \u2126\n\nd\n\n\u03b1 \u2192 1\n\u03b1 \u2192 1\n\n\u03b1 \u2265 \u2126\n\u03b1 \u2265 \u2126 (1)\n\u03b1 \u2265 \u2126 (1)\n\nd\n\nOblivious\nAdaptive\nAdaptive\nOblivious\nOblivious\nOblivious\nAdaptive\nOblivious\n\nNo\nNo\nNo\nNo\nNo\nNo\nNo\nYes\n\nTechnique\n\nL1 regularization\n\nSOCP\n\nRobust thresholding\nL1 regularization\nL1 regularization\n\nWeighted subsampling\n\nHard thresholding\nHard thresholding\n\npoints from consideration. The RLSR problem formalizes this requirement as follows:\n\n(yi \u2212 xT\n\ni w)2,\n\n(1)\n\n((cid:98)w,(cid:98)S) = arg min\n\nw\u2208Rp,S\u2282[n]\n\n|S|=n\u2212k\n\n(cid:88)\n\ni\u2208S\n\nThis formulation seeks to simultaneously extract the set of uncorrupted points and estimate the\nleast-squares solutions over those uncorrupted points. Due to the combinatorial nature of the RLSR\nformulation (1), solving it directly is challenging and in fact, NP-hard in general [3, 20].\nLiterature in robust statistics suggests several techniques to solve (1). The most common model\nassumes a realizable setting wherein there exists a gold model w\u2217 that generates the non-corrupted\nresponses. A vector of corruptions is then introduced to model the corrupted responses i.e.\n\ny = X T w\u2217 + b\u2217.\n\n(2)\nThe goal of RLSR is to recover w\u2217 \u2208 Rd, the true model. The vector b\u2217 \u2208 Rn is a k-sparse vector\nwhich takes non-zero values on at most k corrupted samples out of the n total samples, and a zero\nvalue elsewhere. A more useful, but challenging model is one in which (mostly heteroscedastic and\ni.i.d.) Gaussian noise is injected into the responses in addition to the corruptions.\n\nNote that the Gaussian noise vector \u0001 is not sparse. In fact, we have (cid:107)\u0001(cid:107)0 = n almost surely.\n\ny = X T w\u2217 + b\u2217 + \u0001.\n\n(3)\n\n2 Related Works\n\nA string of recent works have looked at the RLSR problem in various settings. To facilitate a\ncomparison among these, we set the following benchmarks for RLSR algorithms\n\n1. (Breakdown Point) This is the number of corruptions k that an RLSR algorithm can tolerate\nis a direct measure of its robustness. This limit is formalized as the breakdown point of\nthe algorithm in statistics literature. The breakdown point k is frequently represented as a\nfraction \u03b1 of the total number of data points i.e. k = \u03b1 \u00b7 n.\n\n2. (Adversary Model) RLSR algorithms frequently resort to an adversary model to specify\nhow are the corruptions introduced into the regression problem. The strictest is the adaptive\nadversarial model wherein the adversary is able to view X and w\u2217 (as well as \u0001 if Gaussian\nnoise is present) before deciding upon b\u2217. A weaker model is the oblivious adversarial\nmodel wherein the adversary generates a k-sparse vector in complete ignorance of X and\nw\u2217 (and \u0001). However, the adversary is still free to make arbitrary choices for the location\nand values of corruptions.\n\n3. (Consistency) RLSR algorithms that are able to operate in the hybrid noise model with\nsparse adversarial corruptions as well as dense Gaussian noise are more valuable. An RLSR\nalgorithms is said to be consistent if, when invoked in the hybrid noise model on n data\npoints sampled from a distribution with appropriate characteristics, the RLSR algorithm\n\nreturns an estimate(cid:98)wn such that limn\u2192\u221e E [(cid:98)wn \u2212 w\u2217]2 = 0 (for simplicity, assume a \ufb01xed\n\ncovariate design with the expectation being over random Gaussian noise in the responses).\n\n2\n\n\fIn Table 1, we present a summarized view of existing RLSR techniques and their performance\nvis-a-vis the benchmarks discussed above. Past work has seen the application of a wide variety\nof algorithmic techniques to solve this problem, including more expensive methods involving L1\nregularization (for example minw,b \u03bbw (cid:107)w(cid:107)1 + \u03bbb (cid:107)b(cid:107)1 +(cid:107)X(cid:62)w + b\u2212 y(cid:107)2\n2) and second-order cone\nprograms such as [21, 7, 16, 17], as well as more scalable methods such as the robust thresholding\nand iterative hard thresholding [6, 3]. As the work of [3] shows, L1 regularization and other expensive\nmethods struggle to scale to even moderately sized problems.\nThe adversary models considered by these works is also quite diverse. Half of the works consider an\noblivious adversary and the other half brace themselves against an adaptive adversary. The oblivious\nadversary model, although weaker, can model some important practical situations where there is\nsystematic error in the sensing equipment being used, such as a few pixels in a camera becoming\nunresponsive. Such errors are surely not random, and hence cannot be modeled as Gaussian noise,\nbut introduce corruptions the \ufb01nal measurement in a manner that is oblivious of the signal actually\nbeing sensed, in this case the image being photographed.\nAn important point of consideration is the breakdown point of these methods. Among those cited\nin Table 1, the works of [21] and [16] obtain the best breakdown points that allow a fraction of\npoints to be corrupted that is arbitrarily close to 1. They require the data to be generated from either\nan isotropic Gaussian ensemble or be row-sampled from an incoherent orthogonal matrix. Most\nresults mentioned in the table allow a constant fraction of points to be corrupted i.e. allow k = \u03b1 \u00b7 n\ncorruptions for some \ufb01xed constant \u03b1 > 0. This is still impressive since it allows a dense subset of\ndata points to be corrupted and yet guarantees recovery. However, as we shall see below, these results\ncannot guarantee consistency while allowing k = \u03b1 \u00b7 n corruptions.\nWe note that we use the term dense to refer to the corruptions in our model since they are a constant\nfraction of the total available data. Moreover, as we shall see, this constant shall be universal and\nindependent of the ambient dimensionality d. This terminology is used to contrast against some other\nworks which can tolerate only o(n) corruptions which is arguably much sparser. For instance, as we\n\u221a\nshall see below, the work of [17] can tolerate only o(n/ log n) corruptions if a consistent estimate is\nexpected. The work of [6] also offers a weak guarantee wherein they are only able to tolerate a 1/\nd\nfraction of corruptions. However, [6] allow corruptions in covariates as well.\nHowever, we note that none of the algorithms listed here, and to the best of our knowledge elsewhere\nas well, are able to guarantee a consistent solution, irrespective of assumptions on the adversary\nmodel. More speci\ufb01cally, none of these methods are able to guarantee exact recovery of w\u2217, even\nwith n \u2192 \u221e and constant fraction of corruptions \u03b1 = \u2126 (1) (i.e. k = \u2126 (n)) . At best, they\nguarantee (cid:107)w \u2212 w\u2217(cid:107)2 \u2264 O (\u03c3) when k = \u2126 (n) where \u03c3 is the standard deviation of the white noise\n(see Equation 3). Thus, their estimation error is of the order of the white noise in the system, even if\nthe algorithm is supplied with an in\ufb01nite amount of data. This is quite unsatisfactory, given our deep\nunderstanding of the consistency guarantees for least squares models.\nFor example, consider the work of [17] which considers a corruption model similar to (3). The work\nmakes deterministic assumptions on the data matrix and proposes the following convex program.\n\n(4)\n\n(cid:33)\n\nFor Gaussian designs, which we also consider, their results guarantee that for n = O (s log d),\n\nmin\nw,b\n\n\u03bbw (cid:107)w(cid:107)1 + \u03bbb (cid:107)b(cid:107)1 + (cid:107)X(cid:62)w + b \u2212 y(cid:107)2\n2.\n(cid:114)\n\n(cid:32)(cid:114)\n\n(cid:107)(cid:98)w \u2212 w\u2217(cid:107)2 + (cid:107)(cid:98)b \u2212 b\u2217(cid:107)2 \u2264 O\nlog n(cid:1). Thus, the result is unable to ensure limn\u2192\u221e E [(cid:98)wn \u2212 w\u2217]2 = 0.\n\n\u03c32s log d log n\n\n\u03c32k log n\n\nn\n\n+\n\nn\n\nas \u2126(cid:0)\u03c3\n\n\u221a\n\nwhere s is the sparsity index of the regressor w\u2217. Note that for k = \u0398(n), the right hand side behaves\n\nWe have excluded some classical approaches to the RLSR problem from the table such as [18, 1, 2]\nwhich use the Least Median of Squares (LMS) and Least Trimmed Squares (LTS) methods that\nguaranteed consistency but may require an exponential running time. Our focus is on polynomial\ntime algorithms, more speci\ufb01cally those that are ef\ufb01cient and scalable. We note a recent work [5]\nin robust stochastic optimization which is able to tolerate a constant fraction of corruptions \u03b1 \u2192 1.\nHowever, their algorithms operate in the list-decoding model wherein they output not one, but as\n\nmany as O(cid:16) 1\n\n1\u2212\u03b1\n\n(cid:17)\n\nmodels, of which one (unknown) model is guaranteed to be correct.\n\n3\n\n\fRecovering Sparse High-dimensional Models: We note that several previous works extend their\nmethods and analyses to handle the case of sparse robust recovery in high-dimensional settings as\nwell, including [3, 7, 17]. A bene\ufb01t of such extensions is the ability to work even in data starved\nsettings n (cid:28) d if the true model w\u2217 is s-sparse with s (cid:28) d. However, previous works continue\nto require the number of corruptions to be of the order of k = o(n) or else k = O (n/s) in order\n\nto ensure that limn\u2192\u221e E [(cid:98)wn \u2212 w\u2217]2 = 0 and cannot ensure consistency if k = O (n). This is\n\nevident, for example from the recovery guarantee offered by [17] discussed above, which requires\nk = o(n/ log n). We do believe our CRR estimator can be adapted to high dimensional settings as\nwell. However, the details are tedious and we reserve them for an expanded version of the paper.\n\n3 Our Contributions\n\n(cid:113) d\n\nIn this paper, we remedy the above problem by using a simple and scalable iterative hard-thresholding\nalgorithm called CRR along with a novel two-stage proof technique. Given n covariates that form a\nn \u2192 \u221e (see Theorem 4 for a precise statement). In fact, our method guarantees a nearly optimal\nn. It is noteworthy that CRR can tolerate a constant fraction of\n\nGaussian ensemble, our method in time poly(n, d), outputs an estimate (cid:98)wn s.t. (cid:107)(cid:98)wn \u2212 w\u2217(cid:107)2 \u2192 0 as\nerror rate of (cid:107)(cid:98)wn \u2212 w\u2217(cid:107)2 \u2264 \u03c3\n\ncorruptions i.e. tolerate k = \u03b1 \u00b7 n corruptions for some \ufb01xed \u03b1 > 0.\nWe note that although hard thresholding techniques have been applied to the RLSR problem earlier\n[3, 6], none of those methods are able to guarantee a consistent solution to the problem. Our results\nhold in the setting where a constant fraction of the responses are corrupted by an oblivious adversary\n(i.e. the one which corrupts observations without information of the data points themselves). Our\n\nalgorithm runs in time (cid:101)O(cid:0)d3 + nd(cid:1), where d is the dimensionality of the data. Moreover, as we shall\n\nsee, our technique makes more ef\ufb01cient use of data than previous hard thresholding methods such as\nTORRENT [3].\nTo the best of our knowledge, this is the \ufb01rst ef\ufb01cient and consistent estimator for the RLSR problem\nin the challenging setting where a constant fraction of the responses may be corrupted in the presence\nof dense noise. We would like to note that the problem of consistent robust regression is especially\nchallenging because without the assumption of an oblivious adversary, consistent estimation with a\nconstant fraction of corruptions (even for an arbitrarily small constant) may be impossible even when\nsupplied with in\ufb01nitely many data points.\nHowever, by crucially using the restriction of obliviousness on the adversary along with a novel proof\ntechnique, we are able to provide a consistent estimator for RLSR with optimal (up to constants)\nstatistical and computational complexity.\nDiscussion on Problem Setting: We clarify that our improvements come at a cost. Our results\nassume an oblivious adversary whereas several previous works allowed a fully adaptive adversary.\nIndeed there is no free-lunch: it seems unlikely that consistent estimators are even possible in the\nface of a fully adaptive adversary who can corrupt a constant fraction of responses since such an\nadversary can use his power to introduce biased noise into the model in order to defeat any estimator.\nAn oblivious adversary is prohibited from looking at the responses before deciding the corruptions\nand is thus unable to do the above.\nPaper Organization: We will begin our discussion by introducing the problem formulation, relevant\nnotation, and tools in Section 4. This is followed by Section 5 where we develop CRR, a near-linear\ntime algorithm that gives consistent estimates for the RLSR problem, which we analyze in Section 6.\nFinally in Section 7, we present rigorous experimental benchmarking of this algorithm. In Section 8\nwe offer some clari\ufb01cations on how the manuscript was modi\ufb01ed in response to reviewer comments.\n\n4 Problem Formulation\nWe are given n data points X = [x1, . . . , xn] \u2208 Rd\u00d7n, where xi \u2208 Rd are the covariates and, for\nsome true model w\u2217 \u2208 Rd, the vector of responses y \u2208 Rn is generated\n\n(5)\nThe responses suffer two kinds of perturbations \u2013 dense white noise \u0001i \u223c N (0, \u03c32) that is chosen\nin an i.i.d. fashion independently of the data X and the model w\u2217, and adversarial corruptions\n\ny = X(cid:62)w\u2217 + b\u2217 + \u0001.\n\n4\n\n\fAlgorithm 1 CRR: Consistent Robust Regression\nInput: Covariates X = [x1, . . . , xn], responses y = [y1, . . . , yn](cid:62), corruption index k, tolerance \u0001\n1: b0 \u2190 0, t \u2190 0,\n\n2: while(cid:13)(cid:13)bt \u2212 bt\u22121(cid:13)(cid:13)2\n\nPX \u2190 X(cid:62)(XX(cid:62))\u22121X\n\n> \u0001 do\n\nbt+1 \u2190 HTk(PX bt + (I \u2212 PX )y)\nt \u2190 t + 1\n\n3:\n4:\n5: end while\n6: return wt \u2190 (XX(cid:62))\u22121X(y \u2212 bt)\n\nin the form of b\u2217. We assume that b\u2217 is a k\u2217-sparse vector albeit one with potentially unbounded\nentries. The constant k\u2217 will be called the corruption index of the problem. We assume the oblivious\nadversary model where b\u2217 is chosen independently of X, w\u2217 and \u0001.\nAlthough there exist works that operate under a fully adaptive adversary [3, 7], none of these works\nare able to give a consistent estimate, whereas our algorithm CRR does provide a consistent estimate.\nWe also note that existing works are unable to give consistent estimates even in the oblivious adversary\nmodel. Our result requires a signi\ufb01cantly \ufb01ner analysis; the standard (cid:96)2-norm style analysis used by\nexisting works [3, 7] seems incapable of offering a consistent estimation result in the robust regression\nsetting.\nWe will require the notions of Subset Strong Convexity and Subset Strong Smoothness similar to [3]\nand reproduce the same below. For any set S \u2282 [n], let XS := [xi]i\u2208S \u2208 Rd\u00d7|S| denote the matrix\nwith columns in that set. We de\ufb01ne vS for a vector v \u2208 Rn similarly. \u03bbmin(X) and \u03bbmax(X) will\ndenote, respectively, the smallest and largest eigenvalues of a square symmetric matrix X.\nDe\ufb01nition 1 (SSC Property). A matrix X \u2208 Rd\u00d7n is said to satisfy the Subset Strong Convexity\nProperty at level m with constant \u03bbm if the following holds:\n\u03bbmin(XSX(cid:62)\nS )\n\n\u03bbm \u2264 min\n|S|=m\n\nDe\ufb01nition 2 (SSS Property). A matrix X \u2208 Rd\u00d7n is said to satisfy the Subset Strong Smoothness\nProperty at level m with constant \u039bm if the following holds:\n\n\u03bbmax(XSX(cid:62)\n\nS ) \u2264 \u039bm.\n\nmax\n|S|=m\n\nIntuitively speaking, the SSC and SSS properties ensure that the regression problem remains well\nconditioned, even if restricted to an arbitrary subset of the data points. This allows the estimator to\nrecover the exact model no matter what portion of the data was left uncorrupted by the adversary. We\nrefer the reader to the Appendix A for SSC/SSS bounds for Gaussian ensembles.\n\n5 CRR: A Hard Thresholding Approach to Consistent Robust Regression\n\nWe now present a consistent method CRR for the RLSR problem. CRR takes a signi\ufb01cantly different\napproach to the problem than previous works. Instead of attempting to exclude data points deemed\nunclean (as done by the TORRENT algorithm proposed by [3]), CRR focuses on correcting the errors.\nThis allows CRR to work with the entire dataset at all times, as opposed to TORRENT that works\nwith a fraction of the data at any given point of time.\nTo motivate\nthe\nwe\nminw\u2208Rp,(cid:107)b(cid:107)0\u2264k\u2217 1\nruption vector,\n\n2, and realize that given any estimate (cid:98)b of the cor-\n(cid:98)w = (XX(cid:62))\u22121X(y \u2212(cid:98)b). Plugging this expression for (cid:98)w into the formulation allows us to\n\nthe optimal model with respect to this estimate is given by the expression\n\n(cid:13)(cid:13)X(cid:62)w \u2212 (y \u2212 b)(cid:13)(cid:13)2\n\nstart with\n\nformulation\n\nalgorithm,\n\nCRR\n\nthe\n\nRLSR\n\n2\n\nreformulate the RLSR problem.\n\n(6)\nwhere PX = X(cid:62)(XX(cid:62))\u22121X. This greatly simpli\ufb01es the problem by casting it as a sparse parameter\nestimation problem instead of a data subset selection problem (as done by TORRENT). CRR directly\n\n(cid:107)b(cid:107)0\u2264k\u2217 f (b) =\nmin\n\n2\n\n(cid:107)(I \u2212 PX )(y \u2212 b)(cid:107)2\n\n1\n2\n\n5\n\n\foptimizes (6) by using a form of iterative hard thresholding. Notice that this approach allows CRR\nto keep using the entire set of data points at all times, all the while using the current estimate of the\nparameter b to correct the errors in the observations. At each step, CRR performs the following\nupdate: bt+1 = HTk(bt \u2212 \u2207f (bt)), where k is a parameter for CRR. Any value k \u2265 2k\u2217 suf\ufb01ces\nto ensure convergence and consistency, as will be clari\ufb01ed in the theoretical analysis. The hard\nthresholding operator HTk(\u00b7) is de\ufb01ned below.\nDe\ufb01nition 3 (Hard Thresholding). For any v \u2208 Rn, let the permutation \u03c3v \u2208 Sn order elements\nof v in descending order of their magnitudes. Then for any k \u2264 n, we de\ufb01ne the hard thresholding\n\noperator as(cid:98)v = HTk(v) where(cid:98)vi = vi if \u03c3\u22121\n\nv (i) \u2264 k and 0 otherwise.\n\nWe note that CRR functions with a \ufb01xed, unit step length, which is convenient in practice as it avoids\nstep length tuning, something most IHT algorithms [12, 13] require. For simplicity of exposition, we\nwill consider only Gaussian ensembles for the RLSR problem i.e. xi \u223c N (0, \u03a3); our proof technique\nworks for general sub-Gaussian ensembles with appropriate distribution dependent parameters. Since\nCRR interacts with the data only using the projection matrix PX, for Gaussian ensembles, one can\nassume without loss of generality that the data points are generated from a spherical Gaussian i.e.\nxi \u223c N (0, Id\u00d7d). Our analysis will take care of the condition number of the data ensemble whenever\nit is apparent in the convergence rates.\nBefore moving to present the consistency and convergence guarantees for CRR, we note that\nGaussian ensembles are known to satisfy the SSC/SSS properties with high probability. For instance,\nin the case of the standard Gaussian ensemble, we have SSC/SSS constants of the order of \u039bm \u2264\n. These results are known\n\nn(cid:1) and \u03bbm \u2265 n \u2212 O(cid:16)\n\nO(cid:0)m(cid:112)log n\n\n(n \u2212 m)\n\n(cid:113)\n\nlog n\n\n(cid:17)\n\n\u221a\n\n\u221a\n\nn\n\nn\u2212m +\n\nm +\n\nfrom previous works [3, 10] and are reproduced in Appendix A.\n\nd\n\n\u03b4\n\nlog\n\n\u03bbmin(\u03a3)\n\n(cid:17)\n\n(cid:18)\n\n6 Consistency Guarantees for CRR\nTheorem 4. Let xi \u2208 Rd, 1 \u2264 i \u2264 n be generated i.i.d. from a Gaussian distribution, let yi\u2019s\nbe generated using (5) for a \ufb01xed w\u2217, and let \u03c32 be the noise variance. Also let the number of\ncorruptions k\u2217 be s.t. 2k\u2217 \u2264 k \u2264 n/10000. Then for any \u0001, \u03b4 > 0, with probability at least 1\u2212\u03b4, after\n.\n\n(cid:19)\nO(cid:16)\nThe above result establishes consistency of the CRR method with an error rate of \u02dcO(\u03c3(cid:112)d/n) that is\n\nsteps, CRR ensures that (cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 \u0001 + O\n\n(cid:107)b\u2217(cid:107)2\n\u03c3k+\u0001 + log n\n\n(cid:113) d\n\nn log nd\n\n\u03c3\u221a\n\nthe consistency properties of CRR. The second term is of the form O(cid:16)\n\nknown to be statistically optimal. It is notable that this optimal rate is being ensured in the presence of\ngross and unbounded outliers. We reiterate that to the best of our knowledge, this is the \ufb01rst instance\nof a poly-time algorithm being shown to be consistent for the RLSR problem. It is also notable that\nthe result allows the corruption index to be k\u2217 = \u2126(n), i.e. allows upto a constant factor of the total\nnumber of data points to be arbitrarily corrupted, while ensuring consistency, which existing results\n[3, 6, 16] do not ensure.\nWe pause a bit to clarify some points regarding the result. Firstly we note that the upper bound on\nrecovery error consists of two terms. The \ufb01rst term is \u0001 which can be made arbitrarily small simply by\nexecuting the CRR algorithm for several iterations. The second term is more crucial and underscores\nand is\neasily seen to vanish with n \u2192 \u221e for any constant d, \u03c3. Secondly we note that the result requires\nk\u2217 \u2264 n/20000 i.e. \u03b1 \u2264 1/20000. Although this constant might seem small, we stress that these\nconstants are not the best possible since we preferred analyses that were more accessible. Indeed,\nin our experiments, we found CRR to be robust to much higher corruption levels than what the\nTheorem 4 guarantees. Thirdly, we notice that the result requires the CRR to be executed with the\ncorruption index set to a value k \u2265 2k\u2217. In practice the value of k can be easily tuned using a simple\nbinary search because of the speed of execution that CRR offers (see Section 7).\nFor our analysis, we will divide CRR\u2019s execution into two phases \u2013 a coarse convergence phase and\na \ufb01ne convergence phase. CRR will enjoy a linear rate of convergence in both phases. However, the\ncoarse convergence analysis will only ensure (cid:107)wt \u2212 w\u2217(cid:107)2 = O (\u03c3). The \ufb01ne convergence phase\nwill then use a much more careful analysis of the algorithm to show that in at most O (log n) more\n\n\u03c3(cid:112)d log(nd)/n\n\n(cid:17)\n\n6\n\n\f(cid:113)\n\niterations, CRR ensures (cid:107)wt \u2212 w\u2217(cid:107)2 = \u02dcO(\u03c3(cid:112)d/n), thus establishing consistency of the method.\n\nExisting methods, such as TORRENT, ensure an error level of O (\u03c3), but no better.\nAs shorthand notation, let \u03bbt := (XX(cid:62))\u22121X(bt \u2212 b\u2217), g := (I \u2212 PX )\u0001, and vt = X(cid:62)\u03bbt + g. Let\nS\u2217 := supp(b\u2217) be the true locations of the corruptions and I t := supp(bt) \u222a supp(b\u2217).\nCoarse convergence: Here we establish a result that guarantees that after a certain number of steps\nT0, CRR identi\ufb01es the corruption vector with a relatively high accuracy and consequently ensures\n\nLemma 5. For any data matrix X that satis\ufb01es the SSC and SSS properties such that 2\u039bk+k\u2217\n< 1,\nCRR, when executed with k \u2265 k\u2217, ensures for any \u0001, \u03b4 > 0, with probability at least 1 \u2212 \u03b4 (over\nsteps,\n\nthat(cid:13)(cid:13)wT0 \u2212 w\u2217(cid:13)(cid:13)2 \u2264 O (\u03c3).\nthe random Gaussian noise \u0001 in the responses \u2013 see (3)) that after T0 = O(cid:16)\n(cid:13)(cid:13)bT0 \u2212 b\u2217(cid:13)(cid:13)2 \u2264 3e0 + \u0001, where e0 = O(cid:16)\nUsing Lemma 12 (see the appendix), we can translate the above result to show that(cid:13)(cid:13)wT0 \u2212 w\u2217(cid:13)(cid:13)2 \u2264\n\n150. However, Lemma 5 will be more useful in the following \ufb01ne\n\n0.95\u03c3 + \u0001, assuming k\u2217 \u2264 k \u2264 n\nconvergence analysis.\nFine convergence: We now show that CRR progresses further at a linear rate to achieve a consistent\nsolution. In Lemma 6, we show that (cid:107)X(bt \u2212 b\u2217)(cid:107)2 has a linear decrease for every iteration t > T0\n\u221a\nalong with a term which is \u02dcO(\ndn). The proof proceeds by showing that for any \ufb01xed \u03bbt such that\n(cid:107)\u03bbt(cid:107)2 \u2264 \u03c3\n100, we obtain a linear decrease in (cid:107)\u03bbt+1(cid:107)2 = (cid:107)(XX T )\u22121X(bt+1 \u2212 b\u2217)(cid:107)2. We then take\na union bound over a \ufb01ne \u0001-net over all possible values of \u03bbt to obtain the \ufb01nal result.\nLemma 6. Let X = [x1, x2, . . . , xn] be a data matrix consisting of i.i.d. standard normal vectors i.e\nxi \u223c N (0, Id\u00d7d), and \u0001 \u223c N (0, \u03c32 \u00b7 In\u00d7n) be a standard normal vector of white noise values drawn\nindependently of X. For any \u03bb \u2208 Rd such that (cid:107)\u03bb(cid:107)2 \u2264 \u03c3\n100 , de\ufb01ne bnew = HTk(X(cid:62)\u03bb + \u0001 + b\u2217),\nznew = bnew \u2212 b\u2217 and \u03bbnew = (XX T )\u22121Xznew, where k \u2265 2k\u2217, |supp(b\u2217)| \u2264 k\u2217, k\u2217 \u2264 n/10000,\nand d \u2264 n/10000. Then, with probability at least 1 \u2212 1/n5, for every \u03bb s.t. (cid:107)\u03bb(cid:107)2 \u2264 \u03c3\n100 , we have\n(cid:114)\n\nfor standard Gaussian designs.\n\n(k + k\u2217) log\n\n\u03c3\n\nn\n\n\u03b4(k+k\u2217)\n\n(cid:17)\n\n\u03bbn\n(cid:107)b\u2217(cid:107)2\ne0+\u0001\n\nlog\n\n(cid:17)\n\n(cid:107)Xznew(cid:107)2 \u2264 .9n(cid:107)\u03bb(cid:107)2 + 100\u03c3\n(cid:107)\u03bbnew(cid:107)2 \u2264 .91(cid:107)\u03bb(cid:107)2 + 110\u03c3\n\n\u221a\n\nd \u00b7 n log2 n,\nd\nn\n\nlog2 n.\n\nPutting all these results together establishes Theorem 4. See Appendix B for a detailed proof. Note\nthat while both the coarse/\ufb01ne stages offer a linear rate of convergence, it is the \ufb01ne phase that ensures\nconsistency. Indeed, the coarse phase only acts as a sort of good-enough initialization. Several\nresults in non-convex optimization assume a nice initialization \u201cclose\u201d to the optimum (alternating\nminimization, EM etc). In our case, we have a happy situation where the initialization and main\nalgorithms are one and the same. Note that we could have actually used other algorithms e.g.\nTORRENT to perform initialization as well since TORRENT [3, Theorem 10] essentially offers the\nsame (weak) guarantee as Lemma 5 offers.\n\n7 Experiments\n\nExperiments were carried out on synthetically generated linear regression datasets with corruptions.\nAll implementations were done in Matlab and were run on a single core 2.4GHz machine with\n8GB RAM. The experiments establish the following: 1) CRR gives consistent estimates of the\nregression model, especially in situations with a large number of corruptions where the ordinary least\nsquares estimator fails catastrophically, 2) CRR scales better to large datasets than the TORRENT-FC\nalgorithm of [3] (upto 5\u00d7 faster) and the Extended Lasso algorithm of [17] (upto 20\u00d7 faster). The\nmain reason behind this speedup is that TORRENT keeps changing its mind on which active set of\npoints it wishes to work with. Consequently, it expends a lot of effort processing each active set.\nCRR on the other hand does not face such issues since it always works with the entire set of points.\nData: The model w\u2217 \u2208 Rd was chosen to be a random unit norm vector. The data was generated\nas xi \u223c N (0, Id). The k\u2217 responses to be corrupted were chosen uniformly at random and the\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Variation of recovery error with varying number of data points n, dimensionality d, number of\ncorruptions k\u2217 and white noise variance \u03c3. CRR and TORRENT show better recovery properties than the\nnon-robust OLS on all experiments. Extended Lasso offers comparable or slightly worse recovery in most\n\nsettings. Figure 1(a) ascertains the (cid:101)O(cid:16)(cid:112)1/n\n(cid:17)\n\n-consistency of CRR as is shown in the theoretical analysis.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Figure 2(a) show the average CPU run times of CRR, TORRENT and Extended Lasso with varying\nsample sizes. CRR can be an order of magnitude faster than TORRENT and Extended Lasso on problems\nin 1000 dimensions while ensuring similar recovery properties.. Figure 2(b), 2(c) and 2(d) show that CRR\neventually not only captures the total mass of corruptions, but also does support recovery of the corrupted points\nin an accurate manner. With every iteration, CRR improves upon its estimate of b\u2217 and provides cleaner points\nfor estimation of w. CRR is also able to very effectively utilize larger data sets to offer much faster convergence.\nNotice the visibly faster convergence in Figure 2(d) which uses 10x more points than \ufb01gure (c).\n\ni \u223c Unif (10, 20). Responses were then generated as yi =\nvalue of the corruptions was sets as b\u2217\ni where \u03b7i \u223c N (0, \u03c32). All reported results were averaged over 20 randomly trials.\n(cid:104)xi, w\u2217(cid:105) + \u03b7i + b\u2217\nr(cid:98)w = (cid:107)(cid:98)w \u2212 w\u2217(cid:107)2. For the timing experiments, we deemed an algorithm to converge on an instance\nEvaluation Metric: We measure the performance of various algorithms using the standard L2 error:\nif it obtained a model wt such that (cid:107)wt \u2212 wt\u22121(cid:107)2 \u2264 10\u22124.\nBaseline Algorithms: CRR was compared to two baselines 1) the Ordinary Least Squares (OLS)\nestimator which is oblivious of the presence of any corruptions in the responses, 2) the TORRENT\nalgorithm of [3] which is a recently proposed method for performing robust least squares regression,\nand 3) the Extended Lasso (ex-Lasso) approach of [15] for which we use the FISTA implementation\nof [23] and choose the regularization paramaters for our model data as mentioned by the authors.\nRecovery Properties & Timing: CRR, TORRENT and ex-Lasso were found to be competitive, and\noffered much lower residual errors (cid:107)w \u2212 w\u2217(cid:107)2 than the non-robust OLS method when varying\ndataset size Figure 1(a), dimensionality Figure 1(b), number of corrupted responses Figure 1(c), and\nmagnitude of white noise Figure 1(d). In terms of scaling properties, CRR exhibited faster runtimes\nthan TORRENT-FC as depicted in Figure 2(a). CRR can be upto 5\u00d7 faster than TORRENT and upto\n20\u00d7 faster than ex-Lasso on problems of 1000 dimensions. Figure 2(a) suggests that executing\nboth TORRENT and ex-Lasso becomes very expensive with an order of magnitude increase in the\ndimension parameter of the problem while CRR scales gracefully. Also, Figures 2(c) and 2(d) show\nthe variation of (cid:107)bt \u2212 b\u2217(cid:107)2 for various values of the noise parameter \u03c3. The plot depicts the fact\nthat as \u03c3 \u2192 0, CRR is correctly able to identify all the corrupted points and estimate the level of\ncorruption correctly, thereby returning the exact solution w\u2217. Notice that in Figure 2(d) which utilizes\nmore data points, CRR offers uniformly faster convergence across all white noise levels.\nChoice of Potential Function: In Lemmata 5 and 6, we show that (cid:107)bt \u2212 b\u2217(cid:107)2 decreases with every\niteration. Figures 2(c) and (d) back this theoretical statement by showing that CRR\u2019s estimate of b\u2217\nimproves with every iteration. Along with estimating the magnitude of b\u2217, Figure 2(b) shows that\nCRR is also able to correctly identify the support of the corrupted points with increasing iterations.\n\n8\n\n2000400060008000Number of Datapoints n024|| w-w*||2d = 500, \u03c3 = 1, k = 600OLSex-LassoTORRENT-FCCRR100200300400500600Dimensionality d024|| w-w*||2n = 2000, \u03c3 = 1, k = 600OLSex-LassoTORRENT-FCCRR200300400500600700Number of Corruptions k024|| w-w*||2n = 2000, d = 500, \u03c3 = 1OLSex-LassoTORRENT-FCCRR00.511.5White Noise \u03c3024|| w-w*||2n = 2000, d = 500, k = 600OLSex-LassoTORRENT-FCCRR2000400060008000Number of Datapoints n050100150200250Time (in sec)d = 1000, \u03c3 = 7.5, k = 0.3*nex-LassoTORRENT-FCCRR010203040Iteration Number0.850.90.951Fraction of Corruptions Identifiedn = 2000 d = 500 k = 0.37*n\u03c3 = 0.01\u03c3 = 0.05\u03c3 = 0.1\u03c3 = 0.501020304050Iteration Number100||bt-b*||2n = 500 d = 100 k = 0.37*n\u03c3 = 0.01\u03c3 = 0.05\u03c3 = 0.1\u03c3 = 0.501020304050Iteration Number10-2100102||bt - b*||2n = 5000 d = 100 k = 0.37*n\u03c3 = 0.01\u03c3 = 0.05\u03c3 = 0.1\u03c3 = 0.5\f8 Response to Reviewer Comments\n\nWe are thankful to the reviewers for their comments aimed at improving the manuscript. Below we\noffer some clari\ufb01cations regarding the same.\n\n1. We have \ufb01xed all typographical errors pointed out in the reviews.\n2. We have included additional references as pointed out in the reviews.\n3. We have improved the presentation of the statement of the results to make the theorem and\n\nlemma statements more crisp and self contained.\n\n4. We have \ufb01xed minor inconsistencies in the \ufb01gures by executing experiments afresh.\n5. We note that CRR\u2019s reduction of the robust recovery problem to sparse recovery is not\nonly novel, but also one that offers impressive speedups in practice over the fully corrective\nversion of the existing TORRENT algorithm [3]. However, note that the reduction to sparse\nrecovery actually hides a sort of \u201cfully-corrective\u201d step wherein the optimal model for a\nparticular corruption estimate is used internally in the formulation. Thus, CRR is implicitly\na fully corrective algorithm as well.\n\n6. We agree with the reviewers that further efforts are needed to achieve results with sharper\nconstants. For example, CRR offers robustness upto a breakdown fraction of 1/20000 which,\nalthough a constant, nevertheless leaves room for improvement. Having shown for the \ufb01rst\ntime that tolerating a non-trivial, universally constant fraction of corruptions is possible\nin polynomial time, it is indeed encouraging to study how far can the breakdown point be\npushed for various families of algorithms.\n\n7. Our current efforts are aimed at solving the robust sparse recovery problems in high di-\nmensional settings in a statistically consistent manner, as well as extending the consistency\nproperties established in this paper for non-Gaussian, for example \ufb01xed, designs.\n\nAcknowledgments\n\nThe authors thank the reviewers for useful comments. PKar is supported by the Deep Singh and\nDaljeet Kaur Faculty Fellowship and the Research-I Foundation at IIT Kanpur, and thanks Microsoft\nResearch India and Tower Research for research grants. KB gratefully acknowledges the support of\nthe NSF through grant IIS-1619362.\n\nReferences\n[1] J. \u00c1mos Vi\u02dcsek. The least trimmed squares. Part I: Consistency. Kybernetika, 42:1\u201336, 2006.\n[2] J. \u00c1mos Vi\u02dcsek. The least trimmed squares. Part II:\n[3] K. Bhatia, P. Jain, and P. Kar. Robust Regression via Hard Thresholding. In Proceedings of the 29th\n\n\u221a\nn-consistency. Kybernetika, 42:181\u2013202, 2006.\n\nAnnual Conference on Neural Information Processing Systems (NIPS), 2015.\n\n[4] E. J. Cand\u00e8s, X. Li, and J. Wright. Robust Principal Component Analysis? Journal of the ACM, 58(1):1\u201337,\n\n2009.\n\n[5] M. Charikar, J. Steinhardt, and G. Valiant. Learning from Untrusted Data. arXiv:1611.02315 [cs.LG],\n\n2016.\n\n[6] Y. Chen, C. Caramanis, and S. Mannor. Robust Sparse Regression under Adversarial Corruption. In\n\nProceedings of the 30th International Conference on Machine Learning (ICML), 2013.\n\n[7] Y. Chen and A. S. Dalalyan. Fused sparsity and robust estimation for linear models with unknown variance.\nIn Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012.\n[8] Y. Cherapanamjeri, K. Gupta, and P. Jain. Nearly-optimal Robust Matrix Completion. arXiv:1606.07315\n\n[cs.LG], 2016.\n\n[9] F. Cucker and S. Smale. On the Mathematical Foundations of Learning. Bulleting of the American\n\nMathematical Society, 39(1):1\u201349, 2001.\n\n[10] M. A. Davenport, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk. A Simple Proof that Random Matrices\nare Democratic. Technical Report TREE0906, Rice University, Department of Electrical and Computer\nEngineering, 2009.\n\n9\n\n\f[11] J. Feng, H. Xu, S. Mannor, and S. Yan. Robust Logistic Regression and Classi\ufb01cation. In Proceedings of\n\nthe 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.\n\n[12] R. Garg and R. Khandekar. Gradient Descent with Sparsi\ufb01cation: An Iterative Algorithm for Sparse\nRecovery with Restricted Isometry Property. In Proceedings of the 26th International Conference on\nMachine Learning (ICML), 2009.\n\n[13] P. Jain, A. Tewari, and P. Kar. On Iterative Hard Thresholding Methods for High-dimensional M-estimation.\nIn Proceedings of the 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.\n[14] B. McWilliams, G. Krummenacher, M. Lucic, and J. M. Buhmann. Fast and Robust Least Squares\nEstimation in Corrupted Linear Models. In 28th Annual Conference on Neural Information Processing\nSystems (NIPS), 2014.\n\n[15] N. M. Nasrabadi, T. D. Tran, and N. Nguyen. Robust Lasso with Missing and Grossly Corrupted\n\nObservations. In Advances in Neural Information Processing Systems, pages 1881\u20131889, 2011.\n\n[16] N. H. Nguyen and T. D. Tran. Exact recoverability from dense corrupted observations via (cid:96)1-minimization.\n\nIEEE transactions on information theory, 59(4):2017\u20132035, 2013.\n\n[17] N. H. Nguyen and T. D. Tran. Robust Lasso With Missing and Grossly Corrupted Observations. IEEE\n\nTransaction on Information Theory, 59(4):2036\u20132058, 2013.\n\n[18] P. J. Rousseeuw. Least Median of Squares Regression. Journal of the American Statistical Association,\n\n79(388):871\u2013880, 1984.\n\n[19] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John Wiley and Sons, 1987.\n[20] C. Studer, P. Kuppinger, G. Pope, and H. B\u00f6lcskei. Recovery of Sparsely Corrupted Signals. IEEE\n\nTransaction on Information Theory, 58(5):3115\u20133130, 2012.\n\n[21] J. Wright and Y. Ma. Dense Error Correction via (cid:96)1 Minimization. IEEE Transactions on Information\n\nTheory, 56(7):3540\u20133560, 2010.\n\n[22] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma. Robust Face Recognition via Sparse Represen-\n\ntation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210\u2013227, 2009.\n\n[23] A. Y. Yang, Z. Zhou, A. G. Balasubramanian, S. S. Sastry, and Y. Ma. Fast (cid:96)1-minimization algorithms for\n\nrobust face recognition. IEEE Transactions on Image Processing, 22(8):3234\u20133246, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1274, "authors": [{"given_name": "Kush", "family_name": "Bhatia", "institution": "UC Berkeley"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Parameswaran", "family_name": "Kamalaruban", "institution": "EPFL"}, {"given_name": "Purushottam", "family_name": "Kar", "institution": "Indian Institute of Technology Kanpur"}]}