{"title": "Iterative Least Trimmed Squares for Mixed Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 6078, "page_last": 6088, "abstract": "Given a linear regression setting, Iterative Least Trimmed Squares (ILTS) involves alternating between (a) selecting the subset of samples with lowest current loss, and (b) re-fitting the linear model only on that subset. Both steps are very fast and simple. In this paper, we analyze ILTS in the setting of mixed linear regression with corruptions (MLR-C). We first establish deterministic conditions (on the features etc.) under which the ILTS iterate converges linearly to the closest mixture component. We also provide a global algorithm that uses ILTS as a subroutine, to fully solve mixed linear regressions with corruptions. We then evaluate it for the widely studied setting of isotropic Gaussian features, and establish that we match or better existing results in terms of sample complexity. Finally, we provide an ODE analysis for a gradient-descent variant of ILTS that has optimal time complexity. Our results provide initial theoretical evidence that iteratively fitting to the best subset of samples -- a potentially widely applicable idea -- can provably provide state of the art performance in bad training data settings.", "full_text": "Iterative Least Trimmed Squares for Mixed Linear\n\nRegression\n\nYanyao Shen\nECE Department\n\nUniversity of Texas at Austin\n\nAustin, TX 78712\n\nshenyanyao@utexas.edu\n\nSujay Sanghavi\nECE Department\n\nUniversity of Texas at Austin\n\nAustin, TX 78712\n\nsanghavi@mail.utexas.edu\n\nAbstract\n\nGiven a linear regression setting, Iterative Least Trimmed Squares (ILTS) involves\nalternating between (a) selecting the subset of samples with lowest current loss,\nand (b) re-\ufb01tting the linear model only on that subset. Both steps are very fast and\nsimple. In this paper we analyze ILTS in the setting of mixed linear regression with\ncorruptions (MLR-C). We \ufb01rst establish deterministic conditions (on the features\netc.) under which the ILTS iterate converges linearly to the closest mixture\ncomponent. We also evaluate it for the widely studied setting of isotropic Gaussian\nfeatures, and establish that we match or better existing results in terms of sample\ncomplexity. We then provide a global algorithm that uses ILTS as a subroutine, to\nfully solve mixed linear regressions with corruptions. Finally, we provide an ODE\nanalysis for a gradient-descent variant of ILTS that has optimal time complexity.\nOur results provide initial theoretical evidence that iteratively \ufb01tting to the best\nsubset of samples \u2013 a potentially widely applicable idea \u2013 can provably provide\nstate-of-the-art performance in bad training data settings.\n\n1\n\nIntroduction\n\nIn vanilla linear regression, one (implicitly) assumes that each sample is a linear measurement of\na single unknown vector, which needs to be recovered from these measurements. Statistically, it is\ntypically studied in the setting where the samples come from such a ground truth unknown vector,\nand we are interested in the (computational/statistical complexity of) recovery of this ground truth\nvector. Mixed linear regression (MLR for brevity) is the problem where there are multiple unknown\nvectors, and each sample can come from any one of them (and we do not know which one, a-priori).\nOur objective is again to recover all (or some, or one) of them from the samples. In this paper, we\nconsider MLR with the additional presence of corruptions \u2013 i.e. adversarial additive errors in the\nresponses \u2013 for some unknown subset of the samples. There is now a healthy and quickly growing\nbody of work on algorithms, and corresponding theoretical guarantees, for MLR with and without\nadditive noise and corruptions; we review these in detail in the related work section.\nIn our paper we start from a classical (but hard to compute) approach from robust statistics: least\ntrimmed squares [19]. This advocates \ufb01tting a model so as to minimize the loss on only a fraction \u2327\nof the samples, instead of all of them \u2013 but crucially, the subset S of samples chosen and the model to\n\ufb01t them are to be estimated jointly. To be more speci\ufb01c, suppose our samples are (xi, yi), for i 2 [n].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThen the least squares (LS) and least trimmed squares (LTS) estimates are:\n\nb\u2713LS = arg min\nb\u2713LTS = arg min\n\n\u2713\n\n(yi hxi,\u2713 i)2 ,\n\n\u2713 Xi2[n]\nS : |S|=b\u2327nc Xi2S\n\nmin\n\n(yi hxi,\u2713 i)2 .\n\nNote that least trimmed squares involves a parameter: the fraction \u2327 of samples we want to \ufb01t. Solving\n\nfor the least trimmed squares estimateb\u2713LTS needs to address the combinatorial issue of \ufb01nding the\n\nbest subset to \ufb01t, but the goodness of a subset is only known once it is \ufb01t. LTS is shown to have\ncomputation lower bound exponential in the dimension of x [17].\nLTS, if one could solve it, would be a candidate algorithm for MLR as follows: suppose we knew a\nlower bound on the number of samples corresponding to a single component (i.e. generated using one\nof the unknown vectors). Then one would choose the fraction \u2327 in the LTS procedure to be smaller\nthan this lower bound on the fraction of samples that belong to a component. Ideally, this would\nlead the LTS to choose a subset S of samples that all correspond to a single component, and the\nleast squares on that set S would \ufb01nd the corresponding unknown vector. This is easiest to see in the\nnoiseless corruption-less setting where each sample is just a pure linear equation in the corresponding\nunknown vector. In this case, an S containing samples only from one component, and a \u2713 which\nis the corresponding ground truth vector, would give 0 error and hence would be the best solutions\nto LTS. Hence, to summarize, one can use LTS to solve MLR by estimating a single ground truth\nvector at a time.\nHowever, LTS is intractable, and we instead study the natural iterative variant of LTS, which alternates\nbetween \ufb01nding the set S \u21e2 n of samples to be \ufb01t, and the \u2713 that \ufb01ts it. In particular, our procedure \u2013\nwhich we call iterative least trimmed squares (ILTS) \u2013 \ufb01rst picks a fraction \u2327 and then proceeds\nin iterations (denoted by t) as follows: starting from an initial \u27130,\n\nSt = arg\n\n(yi hxi,\u2713 ti)2 ,\n\n\u2713t+1 = arg min\n\n(yi hxi,\u2713 i)2 .\n\nmin\n\nS : |S|=b\u2327nc Xi2S\n\u2713 Xi2St\n\nNote that now, as opposed to before, \ufb01nding the subset St is trivial: just sort the samples by their\ncurrent squared errors (yi hxi,\u2713 ti)2, and pick the \u2327n that have smallest loss. Similarly, the \u2713 update\nnow is a simple least squared problem on a pre-selected subset of samples. Note also that each of the\nabove steps decreases the function a(\u2713, S) ,Pi2S (yi hxi,\u2713 i)2. This has also been referred to\n\nas iterative hard thresholding and studied for the different but related problem of robust regression,\nagain please see related work for known results. Our motivations for studying ILTS are several: (1)\nit is very simple and natural, and easy to implement in much more general scenarios beyond least\nsquares. Linear regression represents in some sense the simplest statistical setting to understand this\napproach. (2) In spite of its simplicity, we show in the following that it manages to get state-of-the-art\nperformance for MLR with corruptions, with weaker assumptions than several existing results.\nAgain as before, one can use ILTS for MLR by choosing a \u2327 that is smaller than the number of\nsamples in a component. However, additionally, we now also need to choose an initial \u27130 that is\ncloser to one component than the others. In the following, we thus give two kinds of theoretical\nguarantees on its performance: a local one that shows linear convergence to the closest ground truth\nvector, and a global one that adds a step for good initialization.\nMain contributions and outline:\n\u2022 We propose a simple and ef\ufb01cient algorithm ILTS for solving MLR with adversarial corruptions;\nwe precisely describe the problem setting in Section 3. ILTS starts with an initial estimate\nof a single unknown \u2713, and alternates between selecting the size \u2327n subset of the samples best\nexplained by \u2713, and updating the \u2713 to best \ufb01t this set. Each of these steps is very fast and easy.\n\n\u2022 Our \ufb01rst result, Theorem 4 in Section 4 establishes deterministic conditions \u2013 on the features,\nthe initialization, and the numbers of samples in each component \u2013 under which ILTS linearly\nconverges to the ground truth vector that is closest to the initialization. Theorem 7 in Section\n4 specializes this to the (widely studied) case when the features are isotropic Gaussians. The\n\n2\n\n\fsample complexity is nearly optimal in both dimension d and the number of components m, while\nprevious state-of-the-art results are nearly optimal in d, but can be exponential in m. Our analysis\nfor inputs following isotropic Gaussian distribution is easy to generalize to more general class of\nsub-Gaussian distributions.\n\n\u2022 To solve the full MLR problem, we identify \ufb01nding the subspace spanned by the true MLR\ncomponents as a core problem for initialization. In the case of isotropic Gaussian features, this\nis known to be possible by existing results in robust PCA (when corruptions exist) or standard\nspectral methods (when there are no corruptions). Given a good approximation of this subspace,\none can use the ILTS process above as a subroutine with an \u201couter loop\" that tries out many\ninitializations (which can be done in parallel, and are not too many when number of components\nis \ufb01xed and small) and evaluates whether the \ufb01nal estimate is to be accepted as an estimate for a\nground truth vector (Global-ILTS). We specify and analyze it in Section 5 for the case of random\nisotropic Gaussian features and also discuss the feasibility of \ufb01nding such a subspace.\n\n2 Related Work\n\nMixed linear regression Learning MLR even in the two mixture setting is NP hard in general [26]. As\na result, provably ef\ufb01cient algorithmic solutions under natural assumptions on the data, e.g., all inputs\nare i.i.d. isotropic Gaussian, are studied. Ef\ufb01cient algorithms that provably \ufb01nd both components\ninclude the idea of using spectral initialization with local alternating minimization [26], and classical\nEM approach with \ufb01ner analysis [1, 12, 13]. In the multiple components setting, substituting spectral\ninitialization by tensor decomposition brings provable algorithmic solutions [5, 27, 29, 20]. Recently,\n[14] proposes an algorithm with nearly optimal complexity using quite different ideas. They relate\nMLR problem with learning GMMs and use the black-box algorithm in [16].\nIn Table 1, we\nsummarize the sample and computation complexity of the three most related work. Previous literature\nfocus on the dependency on dimension d, for all these algorithms that achieve near optimal sample\ncomplexity, the dependencies on m for all the algorithms are expoential (notice that the guarantees\nin [27] contains a m term, which can be exponentially small in m without further assumptions, as\npointed out by [14]), and [14] requires exponential in m2 number of samples for a more general\nclass of Gaussian distributions. Notice that while it is reasonable to assume m being a constant,\nthis exponential dependency on m or m2 could dominate the sample complexity in practice. From\nrobustness point of view, the analysis of all these algorithms rely heavily on exact model assumptions\nand are restricted to Gaussian distributions. While recent approaches on robust algorithms are able to\ndeal with strongly convex functions, e.g., [9], with corruption in both inputs and outputs, [29] showed\nlocal strong convexity of MLR with small local region \u02dcO(d(md)m), under \u02dc\u2326(dmm) samples. To\nthe best of our knowledge, we are not aware of any previous work study the algorithmic behavior\nunder mis-speci\ufb01ed MLR model settings. We provide \ufb01ne-grained analysis for a simple algorithm\nthat achieves nearly optimal sample and computation complexity.\nRobust regression Our algorithm idea is similar to least trimmed square estimator (LTS) proposed\nby [19]. The hardness of \ufb01nding the exact LTS estimator is discussed in [17], which shows an\nexponential in d computation lower bound under the hardness of af\ufb01ne degeneracy conjecture. While\nour algorithm is similar to the previous hard thresholding solutions proposed in [2], their analysis\ndoes not handle the MLR setting, and only guarantees parameter recovery given a small constant\nfraction of corruption. Algorithmic solutions based on LTS for solving more general problems have\nbeen proposed in [25, 23, 21]. [10] studies the l1 regression and gives a tight analysis for recoverable\ncorruption ratio. Another line of research focus on robust regression where both the inputs and\noutputs can be corrupted, e.g., [6]. There are provable recovery guarantees under constant ratio of\ncorruption using using robust gradient methods [9, 18, 15], and sum of squares method [11]. We focus\non computationally ef\ufb01cient method with nearly optimal computation time that is easily scalable in\npractice.\n\n3 Problem Setup and Preliminaries\n\nWe consider the standard (noiseless) MLR model with corruptions, which we will abbreviate to\n(MLR-C); each sample is a linear measurement of one of m unknown \u201cground truth\" vectors \u2013 but\nwe do not know which one. Our task is to \ufb01nd the ground truth vectors, and this is made harder by a\nconstant fraction of all samples having an additional error in responses. We now specify this formally.\n\n3\n\n\fsetting\n\nN (0, Id), k\n\nlinearly independent \u2713(j)s\nN (0, Id), constant Q\n\nN (0, \u2303(j))\n\n[27]\n\n[29]\n\n[14]\n\nOurs\n\nlocal\nglobal\n\nsample (n)\n\npoly(m)d\npoly(m)d/5\nm\n\nmmd\n\ndpoly( m\n\nQ )m2\n\nQ ) + ( cm\nmd\n-\n\ncomputation\nnd2 + md3\n\nnd2 + poly(m)\n\nnd\n\nnd\n\nnd2 (nd for GD-ILTS)\n\nrobust, not limited to N (0, \u2303(j))\ngood estimate of the subspace\n\nlocal\nglobal\n\nsubspace est. +( cm\n\nQ )m \u00b7 nd\nTable 1: Compare with previous results in the setting of balanced MLR, i.e., each component has n/m\nsamples. Q represents a separation property of the mixture components (see De\ufb01nition 1 for details).\nFor conciseness, we only keep the main factors in the complexity terms. The inspiring algorithms\nlisted here achieve nearly optimal sample complexity (nearly linear in d) under certain settings,\nwhich are helpful for understanding the limit of learning MLR. Note that we have \u02dcO(nd2 log 1\n\" )\ncomputation for ILTS and \u02dcO(nd log2 1\n\" ) for GD-ILTS (in Section B, a direct gradient variant).\nSample complexity for our global step depends on the hardness of \ufb01nding the subspace. Our local\nrequirement only needs the current estimation to be close to one of the components, which is much\neasier to satisfy than the local notion in [27]. Methods in [5, 20] require \u02dc\u2326(d6) and \u02dc\u2326(d3) sample\ncomplexity (they can handle more general settings), [28] uses sparse graph codes for sparse MLR.\nTherefore, we do not list their results here (hard to compare with).\n\n(MLR-C): We are given n samples of the form (xi, yi) for i = 1, . . . , n, where each yi 2 R and xi 2\nRd. Unkown to us, there are m \u201cground truth\" vectors \u2713?\n(m), each in Rd; correspondingly,\nand again unknown to us, the set of samples is partitioned into disjoint sets S(1), . . . , S(m). If the ith\nsample is in set S(j) for some j 2 [m], it satis\ufb01es\n\n(1), . . . ,\u2713 ?\n\nyi = hxi,\u2713 ?\n\n(j)i + ri,\n\nfor i 2 S(j)\n\n(MLR-C).\n\n(1), . . . ,\u2713 ?\n\nHere, ri denotes the possible additive corruption \u2013 a fraction of the r1, . . . , rn are arbitrary unknown\nvalues, and the remaining are 0 (and again, we are not told which).\nOur objective is: given only the samples (xi, yi), \ufb01nd the ground truth vectors \u2713?\n(m). In\nparticular, we do not have a-priori knowledge of any of the sets S(j), or the values/support of the\ncorruptions. We now develop some notation for the sizes of the components etc.\nSizes of sets: Let R? = {i 2 [n] s.t. ri 6= 0} denote the set of corrupted samples; note that this\nset can overlap with any / all of the components\u2019 sets S(j)s. Let S?\n(j) = S(j)\\R? be the uncorrupted\nset of samples from the S(j), for all j 2 [m]. Let \u2327 ?\n(j)|/n denote the fraction of uncorrupted\nsamples in each component j, and \u2327 ?\n(j) denote the smallest such fraction. Let\nmin) be the ratio of the number of corrupted samples to the size of the smallest\n? = |R?|/(n\u2327 ?\ncomponent 1. Notice that ? = 0 corresponds to the MLR model without corruption. We do not make\nany assumptions on which speci\ufb01c samples are corrupted; R? can be any subset of size ?\u2327 ?\nminn of\nthe set of n samples. Thus a ? = 1 situation can prevent the recovery of the smallest component.\n(l), 8j 2 [m], X = [x1,\u00b7\u00b7\u00b7 , xn]> 2\nFinally, for convenience, we denote S?\nRn\u21e5d, and y = [y1,\u00b7\u00b7\u00b7 yn]. Note that we consider the case without additive stochastic noise, which\nis the same setting as in [27, 29, 14].\n\n(j) = |S?\nmin = minj2[m] \u2327 ?\n\n(j) := [l2[m]\\{j}S?\n\n3.1 Preliminaries\n\nWe now develop our way to making a few basic assumptions on the model setting; our main results\nshow that under these assumptions the simple ILTS algorithm succeeds. The \ufb01rst de\ufb01nition quanti\ufb01es\nthe separation between the ground truth vectors.\n\n1the component with fewest samples\n\n4\n\n\fi=1, initial \u27130, fraction of samples to be retained \u2327\n\nAlgorithm 1 ILTS (for recovering a single component)\n1: Input: Samples Dn = {xi, yi}n\n2: Output: Final estimationb\u2713\n3: Parameters: Number of rounds T\n4: for t = 0 to T 1 do\nSt index set of b\u2327nc samples with smallest residuals (yi hxi,\u2713 ti)2, i 2 [n]\n5:\n\u2713t+1 = arg min\u2713 Pi2St\n6:\n7: Output:b\u2713 = \u2713T\nDe\ufb01nition 1 (Q-separation). For the set of components {\u2713?\n(i) the set of components is Q-separated if Q \uf8ff\n(ii) local separation Qj is de\ufb01ned as Qj =\n\n(1),\u00b7\u00b7\u00b7 ,\u2713 ?\n(i)\u2713?\nmini,j2[m],i6=j k\u2713?\nmaxj2[m] k\u2713?\n(j)k\n(l)\u2713?\n(j)k2\n\n(yi hxi,\u2713 i)2\n\n(m)},\n(j)k2\n;\n\nminl2[m]\\{j} k\u2713?\n(j)k\n\nk\u2713?\n\n, 8j 2 [m].\n\nBy de\ufb01nition, it is clear that Q \uf8ff Qj, 8j 2 [m]. In fact, Q represents the global separation property,\nwhich is required by previous literatures for solving MLR [27, 29, 14], while Qj describes the\nlocal separation property for the jth component, and gives us a better characterization of the local\nconvergence property for a single component. We now turn to the features; let X denote the n \u21e5 d\nmatrix of features, with the ith row being x>i \u2013 the features of the ith sample.\nDe\ufb01nition 2 (( +, )-feature regularity). De\ufb01ne Sk to be the set of all subsets in [n] with size k,\nand let XS be the sub-matrix of X with rows indexed by some S \u21e2 [n]. De\ufb01ne\n\n +(k) = max\nS2Sk\n\nmax(X>S XS),\n\nand (k) = min\nS2Sk\n\nmin(X>S XS),\n\n(1)\n\nwhere functions +(k), (k) are feature regularity upper bound and lower bound, respectively.\nmax(A)(min(A)) represents the largest (smallest) eigenvalue of a symmetric matrix A.\n\nClearly, if + is too large or is too small, identifying samples belonging to a certain component\nor not, even given a very good estimate of the true component, can become extremely dif\ufb01cult. For\nexample, if the true component coincides with the top eigenvalue direction of its feature covariance\nmatrix, then, even if the current estimate is close within `2, the prediction error can still be quite\nlarge due to the X. On the other hand, if each row in X follows i.i.d. isotropic Gaussian distribution,\n +(k) and (k) are upper and lower bounded by \u21e5(n) for k being a constant factor of n (when n\nis large enough). This is shown in Lemma 5. Next, we de\ufb01ne -af\ufb01ne error, a property of the data\nthat is closely connected with our analysis of ILTS in Section 4.\nDe\ufb01nition 3 (-af\ufb01ne error /V()). For 8j 2 [m], denote X(j) as the sub-matrix with rows from\n(j) with size n\u2327 ?\n(j), let \u2327(j) = c\u2327 \u2327 ?\nS?\nfor some \ufb01xed constant c\u2327 < 1. De\ufb01ne -af\ufb01ne error V() to be the maximum value of integer V\nsuch that the following holds for some v1, v2 2 Rd with kv1k2/kv2k2 = \uf8ff 1 and j 2 [m]:\n\n(j), X(j) as the sub-matrix with rows from S?\n\n(j) with size \u2327 ?\n\n(j)\n\n[|X(j)v1|](V +d(\u2327 ?\n\nj \u2327j )ne)th largest [|X(j)v2|](V )th smallest.\n\n(2)\n\n(j) and S?\n\n(j) by ranking and \ufb01nding the smallest\nThis is saying, when we pick samples from set S?\nb\u2327jnc samples based on the projected values to v1, v2, the number of samples from S?\n(j) is at\nmost V(). For example, given current estimate \u2713, the residual of a sample from component j is\n(j) \u2713. As a result, this de\ufb01nition helps quantify the\nhxi,\u2713 ?\nnumber of mis-classi\ufb01ed samples from other components, see Figure 1 for another illustration. If\neach row in X follows i.i.d. isotropic Gaussian distribution, V() scales linearly with for large\nenough n. This is shown in Lemma 6.\n\n(j) \u2713i, and v1 can be considered as \u2713?\n\n4\n\nILTS and Local Analysis\n\nAlgorithm 1 presents the procedure of ILTS: Starting from initial parameter \u27130, the algorithm\nalternates between (a) selecting samples with smallest residuals, and (b) getting the least square\n\n5\n\n\fFigure 1: A two-dimensional illustration of -af\ufb01ne error V() in De\ufb01nition 3 for kv1k2/kv2k2 =\n(for simplicity, assume \u2327 ?\nj = \u2327j). V() can be interpreted as the number of mistakenly \ufb01ltered\nsamples in any directions. The plot in the middle contains blue dots from one component X(j), and\nred dots from other components X(j). The plots on the left and right illustrate how the histogram\nlooks like for X(j)v1 (in blue) and X(j)v2 (in red), for two sets of v1 and v2. Areas in blue represent\nthe samples that may be mistakenly \ufb01ltered out. The maximum V that satis\ufb01es (2) is larger on the\nright side plot since projected values for samples from S?\n(j) are more concentrated. V() is an\nupper bound of the maximum V on all possible directions.\n\nsolution on the selected set of samples as the new parameter. Intuitively, ILTS succeeds if (a) \u27130 is\nclose to the targeted component, and (b) for each round of update, the new parameter is getting closer\nto the targeted component. For our analysis, we assume the chosen fraction of samples to be retained\nis strictly less than the number of samples from the interested component, i.e., \u2327 = c0\u2327 ?\n(j) for some\nuniversal constant c0. We \ufb01rst provide local recovery results using the structural de\ufb01nition we made\nin Section 3, for both no corruption setting and corruption setting. Then, we present the result under\nGaussian design matrix. All proofs can be found in the Appendix.\nTheorem 4 (deterministic features). Consider (MLR-C) using Algorithm 1 with \u2327<\u2327\n(j). Given\n?\niterate \u2713t at round t, which is closer to the j-th component in Euclidean distance and satis\ufb01es\n(l)k2, then the next iterate \u2713t+1 of the algorithm satis\ufb01es\nk\u2713t \u2713?\nQj \u00b7\n\n2 minl2[m]\\{j} k\u2713?\n\n(j)k2 \uf8ff 1\n\n(j) \u2713?\n\n2k\u2713t\u2713?\n\n(j)k2\n\n(j)k2 \u2318 + ?\u2327 ?\n\nminn\u2318\n\nk\u2713?\n (\u2327n )\n\nk\u2713t \u2713?\n\n(j)k2.\n\n(3)\n\nk\u2713t+1 \u2713?\n\n(j)k2 \uf8ff\n\n2 +\u21e3V\u21e3 1\n\n(j) and the iterate \u2713t is\nThe above one-step update rule (3) holds as long as Algorithm 1 uses \u2327<\u2327\n?\ncloser to the j-th component. However, in order to make \u2713t+1 getting closer to \u2713?\n(j), the contraction\nterm on the RHS of (3) needs to be less than 1, which may require stronger conditions on \u2713t,\ndepending on what xis are. The denominator term (\u2327n ) is due to the selection bias on a subset\nof samples, which scales with n as long as the inputs have good regularity property. The numerator\nterm is due to the incorrect samples selected by St, which consists of: (a) samples from other mixture\ncomponents, and (b) corrupted samples. (a) is controlled by the af\ufb01ne error, which depends on (a1)\nthe local separation of components Qj, and (a2) the relative closeness of \u2713t to \u2713?\n(j), and scales with n.\n(j). For (b), the\nThe af\ufb01ne error V gets larger if the separation is small, or \u2713t is not close enough to \u2713?\nnumber of all corrupted samples is controlled by ?\u2327 ?\nminn, which is not large given ? being a small\nconstant.\nTheorem 4 gives a general update rule for any given dataset according to De\ufb01nitions 1-2. Next, we\npresent the local convergence result for the speci\ufb01c setting of Gaussian input vectors, by giving a tight\nanalysis for feature regularity in Lemma 5 and a tight bound for the af\ufb01ne error V() in Lemma 6.\nLemma 5. Let +(k), (k) be de\ufb01ned as in (1), and assume each xi \u21e0N (0, Id). Then, for\nk = ckn with constant ck, for n =\u2326 \u2713 d log 1\n\nck \u25c6, with high probability,\n +(k) \uf8ff c1 \u00b7 k, (k) c2 \u00b7 k,\n\nck\n\n6\n\n\fwhere c1, c2 are constants that depend on ck: c1 \uf8ff 1 + 3eq6 log 2\nconstants C1, C2.\nLemma 6. Suppose we have xi \u21e0N (0, Id), \u2327 ?\n\u2326( d log log d/\u2327 ?\n\nck\n\nmin), with high probability, the design matrix satis\ufb01es V() \uf8ff c{n _ log n}.\n\n(j)n samples for each class S(j). Then, for n =\n\n+ C1\nck\n\n, c2 C2ck, for universal\n\nPlug in Lemma 5 and Lemma 6 to Theorem 4, we have:\nTheorem 7 (Gaussian features). For (MLR-C), assume xi \u21e0N (0, Id), consider using Algo-\nrithm 1 with \u2327<\u2327\n(j)k \uf8ff\n(j)) for some j 2 [m],\ncj minl2[m]\\{j} k\u2713?\nthen, w.h.p., the next iterate \u2713t+1 of the algorithm satis\ufb01es\n\nmin \u2318. If the iterate satis\ufb01es k\u2713t \u2713?\n\n(j)k2 (where cj is a constant depending on \u2327 and \u2327 ?\n\n(j), 8j 2 [m], and n =\u2326 \u21e3 d log log d\n(j)k2 \uf8ff \uf8fftk\u2713t \u2713?\n\n(l) \u2713?\n\n(j)k2,\n\n(4)\n\n\u2327 ?\n\n?\n\nwhere \uf8fft = c0\n\n\u2327n\u21e3n 1\n\nQj \u00b7\n\nk\u2713t+1 \u2713?\nn _ log no + ?\u2327 ?\n\n(j)k2\n\n2k\u2713t\u2713?\n(j)k2\n\nk\u2713?\n\nminn\u2318 < 1, for some small constant ?.\n\n(j) up to arbitrary accuracy with \u02dcO(d/\u2327 ?\n\nNote that in this Theorem, c0 is a constant such that \uf8fft < 1, and such a c0 corresponds to an upper\nbound on cj, i.e., the local region. Theorem 7 shows that, as long as \u2713t is contant time closer to\nmin) samples. In fact, Lemma 5 and\n(j), we can recover \u2713?\n\u2713?\nLemma 6 (and hence Theorem 7) are generalizable to more general distributions, including the setting\nstudied in [14]. The initial condition simply changes by a factor of , where is the upper bound of\nthe covariance matrix. The formal statement is as follows:\nCorollary 8 (features with non-isotropic Gaussians). Consider (MLR-C), where each xi \u21e0\nN (0, \u2303(j)) for i 2 S(j), I \u2303(j) I. Under the same setting as in Theorem 7, convergence\nproperty (4) holds as long as iterate \u2713t satis\ufb01es k\u2713t \u2713?\nDiscussion We summarize our results from the following four perspectives:\n\u2022 Our results can generalize to a wide class of distributions: e.g., Gaussians or a sub-class of\nsub-Gaussians with different covariance matrix. This is because the proof technique for showing\nLemma 5 and Lemma 6 only exploits the property of (a) concentration of order statistics; (b)\nanti-concentration of Gaussian-type distributions.\n\n minl2[m]\\{j} k\u2713?\n\n(j)k \uf8ff cj\n\n(l) \u2713?\n\n(j)k2.\n\n\u2327\n\n\u2022 Super-linear convergence speed for ? = 0: When ? = 0, \uf8fft / k\u2713t \u2713?\n\u2022 Optimal local sample dependency on m: Notice that locally, in the balanced setting, where\n(j) = 1/m, the sample dependency on m is linear. This dependency is optimal since for each\n\u2327 ?\ncomponent, we want n/m > d to make the problem identi\ufb01able. 2\n\n(j)k2 in Theorem 7.\n\n\u2022 ILTS learns each component separately: Different from the local alternating minimization approach\nby [27], recovering one component does not require good estimates of any other components. E.g.,\nif we are only interested in the j-th component, then, the sample complexity is \u02dcO\u21e3d/\u2327 ?\n(j)\u2318.\n\n5 Global ILTS and Its Analysis\n\n(1),\u2713 ?\n\n(2),\u00b7\u00b7\u00b7 ,\u2713 ?\n\nIn Section 4, we show that as long as the initialization is closer to the targeted component with\nconstant factor, we can locally recover the component, even under a constant fraction of corruptions.\nIn this part, we discuss the initialization condition. Let us de\ufb01ne the targeted subspace Um as:\n(m)o , and for any subspace U, we denote U as the corresponding\nUm := spann\u2713?\nsubspace matrix, with orthonormal columns. We de\ufb01ne the concept of \u270f-close subspace as follows:\nDe\ufb01nition 9 (\u270f-close subspace). bU2 Rd\u21e5 \u02dcm is an \u270f-close subspace to Um if \u02dcm = O(m), and their\ncorresponding subspace matrices bU, Um satisfy:\u21e3Id bUbU>\u2318 \u00b7 Um2 \uf8ff \u270f.\n\n2Notice that the larger m becomes, the smaller the local region becomes, since cj depends on m. However,\n\naccording to our bound for + and , the dependency of cj on m is still polynomial.\n\n7\n\n\fi=1\n\n(j)k2\n\nj=1, small error \n\nRemove samples in set Sj from Dn\n\nAlgorithm 2 GLOBAL-ILTS (for recovering all components )\n1: Input: Samples Dn = {xi, yi}n\n2: Output:b\u27131,\u00b7\u00b7\u00b7 ,b\u2713m\n3: Parameters: Granularity \u270f, estimate {\u2327j}m\n4: Find a \u270f-close subspace Um\n5: Generate an \u270f-net \u21e5\u270f covering the centered sphere in Um with radius k maxj2[m] \u2713?\n6: for j = 1 to m do\n7:\n8:\n9:\n10:\n11:\n12:\n\nfor \u02dc\u2713 randomly drawn from \u21e5\u270f do\n\u2713 ILTS(Dn, \u02dc\u2713, \u2327j)\nSj = {i | (yi hxi,\u2713 i)2 < 2}\nif |Sj| b \u2327jnc then\nb\u2713j = \u2713, break\n13: Return:b\u27131,\u00b7\u00b7\u00b7 ,b\u2713m\nAn interpretation of an \u270f-close subspace U is as follows: for any unit vector v from subspace Um,\nthere exists a vector v0 in subspace U with norm less than 1, such that kv v0k2 \uf8ff \u270f. We also de\ufb01ne\n\"-recovery, to help with stating our theorem.\nDe\ufb01nition 10 (\"-recovery). b\u21e5= hb\u27131,\u00b7\u00b7\u00b7 ,b\u2713mi is a \"-recovery of \u21e5? = h\u2713?\n(m)i if\nminP2Pm kb\u21e5P \u21e5?k2,1 \uf8ff \", where Pm is the class of all m-dimensional permutation matrices.\nThe procedure for Global-ILTS is shown in Algorithm 2. The algorithm takes a subspace as its\ninput, which should be a good approximation of the subspace spanned by the correct \u2713?\n(j)s. Given\nthe subspace, Global-ILTS constructs an \u270f-net over a sphere in subspace Um The algorithm then\niteratively removes samples once the ILTS sub-routine \ufb01nds a valid component. Notice that we\nrequire the estimates \u2327js to satisfy \u2327j <\u2327 ?\nTheorem 11 (Global algorithm). For (MLR-C), assume xi \u21e0N (0, Id). Following Algorithm 2, we\n, and with small , \" (e.g., = cplog n\" with\ncan \ufb01nd an \u270f-close subspace U with \u270f = cl minj2[m] \u2327j\n(j) for all j 2 [m], we are able to have \"-recovery over all components\n\" small enough) and \u2327j <\u2327 ?\nminQ\u2318O(m)\nwith n =\u2326 \u21e3 d log log d\n\nmin \u2318 samples, in O\u2713\u21e3 1\n\n(j). 3 We have the following global recovery result:\n\n\"\u25c6 time.\n\n(1),\u00b7\u00b7\u00b7 ,\u2713 ?\n\n\u2327 ?\n\n2\n\n\u2327 ?\n\nnd2 log 1\n\nSeveral merits of Theorem 11: First, our result clearly separates the problem into (a) globally\n\ufb01nding a subspace; and (b) locally recovering a single component with ILTS. Second, the nd2\ncomputation dependency is due to \ufb01nding the exact least squares. Alternatively, one can take gradient\ndescent to \ufb01nd an approximation to the true component. The convergence property of a gradient\ndescent variant of ILTS is shown in Section B, where we further discuss the ideal number of gradient\nupdates to make for each round, so that the algorithm can be more ef\ufb01cient. Third, the exp(O(m))\ndependency in runtime can be practically avoided, since our algorithm is easy to run in parallel.\nFeasibility of getting U. Let L = [y1x1; y2x2;\u00b7\u00b7\u00b7 ; ynxn], then the column space of L is close to\nUm for ? = 0, when xis have identity covariance. For ? = 0, the standard top-m SVD on L in\nO(m2n) time with \u2326\u21e3 d\npoly log(d)\u2318 samples is guaranteed to get a \u270f-close estimate, following\nthe well-known sin-theta theorem [7, 4]. For ? 6= 0 under the same setting, we can use robust\nPCA methods to robustly \ufb01nd the subspace. For example, the state-of-the-art result in [8] provides\na near optimal recovery guarantee, with slightly larger sample size (i.e., \u2326(d2/\u270f2)). Closing this\nsample complexity gap is an interesting open problem for outlier robust PCA. Notice that instead of\nestimating the subspace, [14] uses the strong distributional assumption to calculate higher moments\nof Gaussian, and suffers from exponential dependency in m in their sample complexity.\n\n\u270f2\u2327 ?\n\nmin\n\n3To satisfy this, one can always search through the set {1, c, c2, c3,\u00b7\u00b7\u00b7} (for some constant c < 1) and get\n\nan estimate in the interval [c\u2327 ?\n\n(j),\u2327 ?\n\n(j)).\n\n8\n\n\f6 Discussion\n\nIterative least trimmed squares is the simplest instance of a much more general principle: that one can\nmake learning robust to bad training data by iteratively updating a model using only the samples it\nbest \ufb01ts currently. In this paper we provide rigorous theoretical evidence that it obtains state-of-the-art\nresults for a speci\ufb01c simple setting: mixed linear regression with corruptions. It is very interesting to\nsee if this positive evidence can be established in other (and more general) settings.\nWhile it seems similar at \ufb01rst glance, we note that our algorithm is not an instance of the Expectation-\nMaximization (EM) algorithm. In particular, it is tempting to associate a binary selection \u201chidden\nvariable\" zi for every sample i, and then use EM to minimize an overall loss that depends on \u2713 and\nthe z\u2019s. However, this EM approach needs us to posit a model for the data under both the zi = 0 (i.e.\n\u201cdiscarded sample\") and zi = 1 (i.e. \u201cchosen sample\") choices. ILTS on the other hand only needs a\nmodel for the zi = 1 case.\n\nAcknowledgement\n\nWe would like to acknowledge NSF grants 1302435 and 1564000 for supporting this research.\n\n9\n\n\fReferences\n[1] Sivaraman Balakrishnan, Martin J Wainwright, Bin Yu, et al. Statistical guarantees for the em\nalgorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77\u2013120,\n2017.\n\n[2] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In\n\nAdvances in Neural Information Processing Systems, pages 721\u2013729, 2015.\n\n[3] St\u00e9phane Boucheron, Maud Thomas, et al. Concentration inequalities for order statistics.\n\nElectronic Communications in Probability, 17, 2012.\n\n[4] Tony Cai, Zongming Ma, and Yihong Wu. Optimal estimation and rank detection for sparse\n\nspiked covariance matrices. Probability theory and related \ufb01elds, 161(3-4):781\u2013815, 2015.\n\n[5] Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear\n\nregressions. In International Conference on Machine Learning, pages 1040\u20131048, 2013.\n\n[6] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under\nadversarial corruption. In International Conference on Machine Learning, pages 774\u2013782,\n2013.\n\n[7] Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii.\n\nSIAM Journal on Numerical Analysis, 7(1):1\u201346, 1970.\n\n[8] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair\nStewart. Being robust (in high dimensions) can be practical.\nIn Proceedings of the 34th\nInternational Conference on Machine Learning-Volume 70, pages 999\u20131008. JMLR. org, 2017.\n\n[9] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alis-\ntair Stewart. Sever: A robust meta-algorithm for stochastic optimization. arXiv preprint\narXiv:1803.02815, 2018.\n\n[10] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1\nregression. In 2nd Symposium on Simplicity in Algorithms (SOSA 2019). Schloss Dagstuhl-\nLeibniz-Zentrum fuer Informatik, 2018.\n\n[11] Adam R. Klivans, Pravesh K. Kothari, and Raghu Meka. Ef\ufb01cient algorithms for outlier-robust\n\nregression. In Conference on Learning Theory, pages 1420\u20131430, 2018.\n\n[12] Jason M Klusowski, Dana Yang, and WD Brinda. Estimating the coef\ufb01cients of a mixture of\ntwo linear regressions by expectation maximization. IEEE Transactions on Information Theory,\n2019.\n\n[13] Jeongyeol Kwon and Constantine Caramanis. Global convergence of em algorithm for mixtures\n\nof two component linear regression. arXiv preprint arXiv:1810.05752, 2018.\n\n[14] Yuanzhi Li and Yingyu Liang. Learning mixtures of linear regressions with nearly optimal\n\ncomplexity. In Conference on Learning Theory, pages 1125\u20131144, 2018.\n\n[15] Liu Liu, Yanyao Shen, Tianyang Li, and Constantine Caramanis. High dimensional robust\n\nsparse regression. arXiv preprint arXiv:1805.11643, 2018.\n\n[16] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians.\nIn Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages\n93\u2013102. IEEE, 2010.\n\n[17] David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu.\n\nOn the least trimmed squares estimator. Algorithmica, 69(1):148\u2013183, 2014.\n\n[18] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust\n\nestimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.\n\n10\n\n\f[19] Peter J Rousseeuw. Least median of squares regression. Journal of the American statistical\n\nassociation, 79(388):871\u2013880, 1984.\n\n[20] Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning\nmixtures of generalized linear models. In Arti\ufb01cial Intelligence and Statistics, pages 1223\u20131231,\n2016.\n\n[21] Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss\n\nminimization. International Conference on Machine Learning, 2019.\n\n[22] Arun Suggala, Adarsh Prasad, and Pradeep K Ravikumar. Connecting optimization and\nregularization paths. In Advances in Neural Information Processing Systems, pages 10631\u2013\n10641, 2018.\n\n[23] Daniel Vainsencher, Shie Mannor, and Huan Xu. Ignoring is a bliss: Learning with large noise\nthrough reweighting-minimization. In Conference on Learning Theory, pages 1849\u20131881, 2017.\n\n[24] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[25] Eunho Yang, Aur\u00e9lie C Lozano, Aleksandr Aravkin, et al. A general family of trimmed estima-\ntors for robust high-dimensional data analysis. Electronic Journal of Statistics, 12(2):3519\u20133553,\n2018.\n\n[26] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed\n\nlinear regression. In International Conference on Machine Learning, pages 613\u2013621, 2014.\n\n[27] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Solving a mixture of many ran-\ndom linear equations by tensor decomposition and alternating minimization. arXiv preprint\narXiv:1608.05749, 2016.\n\n[28] Dong Yin, Ramtin Pedarsani, Yudong Chen, and Kannan Ramchandran. Learning mixtures of\nsparse linear regressions using sparse graph codes. IEEE Transactions on Information Theory,\n2018.\n\n[29] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with multiple compo-\n\nnents. In Advances in neural information processing systems, pages 2190\u20132198, 2016.\n\n11\n\n\f", "award": [], "sourceid": 3269, "authors": [{"given_name": "Yanyao", "family_name": "Shen", "institution": "UT Austin"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UT-Austin"}]}