{"title": "List-decodable Linear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 7425, "page_last": 7434, "abstract": "We give the first polynomial-time algorithm for robust regression in the list-decodable setting where an adversary can corrupt a greater than 1/2 fraction of examples. \n\nFor any \\alpha < 1, our algorithm takes as input a sample {(x_i,y_i)}_{i \\leq n} of n linear equations where \\alpha n of the equations satisfy y_i = \\langle x_i,\\ell^*\\rangle +\\zeta for some small noise \\zeta and (1-\\alpha) n of the equations are {\\em arbitrarily} chosen. It outputs a list L of size O(1/\\alpha) - a fixed constant - that contains an \\ell that is close to \\ell^*.\n\nOur algorithm succeeds whenever the inliers are chosen from a certifiably anti-concentrated distribution D. In particular, this gives a (d/\\alpha)^{O(1/\\alpha^8)} time algorithm to find a O(1/\\alpha) size list when the inlier distribution is a standard Gaussian. For discrete product distributions that are anti-concentrated only in regular directions, we give an algorithm that achieves similar guarantee under the promise that \\ell^* has all coordinates of the same magnitude. To complement our result, we prove that the anti-concentration assumption on the inliers is information-theoretically necessary.\n\nTo solve the problem we introduce a new framework for list-decodable learning that strengthens the ``identifiability to algorithms'' paradigm based on the sum-of-squares method.", "full_text": "List-decodeable Linear Regression\n\nSushrut Karmalkar\n\nUniversity of Texas at Austin\n\nsushrutk@cs.utexas.edu\n\nAdam R. Klivans\n\nUniversity of Texas at Austin\nklivans@cs.utexas.edu\n\nPravesh K. Kothari\n\nPrinceton University and Institute for Advanced Study\n\nkothari@cs.princeton.edu\n\nAbstract\n\nWe give the \ufb01rst polynomial-time algorithm for robust regression in the list-\ndecodable setting where an adversary can corrupt a greater than 1/2 fraction\nof examples.\nFor any \u03b1 < 1, our algorithm takes as input a sample {(xi, yi)}i\u2264n of n linear\nequations where \u03b1n of the equations satisfy yi = (cid:104)xi, (cid:96)\u2217(cid:105) + \u03b6 for some small noise\n\u03b6 and (1 \u2212 \u03b1)n of the equations are arbitrarily chosen. It outputs a list L of size\nO(1/\u03b1) - a \ufb01xed constant - that contains an (cid:96) that is close to (cid:96)\u2217.\nOur algorithm succeeds whenever the inliers are chosen from a certi\ufb01ably anti-\nconcentrated distribution D. As a corollary of our algorithmic result, we obtain a\n(d/\u03b1)O(1/\u03b18) time algorithm to \ufb01nd a O(1/\u03b1) size list when the inlier distribution\nis standard Gaussian. For discrete product distributions that are anti-concentrated\nonly in regular directions, we give an algorithm that achieves similar guarantee\nunder the promise that (cid:96)\u2217 has all coordinates of the same magnitude. To comple-\nment our result, we prove that the anti-concentration assumption on the inliers is\ninformation-theoretically necessary.\nTo solve the problem we introduce a new framework for list-decodable learning\nthat strengthens the \u201cidenti\ufb01ability to algorithms\u201d paradigm based on the sum-of-\nsquares method.\n\n1\n\nIntroduction\n\nIn this work, we design algorithms for the problem of linear regression that are robust to training sets\nwith an overwhelming ((cid:29) 1/2) fraction of adversarially chosen outliers.\nOutlier-robust learning algorithms have been extensively studied (under the name robust statistics)\nin mathematical statistics [43, 37, 25, 23]. However, the algorithms resulting from this line of work\nusually run in time exponential in the dimension of the data [6]. An in\ufb02uential line of recent work\n[29, 1, 16, 33, 8, 30, 31, 24, 14, 17, 28] has focused on designing ef\ufb01cient algorithms for outlier-robust\nlearning.\nOur work extends this line of research. Our algorithms work in the \u201clist-decodable learning\u201d frame-\nwork. In this model, a majority of the training data (a 1 \u2212 \u03b1 fraction) can be adversarially corrupted\nleaving only an \u03b1 (cid:28) 1/2 fraction of \u201cinliers\u201d. Since uniquely recovering the underlying parameters\nis information-theoretically impossible in such a setting, the goal is to output a list (with an absolute\nconstant size) of parameters, one of which matches the ground truth. This model was introduced\nin [3] to give a discriminative framework for clustering. More recently, beginning with [8], various\nworks [18, 30] have considered this as a model of \u201cuntrusted\u201d data.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThere has been phenomenal progress in developing techniques for outlier-robust learning with a\nsmall ((cid:28) 1/2)-fraction of outliers (e.g. outlier \u201c\ufb01lters\u201d [13, 14, 10, 15], separation oracles for\ninliers [13] or the sum-of-squares method [31, 24, 30, 28]). In contrast, progress on algorithms that\ntolerate the signi\ufb01cantly harsher conditions in the list-decodable setting has been slower. The only\nprior works [8, 18, 30] in this direction designed list-decodable algorithms for mean estimation via\nproblem-speci\ufb01c methods. Recently [22] addressed the somewhat related problem of conditional\nlinear regression where the goal is to \ufb01nd a linear function with small square loss conditioned on a\nsubset of training points whose \u2018indices\u2019 satisfy some constant-width k-DNF formula.\nIn this paper, we develop a principled technique to give the \ufb01rst ef\ufb01cient list-decodable learning\nalgorithm for the fundamental problem of linear regression. Our algorithm takes a corrupted set\nof linear equations with an \u03b1 (cid:28) 1/2 fraction of inliers and outputs a O(1/\u03b1)-size list of linear\nfunctions, one of which is guaranteed to be close to the ground truth (i.e., the linear function that\ncorrectly labels the inliers). A key conceptual insight in this result is that list-decodable regression\ninformation-theoretically requires the inlier-distribution to be \u201canti-concentrated\u201d. Our algorithm\nsucceeds whenever the distribution satis\ufb01es a stronger \u201ccerti\ufb01able anti-concentration\u201d condition that\nis algorithmically \u201cusable\u2019. This class includes the standard gaussian distribution and more generally,\nany spherically symmetric distribution with strictly sub-exponential tails.\nPrior to our work1, the state-of-the-art outlier-robust algorithms for linear regression [28, 19, 12,\n39] could handle only a small (< 0.1)-fraction of outliers even under strong assumptions on the\nunderlying distributions.\nList-decodable regression generalizes the well-studied [11, 26, 21, 44, 2, 9, 45, 41, 34] and easier\nproblem of mixed linear regression: given k \u201cclusters\u201d of examples that are labeled by one out of k\ndistinct unknown linear functions, \ufb01nd the unknown set of linear functions. All known techniques\nfor the problem rely on faithfully estimating certain moment tensors from samples and thus, cannot\ntolerate the overwhelming fraction of outliers in the list-decodable setting. On the other hand, since\nwe can take any cluster as inliers and treat rest as outliers, our algorithm immediately yields new\nef\ufb01cient algorithms for mixed linear regression. Unlike all prior works, our algorithms work without\nany pairwise separation or bounded condition-number assumptions on the k linear functions.\n\nList-Decodable Learning via the Sum-of-Squares Method Our algorithm relies on a strengthen-\ning of the robust-estimation framework based on the sum-of-squares (SoS) method. This paradigm\nhas been recently used for clustering mixture models [24, 30] and obtaining algorithms for moment\nestimation [31] and linear regression [28] that are resilient to a small ((cid:28) 1/2) fraction of outliers\nunder the mildest known assumptions on the underlying distributions. At the heart of this technique is\na reduction of outlier-robust algorithm design to just \ufb01nding \u201csimple\u201d proofs of unique \u201cidenti\ufb01ability\u201d\nof the unknown parameter of the original distribution from a corrupted sample. However, this princi-\npled method works only in the setting with a small ((cid:28) 1/2) fraction of outliers. As a consequence,\nthe work of [30] for mean estimation in the list-decodable setting relied on \u201csupplementing\u201d the SoS\nmethod with a somewhat ad hoc, problem-dependent technique.\nAs an important conceptual contribution, our work yields a framework for list-decodable learning\nthat recovers some of the simplicity of the general blueprint. Central to our framework is a general\nmethod of rounding by votes for \u201cpseudo-distributions\u201d in the setting with (cid:29) 1/2 fraction outliers.\nOur rounding builds on the work of [32] who developed such a method to give a simpler proof of the\nlist-decodable mean estimation result of [30]. In Section 2, we explain our ideas in detail.\nThe results in all the works above hold for any underlying distribution that has upper-bounded low-\ndegree moments and such bounds are \u201ccaptured\u201d within the SoS system. Such conditions are called as\n\u201ccerti\ufb01ed bounded moment\u201d inequalities. An important contribution of this work is to formalize anti-\nconcentration inequalities within the SoS system and prove \u201ccerti\ufb01ed anti-concentration\u201d for natural\ndistribution families. Unlike bounded moment inequalities, there is no canonical encoding within\nSoS for such statements. We choose an encoding that allow proving certi\ufb01ed anti-concentration for a\ndistribution by showing the existence of a certain approximating polynomial. This allows showing\ncerti\ufb01ed anti-concentration of natural distributions via a completely modular approach that relies on a\nbeautiful line of works that construct \u201cweighted \u201d polynomial approximators [35].\n\n1There\u2019s a long line of work on robust regression algorithms (see for e.g. [7, 27]) that can tolerate corruptions\n\nonly in the labels. We are interested in algorithms robust against corruptions in both examples and labels.\n\n2\n\n\fWe believe that our framework for list-decodable estimation and our formulation of certi\ufb01ed anti-\nconcentration condition will likely have further applications in outlier-robust learning.\n\n1.1 Our Results\n\nWe \ufb01rst de\ufb01ne our model for generating samples for list-decodable regression.\nModel 1.1 (Robust Linear Regression). For 0 < \u03b1 < 1 and (cid:96)\u2217 \u2208 Rd with (cid:107)(cid:96)\u2217(cid:107)2 \u2264 1, let LinD(\u03b1, (cid:96)\u2217)\ndenote the following probabilistic process to generate n noisy linear equations S = {(cid:104)xi, a(cid:105) = yi |\n1 \u2264 i \u2264 n} in variable a \u2208 Rd with \u03b1n inliers I and (1 \u2212 \u03b1)n outliers O:\n\n1. Construct I by choosing \u03b1n i.i.d. samples xi \u223c D and set yi = (cid:104)xi, (cid:96)\u2217(cid:105) + \u03b6 for additive\n\nnoise \u03b6,\n\n2. Construct O by choosing the remaining (1 \u2212 \u03b1)n equations arbitrarily and potentially\n\nadversarially w.r.t the inliers I.\n\nNote that \u03b1 measures the \u201csignal\u201d (fraction of inliers) and can be (cid:28) 1/2. The bound on the norm of\n(cid:96)\u2217 is without any loss of generality. For the sake of exposition, we will restrict to \u03b6 = 0 for most of\nthis paper and discuss (see Remarks 1.6 and 4.4) how our algorithms can tolerate additive noise.\nAn \u03b7-approximate algorithm for list-decodable regression takes input a sample from LinD(\u03b1, (cid:96)\u2217) and\noutputs a constant (depending only on \u03b1) size list L of linear functions such that there is some (cid:96) \u2208 L\nthat is \u03b7-close to (cid:96)\u2217.\nOne of our key conceptual contributions is to identify the strong relationship between anti-\nconcentration inequalities and list-decodable regression. Anti-concentration inequalities are well-\nstudied [20, 42, 40] in probability theory and combinatorics. The simplest of these inequalities upper\nbound the probability that a high-dimensional random variable has zero projections in any direction.\nDe\ufb01nition 1.2 (Anti-Concentration). A Rd-valued zero-mean random variable Y has a \u03b4-anti-\nconcentrated distribution if Pr[(cid:104)Y, v(cid:105) = 0] < \u03b4.\nIn Proposition 2.4, we provide a simple but conceptually illuminating proof that anti-concentration is\nsuf\ufb01cient for list-decodable regression. In Theorem 6.1, we prove a sharp converse and show that\nanti-concentration is information-theoretically necessary for even noiseless list-decodable regression.\nThis lower bound surprisingly holds for a natural distribution: uniform distribution on {0, 1}d and\nmore generally, uniform distribution on [q]d for q = {0, 1, 2 . . . , q}. And in fact, our lower bound\nshows the impossibility of even the \u201ceasier\u201d problem of mixed linear regression on this distribution.\nTheorem 1.3 (See Proposition 2.4 and Theorem 6.1). There is a (inef\ufb01cient) list-decodable regression\nalgorithm for LinD(\u03b1, (cid:96)\u2217) with list size O( 1\n\u03b1 ) whenever D is \u03b1-anti-concentrated. Further, there\nexists a distribution D on Rd that is (\u03b1 + \u0001)-anti-concentrated for every \u0001 > 0 but there is no\n2 -approximate list-decodable regression for LinD(\u03b1, (cid:96)\u2217) that returns a list of size < d.\nalgorithm for \u03b1\n\nTo handle additive noise of variance \u03b6 2, we need a control of Pr[|(cid:104)x, v(cid:105)| \u2264 \u03b6]. For our ef\ufb01cient\nalgorithms, in addition, we need a certi\ufb01ed version of the anti-concentration condition. Informally,\nthis means that there is a \u201clow-degree sum-of-squares proof\u201d of anti-concentration of I. We give\nprecise de\ufb01nition and background in Section 3. For this section, we will use this phrase informally\nand encourage the reader to think of it as a version of anti-concentration that the SoS method can\nreason about.\nDe\ufb01nition 1.4 (Certi\ufb01able Anti-Concentration). A random variable Y has a k-certi\ufb01ably (C, \u03b4)-anti-\nconcentrated distribution if there is a univariate polynomial p satisfying p(0) = 1 such that there is a\ndegree k sum-of-squares proof of the following two inequalities:\n1. \u2200v, (cid:104)Y, v(cid:105)2 \u2264 \u03b42E(cid:104)Y, v(cid:105)2 implies (p((cid:104)Y, v(cid:105)) \u2212 1)2 \u2264 \u03b42.\n2. \u2200v, (cid:107)v(cid:107)2\n\n2 \u2264 1 implies Ep2((cid:104)Y, v(cid:105)) \u2264 C\u03b4.\n\nIntuitively, certi\ufb01ed anti-concentration asks for a certi\ufb01cate of the anti-concentration property of Y in\nthe \u201csum-of-squares\u201d proof system (see Section 3 for precise de\ufb01nitions). SoS is a proof system that\n\nPlease note that sections 3-6 are in the supplementary material.\n\n3\n\n\freasons about polynomial inequalities. Since the \u201ccore indicator\u201d 1(|(cid:104)x, v(cid:105)| \u2264 \u03b4) is not a polynomial,\nwe phrase the condition in terms of an approximating polynomial p. We are now ready to state our\nmain result.\nTheorem 1.5 (List-Decodable Regression). For every \u03b1, \u03b7 > 0 and a k-certi\ufb01ably (C, \u03b12\u03b72/10C)-\nanti-concentrated distribution D on Rd, there exists an algorithm that takes input a sample generated\naccording to LinD(\u03b1, (cid:96)\u2217) and outputs a list L of size O(1/\u03b1) such that there is an (cid:96) \u2208 L satisfying\n(cid:107)(cid:96) \u2212 (cid:96)\u2217(cid:107)2 < \u03b7 with probability at least 0.99 over the draw of the sample. The algorithm needs a\nsample of size n = (kd)O(k) and runs in time nO(k) = (kd)O(k2).\nRemark 1.6 (Tolerating Additive Noise). For additive noise (not necessarily independent across\nsamples) of variance \u03b6 2 in the inlier labels, our algorithm, in the same running time and sample\ncomplexity, outputs a list of size O(1/\u03b1) that contains an (cid:96) satisfying (cid:107)(cid:96) \u2212 (cid:96)\u2217(cid:107)2 \u2264 \u03b6\n\u03b1 + \u03b7. Since we\nnormalize (cid:96)\u2217 to have unit norm, this guarantee is meaningful only when \u03b6 (cid:28) \u03b1.\nRemark 1.7 (Exponential Dependence on 1/\u03b1). List-decodable regression algorithms immediately\nyield algorithms for mixed linear regression (MLR) without any assumptions on the components.\nThe state-of-the-art algorithms for MLR with gaussian components [34, 41] has an exponential\ndependence on k = 1/\u03b1 in the running time in the absence of strong pairwise separation or small\ncondition number of the components. Liang and Liu [34] (see Page 10 of their paper) use the\nrelationship to learning mixtures of k gaussians (with an exp(k) lower bound [38]) to note that\nthere may not exist any algorithms with polynomial dependence on 1/\u03b1 for MLR and thus, also for\nlist-decodable regression.\n\nCerti\ufb01ably anti-concentrated distributions\nIn Section 5, we show certi\ufb01able anti-concentration\nof some well-studied families of distributions. This includes the standard gaussian distribution and\nmore generally any anti-concentrated spherically symmetric distribution with strictly sub-exponential\ntails. We also show that simple operations such as scaling, applying well-conditioned linear transfor-\nmations and sampling preserve certi\ufb01able anti-concentration. This yields:\nCorollary 1.8 (List-Decodable Regression for Gaussian Inliers). For every \u03b1, \u03b7 > 0 there\u2019s\n(cid:17)\nan algorithm for list-decodable regression for the model LinD(\u03b1, (cid:96)\u2217) with D = N (0, \u03a3) with\n\u03bbmax(\u03a3)/\u03bbmin(\u03a3) = O(1) that needs n = (d/\u03b1\u03b7)O\n=\n(d/\u03b1\u03b7)O\n\nsamples and runs in time nO\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:17)\n\n\u03b14\u03b74\n\n\u03b14 \u03b74\n\n(cid:17)\n\n.\n\n\u03b18\u03b78\n\nWe note that certi\ufb01ably anti-concentrated distributions are more restrictive compared to the families of\ndistributions for which the most general robust estimation algorithms work [31, 30, 28]. To a certain\nextent, this is inherent. The families of distributions considered in these prior works do not satisfy\nanti-concentration in general. And as we discuss in more detail in Section 2, anti-concentration is\ninformation-theoretically necessary (see Theorem 1.3) for list-decodable regression. This surprisingly\nrules out families of distributions that might appear natural and \u201ceasy\u201d, for example, the uniform\ndistribution on {0, 1}n.\nWe rescue this to an extent for the special case when (cid:96)\u2217 in the model Lin(\u03b1, (cid:96)\u2217) is a \"Boolean\nvector\", i.e., has all coordinates of equal magnitude. Intuitively, this helps because while the the\nuniform distribution on {0, 1}n (and more generally, any discrete product distribution) is badly\nanti-concentrated in sparse directions, they are well anti-concentrated [20] in the directions that are\nfar from any sparse vectors.\nAs before, for obtaining ef\ufb01cient algorithms, we need to work with a certi\ufb01ed version (see De\ufb01ni-\ntion 4.5) of such a restricted anti-concentration condition. As a speci\ufb01c Corollary (see Theorem 4.6\nfor a more general statement), this allows us to show:\nTheorem 1.9 (List-Decodable Regression for Hypercube Inliers). For every \u03b1, \u03b7 > 0 there\u2019s an\n\u03b7-approximate algorithm for list-decodable regression for the model LinD(\u03b1, (cid:96)\u2217) with D is uniform\non {0, 1}d that needs n = (d/\u03b1\u03b7)O(\n\n\u03b14\u03b74 ) samples and runs in time nO(\n\n\u03b14\u03b74 ) = (d/\u03b1\u03b7)O(\n\n\u03b18\u03b78 ).\n\n1\n\n1\n\n1\n\nIn Section 4.1, we obtain similar results for general product distributions. It is an important open\nproblem to prove certi\ufb01ed anti-concentration for a broader family of distributions.\n\nPlease note that sections 3-6 are in the supplementary material.\n\n4\n\n\f2 Overview of our Technique\n\nIn this section, we give a bird\u2019s eye view of our approach and illustrate the important ideas in our\nalgorithm for list-decodable regression. Thus, given a sample S = {(xi, yi)}n\ni=1 from LinD(\u03b1, (cid:96)\u2217),\nwe must construct a constant-size list L of linear functions containing an (cid:96) close to (cid:96)\u2217.\nOur algorithm is based on the sum-of-squares method. We build on the \u201cidenti\ufb01ability to algorithms\u201d\nparadigm developed in several prior works [5, 4, 36, 31, 24, 30, 28] with some important conceptual\ndifferences.\n\nAn inef\ufb01cient algorithm Let\u2019s start by designing an inef\ufb01cient algorithm for the problem. This\nmay seem simple at the outset. But as we\u2019ll see, solving this relaxed problem will rely on some\nimportant conceptual ideas that will serve as a starting point for our ef\ufb01cient algorithm.\nWithout computational constraints, it is natural to just return the list L of all linear functions (cid:96) that\ncorrectly labels all examples in some S \u2286 S of size \u03b1n. We call such an S, a large, soluble set. True\ninliers I satisfy our search criteria so (cid:96)\u2217 \u2208 L. However, it\u2019s not hard to show (Proposition B.1 ) that\none can choose outliers so that the list so generated has size exp(d) (far from a \ufb01xed constant!).\nA potential \ufb01x is to search instead for a coarse soluble partition of S, if it exists, into disjoint\nS1, S2, . . . , Sk and linear functions (cid:96)1, (cid:96)2, . . . , (cid:96)k so that every |Si| \u2265 \u03b1n and (cid:96)i correctly computes\nthe labels in Si. In this setting, our list is small (k \u2264 1/\u03b1). But it is easy to construct samples S for\nwhich this fails because there are coarse soluble partitions of S where every (cid:96)i is far from (cid:96)\u2217.\n\nAnti-Concentration It turns out that any (even inef\ufb01cient) algorithm for list-decodable regression\nprovably (see Theorem 6.1) requires that the distribution of inliers2 be suf\ufb01ciently anti-concentrated:\nDe\ufb01nition 2.1 (Anti-Concentration). A Rd-valued random variable Y with mean 0 is \u03b4-anti-\nconcentrated3 if for all non-zero v, Pr[(cid:104)Y, v(cid:105) = 0] < \u03b4. A set T \u2286 Rd is \u03b4-anti-concentrated\nif the uniform distribution on T is \u03b4-anti-concentrated.\n\nAs we discuss next, anti-concentration is also suf\ufb01cient for list-decodable regression. Intuitively,\nthis is because anti-concentration of the inliers prevents the existence of a soluble set that intersects\nsigni\ufb01cantly with I and yet can be labeled correctly by (cid:96) (cid:54)= (cid:96)\u2217. This is simple to prove in the special\ncase when S admits a coarse soluble partition.\nProposition 2.2. Suppose I is \u03b1-anti-concentrated.\nSuppose there exists a partition\nS1, S2, . . . , Sk \u2286 S such that each |Si| \u2265 \u03b1n and there exist (cid:96)1, (cid:96)2, . . . , (cid:96)k such that yj = (cid:104)(cid:96)i, xj(cid:105)\nfor every j \u2208 Si. Then, there is an i such that (cid:96)i = (cid:96)\u2217.\nProof. Since k \u2264 1/\u03b1, there is a j such that |I \u2229 Sj| \u2265 \u03b1|I|. Then, (cid:104)xi, (cid:96)j(cid:105) = (cid:104)xi, (cid:96)\u2217(cid:105) for every\ni \u2208 I \u2229 Sj. Thus, Pri\u223cI[(cid:104)xi, (cid:96)j \u2212 (cid:96)\u2217(cid:105) = 0] \u2265 \u03b1. This contradicts anti-concentration of I unless\n(cid:96)j \u2212 (cid:96)\u2217 = 0.\n\nThe above proposition allows us to use any soluble partition as a certi\ufb01cate of correctness for the\nassociated list L. Two aspects of this certi\ufb01cate were crucial in the above argument: 1) largeness:\neach Si is of size \u03b1n - so the generated list is small, and, 2) uniformity: every sample is used in\nexactly one of the sets so I must intersect one of the Sis in at least \u03b1-fraction of the points.\nIdenti\ufb01ability via anti-concentration For arbitrary S, a coarse soluble partition might not exist.\nSo we will generalize coarse soluble partitions to obtain certi\ufb01cates that exist for every sample S\nand guarantee largeness and a relaxation of uniformity (formalized below). For this purpose, it is\nconvenient to view such certi\ufb01cates as distributions \u00b5 on \u2265 \u03b1n size soluble subsets of S so any\ncollection C \u2286 2S of \u03b1n size sets corresponds to the uniform distribution \u00b5 on C.\nTo precisely de\ufb01ne uniformity, let Wi(\u00b5) = ES\u223c\u00b5[1(i \u2208 S)] be the \u201cfrequency of i\u201d, that is,\nprobability that the ith sample is chosen to be in a set drawn according to \u00b5. Then, the uniform\ndistribution \u00b5 on any coarse soluble k-partition satis\ufb01es Wi = 1\nk for every i. That is, all samples\n\nPlease note that sections 3-6 are in the supplementary material.\n2As in the standard robust estimation setting, the outliers are arbitrary and potentially adversarially chosen.\n3De\ufb01nition 1.4 differs slightly to handle list-decodable regression with additive noise in the inliers.\n\n5\n\n\fi \u2208 S are uniformly used in such a \u00b5. To generalize this idea, we de\ufb01ne(cid:80)\n(cid:80)\n\ni Wi(\u00b5)2 as the distance\nto uniformity of \u00b5. Up to a shift, this is simply the variance in the frequencies of the points in S\nused in draws from \u00b5. Our generalization of a coarse soluble partition of S is any \u00b5 that minimizes\ni Wi(\u00b5)2, the distance to uniformity, and is thus maximally uniform among all distributions\n\nsupported on large soluble sets. Such a \u00b5 can be found by convex programming.\nThe following claim generalizes Proposition 2.2 to derive the same conclusion starting from any\nmaximally uniform distribution supported on large soluble sets.\nProposition 2.3. For a maximally uniform \u00b5 on \u03b1n size\n\n(cid:80)\nThe proof proceeds by contradiction (see Lemma 4.3). We show that if(cid:80)\ni\u2208I ES\u223c\u00b5[1 (i \u2208 S)] \u2265 \u03b1|I|.\n\ni\u2208I Wi(\u00b5) \u2264 \u03b1|I|, then we\ncan strictly reduce the distance to uniformity by taking a mixture of \u00b5 with the distribution that places\nall its probability mass on I. This allow us to obtain an (inef\ufb01cient) algorithm for list-decodable\nregression establishing identi\ufb01ability.\nProposition 2.4 (Identi\ufb01ability for List-Decodable Regression). Let S be sample from Lin(\u03b1, (cid:96)\u2217)\nsuch that I is \u03b4-anti-concentrated for \u03b4 < \u03b1. Then, there\u2019s an (inef\ufb01cient) algorithm that \ufb01nds a list\nL of size 20\n\n\u03b1\u2212\u03b4 such that (cid:96)\u2217 \u2208 L with probability at least 0.99.\n\nsubsets of S,\n\nsoluble\n\nTo see why (cid:96)\u2217 \u2208 L, observe that E|Sj\u2229I| =(cid:80)\n\nProof. Let \u00b5 be any maximally uniform distribution over \u03b1n size soluble subsets of S. For k = 20\n\u03b1\u2212\u03b4 ,\nlet S1, S2, . . . , Sk be independent samples from \u00b5. Output the list L of k linear functions that\ncorrectly compute the labels in each Si.\ni\u2208I E1(i \u2208 Sj) \u2265 \u03b1|I|. By averaging, Pr[|Sj\u2229I| \u2265\n2 |I|] \u2265 \u03b1\u2212\u03b4\n2 . Thus, there\u2019s a j \u2264 k so that |Sj \u2229 I| \u2265 \u03b1+\u03b4\n2 |I| with probability at least\n1 \u2212 (1 \u2212 \u03b1\u2212\u03b4\n\u03b1\u2212\u03b4 \u2265 0.99. We can now repeat the argument in the proof of Proposition 2.2 to\n2 )\nconclude that any linear function that correctly labels Sj must equal (cid:96)\u2217.\n\n\u03b1+\u03b4\n\n20\n\nAn ef\ufb01cient algorithm Our identi\ufb01ability proof suggests the following simple algorithm: 1) \ufb01nd\nany maximally uniform distribution \u00b5 on soluble subsets of size \u03b1n of S, 2) take O(1/\u03b1) samples\nSi from \u00b5 and 3) return the list of linear functions that correctly label the equations in Sis. This is\ninef\ufb01cient because searching over distributions is NP-hard in general.\nTo make this into an ef\ufb01cient algorithm, we start by observing that soluble subsets S \u2286 S of size \u03b1n\ncan be described by the following set of quadratic equations where w stands for the indicator of S\nand (cid:96), the linear function that correctly labels the examples in S.\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\nAw,(cid:96) :\n\ni=1 wi = \u03b1n\n\u2200i \u2208 [n].\ni = wi\n\u2200i \u2208 [n]. wi \u00b7 (yi \u2212 (cid:104)xi, (cid:96)(cid:105)) = 0\n(cid:107)(cid:96)(cid:107)2 \u2264 1\n\nw2\n\n(cid:80)n\n\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8fe\n\n(2.1)\n\nOur ef\ufb01cient algorithm searches for a maximally uniform pseudo-distribution on w satisfying (2.1).\nDegree k pseudo-distributions (see Section 3 for precise de\ufb01nitions) are generalization of distributions\nthat nevertheless \u201cbehave\u201d just as distributions whenever we take (pseudo)-expectations (denoted\nby \u02dcE) of a class of degree k polynomials. And unlike distributions, degree k pseudo-distributions\nsatisfying4 polynomial constraints (such as (2.1)) can be computed in time nO(k).\nFor the sake of intuition, it might be helpful to (falsely) think of pseudo-distributions \u02dc\u00b5 as simply\ndistributions where we only get access to moments of degree \u2264 k. Thus, we are allowed to compute\nexpectations of all degree \u2264 k polynomials with respect to \u02dc\u00b5. Since Wi(\u02dc\u00b5) = \u02dcE\u02dc\u00b5 wi are just\n\ufb01rst moments of \u02dc\u00b5, our notion of maximally uniform distributions extends naturally to pseudo-\ndistributions. This allows us to prove an analog of Proposition 2.3 for pseudo-distributions and gives\nus an ef\ufb01cient replacement for Step 1.\n\nPlease note that sections 3-6 are in the supplementary material.\n4See Fact 3.3 for a precise statement.\n\n6\n\n\fProposition 2.5. For any maximally uniform \u02dc\u00b5 of degree \u2265 2, (cid:80)\n\u03b1(cid:80)\n\n\u02dcE\u02dc\u00b5[wi] .\n\ni\u2208[n]\n\ni\u2208I \u02dcE\u02dc\u00b5[wi] \u2265 \u03b1|I| =\n\nFor Step 2, however, we hit a wall: it\u2019s not possible to obtain independent samples from \u02dc\u00b5 given only\nlow-degree moments.\n\nRounding by Votes To circumvent this hurdle, our algorithm departs from rounding strategies for\npseudo-distributions used in prior works and instead \u201crounds\u201d each sample to a candidate linear\nfunction. While a priori, this method produces n different candidates instead of one, we will be able\nto extract a list of O( 1\n\u03b1 ) size that contains the true vector from them. This step will crucially rely on\nanti-concentration properties of I.\nwhenever \u02dcE\u02dc\u00b5[wi] (cid:54)= 0 (set vi to zero, otherwise). This is simply the\nConsider the vector vi =\n(scaled) average, according to \u02dc\u00b5, of all the linear functions (cid:96) that are used to label the sets S of size\n\u03b1n in the support of \u02dc\u00b5 whenever i \u2208 S. Further, vi depends only on the \ufb01rst two moments of \u02dc\u00b5.\nWe think of vis as \u201cvotes\u201dcast by the ith sample for the unknown linear function. Let us focus\nour attention on the votes vi of i \u2208 I - the inliers. We will show that according to the distribution\nproportional to \u02dcE[w], the average (cid:96)2 distance of vi from (cid:96)\u2217 is at max \u03b7:\n\n\u02dcE \u02dc\u00b5[wi(cid:96)]\n\u02dcE \u02dc\u00b5[wi]\n\n1(cid:80)\n\ni\u2208I \u02dcE[wi]\n\n(cid:88)\n\ni\u2208I\n\n\u02dcE[wi](cid:107)vi \u2212 (cid:96)\u2217(cid:107)2 < \u03b7 .\n\n((cid:63))\n\nBefore diving into ((cid:63)), let\u2019s see how it gives us our ef\ufb01cient list-decodable regression algorithm:\n\n1. Find a pseudo-distribution \u02dc\u00b5 satisfying (2.1) that minimizes distance to uniformity\n\n(cid:80)\n\n\u02dcE\u02dc\u00b5[wi]2.\n\ni\n\n2. For O( 1\n\n\u03b1 ) times, independently choose a random index i \u2208 [n] with probability proportional\n\nto \u02dcE\u02dc\u00b5[wi] and return the list of corresponding vis.\n\nStep 1 above is a convex program - it minimizes a norm subject on the convex set of pseudo-\ndistributions - and can be solved in polynomial time. Let\u2019s analyze step 2 to see why the algorithm\nworks. Using ((cid:63)) and Markov\u2019s inequality, conditioned on i \u2208 I, (cid:107)vi \u2212 (cid:96)\u2217(cid:107)2 \u2264 2\u03b7 with probability\n\u2265 1/2. By Proposition 2.5,\n\u2265 \u03b1 so i \u2208 I with probability at least \u03b1. Thus in each\niteration of step 2, with probability at least \u03b1/2, we choose an i such that vi is 2\u03b7-close to (cid:96)\u2217.\nRepeating O(1/\u03b1) times gives us the 0.99 chance of success.\n\ni\u2208I \u02dcE[wi]\ni\u2208[n] \u02dcE[wi]\n\n(cid:80)\n(cid:80)\n\n((cid:63)) via anti-concentration As in the information-theoretic argument, ((cid:63)) relies on the anti-\nconcentration of I. Let\u2019s do a quick proof for the case when \u02dc\u00b5 is an actual distribution \u00b5.\n\ni (cid:107)E\u00b5[wi(cid:96)]\u2212E\u00b5[wi](cid:96)\u2217(cid:107) \u2264 E\u00b5[(cid:80)\n\nProof of ((cid:63)) for actual distributions \u00b5. Observe that \u00b5 is a distribution over (w, (cid:96)) satisfying (2.1).\nRecall that w indicates a subset S \u2286 S of size \u03b1n and wi = 1 iff i \u2208 S. And (cid:96) \u2208 Rd satis\ufb01es all the\nequations in S.\ni\u2208I wi(cid:107)(cid:96)\u2212(cid:96)\u2217(cid:107)]. Next, as in Proposition 2.2,\nsince I is \u03b7-anti-concentrated, and for all S such that |I \u2229 S| \u2265 \u03b7|I|, (cid:96) \u2212 (cid:96)\u2217 = 0. Thus, any such S\nin the support of \u00b5 contributes 0 to the expectation above. We will now show that the contribution\nfrom the remaining terms is upper bounded by \u03b7. Observe that since (cid:107)(cid:96) \u2212 (cid:96)\u2217(cid:107) \u2264 2,\n\ni\u2208I wi(cid:107)(cid:96)\u2212 (cid:96)\u2217(cid:107)] = E\u00b5[1 (|S \u2229 I| < \u03b7|I|) wi(cid:107)(cid:96)\u2212 (cid:96)\u2217(cid:107)] = E\u00b5[(cid:80)\n\nBy Cauchy-Schwarz,(cid:80)\nE\u00b5[(cid:80)\n\ni\u2208S\u2229I (cid:107)(cid:96)\u2212 (cid:96)\u2217(cid:107)] \u2264 2\u03b7|I|.\n\nSoSizing Anti-Concentration The key to proving ((cid:63)) for pseudo-distributions is a sum-of-squares\n(SoS) proof of anti-concentration inequality: Prx\u223cI[(cid:104)x, v(cid:105) = 0] \u2264 \u03b7 in variable v. SoS is a restricted\nsystem for proving polynomial inequalities subject to polynomial inequality constraints. Thus, to\neven ask for a SoS proof we must phrase anti-concentration as a polynomial inequality.\n\nPlease note that sections 3-6 are in the supplementary material.\n\n7\n\n\fTo do this, let p(z) be a low-degree polynomial approximator for the function 1 (z = 0). Then, we\ncan hope to \u201creplace\u201d the use of the inequality Prx\u223cI[(cid:104)x, v(cid:105) = 0] \u2264 \u03b7 \u2261 Ex\u223cI[1((cid:104)x, v(cid:105) = 0)] \u2264 \u03b7\nin the argument above by Ex\u223cI[p((cid:104)x, v(cid:105))2] \u2264 \u03b7. Since polynomials grow unboundedly for large\nenough inputs, it is necessary for the uniform distribution on I to have suf\ufb01ciently light-tails to\nensure that Ex\u223cIp((cid:104)x, v(cid:105))2 is small. In Lemma A.1, we show that anti-concentration and strictly\nsub-exponential tails are suf\ufb01cient to construct such a polynomial.\nWe can \ufb01nally ask for a SoS proof for Ex\u223cIp((cid:104)x, v(cid:105)) \u2264 \u03b7 in variable v. We prove such certi\ufb01ed\nanti-concentration inequalities for broad families of inlier distributions in Section 5.\n\n3 Acknowledgements\n\nThe authors would like to thank the following sources of support.\nSushrut Karmalkar was supported by NSF Award CNS-1414023. Adam Klivans was supported by\nNSF Award CCF-1717896. Pravesh Kothari was supported by Schmidt Foundation Fellowship and\nAvi Wigderson\u2019s NSF Award CCF-1412958.\n\nReferences\n[1] Pranjal Awasthi, Maria-Florina Balcan, and Philip M. Long. The power of localization for\n\nef\ufb01ciently learning linear separators with malicious noise. CoRR, abs/1307.8371, 2013. 1\n\n[2] Sivaraman Balakrishnan, Martin J. Wainwright, and Bin Yu. Statistical guarantees for the EM\n\nalgorithm: From population to sample-based analysis. CoRR, abs/1408.2156, 2014. 2\n\n[3] Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for\n\nclustering via similarity functions. In STOC, pages 671\u2013680. ACM, 2008. 1\n\n[4] Boaz Barak, Jonathan A. Kelner, and David Steurer. Dictionary learning and tensor decomposi-\ntion via the sum-of-squares method [extended abstract]. In STOC\u201915\u2014Proceedings of the 2015\nACM Symposium on Theory of Computing, pages 143\u2013151. ACM, New York, 2015. 5\n\n[5] Boaz Barak and Ankur Moitra. Noisy tensor completion via the sum-of-squares hierarchy. In\nCOLT, volume 49 of JMLR Workshop and Conference Proceedings, pages 417\u2013445. JMLR.org,\n2016. 5\n\n[6] Thorsten Bernholt. Robust estimators are hard to compute. Technical report, Technical\nReport/Universit\u00e4t Dortmund, SFB 475 Komplexit\u00e4tsreduktion in Multivariaten Datenstrukturen,\n2006. 1\n\n[7] Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, and Purushottam Kar. Consistent robust\nregression. In Advances in Neural Information Processing Systems 30: Annual Conference\non Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA,\npages 2107\u20132116, 2017. 2\n\n[8] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In STOC,\n\npages 47\u201360. ACM, 2017. 1, 2\n\n[9] Yudong Chen, Xinyang Yi, and Constantine Caramanis. A convex formulation for mixed\nregression with two components: Minimax optimal rates. In Proceedings of The 27th Conference\non Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, pages 560\u2013604, 2014. 2\n\n[10] Yu Cheng, Ilias Diakonikolas, and Rong Ge. High-dimensional robust mean estimation in\nnearly-linear time. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2755\u20132771,\n2019. 2\n\n[11] Richard D. De Veaux. Mixtures of linear regressions. Comput. Statist. Data Anal., 8(3):227\u2013245,\n\n1989. 2\n\n8\n\n\f[12] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li 0001, Jacob Steinhardt, and Alis-\ntair Stewart. Sever: A robust meta-algorithm for stochastic optimization. CoRR, abs/1803.02815,\n2018. 2\n\n[13] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair\nStewart. Robust estimators in high dimensions without the computational intractability. In\nFOCS, pages 655\u2013664. IEEE Computer Society, 2016. 2\n\n[14] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair\nStewart. Robustly learning a gaussian: Getting optimal error, ef\ufb01ciently. CoRR, abs/1704.03866,\n2017. 1, 2\n\n[15] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair\nStewart. Robustly learning a gaussian: Getting optimal error, ef\ufb01ciently. In Proceedings of\nthe Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New\nOrleans, LA, USA, January 7-10, 2018, pages 2683\u20132702, 2018. 2\n\n[16] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Zheng Li, Ankur Moitra, and\nAlistair Stewart. Robust estimators in high dimensions without the computational intractability.\nCoRR, abs/1604.06443, 2016. 1\n\n[17] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Learning geometric concepts with\n\nnasty noise. CoRR, abs/1707.01242, 2017. 1\n\n[18] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. List-decodable robust mean estimation\nand learning mixtures of spherical gaussians. In Proceedings of the 50th Annual ACM SIGACT\nSymposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018,\npages 1047\u20131060, 2018. 1, 2\n\n[19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart. Ef\ufb01cient algorithms and lower bounds\nfor robust linear regression. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual\nACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA,\nJanuary 6-9, 2019, pages 2745\u20132754. SIAM, 2019. 2\n\n[20] P. Erd\u00f6s. On a lemma of littlewood and offord. Bull. Amer. Math. Soc., 51(12):898\u2013902, 12\n\n1945. 3, 4\n\n[21] Susana Faria and Gilda Soromenho. Fitting mixtures of linear regressions. J. Stat. Comput.\n\nSimul., 80(1-2):201\u2013225, 2010. 2\n\n[22] John Hainline, Brendan Juba, Hai Le, and David Woodruff. Conditional sparse l_p-norm regres-\nsion with optimal probability. In The 22nd International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 1042\u20131050, 2019. 2\n\n[23] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust\nstatistics: the approach based on in\ufb02uence functions, volume 114. John Wiley & Sons, 2011. 1\n\n[24] Sam B. Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. 2017. 1,\n\n2, 5\n\n[25] Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages\n\n1248\u20131251. Springer, 2011. 1\n\n[26] Michael I. Jordan and Robert A. Jacobs. Hierarchical mixtures of experts and the em algorithm.\n\nNeural Computation, 6(2):181\u2013214, 1994. 2\n\n[27] Sushrut Karmalkar and Eric Price. Compressed sensing with adversarial sparse noise via l1\nregression. In Jeremy T. Fineman and Michael Mitzenmacher, editors, SOSA@SODA, volume 69\nof OASICS, pages 19:1\u201319:19. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2019. 2\n\n[28] Adam R. Klivans, Pravesh K. Kothari, and Raghu Meka. Ef\ufb01cient algorithms for outlier-robust\nregression. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018.,\npages 1420\u20131430, 2018. 1, 2, 4, 5\n\n9\n\n\f[29] Adam R. Klivans, Philip M. Long, and Rocco A. Servedio. Learning halfspaces with malicious\n\nnoise. Journal of Machine Learning Research, 10:2715\u20132740, 2009. 1\n\n[30] Pravesh K. Kothari and Jacob Steinhardt. Better agnostic clustering via relaxed tensor norms.\n\n2017. 1, 2, 4, 5\n\n[31] Pravesh K. Kothari and David Steurer. Outlier-robust moment-estimation via sum-of-squares.\n\nCoRR, abs/1711.11581, 2017. 1, 2, 4, 5\n\n[32] Pravesh K. Kothari and David Steurer. List-decodable mean estimation made simple.\n\nManuscript, 2019. 2\n\nIn\n\n[33] Kevin A. Lai, Anup B. Rao, and Santosh Vempala. Agnostic estimation of mean and covariance.\n\nIn FOCS, pages 665\u2013674. IEEE Computer Society, 2016. 1\n\n[34] Yuanzhi Li and Yingyu Liang. Learning mixtures of linear regressions with nearly optimal\ncomplexity. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July\n2018., pages 1125\u20131144, 2018. 2, 4\n\n[35] Doron S Lubinsky. A Survey of Weighted Approximation for Exponential Weights. arXiv\n\nMathematics e-prints, page math/0701099, Jan 2007. 2\n\n[36] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with\n\nsum-of-squares. In FOCS, pages 438\u2013446. IEEE Computer Society, 2016. 5\n\n[37] RARD Maronna, R Douglas Martin, and Victor Yohai. Robust statistics. John Wiley & Sons,\n\nChichester. ISBN, 2006. 1\n\n[38] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians.\n\nIn FOCS, pages 93\u2013102. IEEE Computer Society, 2010. 4\n\n[39] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust\n\nestimation via robust gradient estimation. CoRR, abs/1802.06485, 2018. 2\n\n[40] Mark Rudelson and Roman Vershynin. The Littlewood-Offord problem and invertibility of\n\nrandom matrices. Adv. Math., 218(2):600\u2013633, 2008. 3\n\n[41] Hanie Sedghi, Majid Janzamin, and Anima Anandkumar. Provable tensor methods for learning\nIn AISTATS, volume 51 of JMLR Workshop and\n\nmixtures of generalized linear models.\nConference Proceedings, pages 1223\u20131231. JMLR.org, 2016. 2, 4\n\n[42] Terence Tao and Van Vu. The Littlewood-Offord problem in high dimensions and a conjecture\n\nof Frankl and F\u00fcredi. Combinatorica, 32(3):363\u2013372, 2012. 3\n\n[43] John W. Tukey. Mathematics and the picturing of data. pages 523\u2013531, 1975. 1\n\n[44] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating Minimization for Mixed\n\nLinear Regression. arXiv e-prints, page arXiv:1310.3745, Oct 2013. 2\n\n[45] Kai Zhong, Prateek Jain, and Inderjit S. Dhillon. Mixed linear regression with multiple\n\ncomponents. In NIPS, pages 2190\u20132198, 2016. 2\n\n10\n\n\f", "award": [], "sourceid": 4034, "authors": [{"given_name": "Sushrut", "family_name": "Karmalkar", "institution": "The University of Texas at Austin"}, {"given_name": "Adam", "family_name": "Klivans", "institution": "UT Austin"}, {"given_name": "Pravesh", "family_name": "Kothari", "institution": "Princeton University and Institute for Advanced Study"}]}