{"title": "Estimating Learnability in the Sublinear Data Regime", "book": "Advances in Neural Information Processing Systems", "page_first": 5455, "page_last": 5464, "abstract": "We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is often possible to accurately estimate this ``learnability'' even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a $d$-dimensional distribution with isotropic covariance, and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with $O(\\sqrt{d})$ samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. \nWe extend these techniques to a binary classification, and show that the prediction error of the best linear classifier can be accurately estimated given $O(\\sqrt{d})$ labeled samples. For comparison, in both the linear regression and binary classification settings, even if there is no noise in the labels, a sample size linear in the dimension, $d$, is required to \\emph{learn} any function correlated with the underlying model. We further extend our estimation approach to the setting where the data distribution has an (unknown) arbitrary covariance matrix, allowing these techniques to be applied to settings where the model class consists of a linear function applied to a nonlinear embedding of the data. We demonstrate the practical viability of our approaches on synthetic and real data. This ability to estimate the explanatory value of a set of features (or dataset), even in the regime in which there is too little data to realize that explanatory value, may be relevant to the scientific and industrial settings for which data collection is expensive and there are many potentially relevant feature sets that could be collected.", "full_text": "Estimating Learnability in the Sublinear Data Regime\n\nWeihao Kong\n\nStanford University\n\nwhkong@stanford.edu\n\nGregory Valiant\nStanford University\n\ngvaliant@cs.stanford.edu\n\nAbstract\n\n\u221a\n\nWe consider the problem of estimating how well a model class is capable of \ufb01tting\na distribution of labeled data. We show that it is possible to accurately estimate\nthis \u201clearnability\u201d even when given an amount of data that is too small to reliably\nlearn any accurate model. Our \ufb01rst result applies to the setting where the data is\ndrawn from a d-dimensional distribution with isotropic covariance, and the label\nof each datapoint is an arbitrary noisy function of the datapoint. In this setting,\nwe show that with O(\nd) samples, one can accurately estimate the fraction of\nthe variance of the label that can be explained via the best linear function of the\ndata. For comparison, even if the labels are noiseless linear functions of the data, a\nsample size linear in the dimension, d, is required to learn any function correlated\nwith the underlying model. Our estimation approach also applies to the setting\nwhere the data distribution has an (unknown) arbitrary covariance matrix, allowing\nthese techniques to be applied to settings where the model class consists of a\nlinear function applied to a nonlinear embedding of the data. In this setting we\ngive a consistent estimator of the fraction of explainable variance that uses o(d)\nsamples. Finally, our techniques also extend to the setting of binary classi\ufb01cation,\nwhere we obtain analogous results under the logistic model, for estimating the\nclassi\ufb01cation accuracy of the best linear classi\ufb01er. We demonstrate the practical\nviability of our approaches on synthetic and real data. This ability to estimate the\nexplanatory value of a set of features (or dataset), even in the regime in which there\nis too little data to realize that explanatory value, may be relevant to the scienti\ufb01c\nand industrial settings for which data collection is expensive and there are many\npotentially relevant feature sets that could be collected.\n\nIntroduction\n\n1\nGiven too little labeled data to learn a model or classi\ufb01er, is it possible to determine whether an\naccurate classi\ufb01er or predictor exists? For example, consider a setting where you are given n\ndatapoints with real-valued labels drawn from some distribution of interest, D. Suppose you are in\nthe regime in which n is too small to learn an accurate prediction model; might it still be possible\nto estimate the performance that would likely be obtained if, hypothetically, you were to gather\nmore data, say a dataset of size n(cid:48) (cid:29) n and train a model on that data? We answer this question\naf\ufb01rmatively, and show that in the settings of linear regression and binary classi\ufb01cation via linear\n(or logistic) classi\ufb01ers, it is possible to estimate the likely performance of a (hypothetical) predictor\ntrained on a larger hypothetical dataset, even given an amount of data that is sublinear in the amount\nthat would be required to learn such a predictor.\nFor concreteness, we begin by describing the \ufb02avor of our results in a very basic setting: learning a\nnoisy linear function of high-dimensional data. Suppose we are given access to independent samples\nfrom a d-dimensional isotropic Gaussian, and each sample, x \u2208 Rd is labeled according to a noisy\nlinear function y = (cid:104)x, \u03b2(cid:105) + \u03b7, where \u03b2 is the true model and the noise \u03b7 is drawn (independently)\nfrom a distribution of (unknown) variance \u03b42. One natural goal is to estimate the signal to noise ratio,\n1 \u2212 \u03b42/Var[Y ], namely estimating how much of the variation in the label we could hope to explain.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fd) samples.\n\nEven in the noiseless setting (\u03b4 = 0), it is information theoretically impossible to learn any function\nthat has even a small constant correlation with the labels unless we are given an amount of data that is\nlinear in the dimension, d. Nevertheless, as was recently shown by Dicker [6] in this Gaussian setting\n\u221a\nwith independent noise, it is possible to estimate the magnitude of the noise, \u03b4, and variance of the\nlabel, given only O(\nOur results (summarized in Section 1.1), explore this striking ability to estimate the \u201clearnability\u201d\nof a distribution over labeled data based on relatively few samples. Our results signi\ufb01cantly extend\nprevious results and the results of Dicker in the following senses: 1) We present a uni\ufb01ed approach\nthat yields accurate estimation of this learnability when n = o(d) which applies even when the x\nportion of the datapoints are drawn from a distribution with arbitrary (unknown) covariance. This\nis very surprising\u2014and was conjectured to be impossible [14]\u2014because the best linear model can\nnot be approximated with o(d) data, nor can the covariance be consistently estimated with o(d)\ndatapoints. 2) Agnostic setting: Our techniques do not require any distributional assumptions on\nthe label, y, in contrast to most previous work that assumed y is a linear function plus independent\nnoise (which is not a realistic assumption for many of the practical settings of interest). Instead, our\napproach directly estimates the fraction of the variance in the label that can be explained via a linear\nfunction of x. 3) Binary classi\ufb01cation setting: Our techniques naturally extend to the setting of binary\nclassi\ufb01cation, provided a strong distributional assumption is made\u2014namely that the data is drawn\naccording to the logistic model (see Section 1.1 for a formal description of this model).\nThroughout, we focus on linear models and classi\ufb01ers. Because some of our results apply when\nthe covariance matrix of the distribution is non-isotropic (and non-Gaussian), the results extend\nto the many non-linear models that can be represented as a linear function applied to a non-linear\nembedding of the data, for example settings where the label is a noisy polynomial function of the\nfeatures. Still, our assumptions on the data generating distribution are very speci\ufb01c for the binary\nclassi\ufb01cation results, and our algorithm for this setting does not apply if the two classes have unequal\nprobabilities. We are optimistic that our techniques may be extended in future work to address that\nsetting, as well as more general function classes and losses.\nMotivating Applications: Estimating the value of data and dataset selection. In some data-\nanalysis settings, the ultimate goal is to quantify the signal and noise\u2014namely understand how much\nof the variation in the quantity of interest can be explained via some set of explanatory variables. For\nexample, in some medical settings, the goal is to understand how much disease risk is associated with\ngenomic factors (versus random luck, or environmental factors, etc.). In other settings, the goal is\nto accurately predict a quantity of interest. The key question then becomes \u201cwhat data should we\ncollect\u2014what features or variables should we try to measure?\" The traditional pipeline is to collect\na lot of data, train a model, and then evaluate the value of the data based on the performance (or\nimprovement in performance) of the model.\nOur results demonstrate the possibility of evaluating the explanatory utility of additional features,\neven in the regime in which too few data points have been collected to leverage these data points\nto learn a model. For example, suppose we wish to build a predictor for whether or not someone\nwill get a certain disease. We could begin by collecting a modest amount of genetic data (e.g. for a\nfew hundred patients, record the presence of genetic abnormalities for each of the 20k genes), and a\nmodest amount of epigenetic data. Even if we have data for too few patients to learn a good predictor,\nwe can at least evaluate how much the model would improve if we were to collect a lot more genetic\ndata, versus collecting more epigenetic data. This ability to explore the potential of different features\nwith less data than would be required to exploit those features seems extremely relevant to the many\nindustry and research settings where it is expensive or dif\ufb01cult to gather data.\nAlternately, these techniques could be leveraged by data providers in the context of a \u201cverify then\nbuy\u201d model: Suppose I have a large dataset of customer behaviors that I think will be useful for your\ngoal of predicting customer clicks/purchases. Before you purchase access to my dataset, I could give\nyou a tiny sample of the data\u2014too little to be useful to you, but suf\ufb01cient for you to verify the utility\nof the dataset.\nOne \ufb01nal downstream application of our techniques is to the data aggregation and \u201cfederated learning\u201d\nsettings. The approaches of this work can be re-purposed to measure the extent to which two or more\nlabeled datasets have the same (or similar) labeling functions, even in the regime in which there is too\nlittle data to learn such labeling functions. (This can be accomplished, for example, by applying our\ntechniques of this paper to the aggregate of the datasets versus individually and seeing whether the\n\n2\n\n\fsignal to noise ratio degrades upon aggregation.) Such a primitive might have fruitful applications in\nrealm of federated learning, since one of the key questions in such settings is how to decide which\nentities have similar models, and hence which subsets of entities might bene\ufb01t from training a model\non their combined data.\n\n\u221a\n\n\u221a\n\u03c4 n ).\n\nd+n\n\nd) datapoints, the magnitude of the noise can be accurately determined:\n\n1.1 Summary of Results\nOur \ufb01rst result applies to the setting where the data is drawn according to a d dimensional distribution\nwith identity covariance, and the labels are noisy linear functions. This result generalizes the results\nof Dicker [6] and Verzelen and Gassiat [14] beyond the Gaussian setting. Provided there are more\nthan O(\nProposition 1. [Slight generalization of Lemma 2 in [6] and Corollary 2.2 in [14]] Suppose we are\ngiven n labeled examples, (x1, y1), . . . , (xn, yn), with xi drawn independently from a d-dimension\ndistribution of mean zero, identity covariance, and fourth moments bounded by C. Assuming that\neach label yi = xi\u03b2 + \u03b7, where the noise \u03b7 is drawn independently from an (unknown) distribution\nE with mean 0 variance \u03b42, and the labels have been normalized to have unit variance. There is an\nestimator \u02c6\u03b42, that with probability 1 \u2212 \u03c4, approximates \u03b42 with additive error O(C\nThe fourth moment condition of the above theorem is formally de\ufb01ned as follows: for all vectors\nu, v \u2208 Rd, E[(xT u)2(xT v)2] \u2264 CE[(xT u)2]E[(xT v)2]. In the case that the data distribution is an\nisotropic Gaussian, this fourth moment bound is satis\ufb01ed with C = 3.\nIn the above setting, it is information theoretically impossible to approximate \u03b2, or accurately predict\nthe yi\u2019s without a sample size that is linear in the dimension, d. As we show, the above result is also\n\u221a\noptimal, to constant factors, in the constant-error regime: no algorithm can distinguish the case that\nthe label is pure noise, from the case that the label has a signi\ufb01cant signal, using o(\nd) datapoints\n(see e.g. Proposition 4.2 in [15]).\nOur estimation machinery extends beyond the isotropic setting, and we prove an analog of Proposi-\ntion 1 in the more general setting where the datapoints, xi are drawn from a d dimensional distribution\nwith (unknown) non-isotropic covariance. This setting is considerably more challenging than the\nisotropic setting because the geometry of the datapoints depend heavily on the covariance, yet the\ncovariance can not be accurately estimated with o(d) samples. Though our results are weaker than in\nthe isotropic setting, we still establish accurate estimation of the unexplained variance in the sublinear\nregime, though require a sample size O\u0001(d1\u2212\u221a\n\u0001) to obtain an estimate within error O(\u0001). In the case\nwhere the covariance matrix has a constant condition number, the sample size can be reduced to\nn = O\u0001(d1\u2212 1\nOur results in the non-isotropic setting apply to the following standard model of non-isotropic\ndistributions: the distribution is speci\ufb01ed by an arbitrary d \u00d7 d real-valued matrix, S, and a univariate\nrandom variable Z with mean 0, variance 1, and bounded fourth moment. Each sample x \u2208 Rd is\nthen obtained by computing x = Sz where z \u2208 Rd has entries drawn independently according to\nZ. In this model, the covariance of x will be SST . This model is fairly general (by taking Z to be a\nstandard Gaussian this model can represent any d-dimensional Gaussian distribution, and it can also\nrepresent any rotated and scaled hypercube, etc), and is widely considered in the statistics literature\n(see e.g. [17, 1]). While our theoretical results rely on this modeling assumption, our algorithm is not\ntailored to this speci\ufb01c model, and likely performs well in more general settings.\nTheorem 1. Suppose we are given n < d labeled examples, (x1, y1), . . . , (xn, yn), with xi = Szi\nwhere S is an unknown arbitrary d \u00d7 d real matrix and each entry of zi is drawn independently from\na one dimensional distribution with mean zero, variance 1, and constant fourth moment. Assuming\nthat each label yi = xi\u03b2 + \u03b7, where the noise \u03b7 is drawn independently from an unknown distribution\nwith mean 0 and variance \u03b42, and the labels have been normalized to have unit variance. There\nis an algorithm that takes n labeled samples, parameter k, \u03c3max,\u03c3min which satis\ufb01es \u03c3maxI (cid:23)\nST S (cid:23) \u03c3minI, and with probability 1 \u2212 \u03c4, outputs an estimate \u02c6\u03b42 with additive error |\u02c6\u03b42 \u2212 \u03b42| \u2264\nmin( 1\n\u221a\nConsidering the case when (cid:107)\u03b2(cid:107), \u03c3max are constants, setting k = 1/\nor k = O(log( 1\n\n\u0001 in the case where \u03c3min = 0\n\u0001 )) in the case where \u03c3min is a constant greater than 0 yields the following corollary:\n\n(cid:113) \u03c3min\n\u03c3max ))\u03c3max(cid:107)\u03b2(cid:107)2 + f (k)\n\n, where f (k) = kO(k).\n\n\u2212(k\u22121)\n\nk2 , 2e\n\nlog 1/\u0001 ).\n\n(cid:80)k\n\ndi/2\u22121/2\n\n\u03c4\n\ni=2\n\nni/2\n\n3\n\n\fCorollary 1. In the setting of Theorem 1, with constant (cid:107)\u03b2(cid:107) and \u03c3max, the noise can be approximated\nto error O(\u0001) with n = O(poly(\u0001)d1\u2212\u221a\n\u0001). With the additional assumption that \u03c3min is a constant\ngreater than 0, the noise can be approximated to error O(\u0001) with n = O(poly(log(1/\u0001))d1\u2212 1\nlog 1/\u0001 ).\nFinally, we establish the lower bound, demonstrating that, without any assumptions on (cid:107)\u03a3(cid:107) or (cid:107)\u03b2(cid:107),\nno sublinear sample estimation is possible. See Theorem 3 of the supplementary material for the\nlowerbound statement.\nThe Agnostic Setting. Our algorithms and techniques do not rely on the assumption that the labels\nconsist of a linear function plus independent noise, and our results partially extend to the agnostic\nsetting. Formally, assuming that the label, y, can have any joint distribution with x, we show that our\nalgorithms will accurately estimate the fraction of the variance in y that can be explained via (the\nbest) linear function of x, namely the quantity inf \u03b2 E[(\u03b2T x \u2212 y)2]. The analog of Proposition 1 in\nthe agnostic setting is the following:\nTheorem 2. Suppose we are given n labeled examples, (x1, y1), . . . , (xn, yn), with (xi, yi) drawn\nindependently from a d + 1-dimension distribution where xi has mean zero and identity covariance,\nand yi has mean zero and variance 1, and the fourth moments of the joint distribution (x, y) is bounded\nby C. There is an estimator \u02c6\u03b42, that with probability 1 \u2212 \u03c4, approximates inf \u03b2 E[(\u03b2T x \u2212 y)2] with\nadditive error O(C\n\nd+n\n\n\u221a\n\u03c4 n ).\n\nIn the setting where the distribution of x is non-isotropic, the algorithm to which Theorem 1 applies\nstill extends to this agnostic setting. The estimate of the unexplained variance is still accurate in\nexpectation, though additional assumptions on the (joint) distribution of (x, y) would be required to\nbound the variance of the estimator in this agnostic and non-isotropic setting. Such conditions are\nlikely to be satis\ufb01ed in many practical settings, though a fully general agnostic and non-isotropic\nanalog of Theorem 1 likely does not hold.\nBinary Classi\ufb01cation. Our approaches and techniques for the linear regression setting also can be\napplied to the important setting of binary classi\ufb01cation\u2014namely estimating the performance of the\nbest linear classi\ufb01er, in the regime in which there is insuf\ufb01cient data to learn any accurate classi\ufb01er.\nAs an initial step along these lines, we obtain strong results in a restricted model of Gaussian data\nwith labels corresponding to the latent variable interpretation of logistic regression:\nTheorem 3. Suppose we are given n < d labeled examples, (x1, y1), . . . , (xn, yn), with xi drawn\nindependently from a Gaussian distribution with mean 0 and covariance \u03a3 where \u03a3 is an unknown\narbitrary d by d real matrix. Assuming that each label yi takes value 1 with probability g(\u03b2T xi) and\n\u22121 with probability 1 \u2212 g(\u03b2T xi), where g(x) = 1\n1+e\u2212x is the sigmoid function and \u03b2 is an unknown\nparameter vector. There is an algorithm that takes n labeled samples, parameter k, \u03c3max and \u03c3min\nwhich satis\ufb01es \u03c3maxI (cid:23) \u03a3 (cid:23) \u03c3minI, and with probability 1 \u2212 \u03c4, outputs an estimate (cid:92)erropt with ad-\nditive error |(cid:92)erropt \u2212 erropt| \u2264 c\n,\nwhere erropt is the classi\ufb01cation error of the best linear classi\ufb01er, f (k) = kO(k) and c is an absolute\nconstant.\n\n(cid:113) \u03c3min\n\u03c3max )\u03c3max(cid:107)\u03b2(cid:107)2 + f (k)\n\n(cid:16)(cid:114)\n\n(cid:80)k\n\nmin( 1\n\nk2 , 2e\n\ndi/2\u22121/2\n\n\u2212(k\u22121)\n\n(cid:17)\n\n\u03c4\n\ni=2\n\nni/2\n\n\u221a\nIn the setting where the distribution of x is an isotropic Gaussian, we obtain the simpler result that\nthe classi\ufb01cation error of the best linear classi\ufb01er can be accurately estimated with O(\nd) samples.\nThis is information theoretically optimal, see Theorem 8 of the supplementary material.\nCorollary 2. In the setting of Theorem 3 but where each datapoint, xi is drawn according to a\nd-dimension isotropic Gaussian distribution, there is an algorithm that takes n labeled samples, and\n\u221a\nwith probability 1 \u2212 \u03c4, outputs an estimate (cid:92)erropt with additive error |(cid:92)erropt \u2212 erropt| \u2264 c(\nn )1/2,\nwhere erropt is the classi\ufb01cation error of the best linear classi\ufb01er and c is an absolute constant.\nDespite the strong assumptions on the data-generating distribution in the above theorem and corollary,\nthe algorithm to which they apply seems to perform quite well on real-world data, and is capable\nof accurately estimating the classi\ufb01cation error of the best linear predictor, even in the data regime\nwhere it is impossible to learn any good predictor. One partial explanation is that our approach can\nbe easily adapted to a wide class of \u201clink functions,\u201d beyond just the sigmoid function addressed by\nthe above results. Additionally, for many smooth, monotonic functions, the resulting algorithm is\nalmost identical to the algorithm corresponding to the sigmoid link function.\n\nd\n\n4\n\n\f1.2 Related Work\n\nFor the speci\ufb01c question of estimating the signal to variance ratio (or signal to noise), also referred to\nas the \u201cunexplained variance\u201d, there are many classic and more recent estimators that perform well\nin the linear and super-linear data regime. These estimators apply to the most restrictive setting we\nconsider, where each label y = \u03b2T x + \u03b7 is given as a linear function of x plus independent noise\n\u03b7 of variance \u03b42. Two common estimators for \u03b42 involve \ufb01rst computing the parameter vector \u02c6\u03b2\nthat minimizes the squared error on the n datapoints. These estimators are 1) the \u201cnaive estimator\u201d\nor the \u201cmaximum likelihood\u201d estimator: (y \u2212 X \u02c6\u03b2)T (y \u2212 X \u02c6\u03b2)/n, and 2) the \u201cunbiased\u201d estimator\n(y \u2212 X \u02c6\u03b2)T (y \u2212 X \u02c6\u03b2)/(n \u2212 d), where y refers to the vector of n labels, and X is the n \u00d7 d matrix\nwhose rows represent the n datapoints. Of course, both of these estimators are zero (or unde\ufb01ned)\nin the regime where n \u2264 d, as the prediction error (y \u2212 X \u02c6\u03b2) is identically zero in this regime.\nAdditionally, the variance of the unbiased estimator increases as n approaches d, as is evident in\nour empirical experiments where we compare our estimators with this unbiased estimator. In the\nregime where n < d, variants of these estimators might still be applied but where \u02c6\u03b2 is computed as\nthe solution to a regularized regression (see, e.g. [16]); however, such approaches seem unlikely to\napply in the sublinear regime where n = o(d), as the recovered parameter vector \u02c6\u03b2 is not signi\ufb01cantly\ncorrelated with the true \u03b2 in this regime, unless strong assumptions are made on \u03b2.\nIndeed, there has been a line of work on estimating the noise level \u03b42 assuming that \u03b2 is sparse [9, 7,\n13, 12, 2]. These works give consistent estimates of \u03b42 in the regime where n = \u2126(k log(d)) where\nk is the sparsity of \u03b2. More generally, there is an enormous body of work on the related problem\nof feature selection.The basis dependent nature of this question (i.e. identifying which features are\nrelevant) and the setting of sparse \u03b2, are quite different from the setting we consider where the signal\nmay be a dense vector.\nThere have been recent results on estimating the variance of the noise, without assumptions on \u03b2,\nin the n < d regime. In the case where n < d but n/d approaches a constant c \u2264 1, Janson et\nal. proposed the EigenPrism [10] to estimate the noise level. Their theoretical results rely on the\nassumptions that the data x is drawn from an isotropic Gaussian distribution, and that the label is a\nlinear function plus independent noise, and the performance bounds become trivial if n/d \u2192 0.\nThe most similar work to our paper is the work of Dicker [6], which proposed an estimator of \u03b42 with\n\u221a\nn ) in the setting where the data x is drawn from an isotropic Gaussian distribution, and\nerror rate O(\nthe label is a linear function plus independent Gaussian noise. Their estimator is fairly similar to ours\nin the identity covariances setting and gives the same error rate. However, our result is more general\nin the following senses: 1) Our estimator and analysis do not rely on Gaussianity assumptions; 2) Our\nresults apply beyond the setting where the label y is a linear function of x plus independent noise,\nand estimates the fraction of the variance that can be explained via a linear function (the \u201cagnostic\u201d\nsetting); and 3) our approach extends to the unknown non-isotropic covariance setting.\nIn case the sparsity of \u03b2 is unknown, Verzelen and Gassiat [14] introduced a hybrid approach which\ncombines Dicker\u2019s result in the dense regime and Lasso in the sparse regime to achieve consistent\n\nestimation of \u03b42 using min(k log(d),(cid:112)d log(d)) samples in the isotropic covariance setting where\n\nd\n\nk is the unknown sparsity of \u03b2, and they showed the optimality of the algorithm. In the unknown\ncovariance, dense \u03b2 setting, they conjectured consistent estimation of \u03b42 is not possible with o(d)\nsamples; our Theorem 1 shows that this conjecture is false.\nFinally, there is a body of work from the theoretical computer science community on \u201ctesting\u201d whether\na function belongs to a certain class, including work on testing linearity [3, 4] and monotonicity [8, 5].\nMost of this work is in the \u201cquery model\u201d where the algorithm can (adaptively) choose a point, x,\nand obtain its label (cid:96)(x), and the \ufb02avor of results is very different from the setting we consider.\n\n2\n\nIntuition for Sublinear Estimation\n\n\u221a\n\nWe begin by describing one intuition for why it is possible to estimate the magnitude of the noise\nusing only O(\nd) samples, in the isotropic setting. Suppose we are given data x1, . . . , xn drawn\ni.i.d. from N (0, Id), and let y1, . . . , yn represent the labels, with yi = \u03b2T xi + \u03b7 for a random vector\n\u03b2 \u2208 Rd and \u03b7 drawn independently from N (0, \u03b42). Fix \u03b2, and consider partitioning the datapoints\n\n5\n\n\finto two sets, according to whether the label is positive or negative. In the case where the labels are\ncomplete noise (\u03b42 = 1), the expected value of a positively labeled point is the same as that of a\n\u2212\u2192\nnegatively labeled point and is\n0 . In the case where there is little noise, the expected value \u00b5+ of a\npositive point will be different than that of a negative point, \u00b5\u2212, and the distance between these points\ncorresponds to the distance between the mean of the \u2018top\u2019 half of a Gaussian and the \u2018bottom\u2019 half of\na Gaussian. Furthermore, this distance between the expected means will smoothly vary between 0\n\nand 2(cid:112)2/\u03c0 as the variance of the noise, \u03b42, varies between 1 and 0.\n\n\u221a\n\n\u221a\nThe crux of the intuition for the ability to estimate \u03b42 in the regime where n = O(\nd) is the following\nobservation: while the empirical means of the positive and negative points have high variance in the\nn = o(d) regime, it is possible to accurately estimate the distance between \u00b5+ and \u00b5\u2212 from these\nempirical means! At a high level, this is because the empirical means consists of d coordinates, each\nof which has a signi\ufb01cant amount of noise. However, their squared distance is just a single number\nwhich is a sum of d quantities, and we can leverage concentration in the amount of noise contributed\nby these d summands to save a\nd factor. This closely mirrors the folklore result that it requires O(d)\n\u221a\nsamples to accurately estimate the mean of an identity covariance Gaussian with unknown mean,\nN (\u00b5, Id), though the norm of the mean (cid:107)\u00b5(cid:107) can be estimated to error \u0001 using only n = O(\nOur actual estimators, even in the isotropic case, do not directly correspond to the intuitive argument\nsketched in this section. In particular, there is no partitioning of the data according to the sign of the\nlabel, and the unbiased estimator that we construct does not rely on any Gaussianity assumption.\n\nd/\u0001).\n\n3 The Estimators, Regression Setting\n\nThe basic idea of our proposed estimator is as follows. Given a joint distribution over (x, y) where x\nhas mean 0 and variance \u03a3, the classical least squares estimator which minimizes the unexplained\nvariance takes the form \u03b2 = E[xxT ]\u22121E[yx] = \u03a3\u22121E[yx], and the corresponding value of the\nunexplained variance is E[(y \u2212 \u03b2T x)2] = E[y2] \u2212 \u03b2T \u03a3\u03b2. Note that this de\ufb01nition of \u03b2 coincides\n\u221a\nwith the model parameter in the case that y = \u03b2T x + \u03b7. The variance of the labels, y can be estimated\nup to 1/\nn error with n samples, after which the problem reduces to estimating \u03b2T \u03a3\u03b2. While we\ndo not have an unbiased estimator of \u03b2T \u03a3\u03b2, as we show, we can construct an unbiased estimator for\n\u03b2T \u03a3k\u03b2 for any integer k \u2265 2.\nTo see the utility of estimating these \u201chigher moments\u201d, assume for simplicity that \u03a3 is a diagonal\nmatrix. Consider the distribution over R consisting of d point masses with the ith point mass located\ni /(cid:107)\u03b2(cid:107)2. The problem of estimating \u03b2T \u03a3\u03b2 is now precisely the\nat \u03a3i,i with probability mass \u03b22\nproblem of approximating the \ufb01rst moment of this distribution, and we are claiming that we can\ncompute unbiased (and low variance) estimates of \u03b2T \u03a3k\u03b2 for k = 2, 3, . . ., which exactly correspond\nto the 2nd, 3rd, etc. moments of this distribution of point masses. Our main theorem follows from\nthe following two components: 1) There is an unbiased estimator that can estimate the kth (k \u2265 2)\nmoment of the distribution using only O(d1\u22121/k) samples. 2) Given accurate estimates of the 2nd,\n3rd,. . . ,kth moments, one can approximate the \ufb01rst moment with error O(1/k2). The main technical\nchallenge is the \ufb01rst component\u2014constructing and analyzing the unbiased estimators for the higher\nmoments. Analyzing the variance of these unbiased estimators is quite complicated, as a signi\ufb01cant\namount of machinery needs to be developed to deal with the combinatorial number of types of\n\u201ccross-terms\u201d in the variance expression for the estimators. Fortunately, we are able to leverage\nsome techniques from [11], which bounds similar looking moments (with the rather different goal of\nrecovering the covariance spectrum). The second component of our approach\u2014leveraging estimates\nof the \u201chigher moments\u201d to estimate the \u201c\ufb01rst moment\u201d follows easily from standard results on\npolynomial approximation. Algorithm 1, to which Theorem 1 applies, describes our estimator in\nthe general setting where the data has an arbitrary covariance matrix. The proof of correctness of\nAlgorithm 1, establishing Theorem 1 is given in a self-contained form in the supplementary material.\nIn the special case where the data distribution has identity covariance, \u03b2T \u03a32\u03b2 = \u03b2T \u03a3\u03b2 simply\nbecause I 2 = I, and hence we do have a simple unbiased estimator, which is constructed by\nAlgorithm 1 by letting the input polynomial p(x) = 1. To see the intuition for the unbiased estimators\nof \u03b2T \u03a3k\u03b2 computed in Algorithm 1, we examine the k = 2 case: consider drawing two independent\nsamples (x1, y1), (x2, y2) from our linear model plus noise. Indeed, y1y2xT\n1 x2 is an unbiased\nestimator of \u03b2T \u03a32\u03b2, because E[y1y2xT\n1 ]E[y2x2] = \u03b2T \u03a32\u03b2. Given n samples, by\nlinearity of expectation, a natural unbiased estimate is hence to compute this quantity for each pair\n\n1 x2] = E[y1xT\n\n6\n\n\fAlgorithm 1 Estimating Linearity, General covariance\n\n\uf8f9\uf8fa\uf8fb , and degree k polynomial p(x) =(cid:80)k\u22122\n\nInput: X =\n\ni=0 aixi+2 that approxi-\nmates the function f (x) = x for all x \u2208 [\u03c3min, \u03c3max], where \u03c3min and \u03c3max are the minimum and\nmaximum singular values of the covariance of the distribution from which the xi\u2019s are drawn.\n\u2022 Set A = XX T , and let G = Aup be the matrix A with the diagonal and lower triangular\n\n\uf8ee\uf8ef\uf8f0y1\n\n...\nyn\n\n...\nxn\n\n\uf8f9\uf8fa\uf8fb , y =\n\uf8ee\uf8ef\uf8f0x1\nn \u2212(cid:80)k\u22121\n\ni=0 ai\n\nentries set to zero.\n\nOutput: yT y\n\nyT Gi+1y\n\n( n\ni+2)\n\n(of distinct) samples, and take the average of these(cid:0)n\ncomputes, since E[yT Gy] = E[(cid:80)\n\ni