{"title": "Asymptotics for Sketching in Least Squares Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 3675, "page_last": 3685, "abstract": "We consider a least squares regression problem where the data has been generated from a linear model, and we are interested to learn the unknown regression parameters. We consider \"sketch-and-solve\" methods that randomly project the data first, and do regression after. Previous works have analyzed the statistical and computational performance of such methods. However, the existing analysis is not fine-grained enough to show the fundamental differences between various methods, such as the Subsampled Randomized Hadamard Transform (SRHT) and Gaussian projections. In this paper, we make progress on this problem, working in an asymptotic framework where the number of datapoints and dimension of features goes to infinity. We find the limits of the accuracy loss (for estimation and test error) incurred by popular sketching methods. We show separation between different methods, so that SRHT is better than Gaussian projections. Our theoretical results are verified on both real and synthetic data. The analysis of SRHT relies on novel methods from random matrix theory that may be of independent interest.", "full_text": "Asymptotics for Sketching in Least Squares\n\nEdgar Dobriban\n\nDepartment of Statistics\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\ndobriban@wharton.upenn.edu\n\nSifan Liu\u2217\n\nDepartment of Statistics\n\nStanford University\nStanford, CA 94305\n\nsfliu@stanford.edu\n\nAbstract\n\nWe consider a least squares regression problem where the data has been gener-\nated from a linear model, and we are interested to learn the unknown regression\nparameters. We consider \"sketch-and-solve\" methods that randomly project the\ndata \ufb01rst, and do regression after. Previous works have analyzed the statistical and\ncomputational performance of such methods. However, the existing analysis is not\n\ufb01ne-grained enough to show the fundamental differences between various methods,\nsuch as the Subsampled Randomized Hadamard Transform (SRHT) and Gaussian\nprojections. In this paper, we make progress on this problem, working in an asymp-\ntotic framework where the number of datapoints and dimension of features goes\nto in\ufb01nity. We \ufb01nd the limits of the accuracy loss (for estimation and test error)\nincurred by popular sketching methods. We show separation between different\nmethods, so that SRHT is better than Gaussian projections. Our theoretical results\nare veri\ufb01ed on both real and synthetic data. The analysis of SRHT relies on novel\nmethods from random matrix theory that may be of independent interest.\n\n1\n\nIntroduction\n\nTo enable learning from large datasets, randomized algorithms such as sketching or random pro-\njections are an effective approach of wide applicability (Mahoney, 2011; Woodruff, 2014; Drineas\nand Mahoney, 2016). In this work, we study the statistical performance of sketching algorithms\nin linear regression. Various versions of this fundamental problem have been studied before (see\ne.g., Drineas et al., 2006, 2011; Dhillon et al., 2013; Ma et al., 2015; Raskutti and Mahoney, 2016;\nThanei et al., 2017, and the references therein). Speci\ufb01cally, in a generative model where the data are\nsampled from a linear regression model, Raskutti and Mahoney (2016) have recently compared the\nstatistical performance of various sketching algorithms, such as Gaussian projections and subsampled\nrandomized Hadamard transforms (SRHT) (introduced earlier in Sarlos (2006); Ailon and Chazelle\n(2006)).\nHowever, the known results are not precise enough to enable us to distinguish between the various\nsketching methods. For instance, the statistical performance of Gaussian projections and SRHT is\npredicted to be the same (Raskutti and Mahoney, 2016), whereas the SRHT has been observed to\nwork better in practice (Mahoney, 2011; Woodruff, 2014; Drineas and Mahoney, 2016). To address\nthis issue, in this paper we introduce a new approach to studying sketching in least squares linear\nregression. As a key difference from prior work we adopt a \"large-data\" asymptotic limit, where\nthe relevant dimensions and sample sizes tend to in\ufb01nity, and can have arbitrary aspect ratios. By\nleveraging very recent results from asymptotic random matrix theory and free probability theory, we\nget more accurate results for the performance of sketching.\n\n\u2217The bulk of this work was performed while SL was a student at Tsinghua University.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Summary of main results. We have a linear model Y = X\u03b2 + \u03b5 of size n \u00d7 p and do\nregression after sketching on data (SY, SX). We show the increase in three loss functions due\nto sketching: V E (variance ef\ufb01ciency\u2013increase in parameter estimation error), P E (prediction\nef\ufb01ciency), and OE (out-of-sample prediction ef\ufb01ciency). The assumptions for X depend on the\nsketching method.\n\nAssumption\n\non X\n\nAssumption\n\non S\n\nV E\n\nP E\n\nOE\n\nArbitrary\n\nArbitrary\n\niid entries\n\nHaar/Hadamard\n\nn \u2212 p\nr \u2212 p\n\n1 +\n\nnr \u2212 p2\nn(r \u2212 p)\n\nn \u2212 p\nr \u2212 p\n\nr(n \u2212 p)\nn(r \u2212 p)\n\nOrtho-\ninvariant\nUniform\nsampling\nn \u2212 p\nr \u2212 p\n\nr(n \u2212 p)\nn(r \u2212 p)\n\nElliptical: W Z\u03a31/2\n\nLeverage sampling\n\n\u03b7\n\n\u03b7\n\n\u22121\nsw2 (1\u2212p/n)\n\u22121\nw2 (1\u2212p/n)\n\u22121\nsw2 (1\u2212p/n)\n\n\u03b7\n\n1+E[w2(1\u2212s)]\n\u22121\nsw2 (1\u2212\u03b3)\n\u22121\nw2 (1\u2212\u03b3)\n\n1+Ew2\u03b7\n1+Ew2\u03b7\n\np/n\n\nWe study many of the most popular and important sketching methods in a uni\ufb01ed framework, including\nrandom projection methods (Gaussian and iid projections, uniform orthogonal\u2014Haar\u2014projections,\nsubsampled randomized Hadamard transforms) as well as random sampling methods (including\nuniform, randomized leverage-based, and greedy leverage sampling). We \ufb01nd clean formulas for the\naccuracy loss of these methods, compared to standard least squares. As an improvement over prior\nwork, our formulas are accurate down to the constant. We verify these results in extensive simulations\nand on two empirical datasets.\n\n1.1 Problem setup\n\nSuppose we observe n datapoints (xi, yi), i = 1, . . . , n, where xi are the p-dimensional features (or\npredictors, covariates) of the i-th datapoint, and yi are the continuous outcomes (or responses). We\nassume the usual linear model yi = x(cid:62)\ni \u03b2 + \u03b5i, where \u03b2 is an unknown p-dimensional parameter.\nAlso \u03b5i is the zero mean noise, with entries uncorrelated and of equal variance \u03c32 across samples. In\nmatrix form, we have Y = X\u03b2 + \u03b5, where X is the n \u00d7 p data matrix with i-th row x(cid:62)\ni , and Y is the\nn \u00d7 1 outcome vector with i-th entry yi. Then the usual ordinary least squares (OLS) estimator is\n\n\u02c6\u03b2 = (X(cid:62)X)\u22121X(cid:62)Y,\n\nif rank(X) = p. This estimator is a gold standard when n > p, extremely popular in practice,\nand with many optimality properties. However, when n, p are large, say on the order of millions\nor billions, the natural O(np2) time-complexity algorithms for computing it can be prohibitively\nexpensive. Sketching reduces the size of the problem by multiplying (X, Y ) by the r \u00d7 n matrix S\nto obtain the sketched data ( \u02dcX, \u02dcY ) = (SX, SY ). The dimensions are now r \u00d7 p and r \u00d7 1. Then\ninstead of doing regression of Y on X, we do regression of \u02dcY on \u02dcX. The solution is\n\n\u02c6\u03b2s = ( \u02dcX(cid:62) \u02dcX)\u22121 \u02dcX(cid:62) \u02dcY ,\n\nif rank(SX) = p. In the remainder, we assume that both X and SX have full column rank, which\nhappens with probability one in the generic case if r > p. The computational cost decreases from\nnp2 to rp2, which is signi\ufb01cant if r (cid:28) n. In parallel, the statistical error increases. There is a tradeoff\nbetween the computational cost and statistical error. The natural question is then, how much does the\nerror increase?\n\nError Criteria To compare the statistical ef\ufb01ciency of the estimators \u02c6\u03b2 and \u02c6\u03b2s, we evaluate the\nrelative value of their mean squared error. If we use the full OLS estimator, we incur a mean\nsquared error of E(cid:107) \u02c6\u03b2 \u2212 \u03b2(cid:107)2. If we use the sketched OLS estimator, we incur a mean squared error of\nE(cid:107) \u02c6\u03b2s \u2212 \u03b2(cid:107)2 instead. To see how much ef\ufb01ciency we lose, it is natural and customary in classical\nstatistics to consider the relative ef\ufb01ciency, which is their ratio (e.g. Van der Vaart, 1998). We call this\nthe variance ef\ufb01ciency (V E), because the MSE for estimation can be viewed as the sum of variances\n\n2\n\n\fof the OLS estimator. Hence, we de\ufb01ne\n\nV E( \u02c6\u03b2s, \u02c6\u03b2) =\n\nE(cid:107) \u02c6\u03b2s \u2212 \u03b2(cid:107)2\nE(cid:107) \u02c6\u03b2 \u2212 \u03b2(cid:107)2\n\n.\n\nP E =\n\nThis quantity is greater than or equal to unity, so V E \u2265 1, and smaller is better. An accurate\nsketching method would achieve an ef\ufb01ciency close to unity, V E \u2248 1. Our goal will be to \ufb01nd VE.\nFor completeness, we also consider the relative prediction ef\ufb01ciency (PE), residual ef\ufb01ciency (RE),\nand out-of-sample ef\ufb01ciency (OE)\nE(cid:107)X \u02c6\u03b2s \u2212 X\u03b2(cid:107)2\nE(cid:107)X \u02c6\u03b2 \u2212 X\u03b2(cid:107)2\n\nE(x(cid:62)\nE(x(cid:62)\nt\nwhere (xt, yt) is a test data point generated from the same model yt = x(cid:62)\nt \u03b2 + \u03b5t, and xt, \u03b5t are\nindependent of X, \u03b5, and only xt is observable. The PE quanti\ufb01es the loss of accuracy in predicting\nthe regression function E[Y |X] = X\u03b2, the RE quanti\ufb01es the increase in residuals, while the OE\nquanti\ufb01es the increase in test error.\n\nE(cid:107)Y \u2212 X \u02c6\u03b2s(cid:107)2\nE(cid:107)Y \u2212 X \u02c6\u03b2(cid:107)2\n\n, OE =\n\n, RE =\n\nt\n\n\u02c6\u03b2s \u2212 yt)2\n\u02c6\u03b2 \u2212 yt)2\n\n,\n\n1.2 Our contributions\n\nWe consider a \"large data\" asymptotic limit, where both the dimension p and the sample size n tend\nto in\ufb01nity, and their aspect ratio converges to a constant. The size r of the sketched data is also\nproportional to the sample size. Speci\ufb01cally n, p, and r tend to in\ufb01nity such that the original aspect\nratio converges, p/n \u2192 \u03b3 \u2208 (0, 1), while the data reduction factor also converges, r/n \u2192 \u03be \u2208 (\u03b3, 1).\nUnder these asymptotics, we \ufb01nd the limits of the relative ef\ufb01ciencies under various conditions on\nX and S. This asymptotic setting is different from the usual one under which sketching is studied,\nwhere n (cid:29) r (e.g., Mahoney, 2011; Woodruff, 2014; Drineas and Mahoney, 2016). However our\nresults are accurate even in that regime. It may be possible to get convergence rates for the projections\nwith iid entries using known results on convergence rates of Stieltjes transforms.\nIn practice, we do not think that n or p grow. Instead, for any given dataset with given n and p, we\nuse our results with \u03b3 = p/n as an approximation. If n, p are both relatively large (say larger than\n20), then our results are already quite accurate.\nIt turns out that the different methods have different performance, and they are applicable to different\ndata matrices. Our main results are summarized in Table 1. For instance, when X is arbitrary and\nS is a matrix with iid entries, the variance ef\ufb01ciency is 1 + (n \u2212 p)/(r \u2212 p), so estimation error\nincreases by that factor due to sketching. The results are stated formally in theorems in the remainder\nof the paper.\n\nThe formulas are accurate and simple We observe that our results are accurate, both in simu-\nlations and in two empirical data analysis examples, see Section 3. In particular, they go beyond\nearlier work (Raskutti and Mahoney, 2016) because they are accurate not just up to the rate, but also\ndown to the precise constants, even in relatively small samples (see Section A.16 in the supplemental\nfor a comparison). Moreover, they have simple expressions and do not depend on any un-estimable\nparameters of the data.\n\nSeparation between sketching methods Our results enable us to compare the different sketching\nmethods to a greater level of detail than previously known. For instance, in estimation error (V E),\nwe have V Eiid = V EHaar + 1 = V EHadamard + 1. This shows that estimation error for uniform\northogonal (Haar) random projections and the subsampled randomized Hadamard transform (SRHT)\n(Ailon and Chazelle, 2006) is less than for iid random projections. This shows a separation between\northogonal and iid random projections.\n\nTradeoff between computation and statistical accuracy Each sketching method becomes more\naccurate as the projection dimension increases. However, this comes at an increased computational\ncost. We give a summary of the algorithmic complexity and statistical accuracy (variance ef\ufb01ciency)\nof each method in Section A.12, as well as a numerical comparison in Section A.17 in the supplement.\nAs an illustrating example, consider the dataset with n = 107 and p = 105 and we want to use SRHT\nbefore doing least squares. Our results show that if we project down to r < n samples, then our test\n\n3\n\n\ferror increases by a factor of r(n \u2212 p)/[n(r \u2212 p)]. Suppose now that we are willing to tolerate an\nincrease of 1.1x in our test error. Setting r(n\u2212 p)/[n(r \u2212 p)] = 1.1 gives r = 106. So we can reduce\nthe data size 10x, and only incur an increase of 1.1x in test error! This is a striking illustration of the\npower of sketching.\n\nTechnical contributions As a speci\ufb01c technical contribution, our results rely on asymptotic random\nmatrix theory (e.g., Bai and Silverstein, 2010; Couillet and Debbah, 2011; Yao et al., 2015). However,\nwe emphasize that the \"standard\" results such as the Marchenko-Pastur law are not enough. For\ninstance, to study the subsampled randomized Hadamard transform (SRHT), we discovered that we\ncan use the results of (Anderson and Farrell, 2014) on asymptotically liberating sequences, see also\n(Tulino et al., 2010) for prior work. To our knowledge, this is the \ufb01rst time that these results are\nused in any statistical learning application. Given the importance of the SRHT, and the notoriously\ndif\ufb01cult nature of analyzing it, we view this as a technical innovation of broader interest.\nSince there are already many different sketching methods proposed before, we do not attempt to\nintroduce new ones here. Our goal is instead to develop a clear theory. This can lead to an increased\nunderstanding of the performance of the various methods, helping practitioners choose between them.\nOur theoretical framework may also help in analyzing and understanding new methods.\n\n1.3 Related work\n\nIn this section we review some recent related work. Due to space limitations, we can only mention a\nsmall subset of them. For overviews of sketching and random projection methods from a numerical\nlinear algebra perspective, see (Halko et al., 2011; Mahoney, 2011; Woodruff, 2014; Drineas and\nMahoney, 2017). For a theoretical computer science perspective, see (Vempala, 2005).\n(Drineas et al., 2006) show that leverage score sampling leads to better results than uniform sampling.\n(Drineas et al., 2012), show furthermore that leverage scores can be approximated fast using the\nHadamard transform. (Drineas et al., 2011) propose the fast Hadamard transform for sketching in\nregression. They prove strong relative error bounds on the realized in-sample prediction error for\narbitrary input data. Our results concern a different setting that assumes a generative statistical model.\nOne of the most related works is (Raskutti and Mahoney, 2016). They study sketching algorithms\nfrom both statistical and algorithmic perspectives. However, they focus on a different setting, where\nn (cid:29) r, and prove bounds on RE and P E. For instance, they discover that RE can be bounded even\nwhen r is not too large, proving bounds such as RE \u2264 1 + 44p/r for subsampling and subgaussian\nprojections. In contrast, we show more precise results such as |RE \u2212 r/(r \u2212 p)| = o(1), (without\nthe constant 44). This holds without additional assumption for iid projections, and under the slightly\nstronger condition of ortho-invariance for subsampling. We show that these conditions are reasonable,\nbecause our results are accurate both in simulations and in empirical data analysis examples.\nOther related works include sketching with convex constraints (Pilanci and Wainwright, 2015),\ncolumn-wise sketching (Maillard and Munos, 2009; Kab\u00e1n, 2014; Thanei et al., 2017), tensor\nsketching (Pham and Pagh, 2013; Diao et al., 2017; Malik and Becker, 2018), subspace embedding\nfor nonlinear kernel mapping (Avron et al., 2014), partial sketching (Dhillon et al., 2013; Ahfock\net al., 2017), frequent direction in streaming model (Liberty, 2013; Huang, 2018), count-min sketch\n(Cormode and Muthukrishnan, 2005), randomized dimension reduction in stochastic geometry\n(Oymak and Tropp, 2017). Sketching also has numerous applications to problems in machine\nlearning and data science, such as clustering (Cannings and Samworth, 2017), hypothesis testing\n(Lopes et al., 2011), bandits (Kuzborskij et al., 2018) etc.\n\n2 Theoretical results\n\nWe present our theoretical results in this section. All proofs are in the supplemental material.\n\n2.1 Gaussian projection\n\nFor Gaussian random projection, the sketching matrix S is generated from the Gaussian distribution.\nAn advantage of Gaussian projections is that generating and multiplying Gaussian matrices is embar-\nrassingly parallel, making it appropriate for certain distributed and cloud-computing architectures.\n\n4\n\n\fV E( \u02c6\u03b2s, \u02c6\u03b2) = P E( \u02c6\u03b2s, \u02c6\u03b2) = 1 +\n\nOE( \u02c6\u03b2s, \u02c6\u03b2) =\n\n1 +\n\n1 + n\u2212p\nr\u2212p\u22121\n1 + x(cid:62)\n\n(cid:104)\n\n(cid:105)\n\nn \u2212 p\nr \u2212 p \u2212 1\nx(cid:62)\nt (X(cid:62)X)\u22121xt\n\n,\n\nt (X(cid:62)X)\u22121xt\n\n.\n\nFor the performance of Gaussian sketching, we have the following result. The \ufb01rst part gives exact\nformulas for the variance, prediction, and out-of-sample ef\ufb01ciencies VE, PE, and OE. The second\npart simpli\ufb01es the OE approximation for a special class of design matrices X.\nTheorem 2.1 (Gaussian projection). Suppose S is an r \u00d7 n Gaussian random matrix with iid\nstandard normal entries. Let X be an arbitrary n \u00d7 p matrix with full column rank p, and suppose\nthat r \u2212 p > 1. Then the ef\ufb01ciencies have the following form\n\nSecond, suppose in addition that X is also random, having the form X = Z\u03a31/2, where Z \u2208 Rn\u00d7p\nhas iid entries of zero mean, unit variance and \ufb01nite fourth moment, and \u03a3 \u2208 Rp\u00d7p is a deterministic\npositive de\ufb01nite matrix. If the test datapoint is drawn independently from the same population as\nX, i.e. xt = \u03a31/2zt, then as n, p, r grow to in\ufb01nity proportionally, with p/n \u2192 \u03b3 \u2208 (0, 1) and\nr/n \u2192 \u03be \u2208 (\u03b3, 1), we have the simple formula for OE\n\u03be \u2212 \u03b32\n\u03be \u2212 \u03b3\n\n\u2248 nr \u2212 p2\nn(r \u2212 p)\n\nn\u2192\u221e OE( \u02c6\u03b2s, \u02c6\u03b2) =\nlim\n\n.\n\nThese results are complementary to Raskutti and Mahoney (2016), who showed that P E \u2264 44(1 +\nn/r), RE \u2264 1 + 44p/r with \ufb01xed probability under slightly different assumptions. These formulas\nhave all the properties we claimed before: they are simple, accurate, and easy to interpret. The\nrelative ef\ufb01ciencies decrease with r/n, the ratio of preserved samples after sketching. This is because\na larger number of samples leads to a higher accuracy. Also, when \u03be = lim r/n = 1, V E and P E\nreach a minimum of 2. Thus, taking a random Gaussian projection will degrade the performance of\nOLS even if we do not reduce the sample size. This is because iid projections distort the geometry of\nEuclidean space due to their non-orthogonality. We will see how to overcome this using orthogonal\nrandom projections.\nThe proofs have three stages. The \ufb01rst stage, common to all sketching methods, expresses the VE and\nother desired quantities in terms of traces of appropriate matices. The second stage involves \ufb01nding\nthe implicit limit of those traces using random matrix theory, in terms of certain \ufb01xed-point equations\nfrom the Marchenko-Pastur law. The \ufb01nal stage involves \ufb01nding the explicit limit. In the Gaussian\ncase, the second and third stages simplify into explicit calculations with the Wishart distribution.\n\n2.2\n\niid projections\n\nFor iid projections, the entries of S are generated independently from the same distribution (not\nnecessarily Gaussian). This will include sparse projections with iid 0,\u00b11 entries, which can speed up\ncomputation (Achlioptas, 2001). We show that in the \"large-data\" limit the performance of sketching\nis the same as for Gaussian projections. This is an instance of universality.\nTheorem 2.2 (Universality for iid projection). Suppose that S has iid entries of zero mean and\n\ufb01nite fourth moment. Suppose also that X is a deterministic matrix, whose singular values are\nuniformly bounded away from zero and in\ufb01nity. Then as n goes to in\ufb01nity, while p/n \u2192 \u03b3 \u2208 (0, 1),\nr/n \u2192 \u03be \u2208 (\u03b3, 1), the ef\ufb01ciencies have the limits\nn\u2192\u221e V E( \u02c6\u03b2s, \u02c6\u03b2) = lim\nlim\n\nn\u2192\u221e P E( \u02c6\u03b2s, \u02c6\u03b2) = 1 +\n\n1 \u2212 \u03b3\n\u03be \u2212 \u03b3\n\n.\n\nSuppose in addition that X is also random, under the same model as in Theorem 2.1. Then the\nformula for OE given there still holds in this more general case.\n\nThe proof is based on a Lindeberg exchange argument.\n\n2.3 Orthogonal (Haar) random projection\n\nWe saw that a random projection with iid entries will degrade the performance of OLS even if we do\nnot reduce the sample size. Matrices with iid entries are not ideal for sketching, because they distort\n\n5\n\n\f(cid:80)p\n\nthe geometry of Euclidean space due to their non-orthogonality. Is it possible to overcome this using\northogonal random projections? Here S is a Haar random matrix uniformly distributed over the space\nof all r \u00d7 n partial orthogonal matrices.\nWe need the following de\ufb01nition. Recall that for an n \u00d7 p matrix M with n \u2265 p, such that the\neigenvalues of n\u22121M(cid:62)M are \u03bbj, the empirical spectral distribution (esd.) of M is the mixture\n\n1\np\n\nj=1 \u03b4\u03bbj , where \u03b4\u03bb denotes a point mass distribution at \u03bb.\n\nTheorem 2.3 (Haar projection). Suppose that S is an r\u00d7n Haar-distributed random matrix. Suppose\nalso that X is a deterministic matrix s.t. the esd. of X(cid:62)X converges weakly to some \ufb01xed probability\ndistribution with compact support bounded away from the origin. Then as n tends to in\ufb01nity, while\np/n \u2192 \u03b3 \u2208 (0, 1), r/n \u2192 \u03be \u2208 (\u03b3, 1), the ef\ufb01ciencies have the limits\n\nn\u2192\u221e V E( \u02c6\u03b2s, \u02c6\u03b2) = lim\nlim\n\nn\u2192\u221e P E( \u02c6\u03b2s, \u02c6\u03b2) =\n\n1 \u2212 \u03b3\n\u03be \u2212 \u03b3\n\n.\n\nSuppose in addition that the training and test data X and xt are also random, under the same model\nas in Theorem 2.1. Then limn\u2192\u221e OE( \u02c6\u03b2s, \u02c6\u03b2) = 1\u2212\u03b3\n1\u2212\u03b3/\u03be .\nThe proof uses the limiting esd of a product of Haar and \ufb01xed matrices. Orthogonal projections\nare uniformly better than iid projections in terms of statistical accuracy. For variance ef\ufb01ciency,\nV Eiid = V EHaar + 1. However, there is still a tradeoff between statistical accuracy and computational\ncost, since the time complexity of generating a Haar matrix using the Gram-Schmidt procedure is\nO(nr2).\n\n2.4 Subsampled randomized Hadamard transform\n\nA faster way to do orthogonal projection is the subsampled randomized Hadamard transform (SRHT)\n(Ailon and Chazelle, 2006), also known as the Fast Johnson-Lindentsrauss transform (FJLT). This is\nfaster as it relies on the Fast Fourier Transform, and is often viewed as a standard reference point for\ncomparing sketching algorithms.\n\u221a\nAn n \u00d7 n possibly complex-valued matrix H is called a Hadamard matrix if H/\nn is orthogonal\nand the absolute values of its entries are unity, |Hij| = 1 for i, j = 1, . . . , n. A prominent example,\nthe Walsh-Hadamard matrix is de\ufb01ned recursively by\n\n(cid:18) Hn/2 Hn/2\n\nHn/2 \u2212Hn/2\n\n(cid:19)\n\n,\n\nHn =\n\nwith H1 = (1). This requires n to be a power of 2. Another construction is the discrete Fourier\ntransform (DFT) matrix with the (u, v)-th entry equal to Huv = n\u22121/2e\u22122\u03c0i(u\u22121)(v\u22121)/n. Multiply-\ning this matrix from the right by X is equivalent to applying the discrete Fourier transform to each\ncolumn of X, up to scaling. The time complexity for the matrix-matrix multiplication for both the\ntransforms is O(np log n) due to the Fast Fourier Transform, faster than other random projections.\nNow we consider the subsampled randomized Hadamard transform. De\ufb01ne the n \u00d7 n subsampled\nrandomized Hadamard matrix as S = BHDP , where B \u2208 Rn\u00d7n is a diagonal sampling matrix\nof iid Bernoulli random variables with success probability r/n, H \u2208 Rn\u00d7n is a Hadamard matrix,\nD \u2208 Rn\u00d7n is a diagonal matrix of iid random variables equal to \u00b11 with probability one half, and\nP \u2208 Rn\u00d7n is a uniformly distributed permutation matrix. In the de\ufb01nition of S, the Hadamard matrix\nH is deterministic, while the other matrices B, D and P are random. At the last step, we discard the\nzero rows of S, so it becomes an \u02dcr \u00d7 n orthogonal matrix where \u02dcr \u2248 r. We expect the SRHT to be\nsimilar to uniform orthogonal projections. The following theorem veri\ufb01es our intuition. The proof\nuses free probability theory (Tulino et al., 2010; Anderson and Farrell, 2014).\nTheorem 2.4 (Subsampled randomized Hadamard projection). Let S be an n \u00d7 n subsampled\nrandomized Hadamard matrix. Suppose also that X is an n \u00d7 p deterministic matrix whose e.s.d.\nconverges weakly to some \ufb01xed probability distribution with compact support bounded away from the\norigin. Then as n tends to in\ufb01nity, while p/n \u2192 \u03b3 \u2208 (0, 1), r/n \u2192 \u03be \u2208 (\u03b3, 1), the ef\ufb01ciencies have\nthe same limits as for Haar projection in Theorem 2.3.\n\n2.5 Uniform random sampling\n\nFast orthogonal transforms such as the Hadamard transforms are considered as a baseline for sketching\nmethods, because they are ef\ufb01cient and work well quite generally. However, if the data are very\n\n6\n\n\funiform, for instance if the data matrix can be assumed to be nearly rotationally invariant, then\nsampling methods can work just as well, as will be shown below.\nThe simplest sampling method is uniform subsampling, where we take r of the n rows of X with\nequal probability, with or without replacement. Here we analyze a nearly equivalent method, where\nwe sample each row of X independently with probability r/n, so that the expected number of sampled\nrows is r. For large r and n, the number of sampled rows concentrates around r.\nMoreover, we also assume that X is random, and the distribution of X is rotationally invariant, i.e.\nfor any n \u00d7 n orthogonal matrix U and any p \u00d7 p orthogonal matrix V , the distribution of U XV (cid:62) is\nthe same as the distribution of X. This holds for instance if X has iid Gaussian entries. Then the\nfollowing theorem states the surprising fact that uniform sampling performs just like Haar projection.\nTheorem 2.5 (Uniform sampling). Let S be an n \u00d7 n diagonal uniform sampling matrix with iid\nBernoulli(r/n) entries. Let X be an n \u00d7 p rotationally invariant random matrix. Suppose that n\ntends to in\ufb01nity, while p/n \u2192 \u03b3 \u2208 (0, 1), and r/n \u2192 \u03be \u2208 (\u03b3, 1), and the e.s.d. of X converges\nalmost surely in distribution to a compactly supported probability measure bounded away from the\norigin. Then the ef\ufb01ciencies have the same limits as for Haar matrices in Theorem 2.3.\n\n2.6 Leverage-based sampling\n\nUniform sampling can work poorly when the data are highly non-uniform and some datapoints\nare more in\ufb02uential than others for the regression \ufb01t. In that case, it has been proposed to sample\nproportionally to the leverage scores hii = x(cid:62)\ni (X(cid:62)X)\u22121xi. These can be thought of as the \"leverage\nof response value Yi on the corresponding value \u02c6Yi\". One can also do greedy leverage sampling,\ndeterministically taking the r rows with largest leverage scores (Papailiopoulos et al., 2014).\nIn this section, we give a uni\ufb01ed framework to study these sampling methods. Since leverage-based\nsampling does not introduce enough randomness for the results to be as simple and universal as\nbefore, we need to assume some more randomness via a model for X. Here we consider the elliptical\nmodel\n\nxi = wi\u03a31/2zi, i = 1, . . . , n,\n\nRecall that \u03b7-transform of a distribution F is de\ufb01ned by \u03b7F (z) =(cid:82)\n\n(1)\nwhere the scale variables wi are deterministic scalars bounded away from zero, and \u03a31/2 is a p \u00d7 p\npositive de\ufb01nite matrix. Also, zi are iid p \u00d7 1 random vectors whose entries are all iid random\nvariables of zero mean and unit variance. This model has a long history in multivariate statistics,\nsee (Mardia et al., 1979). If a scale variable wi is much larger than the rest, then xi will have a\nlarge leverage score. This model allows us to study the effect of unequal leverage scores. Similarly\nto uniform sampling, we analyze the model where each row is sampled independently with some\nprobability.\n1+zx dF (x), for z \u2208 C+ (e.g.,\nTulino and Verd\u00fa, 2004; Couillet and Debbah, 2011). In the next result, we assume that the scalars\ni , i = 1, . . . , n, have a limiting distribution Fw2 as the dimension increases. In that case, the\nw2\neta-transform is the limit of the leverage scores. First we give a result for arbitrary sampling with\nprobability \u03c0i depending only on wi, and next specialize it to leverage sampling.\nTheorem 2.6 (Sampling for elliptical model). Suppose X is sampled from the elliptical model\nde\ufb01ned in (1). Suppose the e.s.d. of \u03a3 converges in distribution to some probability measure with\ncompact support bounded away from the origin. Let n tend to in\ufb01nity, while p/n \u2192 \u03b3 \u2208 (0, 1) and\nr/n \u2192 \u03be \u2208 (\u03b3, 1). Suppose also that the 4 + \u03b7-th moment of zi is uniformly bounded, for some\n\u03b7 > 0.\nConsider the sketching method where we sample the i-th row of X with probability \u03c0i independently,\nwhere \u03c0i may only depend on wi, and \u03c0i, i = 1, . . . , n have a limiting distribution F\u03c0. Let s|\u03c0 be a\nBernoulli random variable with success probability \u03c0, then\n\n1\n\nn\u2192\u221e V E( \u02c6\u03b2s, \u02c6\u03b2) =\nlim\n\nsw2(1 \u2212 \u03b3)\n\u03b7\u22121\nw2 (1 \u2212 \u03b3)\n\u03b7\u22121\nn\u2192\u221e P E( \u02c6\u03b2s, \u02c6\u03b2) = 1 +\nlim\n\n1\n\u03b3\n\n,\n\nn\u2192\u221e OE( \u02c6\u03b2s, \u02c6\u03b2) =\nlim\n\n1 + Ew2\u03b7\u22121\n1 + Ew2\u03b7\u22121\n\nsw2(1 \u2212 \u03b3)\nw2 (1 \u2212 \u03b3)\n\nEw2(1 \u2212 s)\u03b7\u22121\n\nsw2 (1 \u2212 \u03b3),\n\n7\n\n\fFigure 1: Veri\ufb01cation of our theory. Solid lines show the theoretical formulas for variance ef\ufb01ciency,\nwhile dashed lines show the simulation results, for \u03b3 = 0.05 (left, log of VE shown), and \u03b3 = 0.4\n(right). Showing SD over 10 trials of Gaussian, iid, Haar, Hadamard sketching, and sampling.\n\nFigure 2: Empirical data analysis. Left: Million Song dataset. Right: Flight dataset.\n\nwhere \u03b7w2 and \u03b7sw2 are the \u03b7-transforms of w2 (where w is the distribution of scales of xi) and sw2\n(where s is de\ufb01ned above), respectively. Moreover, the expectation is taken with respect to the joint\ndistribution of s, w2 as de\ufb01ned above. In particular for leverage score sampling, s is a Bernoulli\nvariable with success probability min[r/p(1 \u2212 1/(1 + w2\u03b7\u22121\n\nw2 (1 \u2212 \u03b3)), 1].\n\nIf wi-s are all equal to unity, one can check that the results are the same as for orthogonal projection\nor uniform sampling on rotationally invariant X. This is because all leverage scores are nearly equal.\nWe specialize this result to greedy sampling in Section A.11 in the supplement.\n\n3 Simulations and data analysis\n\nWe report some simulations to verify our results. In Figure 1, we take n = 2000, and p = 100 or 800,\nrespectively. Each row of X is generated iid from N (0, Ip). The simulation results of VE and the\nerror bar are the mean and one standard deviation over 10 repetitions. We also plot our theoretical\nresults (bold lines) in the \ufb01gures. The x-axis is on a log scale. We observe that the simulation\nresults match the theoretical results very well. Also note that in this case, where the data is uniformly\ndistributed, sampling methods work as well as orthogonal and Hadamard projection, while Gaussian\nand iid projections perform worse. Additional simulations with correlated t-distributed data and\nleverage sampling are in Section A.14 and A.13 in the supplement.\nWe test our results on the Million Song Year Prediction Dataset (MSD) (Bertin-Mahieux et al., 2011)\n(n = 515344, p = 90) and the New York \ufb02ights dataset (Wickham, 2018) (n = 60449, p = 21). The\ncolumns are standardized to have zero mean and unit standard deviation. We compare three different\nsketching methods: Gaussian projection, randomized Hadamard projection, and uniform sampling.\nFor each target dimension r, we show the mean, as well as 5% and 95% quantiles over 10 repetitions.\nThe results for RE are in Figure 2, and the results for OE are in Section A.15 in the supplement. For\nGaussian and Hadamard projections our theory agrees well with the experiments. However, uniform\n\n8\n\n\fsampling has very large variance, especially on the \ufb02ight dataset. Our theory is less accurate here,\nbecause it requires the data matrix to be rotationally invariant, which may not hold.\n\nDiscussion\n\nA direction for future work is to study sketching in (kernel) ridge regression (perhaps possible\nusing RMT), lasso (perhaps possible using approximate message passing). Another question is to\nunderstand the variability of sketching methods.\n\nAcknowledgments\n\nThe authors thank Ken Clarkson, Miles Lopes, Michael Mahoney, Mert Pilanci, Garvesh Raskutti,\nDavid Woodruff for helpful discussions. ED was partially supported by NSF BIGDATA grant IIS\n1837992. SL was partially supported by a Tsinghua University Summer Research award. A version\nof our manuscript is available on arxiv at https://arxiv.org/abs/1810.06089.\n\nReferences\nDimitris Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-\n\nSIGACT-SIGART symposium on Principles of database systems, pages 274\u2013281. ACM, 2001.\n\nDaniel Ahfock, William J Astle, and Sylvia Richardson. Statistical properties of sketching algorithms. arXiv\n\npreprint arXiv:1706.03665, 2017.\n\nNir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-lindenstrauss transform.\nIn Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557\u2013563. ACM,\n2006.\n\nGreg W Anderson and Brendan Farrell. Asymptotically liberating sequences of random unitary matrices.\n\nAdvances in Mathematics, 255:381\u2013413, 2014.\n\nHaim Avron, Huy Nguyen, and David Woodruff. Subspace embeddings for the polynomial kernel. In Advances\n\nin Neural Information Processing Systems, pages 2258\u20132266, 2014.\n\nZhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices. Springer Series\n\nin Statistics. Springer, New York, 2nd edition, 2010.\n\nThierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In\n\nProceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.\n\nTimothy I Cannings and Richard J Samworth. Random-projection ensemble classi\ufb01cation. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 79(4):959\u20131035, 2017.\n\nGraham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-min sketch and its\n\napplications. Journal of Algorithms, 55(1):58\u201375, 2005.\n\nRomain Couillet and Merouane Debbah. Random Matrix Methods for Wireless Communications. Cambridge\n\nUniversity Press, 2011.\n\nParamveer Dhillon, Yichao Lu, Dean P Foster, and Lyle Ungar. New subsampling algorithms for fast least\n\nsquares regression. In Advances in neural information processing systems, pages 360\u2013368, 2013.\n\nHuaian Diao, Zhao Song, Wen Sun, and David P Woodruff. Sketching for kronecker product regression and\n\np-splines. arXiv preprint arXiv:1712.09473, 2017.\n\nPetros Drineas and Michael W Mahoney. RandNLA: randomized numerical linear algebra. Communications of\n\nthe ACM, 59(6):80\u201390, 2016.\n\nPetros Drineas and Michael W Mahoney. Lectures on randomized numerical linear algebra. arXiv preprint\n\narXiv:1712.08880, 2017.\n\nPetros Drineas, Michael W Mahoney, and S Muthukrishnan. Sampling algorithms for l 2 regression and\napplications. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages\n1127\u20131136. Society for Industrial and Applied Mathematics, 2006.\n\n9\n\n\fPetros Drineas, Michael W Mahoney, S Muthukrishnan, and Tam\u00e1s Sarl\u00f3s. Faster least squares approximation.\n\nNumerische mathematik, 117(2):219\u2013249, 2011.\n\nPetros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruff. Fast approximation of\nmatrix coherence and statistical leverage. Journal of Machine Learning Research, 13(Dec):3475\u20133506, 2012.\n\nNathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic\n\nalgorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217\u2013288, 2011.\n\nTrevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining,\n\ninference, and prediction, Springer Series in Statistics. Springer New York, 2009.\n\nZengfeng Huang. Near optimal frequent directions for sketching dense and sparse matrices. In International\n\nConference on Machine Learning, pages 2053\u20132062, 2018.\n\nAta Kab\u00e1n. New bounds on compressive linear least squares regression. In Arti\ufb01cial Intelligence and Statistics,\n\npages 448\u2013456, 2014.\n\nIlja Kuzborskij, Leonardo Cella, and Nicol\u00f2 Cesa-Bianchi. Ef\ufb01cient linear bandits through matrix sketching.\n\narXiv preprint arXiv:1809.11033, 2018.\n\nEdo Liberty. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international\n\nconference on Knowledge discovery and data mining, pages 581\u2013588. ACM, 2013.\n\nMiles Lopes, Laurent Jacob, and Martin J Wainwright. A more powerful two-sample test in high dimensions\nusing random projection. In Advances in Neural Information Processing Systems, pages 1206\u20131214, 2011.\n\nPing Ma, Michael W Mahoney, and Bin Yu. A statistical perspective on algorithmic leveraging. The Journal of\n\nMachine Learning Research, 16(1):861\u2013911, 2015.\n\nMichael W Mahoney. Randomized algorithms for matrices and data. Foundations and Trends R(cid:13) in Machine\n\nLearning, 3(2):123\u2013224, 2011.\n\nOdalric Maillard and R\u00e9mi Munos. Compressed least-squares regression. In Advances in Neural Information\n\nProcessing Systems, pages 1213\u20131221, 2009.\n\nOsman Asif Malik and Stephen Becker. Low-rank tucker decomposition of large tensors using tensorsketch. In\n\nAdvances in Neural Information Processing Systems, pages 10096\u201310106, 2018.\n\nKanti Mardia, John T Kent, and John M Bibby. Multivariate analysis. Academic Press, 1979.\n\nSamet Oymak and Joel A Tropp. Universality laws for randomized dimension reduction, with applications.\n\nInformation and Inference: A Journal of the IMA, 2017.\n\nDimitris Papailiopoulos, Anastasios Kyrillidis, and Christos Boutsidis. Provable deterministic leverage score\nsampling. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 997\u20131006. ACM, 2014.\n\nNinh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of\nthe 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 239\u2013247.\nACM, 2013.\n\nMert Pilanci and Martin J Wainwright. Randomized sketches of convex programs with sharp guarantees. IEEE\n\nTransactions on Information Theory, 61(9):5096\u20135115, 2015.\n\nGarvesh Raskutti and Michael W Mahoney. A statistical perspective on randomized sketching for ordinary\n\nleast-squares. The Journal of Machine Learning Research, 17(1):7508\u20137538, 2016.\n\nTamas Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations of\n\nComputer Science, 2006. FOCS\u201906. 47th Annual IEEE Symposium on, pages 143\u2013152. IEEE, 2006.\n\nGian-Andrea Thanei, Christina Heinze, and Nicolai Meinshausen. Random projections for large-scale regression.\n\nIn Big and complex data analysis, pages 51\u201368. Springer, 2017.\n\nAntonia M Tulino, Giuseppe Caire, Shlomo Shamai, and Sergio Verd\u00fa. Capacity of channels with frequency-\n\nselective and time-selective fading. IEEE Transactions on Information Theory, 56(3):1187\u20131215, 2010.\n\nAntonio M Tulino and Sergio Verd\u00fa. Random matrix theory and wireless communications. Communications\n\nand Information theory, 1(1):1\u2013182, 2004.\n\n10\n\n\fAad W Van der Vaart. Asymptotic statistics. Cambridge University Press, 1998.\n\nSantosh S Vempala. The random projection method, volume 65. American Mathematical Soc., 2005.\n\nHadley Wickham. nyc\ufb02ights13: Flights that Departed NYC in 2013, 2018. URL https://CRAN.R-project.\n\norg/package=nycflights13. R package version 1.0.0.\n\nDavid P Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends R(cid:13) in Theoretical\n\nComputer Science, 10(1\u20132):1\u2013157, 2014.\n\nJianfeng Yao, Zhidong Bai, and Shurong Zheng. Large Sample Covariance Matrices and High-Dimensional\n\nData Analysis. Cambridge University Press, New York, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1985, "authors": [{"given_name": "Edgar", "family_name": "Dobriban", "institution": "University of Pennsylvania"}, {"given_name": "Sifan", "family_name": "Liu", "institution": "Stanford University"}]}