{"title": "Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares", "book": "Advances in Neural Information Processing Systems", "page_first": 406, "page_last": 414, "abstract": "In modern data analysis, random sampling is an efficient and widely-used strategy to overcome the computational difficulties brought by large sample size. In previous studies, researchers conducted random sampling which is according to the input data but independent on the response variable, however the response variable may also be informative for sampling. In this paper we propose an adaptive sampling called the gradient-based sampling which is dependent on both the input data and the output for fast solving of least-square (LS) problems. We draw the data points by random sampling from the full data according to their gradient values. This sampling is computationally saving, since the running time of computing the sampling probabilities is reduced to O(nd) where n is the full sample size and d is the dimension of the input. Theoretically, we establish an error bound analysis of the general importance sampling with respect to LS solution from full data. The result establishes an improved performance of the use of our gradient-based sampling. Synthetic and real data sets are used to empirically argue that the gradient-based sampling has an obvious advantage over existing sampling methods from two aspects of statistical efficiency and computational saving.", "full_text": "Gradient-based Sampling: An Adaptive Importance\n\nSampling for Least-squares\n\nAcademy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China.\n\nRong Zhu\n\nrongzhu@amss.ac.cn\n\nAbstract\n\nIn modern data analysis, random sampling is an ef\ufb01cient and widely-used strategy\nto overcome the computational dif\ufb01culties brought by large sample size. In previous\nstudies, researchers conducted random sampling which is according to the input\ndata but independent on the response variable, however the response variable may\nalso be informative for sampling. In this paper we propose an adaptive sampling\ncalled the gradient-based sampling which is dependent on both the input data\nand the output for fast solving of least-square (LS) problems. We draw the data\npoints by random sampling from the full data according to their gradient values.\nThis sampling is computationally saving, since the running time of computing\nthe sampling probabilities is reduced to O(nd) where n is the full sample size\nand d is the dimension of the input. Theoretically, we establish an error bound\nanalysis of the general importance sampling with respect to LS solution from full\ndata. The result establishes an improved performance of the use of our gradient-\nbased sampling. Synthetic and real data sets are used to empirically argue that the\ngradient-based sampling has an obvious advantage over existing sampling methods\nfrom two aspects of statistical ef\ufb01ciency and computational saving.\n\n1\n\nIntroduction\n\nModern data analysis always addresses enormous data sets in recent years. Facing the increasing large\nsample data, computational savings play a major role in the data analysis. One simple way to reduce\nthe computational cost is to perform random sampling, that is, one uses a small proportion of the data\nas a surrogate of the full sample for model \ufb01tting and statistical inference. Among random sampling\nstrategies, uniform sampling is simple but trivial way since it fails to exploit the unequal importance\nof the data points. As an alternative, leverage-based sampling is to perform random sampling with\nrespect to nonuniform sampling probabilities that depend on the empirical statistical leverage scores\nof the input matrix X. It has been intensively studied in the machine learning community and has\nbeen proved to achieve much better results for worst-case input than uniform sampling [1, 2, 3, 4].\nHowever it is known that leverage-based sampling replies on input data but is independent on the\noutput variable, so does not make use of the information of the output. Another shortcoming is that it\nneeds to cost much time to get the leverage scores, although approximating leverage scores has been\nproposed to further reduce the computational cost [5, 6, 7].\nIn this paper, we proposed an adaptive importance sampling, the gradient-based sampling, for solving\nleast-square (LS) problem. This sampling attempts to suf\ufb01ciently make use of the data information\nincluding the input data and the output variable. This adaptive process can be summarized as follows:\ngiven a pilot estimate (good \u201cguess\") for the LS solution, determine the importance of each data\npoint by calculating the gradient value, then sample from the full data by importance sampling\naccording to the gradient value. One key contribution of this sampling is to save more computational\ntime than leverage-based sampling, and the running time of getting the probabilities is reduced to\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fO(nd) where n is the sample size and d is the input dimension. It is worthy noting that, although we\napply gradient-based sampling into the LS problem, we believe that it may be extended to fast solve\nother large-scale optimization problems as long as the gradient of optimization function is obtained.\nHowever this is out of the scope so we do not extend it in this paper.\nTheoretically, we give the risk analysis, error bound of the LS solution from random sampling. [8]\nand [9] gave the risk analysis of approximating LS by Hadamard-based projection and covariance-\nthresholded regression, respectively. However, no such analysis is studied for importance sampling.\nThe error bound analysis is a general result on any importance sampling as long as the conditions hold.\nBy this result, we establishes an improved performance guarantee on the use of our gradient-based\nsampling. It is improved in the sense that our gradient-based sampling can make the bound approxi-\nmately attain its minimum, while previous sampling methods can not get this aim. Additionally, the\nnon-asymptotic result also provides a way of balancing the tradeoff between the subsample size and\nthe statistical accuracy.\nEmpirically, we conduct detailed experiments on datasets generated from the mixture Gaussian and\nreal datasets. We argue by these empirical studies that the gradient-based sampling is not only more\nstatistically ef\ufb01cient than leverage-based sampling but also much computationally cheaper from the\ncomputational viewpoint. Another important aim of detailed experiments on synthetic datasets is to\nguide the use of the sampling in different situations that users may encounter in practice.\nThe remainder of the paper is organized as follows: In Section 2, we formally describe random\nsampling algorithm to solve LS, then establish the gradient-based sampling in Section 3. The non-\nasymptotic analysis is provided in Section 4. We study the empirical performance on synthetic and\nreal world datasets in Section 5.\nNotation: For a symmetric matrix M \u2208 Rd\u00d7d, we de\ufb01ne \u03bbmin(M) and \u03bbmax(M) as its the largest and\nsmallest eigenvalues. For a vector v \u2208 Rd, we de\ufb01ne (cid:107)v(cid:107) as its L2 norm.\n\n2 Problem Set-up\nFor LS problem, suppose that there are an n \u00d7 d matrix X = (x1,\u00b7\u00b7\u00b7 , xn)T and an n \u00d7 1 response\nvector y = (y1,\u00b7\u00b7\u00b7 , yn)T . We focus on the setting n (cid:29) d. The LS problem is to minimize the\nsample risk function of parameters \u03b2 as follows:\n\nn(cid:88)\n\nn(cid:88)\n\n(yi \u2212 xT\n\ni \u03b2)2/2 =:\n\nli.\n\n(1)\n\nThe solution of equation (1) takes the form of\n\ni=1\n\ni=1\n\nn bn,\n\n\u02c6\u03b2n = (n\u22121XT X)\u22121(n\u22121XT y) =: \u03a3\u22121\n\n(2)\nwhere \u03a3n = n\u22121XT X and bn = n\u22121XT y. However, the challenge of large sample size also exists\nin this simple problem, i.e., the sample size n is so large that the computational cost for calculating\nLS solution (2) is very expensive or even not affordable.\nWe perform the random sampling algorithm as follows:\n(a) Assign sampling probabilities {\u03c0i}n\n(b) Get a subsample S = {(xi, yi) : i is drawn} by random sampling according to the probabilities;\n(c) Maximize a weighted loss function to get an estimate \u02dc\u03b2\n(cid:107)yi \u2212 xT\n\ni=1 for all data points such that(cid:80)n\n(cid:88)\n\ni \u03b2(cid:107)2 = \u03a3\u22121\n\ni=1 \u03c0i = 1;\n\ns bs,\n\n(3)\n\nn XT\n\ns Xs, bs = 1\n\ns \u03a6\u22121\ns ys, and Xs, ys and \u03a6s are the partitions of X, y and\nwhere \u03a3s = 1\nn XT\n\u03a6 = diag{r\u03c0i}n\ni=1 with the subsample size r, respectively, corresponding the subsample S . Note\nthat the last equality in (3) holds under the assumption that \u03a3s is invertible. Throughout this paper,\nwe assume that \u03a3s is invertible for the convenience since p (cid:28) n in our setting and it can be replaced\nwith its regularized version if it is not invertible.\nHow to construct {\u03c0i}n\ni=1 is a key component in random sampling algorithm. One simple method\nis the uniform sampling, i.e.,\u03c0i = n\u22121, and another method is leverage-based sampling, i.e., \u03c0i \u221d\n\n\u02dc\u03b2 = arg min\n\u03b2\u2208Rd\n\n1\n2\u03c0i\n\ni\u2208S\ns \u03a6\u22121\n\n2\n\n\fi (XT X)\u22121xi. In the next section, we introduce a new ef\ufb01cient method: gradient-based sampling,\nxT\nwhich draws data points according to the gradient value of each data point.\nRelated Work. [10, 11, 4] developed leverage-based sampling in matrix decomposition. [10, 12]\napplied the sampling method to approximate the LS solution. [13] derived the bias and variance\nformulas for the leverage-based sampling algorithm in linear regression using the Taylor series\nexpansion. [14] further provided upper bounds for the mean-squared error and the worst-case error of\nrandomized sketching for the LS problem. [15] proposed a sampling-dependent error bound then\nimplied a better sampling distribution by this bound. Fast algorithms for approximating leverage\nscores {xT\n\ni=1 were proposed to further reduce the computational cost [5, 6, 7].\n\ni (XT X)\u22121xi}n\n\n3 Gradient-based Sampling Algorithm\n\nThe gradient-based sampling uses a pilot solution of the LS problem to compute the gradient of the\nobjective function, and then sampling a subsample data set according to the calculated gradient values.\nIt differs from leverage-based sampling in that the sampling probability \u03c0i is allowed to depend on\ninput data X as well as y. Given a pilot estimate (good guess) \u03b20 for parameters \u03b2, we calculate the\ngradient for the ith data point\n\ngi =\n\n\u2202li(\u03b20)\n\n\u2202\u03b20\n\n= xi(yi \u2212 xT\n\ni \u03b20).\n\n(4)\n\nGradient represents the slope of the tangent of the loss function, so logically if gradient of data points\nare large in some sense, these data points are important to \ufb01nd the optima. Our sampling strategy\nmakes use of the gradient upon observing yi given xi, and speci\ufb01cally,\n\nn(cid:88)\n\ni = (cid:107)gi(cid:107)/\n\u03c00\n\n(cid:107)gi(cid:107).\n\n(5)\n\ni=1\n\nEquations (4) and (5) mean that, (cid:107)gi(cid:107) includes two parts of information: one is (cid:107)xi(cid:107) which is the\ninformation provided by the input data and the other is |yi \u2212 xT\ni \u03b20| which is considered to provide a\njusti\ufb01cation from the pilot estimate \u03b20 to a better estimate. Figure 1 illustrates the ef\ufb01ciency bene\ufb01t\nof the gradient-based sampling by constructing the following simple example. The \ufb01gure shows that\nthe data points with larger |yi \u2212 xi\u03b20| are probably considered to be more important in approximating\nthe solution. On the other side, given |yi \u2212 xi\u03b20|, we hope to choose the data points with larger (cid:107)xi(cid:107)\nvalues, since larger (cid:107)xi(cid:107) values probably cause the approximate solution be more ef\ufb01cient. From the\ncomputation view, calculating {\u03c00\ni=1 costs O(nd), so the gradient-based sampling is much saving\ncomputational cost.\n\ni }n\n\n\u02c6\u03b2 =(cid:80)12\n\ni=1 xiyi/(cid:80)12\n\nFigure 1: An illustration example. 12 data points are generated from yi = xi + ei where xi =\n(\u00b13,\u00b12.5,\u00b12,\u00b11.5,\u00b11,\u00b10.5) and ei \u223c N (0, 0.5). The LS solution denoted by the red line\n\ni=1 x2\n\ni . The pilot estimate denoted by dashed line \u03b20 = 0.5.\n\nChoosing the pilot estimate \u03b20. In many applications, there may be a natural choice of pilot estimate\n\u03b20, for instance, the \ufb01t from last time is a natural choice for this time. Another simple way is to use\na pilot estimate \u03b20 from an initial subsample of size r0 obtained by uniform sampling. The extra\ncomputational cost is O(r0d2), which is assumed to be small since a choice r0 \u2264 r will be good\n\n3\n\n\u22123\u22122\u221210123\u22124\u22122024xyllllllllllll\fenough. We empirically show the effect of small r0 (r0 \u2264 r) on the performance of the gradient-\nbased sampling by simulations, and argue that one does not need to be careful when choosing r0 to\nget a pilot estimate. (see Supplementary Material, Section S1)\nPoisson sampling v.s. sampling with replacement. In this study, we do not choose sampling with\nreplacement as did in previous studies, but apply Poisson sampling into this algorithm. Poisson\nsampling is executed in the following way: proceed down the list of elements and carry out one\nrandomized experiment for each element, which results either in the election or in the nonselection of\nthe element [16]. Thus, Poisson sampling can improve the ef\ufb01ciency in some context compared to\nsampling with replacement since it can avoid repeatedly drawing the same data points, especially\nwhen the sampling ratio increases, We empirically illustrates this advantage of Poisson sampling\ncompared to sampling with replacement. (see Supplementary Material, Section S2)\nIndependence on model assumption. LS solution is well known to be statistically ef\ufb01cient under\nthe linear regression model with homogeneous errors, but model misspeci\ufb01cation is ubiquitous in real\napplications. On the other hand, LS solution is also an optimization problem without any linear model\nassumption from the algorithmic view. To numerically show the independence of the gradient-based\nsampling on model assumption, we do simulation studies and \ufb01nd that it is an ef\ufb01cient sampling\nmethod from the algorithmic perspective. (see Supplementary Material, Section S3)\nNow as a summary we present the gradient-based sampling in Algorithm 1.\n\nAlgorithm 1 Gradient-based sampling Algorithm\n\n\u2022 Pilot estimate \u03b20:\n\n(1) Have a good guess as the pilot estimate \u03b20, or use the initial estimate \u03b20 from an initial\nsubsample of size r0 by uniform sampling as the pilot estimate.\n\n\u2022 Gradient-based sampling:\n\n(2) Assign sampling probabilities {\u03c0i \u221d (cid:107)gi(cid:107)}n\n\u03c0i = 1.\n(3) Generate independent si \u223c Bernoulli(1, pi), where pi = r\u03c0i and r is the expected\nsubsample size.\n(4) Get a subsample by selecting the element corresponding to {si = 1}, that is, if si = 1,\nthe ith data is chosen, otherwise not.\n\ni=1 for all data points such that\n\ni=1\n\nn(cid:80)\n\n\u2022 Estimation:\n\n(5) Solve the LS problem using the subsample using equation (3) then get the subsample\nestimator \u02dc\u03b2.\n\n1. Since r\u2217 is multinomial distributed with expectation E(r\u2217) = (cid:80)n\nV ar(r\u2217) =(cid:80)n\n\nRemarks on Algorithm 1. (a) The subsample size r\u2217 from Poisson sampling is random in Algorithm\ni=1 pi = r and variance\ni=1 pi(1 \u2212 pi), the range of probable values of r\u2217 can be assessed by an interval. In\npractice we just need to set the expected subsample size r. (b) If \u03c0i\u2019s are so large that pi = r\u03c0i > 1\nfor some data points, we should take pi = 1, i.e., \u03c0i = 1/r for them.\n\n4 Error Bound Analysis of Sampling Algorithms\n\nOur main theoretical result establishes the excess risk, i.e., an upper error bound of the subsample\nestimator \u02dc\u03b2 to approximate \u02c6\u03b2n for an random sampling method. Given sampling probabilities\n{\u03c0i}n\ni=1, the excess risk of the subsample estimator \u02dc\u03b2 with respect to \u02c6\u03b2n is given in Theorem 1.\n(see Section S4 in Supplementary Material for the proof). By this general result, we provide an\nexplanation why the gradient-based sampling algorithm is statistically ef\ufb01cient.\n\nn(cid:80)\n\ni=1\n\nTheorem 1 De\ufb01ne \u03c32\nR = max{(cid:107)xi(cid:107)2}n\n\ni=1, if\n\n\u03a3 = 1\nn2\n\nn(cid:80)\n\ni=1\n\ni (cid:107)xi(cid:107)4, \u03c32\n\u03c0\u22121\n\nb = 1\nn2\n\n(cid:107)xi(cid:107)2e2\n\ni where ei = yi \u2212 xT\n\ni\n\n\u02c6\u03b2n, and\n\n1\n\u03c0i\n\nr >\n\n\u03c32\n\u03a3 log d\n\n\u03b42(2\u22121\u03bbmin(\u03a3n) \u2212 (3n\u03b4)\u22121R log d)2\n\n4\n\n\f3n\u03bbmin(\u03a3n)\n\nholds, the excess risk of \u02dc\u03b2 for approximating \u02c6\u03b2n is bounded in probability 1 \u2212 \u03b4 for \u03b4 > R log d\nas\n(6)\n\n(cid:107)\u02dc\u03b2 \u2212 \u02c6\u03b2n(cid:107) \u2264 Cr\u22121/2,\n\nmin(\u03a3n)\u03b4\u22121\u03c3b.\n\nwhere C = 2\u03bb\u22121\nTheorem 1 indicates that, (cid:107)\u02dc\u03b2 \u2212 \u02c6\u03b2n(cid:107) can be bounded by Cr\u22121/2. From (6), the choice of sampling\nmethod has no effect on the decreasing rate of the bound, r\u22121/2, but in\ufb02uences the constant C. Thus,\na theoretical measure of ef\ufb01ciency for some sampling method is whether it can make the constant\nC attain its minimum. In Corollary 1 (see Section S5 in Supplementary Material for the proof), we\nshow that Algorithm 1 can approximately get this aim.\nRemarks on Theorem 1. (a) Theorem 1 can be used to guide the choice of r in practice so as to\nguarantee the desired accuracy of the solution with high probability. (b) The constants \u03c3b, \u03bbmin(\u03a3n)\nand \u03c3\u03a3 can be estimated based on the subsample. (c) The risk of X\u02dc\u03b2 to predict X\u02c6\u03b2n follows\nfrom equation (6) and get that (cid:107)X\u02dc\u03b2 \u2212 X\u02c6\u03b2n(cid:107)/n \u2264 Cr\u22121/2\u03bb1/2\nmax(\u03a3n). (d) Although Theorem 1\nis established under Poisson sampling, we can easily extend the error bound to sampling with\nreplacement by following the technical proofs in Supplementary Material, since each drawing in\nsampling with replacement is considered to be independent.\nCorollary 1 If \u03b20 \u2212 \u02c6\u03b2n = op(1), then C is approximately mimimized by Algorithm 1, that is,\n\nC(\u03c00\n\ni ) \u2212 min\n\n\u03c0\n\nC = op(1),\n\n(7)\n\nwhere C(\u03c00\n\ni ) denotes the value C corresponding to our gradient-based sampling.\n\nThe signi\ufb01cance of Corollary 1 is to give an explanation why the gradient-based sampling is\nstatistically ef\ufb01cient. The corollary establishes an improved performance guarantee on the use of\nthe gradient-based sampling. It is improved in the sense that our gradient-based sampling can\nmake the bound approximately attain its minimum as long as the condition is satis\ufb01ed, while neither\nuniform sampling nor leverage-based sampling can get this aim. The condition that \u03b20 \u2212 \u02c6\u03b2n = op(1)\nprovides a benchmark whether the pilot estimate \u03b20 is a good guess of \u02c6\u03b2n. Note the condition is\nsatis\ufb01ed by the initial estimate \u03b20 from an initial subsample of size r0 by uniform sampling since\n\u03b20 \u2212 \u02c6\u03b2n = Op(r\n\n\u22121/2\n0\n\n).\n\n5 Numerical Experiments\n\nDetailed numerical experiments are conducted to compare the excess risk of \u02dc\u03b2 based on L2 loss\nagainst the expected subsample size r for different synthetic datasets and real data examples. In this\nsection, we report several representative studies.\n\nmg\u03c32\n\nx) + 1\n\n2 N (\u00b5, \u03b82\n\n2 N (\u2212\u00b5, \u03c32\n\n5.1 Performance of gradient-based sampling\nThe n \u00d7 d design matrix X is generated with elements drawn independently from the mixture\nGaussian distributions 1\nx) below: (1) \u00b5 = 0 and \u03b8mg = 1, i.e., Gaussian\ndistribution (referred as to GA data); (2) \u00b5 = 0 and \u03b8mg = 2, i.e.,the mixture between small and\nrelatively large variances (referred as to MG1 data); (3) \u00b5 = 0 and \u03b8mg = 5, i.e., the mixture between\nsmall and highly large variances (referred as to MG2 data); (4) \u00b5 = 5 and \u03b8mg = 1, i.e., the mixture\nbetween two symmetric peaks (referred as to MG3 data). We also do simulations on X generated\nfrom multivariate mixture Gaussian distributions with AR(1) covariance matrix, but obtain the similar\nperformance to the setting above, so we do not report them here. Given X, we generate y from the\nmodel y = X\u03b2 + \u0001 where each element of \u03b2 is drawn from normal distribution N (0, 1) and then\n\ufb01xed, and \u0001 \u223c N (0, \u03c32In), where \u03c3 = 10. Note that we also consider the heteroscedasticity setting\nthat \u0001 is from a mixture Gaussian, and get the similar results to the homoscedasticity setting. So we\ndo not report them here. We set d as 100, and n as among 20K, 50K, 100K, 200K, 500K.\nWe calculate the full sample LS solution \u02c6\u03b2n for each dataset, and repeatedly apply various sampling\nmethods for B = 1000 times to get subsample estimates \u02dc\u03b2b for b = 1, . . . , B. We calculate the\n\n5\n\n\fempirical risk based on L2 loss (MSE) as follows:\n\nB(cid:88)\n\nMSE = B\u22121\n\n(cid:107)\u02dc\u03b2b \u2212 \u02c6\u03b2n(cid:107)2.\n\nTwo sampling ratio r/n values are considered: 0.01 and 0.05. We compare uniform sampling (UNIF),\nthe leverage-based sampling (LEV) and the gradient-based sampling (GRAD) to these data sets. For\nGRAD, we set the r0 = r to getting the pilot estimate \u03b20.\n\nb=1\n\nFigure 2: Boxplots of the logarithm of different sampling probabilities of X matrices with n = 50K.\nFrom left to right: GA, MG1, MG2 and MG3 data sets.\n\nFigure 2 gives boxplots of the logarithm of sampling probabilities of LEV and GRAD, where taking\nthe logarithm is to clearly show their distributions. We have some observations from the \ufb01gure. (1)\nFor all four datasets, GRAD has heavier tails than LEV, that is, GRAD lets sampling probabilities\nmore disperse than LEV. (2) MG2 tends to have the most heterogeneous sampling probabilities, MG1\nhas less heterogeneous than MG2, whereas MG3 and GA have the most homogeneous sampling\nprobabilities. This indicates that the mixture of large and small variances has effect on the distributions\nof sampling probabilities while the mixture of different peak locations has no effect.\nWe plot the logarithm of MSE values for GA, MG1, and MG2 in Figure 3, where taking the logarithm\nis to clearly show the relative values. We do not report the results for MG3, as there is little difference\nbetween MG3 and GA. There are several interesting results shown in Figure 3. (1) GRAD has\nbetter performance than others, and the advantage of GRAD becomes obvious as r/n increases. (2)\nFor GA, LEV is shown to have similar performance to UNIF, however GRAD has obviously better\nperformance than UNIF. (3) When r/n increases, the smaller n is needed to make sure that GRAD\noutperforms others.\nFrom the computation view, we compare the computational cost for UNIF, approximate LEV (ALEV)\n[5, 6] and GRAD in Table 1, since ALEV is shown to be computationally ef\ufb01cient to approximate\nLEV. From the table, UNIF is the most saving, and the time cost of GRAD is much less than that\nof ALEV. It indicates that GRAD is also an ef\ufb01cient method from the computational view, since its\nrunning time is O(nd). Additionally, Table 2 summaries the computational complexity of several\nsampling methods for fast solving LS problems.\n\n5.2 Real Data Examples\n\nIn this section, we compare the performance of various sampling algorithms on two UCI datasets:\nCASP (n = 45730, d = 9) and OnlineNewsPopularity (NEWS) (n = 39644, d = 59). At \ufb01rst, we\nplot boxplots of the logarithm of sampling probabilities of LEV and GRAD in Figure 4. From it,\nsimilar to synthetic datasets, we know that the sampling probabilities of GRAD looks more dispersed\ncompared to those of LEV.\nThe MSE values are reported in Table 3. From it, we have two observations below. First, GRAD\nhas smaller MSE values than others when r is large. Second, as r increases, the outperformance\nof Poisson sampling than sampling with replacement gets obvious for various methods. Similar\nobservation is gotten in simulations (see Supplementary Material, Section S2).\n\n6\n\nllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLEVGRAD\u221220\u221216\u221212\u22128GAllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLEVGRAD\u221220\u221216\u221212\u22128MG1llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLEVGRAD\u221220\u221216\u221212\u22128MG2lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLEVGRAD\u221220\u221216\u221212\u22128MG3\fFigure 3: Empirical mean-squared error of \u02dc\u03b2 for approximating \u02c6\u03b2n. From top to bottom: upper\npanels are r/n = 0.01, and lower panels r/n = 0.05. From left to right: GA, MG1, and MG2 data,\nrespectively.\n\nTable 1: The cost time of obtaining \u02dc\u03b2 on various subsample sizes r by UNIF, ALEV and GRAD for\nn = 500K, 5M, where () denotes the time of calculating full sample LS solution \u02c6\u03b2n. We perform the\ncomputation by R software in PC with 3 GHz intel i7 processor, 8 GB memory and OS X operation\nsystem.\n\nn = 500K\n\nr\n\nSystem Time (0.406)\n200\n2000\n0.003\nUNIF\n0.000\n0.797\nALEV 0.494\n0.114\nGRAD 0.099\nn = 5M\n\n500\n0.002\n0.642\n0.105\n\nr\n\nSystem Time (121.4)\n10000\n500\n0.159\nUNIF\n0.057\n81.85\nALEV 50.86\nGRAD 5.836\n6.479\n\n2000\n0.115\n53.64\n6.107\n\nUser Time (7.982)\n\n200\n0.010\n2.213\n0.338\n\n500\n0.018\n2.592\n0.390\n\n2000\n0.050\n4.353\n0.412\n\nUser Time (129.88)\n500\n2.81\n86.12\n28.85\n\n2000\n5.94\n88.36\n30.06\n\n10000\n14.28\n120.15\n37.51\n\n6 Conclusion\n\nIn this paper we have proposed gradient-based sampling algorithm for approximating LS solution.\nThis algorithm is not only statistically ef\ufb01cient but also computationally saving. Theoretically, we\nprovide the error bound analysis, which supplies a justi\ufb01cation for the algorithm and give a tradeoff\nbetween the subsample size and approximation ef\ufb01ciency. We also argue from empirical studies\nthat: (1) since the gradient-based sampling algorithm is justi\ufb01ed without linear model assumption,\nit works better than the leverage-based sampling under different model speci\ufb01cations; (2) Poisson\nsampling is much better than sampling with replacement when sampling ratio r/n increases.\n\n7\n\n10.010.511.011.512.012.513.0\u221210123GAlog(sample size)log(MSE)lllllllllllllllUNIFLEVGRAD10.010.511.011.512.012.513.0\u22124\u22123\u22122\u22121012MG1log(sample size)log(MSE)lllllllllllllll10.010.511.011.512.012.513.0\u22125\u22124\u22123\u22122\u2212101MG2log(sample size)log(MSE)lllllllllllllll10.010.511.011.512.012.513.0\u22123\u22122\u2212101log(sample size)log(MSE)lllllllllllllll10.010.511.011.512.012.513.0\u22126\u22125\u22124\u22123\u22122log(sample size)log(MSE)lllllllllllllll10.010.511.011.512.012.513.0\u22127\u22126\u22125\u22124\u22123log(sample size)log(MSE)lllllllllllllll\fTable 2: The running time of obtaining \u02dc\u03b2 by various sampling strategy. Stage D1 is computing the\nweights, D2 is computing the LS based on subsample, \u201coverall\" is the total running time.\n\nD2\n\nD1\n-\n-\n\nStage\nFull\nUNIF\nLEV\nALEV O(nd log n) O(max{rd2, d3}) O(max{nd log n, rd2, d3})\nGRAD\n\nO(max{nd2, d3})\nO(max{rd2, d3})\nO(max{rd2, d3})\nO(max{rd2, d3})\n\nO(max{nd2, d3})\nO(max{rd2, d3})\n\nO(max{nd2, rd2, d3})\nO(max{nd, rd2, d3})\n\nO(nd2)\n\noverall\n\nO(nd)\n\nFigure 4: Boxplots of the logarithm of sampling probabilities for LEV and GRAD among datasets\nCASP and NEWS\n\nTable 3: The MSE comparison among various methods for real datasets, where \u201cSR\" denotes sampling\nwith replacement, and \u201cPS\" denotes Poisson sampling.\n\nCASP n = 45730, d = 9\n\n450\n\n4.411e-06\n4.243e-06\n1.950e-06\n1.689e-06\n1.861e-06\n1.678e-06\n\n1800\n\n1.330e-06\n1.369e-06\n4.594e-07\n4.685e-07\n4.322e-07\n3.687e-07\n\nr\n\n45\n\nUNIF-SR\n2.998e-05\nUNIF-PS\n2.702e-05\n1.962e-05\nLEV-SR\nLEV-PS\n2.118e-05\nGRAD-SR 2.069e-05\nGRAD-PS\n2.411e-05\n\nr\n\nUNIF-SR\nUNIF-PS\nLEV-SR\nLEV-PS\nGRAD-SR\nGRAD-PS\n\n300\n\n22.050\n27.215\n22.487\n21.971\n10.997\n9.729\n\n180\n\n9.285e-06\n9.669e-06\n4.379e-06\n5.240e-06\n5.711e-06\n5.138e-06\n\n14.832\n19.607\n11.047\n9.419\n5.508\n5.252\n\nNEWS n = 39644, d = 59\n600\n\n1200\n10.790\n15.258\n5.519\n4.072\n3.074\n2.403\n\n2400\n7.110\n9.504\n2.641\n2.101\n1.505\n1.029\n\n4500\n\n4.574e-07\n4.824e-07\n2.050e-07\n1.694e-07\n1.567e-07\n1.179e-07\n\n4800\n4.722\n4.378\n1.392\n0.882\n0.752\n0.399\n\nThere is an interesting problem to address in the further study. Although the gradient-based sampling\nis proposed to approximate LS solution in this paper, we believe that this sampling method can apply\ninto other optimization problems for large-scale data analysis, since gradient is considered to be the\nsteepest way to attain the (local) optima. Thus, applying this idea to other optimization problems is\nan interesting study.\n\nAcknowledgments\n\nThis research was supported by National Natural Science Foundation of China grants 11301514 and\n71532013. We thank Xiuyuan Cheng for comments in a preliminary version.\n\n8\n\nlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLEVGRAD\u221220\u221215\u221210\u22125CASPllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllLEVGRAD\u221220\u221215\u221210\u22125NEWS\fReferences\n[1] P. Drineas, R. Kannan, and M.W. Mahoney. Fast monte carlo algorithms for matrices i:\napproximating matrix multiplication. SIAM Journal on Scienti\ufb01c Computing, 36:132\u2013157,\n2006.\n\n[2] P. Drineas, R. Kannan, and M.W. Mahoney. Fast monte carlo algorithms for matrices ii:\ncomputing a low-rank approximation to a matrix. SIAM Journal on Scienti\ufb01c Computing,\n36:158\u2013183, 2006.\n\n[3] P. Drineas, R. Kannan, and M.W. Mahoney. Fast monte carlo algorithms for matrices iii:\ncomputing a compressed approximate matrix decomposition. SIAM Journal on Scienti\ufb01c\nComputing, 36:184\u2013206, 2006.\n\n[4] M.W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis.\n\nProceedings of the National Academy of Sciences, 106:697\u2013702, 2009.\n\n[5] P. Drineas, M. Magdon-Ismail, M.W. Mahoney, and D.P. Woodruff. Fast approximation of\nmatrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475\u2013\n3506, 2012.\n\n[6] D.P. Clarkson, K.L.and Woodruff. Low rank approximation and regression in input sparsity\n\ntime. STOC, 2013.\n\n[7] M.B. Cohen, Y.T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. Uniform sampling for\n\nmatrix approximation. arXiv:1408.5099, 2014.\n\n[8] P. Dhillon, Y. Lu, D.P. Foster, and L. Ungar. New subsampling algorithns for fast least squares\nregression. In Advances in Neural Information Processing Systems, volume 26, pages 360\u2013368,\n2013.\n\n[9] D. Shender and J. Lafferty. Computation-risk tradeoffs for covariance-thresholded regression.\n\nIn Proceedings of the 30th International Conference on Machine Learning, 2013.\n\n[10] P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l2 regression and\napplications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms,\npages 1127\u20131136, 2006.\n\n[11] P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decomposition.\n\nSIAM Journal on Matrix Analysis and Applications, 30:844\u2013881, 2008.\n\n[12] P. Drineas, M.W. Mahoney, S. Muthukrishnan, and T. Sarlos. Faster least squares approximation.\n\nNumerische Mathematik, 117:219\u2013249, 2011.\n\n[13] P. Ma, M.W. Mahoney, and B. Yu. A statistical perspective on algorithmic leveraging. In\n\nProceedings of the 31th International Conference on Machine Learning, 2014.\n\n[14] G. Raskutti and M.W. Mahoney. A statistical perspective on randomized sketching for ordinary\n\nleast-squares. In Proc. of the 32nd ICML Conference, 2015.\n\n[15] T. Yang, L. Zhang, R. Jin, and S. Zhu. An explicit sampling dependent spectral error bound for\n\ncolumn subset selection. In Proc. of the 32nd ICML Conference, 2015.\n\n[16] C.E. S\u00e4rndal, B. Swensson, and J.H. Wretman. Model Assisted Survey Sampling. Springer,\n\nNew York, 2003.\n\n9\n\n\f", "award": [], "sourceid": 244, "authors": [{"given_name": "Rong", "family_name": "Zhu", "institution": "Chinese Academy of Sciences"}]}