{"title": "Compressed Least-Squares Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1213, "page_last": 1221, "abstract": "We consider the problem of learning, from K input data, a regression function in a function space of high dimension N using projections onto a random subspace of lower dimension M. From any linear approximation algorithm using empirical risk minimization (possibly penalized), we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We apply the analysis to the ordinary Least-Squares regression and show that by choosing M=O(\\sqrt{K}), the estimation error (for the quadratic loss) of the ``Compressed Least Squares Regression is O(1/\\sqrt{K}) up to logarithmic factors. We also discuss the numerical complexity of several algorithms (both in initial and compressed domains) as a function of N, K, and M.", "full_text": "Compressed Least-Squares Regression\n\nOdalric-Ambrym Maillard and R\u00b4emi Munos\n\nSequeL Project, INRIA Lille - Nord Europe, France\n\n{odalric.maillard, remi.munos}@inria.fr\n\nAbstract\n\nWe consider the problem of learning, from K data, a regression function in a lin-\near space of high dimension N using projections onto a random subspace of lower\ndimension M. From any algorithm minimizing the (possibly penalized) empiri-\ncal risk, we provide bounds on the excess risk of the estimate computed in the\nprojected subspace (compressed domain) in terms of the excess risk of the esti-\nmate built in the high-dimensional space (initial domain). We show that solving\nthe problem in the compressed domain instead of the initial domain reduces the\nestimation error at the price of an increased (but controlled) approximation error.\nWe apply the analysis to Least-Squares (LS) regression and discuss the excess\n\u221a\nrisk and numerical complexity of the resulting \u201cCompressed Least Squares Re-\ngression\u201d (CLSR) in terms of N, K, and M. When we choose M = O(\nK), we\nshow that CLSR has an estimation error of order O(log K/\n\nK).\n\n\u221a\n\n1 Problem setting\nWe consider a regression problem where we observe data DK = ({xk, yk}k\u2264K) (where xk \u2208 X and\nyk \u2208 R) are assumed to be independently and identically distributed (i.i.d.) from some distribution\nP , where xk \u223c PX and yk = f\u2217(xk) + \u03b7k(xk), where f\u2217 is the (unknown) target function, and \u03b7k\na centered independent noise of variance \u03c32(xk). For a given class of functions F, and f \u2208 F, we\nde\ufb01ne the empirical (quadratic) error\n\nK(cid:88)\n\nk=1\n\n[yk \u2212 f(xk)]2,\n\nLK(f) def=\n\n1\nK\n\nand the generalization (quadratic) error\n\nL(f) def= E(X,Y )\u223cP [(Y \u2212 f(X))2].\n\nOur goal is to return a regression function (cid:98)f \u2208 F with lowest possible generalization error L((cid:98)f).\nNotations: In the sequel we will make use of the following notations about norms: for h : X (cid:55)\u2192 R,\n(cid:1)1/2.\nof h w.r.t. the empirical measure PK, and for u \u2208 Rn, ||u|| denotes by default(cid:0)(cid:80)n\nwe write ||h||P for the L2 norm of h with respect to (w.r.t.) the measure P , ||h||PK for the L2 norm\nf\u2217 /\u2208 F. For any regression function (cid:98)f, we de\ufb01ne the excess risk\nL((cid:98)f) \u2212 L(f\u2217) = ||(cid:98)f \u2212 f\u2217||2\nwhich decomposes as the sum of the estimation error L((cid:98)f)\u2212 inf f\u2208F L(f) and the approximation\n\nThe measurable function minimizing the generalization error is f\u2217, but it may be the case that\n\ni=1 u2\ni\n\nP ,\n\nerror inf f\u2208F L(f)\u2212 L(f\u2217) = inf f\u2208F ||f \u2212 f\u2217||2\nfunction space F.\n\nP which measures the distance between f\u2217 and the\n\n1\n\n\fdef= (cid:80)N\n\ndef= {f\u03b1\n\nn=1 \u03b1n\u03d5n, \u03b1 \u2208 RN}.\n\nIn this paper we consider a class of linear functions FN de\ufb01ned as the span of a set of N functions\n{\u03d5n}1\u2264n\u2264N called features. Thus: FN\nWhen the number of data K is larger than the number of features N, the ordinary Least-Squares\nRegression (LSR) provides the LS solution fb\u03b1 which is the minimizer of the empirical risk LK(f)\nin FN . Note that here LK(f\u03b1) rewrites 1\nK||\u03a6\u03b1\u2212 Y ||K where \u03a6 is the K \u00d7 N matrix with elements\n(\u03d5n(xk))1\u2264n\u2264N,1\u2264k\u2264K and Y the K-vector with components (yk)1\u2264k\u2264K.\nUsual results provide bound on the estimation error as a function of the capacity of the function\nspace and the number of data. In the case of linear approximation, the capacity measures (such as\ncovering numbers [23] or the pseudo-dimension [16]) depend on the number of features (for example\nthe pseudo-dimension is at most N + 1). For example, let fb\u03b1 be a LS estimate (minimizer of LK\nin FN ), then (a more precise statement will be stated later in Subsection 3) the expected estimation\nerror is bounded as:\n\nE(cid:2)L(fb\u03b1) \u2212 inf\n\nf\u2208FN\n\nL(f)(cid:3) \u2264 c\u03c32 N log K\n\nK\n\n,\n\n(1)\n\nwhere c is a universal constant, \u03c3 def= supx\u2208X \u03c3(x), and the expectation is taken with respect to P .\nNow, the excess risk is the sum of this estimation error and the approximation error inf f\u2208FN ||f \u2212\nf\u2217||P of the class FN . Since the later usually decreases when the number of features N increases\nN FN is dense in L2(P )), we see the usual tradeoff between small estimation error\n\n[13] (e.g. when(cid:83)\n\n(low N) and small approximation error (large N).\nIn this paper we are interested in the setting when N is large so that the approximation error is small.\nWhenever N is larger than K we face the over\ufb01tting problem since there are more parameters than\nactual data (more variables than constraints), which is illustrated in the bound (1) which provides\nno information about the generalization ability of any LS estimate.\nIn addition, there are many\nminimizers (in fact a vector space of same dimension as the null space of \u03a6T \u03a6) of the empirical\nrisk. To overcome the problem, several approaches have been proposed in the literature:\n\nror with minimal (l1 or l2)-norm: (cid:98)\u03b1 = arg min\u03a6\u03b1=Y ||\u03b1||1 or 2, (or a robust solution\n\u2022 LS solution with minimal norm: The solution is the minimizer of the empirical er-\narg min||\u03a6\u03b1\u2212Y ||2\u2264\u03b5 ||\u03b1||1). The choice of (cid:96)2-norm yields the ordinary LS solution. The\nchoice of (cid:96)1-norm has been used for generating sparse solutions (e.g. the Basis Pursuit\n[10]), and assuming that the target function admits a sparse decomposition, the \ufb01eld of\nCompressed Sensing [9, 21] provides suf\ufb01cient conditions for recovering the exact so-\nlution. However, such conditions (e.g. that \u03a6 possesses a Restricted Isometric Property\n(RIP)) does not hold in general in this regression setting. On another aspect, solving these\nproblems (both for l1 or l2-norm) when N is large is numerically expensive.\n\n\u2022 Regularization. The solution is the minimizer of the empirical error plus a penalty term,\n\nfor example\n\n(cid:98)f = arg min\n\nf\u2208FN\n\nLK(f) + \u03bb||f||p\np,\n\nfor p = 1 or 2.\n\n(cid:96)1 (LASSO [19]). A close alternative is the Dantzig selector [8, 5] which solves: (cid:98)\u03b1 =\nwhere \u03bb is a parameter and usual choices for the norm are (cid:96)2 (ridge-regression [20]) and\narg min||\u03b1||1\u2264\u03bb ||\u03a6T (Y \u2212 \u03a6\u03b1)||\u221e. The numerical complexity and generalization bounds\nof those methods depend on the sparsity of the target function decomposition in FN .\nterm that depends on the size of the model: (cid:98)fN = arg minf\u2208FN ,N\u22651 LK(f) + pen(N, K), where\n\nNow if we possess a sequence of function classes (FN )N\u22651 with increasing capacity, we may per-\nform structural risk minimization [22] by solving in each model the empirical risk penalized by a\n\nthe penalty term measures the capacity of the function space.\nIn this paper we follow another approach where instead of searching in the large space FN (where\nN > K) for a solution that minimizes the empirical error plus a penalty term, we simply search\nfor the empirical error minimizer in a (randomly generated) lower dimensional subspace GM \u2282 FN\n(where M < K).\n\nOur contribution: We consider a set of M random linear combinations of the initial N features\nand perform our favorite LS regression algorithm (possibly regularized) using those \u201ccompressed\n\n2\n\n\ffeatures\u201d. This is equivalent to projecting the K points {\u03d5(xk) \u2208 RN , k = 1..K} from the initial\ndomain (of size N) onto a random subspace of dimension M, and then performing the regres-\nsion in the \u201ccompressed domain\u201d (i.e. span of the compressed features). This is made possible\nbecause random projections approximately preserve inner products between vectors (by a variant of\nthe Johnson-Lindenstrauss Lemma stated in Proposition 1.\nOur main result is a bound on the excess risk of a linear estimator built in the compressed domain\nin terms of the excess risk of the linear estimator built in the initial domain (Section 2). We further\ndetail the case of ordinary Least-Squares Regression (Section 3) and discuss, in terms of M, N, K,\nthe different tradeoffs concerning the excess risk (reduced estimation error in the compressed do-\nmain versus increased approximation error introduced by the random projection) and the numerical\ncomplexity (reduced complexity of solving the LSR in the compressed domain versus the additional\nload of performing the projection).\n\u221a\nK) projections we de\ufb01ne a Compressed\nAs a consequence, we show that by choosing M = O(\nLeast-Squares Regression which uses O(N K 3/2) elementary operations to compute a regression\nfunction with estimation error (relatively to the initial function space FN ) of order log K/\nK up to\na multiplicative factor which depends on the best approximation of f\u2217 in FN . This is competitive\nwith the best methods, up to our knowledge.\n\n\u221a\n\nRelated works: Using dimension reduction and random projections in various learning areas has\nreceived considerable interest over the past few years. In [7], the authors use a SVM algorithm in a\ncompressed space for the purpose of classi\ufb01cation and show that their resulting algorithm has good\ngeneralization properties. In [25], the authors consider a notion of compressed linear regression.\nFor data Y = X\u03b2 + \u03b5, where \u03b2 is the target and \u03b5 a standard noise, they use compression of the\nset of data, thus considering AY = AX\u03b2 + A\u03b5, where A has a Restricted Isometric Property.\nThey provide an analysis of the LASSO estimator built from these compressed data, and discuss a\nproperty called sparsistency, i.e. the number of random projections needed to recover \u03b2 (with high\nprobability) when it is sparse. These works differ from our approach in the fact that we do not\nconsider a compressed (input and/or output) data space but a compressed feature space instead.\nIn [11], the authors discuss how compressed measurements may be useful to solve many detection,\nclassi\ufb01cation and estimation problems without having to reconstruct the signal ever. Interestingly,\nthey make no assumption about the signal being sparse, like in our work. In [6, 17], the authors\nshow how to map a kernel k(x, y) = \u03d5(x) \u00b7 \u03d5(y) into a low-dimensional space, while still approx-\nimately preserving the inner products. Thus they build a low-dimensional feature space speci\ufb01c for\n(translation invariant) kernels.\n\ndef= {f\u03b1 =(cid:80)N\n\n2 Linear regression in the compressed domain\nWe remind that the initial set of features is {\u03d5n : X (cid:55)\u2192 R, 1 \u2264 n \u2264 N} and the initial domain\nn=1 \u03b1n\u03d5n, \u03b1 \u2208 RN} is the span of those features. We write \u03d5(x) the N-vector of\nFN\ncomponents (\u03d5n(x))n\u2264N . Let us now de\ufb01ne the random projection. Let A be a M \u00d7 N matrix of\ni.i.d. elements drawn for some distribution \u03c1. Examples of distributions are:\n\n\u2022 Gaussian random variables N (0, 1/M),\n\u2022 \u00b1 Bernoulli distributions, i.e. which takes values \u00b11/\n\n\u2022 Distribution taking values \u00b1(cid:112)3/M with probability 1/6 and 0 with probability 2/3.\n\nM with equal probability 1/2,\n\n\u221a\n\nThe following result (proof in the supplementary material) states the property that inner-product are\napproximately preserved through random projections (this is a simple consequence of the Johnson-\nLindenstrauss Lemma):\nProposition 1 Let (uk)1\u2264k\u2264K and v be vectors of RN . Let A be a M \u00d7 N matrix of i.i.d. el-\nements drawn from one of the previously de\ufb01ned distributions. For any \u03b5 > 0, \u03b4 > 0, for\nM \u2265 1\n\n\u03b4 , we have, with probability at least 1 \u2212 \u03b4, for all k \u2264 K,\n\nlog 4K\n\n4 \u2212 \u03b53\n\u03b52\n\n6\n\n|Auk \u00b7 Av \u2212 uk \u00b7 v| \u2264 \u03b5||uk||||v||.\n\n3\n\n\f(cid:80)N\n\nWe now introduce the set of M compressed features (\u03c8m)1\u2264m\u2264M such that \u03c8m(x) def=\nn=1 Am,n\u03d5n(x). We also write \u03c8(x) the M-vector of components (\u03c8m(x))m\u2264M . Thus\n\u03c8(x) = A\u03d5(x). We de\ufb01ne the compressed domain GM\nm=1 \u03b2m\u03c8m, \u03b2 \u2208 RM} the\nspan of the compressed features (vector space of dimension at most M). Note that each \u03c8m \u2208 FN ,\nthus GM is a subspace of FN .\n\ndef= {g\u03b2 =(cid:80)M\n\n2.1 Approximation error\nWe now compare the approximation error assessed in the compressed domain GM versus in the\ninitial space FN . This applies to the linear algorithms mentioned in the introduction such as ordinary\nLS regression (analyzed in details in Section 3), but also its penalized versions, e.g. LASSO and\nridge regression. De\ufb01ne \u03b1+ = arg min\u03b1\u2208RN L(f\u03b1) \u2212 L(f\u2217) the parameter of the best regression\nfunction in FN .\nTheorem 1 For any \u03b4 > 0, any M \u2265 15 log(8K/\u03b4), let A be a random M \u00d7 N matrix de\ufb01ned\nlike in Proposition 1, and GM be the compressed domain resulting from this choice of A. Then with\nprobability at least 1 \u2212 \u03b4,\n\n||\u03b1+||2(cid:16)\n\nE(cid:2)||\u03d5(X)||2(cid:3)+2 sup\n\n(cid:114)log 4/\u03b4\n\n(cid:17)\n\ninf\ng\u2208GM\n\n||g\u2212f\u2217||2\n\nP \u2264 8 log(8K/\u03b4)\n\nM\n\n||\u03d5(x)||2\n\nx\u2208X\n\n2K\n\n+ inf\nf\u2208FN\n\n||f\u2212f\u2217||2\nP .\n\nThis theorem shows the tradeoff in terms of estimation and approximation errors for an estimator(cid:98)g\nobtained in the compressed domain compared to an estimator (cid:98)f obtained in the initial domain:\n\u2022 Bounds on the estimation error of(cid:98)g in GM are usually smaller than that of (cid:98)f in FN when\n\nM < N (since the capacity of FN is larger than that of GM ).\n\n\u2022 Theorem 1 says that the approximation error assessed in GM increases by at most\n\n(2)\n\nO( log(K/\u03b4)\n\nM )||\u03b1+||2E||\u03d5(X)||2 compared to that in FN .\n\nProof: Let us write f + def= f\u03b1+ = arg minf\u2208FN ||f \u2212 f\u2217||P and g+ def= gA\u03b1+. The approximation\nerror assessed in the compressed domain GM is bounded as\n\n||g \u2212 f\u2217||2\n\nP \u2264 ||g+ \u2212 f\u2217||2\n\nP = ||g+ \u2212 f +||2\n\ninf\ng\u2208GM\n\n(3)\nsince f + is the orthogonal projection of f\u2217 on FN and g+ belongs to FN . We now bound ||g+ \u2212\nP using concentration inequalities. De\ufb01ne Z(x) def= A\u03b1+ \u00b7 A\u03d5(x) \u2212 \u03b1+ \u00b7 \u03d5(x). De\ufb01ne \u03b52 def=\nf +||2\nM log(8K/\u03b4). For M \u2265 15 log(8K/\u03b4) we have \u03b5 < 3/4 thus M \u2265 log(8K/\u03b4)\n\u03b52/4\u2212\u03b53/6. Proposition 1\napplies and says that on an event E of probability at least 1 \u2212 \u03b4/2, we have for all k \u2264 K,\n\nP + ||f + \u2212 f\u2217||2\nP ,\n\n8\n\n|Z(xk)| \u2264 \u03b5||\u03b1+||||\u03d5(xk)|| \u2264 \u03b5||\u03b1+|| sup\nx\u2208X\n\n||\u03d5(x)|| def= C\n\n(4)\n\nOn the event E, we have with probability at least 1 \u2212 \u03b4(cid:48),\n\nK(cid:88)\n\n||g+ \u2212 f +||2\n\n|Z(xk)|2 + C 2\n\nK\n\nP = EX\u223cPX|Z(X)|2 \u2264 1\nK(cid:88)\nE(cid:2)||\u03d5(X)||2(cid:3) + 2 sup\n\n\u2264 \u03b52||\u03b1+||2(cid:16) 1\n\u2264 \u03b52||\u03b1+||2(cid:16)\n\n||\u03d5(xk)||2 + sup\nx\u2208X\n\nk=1\n\nk=1\n\nK\n\n||\u03d5(x)||2\n\nx\u2208X\n\n||\u03d5(x)||2\n\n2K\n\n(cid:114)log(2/\u03b4(cid:48))\n(cid:114)log(2/\u03b4(cid:48))\n(cid:114)log(2/\u03b4(cid:48))\n(cid:17)\n\n2K\n\n.\n\n2K\n\n(cid:17)\n\nwhere we applied two times Chernoff-Hoeffding\u2019s inequality. Combining with (3), unconditioning,\nand setting \u03b4(cid:48) = \u03b4/2 then with probability at least (1 \u2212 \u03b4/2)(1 \u2212 \u03b4(cid:48)) \u2265 1 \u2212 \u03b4 we have (2).\n(cid:3)\n\n4\n\n\f2.2 Computational issues\n\nWe now discuss the relative computational costs of a given algorithm applied either in the initial or\nin the compressed domain. Let us write Cx(DK,FN , P ) the complexity (e.g. number of elementary\nDK and function space FN .\nWe plot in the table below, both for the initial and the compressed versions of the algorithm A, the\norder of complexity for (i) the cost for building the feature matrix, (ii) the cost for computing the\n\noperations) of an algorithm A to compute the regression function (cid:98)f when provided with the data\nestimator, (iii) the cost for making one prediction (i.e. computing (cid:98)f(x) for any x):\n\nInitial domain\nConstruction of the feature matrix\nComputing the regression function Cx(DK,FN , P )\n\nN K\n\nMaking one prediction\n\nN\n\nCompressed domain\nCx(DK,GM , P )\n\nN KM\n\nN M\n\nNote that the values mentioned for the compressed domain are upper-bounds on the real complexity\nand do not take into account the possible sparsity of the projection matrix A (which would speed up\nmatrix computations, see e.g. [2, 1]).\n\n3 Compressed Least-Squares Regression\n\nWe now analyze the speci\ufb01c case of Least-Squares Regression.\n\n3.1 Excess risk of ordinary Least Squares regression\n\n||\u03b1||.\n\nargmin\n\n(cid:98)\u03b1 =\n\n\u03b1\u2208argmin\u03b1(cid:48)\u2208RN ||Y \u2212\u03a6\u03b1(cid:48)||\n\nIn order to bound the estimation error, we follow the approach of [13] which truncates (up to the\nlevel \u00b1L where L is a bound, assumed to be known, on ||f\u2217||\u221e) the prediction of the LS regression\nfunction. The ordinary LS regression provides the regression function fb\u03b1 where\nNote that \u03a6\u03a6T(cid:98)\u03b1 = \u03a6T Y , hence(cid:98)\u03b1 = \u03a6\u2020Y \u2208 RN where \u03a6\u2020 is the Penrose pseudo-inverse of \u03a61.\nThen the truncated predictor is: (cid:98)fL(x) def= TL[fb\u03b1(x)], where\nTruncation after the computation of the parameter(cid:98)\u03b1 \u2208 RN , which is the solution of an unconstrained\nbounds. Indeed, the excess risk of (cid:98)fL is bounded as\n\noptimization problem, is easier than solving an optimization problem under the constraint that ||\u03b1||\nis small (which is the approach followed in [23]) and allows for consistency results and prediction\n\nif |u| \u2264 L,\nL sign(u) otherwise.\n\n(cid:26) u\n\nTL(u) def=\n\nE(||(cid:98)f \u2212 f\u2217||2\n\n(5)\nwhere a bound on c(cid:48) is 9216 (see [13]). We have a simpler bound when we consider the expectation\nEY conditionally on the input data:\n\nN + 8 inf\nf\u2208FN\n\nP\n\n||f \u2212 f\u2217||2\n\nK\n\nP ) \u2264 c(cid:48) max{\u03c32, L2}1 + log K\nEY (||(cid:98)f \u2212 f\u2217||2\n\n) \u2264 \u03c32 N\nK\n\nPK\n\nf\u2208F ||f \u2212 f\u2217||2\n+ inf\n\nPK\n\n(6)\n\nRemark: Note that because we use the quadratic loss function, by following the analysis in [3],\nor by deriving tight bounds on the Rademacher complexity [14] and following Theorem 5.2 of\nKoltchinskii\u2019s Saint Flour course, it is actually possible to state assumptions under which we can\nremove the log K term in (5). We will not further detail such bounds since our motivation here is\nnot to provide the tightest possible bounds, but rather to show how the excess risk bound for LS\nregression in the initial domain extends to the compressed domain.\n\n1In the full rank case, \u03a6\u2020 = (\u03a6T \u03a6)\u22121\u03a6T when K \u2265 N and \u03a6\u2020 = \u03a6T (\u03a6\u03a6T )\u22121 when K \u2264 N\n\n5\n\n\f3.2 Compressed Least-Squares Regression (CLSR)\n\nis the K \u00d7 M matrix with elements (\u03c8m(xk))1\u2264m\u2264M,1\u2264k\u2264K. The CLSR estimate is de\ufb01ned as\n\nCLSR is de\ufb01ned as the ordinary LSR in the compressed domain. Let (cid:98)\u03b2 = \u03a8\u2020Y \u2208 RM , where \u03a8\n(cid:98)gL(x) def= TL[gb\u03b2(x)]. From Theorem 1, (5) and (6), we deduce the following excess risk bounds for\n(cid:113) K log(8K/\u03b4)\nc(cid:48)(1+log K) . Then whenever M \u2265\nCorollary 1 For any \u03b4 > 0, set M = 8\n15 log(8K/\u03b4), with probability at least 1 \u2212 \u03b4, the expected excess risk of the CLSR estimate is\nbounded as\n\nthe CLSR estimate:\n\n\u221a\n||\u03b1+||\n\nE||\u03d5(X)||2\n\nmax(\u03c3,L)\n\n\u221a\n\n1 +\n\nsupx ||\u03d5(x)||2\nE||\u03d5(X)||2\n\n(cid:114)(1 + log K) log(8K/\u03b4)\n\nc(cid:48) max{\u03c3, L}||\u03b1+||(cid:112)E||\u03d5(X)||2\n(cid:114)log 4/\u03b4\n(cid:17)\n\u00d7(cid:16)\n(cid:112)8K log(8K/\u03b4). Assume N > K and that the features (\u03d5k)1\u2264k\u2264K\n(cid:114)log 4/\u03b4\n(cid:17)\n\n(cid:114)2 log(8K/\u03b4)\n\n||f \u2212 f\u2217||2\nP .\n\n+ 8 inf\nf\u2208FN\n\n(cid:16)\n\n2K\n\nK\n\n(7)\n\n1 +\n\nsupx ||\u03d5(x)||2\nE||\u03d5(X)||2\n\nK\n\n.\n\n2K\n\n) \u2264 4\u03c3||\u03b1+||(cid:112)E||\u03d5(X)||2\n\nNow set M =\nare linearly independent. Then whenever M \u2265 15 log(8K/\u03b4), with probability at least 1 \u2212 \u03b4, the\nexpected excess risk of the CLSR estimate conditionally on the input samples is upper bounded as\n\nE(||(cid:98)gL \u2212 f\u2217||2\n\nP ) \u2264 16\n\n\u221a\n||\u03b1+||\n\nE||\u03d5(X)||2\n\u03c3\n\nProof: Whenever M \u2265 15 log(8K/\u03b4) we deduce from Theorem 1 and (5) that the excess risk of\n\nP ) \u2264 c(cid:48) max{\u03c32, L2}1 + log K\nE||\u03d5(X)||2 + 2 sup\n\nK\n\nM\n\n||\u03d5(x)||2\n\n(cid:114)log 4/\u03b4\n\n(cid:17)\n\n2K\n\n(cid:105)\n\n.\n\n+ inf\nf\u2208FN\n\n||f \u2212 f\u2217||2\n\nP\n\nx\n\nBy optimizing on M, we deduce (7). Similarly, using (6) we deduce the following bound on\n\nEY (||(cid:98)gL \u2212 f\u2217||2\n(cid:98)gL is bounded as\n\nPK\n\nE(||(cid:98)gL \u2212 f\u2217||2\n(cid:104)8 log(8K/\u03b4)\n\n+8\n\nM\n\n||\u03b1+||2(cid:16)\nlog(8K/\u03b4)||\u03b1+||2(cid:16)\n\n):\n\nPK\n\nEY (||(cid:98)gL \u2212 f\u2217||2\n\n(cid:114)log 4/\u03b4\n\n(cid:17)\n\n2K\n\n8\nM\n\n+\n\n\u03c32 M\nK\nBy optimizing on M and noticing that inf f\u2208FN ||f \u2212 f\u2217||2\n(\u03d5k)1\u2264k\u2264K are linearly independent, we deduce the second result.\n\nE||\u03d5(X)||2 + 2 sup\n\n||\u03d5(x)||2\n\nPK\n\nx\n\n+ inf\nf\u2208FN\n\n||f \u2212 f\u2217||2\n\nPK\n\n.\n\n= 0 whenever N > K and the features\n(cid:3)\n\nRemark 1 Note that the second term in the parenthesis of (7) is negligible whenever K (cid:29) log 1/\u03b4.\nThus we have the expected excess risk\n\nE(||(cid:98)gL \u2212 f\u2217||2\n\nP ) = O\n\n(cid:16)||\u03b1+||(cid:112)E||\u03d5(X)||2 log K/\u03b4\u221a\n\n+ inf\nf\u2208FN\n\n||f \u2212 f\u2217||2\n\nP\n\nK\n\n.\n\n(8)\n\n(cid:17)\n\nThe choice of M in the previous corollary depends on ||\u03b1+|| and E||\u03d5(X)|| which are a priori\nIf we set M independently of ||\u03b1+||, then an addi-\nunknown (since f\u2217 and PX are unknown).\ntional multiplicative factor of ||\u03b1+|| appears in the bound, and if we replace E||\u03d5(X)|| by its bound\nsupx ||\u03d5(x)|| (which is known) then this latter factor will appear instead of the former in the bound.\n\nComplexity of CLSR: The complexity of LSR for computing the regression function in the com-\npressed domain only depends on M and K, and is (see e.g. [4]) Cx(DK,GM , P ) = O(M K 2) which\n\u221a\nis of order O(K 5/2) when we choose the optimized number of projections M = O(\nK). However\nthe leading term when using CLSR is the cost for building the \u03a8 matrix: O(N K 3/2).\n\n6\n\n\f4 Discussion\n\n4.1 The factor ||\u03b1+||(cid:112)E||\u03d5(X)||2\ngeneralization error or not is ||\u03b1+||(cid:112)E||\u03d5(X)||2. This factor indicates that a good set of features\n\nIn light of Corollary 1, the important factor which will determine whether the CLSR provides low\n\n(for CLSR) should be such that the norm of those features as well as the norm of the parameter\n\u03b1+ of the projection of f\u2217 onto the span of those features should be small. A natural question is\nwhether this product can be made small for appropriate choices of features. We now provide two\nspeci\ufb01c cases for which this is actually the case: (1) when the features are rescaled orthonormal\nbasis functions, and (2) when the features are speci\ufb01c wavelet functions. In both cases, we relate\nthe bound to an assumption of regularity on the function f\u2217, and show that the dependency w.r.t. N\ndecreases when the regularity increases, and may even vanish.\n\nRescaled Orthonormal Features: Consider a set of orthonormal functions (\u03b7i)i\u22651 w.r.t a measure\n\u00b5, i.e. (cid:104)\u03b7i, \u03b7j(cid:105)\u00b5 = \u03b4i,j. In addition we assume that the law of the input data is dominated by \u00b5,\ni.e. PX \u2264 C\u00b5 where C is a constant. For instance, this is the case when the set X is compact, \u00b5 is\nthe uniform measure and PX has bounded density.\n\nWe de\ufb01ne the set of N features as: \u03d5i\n\nany f \u2208 FN decomposes as f = (cid:80)N\n||\u03b1||2 = (cid:80)N\n||\u03b1+||2E||\u03d5||2 \u2264 C(cid:80)N\n\n)2 and E||\u03d5||2 = (cid:80)N\n)2(cid:80)N\n\ni=1 (cid:104)f, \u03b7i(cid:105) \u03b7i = (cid:80)N\n(cid:82)\n\ndef= ci\u03b7i, where ci > 0, for i \u2208 {1, . . . , N}. Then\ndef= (cid:104)f, \u03b7i(cid:105). Thus\ni . Thus\n\ni (x)dPX (x) \u2264 C(cid:80)N\n\n\u03d5i, where bi\n\ni=1\n\nbi\nci\nX \u03b72\n\nwe have:\n\ni=1( bi\nci\ni=1( bi\nci\n\ni .\ni=1 c2\n\ni=1 c2\ni\n\ni=1 c2\n\nNow, linear approximation theory (Jackson-type theorems) tells us that assuming a function f\u2217 \u2208\nL2(\u00b5) is smooth, it may be decomposed onto the span of the N \ufb01rst (\u03b7i)i\u2208{1,...,N} functions with\ndecreasing coef\ufb01cients |bi| \u2264 i\u2212\u03bb for some \u03bb \u2265 0 that depends on the smoothness of f\u2217. For\nexample the class of functions with bounded total variation may be decomposed with Fourier basis\n(in dimension 1) with coef\ufb01cients |bi| \u2264 ||f||V /(2\u03c0i). Thus here \u03bb = 1. Other classes (such as\nSobolev spaces) lead to larger values of \u03bb related to the order of differentiability.\n\nBy choosing ci = i\u2212\u03bb/2, we have ||\u03b1+||(cid:112)E||\u03d5||2 \u2264 \u221a\nHowever any orthonormal basis, even rescaled, would not necessarily yield a small ||\u03b1+||(cid:112)E||\u03d5||2\n\ni=1 i\u2212\u03bb. Thus if \u03bb > 1, then this term\nis bounded by a constant that does not depend on N. If \u03bb = 1 then it is bounded by O(log N), and\nif 0 < \u03bb < 1, then it is bounded by O(N 1\u2212\u03bb).\n\nterm (this is all the more true when the dimension of X is large). The desired property that the\ncoef\ufb01cients (\u03b1+)i of the decomposition of f\u2217 rapidly decrease to 0 indicates that hierarchical bases,\nsuch as wavelets, that would decompose the function at different scales, may be interesting.\n\nC(cid:80)N\n\nh,l) (indexed by n \u2265 1 or\nWavelets: Consider an in\ufb01nite family of wavelets in [0, 1]: (\u03d50\nn) = (\u03d50\nequivalently by the scale h \u2265 0 and translation 0 \u2264 l \u2264 2h \u2212 1) where \u03d50\nh,l(x) = 2h/2\u03d50(2hx \u2212 l)\nand \u03d50 is the mother wavelet. Then consider N = 2H features (\u03d5h,l)1\u2264h\u2264H de\ufb01ned as the rescaled\n(cid:80)\nh,l, where ch > 0 are some coef\ufb01cients. Assume the mother wavelet\nwavelets \u03d5h,l\nl \u03d50(2hx \u2212\nis Cp (for p \u2265 1), has at least p vanishing moments, and that for all h \u2265 0, supx\nl)2 \u2264 1. Then the following result (proof in the supplementary material) provides a bound on\n\nsupx\u2208X ||\u03d5(x)||2 (thus on(cid:112)E||\u03d5(X)||2) by a constant independent of N:\n\ndef= ch2\u2212h/2\u03d50\n\nProposition 2 Assume that f\u2217 is (L, \u03b3)-Lipschitz (i.e. for all v \u2208 X there exists a polynomial pv of\ndegree (cid:98)\u03b3(cid:99) such that for all u \u2208 X , |f(u) \u2212 pv(u)| \u2264 L|u \u2212 v|\u03b3) with 1/2 < \u03b3 \u2264 p. Then setting\nch = 2h(1\u22122\u03b3)/4, we have ||\u03b1+|| supx ||\u03d5(x)|| \u2264 L\nNotice that the Haar walevets has p = 1 vanishing moment but is not C1, thus the Proposition does\nnot apply directly. However direct computations show that if f\u2217 is L-Lipschitz (i.e. \u03b3 = 1) then\nh,l \u2264 L2\u22123h/2\u22122, and thus ||\u03b1+|| supx ||\u03d5(x)|| \u2264\n\u03b10\n\n(cid:82) 1\n0 |\u03d50|, which is independent of N.\n\n4(1\u22122\u22121/2) with ch = 2\u2212h/4.\n\n1\u221221/2\u2212\u03b3\n\n2\u03b3\n\nL\n\n7\n\n\f\u221a\n\n\u221a\n\n4.2 Comparison with other methods\n\nIn the case when the factor ||\u03b1+||(cid:112)E||\u03d5(X)||2 does not depend on N (such as in the previous\n\nK). It is clear that whenever N >\n\nexample), the bound (8) on the excess risk of CLSR states that the estimation error (assessed in\nterms of FN ) of CLSR is O(log K/\nK (which is the case of\ninterest here), this is better than the ordinary LSR in the initial domain, whose estimation error is\nO(N log K/K).\nIt is dif\ufb01cult to compare this result with LASSO (or the Dantzig selector that has similar properties\n[5]) for which an important aspect is to design sparse regression functions or to recover a solution\nassumed to be sparse. From [12, 15, 24] one deduces that under some assumptions, the estimation\n\u221a\nerror of LASSO is of order S log N\nK where S is the sparsity (number of non-zero coef\ufb01cients) of the\nbest regressor f + in FN . If S <\nK then LASSO is more interesting than CLSR in terms of excess\nrisk. Otherwise CLSR may be an interesting alternative although this method does not make any\nassumption about the sparsity of f + and its goal is not to recover a possible sparse f + but only to\nmake good predictions. However, in some sense our method \ufb01nds a sparse solution in the fact that\n\nthe regression function(cid:98)gL lies in a space GM of small dimension M (cid:28) N and can thus be expressed\n\nusing only M coef\ufb01cients.\nNow in terms of numerical complexity, CLSR requires O(N K 3/2) operations to build the matrix\nand compute the regression function, whereas according to [18], the (heuristical) complexity of the\nLASSO algorithm is O(N K 2) in the best cases (assuming that the number of steps required for\nconvergence is O(K), which is not proved theoretically). Thus CLSR seems to be a good and\nsimple competitor to LASSO.\n\n5 Conclusion\n\nWe considered the case when the number of features N is larger than the number of data K. The\nresult stated in Theorem 1 enables to analyze the excess risk of any linear regression algorithm (LS\nor its penalized versions) performed in the compressed domain GM versus in the initial space FN .\nIn the compressed domain the estimation error is reduced but an additional (controlled) approxima-\ntion error (when compared to the best regressor in FN ) comes into the picture. In the case of LS\n\nregression, when the term ||\u03b1+||(cid:112)E||\u03d5(X)||2 has a mild dependency on N, then by choosing a\n\n\u221a\n\u221a\nrandom subspace of dimension M = O(\nFN ) bounded by O(log K/\nIn short, CLSR provides an alternative to usual penalization techniques where one \ufb01rst selects a ran-\ndom subspace of lower dimension and then performs an empirical risk minimizer in this subspace.\nFurther work needs to be done to provide additional settings (when the space X is of dimension > 1)\n\nfor which the term ||\u03b1+||(cid:112)E||\u03d5(X)||2 is small.\n\nK), CLSR has an estimation error (assessed in terms of\n\nK) and has numerical complexity O(N K 3/2).\n\nAcknowledgements: The authors wish to thank Laurent Jacques for numerous comments and\nAlessandro Lazaric and Mohammad Ghavamzadeh for exciting discussions. This work has been\nsupported by French National Research Agency (ANR) through COSINUS program (project\nEXPLO-RA, ANR-08-COSI-004).\n\nReferences\n[1] Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with bi-\n\nnary coins. Journal of Computer and System Sciences, 66(4):671\u2013687, June 2003.\n\n[2] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson-\nLindenstrauss transform. In STOC \u201906: Proceedings of the thirty-eighth annual ACM sym-\nposium on Theory of computing, pages 557\u2013563, New York, NY, USA, 2006. ACM.\n\n[3] Jean-Yves Audibert and Olivier Catoni. Risk bounds in linear regression through pac-bayesian\n\ntruncation. Technical Report HAL : hal-00360268, 2009.\n\n[4] David Bau III and Lloyd N. Trefethen. Numerical linear algebra. Philadelphia: Society for\n\nIndustrial and Applied Mathematics, 1997.\n\n8\n\n\f[5] Peter J. Bickel, Ya\u2019acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso\n\nand Dantzig selector. To appear in Annals of Statistics, 2008.\n\n[6] Avrim Blum. Random projection, margins, kernels, and feature-selection. Subspace, Latent\n\nStructure and Feature Selection, pages 52\u201368, 2006.\n\n[7] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal\nsparse dimensionality reduction and learning in the measurement domain. Technical Report,\n2009.\n\n[8] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is\n\nmuch larger than n. Annals of Statistics, 35:2313, 2007.\n\n[9] Emmanuel J. Candes and Justin K. Romberg. Signal recovery from random projections. vol-\n\nume 5674, pages 76\u201386. SPIE, 2005.\n\n[10] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM\n\nJournal on Scienti\ufb01c Computing, 20:33\u201361, 1998.\n\n[11] Mark A. Davenport, Michael B. Wakin, and Richard G. Baraniuk. Detection and estimation\nwith compressive measurements. Technical Report TREE 0610, Department of Electrical and\nComputer Engineering, Rice University, 2006.\n\n[12] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the\n\nvirtue of over-parametrization. Bernoulli, 10:971\u2013988, 2004.\n\n[13] L. Gy\u00a8or\ufb01, M. Kohler, A. Krzy\u02d9zak, and H. Walk. A distribution-free theory of nonparametric\n\nregression. Springer-Verlag, 2002.\n\n[14] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear predic-\ntion: Risk bounds, margin bounds, and regularization. In Daphne Koller, Dale Schuurmans,\nYoshua Bengio, and Leon Bottou, editors, Neural Information Processing Systems, pages 793\u2013\n800. MIT Press, 2008.\n\n[15] Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group Lasso estima-\n\ntor for linear models. Electron. J. Statist., 2:605\u2013633, 2008.\n\n[16] D. Pollard. Convergence of Stochastic Processes. Springer Verlag, New York, 1984.\n[17] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Neural\n\nInformation Processing Systems, 2007.\n\n[18] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics,\n\n35:1012, 2007.\n\n[19] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal\n\nStatistical Society, Series B, 58:267\u2013288, 1994.\n\n[20] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method.\n\nSoviet Math Dokl 4, pages 1035\u20131038, 1963.\n\n[21] Yaakov Tsaig and David L. Donoho. Compressed sensing.\n\n52:1289\u20131306, 2006.\n\nIEEE Trans. Inform. Theory,\n\n[22] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc.,\n\nNew York, NY, USA, 1995.\n\n[23] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal\n\nof Machine Learning Research, 2:527\u2013550, 2002.\n\n[24] Tong Zhang. Some sharp performance bounds for least squares regression with L1 regulariza-\n\ntion. To appear in Annals of Statistics, 2009.\n\n[25] Shuheng Zhou, John D. Lafferty, and Larry A. Wasserman. Compressed regression. In John C.\nPlatt, Daphne Koller, Yoram Singer, and Sam T. Roweis, editors, Neural Information Process-\ning Systems. MIT Press, 2007.\n\n9\n\n\f", "award": [], "sourceid": 899, "authors": [{"given_name": "Odalric", "family_name": "Maillard", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}