{"title": "Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness", "book": "Advances in Neural Information Processing Systems", "page_first": 1563, "page_last": 1570, "abstract": "This paper uses information-theoretic techniques to determine minimax rates for estimating nonparametric sparse additive regression models under high-dimensional scaling. We assume an additive decomposition of the form $f^*(X_1, \\ldots, X_p) = \\sum_{j \\in S} h_j(X_j)$, where each component function $h_j$ lies in some Hilbert Space $\\Hilb$ and $S \\subset \\{1, \\ldots, \\pdim \\}$ is an unknown subset with cardinality $\\s = |S$. Given $\\numobs$ i.i.d. observations of $f^*(X)$ corrupted with white Gaussian noise where the covariate vectors $(X_1, X_2, X_3,...,X_{\\pdim})$ are drawn with i.i.d. components from some distribution $\\mP$, we determine tight lower bounds on the minimax rate for estimating the regression function with respect to squared $\\LTP$ error. The main result shows that the minimax rates are $\\max{\\big(\\frac{\\s \\log \\pdim / \\s}{n}, \\LowerRateSq \\big)}$. The first term reflects the difficulty of performing \\emph{subset selection} and is independent of the Hilbert space $\\Hilb$; the second term $\\LowerRateSq$ is an \\emph{\\s-dimensional estimation} term, depending only on the low dimension $\\s$ but not the ambient dimension $\\pdim$, that captures the difficulty of estimating a sum of $\\s$ univariate functions in the Hilbert space $\\Hilb$. As a special case, if $\\Hilb$ corresponds to the $\\m$-th order Sobolev space $\\SobM$ of functions that are $m$-times differentiable, the $\\s$-dimensional estimation term takes the form $\\LowerRateSq \\asymp \\s \\; n^{-2\\m/(2\\m+1)}$. The minimax rates are compared with rates achieved by an $\\ell_1$-penalty based approach, it can be shown that a certain $\\ell_1$-based approach achieves the minimax optimal rate.", "full_text": "Lower bounds on minimax rates for nonparametric\n\nregression with additive sparsity and smoothness\n\nGarvesh Raskutti1, Martin J. Wainwright1,2, Bin Yu1,2\n\n1UC Berkeley Department of Statistics\n\n2UC Berkeley Department of Electrical Engineering and Computer Science\n\nAbstract\n\nWe study minimax rates for estimating high-dimensional nonparametric regression mod-\nels with sparse additive structure and smoothness constraints. More precisely, our goal\nis to estimate a function f\u2217 : Rp \u2192 R that has an additive decomposition of the form\nf\u2217(X1, . . . , Xp) = Pj\u2208S h\u2217j (Xj), where each component function h\u2217j lies in some class\nH of \u201csmooth\u201d functions, and S \u2282 {1, . . . , p} is an unknown subset with cardinality s = |S|.\nGiven n i.i.d. observations of f\u2217(X) corrupted with additive white Gaussian noise where the\ncovariate vectors (X1, X2, X3, ..., Xp) are drawn with i.i.d. components from some distribu-\ntion P, we determine lower bounds on the minimax rate for estimating the regression function\nwith respect to squared-L2(P) error. Our main result is a lower bound on the minimax rate\n\nthat scales as max(cid:0) s log(p/s)\nn(H)(cid:1). The \ufb01rst term re\ufb02ects the sample size required for\nperforming subset selection, and is independent of the function class H. The second term\nn(H) is an s-dimensional estimation term corresponding to the sample size required for\ns \u01eb2\nestimating a sum of s univariate functions, each chosen from the function class H. It depends\nlinearly on the sparsity index s but is independent of the global dimension p. As a special case,\nif H corresponds to functions that are m-times differentiable (an mth-order Sobolev space),\nn(H) \u224d s n\u22122m/(2m+1). Either of\nthen the s-dimensional estimation term takes the form s\u01eb2\nthe two terms may be dominant in different regimes, depending on the relation between the\nsparsity and smoothness of the additive decomposition.\n\n, s \u01eb2\n\nn\n\n1 Introduction\n\nMany problems in modern science and engineering involve high-dimensional data, by which we mean that the\nambient dimension p in which the data lies is of the same order or larger than the sample size n. A simple\nexample is parametric linear regression under high-dimensional scaling, in which the goal is to estimate a\nregression vector \u03b2\u2217 \u2208 Rp based on n samples.\nIn the absence of additional structure, it is impossible to\nobtain consistent estimators unless the ratio p/n converges to zero which precludes the regime p \u226b n. In\nmany applications, it is natural to impose sparsity conditions, such as requiring that \u03b2\u2217 have at most s non-zero\nparameters for some s \u226a p. The method of \u21131-regularized least squares, also known as the Lasso algorithm [14],\nhas been shown to have a number of attractive theoretical properties for such high-dimensional sparse models\n(e.g., [1, 19, 10]).\n\nOf course, the assumption of a parametric linear model may be too restrictive for some applications. Accord-\ningly, a natural extension is the non-parametric regression model y = f\u2217(x1, . . . , xp)+w, where w \u223c N (0, \u03c32)\nis additive observation noise. Unfortunately, this general non-parametric model is known to suffer severely from\nthe \u201ccurse of dimensionality\u201d, in that for most natural function classes, the sample size n required to achieve\na given estimation accuracy grows exponentially in the dimension. This challenge motivates the use of addi-\ntive non-parametric models (see the book [6] and references therein), in which the function f\u2217 is decomposed\nj=1 h\u2217j (xj) of univariate functions h\u2217j . A natural sub-class of these\n\nadditively as a sum f\u2217(x1, x2, ..., xp) = Pp\n\n1\n\n\fmodels are the sparse additive models, studied by Ravikumar et. al [12], in which\n\nf\u2217(x1, x2, ..., xp) =Xj\u2208S\n\nh\u2217j (xj),\n\n(1)\n\nwhere S \u2282 {1, 2, . . . , p} is some unknown subset of cardinality |S| = s.\nA line of past work has proposed and analyzed computationally ef\ufb01cient algorithms for estimating regression\nfunctions of this form. Just as \u21131-based relaxations such as the Lasso have desirable properties for sparse\nparametric models, similar \u21131-based approaches have proven to be successful. Ravikumar et al. [12] propose a\nback-\ufb01tting algorithm to recover the component functions hj and prove consistency in both subset recovery and\nconsistency in empirical L2(Pn) norm. Meier et al. [9] propose a method that involves a sparsity-smoothness\npenalty term, and also demonstrate consistency in L2(P) norm. In the special case that H is a reproducing\nkernel Hilbert space (RKHS), Koltchinskii and Yuan [7] analyze a least-squares estimator based on imposing\nan \u21131 \u2212 \u2113H\n-penalty. The analysis in these paper demonstrates that under certain conditions on the covariates,\nsuch regularized procedures can yield estimators that are consistent in the L2(P)-norm even when n \u226a p.\nOf complementary interest to the rates achievable by practical methods are the fundamental limits of the esti-\nmating sparse additive models, meaning lower bounds that apply to any algorithm. Although such lower bounds\nare well-known under classical scaling (where p remains \ufb01xed independent of n), to the best of our knowledge,\nlower bounds for minimax rates on sparse additive models have not been determined. In this paper, our main\n\nresult is to establish a lower bound on the minimax rate in L2(P) norm that scales as max(cid:0) s log(p/s)\n\nn(H)(cid:1).\nThe \ufb01rst term s log(p/s)\nis a subset selection term, independent of the univariate function space H in which\nn(H) in an\nthe additive components lie, that re\ufb02ects the dif\ufb01culty of \ufb01nding the subset S. The second term s\u01eb2\ns-dimensional estimation term, which depends on the low dimension s but not the ambient dimension p, and\nre\ufb02ects the dif\ufb01culty of estimating the sum of s univariate functions, each drawn from function class H. Either\nthe subset selection or s-dimensional estimation term dominates, depending on the relative sizes of n, p, and\ns as well as H. Importantly, our analysis applies both in the low-dimensional setting (n \u226b p) and the high-\ndimensional setting (p \u226b n) provided that n, p and s are going to \u221e. Our analysis is based on information-\ntheoretic techniques centered around the use of metric entropy, mutual information and Fano\u2019s inequality in\norder to obtain lower bounds. Such techniques are standard in the analysis of non-parametric procedures under\nclassical scaling [5, 2, 17], and have also been used more recently to develop lower bounds for high-dimensional\ninference problems [16, 11].\n\n, s\u01eb2\n\nn\n\nn\n\nThe remainder of the paper is organized as follows. In the next section, the results are stated including appropri-\nate preliminary concepts, notation and assumptions. In Section 3, we state the main results, and provide some\ncomparisons to the rates achieved by existing algorithms. In Section 4, we provide an overview of the proof.\nWe discuss and summarize the main consequences in Section 5.\n\n2 Background and problem formulation\n\nIn this paper, we consider a non-parametric regression model with random design, meaning that we make n\nobservations of the form\n\ny(i) = f\u2217(X (i)) + w(i),\n\nfor i = 1, 2, . . . , n.\n(2)\nHere the random vectors X (i) \u2208 Rp are the covariates, and have elements X (i)\ndrawn i.i.d. from some un-\nderlying distribution P. We assume that the noise variables w(i) \u223c N (0, \u03c32) are drawn independently, and\nindependent of all X (i)\u2019s. Given a base class H of univariate functions with norm k \u00b7 kH\n, consider the class of\nfunctions f : Rp \u2192 R that have an additive decomposition:\npXj=1\n\nF : =(cid:8)f : Rp \u2192 R | f (x1, x2, ..., xp) =\n\nkhjkH \u2264 1 \u2200j = 1, . . . , p(cid:9).\n\nhj(xj),\n\nand\n\nGiven some integer s \u2208 {1, . . . , p}, we de\ufb01ne the function class F0(s), which is a union of(cid:0)p\nsubspaces of F, given by\n\ns(cid:1) s-dimensional\n\nj\n\nF0(s) : =(cid:8)f \u2208 F |\n\npXj=1\n\nI(hj 6= 0) \u2264 s(cid:9).\n\n(3)\n\nThe minimax rate of estimation over F0(s) is de\ufb01ned by the quantity min bf maxf \u2217\u2208F0(s) Ekbf\u2212f\u2217k2\nthe expectation is taken over the noise w, and randomness in the sampling, and bf ranges over all (measurable)\n\nL2(P), where\n\n2\n\n\ffunctions of the observations {(y(i), X (i))}n\nminimax rate.\n\ni=1. The goal of this paper is to determine lower bounds on this\n\n2.1 Inner products and norms\n\nGiven a univariate function hj \u2208 H, we de\ufb01ne the usual L2(P) inner product\nhj(x)h\u2032j(x) dP(x).\n\nhhj, h\u2032jiL2(P) : =ZR\n\n(With a slight abuse of notation, we use P to refer to the measure over Rp as well as the induced marginal\nmeasure in each direction de\ufb01ned over R). Without loss of generality (re-centering the functions as needed), we\nmay assume\n\nE[hj(X)] =ZR\n\nhj(x) dP(x) = 0,\n\nfor all hj \u2208 H. As a consequence, we have E[f (X1, . . . , Xp)] = 0 for all functions f \u2208 F0(s). Given our\nassumption that the covariate vector X = (X1, . . . , Xp) has independent components, the L2(P) inner product\non F has the additive decomposition hf, f\u2032iL2(P) = Pp\nj=1 hhj, h\u2032jiL2(P). (Note that if independence were not\nassumed the L2(P) inner product over F would involve cross-terms.)\n2.2 Kullback-Leibler divergence\n\nSince we are using information theoretic techniques, we will be using the Kullback-Leibler (KL) divergence as a\n\nmeasure of \u201cdistance\u201d between distributions. For a given pair of functions f and ef, consider the n-dimensional\nvectors f (X) = (cid:0)f (X (1)), f (X (2)), . . . , f (X (n))(cid:1)T and ef (X) = (cid:0)ef (X (1)), ef (X (2)), . . . , ef (X (n))(cid:1)T . Since\nY |f (X) \u223c N (f (X), \u03c32In\u00d7n) and Y |ef (X) \u223c N (ef (X), \u03c32In\u00d7n),\n\n1\n\nD(Y |f (X)k Y |ef (X)) =\n\n2\u03c32kf (X) \u2212 ef (X)k2\n\n2.\n\nWe also use the notation D(f k ef ) to mean the average K-L divergence between the distributions of Y induced\nby the functions f and ef respectively. Therefore we have the relation\n\n(4)\n\n(5)\n\nD(f k ef ) = EX(cid:2)D(Y |f (X)k Y |ef (X))(cid:3)\n\nn\n\n=\n\nL2(P).\n\n2\u03c32kf \u2212 efk2\n\nThis relation between average K-L divergence and squared L2(P) distance plays an important role in our proof.\n\n2.3 Metric entropy for function classes\n\nIn this section, we de\ufb01ne the notion of metric entropy, which provides a way in which to measure the relative\nsizes of different function classes with respect to some metric \u03c1. More speci\ufb01cally, central to our results is the\nmetric entropy of F0(s) with respect to the L2(P) norm.\nDe\ufb01nition 1 (Covering and packing numbers). Consider a metric space consisting of a set S and a metric\n\u03c1 : S \u00d7 S \u2192 R+.\n\n(a) An \u01eb-covering of S in the metric \u03c1 is a collection {f 1, . . . , f N} \u2282 S such that for all f \u2208 S, there\nexists some i \u2208 {1, . . . , N} with \u03c1(f, f i) \u2264 \u01eb. The \u01eb-covering number N\u03c1(\u01eb) is the cardinality of the\nsmallest \u01eb-covering.\n\n(b) An \u01eb-packing of S in the metric \u03c1 is a collection {f 1, . . . , f M} \u2282 S such that \u03c1(f i, f j) \u2265 \u01eb for all\n\ni 6= j. The \u01eb-packing number M\u03c1(\u01eb) is the cardinality of the largest \u01eb-packing.\n\nThe covering and packing entropy (denoted by log N\u03c1(\u01eb) and log M\u03c1(\u01eb) respectively) are simply the logarithms\nof the covering and packing numbers, respectively.\nIt can be shown that for any convex set, the quantities\nlog N\u03c1(\u01eb) and log M\u03c1(\u01eb) are of the same order (within constant factors independent of \u01eb).\n\n3\n\n\fIn this paper, we are interested in packing (and covering) subsets of the function class F0(s) in the L2(P) metric,\nand so drop the subscript \u03c1 from here onwards. En route to characterizing the metric entropy of F0(s), we need\nto understand the metric entropy of the unit balls of our univariate function class H\u2014namely, the sets\n\nThe metric entropy (both covering and packing entropy) for many classes of functions are known. We provide\nsome concrete examples here:\n\nB\n\nH(1) : = {h \u2208 H | khkH \u2264 1}.\n\n[0, 1] \u2192 [0, 1]\n\nH(1) scales as log M (\u01eb;H) \u223c log(1/\u01eb).\n\n(i) Consider the class H = {h\u03b2 : R \u2192 R | h\u03b2(x) = \u03b2x} of all univariate linear functions with the norm\nkh\u03b2kH = |\u03b2|. Then it is known [15] that the metric entropy of B\n(ii) Consider the class H = {h :\n|h(x) \u2212 h(y)| \u2264 |x \u2212 y|} of all 1-Lipschitz func-\ntions on [0, 1] with the norm khkH = supx\u2208[0,1] |h(x)|. In this case, it is known [15] that the metric entropy\nscales as log MH(\u01eb;H) \u223c 1/\u01eb. Compared to the previous example of linear models, note that the metric\nentropy grows much faster as \u01eb \u2192 0, indicating that the class of Lipschitz functions is much richer.\n(iii) Consider the class of Sobolev spaces W m for m \u2265 1, consisting of all functions that have m derivatives,\nand the mth derivative is bounded in L2(P) norm. In this case, it is known that log M (\u01eb;H) \u223c \u01eb\u2212 1\nm (e.g., [3]).\nClearly, increasing the smoothness constraint m leads to smaller classes. Such Sobolev spaces are a particular\nclass of functions whose packing/covering entropy grows at a rate polynomial in 1\n\u01eb .\n\n|\n\nIn our analysis, we require that the metric entropy of B\nAssumption 1. Using log M (\u01eb;H) to denote the packing entropy of the unit ball B\nassume that there exists some \u03b1 \u2208 (0, 1) such that\n\nH(1) satisfy the following technical condition:\n\nH(1) in the L2(P)-norm,\n\nlim\n\u01eb\u21920\n\nlog M (\u03b1\u01eb;H)\nlog M (\u01eb;H)\n\n> 1.\n\nThe condition is required to ensure that log M (c\u01eb)/ log M (\u01eb) can be made arbitrarily small or large uniformly\nover small \u01eb by changing c, so that a bound due to Yang and Barron [17] can be applied. It is satis\ufb01ed for most\nnon-parametric classes, including (for instance) the Lipschitz and Sobolev classes de\ufb01ned in Examples (ii) and\n(iii) above. It may fail to hold for certain parametric classes, such as the set of linear functions considered\nin Example (i); however, we can use an alternative technique to derive bounds for the parametric case (see\nCorollary 2).\n\n3 Main result and some consequences\n\nIn this section, we state our main result and then develop some of its consequences. We begin with a theorem\nthat covers the function class F0(s) in which the univariate function classes H have metric entropy that satis\ufb01es\nAssumption 1. We state a corollary for the special cases of univariate classes H with metric entropy growing\npolynomial in (1/\u01eb), and also a corollary for the special case of sparse linear regression.\nConsider the observation model (2) where the covariate vectors have i.i.d. elements Xj \u223c P, and the regression\nfunction f\u2217 \u2208 F0(s). Suppose that the univariate function class H that underlies F0(s) satis\ufb01es Assumption 1.\nUnder these conditions, we have the following result:\nTheorem 1. Given n i.i.d. samples from the sparse additive model (2), the minimax risk in squared-L2(P)\nnorm is lower bounded as\n\nmin\nbf\n\nmax\n\nf \u2217\u2208F0(s)\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265 max(cid:20) \u03c32s log(p/s)\n\n32n\n\nn(H)(cid:21),\n\n\u01eb2\n\n,\n\ns\n16\n\n(6)\n\nwhere, for a \ufb01xed constant c, the quantity \u01ebn(H) = \u01ebn > 0 is largest positive number satisfying the inequality\n(7)\n\nn\u01eb2\nn\n\n2\u03c32 \u2264 log M(cid:0)c \u01ebn(cid:1).\n\nFor the case where H has an entropy that is growing to \u221e at a polynomial rate as \u01eb \u2192 0\u2014say log M (\u01eb;H) =\n\u0398(\u01eb\u22121/m) for some m > 1\n\n2 , we can compute the rate for the s-dimensional estimation term explicitly.\n\n4\n\n\fCorollary 1. For the sparse additive model (2) with univariate function space H such that such that\nlog M (\u01eb;H) = \u0398(\u01eb\u22121/m), we have\n\nmin\nbf\n\nmax\n\nf \u2217\u2208F0(s)\n\nfor some C > 0.\n\n3.1 Some consequences\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265 max(cid:20) \u03c32s log(p/s)\n\n32n\n\n, C s(cid:0) \u03c32\n\n2m+1(cid:21),\nn (cid:1) 2m\n\n(8)\n\nIn this section, we discuss some consequences of our results.\n\nEffect of smoothness: Focusing on Corollary 1, for spaces with m bounded derivatives (i.e., functions in the\nSobolev space W m), the minimax rate is n\u2212 2m\n2m+1 (for details, see e.g. Stone [13]). Clearly, faster rates are\nobtained for larger smoothness indices m, and as m \u2192 \u221e, the rate approaches the parametric rate of n\u22121.\nSince we are estimating over an s-dimensional space (under the assumption of independence), we are effectively\nestimating s univariate functions, each lying within the function space H. Therefore the uni-dimensional rate is\nmultiplied by s.\nSmoothness versus sparsity: It is worth noting that depending on the relative scalings of s, n and p and the metric\nentropy of H, it is possible for either the subset selection term or s-dimensional estimation term to dominate\nthe lower bound. In general, if log(p/s)\nn(H)), the s-dimensional estimation term dominates, and vice\nversa (at the boundary, either term determines the minimax rate). In the case of a univariate function class H\nwith polynomial entropy as in Corollary 1, it can be seen that for n = o((log(p/s))2m+1), the s-dimensional\nestimation term dominates while for n = \u2126((log(p/s))2m+1), the subset selection term dominates.\nRates for linear models: Using an alternative proof technique (not the one used in this paper), it is possible [11]\nto derive the exact minimax rate for estimation in the sparse linear regression model, in which we observe\n\n= o(\u01eb2\n\nn\n\ny(i) =Xj\u2208S\n\n\u03b2jX (i)\n\nj + w(i),\n\nfor i = 1, 2, ..., n.\n\n(9)\n\nNote that this is a special case of the general model (2) in which H corresponds to the class of univariate linear\nfunctions (see Example (i)).\nCorollary 2. For sparse linear regression model (9), the the minimax rate scales as max(cid:0) s log(p/s)\nIn this case, we see clearly the subset selection term dominates for p \u2192 \u221e, meaning the subset selection\nproblem is always \u201charder\u201d (in a statistical sense) than the s-dimensional estimation problem. As shown\nby Bickel et al. [1], the rate achieved by \u21131-regularized methods is s log p\nunder suitable conditions on the\ncovariates X.\n\nn(cid:1).\n\n, s\n\nn\n\nn\n\nUpper bounds: To show that the lower bounds are tight, upper bounds that are matching need to be derived.\nUpper bounds (matching up to constant factors) can be derived via a classical information-theoretic approach\n(e.g., [5, 2]), which involves constructing an estimator based on a covering set and bounding the covering\nentropy of F0(s). While this estimation approach does not lead to an implementable algorithm, it is a simple\ntheoretical device to demonstrate that lower bounds are tight. We turn our focus on implementable algorithms\nin the next point.\n\nComparison to existing bounds: We now provide a brief comparison of the minimax lower bounds with upper\nbounds on rates achieved by existing implementable algorithms provided by past work [12, 7, 9]. Ravikumar\net al. [12] propose a back-\ufb01tting algorithm to minimize the least-squares objective with a sparsity constraint on\nthe the function f. The rates derived in Koltchinskii and Yuan [7] do not match the lower bounds derived in\nTheorem 1. Further, it is dif\ufb01cult to directly compare the rates in Ravikumar et al. [12] and Meier et al. [9] with\nour minimax lower bounds since their analysis does not explicitly track the sparsity index s. We are currently\nin the process of conducting a thorough comparison with the above-mentioned \u21131-based methods.\n\n4 Proof outline\n\nIn this section, we provide an outline of the proof of Theorem 1; due to space constraints, we defer some of\nthe technical details to the full-length version. The proof is based on a combination of information-theoretic\n\n5\n\n\ftechniques and the concepts of packing and covering entropy, as de\ufb01ned previously in Section 2.3. First, we\nprovide a high-level overview of the proof. The basic idea is to carefully choose two subsets T1 and T2 of the\nfunction class F0(s) and lower bound the minimax rates over these two subsets. In Section 4.1, application of\nthe generalized Fano method\u2014a technique based on Fano\u2019s inequality\u2014to the set T1 de\ufb01ned in equation (10)\nyields a lower bound on the subset selection term. In Section 4.2, we apply an alternative method for obtaining\nlower bounds over a second set T2 de\ufb01ned in equation (11) that captures the dif\ufb01culty of estimating the sum\nof s univariate functions.. The second technique also exploits Fano\u2019s inequality but uses a more re\ufb01ned upper\nbound on the mutual information developed by Yang and Barron [17].\nBefore procedding, we \ufb01rst note that for any T \u2282 F0(s), we have\nmax\nf \u2217\u2208T\n\nL2(P) \u2265 min\nbf\n\nmin\nbf\n\nL2(P).\n\nmax\n\nEkbf \u2212 f\u2217k2\n\nEkbf \u2212 f\u2217k2\nMoreover, for any subsets T1, T2 \u2282 F0(s), we have\nL2(P) \u2265 max(cid:0) min\n\nf \u2217\u2208F0(s)\n\nf \u2217\u2208F0(s)\n\nEkbf \u2212 f\u2217k2\n\nmin\nbf\n\nmax\n\nbf\n\nmax\nf \u2217\u2208T1\n\nEkbf \u2212 f\u2217k2\n\nL2(P), min\nbf\n\nmax\nf \u2217\u2208T2\n\nEkbf \u2212 f\u2217k2\n\nL2(P)(cid:1),\n\nsince the bound holds for each of the two terms. We apply this lower bound using the subsets T1 and T2 de\ufb01ned\nin equations (10) and (11).\n\n4.1 Bounding the complexity of subset selection\n\nFor part of the proof, we use the generalized Fano\u2019s method [4], which we state below without proof. Given\nsome parameter space, we let d be a metric on it.\nLemma 1. (Generalized Fano Method) For a given integer r \u2265 2, consider a collection Mr = {P1, . . . , Pr}\nof r probability distributions such that\n\nand the pairwise KL divergence satis\ufb01es\n\nd(\u03b8(Pi), \u03b8(Pj)) \u2265 \u03b1r\n\nfor all i 6= j,\n\nThen the minimax risk over the family is lower bounded as\n\u03b1r\n\nD(Pi k Pj) \u2264 \u03b2r\n\nfor all i, j = 1, . . . , r.\n\nmax\n\nj\n\nEjd(\u03b8(Pj),b\u03b8) \u2265\n\n2 (cid:18)1 \u2212\n\n\u03b2r + log 2\n\nlog r (cid:19).\n\nT1 : =(cid:26)f : f (X1, X2, ..., Xp) =\n\npXj=1\n\n(10)\n\nThe proof of Lemma 1 involves applying Fano\u2019s inequality over the discrete set of parameters \u03b8 \u2208 \u0398 indexed\nby the set of distributions Mr. Now we construct the set T1 which creates the set of probability distributions\nMr.\nLet g be an arbitrary function in H such that kgkL2(P) = \u03c3\n\n. The set T1 is de\ufb01ned as\n\n4q log (p/s)\ncjg(Xj), cj \u2208 {\u22121, 0, 1} | kck0 = s(cid:27).\n\nn\n\nT1 may be viewed as a hypercube of F0(s) and will lead to the lower bound for the \u2019subset selection\u2019 term. This\nhypercube construction is often used to prove lower bounds (see Yu [18]). Next, we require a further reduction\nof the set T1 to a set A (de\ufb01ned in Lemma 2) to ensure that elements of A are well-separated in L2(P) norm.\nThe construction of A is as follows:\nLemma 2. There exists a subset A \u2282 T1 such that:\n(i) log |A| \u2265 1\n(ii) kf \u2212 f\u2032k2\n\u2200 f, f\u2032 \u2208 A, and\n(iii) D(f k f\u2032) \u2264 1\nThe proof involves using a combinatoric argument to construct the set A. For an argument on how the set is\nconstructed, see K\u00a8uhn [8]. For s log p\ns \u2265 8 log 2, applying the Generalized Fano Method (Lemma 1) together\nwith Lemma 2 yields the bound\n\n2 s log(p/s),\nL2(P) \u2265 \u03c32s log(p/s)\n\n8 s log(p/s) \u2200f, f\u2032 \u2208 A.\n\n16n\n\nmin\nbf\n\nmax\n\nf \u2217\u2208F0(s)\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265 min\nbf\n\nmax\nf \u2217\u2208A\n\nThis completes the proof for the subset selection term ( s log(p/s)\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265\n) in Theorem 1.\n\nn\n\n\u03c32s log(p/s)\n\n32n\n\n.\n\n6\n\n\f4.2 Bounding the complexity of s-dimensional estimation\n\nNext we derive a bound for the s-dimensional estimation term by determining a lower bound over T2. Let S be\nan arbitrary subset of s integers in {1, 2, .., p}, and de\ufb01ne the set FS as\nT2 : = FS : =(cid:8)f \u2208 F : f (X) =Xj\u2208S\n\nhj(Xj)(cid:9).\n\n(11)\n\nClearly FS \u2282 F0(s) meaning that\nmax\n\nmin\nbf\n\nf \u2217\u2208F0(s)\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265 min\nbf\n\nmax\nf \u2217\u2208FS\n\nEkbf \u2212 f\u2217k2\n\nL2(P).\n\nWe use a technique used in Yang and Barron [17] to lower bound the minimax rate over FS. The idea is to\nconstruct a maximal \u03b4n-packing set for FS and a minimal \u01ebn-covering set for FS, and then to apply Fano\u2019s\ninequality to a carefully chosen mixture distribution involving the covering and packing sets (see the full-length\nversion for details). Following these steps yields the following result:\nLemma 3.\n\nmin\nbf\n\nmax\nf \u2217\u2208FS\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265\n\n\u03b42\nn\n\n4 (cid:18)1 \u2212\n\nlog N (\u01ebn;FS) + n\u01eb2\n\nn/2\u03c32 + log 2\n\nlog M (\u03b4n;FS)\n\n(cid:19).\n\nNow we have a bound with expressions involving the covering and packing entropies of the s-dimensional space\nFS. The following Lemma allows bounds on log M (\u01eb;FS) and log N (\u01eb;FS) in terms of the unidimensional\npacking and covering entropies respectively:\nLemma 4. Let H be function space with a packing entropy log M (\u01eb;H) that satis\ufb01es Assumption 1. Then we\nhave the bounds\n\nlog M (\u01eb;FS) \u2265 s log M (\u01eb/\u221as;H),\n\nand\n\nlog N (\u01eb;FS) \u2264 s log N (\u01eb/\u221as;H).\n\nThe proof involves constructing \u01eb\u221as - packing set and covering sets in each of the s dimensions and displaying\nthat these are \u01eb-packing and coverings sets in FS (respectively). Combining Lemmas 3 and 4 leads to the\ninequality\n\nmin\nbf\n\nmax\nf \u2217\u2208FS\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265\n\n\u03b42\nn\n\n4 (cid:18)1 \u2212\n\ns log N (\u01ebn/\u221as;H) + n\u01eb2\n\ns log M (\u03b4n/\u221as;H)\n\nn/2\u03c32 + log 2\n\nNow we choose \u01ebn and \u03b4n to meet the following constraints:\n\nn\n2\u03c32 \u01eb2\n\u01ebn\u221as\n\n\u01ebn\u221as\nn \u2264 s log N (\n\u03b4n\u221as\n;H).\n\n;H) \u2264 log M (\n\n;H),\n\nand\n\n4 log N (\n\n(cid:19).\n\n(12)\n\n(13a)\n\n(13b)\n\nCombining Assumption 1 with the well-known relations log M (2\u01eb;H) \u2264 log N (2\u01eb;H) \u2264 log M (\u01eb;H), we\nconclude that in order to satisfy inequalities (13a) and (13b), it is suf\ufb01cient to choose \u01ebn = c\u03b4n for a constant c,\n2\u03c32 . Furthermore, if we de\ufb01ne \u03b4n/\u221as = e\u03b4n, then this inequality can\nand then require that s log M ( c\u03b4n\u221as ;H) \u2265 n\u03b42\nbe re-expressed as log M (ce\u03b4n) \u2265 nf\u03b4n\nn \u2265 log 2, using inequalities (13a) and (13b) together with\n\nequation (12) yields the desired rate\n\n2\u03c32 . For n\n\n2\u03c32 \u01eb2\n\nn\n\n2\n\nmin\nbf\n\nmax\nf \u2217\u2208FS\n\nEkbf \u2212 f\u2217k2\n\nL2(P) \u2265\n\nse\u03b4n\n\n16\n\nthereby completing the proof.\n\n5 Discussion\n\n2\n\n,\n\nIn this paper, we have derived lower bounds for the minimax risk in squared L2(P) error for estimating sparse\nadditive models based on the sum of univariate functions from a function class H. The rates show that the\nestimation problem effectively decomposes into a subset selection problem and an s-dimensional estimation\n\n7\n\n\fproblem, and the \u201charder\u201d of the two problems (in a statistical sense) determines the rate of convergence.\nMore concretely, we demonstrated that the subset selection term scales as s log(p/s)\n, depending linearly on\nthe number of components s and only logarithmically in the ambient dimension p. This subset selection term is\nindependent of the univariate function space H. On the other hand, the s-dimensional estimation term depends\non the \u201crichness\u201d of the univariate function class, measured by its metric entropy; it scales linearly with s and is\nindependent of p. Ongoing work suggests that our lower bounds are tight in many cases, meaning that the rates\nderived in Theorem 1 are minimax optimal for many function classes.\n\nn\n\nThere are a number of ways in which the work can be extended. One implicit and strong assumption in our\nanalysis was that the covariates Xj, j = 1, 2, ..., p are independent. It would be interesting to investigate the case\nwhen the random variables are endowed with some correlation structure. One would expect the rates to change,\nparticularly if many of the variables are collinear. It would also be interesting to develop a more complete\nunderstanding of whether computationally ef\ufb01cient algorithms [7, 12, 9] based on regularization achieve the\nlower bounds on the minimax rate derived in this paper.\n\nReferences\n\n[1] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of the Lasso and Dantzig selector. Annals of\n\nStatistics, 2009. To appear.\n\n[2] L. Birg\u00b4e. Approximation dans les espaces metriques et theorie de l\u2019estimation. Z. Wahrsch. verw. Gebiete,\n\n65:181\u2013327, 1983.\n\n[3] M. S. Birman and M. Z. Solomjak. Piecewise-polynomial approximations of functions of the classes W \u03b1\np .\n\nMath. USSR-Sbornik, 2(3):295\u2013317, 1967.\n\n[4] T. S. Han and S. Verdu. Generalizing the Fano inequality. IEEE Transactions on Information Theory,\n\n40:1247\u20131251, 1994.\n\n[5] R. Z. Has\u2019minskii. A lower bound on the risks of nonparametric estimates of densities in the uniform\n\nmetric. Theory Prob. Appl., 23:794\u2013798, 1978.\n\n[6] T. Hastie and R. Tibshirani. Generalized Additive Models. Chapman and Hall Ltd, Boca Raton, 1999.\n[7] V. Koltchinskii and M. Yuan. Sparse recovery in large ensembles of kernel machines. In Proceedings of\n\nCOLT, 2008.\n\n[8] T. K\u00a8uhn. A lower estimate for entropy numbers. Journal of Approximation Theory, 110:120\u2013124, 2001.\n[9] L. Meier, S. van de Geer, and P. Buhlmann. High-dimensional additive modeling. Annals of Statistics, To\n\nappear.\n\n[10] N. Meinshausen and B.Yu. Lasso-type recovery of sparse representations for high-dimensional data. An-\n\nnals of Statistics, 37(1):246\u2013270, 2009.\n\n[11] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regres-\n\nsion over \u2113q-balls. Technical Report arXiv:0910.2042, UC Berkeley, Department of Statistics, 2009.\n\n[12] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal\n\nStatistical Society, To appear.\n\n[13] C. J. Stone. Optimal global rates of convergence for nonparametric regression. Annals of Statistics,\n\n10:1040\u20131053, 1982.\n\n[14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[15] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.\n[16] M. J. Wainwright. Information-theoretic bounds for sparsity recovery in the high-dimensional and noisy\nsetting. IEEE Trans. Info. Theory, December 2009. Presented at International Symposium on Information\nTheory, June 2007.\n\n[17] Y. Yang and A. Barron. Information-theoretic determination of minimax rates of convergence. Annals of\n\nStatistics, 27(5):1564\u20131599, 1999.\n\n[18] B. Yu. Assouad, Fano and Le Cam. Research Papers in Probability and Statistics: Festschrift in Honor of\n\nLucien Le Cam, pages 423\u2013435, 1996.\n\n[19] C. H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear regres-\n\nsion. Annals of Statistics, 36:1567\u20131594, 2006.\n\n8\n\n\f", "award": [], "sourceid": 1088, "authors": [{"given_name": "Garvesh", "family_name": "Raskutti", "institution": null}, {"given_name": "Bin", "family_name": "Yu", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}]}