{"title": "SpAM: Sparse Additive Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1201, "page_last": 1208, "abstract": null, "full_text": "SpAM: Sparse Additive Models\n\nPradeep Ravikumar\u2020 Han Liu\u2020\u2021 John Lafferty\u2217\u2020 Larry Wasserman\u2021\u2020\n\n\u2020Machine Learning Department\n\n\u2021Department of Statistics\n\n\u2217Computer Science Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nWe present a new class of models for high-dimensional nonparametric regression\nand classi\ufb01cation called sparse additive models (SpAM). Our methods combine\nideas from sparse linear modeling and additive nonparametric regression. We de-\nrive a method for \ufb01tting the models that is effective even when the number of\ncovariates is larger than the sample size. A statistical analysis of the properties of\nSpAM is given together with empirical results on synthetic and real data, show-\ning that SpAM can be effective in \ufb01tting sparse nonparametric models in high\ndimensional data.\n\n1 Introduction\n\nSubstantial progress has been made recently on the problem of \ufb01tting high dimensional linear re-\ngression models of the form Yi = X T\ni \u03b2 + \u0001i , for i = 1, . . . , n. Here Yi is a real-valued response, Xi\nis a p-dimensional predictor and \u0001i is a mean zero error term. Finding an estimate of \u03b2 when p > n\nthat is both statistically well-behaved and computationally ef\ufb01cient has proved challenging; how-\never, the lasso estimator (Tibshirani (1996)) has been remarkably successful. The lasso estimatorb\u03b2\n\nminimizes the `1-penalized sums of squares\n\nXi\n(Yi \u2212 X T\n\npXj=1\n\ni \u03b2) + \u03bb\n\n|\u03b2 j|\n\n(1)\n\nwith the `1 penalty k\u03b2k1 encouraging sparse solutions, where many componentsb\u03b2 j are zero. The\ngood empirical success of this estimator has been recently backed up by results con\ufb01rming that it has\nstrong theoretical properties; see (Greenshtein and Ritov, 2004; Zhao and Yu, 2007; Meinshausen\nand Yu, 2006; Wainwright, 2006).\nThe nonparametric regression model Yi = m(Xi )+\u0001i , where m is a general smooth function, relaxes\nthe strong assumptions made by a linear model, but is much more challenging in high dimensions.\nHastie and Tibshirani (1999) introduced the class of additive models of the form\n\nYi =\n\npXj=1\n\nm j (Xi j ) + \u0001i\n\n(2)\n\nwhich is less general, but can be more interpretable and easier to \ufb01t; in particular, an additive model\ncan be estimated using a coordinate descent Gauss-Seidel procedure called back\ufb01tting. An extension\nof the additive model is the functional ANOVA model\n\nYi = X1\u2264 j\u2264 p\n\nm j (Xi j ) +Xj <k\n\nm j,k (Xi j , Xik ) + Xj <k<`\n\nm j,k,`(Xi j , Xik , Xi `) + \u00b7 \u00b7 \u00b7 + \u0001i\n\n(3)\n\n1\n\n\fwhich allows interactions among the variables. Unfortunately, additive models only have good\nstatistical and computational behavior when the number of variables p is not large relative to the\nsample size n.\n\nIn this paper we introduce sparse additive models (SpAM) that extend the advantages of sparse linear\nmodels to the additive, nonparametric setting. The underlying model is the same as in (2), but con-\nstraints are placed on the component functions {m j}1\u2264 j\u2264 p to simultaneously encourage smoothness\nof each component and sparsity across components; the penalty is similar to that used by the COSSO\nof Lin and Zhang (2006). The SpAM estimation procedure we introduce allows the use of arbitrary\nnonparametric smoothing techniques, and in the case where the underlying component functions are\nlinear, it reduces to the lasso. It naturally extends to classi\ufb01cation problems using generalized addi-\ntive models. The main results of the paper are (i) the formulation of a convex optimization problem\nfor estimating a sparse additive model, (ii) an ef\ufb01cient back\ufb01tting algorithm for constructing the\nestimator, (iii) simulations showing the estimator has excellent behavior on some simulated and real\ndata, even when p is large, and (iv) a statistical analysis of the theoretical properties of the estimator\nthat support its good empirical performance.\n\n2 The SpAM Optimization Problem\n\nIn this section we describe the key idea underlying SpAM. We \ufb01rst present a population version\nof the procedure that intuitively suggests how sparsity is achieved. We then present an equivalent\nconvex optimization problem. In the following section we derive a back\ufb01tting procedure for solving\nthis optimization problem in the \ufb01nite sample setting.\nTo motivate our approach, we \ufb01rst consider a formulation that scales each component function g j\nby a scalar \u03b2 j , and then imposes an `1 constraint on \u03b2 = (\u03b21, . . . , \u03b2 p)T . For j \u2208 {1, . . . , p}, let H j\ndenote the Hilbert space of measurable functions f j (x j ) of the single scalar variable x j , such that\nE( f j (X j )) = 0 and E( f j (X j )2) < \u221e, furnished with the inner product\n\nLet Hadd = H1 + H2 + . . . , H p denote the Hilbert space of functions of (x1, . . . , x p) that have\nf (x) = P j f j (x j ). The standard additive model optimization problem, in the\nan additive form:\npopulation setting, is\n\nD f j , f 0jE = E(cid:16) f j (X j ) f 0j (X j )(cid:17) .\n\nmin\n\nf j\u2208H j , 1\u2264 j\u2264 p\n\nE(cid:16)Y \u2212P p\n\nj=1 f j (X j )(cid:17)2\n\nand m(x) = E(Y | X = x) is the unknown regression function. Now consider the following modi\ufb01-\ncation of this problem that imposes additional constraints:\n\n(4)\n\n(5)\n\n(6a)\n\n(6b)\n\n(6c)\n\n(6d)\n\n(P)\n\nmin\n\n\u03b2\u2208R p,g j\u2208H j\nsubject to\n\nj=1 \u03b2 j g j (X j )(cid:17)2\n\nE(cid:16)Y \u2212P p\npXj=1\n|\u03b2 j| \u2264 L\nE(cid:16)g2\nj(cid:17) = 1, j = 1, . . . , p\nE(cid:0)g j(cid:1) = 0, j = 1, . . . , p\nj=1 f j (x j ) =P p\n\nnoting that g j is a function while \u03b2 is a vector. Intuitively, the constraint that \u03b2 lies in the `1-ball\n{\u03b2 : k\u03b2k1 \u2264 L} encourages sparsity of the estimated \u03b2, just as for the parametric lasso. When \u03b2 is\nsparse, the estimated additive function f (x) =P p\nj=1 \u03b2 j g j (x j ) will also be sparse,\nmeaning that many of the component functions f j (\u00b7) = \u03b2 j g j (\u00b7) are identically zero. The constraints\n(6c) and (6c) are imposed for identi\ufb01ability; without (6c), for example, one could always satisfy (6a)\nby rescaling.\nWhile this optimization problem makes plain the role `1 regularization of \u03b2 to achieve sparsity, it has\nthe unfortunate drawback of not being convex. More speci\ufb01cally, while the optimization problem is\nconvex in \u03b2 and {g j} separately, it is not convex in \u03b2 and {g j} jointly.\n\n2\n\n\fHowever, consider the following related optimization problem:\n\n(Q)\n\nmin\nf j\u2208H j\n\nsubject to\n\nE(cid:16)Y \u2212P p\npXj=1qE( f 2\n\nj=1 f j (X j )(cid:17)2\nj (X j )) \u2264 L\n\nE( f j ) = 0, j = 1, . . . , p.\n\n(7a)\n\n(7b)\n\n(7c)\n\nThis problem is convex in {f j}. Moreover, the solutions to problems (P) and (Q) are equivalent:\n\n(cid:16)n\u03b2\u2217jo ,ng\u2217jo(cid:17) optimizes (P) impliesn f \u2217j = \u03b2\u2217j g\u2217jo optimizes (Q);\n\nn f \u2217jo optimizes (Q) implies(cid:16)n\u03b2\u2217j = (k f jk2)To ,ng\u2217j = f \u2217j /k f \u2217j k2o(cid:17) optimizes (P).\n\n11 + f 2\n\n21 + f 2\n\n12 +q f 2\n\ntion \u03c012C onto the \ufb01rst two components is an `2 ball. However, the projection \u03c013C onto the \ufb01rst\n\nWhile optimization problem (Q) has the important virtue of being convex, the way it encourages\nsparsity is not intuitive; the following observation provides some insight. Consider the set C \u2282 R4\n22 \u2264 L(cid:27). Then the projec-\nde\ufb01ned by C =(cid:26)( f11, f12, f21, f22)T \u2208 R4 : q f 2\nand third components is an `1 ball. In this way, it can be seen that the constraintP j(cid:13)(cid:13) f j(cid:13)(cid:13)2 \u2264 L\nthe norm(cid:13)(cid:13) f j(cid:13)(cid:13)2 appears in the constraint, and not its square(cid:13)(cid:13) f j(cid:13)(cid:13)2\nthis constraint could be replaced by P j(cid:13)(cid:13) f j(cid:13)(cid:13)q \u2264 L for any q \u2265 1.\n\nacts as an `1 constraint across components to encourage sparsity, while it acts as an `2 constraint\nwithin components to encourage smoothness, as in a ridge regression penalty. It is thus crucial that\n2. For the purposes of sparsity,\nIn case each f j is linear,\n\n( f j (x1 j ), . . . , f (xnj )) = \u03b2 j (x1 j , . . . , xnj ), the optimization problem reduces to the lasso.\nThe use of scaling coef\ufb01cients together with a nonnegative garrote penalty, similar to our problem\n(P), is considered by Yuan (2007). However, the component functions g j are \ufb01xed, so that the\nprocedure is not asymptotically consistent. The form of the optimization problem (Q) is similar\nto that of the COSSO for smoothing spline ANOVA models (Lin and Zhang, 2006); however, our\nmethod differs signi\ufb01cantly from the COSSO, as discussed below.\nIn particular, our method is\nscalable and easy to implement even when p is much larger than n.\n\n3 A Back\ufb01tting Algorithm for SpAM\n\nWe now derive a coordinate descent algorithm for \ufb01tting a sparse additive model. We assume that\nwe observe Y = m(X ) + \u0001, where \u0001 is mean zero Gaussian noise. We write the Lagrangian for the\noptimization problem (Q) as\n\n\u00b5 j E( f j ).\n\n(8)\n\nL( f, \u03bb, \u00b5) =\n\n1\n2\n\nE(cid:16)Y \u2212P p\n\nj=1 f j (X j )(cid:17)2\n\n+ \u03bb\n\npXj=1qE( f 2\n\nj (X j )) +Xj\n\nLet R j = Y \u2212Pk6= j fk (Xk ) be the jth residual. The stationary condition for minimizing L as a\nfunction of f j , holding the other components fk \ufb01xed for k 6= j, is expressed in terms of the Frechet\nderivative \u03b4L as\n(9)\nfor any \u03b4 f j \u2208 H j satisfying E(\u03b4 f j ) = 0, where v j \u2208 \u2202qE( f 2\nsatisfyingqEv2\nLetting Pj = E[R j | X j ] denote the projection of the residual onto H j , the solution satis\ufb01es\n\nj ) is an element of the subgradient,\nj ) 6= 0. Therefore, conditioning on X j , the\n(10)\n\n\u03b4L( f, \u03bb, \u00b5; \u03b4 f j ) = E(cid:2)( f j \u2212 R j + \u03bbv j )\u03b4 f j(cid:3) = 0\n\nj \u2264 1 and v j = f j.qE( f 2\n\nf j + \u03bbv j = E(R j | X j ).\n\nstationary condition (9) implies\n\nj ) if E( f 2\n\nif E(P2\n\nj ) > \u03bb\n\n(11)\n\n\uf8eb\n\uf8ed1 +\n\n\uf8f6\n\uf8f8 f j = Pj\n\n\u03bb\n\nqE( f 2\n\nj )\n\n3\n\n\fInput: Data (Xi , Yi ), regularization parameter \u03bb.\nInitialize f j = f (0)\nIterate until convergence:\n\n, for j = 1, . . . , p.\n\nj\n\nFor each j = 1, . . . , p:\nCompute the residual: R j = Y \u2212Pk6= j fk (Xk );\nEstimate the projection Pj = E[R j | X j ] by smoothing: bPj = S j R j ;\nEstimate the norm s j =qE[Pj ]2 using, for example, (15) or (35);\nSoft-threshold: f j =(cid:20)1 \u2212\nCenter: f j \u2190 f j \u2212 mean( f j ).\n\nbs j(cid:21)+bPj ;\n\nOutput: Component functions f j and estimatorbm(Xi ) =P j f j (Xi j ).\n\nFigure 1: THE SPAM BACKFITTING ALGORITHM\n\n\u03bb\n\nand f j = 0 otherwise. Condition (11), in turn, implies\nj ) =qE(P2\n\nj )\n\n\u03bb\n\n\uf8eb\n\uf8ed1 +\n\nqE( f 2\n\nj ) or qE( f 2\n\nj ) =qE(P2\n\nj ) \u2212 \u03bb.\n\nThus, we arrive at the following multiplicative soft-thresholding update for f j :\n\n\uf8f6\n\uf8f8qE( f 2\nf j =\uf8ee\n\uf8f01 \u2212\n\n\uf8f9\n\uf8fb+\n\nPj\n\n\u03bb\n\nqE(P2\n\nj )\n\nbPj = S j R j\n\u221an kbPjk2 =qmean(bP2\n\nj ).\n\nbs j =\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nwhere [\u00b7]+ denotes the positive part. In the \ufb01nite sample case, as in standard back\ufb01tting (Hastie and\nTibshirani, 1999), we estimate the projection E[R j | X j ] by a smooth of the residuals:\n\nwhere S j is a linear smoother, such as a local linear or kernel smoother. Letbs j be an estimate of\nqE[P2\n\nj ]. A simple but biased estimate is\n1\n\nMore accurate estimators are possible; an example is given in the appendix. We have thus derived\nthe SpAM back\ufb01tting algorithm given in Figure 1.\n\nWhile the motivating optimization problem (Q) is similar to that considered in the COSSO (Lin\nand Zhang, 2006) for smoothing splines, the SpAM back\ufb01tting algorithm decouples smoothing and\nsparsity, through a combination of soft-thresholding and smoothing. In particular, SpAM back\ufb01tting\ncan be carried out with any nonparametric smoother; it is not restricted to splines. Moreover, by\niteratively estimating over the components and using soft thresholding, our procedure is simple to\nimplement and scales to high dimensions.\n\n3.1 SpAM for Nonparametric Logistic Regression\n\nThe SpAM back\ufb01tting procedure can be extended to nonparametric logistic regression for classi\ufb01-\ncation. The additive logistic model is\n\nP(Y = 1 | X ) \u2261 p(X; f ) =\n\n4\n\nexp(cid:16)P p\n1 + exp(cid:16)P p\n\nj=1 f j (X j )(cid:17)\nj=1 f j (X j )(cid:17)\n\n(16)\n\n\fwhere Y \u2208 {0, 1}, and the population log-likelihood is `( f ) = E(cid:2)Y f (X ) \u2212 log (1 + exp f (X ))(cid:3).\n\nRecall that in the local scoring algorithm for generalized additive models (Hastie and Tibshirani,\n1999) in the logistic case, one runs the back\ufb01tting procedure within Newton\u2019s method. Here one\niteratively computes the transformed response for the current estimate f0\n\n(17)\nand weights w(Xi ) = p(Xi; f0)(1 \u2212 p(Xi; f0), and carries out a weighted back\ufb01tting of (Z , X )\nwith weights w. The weighted smooth is given by\n\np(Xi; f0)(1 \u2212 p(Xi; f0))\n\nZi = f0(Xi ) +\n\nYi \u2212 p(Xi; f0)\n\nS j (w R j )\n\nS j w\n\n.\n\nbPj =\n\n(18)\n\n\u00b5 j E( f j )\n\n(19)\n\nTo incorporate the sparsity penalty, we \ufb01rst note that the Lagrangian is given by\n\nL( f, \u03bb, \u00b5) = E(cid:2)log (1 + exp f (X )) \u2212 Y f (X )(cid:3) + \u03bb\n\npXj=1qE( f 2\n\nj (X j )) +Xj\n\nj ). As in the unregularized case, this condition is nonlinear in f ,\nand so we linearize the gradient of the log-likelihood around f0. This yields the linearized condition\n\nand the stationary condition for component function f j is E(cid:0) p \u2212 Y | X j(cid:1) + \u03bbv j = 0 where v j is an\nelement of the subgradient \u2202qE( f 2\nE(cid:2)w(X )( f (X ) \u2212 Z ) | X j(cid:3) + \u03bbv j = 0. When E( f 2\nqE( f j )2\uf8f6\nS j w + \u03bb.qE( f 2\n\nj ) 6= 0, this implies the condition\n\uf8f8 f j (X j ) = E(w R j | X j ).\n\nIn the \ufb01nite sample case, in terms of the smoothing matrix S j , this becomes\n\nIf kS j (w R j )k2 < \u03bb, then f j = 0. Otherwise, this implicit, nonlinear equation for f j cannot be\nsolved explicitly, so we propose to iterate until convergence:\n\n\uf8eb\n\uf8edE(cid:0)w | X j(cid:1) +\n\nS j (w R j )\n\nf j =\n\n.\n\nj )\n\n\u03bb\n\n(20)\n\n(21)\n\nWhen \u03bb = 0, this yields the standard local scoring update (18). An example of logistic SpAM is\ngiven in Section 5.\n\nf j \u2190\n\nS j (w R j )\n\nS j w + \u03bb\u221an(cid:14)k f jk2\n\n.\n\n(22)\n\n4 Properties of SpAM\n\n4.1 SpAM is Persistent\n\nThe notion of risk consistency, or persistence, was studied by Juditsky and Nemirovski (2000) and\nGreenshtein and Ritov (2004) in the context of linear models. Let (X, Y ) denote a new pair (inde-\npendent of the observed data) and de\ufb01ne the predictive risk when predicting Y with f (X ) by\n\n(23)\n\nR( f ) = E(Y \u2212 f (X ))2.\n\nSince we consider predictors of the form f (x) = P j \u03b2 j g j (x j ) we also write the risk as R(\u03b2, g)\nwhere \u03b2 = (\u03b21, . . . , \u03b2 p) and g = (g1, . . . , g p). Following Greenshtein and Ritov (2004), we say\nthat an estimatorbmn is persistent relative to a class of functions Mn if\nR(bmn) \u2212 R(m\u2217n)\n(24)\nwhere m\u2217n = argmin f \u2208Mn R( f ) is the predictive oracle. Greenshtein and Ritov (2004) showed\nthat the lasso is persistent for the class of linear models Mn = {f (x) = x T \u03b2 : k\u03b2k1 \u2264 Ln} if\nLn = o((n/ log n)1/4). We show a similar result for SpAM.\nTheorem 4.1. Suppose that pn \u2264 en\u03be for some \u03be < 1. Then SpAM is persistent relative to the\nclassofadditivemodelsMn =n f (x) =P p\nj=1 \u03b2 j g j (x j ) : k\u03b2k1 \u2264 Lno if Ln = o(cid:0)n(1\u2212\u03be )/4(cid:1).\n\n\u2192 0\n\nP\n\n5\n\n\f4.2 SpAM is Sparsistent\nIn the case of linear regression, with m j (X j ) = \u03b2 T\nj X j , Wainwright (2006) shows that under certain\nconditions on n, p, s = |supp(\u03b2)|, and the design matrix X, the lasso recovers the sparsity pattern\nshow a similar result for SpAM with the sparse back\ufb01tting procedure.\n\nasymptotically; that is, the lasso estimator b\u03b2n is sparsistent: P(cid:0)supp(\u03b2) = supp(b\u03b2n)(cid:1) \u2192 1. We\n\nFor the purpose of analysis, we use orthogonal function regression as the smoothing procedure. For\neach j = 1, . . . , p let \u03c8 j be an orthogonal basis for H j . We truncate the basis to \ufb01nite dimension\ndn, and let dn \u2192 \u221e such that dn/n \u2192 0. Let \u0001 j denote the n \u00d7 d matrix \u0001 j (i, k) = \u03c8 jk (Xi j ).\nIf A \u2282 {1, . . . , p}, we denote by \u0001A the n \u00d7 d|A| matrix where for each i \u2208 A, \u0001i appears as a\nsubmatrix in the natural way. The SpAM optimization problem can then be written as\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\nn\n\n6= 0}, with\n6= 0} denote the estimated set of\n\n1\n\nmin\n\u03b2\n\n+ \u03bbn\n\n\u03b2 T\nj \u0001 T\n\nj \u0001 j \u03b2 j\n\nj=1 \u0001 j \u03b2 j(cid:17)2\n\n2n(cid:16)Y \u2212P p\n\nTheorem 4.2. Supposethat \u0001 satis\ufb01estheconditions\n\npXj=1r 1\nwhere each \u03b2 j is a d-dimensional vector. Let S denote the true set of variables {j : m j\ns = |S|, and let Sc denote its complement. LetbSn = {j : b\u03b2 j\nvariables from the minimizerb\u03b2n of (25).\nS \u0001S(cid:19) \u2264 Cmax < \u221e and \u0001min(cid:18) 1\n\u0001max(cid:18) 1\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) 1\nSc \u0001S(cid:17)(cid:16) 1\n1 \u2212 \u03b4\n\u221as\nLettheregularizationparameter \u03bbn \u2192 0 bechosentosatisfy\n\u03bbnpsdn \u2192 0,\n\nS \u0001S(cid:17)\u22121(cid:13)(cid:13)(cid:13)(cid:13)\nThenSpAMissparsistent: P(cid:0)bSn = S(cid:1) \u2212\u2192 1.\n\nS \u0001S(cid:19) \u2265 Cmin > 0\n, forsome0 < \u03b4 \u2264 1\n\ndn(log dn + log( p \u2212 s))\n\n2 \u2264s Cmin\n\ns\ndn\u03bbn \u2192 0,\n\n\u2192 0.\n\n\u0001 T\n\nn\n\n\u0001 T\n\nn\n\nn \u0001 T\n\nn \u0001 T\n\nCmax\n\nn\u03bb2\nn\n\nand\n\n2\n\n5 Experiments\n\nIn this section we present experimental results for SpAM applied to both synthetic and real data,\nincluding regression and classi\ufb01cation examples that illustrate the behavior of the algorithm in vari-\nous conditions. We \ufb01rst use simulated data to investigate the performance of the SpAM back\ufb01tting\nalgorithm, where the true sparsity pattern is known. We then apply SpAM to some real data. If not\nexplicitly stated otherwise, the data are always rescaled to lie in a d-dimensional cube [0, 1]d, and\na kernel smoother with Gaussian kernel is used. To tune the penalization parameter \u03bb, we use a C p\nstatistic, which is de\ufb01ned as\n\nC p(bf ) =\n\n1\nn\n\nnXi=1(cid:16)Yi \u2212P p\n\nj=1 bf j (X j )(cid:17)2\n\n+\n\n2b\u03c3 2\n\nn\n\npXj=1\n\ntrace(S j ) 1[bf j 6= 0]\n\nwhere S j is the smoothing matrix for the j-th dimension andb\u03c3 2 is the estimated variance.\n\n5.1 Simulations\n\n(29)\n\nWe \ufb01rst apply SpAM to an example from (H\u00e4rdle et al., 2004). A dataset with sample size n = 150\nis generated from the following 200-dimensional additive model:\n\nYi = f1(xi1) + f2(xi2) + f3(xi3) + f4(xi4) + \u0001i\n\n(30)\n(31)\nand f j (x) = 0 for j \u2265 5 with noise \u0001i \u223c N (0, 1). These data therefore have 196 irrelevant\ndimensions. The results of applying SpAM with the plug-in bandwidths are summarized in Figure 2.\n\nf4(x) = e\u2212x + e\u22121 \u2212 1\n\nf1(x) = \u22122 sin(2x),\n\nf2(x) = x 2 \u2212 1\n3 ,\n\nf3(x) = x \u2212 1\n2 ,\n\n6\n\n\f6\n\n.\n\n0\n\n5\n0\n\n.\n\n4\n\n.\n\n0\n\n3\n\n.\n\n0\n\n2\n\n.\n\n0\n\ns\nm\nr\no\nN\n\n \nt\n\nn\ne\nn\no\np\nm\no\nC\n\n1\n0\n\n.\n\n0\n\n.\n\n0\n\n4\n\n2\n\n1\nm\n\n2\n\u2212\n\n4\n\u2212\n\n4\n1\n\n2\n1\n\n0\n1\n\n3\n\n4\n\n2\n\n4\n9\n\n9\n\n4\n9\n1\n\np\nC\n\n8\n\n6\n\n4\n\n2\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\nl1=97.05\n\nl1=88.36\n\nl1=90.65\n\nl1=79.26\n\n6\n\n4\n\n2\n\n2\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\n4\n\n2\n\n3\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\u2212\n\n6\n\n4\n\n2\n\n4\nm\n\n2\n\u2212\n\n4\n\u2212\n\n0\n\n.\n\n1\n\n8\n0\n\n.\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\ny\nr\ne\nv\no\nc\ne\nr\n \nt\nc\ne\nr\nr\no\nc\n \nf\n\no\n\n \n.\n\nb\no\nr\np\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\n6\n\n4\n\n2\n\n5\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\u2212\n\np=128\n\np=256\n\n0 10 20 30 40 50 60 70 80 90\nsample size\n\n110 130 150\n\nzero\n\nzero\n\n6\n\n4\n\n2\n\n6\nm\n\n2\n\u2212\n\n4\n\u2212\n\n6\n\u2212\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx1\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx2\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx3\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx4\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx5\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx6\n\nFigure 2: (Simulated data) Upper left: The empirical `2 norm of the estimated components as plotted\n\nagainst the tuning parameter \u03bb; the value on the x-axis is proportional toP j kbf jk2. Upper center:\n\nThe C p scores against the tuning parameter \u03bb; the dashed vertical line corresponds to the value of\n\u03bb which has the smallest C p score. Upper right: The proportion of 200 trials where the correct\nrelevant variables are selected, as a function of sample size n. Lower (from left to right): Estimated\n(solid lines) versus true additive component functions (dashed lines) for the \ufb01rst 6 dimensions; the\nremaining components are zero.\n\n5.2 Boston Housing\n\nThe Boston housing data was collected to study house values in the suburbs of Boston; there are\naltogether 506 observations with 10 covariates. The dataset has been studied by many other authors\n(H\u00e4rdle et al., 2004; Lin and Zhang, 2006), with various transformations proposed for different\ncovariates. To explore the sparsistency properties of our method, we add 20 irrelevant variables. Ten\nof them are randomly drawn from Uniform(0, 1), the remaining ten are a random permutation of the\noriginal ten covariates, so that they have the same empirical densities.\n\nThe full model (containing all 10 chosen covariates) for the Boston Housing data is:\n\nmedv = \u03b1 + f1(crim) + f2(indus) + f3(nox) + f4(rm) + f5(age)\n\n+ f6(dis) + f7(tax) + f8(ptratio) + f9(b) + f10(lstat)\n\n(32)\n\nThe result of applying SpAM to this 30 dimensional dataset is shown in Figure 3. SpAM identi\ufb01es 6\nnonzero components. It correctly zeros out both types of irrelevant variables. From the full solution\npath, the important variables are seen to be rm, lstat, ptratio, and crim. The importance\nof variables nox and b are borderline. These results are basically consistent with those obtained\nby other authors (H\u00e4rdle et al., 2004). However, using C p as the selection criterion, the variables\nindux, age, dist, and tax are estimated to be irrelevant, a result not seen in other studies.\n\n5.3 SpAM for Spam\n\nHere we consider an email spam classi\ufb01cation problem, using the logistic SpAM back\ufb01tting algo-\nrithm from Section 3.1. This dataset has been studied by Hastie et al. (2001), using a set of 3,065\nemails as a training set, and conducting hypothesis tests to choose signi\ufb01cant variables; there are a\ntotal of 4,601 observations with p = 57 attributes, all numeric. The attributes measure the percent-\nage of speci\ufb01c words or characters in the email, the average and maximum run lengths of upper case\nletters, and the total number of such letters. To demonstrate how SpAM performs well with sparse\ndata, we only sample n = 300 emails as the training set, with the remaining 4301 data points used\nas the test set. We also use the test data as the hold-out set to tune the penalization parameter \u03bb. The\nresults of a typical run of logistic SpAM are summarized in Figure 4, using plug-in bandwidths.\n\n7\n\n\fl1=177.14\n\nl1=1173.64\n\n0\n2\n\n0\n1\n1\nm\n\n0\n1\n\u2212\n\n0\n2\n\n0\n1\n4\nm\n\n0\n1\n\u2212\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx1\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx4\n\nl1=478.29\n\nl1=1221.11\n\n0\n2\n\n0\n1\n8\nm\n\n0\n2\n\n0\n1\n0\n1\nm\n\n0\n1\n\u2212\n\n3\n\ns\nm\nr\no\nN\n\n2\n\n \nt\n\nn\ne\nn\no\np\nm\no\nC\n\n1\n\n0\n\n4\n\n0\n1\n\n8\n\n3\n\n6\n\n5\n\n7\n\n7\n1\n\n0\n8\n\n0\n7\n\n0\n6\n\np\nC\n\n0\n5\n\n0\n4\n\n0\n3\n\n0\n2\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\n0\n1\n\u2212\n\nFigure 3: (Boston housing) Left: The empirical `2 norm of the estimated components versus the\nregularization parameter \u03bb. Center: The C p scores against \u03bb; the dashed vertical line corresponds to\nbest C p score. Right: Additive \ufb01ts for four relevant variables.\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx8\n\n0.0 0.2 0.4 0.6 0.8 1.0\n\nx10\n\n\u03bb(\u00d710\u22123)\n\n5.5\n5.0\n4.5\n4.0\n3.5\n3.0\n2.5\n2.0\n\nERROR\n0.2009\n0.1725\n0.1354\n0.1083 (\u221a)\n0.1117\n0.1174\n0.1251\n0.1259\n\n# ZEROS\n\nSELECTED VARIABLES\n\n0\n2\n\n.\n\n0\n\n55\n51\n46\n20\n0\n0\n0\n0\n\n{ 8,54}\n\n{ 8, 9, 27, 53, 54, 57}\n\n{7, 8, 9, 17, 18, 27, 53, 54, 57, 58}\n{4, 6\u201310, 14\u201322, 26, 27, 38, 53\u201358}\n\nALL\nALL\nALL\nALL\n\n8\n1\n.\n0\n\nr\no\nr\nr\ne\nn\no\n\n \n\n6\n1\n.\n0\n\ni\n\ni\nt\nc\nd\ne\nr\np\n\n \nl\n\na\nc\ni\nr\ni\np\nm\nE\n\n4\n1\n.\n0\n\n2\n1\n.\n0\n\npenalization parameter\nFigure 4: (Email spam) Classi\ufb01cation accuracies and variable selection for logistic SpAM.\n\n2.0\n\n2.5\n\n3.0\n\n3.5\n\n4.0\n\n4.5\n\n5.0\n\n5.5\n\n6 Acknowlegments\n\nThis research was supported in part by NSF grant CCF-0625879 and a Siebel Scholarship to PR.\n\nReferences\n\nGREENSHTEIN, E. and RITOV, Y. (2004). Persistency in high dimensional linear predictor-selection and the\n\nvirtue of over-parametrization. Journal of Bernoulli 10 971\u2013988.\n\nH\u00c4RDLE, W., M\u00dcLLER, M., SPERLICH, S. and WERWATZ, A. (2004). Nonparametric and Semiparametric\n\nModels. Springer-Verlag Inc.\n\nHASTIE, T. and TIBSHIRANI, R. (1999). Generalized additive models. Chapman & Hall Ltd.\nHASTIE, T., TIBSHIRANI, R. and FRIEDMAN, J. H. (2001). The Elements of Statistical Learning: Data\n\nMining, Inference, and Prediction. Springer-Verlag.\n\nJUDITSKY, A. and NEMIROVSKI, A. (2000). Functional aggregation for nonparametric regression. Ann.\n\nStatist. 28 681\u2013712.\n\nLIN, Y. and ZHANG, H. H. (2006). Component selection and smoothing in multivariate nonparametric regres-\n\nsion. Ann. Statist. 34 2272\u20132297.\n\nMEINSHAUSEN, N. and YU, B. (2006). Lasso-type recovery of sparse representations for high-dimensional\n\ndata. Tech. Rep. 720, Department of Statistics, UC Berkeley.\n\nTIBSHIRANI, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety, Series B, Methodological 58 267\u2013288.\n\nWAINWRIGHT, M. (2006). Sharp thresholds for high-dimensional and noisy recovery of sparsity. Tech. Rep.\n\n709, Department of Statistics, UC Berkeley.\n\nYUAN, M. (2007). Nonnegative garrote component selection in functional ANOVA models. In Proceedings of\n\nAI and Statistics, AISTATS.\n\nZHAO, P. and YU, B. (2007). On model selection consistency of lasso. J. of Mach. Learn. Res. 7 2541\u20132567.\n\n8\n\n\f", "award": [], "sourceid": 415, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}