{"title": "Nonparametric Greedy Algorithms for the Sparse Learning Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 1141, "page_last": 1149, "abstract": "This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate regression, we propose an algorithm called generalized forward regression. Both of them simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-of-the-art competitors, including the LASSO, a nonparametric version of the LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called the Foba. Some theoretical justifications are also provided.", "full_text": "Nonparametric Greedy Algorithms for the Sparse\n\nLearning Problem\n\nHan Liu and Xi Chen\n\nSchool of Computer Science\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nThis paper studies the forward greedy strategy in sparse nonparametric regres-\nsion. For additive models, we propose an algorithm called additive forward re-\ngression; for general multivariate models, we propose an algorithm called gen-\neralized forward regression. Both algorithms simultaneously conduct estimation\nand variable selection in nonparametric settings for the high dimensional sparse\nlearning problem. Our main emphasis is empirical: on both simulated and real\ndata, these two simple greedy methods can clearly outperform several state-of-\nthe-art competitors, including LASSO, a nonparametric version of LASSO called\nthe sparse additive model (SpAM) and a recently proposed adaptive parametric\nforward-backward algorithm called Foba. We also provide some theoretical justi-\n\ufb01cations of speci\ufb01c versions of the additive forward regression.\n\n1\n\nIntroduction\n\nThe linear model is a mainstay of statistical inference. At present, there are two major approaches\nto \ufb01t sparse linear models: convex regularization and greedy pursuit. The convex regularization\napproach regularizes the model by adding a sparsity constraint, leading to methods like LASSO\n[19, 7] or the Dantzig selector [6]. The greedy pursuit approach regularizes the model by iteratively\nselecting the current optimal approximation according to some criteria, leading to methods like the\nmatching pursuit [14] or orthogonal matching pursuit (OMP) [20].\nSubstantial progress has been made recently on applying the convex regularization idea to \ufb01t sparse\nadditive models. For splines, Lin and Zhang [12] propose a method called COSSO, which uses\nthe sum of reproducing kernel Hilbert space norms as a sparsity inducing penalty, and can simul-\ntaneously conduct estimation and variable selection; Ravikumar et al. [17, 16] develop a method\ncalled SpAM. The population version of SpAM can be viewed as a least squares problem penalized\nby the sum of L2(P )-norms; Meier et al. [15] develop a similar method using a different sparsity-\nsmoothness penalty, which guarantees the solution to be a spline. All these methods can be viewed\nas different nonparametric variants of LASSO. They have similar drawbacks: (i) it is hard to extend\nthem to handle general multivariate regression where the mean functions are no longer additive; (ii)\ndue to the large bias induced by the regularization penalty, the model estimation is suboptimal. One\nway to avoid this is to resort to two-stage procedures as in [13], but the method becomes less robust\ndue to the inclusion of an extra tuning parameter in the \ufb01rst stage.\nIn contrast to the convex regularization methods, the greedy pursuit approaches do not suffer from\nsuch problems. Instead of trying to formulate the whole learning task to be a global convex opti-\nmization, the greedy pursuit approaches adopt iterative algorithms with a local view. During each\niteration, only a small number of variables are actually involved in the model \ufb01tting so that the whole\ninference only involves low dimensional models. Thus they naturally extend to the general multi-\nvariate regression and do not induce large estimation bias, which makes them especially suitable for\nhigh dimensional nonparametric inference. However, the greedy pursuit approaches do not attract as\n\n1\n\n\fmuch attention as the convex regularization approaches in the nonparametric literature. For additive\nmodels, the only work we know of are the sparse boosting [4] and multivariate adaptive regression\nsplines (MARS) [9]. These methods mainly target on additive models or lower-order functional\nANOVA models, but without much theoretical analysis. For general multivariate regression, the\nonly available method we are aware of is rodeo [11]. However, rodeo requires the total number of\nvariables to be no larger than a double-logarithmic of the data sample size, and does not explicitly\nidentify relevant variables.\nIn this paper, we propose two new greedy algorithms for sparse nonparametric learning in high di-\nmensions. By extending the idea of the orthogonal matching pursuit to nonparametric settings, the\nmain contributions of our work include: (i) we formulate two greedy nonparametric algorithms:\nadditive forward regression (AFR) for sparse additive models and generalized forward regression\n(GFR) for general multivariate regression models. Both of them can simultaneously conduct esti-\nmation and variable selection in high dimensions. (ii) We present theoretical results for AFR using\nspeci\ufb01c smoothers. (iii) We report thorough numerical results on both simulated and real-world\ndatasets to demonstrate the superior performance of these two methods over the state-of-the-art\ncompetitors, including LASSO, SpAM, and an adaptive parametric forward-backward algorithm\ncalled Foba [22].\nThe rest of this paper is organized as follows:\nin the next section we review the basic problem\nformulation and notations. In Section 3 we present the AFR algorithm, in section 4, we present the\nGFR algorithm. Some theoretical results are given in Section 5. In Section 6 we present numerical\nresults on both simulated and real datasets, followed by a concluding section at the end.\n\n2 Sparse Nonparametric Learning in High Dimensions\n\nWe begin by introducing some notations. Assuming n data points(cid:8)(X i, Y i)(cid:9)n\n\na high dimensional regression model\n\ni=1 are observed from\n\n1, . . . , X i\n\nY i = m(X i) + \u0001i, \u0001i \u223c N(0, \u03c32) i = 1, . . . , n,\n\n(1)\np)T \u2208 Rp is a p-dimensional design point, m : Rp \u2192 R is an unknown\nwhere X i = (X i\nsmooth mean function. Here we assume m lies in a p-dimensional second order Sobolev ball with\nIn the sequel, we denote the response vector (Y 1, . . . , Y n)T by Y and the vector\n\ufb01nite radius.\n(X 1\nj , . . . , X n\nWe assume m is functional sparse, i.e. there exists an index set S \u2282 {1, . . . , p}, such that\n\nj )T by Xj for 1 \u2264 j \u2264 p.\n\n(General) m(x) = m(xS),\n\n(2)\n\nwhere |S| = r (cid:28) p and xS denotes the sub-vector of x with elements indexed by S.\nSometimes, the function m can be assumed to have more structures to obtain a better estimation\nresult. The most popular one is additivity assumption [10]. In this case, m decomposes into the sum\nof r univariate functions {mj}j\u2208S:\n\n(Additive) m(x) = \u03b1 +(cid:88)\n\nmj(xj),\n\nj\u2208S\n\n(3)\n\nwhere each component function mj is assumed to lie in a second order Sobolev ball with \ufb01nite\nradius so that each element in the space is smooth enough. For the sake of identi\ufb01ability, we also\nassume Emj(Xj) = 0 for j = 1, . . . , p, where the expectation is taken with respect to the marginal\ndistribution of Xj.\nGiven the models in (2) or (3), we have two tasks: function estimation and variable selection. For\n\nthe \ufb01rst task, we try to \ufb01nd an estimate (cid:98)m, such that (cid:107)(cid:98)m \u2212 m(cid:107) \u2192 0 as n goes to in\ufb01nity, where (cid:107) \u00b7 (cid:107)\nis some function norm. For the second task, we try to \ufb01nd an estimate (cid:98)S, which is an index set of\n\n(cid:16)(cid:98)S = S\n\n(cid:17) \u2192 1 as n goes to in\ufb01nity.\n\nvariables, such that P\n\n3 Additive Forward Regression\n\nIn this section, we assume the true model is additive, i.e. m(x) = \u03b1 +(cid:80)\n\nj\u2208S mj(xj). In general,\nif the true index set for the relevant variables is known, the back\ufb01tting algorithm can be directly\n\n2\n\n\fequations in a function space. In particular, we denote the estimates on the jth variable Xj to be\n\napplied to estimate (cid:98)m [10]. It is essentially a Gauss-Seidel iteration for solving a set of normal\nj ), . . . ,(cid:98)mj(X n\n(cid:98)mj \u2261 ((cid:98)mj(X 1\nj ))T \u2208 Rn. Then(cid:98)mj can be estimated by regressing the partial residual\nvector Rj = Y \u2212 \u03b1 \u2212(cid:80)\nk(cid:54)=j (cid:98)mk on the variable Xj. This can be calculated by (cid:98)mj = SjRj, where\n(cid:98)mj is updated, the algorithm holds it \ufb01xed and repeats this process by cycling through each variable\n\nSj : Rn \u2192 Rn is a smoothing matrix, which only depends on X 1, . . . , X n but not on Y . Once\nuntil convergence. Under mild conditions on the smoothing matrices S1, . . . ,Sp, the back\ufb01tting\nalgorithm is a \ufb01rst order algorithm that guarantees to converge [5] and achieves the minimax rate\nof convergence as if only estimating a univariate function. However, for sparse learning problems,\nsince the true index set is unknown, the back\ufb01tting algorithm no longer works due to the uncontrolled\nestimation variance.\nBy extending the idea of the orthogonal matching pursuit to sparse additive models, we design a\nforward greedy algorithm called the additive forward regression (AFR), which only involves a few\nvariables in each iteration. Under this framework, we only need to conduct the back\ufb01tting algorithm\non a small set of variables. Thus the variance can be well controlled. The algorithm is described in\nFigure 1, where we use (cid:104)\u00b7,\u00b7(cid:105)n to denote the inner product of two vectors.\n\ni=1 and \u03b7 > 0\n\nfor k = 1, 2, 3, . . .\n\ni=1 Y i/n and the residual R(0) = Y \u2212 \u03b1\n|(cid:104)(cid:98)mj, R(k\u22121)(cid:105)n|\n\nInput:(cid:8)(X i, Y i)(cid:9)n\nlet A(0) = \u2205, \u03b1 =(cid:80)n\nfor each j (cid:54)\u2208 A(k\u22121), estimate (cid:98)mj by smoothing: (cid:98)mj = SjR(k\u22121)\ncompute the residual R(k) = Y \u2212 \u03b1 \u2212(cid:80)\n\nlet j(k) = arg max\nj(cid:54)\u2208A(k\u22121)\nlet A(k) = A(k\u22121) \u222a j(k)\nestimate M(k) = {mj : j \u2208 A(k)} by the back\ufb01tting algorithm\nif ((cid:107)R(k\u22121)(cid:107)2\nk = k \u2212 1\nbreak\n\nmj\u2208M(k) mj(Xj)\n\n2 \u2212 (cid:107)R(k)(cid:107)2\n\n2)/n \u2264 \u03b7\n\nend if\n\nend for\n\nOutput: selected variables A(k) and estimated component functions M(k) = {mj : j \u2208 A(k)}\n\nFigure 1: THE ADDITIVE FORWARD REGRESSION ALGORITHM\n\nThe algorithm uses an active set A to index the variables included in the model during each iteration\nand then performs a full optimization over all \u201cactive\u201d variables via the back\ufb01tting algorithm. The\nmain advantage of this algorithm is that during each iteration, the model inference is conducted in\nlow dimensions and thus avoids the curse of dimensionality. The stopping criterion is controlled\nby a prede\ufb01ned parameter \u03b7 which is equivalent to the regularization tuning parameter in convex\nregularization methods. Other stopping criteria, such as the maximum number of steps, can also\nbe adopted. In practice, we always recommend to use data-dependent technique, such as cross-\nvalidation, to automatically tune this parameter.\nMoreover, the smoothing matrix Sj can be fairly general, e.g. univariate local linear smoothers as\ndescribed below, kernel smoothers or spline smoothers [21], etc.\n\n4 Generalized Forward Regression\n\nThis section only assume m(x) to be functional sparse, i.e. m(x) = m(xS), without restricting the\n\nmodel to be additive. In this case, to \ufb01nd a good estimate (cid:98)m becomes more challenging.\nlocal linear regression: given an evaluation point x = (x1, . . . , xp)T , the estimate (cid:98)m(x) is the\n\nTo estimate the general multivariate mean function m(x), one of the most popular methods is the\n\n3\n\n\fi=1 and \u03b7 > 0\n\nInput:(cid:8)(X i, Y i)(cid:9)n\nlet A(0) = \u2205, \u03b1 =(cid:80)n\nlet \u03b4(k) =(cid:80)n\n\nfor k = 1, 2, 3, . . .\n\nlet j(k) = arg min\nj(cid:54)\u2208A(k\u22121)\nlet A(k) = A(k\u22121) \u222a j(k)\n\ni=1 Y i/n and \u03b4(0) =(cid:80)n\n(cid:80)n\n(cid:0)Y i \u2212 S(A(k))X iY(cid:1)2\n\n/n\n\ni=1\n\nif (\u03b4(k\u22121) \u2212 \u03b4(k)) \u2264 \u03b7\n\ni=1\n\ni=1(Y i \u2212 \u03b1)2/n\n\n(cid:0)Y i \u2212 S(A(k\u22121) \u222a j)X iY(cid:1)2\n\n/n\n\nk = k \u2212 1\nbreak\n\nend if\n\nend for\n\nOutput: selected variables A(k) and local linear estimates (S(A(k))X1 Y, . . . ,S(A(k))X n Y )\n\nFigure 2: THE GENERALIZED FORWARD REGRESSION ALGORITHM\n\nsolution(cid:98)\u03b1x to the following locally kernel weighted least squares problem:\nj \u2212 xj),\n\n(cid:8)Y i \u2212 \u03b1x \u2212 \u03b2T\n\nx (X i \u2212 x)(cid:9)2\n\nn(cid:88)\n\np(cid:89)\n\nKhj (X i\n\nmin\n\u03b1x,\u03b2x\n\ni=1\n\nj=1\n\n(4)\n\nwhere K(\u00b7) is a one dimensional kernel function and the kernel weight function in (4) is taken as a\nproduct kernel with the diagonal bandwidth matrix H 1/2 = diag{h1, . . . , hp}. Such a problem can\nbe re-casted as a standard weighted least squares regression. Therefore a closed-form solution to the\nthe local linear estimate can be explicitly given by\n\nx WxY = SxY,\n\nx WxXx)\u22121X T\np(cid:89)\n\nj=1\n\nKhj (X 1\n\nKhj (X n\n\nWx = diag\n\nj \u2212 xj), . . . ,\n\n\uf8eb\uf8ec\uf8ed 1\n\n\uf8f1\uf8f2\uf8f3 p(cid:89)\n\nwhere e1 = (1, 0, . . . , 0)T is the \ufb01rst canonical vector in Rp+1 and\n\n\uf8f6\uf8f7\uf8f8 .\nhas been characterized in [8]: |(cid:98)m(x) \u2212 m(x)|2 = OP (n\u22124/(4+p)), which is extremely slow when\n\nHere, Sx is the local linear smoothing matrix. Note that if we constrain \u03b2x = 0, then the local linear\nestimate reduces to the kernel estimate. The pointwise rate of convergence of such an estimate\n\n\uf8fc\uf8fd\uf8fe , Xx =\n\np > 10.\nTo handle the large p case, we again extend the idea of the orthogonal matching pursuit to this\nsetting. For an index subset A \u2282 {1, . . . , p} and the evaluation point x, the local linear smoother\nrestricted on A is denoted as S(A) and\n\n(X n \u2212 x)T\n\n(X 1 \u2212 x)T\n\nj \u2212 xj)\n\n...\n1\n\nj=1\n\n...\n\n(cid:98)\u03b1x = eT\n\n1 (X T\n\n(cid:0)X(A)T\n\n(cid:1)\u22121\n\nS(A)x = eT\n\n1\n\nx W (A)xX(A)x\n\nX(A)T\n\nx W (A)x,\n\nwhere W (A)x is a diagonal matrix whose diagonal entries are the product of univariate kernels over\nthe set A and X(A)x is a submatrix of Xx that only contains the columns indexed by A.\nGiven these de\ufb01nitions, the generalized forward regression (GFR) algorithm is described in Figure\n2. Similar to AFR, GFR also uses an active set A to index the variables included in the model.\nSuch mechanism allows all the statistical inference to be conducted only in low-dimensional spaces.\nThe GFR algorithm using the multivariate local linear smoother can be computationally heavy for\nvery high dimensional problems. However, GFR is a generic framework and can be equipped with\narbitrary multivariate smoothers, e.g. kernel/Nearest Neighbor/spline smoothers. These smoothers\nlead to much better scalability. The only reason we use the local linear smoother as an illustrative\nexample in this paper is due to its popularity and potential advantage on correcting the boundary\nbias.\n\n4\n\n\f(cid:107)m \u2212 (cid:98)m(cid:107)2 = OP\n\n(cid:32)(cid:114)log n\n(cid:33)\n(cid:17)1/4(cid:19)\n(cid:18)(cid:16) log n\n\nn\n\n.\n\nn\n\n, then P\n\n(5)\n\n5 Theoretical Properties\n\nIn this section, we provide the theoretical properties of the additive forward regression estimates\nusing the spline smoother. Due to the asymptotic equivalence of the spline smoother and the local\nlinear smoother [18], we deduce that these results should also hold for the local linear smoother.\nOur main result in Theorem 1 says when using the spline smoother with certain truncation rate\nto implement AFR algorithm, the resulting estimator is consistent with a certain rate. When the\nunderlying true component functions do not go to zeroes too fast, we also achieve variable selection\nconsistency. Our analysis relies heavily on [3]. A similar analysis has also been reported in the\ntechnical report version of [16].\nTheorem 1. Assuming there exists some \u03be > 0 which can be arbitrarily large, such that p = O(n\u03be).\nFor \u2200j \u2208 {1, . . . , p}, we assume mj lies in a second-order Sobolev ball with \ufb01nite radius, and\nj=1 mj. For the additive forward regression algorithm using the spline smoother with\n\nm = \u03b1 +(cid:80)p\n\na truncation rate at n1/4, after (n/log n)1/2 steps, we obtain that\n\n(cid:16)(cid:98)S = S\n\nFurthermore, if we also assume minj\u2208S (cid:107)mj(cid:107) = \u2126\n\n(cid:17) \u2192 1 as n goes\nto in\ufb01nity. Here, (cid:98)S is the index set for nonzero component functions in (cid:98)m.\nThe rate for (cid:107)(cid:98)m \u2212 m(cid:107)2 obtained from Theorem 1 is only O(n\u22121/2), which is slower than the\n\nminimax rate O(n\u22124/5). This is mainly an artifact of our analysis instead of a drawback of the\nadditive forward regression algorithm. In fact, if we perform a basis expansion for each component\nfunction to \ufb01rst cast the problem to be a \ufb01nite dimensional linear model with group structure, under\nsome more stringent smallest eigenvalue conditions on the augmented design as in [23], we can\nshow that AFR using spline smoothers can actually achieves the minimax rate O(n\u22124/5) up to a\nlogarithmic factor. A detailed treatment will be reported in a follow up paper.\nSketch of Proof: We \ufb01rst describe an algorithm called group orthogonal greedy algorithm (GOGA),\nwhich solves a noiseless function approximation problem in a direct-sum Hilbert space. AFR can\nthen be viewed as an empirical realization of such an \u201cideal\u201d algorithm.\nGOGA is a group extension of the orthogonal greedy algorithm (OGA) in [3]. For j = 1, . . . , p, let\nHj be a Hilbert space of continuous functions with a Hamel basis Dj. Then for a function m in the\ndirect-sum Hilbert space H = H1 + H2 + . . . + Hp, we want to approximate m using the union of\nmany truncated bases D = D(cid:48)\n\nWe equip an inner product (cid:104)\u00b7,\u00b7(cid:105) on H: \u2200f, g \u2208 H, (cid:104)f, g(cid:105) = (cid:82) f(X)g(X)dPX where PX is the\n\nmarginal distribution for X. Let (cid:107) \u00b7 (cid:107) be the norm induced by the inner product (cid:104)\u00b7,\u00b7(cid:105) on H. GOGA\nbegins by setting m(0) = 0, and then recursively de\ufb01nes the approximant m(k) based on m(k\u22121)\nand its residual r(k\u22121) \u2261 m \u2212 m(k\u22121). More speci\ufb01cally: we proceed as the following: de\ufb01ne f (k)\nto be the projection of r(k\u22121) onto the truncated basis D(cid:48)\n(cid:107)r(k\u22121) \u2212 g(cid:107)2.\nWe calculate j(k) as j(k) = arg maxj |(cid:104)r(k\u22121), f (k)\n(cid:105)|. m(k) can then be calculated by projecting m\nonto the additive function space generated by A(k) = D(cid:48)\n(cid:98)m(k) = arg min\n(cid:80)n\n\nAFR using regression splines is exactly GOGA when there is no noise. For noisy samples, we\nreplace the unknown function m by its n-dimensional output vector Y , and replace the inner product\n(cid:104)\u00b7,\u00b7(cid:105) by (cid:104)\u00b7,\u00b7(cid:105)n, which is de\ufb01ned as (cid:104)f, g(cid:105)n = 1\ni=1 f(X i)g(X i). The projection of the current\nresidual vector onto each dictionary D(cid:48)\nj is replaced by the corresponding nonparametric smoothers.\nConsidering any function m \u2208 H, we proceed in the same way as in [3], but replacing the OGA\narguments in their analysis by those of GOGA. The desired results of the theorem follow from a\nsimple argument on bounding the random random covering number of spline spaces.\n\nj(1) + \u00b7\u00b7\u00b7 + D(cid:48)\n(cid:107)m \u2212 g(cid:107)2.\n\np, where for all j, D(cid:48)\n\nj, i.e. f (k)\n\nj = arg ming\u2208D(cid:48)\n\n1 \u222a . . . \u222a D(cid:48)\n\nj\n\nn\n\nj\n\nj\n\nj \u2282 Dj.\n\ng\u2208span(A(k))\n\nj(k):\n\n5\n\n\f6 Experimental Results\n\nIn this section, we present numerical results for AFR and GFR applied to both synthetic and real\ndata. The main conclusion is that, in many cases, their performance on both function estimation\nand variable selection can clearly outperform those of LASSO, Foba, and SpAM. For all the re-\nported experiments, we use local linear smoothers to implement AFR and GFR. The results for\nother smoothers, such as smoothing splines, are similar. Note that different bandwidth parameters\nwill have big effects on the performances of local linear smoothers. Our experiments simply use\nthe plug-in bandwidths according to [8] and set the bandwidth for each variable to be the same.\nFor AFR, the bandwidth h is set to be 1.06n\u22121/5 and for GFR, the bandwidth is varying over each\niteration such that h = 1.06n\u22121/(4+|A|), where |A| is the size of the current active set.\n\nFor an estimate(cid:98)m, the estimation performance for the synthetic data is measured by the mean square\nerror (MSE), which is de\ufb01ned as MSE((cid:98)m) = 1\n\n. For the real data, since\nwe do not know the true function m(x), we approximate the mean squared error using 5-fold cross-\nvalidation scores.\n\n(cid:0)m(X i) \u2212 (cid:98)m(X i)(cid:1)2\n\n(cid:80)n\n\ni=1\n\nn\n\n6.1 The Synthetic Data\n\nFor the synthetic data experiments, we consider the compound symmetry covariance structure of the\ndesign matrix X \u2208 Rn\u00d7p with n = 400 and p = 20. Each dimension Xj is generated according to\n\nXj = Wj + tU\n1 + t\n\n,\n\nj = 1, . . . , p,\n\nwhere W1, . . . , Wp and U are i.i.d. sampled from Uniform(0,1). Therefore the correlation between\nXj and Xk is t2/(1 + t2) for j (cid:54)= k. We assume the true regression functions have r = 4 relevant\nvariables:\n\nY = m(X) + \u0001 = m(X1, . . . , X4) + \u0001.\n\n(6)\n\nTo evaluate the variable selection performance of different methods, we generate 50 designs and 50\ntrials for each design. For each trial, we run the greedy forward algorithm r steps. If all the relevant\nvariables are included in, the variable selection task for this trial is said to be successful. We report\nthe mean and standard deviation of the success rate in variable selection for various correlation\nbetween covariates by varying the values of t.\nWe adopt some synthetic examples as in [12] and de\ufb01ne the following four functions: g1(x) = x,\ng2(x) = (2x\u2212 1)2, g3(x) = sin(2\u03c0x)/(2\u2212 sin(2\u03c0x)), and g4(x) = 0.1 sin(2\u03c0x) + 0.2 cos(2\u03c0x) +\n0.3 sin2(2\u03c0x) + 0.4 cos3(2\u03c0x) + 0.5 sin3(2\u03c0x).\nThe following four regression models are studied. The \ufb01rst model is linear; the second is additive;\nthe third and forth are more complicated nonlinear models with at least two way interactions:\n\n1 + 3X i\n\n(Model1) : Y i = 2X i\n(Model2) : Y i = 5g1(X i\n(Model3) : Y i = exp(2X i\n\n(Model4) : Y i = (cid:88)4\n\n2 + 4X i\n1) + 3g2(X i\n2 + X i\n1X i\ngj(X i\n\n3 + 5X i\n2) + 4g3(X i\n3) + 2X i\n3X i\n\nj) + g1(X i\n\nj=1\nwith t = 0.5.\n\n4 + 2N(0, 1), with t = 1 ;\n\n3) + 6g4(X i\n\n4) + 4N(0, 1), with t = 1 ;\n\n4 + N(0, 1), with t = 0.5 ;\n4) + g2((X i\n\n1 + X i\n\n3)/2) + g3(X i\n\n1X i\n\n2) + N(0, 1)\n\nCompared with LASSO, Foba, and SpAM, the estimation performance using MSE as evaluation\ncriterion is presented in Figure 3. And Table 1 shows the rate of success for variable selection of\nthese models with different correlations controlled by t.\nFrom Figure 3, we see that AFR and GFR methods provide very good estimates for the underlying\ntrue regression functions as compared to others. Firstly, LASSO and SpAM perform very poorly\nwhen the selected model is very sparse. This is because they are convex regularization based ap-\nproaches: to obtain a very sparse model, they induce very large estimation bias. On the other hand,\nthe greedy pursuit based methods like Foba, AFR and GFR do not suffer from such a problem. Sec-\nondly, when the true model is linear, all methods perform similarly. For the nonlinear true regression\n\n6\n\n\fModel 1\n\nModel 2\n\nModel 3\n\nModel 4\n\nFigure 3: Performance of the different algorithms on synthetic data: MSE versus sparsity level\n\nfunction, AFR, GFR and SpAM outperform LASSO and Foba. It is expectable since LASSO and\nFoba are based on linear assumptions. Furthermore, we notice that when the true model is addi-\ntive (Model 2) or nearly additive (Model 4), AFR performs the best. However, for the non-additive\ngeneral multivariate regression function (Model 3), GFR performs the best. For all examples, when\nmore and more irrelevant variables are included in the model, SpAM has a better generalization\nperformance due to the regularization effect.\n\nTable 1: Comparison of variable selection\n\nLASSO(sd)\n1.000 (0.0000)\n0.879 (0.0667)\n0.559 (0.0913)\n\nFoba\n1.000 (0.0000)\n0.882 (0.0557)\n0.553 (0.0777)\n\nSpAM\n0.999 (0.0028)\n0.683 (0.1805)\n0.190 (0.1815)\n\nAFR\n0.999 (0.0039)\n0.879 (0.0525)\n0.564 (0.0739)\n\nGFR\n0.990 (0.0229)\n0.839 (0.0707)\n0.515 (0.0869)\n\nLASSO(sd)\n0.062 (0.0711)\n0.056 (0.0551)\n0.004 (0.0106)\n\nFoba\n0.069 (0.0774)\n0.060 (0.0550)\n0.029 (0.0548)\n\nSpAM\n0.842 (0.1128)\n0.118 (0.0872)\n0.008 (0.0056)\n\nAFR\n0.998 (0.0055)\n0.819 (0.1293)\n0.260 (0.1439)\n\nGFR\n0.769 (0.1751)\n0.199 (0.2102)\n0.021 (0.0364)\n\nLASSO(sd)\n0.997 (0.0080)\n0.818 (0.1137)\n0.522 (0.1520)\n\nFoba\n0.999 (0.0039)\n0.802 (0.1006)\n0.391 (0.1577)\n\nSpAM\n0.980 (0.1400)\n0.934 (0.1799)\n0.395 (0.3107)\n\nAFR\n1.000 (0.0000)\n1.000 (0.0000)\n0.902 (0.1009)\n\nGFR\n1.000 (0.0000)\n0.995 (0.0103)\n0.845 (0.1623)\n\nLASSO(sd)\n0.043 (0.0482)\n0.083 (0.0823)\n0.048 (0.0456)\n\nFoba\n0.043 (0.0437)\n0.049 (0.0511)\n0.085 (0.0690)\n\nSpAM\n0.553 (0.1864)\n0.157 (0.1232)\n0.095 (0.0754)\n\nAFR\n0.732 (0.1234)\n0.126 (0.0688)\n0.192 (0.0679)\n\nGFR\n0.967 (0.0365)\n0.708 (0.1453)\n0.171 (0.1067)\n\nModel 1\nt = 0\nt = 1\nt = 2\n\nModel 2\nt = 0\nt = 1\nt = 2\n\nModel 3\nt = 0\nt = 1\nt = 2\n\nModel 4\nt = 0\nt = 0.5\nt = 1\n\nThe variable selection performances of different methods in Table 1 are very similar to their estima-\ntion performances. We observe that, when correlation parameter t becomes larger, the performances\nof all methods decrease. But SpAM is most sensitive to the correlation increase. In all models, the\nperformance of SpAM can decrease more than 70% for the larger t; in contrast, AFR and GFR are\nmore robust to the increased correlation between different covariates. Another interesting observa-\ntion is on model 4. From the previous discussion, on this model, AFR achieves a better estimation\nperformance. However, when comparing the variable selection performance, GFR is the best. This\nsuggests that for nonparametric inference, the goals of estimation consistency and variable selection\nconsistency might not be always coherent. Some tradeoffs might be needed to balance them.\n\n6.2 The real data\n\nIn this subsection, we compare \ufb01ve methods on three real datasets: Boston Housing, AutoMPG, and\nIonosphere data set 1. Boston Housing contains 556 data points, with 13 features; AutoMPG 392\ndata points (we delete those with missing values), with 7 features and Ionosphere 351 data points,\nwith 34 features and the binary output. We treat Ionosphere as a regression problem although the\n\n1Available from UCI Machine Learning Database Repository: http:archive.ics.uci.edu/ml.\n\n7\n\n123456789100123456sparsityMSE  GFRAFRSpAMFobaLASSO123456789104567891011121314sparsityMSE  GFRAFRSpAMFobaLASSO1234567891000.511.522.533.5sparsityMSE  GFRAFRSpAMFobaLASSO123456789100.20.250.30.350.40.450.5sparsityMSE  GFRAFRSpAMFobaLASSO\fresponse is binary. We run 10 times 5-fold cross validation on each dataset and plot the mean and\nstandard deviation of MSE versus different sparsity levels in Figure 4.\n\nBoston Housing\n\nAutoMPG\n\nIonosphere\n\nFigure 4: Performance of the different algorithms on real datasets: CV error versus sparsity level\nFrom Figure 4, since all the error bars are tiny, we deem all the results signi\ufb01cant. On the Boston\nHousing and AutoMPG datasets, the generalization performances of AFR and GFR are clearly bet-\nter than LASSO, Foba, and SpAM. For all these datasets, if we prefer very sparse models, the\nperformance of the greedy methods are much better than the convex regularization methods due\nto the much less bias being induced. On the Ionosphere data, we only need to run GFR up to 15\nselected variables, since the generalization performance with 15 variables is already worse than the\nnull model due to the curse of dimensionality. Both AFR and GFR on this dataset achieve the best\nperformances when there are no more than 10 variables included; while SpAM achieves the best\nCV score with 25 variables. However, this is not to say that the true model is not sparse. The main\nreason that SpAM can achieve good generalization performance when many variables included is\ndue to its regularization effect. We think the true model should be sparse but not additive. Similar\ntrend among different methods has also appeared in Model 4 of previous synthetic datasets.\n\n7 Conclusions and Discussions\n\nWe presented two new greedy algorithms for nonparametric regression with either additive mean\nfunctions or general multivariate regression functions. Both methods utilize the iterative forward\nstepwise strategy, which guarantees the model inference is always conducted in low dimensions in\neach iteration. These algorithms are very easy to implement and have good empirical performance\non both simulated and real datasets.\nOne thing worthy to note is: people sometimes criticize the forward greedy algorithms since they\ncan never have the chance to correct the errors made in the early steps. This is especially true for\nhigh dimensional linear models, which motivates the outcome of some adaptive forward-backward\nprocedures such as Foba [22]. We addressed a similar question: Whether a forward-backward pro-\ncedure also helps in the nonparametric settings? AFR and GFR can be trivially extended to be\nforward-backward procedures using the same way as in [22]. We conducted a comparative study to\nsee whether the backward steps help or not. However, the backward step happens very rarely and\nthe empirical performance is almost the same as the purely forward algorithm. This is very different\nfrom the linear model cases, where the backward step can be crucial. In summary, in the nonpara-\nmetric settings, the backward ingredients will cost much more computational efforts with very tiny\nperformance improvement. We will investigate more on this phenomenon in the near future.\nA very recent research strand is to learn nonlinear models by the multiple kernel learning machinery\n[1, 2], another future work is to compare our methods with the multiple kernel learning approach\nfrom both theoretical and computational perspectives.\n\nAcknowledgements\nWe thank John Lafferty, Larry Wasserman, Pradeep Ravikumar, and Jamie Carbonell for very help-\nful discussions on this work. This research was supported in part by NSF grant CCF-0625879 and a\nGoogle Fellowship to Han Liu.\n\n8\n\n12345678910111213102030405060708090sparsityCV Error  GFRAFRSpAMFobaLASSO1234567010203040506070sparsityCV Error  GFRAFRSpAMFobaLASSO1510152025300.060.080.10.120.140.160.180.20.220.24sparsityCV Error  GFRAFRSpAMFobaLASSO\fReferences\n[1] Francis Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine\n\nLearning Research, 8:1179\u20131225, 2008.\n\n[2] Francis Bach. Exploring large feature spaces with hierarchical multiple kernel learning. In\n\nAdvances in Neural Information Processing Systems 21. MIT Press, 2008.\n\n[3] Andrew R. Barron, Albert Cohen, Wolfgang Dahmen, and Ronald A. DeVore. Approximation\n\nand learning by greedy algorithms. The Annals of Statistics, 36:64\u201394, 2008.\n\n[4] Peter B\u00a8uhlmann and Bin Yu. Sparse boosting. Journal of Machine Learning Research, 7:1001\u2013\n\n1024, 2006.\n\n[5] Andreas Buja, Trevor Hastie, and Robert Tibshirani. Linear smoothers and additive models.\n\nThe Annals of Statistics, 17:453\u2013510, 1989.\n\n[6] Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is\n\nmuch larger than n. The Annals of Statistics, 35:2313\u20132351, 2007.\n\n[7] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by\n\nbasis pursuit. SIAM Journal on Scienti\ufb01c and Statistical Computing, 20:33\u201361, 1998.\n\n[8] Jianqing Fan and Ir`ene Gijbels. Local polynomial modelling and its applications. Chapman\n\nand Hall, 1996.\n\n[9] Jerome H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1\u2013\n\n67, 1991.\n\n[10] Trevor Hastie and Robert Tibshirani. Generalized additive models. Chapman & Hall Ltd.,\n\n1999.\n\n[11] John Lafferty and Larry Wasserman. Rodeo: Sparse, greedy nonparametric regression. The\n\nAnnals of Statistics, 36(1):28\u201363, 2008.\n\n[12] Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonpara-\n\nmetric regression. The Annals of Statistics., 34(5):2272\u20132297, 2006.\n\n[13] Han Liu and Jian Zhang. On the estimation consistency of the group lasso and its applications.\nProceedings of the Twelfth International Conference on Arti\ufb01cial Intelligence and Statistics,\n2009.\n\n[14] S. Mallat and Z. Zhang. Matching pursuit with time-frequency dictionaries. IEEE Transactions\n\non Signal Processing, 41:3397\u20133415, 1993.\n\n[15] Lukas Meier, Sara van de Geer, and Peter B\u00a8uhlmann. High-dimensional additive modelling.\n\nThe Annals of Statistics (to appear), 2009.\n\n[16] Pradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models.\n\nJournal of the Royal Statistical Society, Series B, Methodological, 2009. To appear.\n\n[17] Pradeep Ravikumar, Han Liu, John Lafferty, and Larry Wasserman. Spam: Sparse additive\n\nmodels. In Advances in Neural Information Processing Systems 20, 2007.\n\n[18] B. W. Silverman. Spline smoothing: The equivalent variable kernel method. The Annals of\n\nStatistics, 12:898\u2013916, 1984.\n\n[19] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society, Series B, Methodological, 58:267\u2013288, 1996.\n\n[20] Joel A. Tropp. Greed is good: Algorithmic results for sparse approximation.\n\nInform. Theory, 50(10):2231\u20132241, October 2004.\n\nIEEE Trans.\n\n[21] Grace Wahba. Spline models for observational data. SIAM [Society for Industrial and Applied\n\nMathematics], 1990.\n\n[22] Tong Zhang. Adaptive forward-backward greedy algorithm for learning sparse representations.\n\nTechnical report, Rutgers University, 2008.\n\n[23] Tong Zhang. On the consistency of feature selection usinggreedy least squares regression.\n\nJournal of Machine Learning Research, 10:555\u2013568, 2009.\n\n9\n\n\f", "award": [], "sourceid": 696, "authors": [{"given_name": "Han", "family_name": "Liu", "institution": null}, {"given_name": "Xi", "family_name": "Chen", "institution": null}]}