{"title": "Group Sparse Additive Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 198, "page_last": 208, "abstract": "A family of learning algorithms generated from additive models have attracted much attention recently for their flexibility and interpretability in high dimensional data analysis. Among them, learning models with grouped variables have shown competitive performance for prediction and variable selection. However, the previous works mainly focus on the least squares regression problem, not the classification task. Thus, it is desired to design the new additive classification model with variable selection capability for many real-world applications which focus on high-dimensional data classification. To address this challenging problem, in this paper, we investigate the classification with group sparse additive models in reproducing kernel Hilbert spaces. A novel classification method, called as \\emph{group sparse additive machine} (GroupSAM), is proposed to explore and utilize the structure information among the input variables. Generalization error bound is derived and proved by integrating the sample error analysis with empirical covering numbers and the hypothesis error estimate with the stepping stone technique. Our new bound shows that GroupSAM can achieve a satisfactory learning rate with polynomial decay. Experimental results on synthetic data and seven benchmark datasets consistently show the effectiveness of our new approach.", "full_text": "Group Sparse Additive Machine\n\nHong Chen1, Xiaoqian Wang1, Cheng Deng2, Heng Huang1\u2217\n\n1 Department of Electrical and Computer Engineering, University of Pittsburgh, USA\n\n2 School of Electronic Engineering, Xidian University, China\n\nchenh@mail.hzau.edu.cn,xqwang1991@gmail.com\nchdeng@mail.xidian.edu.cn,heng.huang@pitt.edu\n\nAbstract\n\nA family of learning algorithms generated from additive models have attracted\nmuch attention recently for their \ufb02exibility and interpretability in high dimensional\ndata analysis. Among them, learning models with grouped variables have shown\ncompetitive performance for prediction and variable selection. However, the\nprevious works mainly focus on the least squares regression problem, not the\nclassi\ufb01cation task. Thus, it is desired to design the new additive classi\ufb01cation\nmodel with variable selection capability for many real-world applications which\nfocus on high-dimensional data classi\ufb01cation. To address this challenging problem,\nin this paper, we investigate the classi\ufb01cation with group sparse additive models\nin reproducing kernel Hilbert spaces. A novel classi\ufb01cation method, called as\ngroup sparse additive machine (GroupSAM), is proposed to explore and utilize\nthe structure information among the input variables. Generalization error bound is\nderived and proved by integrating the sample error analysis with empirical covering\nnumbers and the hypothesis error estimate with the stepping stone technique. Our\nnew bound shows that GroupSAM can achieve a satisfactory learning rate with\npolynomial decay. Experimental results on synthetic data and seven benchmark\ndatasets consistently show the effectiveness of our new approach.\n\n1\n\nIntroduction\n\nThe additive models based on statistical learning methods have been playing important roles for\nthe high-dimensional data analysis due to their well performance on prediction tasks and variable\nselection (deep learning models often don\u2019t work well when the number of training data is not\nlarge). In essential, additive models inherit the representation \ufb02exibility of nonlinear models and\nthe interpretability of linear models. For a learning approach under additive models, there are two\nkey components: the hypothesis function space and the regularizer to address certain restrictions\non estimator. Different from traditional learning methods, the hypothesis space used in additive\nmodels is relied on the decomposition of input vector. Usually, each input vector X \u2208 Rp is divided\ninto p parts directly [17, 30, 6, 28] or some subgroups according to prior structural information\namong input variables [27, 26]. The component function is de\ufb01ned on each decomposed input and\nthe hypothesis function is constructed by the sum of all component functions. Typical examples of\nhypothesis space include the kernel-based function space [16, 6, 11] and the spline-based function\nspace [13, 15, 10, 30]. Moreover, the Tikhonov regularization scheme has been used extensively for\nconstructing the additive models, where the regularizer is employed to control the complexity of\nhypothesis space. The examples of regularizer include the kernel-norm regularization associated with\nthe reproducing kernel Hilbert space (RKHS) [5, 6, 11] and various sparse regularization [17, 30, 26].\nMore recently several group sparse additive models have been proposed to tackle the high-dimensional\nregression problem due to their nice theoretical properties and empirical effectiveness [15, 10,\n\n\u2217Corresponding author\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f26]. However, most existing additive model based learning approaches are mainly limited to the\nleast squares regression problem and spline-based hypothesis spaces. Surprisingly, there is no any\nalgorithmic design and theoretical analysis for classi\ufb01cation problem with group sparse additive\nmodels in RKHS. This paper focuses on \ufb01lling in this gap on algorithmic design and learning theory\nfor additive models. A novel sparse classi\ufb01cation algorithm, called as group sparse additive machine\n(GroupSAM), is proposed under a coef\ufb01cient-based regularized framework, which is connected to\nthe linear programming support vector machine (LPSVM) [22, 24]. By incorporating the grouped\nvariables with prior structural information and the (cid:96)2,1-norm based structured sparse regularizer, the\nnew GroupSAM model can conduct the nonlinear classi\ufb01cation and variable selection simultaneously.\nSimilar to the sparse additive machine (SAM) in [30], our GroupSAM model can be ef\ufb01ciently solved\nvia proximal gradient descent algorithm. The main contributions of this paper can summarized in\ntwo-fold:\n\n\u2022 A new group sparse nonlinear classi\ufb01cation algorithm (GroupSAM) is proposed by extending\nthe previous additive regression models to the classi\ufb01cation setting, which contains the\nLPSVM with additive kernel as its special setting. To the best of our knowledge, this is the\n\ufb01rst algorithmic exploration of additive classi\ufb01cation models with group sparsity.\n\n\u2022 Theoretical analysis and empirical evaluations on generalization ability are presented to sup-\nport the effectiveness of GroupSAM. Based on constructive analysis on the hypothesis error,\nwe get the estimate on the excess generalization error, which shows that our GroupSAM\nmodel can achieve the fast convergence rate O(n\u22121) under mild conditions. Experimental\nresults demonstrate the competitive performance of GroupSAM over the related methods on\nboth simulated and real data.\n\nBefore ending this section, we discuss related works. In [5], support vector machine (SVM) with\nadditive kernels was proposed and its classi\ufb01cation consistency was established. Although this\nmethod can also be used for grouped variables, it only focuses on the kernel-norm regularizer without\naddressing the sparseness for variable selection. In [30], the SAM was proposed to deal with the\nsparse representation on the orthogonal basis of hypothesis space. Despite good computation and\ngeneralization performance, SAM does not explore the structure information of input variables and\nignores the interactions among variables. More important, different from \ufb01nite splines approximation\nin [30], our approach enables us to estimate each component function directly in RKHS. As illustrated\nin [20, 14], the RKHS-based method is \ufb02exible and only depends on few tuning parameters, but the\ncommonly used spline methods need specify the number of basis functions and the sequence of knots.\nIt should be noticed that the group sparse additive models (GroupSpAM in [26]) also address the\nsparsity on the grouped variables. However, there are key differences between GroupSAM and\nGroupSpAM: 1) Hypothesis space. The component functions in our model are obtained by searching\nin kernel-based data dependent hypothesis spaces, but the method in [26] uses data independent\nhypothesis space (not associated with kernel). As shown in [19, 18, 4, 25], the data dependent\nhypothesis space can provide much more adaptivity and \ufb02exibility for nonlinear prediction. The\nadvantage of kernel-based hypothesis space for additive models is also discussed in [14]. 2) Loss\nfunction. The hinge loss used in our classi\ufb01cation model is different from the least-squares loss in\n[26]. 3) Optimization. Our GroupSAM only needs to construct one component function for each\nvariable group, but the model in [26] needs to \ufb01nd the component functions for each variable in a\ngroup. Thus, our method is usually more ef\ufb01cient. Due to the kernel-based component function and\nnon-smooth hinge loss, the optimization of GroupSpAM can not be extended to our model directly. 4)\nLearning theory. We establish the generalization bound of GroupSAM by the error estimate technique\nwith data dependent hypothesis spaces, while the error bound is not covered in [26].\nNow, we present a brief summary in Table 1 to better illustrate the differences of our GroupSAM\nwith other methods.\nThe rest of this paper is organized as follows. In next section, we revisit the related classi\ufb01cation\nformulations and propose the new GroupSAM model. Theoretical analysis on generalization error\nbound is established in Section 3. In Section 4, experimental results on both simulated examples and\nreal data are presented and discussed. Finally, Section 5 concludes this paper.\n\n2\n\n\fHypothesis space\n\nLoss function\nGroup sparsity\n\nGeneralization bound\n\nTable 1: Properties of different additive models.\n\nSAM [30]\n\nGroup Lasso[27] GroupSpAM [26]\n\nGroupSAM\n\ndata-independent data-independent data-independent data-dependent\n\nhinge loss\n\nleast-square\n\nleast-square\n\nhinge loss\n\nNo\nYes\n\nYes\nNo\n\nYes\nNo\n\nYes\nYes\n\n2 Group sparse additive machine\n\nIn this section, we \ufb01rst revisit the basic background of binary classi\ufb01cation and additive models, and\nthen introduce our new GroupSAM model.\nLet Z := (X ,Y) \u2282 Rp+1, where X \u2282 Rp is a compact input space and Y = {\u22121, 1} is the set\nof labels. We assume that the training samples z := {zi}n\ni=1 are independently\ndrawn from an unknown distribution \u03c1 on Z, where each xi \u2208 X and yi \u2208 {\u22121, 1}. Let\u2019s denote\nthe marginal distribution of \u03c1 on X as \u03c1X and denote its conditional distribution for given x \u2208 X as\n\u03c1(\u00b7|x).\nFor a real-valued function f : X \u2192 R, we de\ufb01ne its induced classi\ufb01er as sgn(f ), where sgn(f )(x) =\n1 if f (x) \u2265 0 and sgn(f )(x) = \u22121 if f (x) < 0. The prediction performance of f is measured by\nthe misclassi\ufb01cation error:\n\ni=1 = {(xi, yi)}n\n\nR(f ) = Prob{Y f (X) \u2264 0} =\n\nProb(Y (cid:54)= sgn(f )(x)|x)d\u03c1X .\n\n(1)\n\nIt is well known that the minimizer of R(f ) is the Bayes rule:\n\nfc(x) = sgn\n\nyd\u03c1(y|x)\n\n= sgn\n\nProb(y = 1|x) \u2212 Prob(y = \u22121|x)\n\n.\n\n(cid:17)\n\n(cid:90)\n(cid:16)\n\nX\n\n(cid:16)(cid:90)\n\nY\n\n(cid:17)\n\nSince the Bayes rule involves the unknown distribution \u03c1, it can not be computed directly. In machine\nlearning literature, the classi\ufb01cation algorithm usually aims to \ufb01nd a good approximation of fc by\nminimizing the empirical misclassi\ufb01cation risk:\n\nn(cid:88)\n\ni=1\n\nRz(f ) =\n\n1\nn\n\nI(yif (xi) \u2264 0) ,\n\n(2)\n\nwhere I(A) = 1 if A is true and 0 otherwise. However, the minimization problem associated with\nRz(f ) is NP-hard due to the 0 \u2212 1 loss I. To alleviate the computational dif\ufb01culty, various convex\nlosses have been introduced to replace the 0 \u2212 1 loss, e.g., the hinge loss, the least square loss, and\nthe exponential loss [29, 1, 7]. Among them, the hinge loss is the most popular error metric for\nclassi\ufb01cation problem due to its nice theoretical properties. In this paper, following [5, 30], we use\nthe hinge loss:\n\n(cid:96)(y, f (x)) = (1 \u2212 yf (x))+ = max{1 \u2212 yf (x), 0}\n\nto measure the misclassi\ufb01cation cost.The expected and empirical risks associated with the hinge loss\nare de\ufb01ned respectively as:\n\nE(f ) =\n\nand\n\n(cid:90)\n\nZ\n\n(1 \u2212 yf (x))+d\u03c1(x, y) ,\n\nn(cid:88)\n\ni=1\n\n(1 \u2212 yif (xi))+ .\n\nEz(f ) =\n\n1\nn\n\nIn theory, the excess misclassi\ufb01cation error R(sgn(f ))\u2212R(fc) can be bounded by the excess convex\nrisk E(f ) \u2212 E(fc) [29, 1, 7]. Therefore, the classi\ufb01cation algorithm usually is constructed under\nstructural risk minimization [22] associated with Ez(f ).\n\n3\n\n\fIn this paper, we propose a novel group sparse additive machine (GroupSAM) for nonlinear clas-\nsi\ufb01cation. Let {1,\u00b7\u00b7\u00b7 , p} be partitioned into d groups. For each j \u2208 {1, ..., d}, we set X (j) as the\ngrouped input space and denote f (j) : X (j) \u2192 R as the corresponding component function. Usually,\nthe groups can be obtained by prior knowledge [26] or be explored by considering the combinations\nof input variables [11].\nLet each K (j) : X (j) \u00d7 X (j) \u2192 R be a Mercer kernel and let HK(j) be the corresponding RKHS\nwith norm (cid:107) \u00b7 (cid:107)K(j). It has been proved in [5] that\n\nwith norm\n\nj=1\n\n(cid:110) d(cid:88)\n\nH =\n\n(cid:111)\nf (j) : f (j) \u2208 HK(j) , 1 \u2264 j \u2264 d\nf (j)(cid:111)\nd(cid:88)\n\n(cid:110) d(cid:88)\n\n(cid:107)f (j)(cid:107)2\n\nK(j) : f =\n\n(cid:107)f(cid:107)2\n\nK = inf\n\nj=1\n\nis an RKHS associated with the additive kernel K =(cid:80)d\n(cid:110)Ez(f ) + \u03b7\nd(cid:88)\n\nFor any given training set z = {(xi, yi)}n\n\n\u00affz =\n\nf =(cid:80)d\n\narg min\nj=1 f (j)\u2208H\n\nj=1\n\nj=1\n\nj=1 K (j).\n\n(cid:111)\n\n,\n\n\u03c4j(cid:107)f (j)(cid:107)2\n\nK(j)\n\ni=1, the additive model in H can be formulated as:\n\nwhere \u03b7 = \u03b7(n) is a positive regularization parameter and {\u03c4j} are positive bounded weights for\ndifferent variable groups.\nThe solution \u00affz in (3) has the following representation:\n\n\u00af\u03b1(j)\nz,i yiK (j)(x(j)\n\ni\n\n, x(j)), \u00af\u03b1(j)\n\nz,i \u2208 R, 1 \u2264 i \u2264 n, 1 \u2264 j \u2264 d .\n\nd(cid:88)\n\nd(cid:88)\n\nn(cid:88)\n\n\u00affz(x) =\n\n\u00affz\n\n(j)(x(j)) =\n\nj=1\n\nj=1\n\ni=1\n\nObserve that \u00affz\n\u00af\u03b1z\npushes us to consider the sparsity-induced penalty:\n\nz,1,\u00b7\u00b7\u00b7 , \u00af\u03b1(j)\n\n(j)(x) \u2261 0 is equivalent to \u00af\u03b1(j)\n\nz (cid:107)2 = 0 for\nz,n)T \u2208 Rn if the j-th variable group is not truly informative. This motivation\n\n(j) = (\u00af\u03b1(j)\n\nz,i = 0 for all i. Hence, we expect (cid:107)\u00af\u03b1(j)\nd(cid:88)\n\nn(cid:88)\n\n(cid:110) d(cid:88)\n\n(cid:111)\n\n\u03c4j(cid:107)\u03b1(j)(cid:107)2 : f =\n\n\u03b1(j)\ni yiK (j)(x(j)\n\ni\n\n,\u00b7)\n\n.\n\n\u2126(f ) = inf\n\nj=1\n\nj=1\n\ni=1\n\nThis group sparse penalty aims at the variable selection [27] and was introduced into the additive\nregression model [26].\nInspired by learning with data dependent hypothesis spaces [19], we introduce the following hypothe-\nsis spaces associated with training samples z:\n\nUnder the group sparse penalty and data dependent hypothesis space, the group sparse additive\nmachine (GroupSAM) can be written as:\n\nfz = arg min\nf\u2208Hz\n\n(1 \u2212 yif (xi))+ + \u03bb\u2126(f )\n\n(3)\n\n(4)\n\n(5)\n\nHz =\n\nf =\n\nf (j) : f (j) \u2208 H(j)\n\nz\n\nwhere\n\n(cid:110)\n\nf (j) =\n\nH(j)\nz =\n\n\u03b1(j)\ni K (j)(x(j)\n\ni\n\n,\u00b7) : \u03b1(j)\n\n,\n\n(cid:111)\ni \u2208 R(cid:111)\n\n.\n\n(cid:111)\n\n,\n\n(cid:110)\n\nd(cid:88)\nn(cid:88)\n\nj=1\n\ni=1\n\n(cid:110) 1\n\nn(cid:88)\n\nn\n\ni=1\n\n4\n\n\fn )T and K(j)\n\nwhere \u03bb > 0 is a regularization parameter.\n1 ,\u00b7\u00b7\u00b7 , \u03b1(j)\nLet\u2019s denote \u03b1(j) = (\u03b1(j)\nd(cid:88)\nGroupSAM in (5) can be rewritten as:\n(cid:110) 1\n\nf (j)\nz =\n\nn(cid:88)\n\n{\u03b1(j)\n\nz } = arg min\n\nfz =\n\nwith\n\nd(cid:88)\nn(cid:88)\nd(cid:88)\n(cid:0)1 \u2212 yi\n\nj=1\n\nt=1\n\nj=1\n\n\u03b1(j)\u2208Rn,1\u2264j\u2264d\n\nn\n\ni=1\n\nj=1\n\ni = (K (j)(x(j)\n\n1 , x(j)\n\ni ),\u00b7\u00b7\u00b7 , K (j)(x(j)\n\nn , x(j)\n\ni ))T . The\n\nz,t K (j)(x(j)\n\u03b1(j)\n\nt\n\n,\u00b7) ,\n\ni )T \u03b1(j)(cid:1)\n\n(K(j)\n\nd(cid:88)\n\nj=1\n\n+ + \u03bb\n\n(cid:111)\n\n.\n\n\u03c4j(cid:107)\u03b1(j)(cid:107)2\n\n(6)\n\nThe formulation (6) transforms the function-based learning problem (5) into a coef\ufb01cient-based\nlearning problem in a \ufb01nite dimensional vector space. The solution of (5) is spanned naturally by\nthe kernelized functions {K (j)(\u00b7, x(j)\ni ))}, rather than B-Spline basis functions [30]. When d = 1,\nour GroupSAM model degenerates to the special case which includes the LPSVM loss and the\nsparsity regularization term. Compared with LPSVM [22, 24] and SVM with additive kernels [5], our\nGroupSAM model imposes the sparsity on variable groups to improve the prediction interpretation of\nadditive classi\ufb01cation model.\nFor given {\u03c4j}, the optimization problem of GroupSAM can be computed ef\ufb01ciently via an acceler-\nated proximal gradient descent algorithm developed in [30]. Due to space limitation, we don\u2019t recall\nthe optimization algorithm here again.\n\n3 Generalization error bound\nIn this section, we will derive the estimate on the excess misclassi\ufb01cation error R(sgn(fz)) \u2212 R(fc).\nBefore providing the main theoretical result, we introduce some necessary assumptions for learning\ntheory analysis.\nAssumption A. The intrinsic distribution \u03c1 on Z := X \u00d7 Y satis\ufb01es the Tsybakov noise condition\nwith exponent 0 \u2264 q \u2264 \u221e. That is to say, for some q \u2208 [0,\u221e) and \u2206 > 0,\n\n(cid:16){x \u2208 X : |Prob(y = 1|x) \u2212 Prob(y = \u22121|x)| \u2264 \u2206t}(cid:17) \u2264 tq,\u2200t > 0.\n\n(7)\n\n\u03c1X\n\nThe Tsybakov noise condition was proposed in [21] and has been used extensively for theoretical\nanalysis of classi\ufb01cation algorithms [24, 7, 23, 20]. Indeed, (7) holds with exponent q = 0 for any\ndistribution and with q = \u221e for well separated classes.\nNow we introduce the empirical covering numbers [8] to measure the capacity of hypothesis space.\nDe\ufb01nition 1 Let F be a set of functions on Z with u = {ui}k\ni=1 \u2282 Z. De\ufb01ne the (cid:96)2-empirical\n2 . The covering number of F with (cid:96)2-empirical\nmetric is de\ufb01ned as N2(F, \u03b5) = supn\u2208N supu\u2208X n N2,u(F, \u03b5), where\n\nt=1(f (ut) \u2212 g(ut))2(cid:9) 1\n(cid:80)k\n\nn\n\nmetric as (cid:96)2,u(f, g) =(cid:8) 1\n(cid:110)\n\nN2,u(F, \u03b5) = inf\n\nl(cid:91)\n\n{f \u2208 F : (cid:96)2,u(f, fi) \u2264 \u03b5}(cid:111)\n\n.\n\nl \u2208 N : \u2203{fi}l\n\ni=1 \u2282 F s. t. F =\n\nLet Br = {f \u2208 HK : (cid:107)f(cid:107)K \u2264 r} and B(j)\n\nAssumption B. Assume that \u03ba = (cid:80)d\n\nr = {f (j) \u2208 HK(j) : (cid:107)f (j)(cid:107)K(j) \u2264 r}.\nj=1 supx(j)\n\n(cid:112)K (j)(x(j), x(j)) < \u221e and for some s \u2208\n\n(0, 2), cs > 0,\n\ni=1\n\n1 , \u03b5) \u2264 cs\u03b5\u2212s, \u2200\u03b5 > 0, j \u2208 {1, ..., d}.\nIt has been asserted in [6] that under Assumption B the following holds:\n\nlog N2(B(j)\n\nlog N2(B1, \u03b5) \u2264 csd1+s\u03b5\u2212s, \u2200\u03b5 > 0.\n\n5\n\n\fIt is worthy noticing that the empirical covering number has been studied extensively in learning\ntheory literatures [8, 20]. Detailed examples have been provided in Theorem 2 of [19], Lemma 3 of\n[18], and Examples 1, 2 of [9]. The capacity condition of additive assumption space just depends\non the dimension of subspace X (j). When K (j) \u2208 C \u03bd(X (j) \u00d7 X (j)) for every j \u2208 {1,\u00b7\u00b7\u00b7 , d}, the\ntheoretical analysis in [19] assures that Assumption B holds true for:\n\n\uf8f1\uf8f2\uf8f3 2d0\n\nd0+2\u03bd ,\n2d0\nd0+\u03bd ,\nd0\n\u03bd ,\n\ns =\n\n\u03bd \u2208 (0, 1];\n\u03bd \u2208 [1, 1 + d0/2];\n\u03bd \u2208 (1 + d0/2,\u221e).\n\nHere d0 denotes the maximum dimension among {X (j)}.\nWith respect to (3), we introduce the data-free regularized function f\u03b7 de\ufb01ned by:\n\nf\u03b7 =\n\narg min\nj=1 f (j)\u2208H\nInspired by the analysis in [6], we de\ufb01ne:\n\nf =(cid:80)d\n\n(cid:111)\n\n.\n\n\u03c4j(cid:107)f (j)(cid:107)2\n\nK(j)\n\nd(cid:88)\n\n(cid:110)E(f ) + \u03b7\nd(cid:88)\n\nj=1\n\nD(\u03b7) = E(f\u03b7) \u2212 E(fc) + \u03b7\n\n\u03c4j(cid:107)f (j)\n\u03b7 (cid:107)2\n\nK(j)\n\n(8)\n\n(9)\n\nj=1\n\nas the approximation error, which re\ufb02ects the learning ability of hypothesis space H under Tikhonov\nregularization scheme.\nThe following approximation condition has been studied and used extensively for classi\ufb01cation\nproblems, such as [3, 7, 24, 23]. Please see Examples 3 and 4 in [3] for the explicit version for Soblov\nkernel and Gaussian kernel induced reproducing kernel Hilbert space.\nAssumption C. There exists an exponent \u03b2 \u2208 (0, 1) and a positive constant c\u03b2 such that:\n\nD(\u03b7) \u2264 c\u03b2\u03b7\u03b2,\u2200\u03b7 > 0.\n\nNow we introduce our main theoretical result on the generalization bound as follows.\n\nTheorem 1 Let 0 < min\n(5) for 0 < \u03b8 \u2264 min{ 2\u2212s\nsuch that\n\nj\n\n\u03c4j \u2264 max\n\u03c4j \u2264 c0 < \u221e and Assumptions A-C hold true. Take \u03bb = n\u2212\u03b8 in\n2\u22122\u03b2}. For any \u03b4 \u2208 (0, 1), there exists a constant C independent of n, \u03b4\n\n2s , 3+5\u03b2\n\nj\n\nR(sgn(fz)) \u2212 R(fc) \u2264 C log(3/\u03b4)n\u2212\u03d1\n\nwith con\ufb01dence 1 \u2212 \u03b4, where\n\n(cid:110) q + 1\n\nq + 2\n\n\u03d1 = min\n\n,\n\n\u03b2(2\u03b8 + 1)\n\n2\u03b2 + 2\n\n,\n\n(q + 1)(2 \u2212 s \u2212 2s\u03b8)\n\n4 + 2q + sq\n\n,\n\n3 + 5\u03b2 + 2\u03b2\u03b8 \u2212 2\u03b8\n\n4 + 4\u03b2\n\n(cid:111)\n\n.\n\nTheorem 1 demonstrates that GroupSAM in (5) can achieve the convergence rate with polynomial\ndecay under mild conditions in hypothesis function space. When q \u2192 \u221e, \u03b2 \u2192 1, and each\nK (j) \u2208 C\u221e, the error decay rate of GroupSAM can arbitrarily close to O(n\u2212 min{1, 1+2\u03b8\n4 }). Hence,\nthe fast convergence rate O(n\u22121) can be obtained under proper selections on parameters. To verify\nthe optimal bound, we need provide the lower bound for the excess misclassi\ufb01cation error. This is\nbeyond the main focus of this paper and we leave it for future study.\nAdditionally, the consistency of GroupSAM can be guaranteed with the increasing number of training\nsamples.\nCorollary 1 Under conditions in Theorem 1, there holds R(sgn(fz)) \u2212 R(fc) \u2192 0 as n \u2192 \u221e.\nTo better understand our theoretical result, we compare it with the related works as below:\n\n6\n\n\f1) Compared with group sparse additive models. Although the asymptotic theory of group sparse\nadditive models has been well studied in [15, 10, 26], all of them only consider the regression task un-\nder the mean square error criterion and basis function expansion. Due to the kernel-based component\nfunction and non-smooth hinge loss, the previous analysis cannot be extended to GroupSAM directly.\n2) Compared with classi\ufb01cation with additive models. In [30], the convergence rate is presented for\nsparse additive machine (SAM), where the input space X is divided into p subspaces directly without\nconsidering the interactions among variables. Different to the sparsity on variable groups in this\npaper, SAM is based on the sparse representation of orthonormal basis similar with [15]. In [5], the\nconsistency of SVM with additive kernel is established, where the kernel-norm regularizer is used.\nHowever, the sparsity on variables and the learning rate are not investigated in previous articles.\n3) Compared with the related analysis techniques. While the analysis technique used here is inspired\nfrom [24, 23], it is the \ufb01rst exploration for additive classi\ufb01cation model with group sparsity. In\nparticular, the hypothesis error analysis develops the stepping stone technique from the (cid:96)1-norm\nregularizer to the group sparse (cid:96)2,1-norm regularizer. Our analysis technique also can be applied to\nother additive models. For example, we can extend the shrunk additive regression model in [11] to\nthe sparse classi\ufb01cation setting and investigate its generalization bound by the current technique.\nProof sketches of Theorem 1\nTo get tight error estimate, we introduce the clipping operator \u03c0(f )(x) = max{\u22121, min{f (x), 1}},\nwhich has been widely used in learning theory literatures, such as [7, 20, 24, 23]. Since R(sgn(fz))\u2212\nR(fc) can be bounded by E(\u03c0(fz)) \u2212 E(fc), we focus on bounding the excess convex risk.\nUsing f\u03b7 as the intermediate function, we can obtain the following error decomposition.\n\nProposition 1 For fz de\ufb01ned in (5), there holds\n\nwhere D(\u03b7) is de\ufb01ned in (9),\n\nand\n\nR(sgn(fz)) \u2212 R(fc) \u2264 E(\u03c0(fz)) \u2212 E(fc) \u2264 E1 + E2 + E3 + D(\u03b7),\n\nE1 = E(\u03c0(fz)) \u2212 E(fc) \u2212(cid:0)Ez(\u03c0(fz)) \u2212 Ez(fc)(cid:1),\nE2 = Ez(f\u03b7) \u2212 Ez(fc) \u2212(cid:0)Ez(f\u03b7) \u2212 E(fc)(cid:1),\n(cid:1).\n\nE3 = Ez(\u03c0(fz)) + \u03bb\u2126(fz) \u2212(cid:0)Ez(f\u03b7) + \u03b7\n\nd(cid:88)\n\n\u03c4j(cid:107)f (j)\n\u03b7 (cid:107)2\n\nK(j)\n\nj=1\n\nand Ez(f\u03b7) + \u03bb(cid:80)d\n\nIn learning theory literature, E1 + E2 is called as the sample error and E3 is named as the hypothesis\nerror. Detailed proofs for these error terms are provided in the supplementary materials.\nThe upper bound of hypothesis error demonstrates that the divergence induced from regularization\nand hypothesis space tends to zero as n \u2192 \u221e under proper selected parameters. To estimate the\nhypothesis error E3, we choose \u00affz as the stepping stone function to bridge Ez(\u03c0(fz)) + \u03bb\u2126(fz)\nK(j). The proof is inspired from the stepping stone technique for\nsupport vector machine classi\ufb01cation [24]. Notice that our analysis is associated with the (cid:96)2,1-norm\nregularizer while the previous analysis just focuses on the (cid:96)1-norm regularization.\nThe error term E1 re\ufb02ects the divergence between the expected excess risk E(\u03c0(fz)) \u2212 E(fc) and\nthe empirical excess risk Ez(\u03c0(fz)) \u2212 Ez(fc). Since fz involves any given z = {(xi, yi)}n\ni=1, we\nintroduce the concentration inequality in [23] to bound E1. We also bound the error term E2 in terms\nof the one-side Bernstein inequality [7].\n\n\u03b7 (cid:107)2\nj=1 \u03c4j(cid:107)f (j)\n\n4 Experiments\n\nTo evaluate the performance of our proposed GroupSAM model, we compare our model with the\nfollowing methods: SVM (linear SVM with (cid:96)2-norm regularization), L1SVM (linear SVM with (cid:96)1-\nnorm regularization), GaussianSVM (nonlinear SVM using Gaussian kernel), SAM (Sparse Additive\nMachine) [30], and GroupSpAM (Group Sparse Additive Models) [26] which is adapted to the\nclassi\ufb01cation setting.\n\n7\n\n\fTable 2: Classi\ufb01cation accuracy comparison on the synthetic data. The upper half shows the results\nwith 24 features groups, while the lower half corresponds to the results with 300 feature groups. The\ntable shows the average classi\ufb01cation accuracy and the standard deviation in 2-fold cross validation.\n\nSVM\n\nSAM\n\nGaussianSVM L1SVM\n\nGroupSpAM GroupSAM\n\u03c3 = 0.8 0.943\u00b10.011 0.935\u00b10.028 0.925\u00b10.035 0.895\u00b10.021 0.880\u00b10.021 0.953\u00b10.018\n\u03c3 = 0.85 0.943\u00b10.004 0.938\u00b10.011 0.938\u00b10.004 0.783\u00b10.088 0.868\u00b10.178 0.945\u00b10.000\n\u03c3 = 0.9 0.935\u00b10.014 0.925\u00b1 0.007 0.938\u00b10.011 0.853\u00b1 0.117 0.883\u00b10.011 0.945\u00b10.007\n\u03c3 = 0.8 0.975\u00b10.035 0.975\u00b10.035 0.975\u00b10.035 0.700\u00b10.071 0.275\u00b10.106 1.000\u00b10.000\n\u03c3 = 0.85 0.975\u00b10.035 0.975\u00b10.035 0.975\u00b10.035 0.600\u00b10.141 0.953\u00b10.004 1.000\u00b10.000\n\u03c3 = 0.9 0.975\u00b10.035 0.975\u00b10.035 0.975\u00b10.035 0.525\u00b10.035 0.983\u00b10.004 1.000\u00b10.000\n\nAs for evaluation metric, we calculate the classi\ufb01cation accuracy, i.e., percentage of correctly labeled\nsamples in the prediction. In comparison, we adopt 2-fold cross validation and report the average\nperformance of each method.\nWe implement SVM, L1SVM and GaussianSVM using the LIBSVM toolbox [2]. We determine the\nhyper-parameter of all models, i.e., parameter C of SVM, L1SVM and GaussianSVM, parameter\n\u03bb of SAM, parameter \u03bb of GroupSpAM, parameter \u03bb in Eq. (6) of GroupSAM, in the range of\n{10\u22123, 10\u22122,\n. . . , 103}. We tune the hyper-parameters via 2-fold cross validation on the training\ndata and report the best parameter w.r.t. classi\ufb01cation accuracy of each method. In the accelerated\nproximal gradient descent algorithm for both SAM and GroupSAM, we set \u00b5 = 0.5, and the number\nof maximum iterations as 2000.\n\n4.1 Performance comparison on synthetic data\n\nWe \ufb01rst examine the classi\ufb01cation performance on the synthetic data as a sanity check. Our synthetic\ndata is randomly generated as a mixture of Gaussian distributions. In each class, data points are\nsampled i.i.d. from a multivariate Gaussian distribution with the covariance being \u03c3I, with I as\nthe identity matrix. This setting indicates independent covariates of the data. We set the number of\nclasses to be 4, the number of samples to be 400, and the number of dimensions to be 24. We set\nthe value of \u03c3 in the range of {0.8, 0.85, 0.9} respectively. Following the experimental setup in\n[31], we make three replicates for each feature in the data to form 24 feature groups (each group\nhas three replicated features). We randomly pick 6 feature groups to generate the data such that we\ncan evaluate the capability of GroupSAM in identifying truly useful feature groups. To make the\nclassi\ufb01cation task more challenging, we add random noise drawn from uniform distribution U(0, \u03b8)\nwhere \u03b8 is 0.8 times the maximum value in the data. In addition, we test on a high-dimensional case\nby generating 300 feature groups (e.g., a total of 900 features) with 40 samples in a similar approach.\nWe summarize the classi\ufb01cation performance comparison on the synthetic data in Table 2. From\nthe experimental results we notice that GroupSAM outperforms other approaches under all settings.\nThis comparison veri\ufb01es the validity of our method. We can see that GroupSAM signi\ufb01cantly\nimproves the performance of SAM, which shows that the incorporation of group information is\nindeed bene\ufb01cial for classi\ufb01cation. Moreover, we can notices the superiority of GroupSAM over\nGroupSpAM, which illustrates that our GroupSAM model is more suitable for classiciation. We also\npresent the comparison of feature groups in Table 3. For illustration purpose, we use the case with 24\nfeature groups as an example. Table 3 shows that the feature groups identi\ufb01ed by GroupSAM are\nexactly the same as the ground truth feature groups used for synthetic data generation. Such results\nfurther demonstrate the effectiveness of GroupSAM method, from which we know GroupSAM is\nable to select the truly informative feature groups thus improve the classi\ufb01cation performance.\n\n4.2 Performance comparison on benchmark data\n\nIn this subsection, we use 7 benchmark data from UCI repository [12] to compare the classi\ufb01cation\nperformance of different methods. The 7 benchmark data includes: Ecoli, Indians Diabetes, Breast\nCancer, Stock, Balance Scale, Contraceptive Method Choice (CMC) and Fertility. Similar to the\nsettings in synthetic data, we construct feature groups by replicating each feature for 3 times. In each\n\n8\n\n\fTable 3: Comparison between the true feature group ID (used for data generation) and the selected\nfeature group ID by our GroupSAM method on the synthetic data. Order of the true feature group ID\ndoes not represent the order of importance.\n\nTrue Feature Group IDs\n\nSelected Feature Group IDs via GroupSAM\n\n\u03c3 = 0.8\n\u03c3 = 0.85\n\u03c3 = 0.9\n\n2,3,4,8,10,17\n1,5,10,12,17,21\n2,6,7,9,12,22\n\n3,10,17,8,2,4\n5,12,17,21,1,10\n6,22,7,9,2,12\n\nTable 4: Classi\ufb01cation accuracy comparison on the benchmark data. The table shows the average\nclassi\ufb01cation accuracy and the standard deviation in 2-fold cross validation.\n\nSVM\n\n0.815\u00b10.054\n0.651\u00b10.000\n\nSAM\n\nGaussianSVM L1SVM\nGroupSpAM GroupSAM\n0.818\u00b10.049 0.711\u00b10.051 0.816\u00b10.039 0.771\u00b10.009 0.839\u00b10.028\n0.652\u00b10.002 0.638\u00b10.018 0.652\u00b10.000 0.643\u00b10.004 0.660\u00b10.013\n\nEcoli\nIndians\nDiabetes\nBreast\n0.965\u00b10.017 0.833\u00b10.008 0.833\u00b10.224 0.958\u00b10.027 0.966\u00b10.014\n0.968\u00b10.017\nCancer\n0.913\u00b10.001\n0.911\u00b10.002 0.873\u00b10.001 0.617\u00b10.005 0.875\u00b10.005 0.917\u00b10.005\nStock\nBalance\n0.864\u00b1 0.003 0.869\u00b10.004 0.870\u00b10.003 0.763\u00b10.194 0.848\u00b10.003 0.893\u00b10.003\nScale\nCMC 0.420\u00b1 0.011 0.445\u00b10.015 0.437\u00b10.014 0.427\u00b10.000 0.433\u00b10.003 0.456\u00b10.003\nFertility 0.880\u00b1 0.000 0.880\u00b10.000 0.750\u00b10.184 0.860\u00b10.028 0.780\u00b10.000 0.880\u00b10.000\n\nfeature group, we add random noise drawn from uniform distribution U(0, \u03b8) where \u03b8 is 0.3 times the\nmaximum value in each data.\nWe display the comparison results in Table 4. We \ufb01nd that GroupSAM performs equal or better than\nthe compared methods in all benchmark datasets. Compared with SVM and L1SVM, our method\nuses additive model to incorporate nonlinearity thus is more appropriate to \ufb01nd the complex decision\nboundary. Moreover, the comparison with Gaussian SVM and SAM illustrates that by involving\nthe group information in classi\ufb01cation, GroupSAM makes better use of the structure information\namong features such that the classi\ufb01cation ability can be enhanced. Compared with GroupSpAM, our\nGroupSAM model is proposed in data dependent hypothesis spaces and employs hinge loss in the\nobjective, thus is more suitable for classi\ufb01cation.\n\n5 Conclusion\n\nIn this paper, we proposed a novel group sparse additive machine (GroupSAM) by incorporating the\ngroup sparsity into the additive classi\ufb01cation model in reproducing kernel Hilbert space. By develop-\ning the error analysis technique with data dependent hypothesis space, we obtain the generalization\nerror bound of the proposed GroupSAM, which demonstrates our model can achieve satisfactory\nlearning rate under mild conditions. Experimental results on both synthetic and real-world benchmark\ndatasets validate the algorithmic effectiveness and support our learning theory analysis. In the future,\nit is interesting to investigate the learning performance of robust group sparse additive machines with\nloss functions induced by quantile regression [6, 14].\n\nAcknowledgments\n\nThis work was partially supported by U.S. NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628,\nNSF-IIS 1619308, NSF-IIS 1633753, NIH AG049371. Hong Chen was partially supported by\nNational Natural Science Foundation of China (NSFC) 11671161. We are grateful to the anonymous\nNIPS reviewers for the insightful comments.\n\n9\n\n\fReferences\n[1] P. L. Bartlett, M. I. Jordan, and J. D. Mcauliffe. Convexity, classi\ufb01cation and risk bounds. J.\n\nAmer. Statist. Assoc., 101(473):138\u2013156, 2006.\n\n[2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions\n\non Intelligent Systems and Technology, 2(27):1\u201327, 2011.\n\n[3] D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou. Support vector machine soft margin classi\ufb01ers:\n\nerror analysis. J. Mach. Learn. Res., 5:1143\u20131175, 2004.\n\n[4] H. Chen, Z. Pan, L. Li, and Y. Tang. Learning rates of coef\ufb01cient-based regularized classi\ufb01er\n\nfor density level detection. Neural Comput., 25(4):1107\u20131121, 2013.\n\n[5] A. Christmann and R. Hable. Consistency of support vector machines using additive kernels for\n\nadditive models. Comput. Stat. Data Anal., 56:854\u2013873, 2012.\n\n[6] A. Christmann and D. X. Zhou. Learning rates for the risk of kernel based quantile regression\n\nestimators in additive models. Anal. Appl., 14(3):449\u2013477, 2016.\n\n[7] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge\n\nUniv. Press, Cambridge, U.K., 2007.\n\n[8] D. Edmunds and H. Triebel. Function Spaces, Entropy Numbers, Differential Operators.\n\nCambridge Univ. Press, Cambridge, U.K., 1996.\n\n[9] Z. Guo and D. X. Zhou. Concentration estimates for learning with unbounded sampling. Adv.\n\nComput. Math., 38(1):207\u2013223, 2013.\n\n[10] J. Huang, J. Horowitz, and F. Wei. Variable selection in nonparametric additive models. Ann.\n\nStatist., 38(4):2282\u20132313, 2010.\n\n[11] K. Kandasamy and Y. Yu. Additive approximation in high dimensional nonparametric regression\n\nvia the salsa. In ICML, 2016.\n\n[12] M. Lichman. UCI machine learning repository, 2013.\n\n[13] Y. Lin and H. H. Zhang. Component selection and smoothing in smoothing spline analysis of\n\nvariance models. Ann. Statist., 34(5):2272\u20132297, 2006.\n\n[14] S. Lv, H. Lin, H. Lian, and J. Huang. Oracle inequalities for sparse additive quantile regression\n\nin reproducing kernel hilbert space. Ann. Statist., preprint, 2017.\n\n[15] L. Meier, S. van de Geer, and P. Buehlmann. High-dimensional additive modeling. Ann. Statist.,\n\n37(6B):3779\u20133821, 2009.\n\n[16] G. Raskutti, M. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models over\n\nkernel classes via convex programming. J. Mach. Learn. Res., 13:389\u2013427, 2012.\n\n[17] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. J. Royal. Statist.\n\nSoc B., 71:1009\u20131030, 2009.\n\n[18] L. Shi. Learning theory estimates for coef\ufb01cient-based regularized regression. Appl. Comput.\n\nHarmon. Anal., 34(2):252\u2013265, 2013.\n\n[19] L. Shi, Y. Feng, and D. X. Zhou. Concentration estimates for learning with (cid:96)1-regularizer and\n\ndata dependent hypothesis spaces. Appl. Comput. Harmon. Anal., 31(2):286\u2013302, 2011.\n\n[20] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n\n[21] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Ann. Statis., 32:135\u2013\n\n166, 2004.\n\n[22] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.\n\n10\n\n\f[23] Q. Wu, Y. Ying, and D. X. Zhou. Multi-kernel regularized class\ufb01ers. J. Complexity, 23:108\u2013134,\n\n2007.\n\n[24] Q. Wu and D. X. Zhou. Svm soft margin classi\ufb01ers: linear programming versus quadratic\n\nprogramming. Neural Comput., 17:1160\u20131187, 2005.\n\n[25] L. Yang, S. Lv, and J. Wang. Model-free variable selection in reproducing kernel hilbert space.\n\nJ. Mach. Learn. Res., 17:1\u201324, 2016.\n\n[26] J. Yin, X. Chen, and E. Xing. Group sparse additive models. In ICML, 2012.\n\n[27] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variabels. J.\n\nRoyal. Statist. Soc B., 68(1):49\u201367, 2006.\n\n[28] M. Yuan and D. X. Zhou. Minimax optimal rates of estimation in high dimensional additive\n\nmodels. Ann. Statist., 44(6):2564\u20132593, 2016.\n\n[29] T. Zhang. Statistical behavior and consistency of classi\ufb01cation methods based on convex risk\n\nminimization. Ann. Statist., 32:56\u201385, 2004.\n\n[30] T. Zhao and H. Liu. Sparse additive machine. In AISTATS, 2012.\n\n[31] L. W. Zhong and J. T. Kwok. Ef\ufb01cient sparse modeling with automatic feature grouping. In\n\nICML, 2011.\n\n11\n\n\f", "award": [], "sourceid": 166, "authors": [{"given_name": "Hong", "family_name": "Chen", "institution": "University of Pittsburgh"}, {"given_name": "Xiaoqian", "family_name": "Wang", "institution": "University of Pittsburgh"}, {"given_name": "Cheng", "family_name": "Deng", "institution": "School of Electronic Engineering, Xidian University, China"}, {"given_name": "Heng", "family_name": "Huang", "institution": "University of Pittsburgh"}]}