{"title": "Hierarchical Penalization", "book": "Advances in Neural Information Processing Systems", "page_first": 1457, "page_last": 1464, "abstract": null, "full_text": "Hierarchical Penalization\n\nMarie Szafranski 1, Yves Grandvalet 1, 2 and Pierre Morizet-Mahoudeaux 1\n\nHeudiasyc 1, UMR CNRS 6599\n\nUniversit\u00b4e de Technologie de Compi`egne\n\nBP 20529, 60205 Compi`egne Cedex, France\n\nIDIAP Research Institute 2\n\nAv. des Pr\u00b4es-Beudin 20\n\nP.O. Box 592, 1920 Martigny, Switzerland\nmarie.szafranski@hds.utc.fr\n\nAbstract\n\nHierarchical penalization is a generic framework for incorporating prior informa-\ntion in the \ufb01tting of statistical models, when the explicative variables are organized\nin a hierarchical structure. The penalizer is a convex functional that performs soft\nselection at the group level, and shrinks variables within each group. This favors\nsolutions with few leading terms in the \ufb01nal combination. The framework, orig-\ninally derived for taking prior knowledge into account, is shown to be useful in\nlinear regression, when several parameters are used to model the in\ufb02uence of one\nfeature, or in kernel regression, for learning multiple kernels.\nKeywords \u2013 Optimization: constrained and convex optimization. Supervised\nlearning: regression, kernel methods, sparsity and feature selection.\n\n1 Introduction\n\nIn regression, we want to explain or to predict a response variable y from a set of explanatory\nvariables x = (x1, . . . , xj, . . . , xd), where y \u2208 R and \u2200j, xj \u2208 R. For this purpose, we use a model\nsuch that y = f(x) + \u0001, where f is a function able to characterize y when x is observed and \u0001 is a\nresidual error.\nSupervised learning consists in estimating f from the available training dataset S = {(xi, yi)}n\ni=1.\nIt can be achieved in a predictive or a descriptive perspective: to predict accurate responses for future\nobservations, or to show the correlations that exist between the set of explanatory variables and the\nresponse variable, and thus, give an interpretation to the model.\nIn the linear case, the function f consists of an estimate \u03b2 = (\u03b21, . . . , \u03b2j, . . . , \u03b2d)t applied to x, that\nis to say f(x) = x\u03b2. In a predictive perspective, x\u03b2 produces an estimate of y, for any observation\nx. In a descriptive perspective, |\u03b2j| can be interpreted as a degree of relevance of variable xj.\nOrdinary Least Squares (OLS) minimizes the sum of the residual squared error. When the explana-\ntory variables are numerous and many of them are correlated, the variability of the OLS estimate\ntends to increase. This leads to reduced prediction accuracy, and an interpretation of the model\nbecomes tricky.\nCoef\ufb01cient shrinkage is a major approach of regularization procedures in linear regression models.\nIt overcomes the drawbacks described above by adding a constraint on the norm of the estimate \u03b2.\nAccording to the chosen norm, coef\ufb01cients associated to variables with little predictive information\nmay be shrunk, or even removed when variables are irrelevant. This latest case is referred to as\nvariable selection. In particular, ridge regression shrinks coef\ufb01cients with regard to the (cid:96)2-norm,\nwhile the lasso (Least Absolute Shrinkage and Selection Operator) [1] and the lars (Least Angle\nRegression Stepwise) [2] both shrink and remove coef\ufb01cients using the (cid:96)1-norm.\n\n1\n\n\fx1\n\n\u0012\n\n-\n\n\n\n\n\n|l\n\n\n@\n\n@\n\n@\n\n@\n\n@R\n\n\u0011\nQ\n\n|l\n|l\n\nQ\n\n\u0011\n\nQ\n\n\u0011\n\nQ\n\nQ\n\nQ\n\n\u0011\u00113\n-\n\nx2\n\nx3\n\nQQs\n\nx4\n\n-\n\nx5\n\nQQs\n\nx6\n\n\u0012\n\n\u03c31,1\n\n\n-\n\n\u03c31,2\n\n|l0\n\n\n@\n\n\n\n@\n\n@\n\u03c31,3\n\n@\n\n@R\n\n\u03c32,1\n\n-|l1\n|l2\n|l3\n\n\u0011\n\u03c32,2\n\u03c32,3\nQ\n\u03c32,4\n\n\u0011\u00113\n-\n\nQQs\n\n\u0011\nQ\n\nQ\n\n\u0011\n\n-\n\nQ\n\nQ\n\n\u03c32,5\nQ\n\u03c32,6\n\nQQs\n\n(cid:111)\n\nJ1\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fe J2\n\uf8fc\uf8f4\uf8fd\uf8f4\uf8fe J3\n\nx1\n\nx2\n\nx3\n\nx4\n\nx5\n\nx6\n\nFigure 1: left: toy-example of the original structure of variables; right: equivalent tree structure\nconsidered for the formalization of the scaling problem.\n\nIn some applications, explanatory variables that share a similar characteristic can be gathered into\ngroups \u2013 or factors. Sometimes, they can be organized hierarchically. For instance, in genomics,\nwhere explanatory variables are (products of) genes, some factors can be identi\ufb01ed from the prior\ninformation available in the hierarchies of Gene Ontology. Then, it becomes necessary to \ufb01nd\nmethods that retain meaningful factors instead of individual variables.\nGroup-lasso and group-lars [3] can be considered as hierarchical penalization methods, with trees of\nheight two de\ufb01ning the hierarchies. They perform variable selection by encouraging sparseness over\nprede\ufb01ned factors. These techniques seem perfectible in the sense that hierarchies can be extended\nto more than two levels and sparseness integrated within groups. This papers proposes a penalizer,\nderived from an adaptive penalization formulation [4], that highlights factors of interest by balancing\nconstraints on each element, at each level of a hierarchy. It performs soft selection at the factor level,\nand shrinks variables within groups, to favor solutions with few leading terms.\nSection 2 introduces the framework of hierarchical penalization and the associated algorithm is\npresented in Section 3. Section 4 shows how this framework can be applied to linear and kernel\nregression. We conclude with a general survey of our future works.\n\n2 Hierarchical Penalization\n\n2.1 Formalization\n\nWe introduce hierarchical penalization by considering problems where the variables are organized\nin a tree structure of height two, such as the example displayed in \ufb01gure 1. The nodes of height\none are labelled in {1, . . . , K}. The set of children (that is, leaves) of node k is denoted Jk and its\ncardinality is dk. As displayed on the right-hand-side of \ufb01gure 1, a branch stemming from the root\nand going to node k is labelled by \u03c31,k, and the branch reaching leaf j is labelled by \u03c32,j.\nWe consider the problem of minimizing a differentiable loss function L(\u00b7), subject to sparseness\nconstraints on \u03b2 and the subsets of \u03b2 de\ufb01ned in a tree hierarchy. This reads\n\n\u03b22\nj\u221a\n\u03c31,k \u03c32,j\n\n,\n\nd(cid:88)\n\n\u03c32,j = 1 ,\n\n(1a)\n\n(1b)\n\n(1c)\nwhere \u03bb > 0 is a Lagrangian parameter that controls the amount of shrinkage, x/y is de\ufb01ned by\ncontinuation at zero as x/0 = \u221e if x (cid:54)= 0 and 0/0 = 0.\n\nk = 1, . . . , K ,\n\nj = 1, . . . , d ,\n\nj=1\n\n\u03c32,j \u2265 0\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nmin\n\u03b2,\u03c3\n\nsubject to\n\nK(cid:88)\n\n(cid:88)\n\nk=1\n\nj\u2208Jk\n\ndk \u03c31,k = 1 ,\n\nL(\u03b2) + \u03bb\n\nK(cid:88)\n\nk=1\n\n\u03c31,k \u2265 0\n\n2\n\n\fThe second term of expression (1a) penalizes \u03b2, according to the tree structure, via scaling factors\n\u03c31 and \u03c32. The constraints (1b) shrink the coef\ufb01cients \u03b2 at group level and inside groups. In what\nfollows, we show that problem (1) is convex and that this joint shrinkage encourages sparsity at the\ngroup level.\n\n2.2 Two important properties\n\nWe \ufb01rst prove that the optimization problem (1) is tractable and moreover convex. Then, we show\nan equivalence with another optimization problem, which exhibits the exact nature of the constraints\napplied to the coef\ufb01cients \u03b2.\nProposition 1 Provided L(\u00b7) is convex, problem (1) is convex.\nProof: A problem minimizing a convex criterion on a convex set is convex. Since L(\u00b7) is convex and\n\u03bb is positive, the criterion (1a) is convex provided f(x, y, z) = x2\u221a\nyz is convex. To show this, we\ncompute the Hessian:\n\n\uf8ee\uf8ef\uf8f0 8\n\n\u22124 x\n\u22124 x\n\ny\n\nz\n\n\uf8f9\uf8fa\uf8fb = 2\n\n\uf8ee\uf8f0 2\u2212 x\n\ny\u2212 x\n\nz\n\n\uf8f9\uf8fb\uf8ee\uf8f0 2\u2212 x\n\n\uf8f9\uf8fbt\n\ny\u2212 x\n\nz\n\n\uf8ee\uf8f0 0\n\nx\n\ny\u2212 x\n\nz\n\n\uf8f9\uf8fb\uf8ee\uf8f0 0\n\nx\n\n\uf8f9\uf8fbt\n\ny\u2212 x\n\nz\n\n+\n\n.\n\n\u22124 x\n3 x2\ny2\nx2\nyz\n\ny \u22124 x\n\nz\n\nx2\nyz\n3 x2\nz2\n\n4(yz) 1\n\n2\u22072f(x, y, z) =\n\nHence, the Hessian is positive semi-de\ufb01nite, and criterion (1a) is convex.\nNext, constraints (1c) de\ufb01ne half-spaces for \u03c31 and \u03c32, which are convex sets. Equality constraints\n(1b) de\ufb01ne linear subspaces of dimension K \u2212 1 and d\u2212 1 which are also convex sets. The intersec-\ntion of convex sets being a convex set, the constraints de\ufb01ne a convex admissible set, and problem\n(cid:3)\n(1) is convex.\n\nProposition 2 Problem (1) is equivalent to\n\nmin\n\u03b2\n\nL(\u03b2) + \u03bb\n\n.\n\n(2)\n\n3\n\nj\u2208Jk\n\n|\u03b2j| 4\n\n\uf8eb\uf8ed(cid:88)\n\n4\uf8f6\uf8f7\uf8f82\n\uf8f6\uf8f8 3\n(cid:32) K(cid:88)\n\u03be1,k \u03c31,k \u2212 d(cid:88)\n\n+ \u03bd1\n\nk=1\n\nd\n\n1\n4\nk\n\nk=1\n\n\uf8eb\uf8ec\uf8ed K(cid:88)\n(cid:88)\n\uf8f6\uf8f8 \u2212 K(cid:88)\n(cid:88)\n\nk=1\n\nj\u2208Jk\n\n\u2212 \u03bb\n2\n\n\u03b22\nj\u221a\n\u03c31,k \u03c32,j\n\n\u03b22\nj\n\n3\n2\n\nj\u2208Jk\n\n\u03c3\n\n1,k\u03c3\n\n1\n2\n2,j\n\u03b22\nj\n\n\u2212 \u03bb\n2\n\n1\n2\n\n\u03c3\n1,k\u03c3\n\n3\n2\n2,j\n\nK(cid:88)\n\nk=1\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n(cid:33)\n\ndk \u03c31,k \u2212 1\n\n+\n\n\u03be2,j \u03c32,j .\n\nj=1\n\n+ \u03bd1dk \u2212 \u03be1,k = 0\n\n+ \u03bd2 \u2212 \u03be2,j = 0\n\n.\n\nSketch of proof:\nThe Lagrangian of problem (1) is\nL = L(\u03b2) + \u03bb\n\n\uf8eb\uf8ed d(cid:88)\n\nj=1\n\n\u03bd2\n\n\u03c32,j \u2212 1\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f3\n\n\u2202L\n\u2202\u03c31,k\n\u2202L\n\u2202\u03c32,j\n\n= 0\n\n= 0\n\n\u21d2\n\nHence, the optimality conditions for \u03c31,k and \u03c32,j are\n\nAfter some tedious algebra, the optimality conditions for \u03c31,k and \u03c32,j can be expressed as\n\n\u03c31,k = d\n\n\u2212 3\n4\nk\n\nK(cid:80)\n\n4\n\n(sk) 3\nk (sk) 3\nd\n\n1\n4\n\n4\n\n,\n\nand \u03c32,j =\n\n1\n4\n\nK(cid:80)\nk |\u03b2j| 4\nd\n(sk) 1\nd\n\n1\n4\n\n4\n\n3\n\nk (sk) 3\n\n4\n\nfor j \u2208 Jk ,\n\nk=1\n3 . Plugging these conditions in criterion (1a) yields the claimed result.\n\nk=1\n\n|\u03b2j| 4\n\n(cid:3)\n\nwhere sk = (cid:80)\n\nj\u2208Jk\n\n3\n\n\f2.3 Sparseness\n\nProposition 2 shows how the penalization in\ufb02uences the groups of variables and each variable in\neach group. Note that, thanks to the positivity of the squared term in (2), the expression can be\nfurther simpli\ufb01ed to\n\n\uf8eb\uf8ed(cid:88)\n\nj\u2208Jk\n\nK(cid:88)\n\nk=1\n\n1\n4\nd\nk\n\n\uf8f6\uf8f8 3\n\n4\n\n|\u03b2j| 4\n\n3\n\nmin\n\u03b2\n\nL(\u03b2) + \u03bd\n\n,\n\n(3)\n\nwhere, for any L(\u03b2), there is a one-to-one mapping from \u03bb in (2) to \u03bd in (3). This expression\ncan be interpreted as the Lagrangian formulation of a constrained optimization problem, where the\nadmissible set for \u03b2 is de\ufb01ned by the multiplicand of \u03bd.\nWe display the shape of the admissible set in \ufb01gure 2, and compare it to ridge regression, which does\nnot favor sparsity, lasso, which encourages sparsity for all variables but does not take into account\nthe group structure, and group-lasso, which is invariant to rotations of within-group variables. One\nsees that hierarchical penalization combines some features of lasso and group-lasso.\n\nridge regression\n3\u22641\n\u03b22\n1 +\u03b22\n\n2 +\u03b22\n\nlasso\n\n|\u03b21|+|\u03b22|+|\u03b23|\u22641\n\ngroup-lasso\n1 +\u03b22\n\n2) 1\n\n2 +|\u03b23|\u22641\n\n2 (\u03b22\n\n\u201c|\u03b21| 4\n\nhierarchical penalization\n4 +|\u03b23|\u22641\n1\n4\n\n3 +|\u03b22| 4\n\n\u201d 3\n\n3\n\n2\n\nFigure 2: Admissible sets for various penalties, the two horizontal axes are the (\u03b21, \u03b22) plane (\ufb01rst group) and\nthe vertical axis is for \u03b23 (second group).\n\nBy looking at the curvature of these sets when they meet axes, one gets a good intuition on why\nridge regression does not suppress variables, why lasso does, why group-lasso suppresses groups\nof variables but not within-group variables, and why hierarchical penalization should do both. This\nintuition is however not correct for hierarchical penalization because the boundary of the admissible\nset is differentiable in the within-group hyper-plane (\u03b21, \u03b22) at \u03b21 = 0 and \u03b22 = 0. However,\nas its curvature is very high, solutions with few leading terms in the within-group variables are\nencouraged.\nTo go beyond the hints provided by these \ufb01gures, we detail here the optimality conditions for \u03b2\nminimizing (3). The \ufb01rst-order optimality conditions are\n|\u03b2j| = 0, \u2202L(\u03b2)\n\u2202\u03b2j\n|\u03b2j| (cid:54)= 0, \u2202L(\u03b2)\n(cid:18)\n\u2202\u03b2j\n\n1. for \u03b2j = 0, j \u2208 Jk and (cid:80)\n2. for \u03b2j = 0, j \u2208 Jk and (cid:80)\n\nk vj = 0, where vj \u2208 [\u22121, 1];\n\n(cid:19)\u2212 1\n\n+ \u03bd d\n\n= 0;\n\nj\u2208Jk\n\n1\n4\n\n4\n\nj\u2208Jk\n3. for \u03b2j (cid:54)= 0, j \u2208 Jk, \u2202L(\u03b2)\n\u2202\u03b2j\n\n1\n4\n\n+ \u03bd d\n\nk sign(\u03b2j)\n\n1 +\n\n|\u03b2(cid:96)| 4\n\n3\n\n= 0.\n\n(cid:88)\n\n(cid:96)\u2208Jk\n(cid:96)(cid:54)=j\n\n1\n|\u03b2j| 4\n\n3\n\nThese equations signify respectively that\n\n1. the variables belonging to groups that are estimated to be irrelevant are penalized with the\n\nhighest strength, thus limiting the number of groups in\ufb02uencing the solution;\n\n2. when a group has some non-zero relevance, all variables enter the set of active variables\n\n3. however, the penalization strength increases very rapidly (as a smooth step function) for\n\nprovided they in\ufb02uence the \ufb01tting criterion;\nsmall values of |\u03b2j|, thus limiting the number of \u03b2j with large magnitude.\n\n4\n\n\fOverall, hierarchical penalization is thus expected to provide solutions with few active groups and\nfew leading variables within each group.\n\n3 Algorithm\n\nTo solve problem (3), we use an active set algorithm, based on the approach proposed by Osborne\net al. [5] for the lasso. This algorithm iterates two phases: \ufb01rst, the optimization problem is solved\nwith a sub-optimal set of active variables, that is, non-zero variables: we de\ufb01ne A = {j | \u03b2j (cid:54)= 0},\nthe current active set of variables, \u03b3 = {\u03b2j}j\u2208A, the vector of coef\ufb01cients associated to A, and\nGk = {Jk \u2229A}, the subset of coef\ufb01cients \u03b3 associated to group k. Then, at each iteration, we solve\nthe problem\n\nL(\u03b3) = L(\u03b3) + \u03bd\n\nmin\n\u03b3\n\n,\n\n(4)\n\n\uf8eb\uf8ed(cid:88)\n\nj\u2208Gk\n\nK(cid:88)\n\nk=1\n\n1\n4\nd\nk\n\n\uf8f6\uf8f8 3\n\n4\n\n|\u03b3j| 4\n\n3\n\nby alternating steps A and B described below. Second, the set of active variables is incrementally\nupdated as detailed in steps C and D.\n\nA Compute a candidate update from an admissible vector \u03b3\n\nL(\u03b3 + h), where \u03b3 is the current estimate of the solution and h \u2208 R|A|.\nThe goal is to solve min\nh\nThe dif\ufb01culties in solving (4) stem from the discontinuities of the derivative due to the absolute\nvalues. These dif\ufb01culties are circumvented by replacing |\u03b3j + hj| by sign(\u03b3j)(\u03b3j + hj). This\nenables the use of powerful continuous optimizers based either on the Newton, quasi-Newton or\nconjugate gradient methods according to the size of the problem.\n\n\u2020\nj ) = sign(\u03b3j), then \u03b3 is sign-feasible, and we go to step C,\n\nB Obtain a new admissible vector \u03b3\u2020\nLet \u03b3\u2020 = \u03b3 + h. If for all j, sign(\u03b3\notherwise:\nB.1 Let S be the set of indices m such that sign(\u03b3+\n\nm) (cid:54)= sign(\u03b3m). Let \u00b5 = min\n\n, that is,\n\u00b5 is the largest step in direction h such that sign(\u03b3m + \u00b5hm) = sign(\u03b3m), except for one\nvariable, (cid:96) = arg min\nm\n\n, for which \u03b3(cid:96) + \u00b5h(cid:96) = 0.\n\nB.2 Set \u03b3 = \u03b3 + \u00b5h and sign(\u03b3(cid:96)) = \u2212 sign(\u03b3(cid:96)), and compute a new direction h as in step A.\n\u2020\n(cid:96) ) (cid:54)= sign(\u03b3(cid:96)), then (cid:96) is removed from A. Go to step A.\n\nIf, for the new solution \u03b3\u2020, sign(\u03b3\n\n\u2212 \u03b3m\nhm\n\nm\u2208S \u2212 \u03b3m\n\nhm\n\nB.3 Iterate step B until \u03b3 is sign-feasible.\n\nIf the appropriate optimality condition holds for all inactive variables \u03b2(cid:96) (\u03b2(cid:96) = 0), that is\n\nC Test optimality of \u03b3\n\nC.1 for (cid:96) \u2208 Jk, where (cid:80)\nC.2 for (cid:96) \u2208 Jk, where (cid:80)\n\nj\u2208Jk\n\nj\u2208Jk\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202L(\u03b2)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03bd d\n\n1\n4\n\nk ,\n\n|\u03b2j| = 0, then\n\u2202\u03b2(cid:96)\n|\u03b2j| (cid:54)= 0, then \u2202L(\u03b2)\n\u2202\u03b2(cid:96)\n\n= 0,\n\nthen \u03b3 is a solution. Else, go to step D.\n\nD Select the variable that enters the active set\n\nD.1 Select variable (cid:96), (cid:96) /\u2208 A that maximizes d\nD.2 Update the active set: A \u2190 A \u222a {(cid:96)}, with initial vector: \u03b3 = [\u03b3, 0]t where the sign of the\n\n\u2202\u03b2(cid:96)\n\n\u2212 1\n4\nk\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202L(\u03b2)\n\n(cid:12)(cid:12)(cid:12)(cid:12), where k is the group of variable (cid:96).\n\nnew zero component is \u2212 sign\n\n(cid:16) \u2202L(\u03b2)\n\n(cid:17)\n\n\u2202\u03b2(cid:96)\n\n.\n\nD.3 Go to step A.\n\nThe algorithm is initialized with A = \u2205, and the \ufb01rst variable is selected with the process described\nat step D.\n\n5\n\n\f4 Experiments\n\nWe illustrate on two datasets how hierarchical penalization can be useful in exploratory analysis\nand in prediction. Then, we show how the algorithm can be applied for multiple kernel learning in\nkernel regression.\n\n4.1 Abalone Database\n\ni = (100) for male, xsex\n\nThe Abalone problem [6] consists in predicting the age of abalone from physical measurements.\nThe dataset is composed of 8 attributes. One concerns the sex of abalone, and has been encoded\nwith dummy variables, that is xsex\ni = (001) for\ninfant. This variable de\ufb01nes the \ufb01rst group. The second group is composed of 3 attributes concerning\nsize parameters (length, diameter and height), and the last group is composed of weight parameters\n(whole, shucked, viscera and shell weight).\nWe randomly selected 2920 examples for training, including the tuning of \u03bd by 10-fold cross val-\nidation, and left the 1257 other for testing. The mean squared test error is at par with lasso (4.3).\nThe coef\ufb01cients estimated on the training set are reported in table 4.1. Weight parameters are a main\ncontributor to the estimation of the age of an abalon, while sex is not essential, except for infant.\n\ni = (010) for female, or xsex\n\nsex\nsize\nweight\n\n0.051\n-0.044\n4.370\n\n0.036\n1.134\n-4.499\n\n-0.360\n0.358\n-1.110\n\n1.399\n\n0.516\n1.7405\n11.989\n\nTable 1: Coef\ufb01cients obtained on the Abalone dataset. The last column represents the value d\n\n0@ P\n\nj\u2208Jk\n\n1\n4\nk\n\n1A 3\n\n4\n\n.\n\n|\u03b2j| 4\n\n3\n\n4.2 Delve Census Database\n\nThe Delve Census problem [7] consists in predicting the median price of a house in different survey\nregions. Each 22732 survey region is represented by 134 demographic information measurements.\nSeveral prototypes are available. We focussed on the prototype \u201chouse-price-16L\u201d, composed of 16\nvariables. We derived this prototype by including all the other variables related to these 16 variables.\nThe \ufb01nal dataset is then composed of 37 variables, split up into 10 groups1.\nWe randomly selected 8000 observations for training and left the 14732 for testing. We divided\nthe training observations into 10 distinct datasets. For each dataset, the parameter \u03bd was selected\nby a 10-fold cross validation, and the mean squared error was computed on the testing set. We\nreported on table 4.2 the mean squared test errors obtained with the hierarchical penalization (hp),\nthe group-lasso (gl) and the lasso estimates.\n\n2\n\n3\n\n4\n\n5\n\n1\n\nDatasets\n10\nhp (\u00d7109)\n2.363 2.745 2.289 4.481 2.211 2.364 2.460 2.298 2.461 2.286\ngl (\u00d7109)\n2.429 2.460 2.289 4.653 2.230 2.364 2.472 2.308 2.454 2.291\nlasso (\u00d7109) 2.380 2.716 2.293 4.656 2.216 2.368 2.490 2.295 2.483 2.288\n\n6\n\n8\n\n9\n\n7\n\nmean error\n\n2.596\n2.595\n2.618\n\nTable 2: Mean squared test errors obtained with different methods for the 10 datasets.\n\nHierarchical penalization performs better than lasso on 8 datasets.\nIt also performs better than\ngroup-lasso on 6 datasets, and obtains equal results on 2 datasets. However the lowest overall mean\nerror is achieved by group-lasso.\n\n4.3 Multiple Kernel Learning\n\nMultiple Kernel Learning has drawn much interest in classi\ufb01cation with support vector machines\n(SVMs) starting from the work of Lanckriet et al. [8]. The problem consists in learning a convex\n\n1 A description of the dataset is available at http://www.hds.utc.fr/\u02dcmszafran/nips07/.\n\n6\n\n\fcombination of kernels in the SVM optimization algorithm. Here, we show that hierarchical penal-\nization is well suited for this purpose for other kernel predictors, and we illustrate its effect on kernel\nsmoothing in the regression setup.\nKernel smoothing has been studied in nonparametric statistics since the 60\u2019s [9]. Here, we consider\nthe model where the response variable y is estimated by a sum of kernel functions\n\nn(cid:88)\n\nj=1\n\nyi =\n\n\u03b2j \u03bah(xi, xj) + \u0001i ,\n\nwhere \u03bah is the kernel with scale factor (or bandwidth) h, and \u0001i is a residual error. For the purpose\nof combining K bandwidths, the general criterion (3) reads\n\n\uf8f6\uf8f82\n\n\uf8eb\uf8ed n(cid:88)\n\nK(cid:88)\n\n1\n4\nk\n\nn\n\n\uf8f6\uf8f8 3\n\n4\n\n\u03b2k,j \u03bahk(xi, xj)\n\n+ \u03bd\n\n|\u03b2k,j| 4\n\n3\n\n.\n\n(5)\n\nn(cid:88)\n\n\uf8eb\uf8edyi \u2212 K(cid:88)\n\nn(cid:88)\n\nmin\n{\u03b2k}K\n\nk=1\n\ni=1\n\nk=1\n\nj=1\n\nk=1\n\nj=1\n\nThe penalized model (5) has been applied to the motorcycle dataset [9]. This uni-dimensional prob-\nlems enables to display the contribution of each bandwidth to the solution. We used Gaussian\nkernels, with 7 bandwidths ranging from 10\u22121 to 102.\nFigure 3 displays the results obtained for different penalization parameters: the estimated function\nobtained by the combination of the selected bandwidths, and the contribution of each bandwidth to\nthe model. We display three settings for the penalization parameter \u03bd, corresponding to slight over-\n\ufb01tting, good \ufb01t and slight under-\ufb01tting. The coef\ufb01cients of bandwidths h2, h6 and h7 were always\nnull and are thus not displayed. As expected, when the penalization parameter \u03bd increases, the \ufb01t\nbecomes smoother, and the number of contributing bandwidths decreases. We also observe that the\neffective contribution of some bandwidths is limited to a few kernels: there are few leading terms in\nthe expansion.\n\n5 Conclusion and further works\n\nHierarchical penalization is a generic framework enabling to process hierarchically structured vari-\nables by usual statistical models. The structure is provided to the model via constraints on the\nsubgroups of variables de\ufb01ned at each level of the hierarchy. The \ufb01tted model is then biased to-\nwards statistical explanations that are \u201csimple\u201d with respect to this structure, that is, solutions which\npromote a small number of groups of variables, with a few leading components.\nIn this paper, we detailed the general framework of hierarchical penalization for tree structures of\nheight two, and discussed its speci\ufb01c properties in terms of convexity and parsimony. Then, we\nproposed an ef\ufb01cient active set algorithm that incrementally builds an optimal solution to the prob-\nlem. We \ufb01nally illustrated how the approach can be used when groups of features, or when discrete\nvariables exist, after being encoded by several binary variables, result in groups of variables. Fi-\nnally, we also shown how the algorithm can be used to learn from multiple kernels in regression. We\nare now performing quantitative empirical evaluations, with applications to regression, classi\ufb01cation\nand clustering, and comparisons to other regularization schemes, such as the group-lasso.\nWe then plan to extend the formalization to hierarchies of arbitrary height, whose properties are\ncurrently under study. We will then be able to tackle new applications, such as genomics, where the\navailable gene ontologies are hierarchical structures that can be faithfully approximated by trees.\n\nReferences\n[1] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical\n\nSociety. Series B, 58(1):267\u2013288, 1996.\n\n[2] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n\n32(2):407\u2013499, 2004.\n\n[3] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\n\nJournal of the Royal Statistical Society. Series B, 68(1):49\u201367, 2006.\n\n7\n\n\fd\ne\nn\ni\nb\nm\no\nc\n\n1\n\u2212\n\n0\n1\n=\n1\nh\n\n1\n=\n3\nh\n\n12\n0\n1\n=\n4\nh\n\n0\n1\n=\n5\nh\n\n\u03bd=10\n\n\u03bd=25\n\n\u03bd=50\n\nFigure 3: Hierarchical penalization applied to kernel smoothing on the motorcycle data. Combined: the points\nrepresent data and the solid line the function of estimated responses. Isolated bandwidths: the points represent\npartial residuals and the solid line represents the contribution of the bandwidth to the model.\n\n[4] Y. Grandvalet and S. Canu. Adaptive scaling for feature selection in SVMs. In Advances in\n\nNeural Information Processing Systems, volume 15. MIT Press, 2003.\n\n[5] M. R. Osborne, B. Presnell, and B. A. Turlach. On the lasso and its dual. Journal of Computa-\n\ntional and Graphical Statistics, 9(2):319\u2013337, June 2000.\n\n[6] C.L. Blake D.J. Newman, S. Hettich and C.J. Merz. UCI repository of machine learning\ndatabases, 1998. URL http://www.ics.uci.edu/\u02dcmlearn/MLRepository.html.\n[7] Delve: Data for evaluating learning in valid experiments. URL http://www.cs.toronto.\n\nedu/\u02dcdelve/.\n\n[8] G. Lanckriet, T. De Bie, N. Cristianini, M. Jordan, and W. Noble. A statistical framework for\n\ngenomic data fusion. Bioinformatics, 20:2626\u20132635, 2004.\n\n[9] W. H\u00a8ardle. Applied Nonparametric Regression, volume 19. Economic Society Monographs,\n\n1990.\n\n8\n\n\f", "award": [], "sourceid": 613, "authors": [{"given_name": "Marie", "family_name": "Szafranski", "institution": null}, {"given_name": "Yves", "family_name": "Grandvalet", "institution": null}, {"given_name": "Pierre", "family_name": "Morizet-mahoudeaux", "institution": null}]}