{"title": "Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 105, "page_last": 112, "abstract": "For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the L1-norm or the block L1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.", "full_text": "Exploring Large Feature Spaces\n\nwith Hierarchical Multiple Kernel Learning\n\nFrancis Bach\n\nINRIA - Willow Project, \u00b4Ecole Normale Sup\u00b4erieure\n\n45, rue d\u2019Ulm, 75230 Paris, France\nfrancis.bach@mines.org\n\nAbstract\n\nFor supervised and unsupervised learning, positive de\ufb01nite kernels allow to use\nlarge and potentially in\ufb01nite dimensional feature spaces with a computational cost\nthat only depends on the number of observations. This is usually done through\nthe penalization of predictor functions by Euclidean or Hilbertian norms. In this\npaper, we explore penalizing by sparsity-inducing norms such as the \u21131-norm or\nthe block \u21131-norm. We assume that the kernel decomposes into a large sum of\nindividual basis kernels which can be embedded in a directed acyclic graph; we\nshow that it is then possible to perform kernel selection through a hierarchical\nmultiple kernel learning framework, in polynomial time in the number of selected\nkernels. This framework is naturally applied to non linear variable selection; our\nextensive simulations on synthetic datasets and datasets from the UCI repository\nshow that ef\ufb01ciently exploring the large feature space through sparsity-inducing\nnorms leads to state-of-the-art predictive performance.\n\n1 Introduction\n\nIn the last two decades, kernel methods have been a proli\ufb01c theoretical and algorithmic machine\nlearning framework. By using appropriate regularization by Hilbertian norms, representer theorems\nenable to consider large and potentially in\ufb01nite-dimensional feature spaces while working within an\nimplicit feature space no larger than the number of observations. This has led to numerous works on\nkernel design adapted to speci\ufb01c data types and generic kernel-based algorithms for many learning\ntasks (see, e.g., [1, 2]).\nRegularization by sparsity-inducing norms, such as the \u21131-norm has also attracted a lot of interest in\nrecent years. While early work has focused on ef\ufb01cient algorithms to solve the convex optimization\nproblems, recent research has looked at the model selection properties and predictive performance of\nsuch methods, in the linear case [3] or within the multiple kernel learning framework (see, e.g., [4]).\nIn this paper, we aim to bridge the gap between these two lines of research by trying to use \u21131-norms\ninside the feature space. Indeed, feature spaces are large and we expect the estimated predictor\nfunction to require only a small number of features, which is exactly the situation where \u21131-norms\nhave proven advantageous. This leads to two natural questions that we try to answer in this paper: (1)\nIs it feasible to perform optimization in this very large feature space with cost which is polynomial\nin the size of the input space? (2) Does it lead to better predictive performance and feature selection?\n\nMore precisely, we consider a positive de\ufb01nite kernel that can be expressed as a large sum of positive\nde\ufb01nite basis or local kernels. This exactly corresponds to the situation where a large feature space is\nthe concatenation of smaller feature spaces, and we aim to do selection among these many kernels,\nwhich may be done through multiple kernel learning. One major dif\ufb01culty however is that the\nnumber of these smaller kernels is usually exponential in the dimension of the input space and\napplying multiple kernel learning directly in this decomposition would be intractable.\n\n\fIn order to peform selection ef\ufb01ciently, we make the extra assumption that these small kernels can\nbe embedded in a directed acyclic graph (DAG). Following [5], we consider in Section 2 a spe-\nci\ufb01c combination of \u21132-norms that is adapted to the DAG, and will restrict the authorized sparsity\npatterns; in our speci\ufb01c kernel framework, we are able to use the DAG to design an optimization\nalgorithm which has polynomial complexity in the number of selected kernels (Section 3). In simu-\nlations (Section 5), we focus on directed grids, where our framework allows to perform non-linear\nvariable selection. We provide extensive experimental validation of our novel regularization frame-\nwork; in particular, we compare it to the regular \u21132-regularization and shows that it is always com-\npetitive and often leads to better performance, both on synthetic examples, and standard regression\nand classi\ufb01cation datasets from the UCI repository.\n\nFinally, we extend in Section 4 some of the known consistency results of the Lasso and multiple ker-\nnel learning [3, 4], and give a partial answer to the model selection capabilities of our regularization\nframework by giving necessary and suf\ufb01cient conditions for model consistency. In particular, we\nshow that our framework is adapted to estimating consistently only the hull of the relevant variables.\nHence, by restricting the statistical power of our method, we gain computational ef\ufb01ciency.\n\n2 Hierarchical multiple kernel learning (HKL)\nWe consider the problem of predicting a random variable Y \u2208 Y \u2282 R from a random variable X \u2208\nX , where X and Y may be quite general spaces. We assume that we are given n i.i.d. observations\n(xi, yi) \u2208 X \u00d7 Y, i = 1, . . . , n. We de\ufb01ne the empirical risk of a function f from X to R as\nn Pn\n1\ni=1 \u2113(yi, f (xi)), where \u2113 : Y \u00d7 R 7\u2192 R+ is a loss function. We only assume that \u2113 is convex\nwith respect to the second parameter (but not necessarily differentiable). Typical examples of loss\nfunctions are the square loss for regression, i.e., \u2113(y, \u02c6y) = 1\n2 (y \u2212 \u02c6y)2 for y \u2208 R, and the logistic loss\n\u2113(y, \u02c6y) = log(1 + e\u2212y \u02c6y) or the hinge loss \u2113(y, \u02c6y) = max{0, 1\u2212 y \u02c6y} for binary classi\ufb01cation, where\ny \u2208 {\u22121, 1}, leading respectively to logistic regression and support vector machines [1, 2].\n2.1 Graph-structured positive de\ufb01nite kernels\nWe assume that we are given a positive de\ufb01nite kernel k : X \u00d7 X \u2192 R, and that this kernel can\nbe expressed as the sum, over an index set V , of basis kernels kv, v \u2208 V , i.e, for all x, x\u2032 \u2208 X ,\nk(x, x\u2032) = Pv\u2208V kv(x, x\u2032). For each v \u2208 V , we denote by Fv and \u03a6v the feature space and feature\nmap of kv, i.e., for all x, x\u2032 \u2208 X , kv(x, x\u2032) = h\u03a6v(x), \u03a6v(x\u2032)i. Throughout the paper, we denote\nby kuk the Hilbertian norm of u and by hu, vi the associated dot product, where the precise space is\nomitted and can always be inferred from the context.\nOur sum assumption corresponds to a situation where the feature map \u03a6(x) and feature space F\nfor k is the concatenation of the feature maps \u03a6v(x) for each kernel kv, i.e, F = Qv\u2208V Fv and\n\u03a6(x) = (\u03a6v(x))v\u2208V . Thus, looking for a certain \u03b2 \u2208 F and a predictor function f (x) = h\u03b2, \u03a6(x)i\nis equivalent to looking jointly for \u03b2v \u2208 Fv, for all v \u2208 V , and f (x) = Pv\u2208V h\u03b2v, \u03a6v(x)i.\nAs mentioned earlier, we make the assumption that the set V can be embedded into a directed acyclic\ngraph. Directed acyclic graphs (DAGs) allow to naturally de\ufb01ne the notions of parents, children,\ndescendants and ancestors. Given a node w \u2208 V , we denote by A(w) \u2282 V the set of its ancestors,\nand by D(w) \u2282 V , the set of its descendants. We use the convention that any w is a descendant\nand an ancestor of itself, i.e., w \u2208 A(w) and w \u2208 D(w). Moreover, for W \u2282 V , we let denote\nsources(W ) the set of sources of the graph G restricted to W (i.e., nodes in W with no parents\nbelonging to W ). Given a subset of nodes W \u2282 V , we can de\ufb01ne the hull of W as the union of all\nancestors of w \u2208 W , i.e., hull(W ) = Sw\u2208W A(w). Given a set W , we de\ufb01ne the set of extreme\npoints of W as the smallest subset T \u2282 W such that hull(T ) = hull(W ) (note that it is always well\nde\ufb01ned, as TT \u2282V, hull(T )=hull(W ) T ). See Figure 1 for examples of these notions.\nThe goal of this paper is to perform kernel selection among the kernels kv, v \u2208 V . We essentially\nuse the graph to limit the search to speci\ufb01c subsets of V . Namely, instead of considering all possible\nsubsets of active (relevant) vertices, we are only interested in estimating correctly the hull of these\nrelevant vertices; in Section 2.2, we design a speci\ufb01c sparsity-inducing norms adapted to hulls.\n\nIn this paper, we primarily focus on kernels that can be expressed as \u201cproducts of sums\u201d, and on the\nassociated p-dimensional directed grids, while noting that our framework is applicable to many other\nkernels. Namely, we assume that the input space X factorizes into p components X = X1\u00d7\u00b7\u00b7\u00b7\u00d7Xp\nand that we are given p sequences of length q + 1 of kernels kij (xi, x\u2032\ni), i \u2208 {1, . . . , p}, j \u2208\n\n\fFigure 1: Example of graph and associated notions. (Left) Example of a 2D-grid. (Middle) Example\nof sparsity pattern (\u00d7 in light blue) and the complement of its hull (+ in light red). (Right) Dark\nblue points (\u00d7) are extreme points of the set of all active points (blue \u00d7); dark red points (+) are the\nsources of the set of all red points (+).\n\ni) = Qp\n\ni=1 kiji (xi, x\u2032\n\ni=1 (cid:16)Pq\n\nji=0 kiji (xi, x\u2032\n\nj1,...,jp=0 Qp\n\ni)(cid:17). We\n{0, . . . , q}, such that k(x, x\u2032) = Pq\nthus have a sum of (q+1)p kernels, that can be computed ef\ufb01ciently as a product of p sums. A natural\nDAG on V = Qp\ni=1{0, . . . , q} is de\ufb01ned by connecting each (j1, . . . , jp) to (j1 + 1, j2, . . . , jp),\n. . . , (j1, . . . , jp\u22121, jp + 1). As shown in Section 2.2, this DAG will correspond to the constraint\nof selecting a given product of kernels only after all the subproducts are selected. Those DAGs\nare especially suited to nonlinear variable selection, in particular with the polynomial and Gaussian\nkernels. In this context, products of kernels correspond to interactions between certain variables, and\nour DAG implies that we select an interaction only after all sub-interactions were already selected.\nPolynomial kernels We consider Xi = R, kij (xi, x\u2032\ni)j; the full kernel is then equal\nto k(x, x\u2032) = Qp\ni)q. Note that this is not exactly the usual\npolynomial kernel (whose feature space is the space of multivariate polynomials of total degree less\nthan q), since our kernel considers polynomials of maximal degree q.\nGaussian kernels We also consider Xi = R, and the Gaussian-RBF kernel e\u2212b(x\u2212x\u2032)2. The\nfollowing decomposition is the eigendecomposition of the non centered covariance operator for a\nnormal distribution with variance 1/4a (see, e.g., [6]):\n\ni)j = Qp\n\nj=0 (cid:0)q\n\nj(cid:1)(xix\u2032\n\ni) = (cid:0)q\n\nj(cid:1)(xix\u2032\n\ni=1(1 + xix\u2032\n\ni=1 Pq\n\ne\u2212b(x\u2212x\u2032)2\n\n= P\u221e\n\nk=0\n\n(b/A)k\n2kk!\n\n[e\u2212 b\n\nA (a+c)x2\n\nHk(\u221a2cx)][e\u2212 b\n\nA (a+c)(x\u2032)2\n\nHk(\u221a2cx\u2032)],\n\nwhere c2 = a2 + 2ab, A = a + b + c, and Hk is the k-th Hermite polynomial. By appropriately\ntruncating the sum, i.e, by considering that the \ufb01rst q basis kernels are obtained from the \ufb01rst q\nsingle Hermite polynomials, and the (q + 1)-th kernel is summing over all other kernels, we ob-\ntain a decomposition of a uni-dimensional Gaussian kernel into q + 1 components (q of them are\none-dimensional, the last one is in\ufb01nite-dimensional, but can be computed by differencing). The\ndecomposition ends up being close to a polynomial kernel of in\ufb01nite degree, modulated by an ex-\nponential [2]. One may also use an adaptive decomposition using kernel PCA (see, e.g., [2, 1]),\nwhich is equivalent to using the eigenvectors of the empirical covariance operator associated with\nthe data (and not the population one associated with the Gaussian distribution with same variance).\nIn simulations, we tried both with no signi\ufb01cant differences.\nANOVA kernels When q = 1, the directed grid is isomorphic to the power set (i.e., the set\nof subsets) with the inclusion DAG. In this setting, we can decompose the ANOVA kernel [2] as\nQp\ni=1(1 + e\u2212b(xi\u2212x\u2032\n2, and our\nframework will select the relevant subsets for the Gaussian kernels.\nKernels or features?\nIn this paper, we emphasize the kernel view, i.e., we are given a kernel (and\nthus a feature space) and we explore it using \u21131-norms. Alternatively, we could use the feature view,\ni.e., we have a large structured set of features that we try to select from; however, the techniques\ndeveloped in this paper assume that (a) each feature might be in\ufb01nite-dimensional and (b) that we\ncan sum all the local kernels ef\ufb01ciently (see in particular Section 3.2). Following the kernel view\nthus seems slightly more natural.\n\n) = PJ\u2282{1,...,p} Qi\u2208J e\u2212b(xi\u2212x\u2032\n\n= PJ\u2282{1,...,p} e\u2212bkxJ \u2212x\u2032\n\nJ k2\n\ni)2\n\ni)2\n\n2.2 Graph-based structured regularization\nGiven \u03b2 \u2208 Qv\u2208V Fv, the natural Hilbertian norm k\u03b2k is de\ufb01ned through k\u03b2k2 = Pv\u2208V k\u03b2vk2.\nPenalizing with this norm is ef\ufb01cient because summing all kernels kv is assumed feasible in polyno-\nmial time and we can bring to bear the usual kernel machinery; however, it does not lead to sparse\nsolutions, where many \u03b2v will be exactly equal to zero.\n\n\fAs said earlier, we are only interested in the hull of the selected elements \u03b2v \u2208 Fv, v \u2208 V ; the hull\nof a set I is characterized by the set of v, such that D(v) \u2282 I c, i.e., such that all descendants of v\nare in the complement I c: hull(I) = {v \u2208 V, D(v) \u2282 I c}c. Thus, if we try to estimate hull(I), we\nneed to determine which v \u2208 V are such that D(v) \u2282 I c. In our context, we are hence looking at\nselecting vertices v \u2208 V for which \u03b2D(v) = (\u03b2w)w\u2208D(v) = 0.\nWe thus consider the following structured block \u21131-norm de\ufb01ned as Pv\u2208V dvk\u03b2D(v)k =\nPv\u2208V dv(Pw\u2208D(v) k\u03b2wk2)1/2, where (dv)v\u2208V are positive weights. Penalizing by such a norm\nwill indeed impose that some of the vectors \u03b2D(v) \u2208 Qw\u2208D(v) Fw are exactly zero. We thus con-\nsider the following minimization problem1:\n\n1\n\nn Pn\n\nmin\u03b2\u2208Qv\u2208V Fv\n\n2 (cid:0)Pv\u2208V dvk\u03b2D(v)k(cid:1)2\n\ni=1 \u2113(yi,Pv\u2208V h\u03b2v, \u03a6v(xi)i) + \u03bb\n\n(1)\nOur Hilbertian norm is a Hilbert space instantiation of the hierarchical norms recently introduced\nby [5] and also considered by [7] in the MKL setting. If all Hilbert spaces are \ufb01nite dimensional, our\nparticular choice of norms corresponds to an \u201c\u21131-norm of \u21132-norms\u201d. While with uni-dimensional\ngroups/kernels, the \u201c\u21131-norm of \u2113\u221e-norms\u201d allows an ef\ufb01cient path algorithm for the square loss\nand when the DAG is a tree [5], this is not possible anymore with groups of size larger than one, or\nwhen the DAG is a not a tree. In Section 3, we propose a novel algorithm to solve the associated\noptimization problem in time polynomial in the number of selected groups/kernels, for all group\nsizes, DAGs and losses. Moreover, in Section 4, we show under which conditions a solution to the\nproblem in Eq. (1) consistently estimates the hull of the sparsity pattern.\n\n.\n\nFinally, note that in certain settings (\ufb01nite dimensional Hilbert spaces and distributions with abso-\nlutely continuous densities), these norms have the effect of selecting a given kernel only after all of\nits ancestors [5]. This is another explanation why hulls end up being selected, since to include a\ngiven vertex in the models, the entire set of ancestors must also be selected.\n\n3 Optimization problem\nIn this section, we give optimality conditions for the problems in Eq. (1), as well as optimization\nalgorithms with polynomial time complexity in the number of selected kernels. In simulations we\nconsider total numbers of kernels larger than 1030, and thus such ef\ufb01cient algorithms are essential\nto the success of hierarchical multiple kernel learning (HKL).\n\n3.1 Reformulation in terms of multiple kernel learning\n\nFollowing [8, 9], we can simply derive an equivalent formulation of Eq. (1). Using Cauchy-Schwarz\ninequality, we have that for all \u03b7 \u2208 RV such that \u03b7 > 0 and Pv\u2208V d2\n\nv\u03b7v 6 1,\n= Pw\u2208V (Pv\u2208A(w) \u03b7\u22121\n\nv )k\u03b2wk2,\n\n(Pv\u2208V dvk\u03b2D(v)k)2 6 Pv\u2208V\n\nk\u03b2D(v)k2\n\n\u03b7v\n\nv k\u03b2D(v)k(Pv\u2208V dvk\u03b2D(v)k)\u22121. We associate to the vector\nwith equality if and only if \u03b7v = d\u22121\n\u03b7 \u2208 RV , the vector \u03b6 \u2208 RV such that \u2200w \u2208 V , \u03b6\u22121\nv . We use the natural convention\nthat if \u03b7v is equal to zero, then \u03b6w is equal to zero for all descendants w of v. We let denote H the\nset of allowed \u03b7 and Z the set of all associated \u03b6. The set H and Z are in bijection, and we can\ninterchangeably use \u03b7 \u2208 H or the corresponding \u03b6(\u03b7) \u2208 Z. Note that Z is in general not convex 2\n(unless the DAG is a tree, see [10]), and if \u03b6 \u2208 Z, then \u03b6w 6 \u03b6v for all w \u2208 D(v), i.e., weights of\ndescendant kernels are smaller, which is consistent with the known fact that kernels should always\nbe selected after all their ancestors.\n\nw = Pv\u2208A(w) \u03b7\u22121\n\nThe problem in Eq. (1) is thus equivalent to\n\nmin\n\u03b7\u2208H\n\nmin\n\n\u03b2\u2208Qv\u2208V Fv\n\n1\n\nn Pn\n\ni=1 \u2113(yi,Pv\u2208V h\u03b2v, \u03a6v(xi)i) + \u03bb\nand \u02dc\u03a6(x) = (\u03b61/2\n\nUsing the change of variable \u02dc\u03b2v = \u03b2v\u03b6\u22121/2\nv \u03a6v(x))v\u2208V , this implies that given\nthe optimal \u03b7 (and associated \u03b6), \u03b2 corresponds to the solution of the regular supervised learning\nproblem with kernel matrix K = Pw\u2208V \u03b6wKw, where Kw is n \u00d7 n the kernel matrix associated\n1We consider the square of the norm, which does not change the regularization properties, but allow simple\n\nv\n\n2 Pw\u2208V \u03b6w(\u03b7)\u22121k\u03b2wk2.\n\n(2)\n\nlinks with multiple kernel learning.\n\nneeded operation (see [10] for details).\n\n2Although Z is not convex, we can still maximize positive linear combinations over Z, which is the only\n\n\fwith kernel kw. Moreover, the solution is then \u03b2w = \u03b6w Pn\ni=1 \u03b1i\u03a6w(xi), where \u03b1 \u2208 Rn are the\ndual parameters associated with the single kernel learning problem.\nThus, the solution is entirely determined by \u03b1 \u2208 Rn and \u03b7 \u2208 RV (and its corresponding \u03b6 \u2208 RV ).\nMore precisely, we have (see proof in [10]):\nProposition 1 The pair (\u03b1, \u03b7) is optimal for Eq. (1), with \u2200w, \u03b2w = \u03b6w Pn\ni=1 \u03b1i\u03a6w(xi), if and\nonly if (a) given \u03b7, \u03b1 is optimal for the single kernel learning problem with kernel matrix K =\nPw\u2208V \u03b6w(\u03b7)Kw, and (b) given \u03b1, \u03b7 \u2208 H maximizes Pw\u2208V (Pv\u2208A(w) \u03b7\u22121\nMoreover, the total duality gap can be upperbounded as the sum of the two separate duality gaps for\nthe two optimization problems, which will be useful in Section 3.2 (see [10] for more details). Note\nthat in the case of \u201c\ufb02at\u201d regular multiple kernel learning, where the DAG has no edges, we obtain\nback usual optimality conditions [8, 9].\n\nv )\u22121\u03b1\u22a4Kw\u03b1.\n\nFollowing a common practice for convex sparsity problems [11], we will try to solve a small problem\nwhere we assume we know the set of v such that k\u03b2D(v)k is equal to zero (Section 3.3). We then\n\u201csimply\u201d need to check that variables in that set may indeed be left out of the solution. In the next\nsection, we show that this can be done in polynomial time although the number of kernels to consider\nleaving out is exponential (Section 3.2).\n\n3.2 Conditions for global optimality of reduced problem\n\nWe let denote J the complement of the set of norms which are set to zero. We thus consider the\noptimal solution \u03b2 of the reduced problem (on J), namely,\n\n1\n\nn Pn\n\nmin\u03b2J \u2208Qv\u2208J Fv\n\n2 (cid:0)Pv\u2208V dvk\u03b2D(v)\u2229Jk(cid:1)2\n\ni=1 \u2113(yi,Pv\u2208Jh\u03b2v, \u03a6v(xi)i) + \u03bb\n\n(3)\nwith optimal primal variables \u03b2J , dual variables \u03b1 and optimal pair (\u03b7J , \u03b6J ). We now consider\nnecessary conditions and suf\ufb01cient conditions for this solution (augmented with zeros for non active\nvariables, i.e., variables in J c) to be optimal with respect to the full problem in Eq. (1). We denote\nby \u03b4 = Pv\u2208J dvk\u03b2D(v)\u2229Jk the optimal value of the norm for the reduced problem.\nProposition 2 (NJ)\nin the extreme points of J are active, then we have maxt\u2208sources(J c) \u03b1\u22a4Kt\u03b1/d2\nProposition 3 (SJ,\u03b5)\nthen the total duality gap is less than \u03b5.\n\nIf maxt\u2208sources(J c) Pw\u2208D(t) \u03b1\u22a4Kw\u03b1/(Pv\u2208A(w)\u2229D(t) dv)2 6 \u03b42 + \u03b5/\u03bb,\n\nIf the reduced solution is optimal for the full problem in Eq. (1) and all kernels\n\n,\n\nt 6 \u03b42 .\n\nThe proof is fairly technical and can be found in [10]; this result constitutes the main technical\ncontribution of the paper:\nit essentially allows to solve a very large optimization problem over\nexponentially many dimensions in polynomial time.\nThe necessary condition (NJ ) does not cause any computational problems. However, the suf\ufb01cient\ncondition (SJ,\u03b5) requires to sum over all descendants of the active kernels, which is impossible in\npractice (as shown in Section 5, we consider V of cardinal often greater than 1030). Here, we need\nto bring to bear the speci\ufb01c structure of the kernel k. In the context of directed grids we consider\nin this paper, if dv can also be decomposed as a product, then Pv\u2208A(w)\u2229D(t) dv is also factorized,\nand we can compute the sum over all v \u2208 D(t) in linear time in p. Moreover we can cache the sums\nPw\u2208D(t) Kw/(Pv\u2208A(w)\u2229D(t) dv)2 in order to save running time.\n3.3 Dual optimization for reduced or small problems\nWhen kernels kv, v \u2208 V have low-dimensional feature spaces, we may use a primal rep-\nresentation and solve the problem in Eq. (1) using generic optimization toolboxes adapted to\nconic constraints (see, e.g., [12]). However, in order to reuse existing optimized supervised\nlearning code and use high-dimensional kernels,\nit is preferable to use a dual optimization.\nNamely, we use the same technique as [8]: we consider for \u03b6 \u2208 Z, the function B(\u03b6) =\nw k\u03b2wk2, which is the optimal value\nmin\u03b2\u2208Qv\u2208V Fv\nof the single kernel learning problem with kernel matrix Pw\u2208V \u03b6wKw. Solving Eq. (2) is equivalent\nto minimizing B(\u03b6(\u03b7)) with respect to \u03b7 \u2208 H.\nIf a ridge (i.e., positive diagonal) is added to the kernel matrices, the function B is differentiable [8].\nMoreover, the function \u03b7 7\u2192 \u03b6(\u03b7) is differentiable on (R\u2217\n+)V . Thus, the function \u03b7 7\u2192 B[\u03b6((1 \u2212\n\ni=1 \u2113(yi,Pv\u2208V h\u03b2v, \u03a6v(xi)i)+ \u03bb\n\n2 Pw\u2208V \u03b6\u22121\n\nn Pn\n\n1\n\n\f|V | d\u22122)] , where d\u22122 is the vector with elements d\u22122\n\n\u03b5)\u03b7 + \u03b5\nv , is differentiable if \u03b5 > 0. We can then\nuse the same projected gradient descent strategy as [8] to minimize it. The overall complexity of\nthe algorithm is then proportional to O(|V |n2)\u2014to form the kernel matrices\u2014plus the complexity\nof solving a single kernel learning problem\u2014typically between O(n2) and O(n3). Note that this\nalgorithm is only used for small reduced subproblems for which V has small cardinality.\n\n3.4 Kernel search algorithm\n\nWe are now ready to present the detailed algorithm which extends the feature search algorithm\nof [11]. Note that the kernel matrices are never all needed explicitly, i.e., we only need them (a)\nexplicitly to solve the small problems (but we need only a few of those) and (b) implicitly to compute\nthe suf\ufb01cient condition (SJ,\u03b5), which requires to sum over all kernels, as shown in Section 3.2.\n\u2022 Input: kernel matrices Kv \u2208 Rn\u00d7n, v \u2208 V , maximal gap \u03b5, maximal # of kernels Q\n\u2022 Algorithm\n\n1. Initialization: set J = sources(V ),\n\ncompute (\u03b1, \u03b7) solutions of Eq. (3), obtained using Section 3.3\n\n2. while (NJ ) and (SJ,\u03b5) are not satis\ufb01ed and #(V ) 6 Q\n\n\u2013 If (NJ ) is not satis\ufb01ed, add violating variables in sources(J c) to J\n\nelse, add violating variables in sources(J c) of (SJ,\u03b5) to J\n\n\u2013 Recompute (\u03b1, \u03b7) optimal solutions of Eq. (3)\n\n\u2022 Output: J, \u03b1, \u03b7\n\nThe previous algorithm will stop either when the duality gap is less than \u03b5 or when the maximal\nnumber of kernels Q has been reached. In practice, when the weights dv increase with the depth of\nv in the DAG (which we use in simulations), the small duality gap generally occurs before we reach\na problem larger than Q. Note that some of the iterations only increase the size of the active sets to\ncheck the suf\ufb01cient condition for optimality; forgetting those does not change the solution, only the\nfact that we may actually know that we have an \u03b5-optimal solution.\nIn order to obtain a polynomial complexity, the maximal out-degree of the DAG (i.e., the maximal\nnumber of children of any given node) should be polynomial as well. Indeed, for the directed p-\ngrid (with maximum out-degree equal to p), the total running time complexity is a function of the\nnumber of observations n, and the number R of selected kernels; with proper caching, we obtain the\nfollowing complexity, assuming O(n3) for the single kernel learning problem, which is conservative:\nO(n3R + n2Rp2 + n2R2p), which decomposes into solving O(R) single kernel learning problems,\ncaching O(Rp) kernels, and computing O(R2p) quadratic forms for the suf\ufb01cient conditions.\n\n4 Consistency conditions\nAs said earlier, the sparsity pattern of the solution of Eq. (1) will be equal to its hull, and thus we\ncan only hope to obtain consistency of the hull of the pattern, which we consider in this section. For\nsimplicity, we consider the case of \ufb01nite dimensional Hilbert spaces (i.e., Fv = Rfv ) and the square\nloss. We also hold \ufb01xed the vertex set of V , i.e., we assume that the total number of features is \ufb01xed,\nand we let n tend to in\ufb01nity and \u03bb = \u03bbn decrease with n.\nFollowing [4], we make the following assumptions on the underlying joint distribution of (X, Y ):\n(a) the joint covariance matrix \u03a3 of (\u03a6(xv))v\u2208V (de\ufb01ned with appropriate blocks of size fv \u00d7 fw)\nis invertible, (b) E(Y |X) = Pw\u2208Wh\u03b2w, \u03a6w(x)i with W \u2282 V and var(Y |X) = \u03c32 > 0 almost\nsurely. With these simple assumptions, we obtain (see proof in [10]):\nW W Diag(dvk\u03b2D(v)k\u22121)v\u2208W \u03b2W k2\n\nk\u03a3wW \u03a3\u22121\n\nt\u2208sources(W c)Pw\u2208D(t)\n\nProposition 4 (Suf\ufb01cient condition) If max\n(Pv\u2208A(w)\u2229D(t) dv)2\n< 1, then \u03b2 and the hull of W are consistently estimated when \u03bbnn1/2 \u2192 \u221e and \u03bbn \u2192 0.\nProposition 5 (Necessary condition) If the \u03b2 and the hull of W are consistently estimated for\nsome sequence \u03bbn, then maxt\u2208sources(W c) k\u03a3wW \u03a3\u22121\nt 6 1.\nNote that the last two propositions are not consequences of the similar results for \ufb02at MKL [4],\nbecause the groups that we consider are overlapping. Moreover, the last propositions show that we\nindeed can estimate the correct hull of the sparsity pattern if the suf\ufb01cient condition is satis\ufb01ed. In\nparticular, if we can make the groups such that the between-group correlation is as small as possible,\n\nW W Diag(dv/k\u03b2D(v)k)v\u2208W \u03b2Wk2/d2\n\n\f1\n\n0.5\n\nr\no\nr\nr\ne\n\n \nt\n\ne\ns\n \nt\ns\ne\n\nt\n\nHKL\ngreedy\nL2\n\n \n\n1\n\n0.5\n\nr\no\nr\nr\ne\n\n \nt\n\ne\ns\n \nt\ns\ne\n\nt\n\nHKL\ngreedy\nL2\n\n0\n \n2\n\n3\n\n6\n\n7\n\n0\n \n2\n\n3\n\n4\n5\nlog\n(p)\n2\n\n4\n5\nlog\n(p)\n2\n\n \n\n6\n\n7\n\nFigure 2: Comparison on synthetic examples: mean squared error over 40 replications (with halved\nstandard deviations). Left: non rotated data, right: rotated data. See text for details.\n\np\n\nL2\n\n6.0\u00b10.1\n5.7\u00b10.2\n\n5.0\u00b10.2\n5.8\u00b10.4\n\ngreedy\n\nlasso-\u03b1 MKL\n\n7.0\u00b10.2\n6.1\u00b10.3\n36.3\u00b14.1 5.9\u00b10.2\n\nHKL\ndataset\nn\nk #(V )\n44.2\u00b11.3 43.9\u00b11.4 47.9\u00b10.7 44.5\u00b11.1 43.3\u00b11.0\n4177 10 pol4 \u2248107\nabalone\n4177 10 rbf \u22481010 43.0\u00b10.9 45.0\u00b11.7 49.0\u00b11.7 43.7\u00b11.0 43.0\u00b11.1\nabalone\n8192 32 pol4 \u22481022 40.1\u00b10.7 39.2\u00b10.8 41.3\u00b10.7 38.7\u00b10.7 38.9\u00b10.7\nbank-32fh\n8192 32 rbf \u22481031 39.0\u00b10.7 39.7\u00b10.7 66.1\u00b16.9 38.4\u00b10.7 38.4\u00b10.7\nbank-32fh\n5.1\u00b10.1\nbank-32fm 8192 32 pol4 \u22481022\n4.6\u00b10.2\nbank-32fm 8192 32 rbf \u22481031\n8192 32 pol4 \u22481022 44.3\u00b11.2 46.3\u00b11.4 45.8\u00b10.8 46.0\u00b11.2 43.6\u00b11.1\nbank-32nh\n8192 32 rbf \u22481031 44.3\u00b11.2 49.4\u00b11.6 93.0\u00b12.8 46.1\u00b11.1 43.5\u00b11.0\nbank-32nh\nbank-32nm 8192 32 pol4 \u22481022 17.2\u00b10.6 18.2\u00b10.8 19.5\u00b10.4 21.0\u00b10.7 16.8\u00b10.6\nbank-32nm 8192 32 rbf \u22481031 16.9\u00b10.6 21.0\u00b10.6 62.3\u00b12.5 20.9\u00b10.7 16.4\u00b10.6\n17.1\u00b13.6 24.7\u00b110.8 29.3\u00b12.3 22.2\u00b12.2 18.1\u00b13.8\nboston\n506 13 pol4 \u2248109\n506 13 rbf \u22481012 16.4\u00b14.0 32.4\u00b18.2 29.4\u00b11.6 20.7\u00b12.1 17.1\u00b14.7\nboston\npumadyn-32fh 8192 32 pol4 \u22481022 57.3\u00b10.7 56.4\u00b10.8 57.5\u00b10.4 56.4\u00b10.7 56.4\u00b10.8\npumadyn-32fh 8192 32 rbf \u22481031 57.7\u00b10.6 72.2\u00b122.5 89.3\u00b12.0 56.5\u00b10.8 55.7\u00b10.7\n3.1\u00b10.0\npumadyn-32fm 8192 32 pol4 \u22481022\n3.4\u00b10.0\npumadyn-32fm 8192 32 rbf \u22481031\npumadyn-32nh 8192 32 pol4 \u22481022 84.2\u00b11.3 73.3\u00b125.4 84.8\u00b10.5 83.6\u00b11.3 36.7\u00b10.4\npumadyn-32nh 8192 32 rbf \u22481031 56.5\u00b11.1 81.3\u00b125.0 98.1\u00b10.7 83.7\u00b11.3 35.5\u00b10.5\npumadyn-32nm 8192 32 pol4 \u22481022 60.1\u00b11.9 69.9\u00b132.8 78.5\u00b11.1 77.5\u00b10.9 5.5\u00b10.1\npumadyn-32nm 8192 32 rbf \u22481031 15.7\u00b10.4 67.3\u00b142.4 95.9\u00b11.9 77.6\u00b10.9 7.2\u00b10.1\n\n6.9\u00b10.1\n7.0\u00b10.1\n5.0\u00b10.1 46.2\u00b151.6 44.7\u00b15.7 7.1\u00b10.1\n\n6.4\u00b11.6\n\n7.5\u00b10.2\n\nTable 1: Mean squared errors (multiplied by 100) on UCI regression datasets, normalized so that the\ntotal variance to explain is 100. See text for details.\n\nwe can ensure correct hull selection. Finally, it is worth noting that if the ratios dw/ maxv\u2208A(w) dv\ntend to in\ufb01nity slowly with n, then we always consistently estimate the depth of the hull, i.e., the\noptimal interaction complexity. We are currently investigating extensions to the non parametric\ncase [4], in terms of pattern selection and universal consistency.\n\n5 Simulations\nSynthetic examples We generated regression data as follows: n = 1024 samples of p \u2208 [22, 27]\nvariables were generated from a random covariance matrix, and the label y \u2208 R was sampled as a\nrandom sparse fourth order polynomial of the input variables (with constant number of monomials).\nWe then compare the performance of our hierarchical multiple kernel learning method (HKL) with\nthe polynomial kernel decomposition presented in Section 2 to other methods that use the same\nkernel and/or decomposition: (a) the greedy strategy of selecting basis kernels one after the other, a\nprocedure similar to [13], and (b) the regular polynomial kernel regularization with the full kernel\n(i.e., the sum of all basis kernels). In Figure 2, we compare the two approaches on 40 replications in\nthe following two situations: original data (left) and rotated data (right), i.e., after the input variables\nwere transformed by a random rotation (in this situation, the generating polynomial is not sparse\nanymore). We can see that in situations where the underlying predictor function is sparse (left),\nHKL outperforms the two other methods when the total number of variables p increases, while in\nthe other situation where the best predictor is not sparse (right), it performs only slightly better: i.e.,\nin non sparse problems, \u21131-norms do not really help, but do help a lot when sparsity is expected.\nUCI datasets For regression datasets, we compare HKL with polynomial (degree 4) and Gaussian-\nRBF kernels (each dimension decomposed into 9 kernels) to the following approaches with the same\n\n\fn\n\np\n\nL2\n\nHKL\ndataset\nk #(V )\n0.1\u00b10.2\n0.4\u00b10.4\nmushrooms 1024 117 pol4 \u22481082\nmushrooms 1024 117 rbf \u224810112 0.1\u00b10.2\n0.1\u00b10.2\n2.0\u00b10.3\nringnorm 1024 20 pol4 \u22481014\n3.8\u00b11.1\n1.2\u00b10.4\n1.6\u00b10.4\nringnorm 1024 20 rbf \u22481019\n8.1\u00b10.7\nspambase 1024 57 pol4 \u22481040\n8.3\u00b11.0\n9.4\u00b11.3 10.6\u00b11.7 8.4\u00b11.0\nspambase 1024 57 rbf \u22481054\n2.9\u00b10.5\ntwonorm 1024 20 pol4 \u22481014\n3.2\u00b10.6\n2.8\u00b10.6\ntwonorm 1024 20 rbf \u22481019\n3.2\u00b10.6\n15.9\u00b11.0 16.0\u00b11.6 15.6\u00b10.8\nmagic04 1024 10 pol4 \u2248107\nmagic04 1024 10 rbf \u22481010 15.7\u00b10.9 17.7\u00b11.3 15.6\u00b10.9\n\ngreedy\n0.1\u00b10.1\n0.1\u00b10.2\n5.9\u00b11.3\n2.4\u00b10.5\n9.7\u00b11.8\n4.7\u00b10.5\n5.1\u00b10.7\n\nTable 2: Error rates (multiplied by 100) on UCI binary classi\ufb01cation datasets. See text for details.\n\nkernel: regular Hilbertian regularization (L2), same greedy approach as earlier (greedy), regulariza-\ntion by the \u21131-norm directly on the vector \u03b1, a strategy which is sometimes used in the context of\nsparse kernel learning [14] but does not use the Hilbertian structure of the kernel (lasso-\u03b1), multiple\nkernel learning with the p kernels obtained by summing all kernels associated with a single variable\n(MKL). For all methods, the kernels were held \ufb01xed, while in Table 1, we report the performance\nfor the best regularization parameters obtained by 10 random half splits.\n\nWe can see from Table 1, that HKL outperforms other methods, in particular for the datasets bank-\n32nm, bank-32nh, pumadyn-32nm, pumadyn-32nh, which are datasets dedicated to non linear re-\ngression. Note also, that we ef\ufb01ciently explore DAGs with very large numbers of vertices #(V ).\nFor binary classi\ufb01cation datasets, we compare HKL (with the logistic loss) to two other methods (L2,\ngreedy) in Table 2. For some datasets (e.g., spambase), HKL works better, but for some others, in\nparticular when the generating problem is known to be non sparse (ringnorm, twonorm), it performs\nslightly worse than other approaches.\n\n6 Conclusion\nWe have shown how to perform hierarchical multiple kernel learning (HKL) in polynomial time in\nthe number of selected kernels. This framework may be applied to many positive de\ufb01nite kernels\nand we have focused on polynomial and Gaussian kernels used for nonlinear variable selection.\nIn particular, this paper shows that trying to use \u21131-type penalties may be advantageous inside the\nfeature space. We are currently investigating applications to string and graph kernels [2].\n\nReferences\n[1] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.\n[2] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Camb. U. P., 2004.\n[3] P. Zhao and B. Yu. On model selection consistency of Lasso. JMLR, 7:2541\u20132563, 2006.\n[4] F. Bach. Consistency of the group Lasso and multiple kernel learning. JMLR, 9:1179\u20131225, 2008.\n[5] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute\n\npenalties. Ann. Stat., To appear, 2008.\n\n[6] C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-based classi\ufb01ers.\n\nIn Proc. ICML, 2000.\n\n[7] M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. In Proc. ICML, 2008.\n[8] A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. Simplemkl. JMLR, 9:2491\u20132521, 2008.\n[9] M. Pontil and C.A. Micchelli. Learning the kernel function via regularization. JMLR, 6:1099\u20131125, 2005.\n[10] F. Bach. Exploring large feature spaces with hierarchical MKL. Technical Report 00319660, HAL, 2008.\n[11] H. Lee, A. Battle, R. Raina, and A. Ng. Ef\ufb01cient sparse coding algorithms. In NIPS, 2007.\n[12] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, 2003.\n[13] K. Bennett, M. Momma, and J. Embrechts. Mark: A boosting algorithm for heterogeneous kernel models.\n\nIn Proc. SIGKDD, 2002.\n\n[14] V. Roth. The generalized Lasso. IEEE Trans. on Neural Networks, 15(1), 2004.\n\n\f", "award": [], "sourceid": 171, "authors": [{"given_name": "Francis", "family_name": "Bach", "institution": null}]}