{"title": "Multiple Operator-valued Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2429, "page_last": 2437, "abstract": "Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an lr-norm constraint on the combination coefficients. The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinate-descent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces.", "full_text": "Multiple Operator-valued Kernel Learning\n\nHachem Kadri\n\nLIF - CNRS / INRIA Lille - Sequel Project\n\nUniversit\u00b4e Aix-Marseille\n\nMarseille, France\n\nAlain Rakotomamonjy\n\nLITIS EA 4108\n\nUniversit\u00b4e de Rouen\n\nSt Etienne du Rouvray, France\n\nhachem.kadri@lif.univ-mrs.fr\n\nalain.rakotomamony@insa-rouen.fr\n\nFrancis Bach\n\nINRIA - Sierra Project\n\nEcole Normale Sup\u00b4erieure\n\nParis, France\n\nfrancis.bach@inria.fr\n\nPhilippe Preux\n\nINRIA Lille - Sequel Project\n\nLIFL - CNRS, Universit\u00b4e de Lille\n\nVilleneuve d\u2019Ascq, France\n\nphilippe.preux@inria.fr\n\nAbstract\n\nPositive de\ufb01nite operator-valued kernels generalize the well-known notion of\nreproducing kernels, and are naturally adapted to multi-output learning situa-\ntions. This paper addresses the problem of learning a \ufb01nite linear combination\nof in\ufb01nite-dimensional operator-valued kernels which are suitable for extending\nfunctional data analysis methods to nonlinear contexts. We study this problem in\nthe case of kernel ridge regression for functional responses with an (cid:96)r-norm con-\nstraint on the combination coef\ufb01cients (r \u2265 1). The resulting optimization prob-\nlem is more involved than those of multiple scalar-valued kernel learning since\noperator-valued kernels pose more technical and theoretical issues. We propose a\nmultiple operator-valued kernel learning algorithm based on solving a system of\nlinear operator equations by using a block coordinate-descent procedure. We ex-\nperimentally validate our approach on a functional regression task in the context\nof \ufb01nger movement prediction in brain-computer interfaces.\n\n1\n\nIntroduction\n\nDuring the past decades, a large number of algorithms have been proposed to deal with learning\nproblems in the case of single-valued functions (e.g., binary-output function for classi\ufb01cation or real\noutput for regression). Recently, there has been considerable interest in estimating vector-valued\nfunctions [21, 5, 7]. Much of this interest has arisen from the need to learn tasks where the target is\na complex entity, not a scalar variable. Typical learning situations include multi-task learning [11],\nfunctional regression [12], and structured output prediction [4].\nIn this paper, we are interested in the problem of functional regression with functional responses in\nthe context of brain-computer interface (BCI) design. More precisely, we are interested in \ufb01nger\nmovement prediction from electrocorticographic signals [23]. Indeed, from a set of signals measur-\ning brain surface electrical activity on d channels during a given period of time, we want to predict,\nfor any instant of that period whether a \ufb01nger is moving or not and the amplitude of the \ufb01nger \ufb02ex-\nion. Formally, the problem consists in learning a functional dependency between a set of d signals\nand a sequence of labels (a step function indicating whether a \ufb01nger is moving or not) and between\nthe same set of signals and vector of real values (the amplitude function). While, it is clear that this\nproblem can be formalized as functional regression problem, from our point of view, such problem\ncan bene\ufb01t from the multiple operator-valued kernel learning framework. Indeed, for these prob-\nlems, one of the dif\ufb01culties arises from the unknown latency between the signal related to the \ufb01nger\n\n1\n\n\fmovement and the actual movement [23]. Hence, instead of \ufb01xing in advance some value for this\nlatency in the regression model, our framework allows to learn it from the data by means of several\noperator-valued kernels.\nIf we wish to address functional regression problem in the principled framework of reproducing\nkernel Hilbert spaces (RKHS), we have to consider RKHSs whose elements are operators that map\na function to another function space, possibly source and target function spaces being different.\nWorking in such RKHSs, we are able to draw on the important core of work that has been per-\nformed on scalar-valued and vector-valued RKHSs [28, 21]. Such a functional RKHS framework\nand associated operator-valued kernels have been introduced recently [12, 13]. A basic question\nwith reproducing kernels is how to build these kernels and what is the optimal kernel choice for a\ngiven application. In order to overcome the need for choosing a kernel before the learning process,\nseveral works have tried to address the problem of learning the scalar-valued kernel jointly with\nthe decision function [18, 29]. Since these seminal works, many efforts have been carried out in\norder to theoretically analyze the kernel learning framework [9, 3] or in order to provide ef\ufb01cient\nalgorithms [24, 1, 15]. While many works have been devoted to multiple scalar-valued kernel learn-\ning, this problem of kernel learning have been barely investigated for operator-valued kernels. One\nmotivation of this work is to bridge the gap between multiple kernel learning (MKL) and operator-\nvalued kernels by proposing a framework and an algorithm for learning a \ufb01nite linear combination\nof operator-valued kernels. While each step of the scalar-valued MKL framework can be extended\nwithout major dif\ufb01culties to operator-valued kernels, technical challenges arise at all stages because\nwe deal with in\ufb01nite dimensional spaces. It should be pointed out that in a recent work [10], the\nproblem of learning the output kernel was formulated as an optimization problem over the cone\nof positive semide\ufb01nite matrices, and a block-coordinate descent method was proposed to solve it.\nHowever, they did not focus on learning the input kernel. In contrast, our multiple operator-valued\nkernel learning formulation can be seen as a way of learning simultaneously input and output ker-\nnels, although we consider a linear combination of kernels that are \ufb01xed in advance.\nIn this paper, we make the following contributions: 1) we introduce a novel approach to in\ufb01nite-\ndimensional multiple operator-valued kernel learning (MovKL) suitable for learning the functional\ndependencies and interactions between continuous data; 2) we extend the original formulation of\nridge regression in dual variables to the functional data analysis domain, showing how to perform\nnonlinear functional regression with functional responses by constructing a linear regression opera-\ntor in an operator-valued kernel feature space (Section 2); 3) we derive a dual form of the MovKL\nproblem with functional ridge regression, and show that a solution of the related optimization prob-\nlem exists (Section 2); 4) we propose a block-coordinate descent algorithm to solve the MovKL\noptimization problem which involves solving a challenging linear system with a sum of block op-\nerator matrices (Section 3); 5) we provide an empirical evaluation of MovKL performance which\ndemonstrates its effectiveness on a BCI dataset (Section 4).\n\n2 Problem Setting\n\nBefore describing the multiple operator-valued kernel learning algorithm that we will study and ex-\nperiment with in this paper, we \ufb01rst review notions and properties of reproducing kernel Hilbert\nspaces with operator-valued kernels, show their connection to learning from multiple response\ndata (multiple outputs; see [21] for discrete data and [12] for continuous data), and describe the\noptimization problem for learning kernels with functional response ridge regression.\n\n2.1 Notations and Preliminaries\n\nWe start by some standard notations and de\ufb01nitions used all along the paper. Given a Hilbert\nspace H, (cid:104)\u00b7,\u00b7(cid:105)H and (cid:107) \u00b7 (cid:107)H refer to its inner product and norm, respectively. We denote by Gx\nand Gy the separable real Hilbert spaces of input and output functional data, respectively. In func-\ntional data analysis domain, continuous data are generally assumed to belong to the space of square\nintegrable functions L2. In this work, we consider that Gx and Gy are the Hilbert space L2(\u2126) which\nconsists of all equivalence classes of square integrable functions on a \ufb01nite set \u2126. \u2126 being poten-\ntially different for Gx and Gy. We denote by F(Gx,Gy) the vector space of functions f : Gx \u2212\u2192 Gy,\nand by L(Gy) the set of bounded linear operators from Gy to Gy.\n\n2\n\n\fn(cid:88)\n\ni=1\n\nWe consider the problem of estimating a function f such that f (xi) = yi when observed functional\ndata (xi, yi)i=1,...,n \u2208 (Gx,Gy). Since Gx and Gy are spaces of functions, the problem can be thought\nof as an operator estimation problem, where the desired operator maps a Hilbert space of factors to\na Hilbert space of targets. We can de\ufb01ne the regularized operator estimate of f \u2208 F as:\n\nf\u03bb (cid:44) arg min\nf\u2208F\n\n1\nn\n\n(cid:107)yi \u2212 f (xi)(cid:107)2Gy\n\n+ \u03bb(cid:107)f(cid:107)2F .\n\n(1)\n\nIn this work, we are looking for a solution to this minimization problem in a function-valued repro-\nducing kernel Hilbert space F. More precisely, we mainly focus on the RKHS F whose elements\nare continuous linear operators on Gx with values in Gy. The continuity property is obtained by\nconsidering a special class of reproducing kernels called Mercer kernels [7, Proposition 2.2]. Note\nthat in this case, F is separable since Gx and Gy are separable [6, Corollary 5.2].\nWe now introduce (function) Gy-valued reproducing kernel Hilbert spaces and show the correspon-\ndence between such spaces and positive de\ufb01nite (operator) L(Gy)-valued kernels. These extend the\ntraditional properties of scalar-valued kernels.\n\nDe\ufb01nition 1 (function-valued RKHS)\nA Hilbert space F of functions from Gx to Gy is called a reproducing kernel Hilbert space if there is\na positive de\ufb01nite L(Gy)-valued kernel KF (w, z) on Gx \u00d7 Gx such that:\ni. the function z (cid:55)\u2212\u2192 KF (w, z)g belongs to F, \u2200z \u2208 Gx, w \u2208 Gx, g \u2208 Gy,\nii. \u2200f \u2208 F, w \u2208 Gx, g \u2208 Gy, (cid:104)f, KF (w,\u00b7)g(cid:105)F = (cid:104)f (w), g(cid:105)Gy\nDe\ufb01nition 2 (operator-valued kernel)\nAn L(Gy)-valued kernel KF (w, z) on Gx is a function KF (\u00b7,\u00b7) : Gx \u00d7Gx \u2212\u2192 L(Gy); furthermore:\ni. KF is Hermitian if KF (w, z) = KF (z, w)\u2217, where \u2217 denotes the adjoint operator,\nii. KF is positive de\ufb01nite on Gx if it is Hermitian and for every natural number r and all\n\n{(wi, ui)i=1,...,r} \u2208 Gx \u00d7 Gy,(cid:80)\n\ni,j(cid:104)KF (wi, wj)uj, ui(cid:105)Gy \u2265 0.\n\n(reproducing property).\n\nTheorem 1 (bijection between function-valued RKHS and operator-valued kernel)\nAn L(Gy)-valued kernel KF (w, z) on Gx is the reproducing kernel of some Hilbert space F, if and\nonly if it is positive de\ufb01nite.\n\nThe proof of Theorem 1 can be found in [21]. For further reading on operator-valued kernels and\ntheir associated RKHSs, see, e.g., [5, 6, 7].\n\n2.2 Functional Response Ridge Regression in Dual Variables\n\nWe can write the ridge regression with functional responses optimization problem (1) as follows:\n\nn(cid:88)\n\ni=1\n\n1\n\n1\n2\n\n(cid:107)f(cid:107)2F +\n\nmin\nf\u2208F\nwith \u03bei = yi \u2212 f (xi).\n\n2n\u03bb\n\n(cid:107)\u03bei(cid:107)2Gy\n\n(2)\n\nNow, we introduce the Lagrange multipliers \u03b1i, i = 1, . . . , n which are functional variables since\nthe output space is the space of functions Gy. For the optimization problem (2), the Lagrangian\nmultipliers exist and the Lagrangian function is well de\ufb01ned. The method of Lagrange multipliers on\nBanach spaces, which is a generalization of the classical (\ufb01nite-dimensional) Lagrange multipliers\nmethod suitable to solve certain in\ufb01nite-dimensional constrained optimization problems, is applied\nhere. For more details, see [16]. Let \u03b1 = (\u03b1i)i=1,...,n \u2208 Gn\ny the vector of functions containing the\nLagrange multipliers, the Lagrangian function is de\ufb01ned as\n\nL(f, \u03b1, \u03be) =\nwhere \u03b1 = (\u03b11, . . . , \u03b1n) \u2208 Gn\n\u03be = (\u03be1, . . . , \u03ben) \u2208 Gn\n\ny , and \u2200a, b \u2208 Gn\n\ny , (cid:104)a, b(cid:105)Gn\n\ny\n\n(cid:104)ai, bi(cid:105)Gy.\n\n1\n\n(cid:107)f(cid:107)2F +\n\n1\n2\ny , y = (y1, . . . , yn) \u2208 Gn\n\n2n\u03bb\n\n+ (cid:104)\u03b1, y \u2212 f (x) \u2212 \u03be(cid:105)Gn\n\n(3)\ny , f (x) = (f (x1), . . . , f (xn)) \u2208 Gn\ny ,\n\n,\n\ny\n\ny\n\n(cid:107)\u03be(cid:107)2Gn\nn(cid:80)\n\n=\n\ni=1\n\n3\n\n\fDifferentiating (3) with respect to f \u2208 F and setting to zero, we obtain\n\nf (.) =\n\nK(xi, .)\u03b1i,\n\n(4)\n\nwhere K : Gx \u00d7 Gx \u2212\u2192 L(Gy) is the operator-valued kernel of F.\nSubstituting this into (3) and minimizing with respect to \u03be, we obtain the dual of the functional\nresponse kernel ridge regression (KRR) problem\n\u2212 1\n2\n\n(cid:104)K\u03b1, \u03b1(cid:105)Gn\n\n+ (cid:104)\u03b1, y(cid:105)Gn\n\n\u2212 n\u03bb\n2\n\n(cid:107)\u03b1(cid:107)2Gn\n\nmax\n\n(5)\n\n\u03b1\n\n,\n\ny\n\ny\n\ny\n\nwhere K = [K(xi, xj)]n\ning the dual formulation of functional KRR are derived in Appendix B of [14].\n\ni,j=1 is the block operator kernel matrix. The computational details regrad-\n\nM(cid:88)\n\nn(cid:88)\n\n2.3 MovKL in Dual Variables\nLet us now consider that the function f (\u00b7) is sum of M functions {fk(\u00b7)}M\nk=1 where each fk belongs\nto a Gy-valued RKHS with kernel Kk(\u00b7,\u00b7). Similarly to scalar-valued multiple kernel learning, we\ncan cast the problem of learning these functions fk as\n\nwith \u03bei = yi \u2212(cid:80)M\nwhere d = [d1,\u00b7\u00b7\u00b7 , dM ], D = {d : \u2200k, dk \u2265 0 and(cid:80)\n\nd\u2208D min\nmin\nfk\u2208Fk\n\ni=1\nk=1 fk(xi),\nk dr\n\n2n\u03bb\n\nk=1\n\n+\n\n1\n\n(cid:107)fk(cid:107)2Fk\n2dk\n\nk \u2264 1} and 1 \u2264 r \u2264 \u221e. Note that this\nproblem can equivalently be rewritten as an unconstrained optimization problem. Before deriving\nthe dual of this problem, it can be shown by means of the generalized Weierstrass theorem [17] that\nthis problem admits a solution. We report the proof in Appendix A of [14].\nNow, following the lines of [24], a dualization of this problem leads to the following equivalent one\n\n(cid:107)\u03bei(cid:107)2Gy\n\n(6)\n\nd\u2208D max\nmin\n\u03b1\u2208Gn\n\ny\n\n\u2212 n\u03bb\n2\n\n(cid:107)\u03b1(cid:107)2Gn\n\ny\n\n\u2212 1\n2\n\n(cid:104)K\u03b1, \u03b1(cid:105)Gn\n\ny\n\n+ (cid:104)\u03b1, y(cid:105)Gn\n\ny\n\n,\n\n(7)\n\nn(cid:88)\n\ni=1\n\nM(cid:80)\n\ndkKk and Kk is the block operator kernel matrix associated to the operator-valued\n\nwhere K =\nkernel Kk. The KKT conditions also state that at optimality we have fk(\u00b7) =\n\ndkKk(xi,\u00b7)\u03b1i.\n\nk=1\n\n3 Solving the MovKL Problem\n\nAfter having presented the framework, we now devise an algorithm for solving this MovKL problem.\n\n3.1 Block-coordinate descent algorithm\n\ni=1\n\nn(cid:80)\n\nSince the optimization problem (6) has the same structure as a multiple scalar-valued kernel learning\nproblem, we can build our MovKL algorithm upon the MKL literature. Hence, we propose to\nborrow from [15], and consider a block-coordinate descent method. The convergence of a block\ncoordinate descent algorithm which is related closely to the Gauss-Seidel method was studied in\nworks of [30] and others. The difference here is that we have operators and block operator matrices\nrather than matrices and block matrices, but this doesn\u2019t increase the complexity if the inverse of\nthe operators are computable (typically analytically or by spectral decomposition). Our algorithm\niteratively solves the problem with respect to \u03b1 with d being \ufb01xed and then with respect to d with \u03b1\nbeing \ufb01xed (see Algorithm 1). After having initialized {dk} to non-zero values, this boils down to\nthe following steps :\n1. with {dk} \ufb01xed, the resulting optimization problem with respect to \u03b1 has the following\n\nform:\n\n(K + \u03bbI)\u03b1 = y,\n\n(8)\n\n4\n\n\fwhere K =(cid:80)M\n\nk=1 dkKk. While the form of solution is rather simple, solving this linear\nsystem is still challenging in the operator setting and we propose below an algorithm for its\nresolution.\n\n2. with {fk} \ufb01xed, according to problem (6), we can rewrite the problem as\n\n(cid:107)fk(cid:107)2Fk\ndk\n\nmin\nd\u2208D\n\nM(cid:88)\n\nk=1\n\n((cid:80)\n\ndk =\n\nr+1\n\n(cid:107)fk(cid:107) 2\nk (cid:107)fk(cid:107) 2r\n\n.\n\nr+1 )1/r\n\n(9)\n\n(10)\n\nwhich has a closed-form solution and for which optimality occurs at [20]:\n\nThis algorithm is similar to that of [8] and [15] both being based on alternating optimization. The\ndifference here is that we have to solve a linear system involving a block-operator kernel matrix\nwhich is a combination of basic kernel matrices associated to M operator-valued kernels. This\nmakes the system very challenging, and we present an algorithm for solving it in the next paragraph.\nWe also report in Appendix C of [14] a convergence proof of a modi\ufb01ed version of the MovKL\nalgorithm that minimizes a perturbation of the objective function (6) with a small positive parameter\nrequired to guarantee convergence [2].\n\n3.2 Solving a linear system involving multiple operator-valued kernel matrices\n\nOne common way to construct operator-valued kernels is to build scalar-valued ones which are\ncarried over to the vector-valued (resp. function-valued) setting by a positive de\ufb01nite matrix (resp.\noperator). In this setting an operator-valued kernel has the following form:\n\nK(w, z) = G(w, z)T,\n\nwhere G is a scalar-valued kernel and T is a positive operator in L(Gy). In multi-task learning,\nT is a \ufb01nite dimensional matrix that is expected to share information between tasks [11, 5]. More\nrecently and for supervised functional output learning problems, T is chosen to be a multiplica-\ntion or an integral operator [12, 13]. This choice is motivated by the fact that functional linear\nmodels for functional responses [25] are based on these operators and then such kernels provide\nan interesting alternative to extend these models to nonlinear contexts. In addition, some works on\nfunctional regression and structured-output learning consider operator-valued kernels constructed\nfrom the identity operator as in [19] and [4]. In this work we adopt a functional data analysis point\nof view and then we are interested in a \ufb01nite combination of operator-valued kernels constructed\nfrom the identity, multiplication and integral operators. A problem encountered when working with\noperator-valued kernels in in\ufb01nite-dimensional spaces is that of solving the system of linear operator\nequations (8). In the following we show how to solve this problem for two cases of operator-valued\nkernel combinations.\nCase 1: multiple scalar-valued kernels and one operator. This is the simpler case where the\ncombination of operator-valued kernels has the following form\n\nK(w, z) =\n\ndkGk(w, z)T,\n\nKronecker product between the multiple scalar-valued kernel matrix G = (cid:80)M\n\n(11)\nwhere Gk is a scalar-valued kernel, T is a positive operator in L(Gy), and dk are the combi-\nIn this setting, the block operator kernel matrix K can be expressed as a\nnation coef\ufb01cients.\nk=1 dkGk, where\ni,j=1, and the operator T . Thus we can compute an analytic solution of the\nGk = [Gk(xi, xj)]n\nsystem of equations (8) by inverting K + \u03bbI using the eigendecompositions of G and T as in [13].\nCase 2: multiple scalar-valued kernels and multiple operators. This is the general case where\nmultiple operator-valued kernels are combined as follows\n\nK(w, z) =\n\ndkGk(w, z)Tk,\n\n(12)\n\nM(cid:88)\n\nk=1\n\nM(cid:88)\n\nk=1\n\n5\n\n\fAlgorithm 2 Gauss-Seidel Method\n\nchoose an initial vector of functions \u03b1(0)\nrepeat\n\nfor i = 1, 2, . . . , n\n\ni \u2190\u2212 sol. of (13):\n\u03b1(t)\n\n[K(xi, xi) + \u03bbI]\u03b1(t)\n\ni = si\n\nend for\n\nuntil convergence\n\nAlgorithm 1 (cid:96)r-norm MovKL\nInput Kk for k = 1, . . . , M\nk \u2190\u2212 1\nd1\nM\n\u03b1 \u2190\u2212 0\nfor t = 1, 2, . . . do\n\nfor k = 1, . . . , M\n\n\u03b1(cid:48) \u2190\u2212 \u03b1\n\nK \u2190\u2212(cid:80)\n\nk dt\n\nkKk\n\n\u03b1 \u2190\u2212 solution of (K + \u03bbI)\u03b1 = y\nif (cid:107)\u03b1 \u2212 \u03b1(cid:48)(cid:107) < \u0001 then\n\nbreak\n\nend if\nk \u2190\u2212\ndt+1\n\nend for\n\n((cid:80)\n\nr+1\n\n(cid:107)fk(cid:107) 2\nk (cid:107)fk(cid:107) 2r\n\nr+1 )1/r\n\nfor k = 1, . . . , M\n\nwhere Gk is a scalar-valued kernel, Tk is a positive operator in L(Gy), and dk are the combination\ncoef\ufb01cients. Inverting the associated block operator kernel matrix K is not feasible in this case,\nthat is why we propose a Gauss-Seidel iterative procedure (see Algorithm 2) to solve the system\nof linear operator equations (8). Starting from an initial vector of functions \u03b1(0), the idea is to\niteratively compute, until a convergence condition is satis\ufb01ed, the functions \u03b1i according to the\nfollowing expression\n\n[K(xi, xi) + \u03bbI]\u03b1(t)\n\n(13)\nwhere t is the iteration index. This problem is still challenging because the kernel K(\u00b7,\u00b7) still\ninvolves a positive combination of operator-valued kernels. Our algorithm is based on the idea\nthat instead of inverting the \ufb01nite combination of operator-valued kernels [K(xi, xi) + \u03bbI], we can\nconsider the following variational formulation of this system\n\nK(xi, xj)\u03b1(t)\n\nj=i+1\n\nj=1\n\n,\n\nj\n\nK(xi, xj)\u03b1(t\u22121)\n\ni = yi \u2212 i\u22121(cid:88)\n\nj \u2212 n(cid:88)\n\n1\n2\n\nmin\n\u03b1(t)\n\ni\n\n(cid:104)M +1(cid:88)\nj \u2212 n(cid:80)\n\nk=1\n\nK(xi, xj)\u03b1(t)\n\nj=1\n\nj=i+1\n\nwhere si = yi \u2212 i\u22121(cid:80)\n\nKk(xi, xi)\u03b1(t)\ni\n\n, \u03b1(t)\n\ni (cid:105)Gy \u2212 (cid:104)si, \u03b1(t)\n\ni (cid:105)Gy ,\n\nK(xi, xj)\u03b1(t\u22121)\n\nj\n\n, Kk = dkGkTk, \u2200k \u2208 {1, . . . , M},\n\nand KM +1 = \u03bbI.\nNow, by means of a variable-splitting approach, we are able to decouple the role of the different\nkernels. Indeed, the above problem is equivalent to the following one :\ni,1 = \u03b1(t)\n\ni,k for k = 2, . . . , M + 1,\n\n(cid:104) \u02c6K(xi, xi)\u03b1(t)\n\n\u2212 (cid:104)si, \u03b1(t)\n\nwith \u03b1(t)\n\ni (cid:105)GM\n\ni (cid:105)GM\n\n, \u03b1(t)\n\ni\n\ny\n\ny\n\n1\n2\n\nmin\n\u03b1(t)\n\ni\n\nwhere \u02c6K(xi, xi) is the (M + 1) \u00d7 (M + 1) diagonal matrix [Kk(xi, xi)]M +1\nis the vector\n(\u03b1(t)\ni,M +1) and the (M + 1)-dimensional vector si = (si, 0, . . . , 0). We now have to deal\nwith a quadratic optimization problem with equality constraints. Writing down the Lagrangian\nof this optimization problem and then deriving its \ufb01rst-order optimality conditions leads us to the\n\ni,1, . . . , \u03b1(t)\n\nk=1 . \u03b1(t)\n\ni\n\nfollowing set of linear equations\uf8f1\uf8f2\uf8f3 K1(xi, xi)\u03b1i,1 \u2212 si +(cid:80)M\n\nKk(xi, xi)\u03b1i,k \u2212 \u03b3k\n\u03b1i,1 \u2212 \u03b1i,k\n\nk=1 \u03b3k = 0\n= 0\n= 0\n\nwhere k = 2, . . . , M + 1 and {\u03b3k} are the Lagrange multipliers related to the M equality con-\nstraints. Finally, in this set of equations, the operator-valued kernels have been decoupled and thus,\nif their inversion can be easily computed (which is the case in our experiments), one can solve the\nproblem (14) with respect to {\u03b1i,k} and \u03b3k by means of another Gauss-Seidel algorithm after simple\nreorganization of the linear system.\n\n(14)\n\n6\n\n\fFigure 1: Example of a couple of input-output signals in our BCI task. (left) Amplitude modula-\ntion features extracted from ECoG signals over 5 pre-de\ufb01ned channels. (middle) Signal of labels\ndenoting whether the \ufb01nger is moving or not. (right) Real amplitude movement of the \ufb01nger.\n\n4 Experiments\n\nIn order to highlight the bene\ufb01t of our multiple operator-valued kernel learning approach, we have\nconsidered a series of experiments on a real dataset, involving functional output prediction in a\nbrain-computer interface framework. The problem we addressed is a sub-problem related to \ufb01n-\nger movement decoding from Electrocorticographic (ECoG) signals. We focus on the problem of\nestimating if a \ufb01nger is moving or not and also on the direct estimation of the \ufb01nger movement\namplitude from the ECoG signals. The development of the full BCI application is beyond the scope\nof this paper and our objective here is to prove that this problem of predicting \ufb01nger movement can\nbene\ufb01t from multiple kernel learning.\nTo this aim, the fourth dataset from the BCI Competition IV [22] was used. The subjects were 3\nepileptic patients who had platinium electrode grids placed on the surface of their brains. The num-\nber of electrodes varies between 48 to 64 depending on the subject, and their position on the cortex\nwas unknown. ECoG signals of the subject were recorded at a 1KHz sampling using BCI2000 [27].\nA band-pass \ufb01lter from 0.15 to 200Hz was applied to the ECoG signals. The \ufb01nger \ufb02exion of the\nsubject was recorded at 25Hz and up-sampled to 1KHz by means of a data glove which measures\nthe \ufb01nger movement amplitude. Due to the acquisition process, a delay appears between the \ufb01nger\nmovement and the measured ECoG signal [22]. One of our hopes is that this time-lag can be prop-\nerly learnt by means of multiple operator-valued kernels. Features from the ECoG signals are built\nby computing some band-speci\ufb01c amplitude modulation features, which is de\ufb01ned as the sum of the\nsquare of the band-speci\ufb01c \ufb01ltered ECoG signal samples during a \ufb01xed time window.\nFor our \ufb01nger movement prediction task, we have kept 5 channels that have been manually selected\nand split ECoG signals in portions of 200 samples. For each of these time segments, we have the\nlabel of whether at each time sample, the \ufb01nger is moving or not as well as the real movement\namplitudes. The dataset is composed of 487 couples of input-output signals, the output signals\nbeing either the binary movement labels or the real amplitude movement. An example of input-\noutput signals are depicted in Figure 1. In a nutshell, the problem boils down to be a functional\nregression task with functional responses.\nTo evaluate the performance of the multiple operator-valued kernel learning approach, we use both:\n(1) the percentage of labels correctly recognized (LCR) de\ufb01ned by (Wr/Tn) \u00d7 100%, where Wr\nis the number of well-recognized labels and Tn the total number of labels to be recognized; (2) the\nresidual sum of squares error (RSSE) as evaluation criterion for curve prediction\n\n(cid:90) (cid:88)\n\n{yi(t) \u2212(cid:98)yi(t)}2dt,\n\nRSSE =\n\nwhere (cid:98)yi(t) is the prediction of the function yi(t) corresponding to real \ufb01nger movement or the\n\ni\n\n\ufb01nger movement state.\nFor the multiple operator-valued kernels having the form (12), we have used a Gaussian kernel\nwith 5 different bandwidths and a polynomial kernel of degree 1 to 3 combined with three oper-\nators T : identity T y(t) = y(t), multiplication operator associated with the function e\u2212t2 de\ufb01ned\nby T y(t) = e\u2212t2\ny(t), and the integral Hilbert-Schmidt operator with the kernel e\u2212|t\u2212s| proposed\n\nin [13], T y(t) = (cid:82) e\u2212|t\u2212s|y(s)ds. The inverses of these operators can be computed analytically.\n\n(15)\n\n7\n\n020406080100120140160180200\u221220020Ch. 1020406080100120140160180200\u221210010Ch. 2020406080100120140160180200\u2212505Ch. 3020406080100120140160180200\u2212505Ch. 4020406080100120140160180200\u2212505Ch. 5Time samples050100150200\u22121.5\u22121\u22120.500.511.5Time samplesFinger Movement State050100150200\u221210123456Time samplesFinger Movement \fTable 1: (Left) Results for the movement state prediction. Residual Sum of Squares Error (RSSE)\nand the percentage number of Labels Correctly Recognized (LCR) of : (1) baseline KRR with the\nGaussian kernel, (2) functional response KRR with the integral operator-valued kernel, (3) MovKL\nwith (cid:96)\u221e, (cid:96)1 and (cid:96)2-norm constraint. (Right) Residual Sum of Squares Error (RSSE) results for\n\ufb01nger movement prediction.\n\nAlgorithm\n\nRSSE LCR(%)\n\nAlgorithm\n\nKRR - scalar-valued -\nKRR - functional response -\nMovKL - (cid:96)\u221e norm -\nMovKL - (cid:96)1 norm -\nMovKL - (cid:96)2 norm -\n\n68.32\n49.40\n45.44\n48.12\n39.36\n\n72.91\n80.20\n81.34\n80.66\n84.72\n\nKRR - scalar-valued -\nKRR - functional response -\nMovKL - (cid:96)\u221e norm -\nMovKL - (cid:96)1 norm -\nMovKL - (cid:96)2 norm -\n\nRSSE\n\n88.21\n79.86\n76.52\n78.24\n75.15\n\nWhile the inverses of the identity and the multiplication operators are easily and directly computable\nfrom the analytic expressions of the operators, the inverse of the integral operator is computed from\nits spectral decomposition as in [13]. The number of eigenfunctions as well as the regularization\nparameter \u03bb are \ufb01xed using \u201cone-curve-leave-out cross-validation\u201d [26] with the aim of minimizing\nthe residual sum of squares error.\nEmpirical results on the BCI dataset are summarized in Table 1. The dataset was randomly parti-\ntioned into 65% training and 35% test sets. We compare our approach in the case of (cid:96)1 and (cid:96)2-norm\nconstraint on the combination coef\ufb01cients with: (1) the baseline scalar-valued kernel ridge regres-\nsion algorithm by considering each output independently of the others, (2) functional response ridge\nregression using an integral operator-valued kernel [13], (3) kernel ridge regression with an evenly-\nweighted sum of operator-valued kernels, which we denote by (cid:96)\u221e-norm MovKL.\nAs in the scalar case, using multiple operator-valued kernels leads to better results. By directly com-\nbining kernels constructed from identity, multiplication and integral operators we could reduce the\nresidual sum of squares error and enhance the label classi\ufb01cation accuracy. Best results are obtained\nusing the MovKL algorithm with (cid:96)2-norm constraint on the combination coef\ufb01cients. RSSE and\nLCR of the baseline kernel ridge regression are signi\ufb01cantly outperformed by the operator-valued\nkernel based functional response regression. These results con\ufb01rm that by taking into account the\nrelationship between outputs we can improve performance. This is due to the fact that an operator-\nvalued kernel induces a similarity measure between two pairs of input/output.\n\n5 Conclusion\n\nIn this paper we have presented a new method for learning simultaneously an operator and a \ufb01-\nnite linear combination of operator-valued kernels. We have extended the MKL framework to deal\nwith functional response kernel ridge regression and we have proposed a block coordinate descent\nalgorithm to solve the resulting optimization problem. The method is applied on a BCI dataset to\npredict \ufb01nger movement in a functional regression setting. Experimental results show that our algo-\nrithm achieves good performance outperforming existing methods. It would be interesting for future\nwork to thoroughly compare the proposed MKL method for operator estimation with previous re-\nlated methods for multi-class and multi-label MKL in the contexts of structured-output learning and\ncollaborative \ufb01ltering.\n\nAcknowledgments\n\nWe would like to thank the anonymous reviewers for their valuable comments. This research was\nfunded by the Ministry of Higher Education and Research, Nord-Pas-de-Calais Regional Council\nand FEDER (Contrat de Projets Etat Region CPER 2007-2013), ANR projects LAMPADA (ANR-\n09-EMER-007) and ASAP (ANR-09-EMER-001), and by the IST Program of the European Com-\nmunity under the PASCAL2 Network of Excellence (IST-216886). This publication only re\ufb02ects the\nauthors\u2019 views. Francis Bach was partially supported by the European Research Council (SIERRA\nProject).\n\n8\n\n\fReferences\n\n[1] J. A\ufb02alo, A. Ben-Tal, C. Bhattacharyya, J. Saketha Nath, and S. Raman. Variable sparsity kernel learning.\n\nJMLR, 12:565\u2013592, 2011.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[3] F. Bach. Consistency of the group Lasso and multiple kernel learning. JMLR, 9:1179\u20131225, 2008.\n[4] C. Brouard, F. d\u2019Alch\u00b4e-Buc, and M. Szafranski. Semi-supervised penalized output kernel regression for\n\nlink prediction. In Proc. ICML, 2011.\n\n[5] A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying. Universal multi-task kernels. JMLR, 68:1615\u2013\n\n1646, 2008.\n\n[6] C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel Hilbert spaces of integrable\n\nfunctions and mercer theorem. Analysis and Applications, 4:377\u2013408, 2006.\n\n[7] C. Carmeli, E. De Vito, and A. Toigo. Vector valued reproducing kernel Hilbert spaces and universality.\n\nAnalysis and Applications, 8:19\u201361, 2010.\n\n[8] C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Proc. UAI, 2009.\n[9] C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In ICML, 2010.\n[10] F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto. Learning output kernels with block coordinate descent.\n\nIn Proc. ICML, 2011.\n\n[11] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. JMLR,\n\n6:615\u2013637, 2005.\n\n[12] H. Kadri, E. Du\ufb02os, P. Preux, S. Canu, and M. Davy. Nonlinear functional regression: a functional RKHS\n\napproach. In Proc. AISTATS, pages 111\u2013125, 2010.\n\n[13] H. Kadri, A. Rabaoui, P. Preux, E. Du\ufb02os, and A. Rakotomamonjy. Functional regularized least squares\n\nclassi\ufb01cation with operator-valued kernels. In Proc. ICML, 2011.\n\n[14] H. Kadri, A. Rakotomamonjy, F. Bach, and P. Preux. Multiple operator-valued kernel learning. Technical\n\nReport 00677012, INRIA, 2012.\n\n[15] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. (cid:96)p-norm multiple kernel learning. JMLR, 12:953\u2013997,\n\n2011.\n\n[16] S. Kurcyusz. On the existence and nonexistence of lagrange multipliers in Banach spaces. Journal of\n\nOptimization Theory and Applications, 20:81\u2013110, 1976.\n\n[17] A. Kurdila and M. Zabarankin. Convex Functional Analysis. Birkhauser Verlag, 2005.\n[18] G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M. Jordan. Learning the kernel matrix with\n\nsemi-de\ufb01nite programming. JMLR, 5:27\u201372, 2004.\n\n[19] H. Lian. Nonlinear functional models for functional responses in reproducing kernel Hilbert spaces. The\n\nCanadian Journal of Statistics, 35:597\u2013606, 2007.\n\n[20] C. Micchelli and M. Pontil. Learning the kernel function via regularization. JMLR, 6:1099\u20131125, 2005.\n[21] C. A. Micchelli and M. Pontil. On learning vector-valued functions. Neural Comput., 17:177\u2013204, 2005.\n[22] K. J. Miller and G. Schalk. Prediction of \ufb01nger \ufb02exion: 4th brain-computer interface data competition.\n\nBCI Competition IV, 2008.\n\n[23] T. Pistohl, T. Ball, A. Schulze-Bonhage, A. Aertsen, and C. Mehring. Prediction of arm movement\ntrajectories from ECoG-recordings in humans. Journal of Neuroscience Methods, 167(1):105\u2013114, 2008.\n\n[24] A. Rakotomamonjy, F. Bach, Y. Grandvalet, and S. Canu. SimpleMKL. JMLR, 9:2491\u20132521, 2008.\n[25] J. O. Ramsay and B. W. Silverman. Functional Data Analysis, 2nd ed. Springer Verlag, New York, 2005.\n[26] John A. Rice and B. W. Silverman. Estimating the mean and covariance structure nonparametrically when\n\nthe data are curves. Journal of the Royal Statistical Society. Series B, 53(1):233\u2013243, 1991.\n\n[27] G. Schalk, D. J. McFarland, T. Hinterberger, N. Birbaumer, and J. R. Wolpaw. BCI2000: a general-\npurpose brain-computer interface system. Biomedical Engineering, IEEE Trans. on, 51:1034\u20131043, 2004.\n[28] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond. MIT Press, Cambridge, MA, USA, 2002.\n\n[29] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. JMLR,\n\n7:1531\u20131565, 2006.\n\n[30] P. Tseng. Convergence of block coordinate descent method for nondifferentiable minimization. J. Optim.\n\nTheory Appl., 109:475\u2013494, 2001.\n\n9\n\n\f", "award": [], "sourceid": 1172, "authors": [{"given_name": "Hachem", "family_name": "Kadri", "institution": null}, {"given_name": "Alain", "family_name": "Rakotomamonjy", "institution": null}, {"given_name": "Philippe", "family_name": "Preux", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}