{"title": "Consistent Multitask Learning with Nonlinear Output Relations", "book": "Advances in Neural Information Processing Systems", "page_first": 1986, "page_last": 1996, "abstract": "Key to multitask learning is exploiting the relationships between different tasks to improve prediction performance. Most previous methods have focused on the case where tasks relations can be modeled as linear operators and regularization approaches can be used successfully. However, in practice assuming the tasks to be linearly related is often restrictive, and allowing for nonlinear structures is a challenge. In this paper, we tackle this issue by casting the problem within the framework of structured prediction. Our main contribution is a novel algorithm for learning multiple tasks which are related by a system of nonlinear equations that their joint outputs need to satisfy. We show that our algorithm can be efficiently implemented and study its generalization properties, proving universal consistency and learning rates. Our theoretical analysis highlights the benefits of non-linear multitask learning over learning the tasks independently. Encouraging experimental results show the benefits of the proposed method in practice.", "full_text": "Consistent Multitask Learning with\n\nNonlinear Output Relations\n\nCarlo Ciliberto \u2022,1 Alessandro Rudi \u2022,\u2217 ,2 Lorenzo Rosasco 3,4,5 Massimiliano Pontil 1,5\n\n{c.ciliberto,m.pontil}@ucl.ac.uk\n\n1Department of Computer Science, University College London, London, UK.\n2INRIA - Sierra Project-team and \u00c9cole Normale Sup\u00e9rieure, Paris, France.\n\nalessandro.rudi@inria.fr\n\nlrosasco@mit.edu\n\n3Massachusetts Institute of Technology, Cambridge, USA.\n\n4Universit\u00e0 degli studi di Genova, Genova, Italy.\n5Istituto Italiano di Tecnologia, Genova, Italy.\n\n\u2022 Equal Contribution\n\nAbstract\n\nKey to multitask learning is exploiting the relationships between different tasks in\norder to improve prediction performance. Most previous methods have focused on\nthe case where tasks relations can be modeled as linear operators and regularization\napproaches can be used successfully. However, in practice assuming the tasks to\nbe linearly related is often restrictive, and allowing for nonlinear structures is a\nchallenge. In this paper, we tackle this issue by casting the problem within the\nframework of structured prediction. Our main contribution is a novel algorithm for\nlearning multiple tasks which are related by a system of nonlinear equations that\ntheir joint outputs need to satisfy. We show that our algorithm can be ef\ufb01ciently\nimplemented and study its generalization properties, proving universal consistency\nand learning rates. Our theoretical analysis highlights the bene\ufb01ts of non-linear\nmultitask learning over learning the tasks independently. Encouraging experimental\nresults show the bene\ufb01ts of the proposed method in practice.\n\n1\n\nIntroduction\n\nImproving the ef\ufb01ciency of learning from human supervision is one of the great challenges in\nmachine learning. Multitask learning is one of the key approaches in this sense and it is based on\nthe assumption that different learning problems (i.e. tasks) are often related, a property that can\nbe exploited to reduce the amount of data needed to learn each individual tasks and in particular\nto learn ef\ufb01ciently novel tasks (a.k.a.\ntransfer learning, learning to learn [1]). Special cases of\nmultitask learning include vector-valued regression and multi-category classi\ufb01cation; applications are\nnumerous, including classic ones in geophysics, recommender systems, co-kriging or collaborative\n\ufb01ltering (see [2, 3, 4] and references therein). Diverse methods have been proposed to tackle this\nproblem, for examples based on kernel methods [5], sparsity approaches [3] or neural networks [6].\nFurthermore, recent theoretical results allowed to quantify the bene\ufb01ts of multitask learning from a\ngeneralization point view when considering speci\ufb01c methods [7, 8].\nA common challenge for multitask learning approaches is the problem of incorporating prior as-\nsumptions on the task relatedness in the learning process. This can be done implicitly, as in neural\nnetworks [6], or explicitly, as done in regularization methods by designing suitable regularizers\n[5]. This latter approach is \ufb02exible enough to incorporate different notions of tasks\u2019 relatedness\nexpressed, for example, in terms of clusters or a graph, see e.g. [9, 10]. Further, it can be extended\nto learn the tasks\u2019 structures when they are unknown [3, 11, 12, 13, 14, 15, 16]. However, most\n\n\u2217Work performed while A.R. was at the Istituto Italiano di Tecnologia.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fregularization approaches are currently limited to imposing, or learning, tasks structures expressed by\nlinear relations (see Sec. 5). For example an adjacency matrix in the case of graphs or a block matrix\nin the case of clusters. Clearly while such a restriction might make the problem more amenable to\nstatistical and computational analysis, in practice it might be a severe limitation.\nEncoding and exploiting nonlinear task relatedness is the problem we consider in this paper. Previous\nliterature on the topic is scarce. Neural networks naturally allow to learn with nonlinear relations,\nhowever it is unclear whether such relations can be imposed a-priori. As explained below, our\nproblem has some connections to that of manifold valued regression [17]. To our knowledge this\nis the \ufb01rst work addressing the problem of explicitly imposing nonlinear output relations among\nmultiple tasks. Close to our perspective is [18], where however a different approach is proposed,\nimplicitly enforcing a nonlinear structure on the problem by requiring the parameters of each task\npredictors to lie on a shared manifold in the hypotheses space.\nOur main contribution is a novel method for learning multiple tasks which are nonlinearly related.\nWe address this problem from the perspective of structured prediction (see [19, 20] and references\ntherein) building upon ideas recently proposed in [21]. Speci\ufb01cally we look at multitask learning\nas the problem of learning a vector-valued function taking values in a prescribed set, which models\ntasks\u2019 interactions. We also discuss how to deal with possible violations of such a constraint set.\nWe study the generalization properties of the proposed approach, proving universal consistency and\nlearning rates. Our theoretical analysis allows also to identify speci\ufb01c training regimes in which\nmultitask learning is clearly bene\ufb01cial in contrast to learning all tasks independently.\n\n2 Problem Formulation\n\nMultitask learning (MTL) studies the problem of estimating multiple (real-valued) functions\n\nf1, . . . , fT : X \u2192 R\n\n(1)\ni=1 with xit \u2208 X and yit \u2208 R, for t = 1, . . . , T . The key\nfrom corresponding training sets (xit, yit)nt\nidea in MTL is to estimate f1, . . . , fT jointly, rather than independently. The intuition is that if the\ndifferent tasks are related this strategy can lead to a substantial decrease of sample complexity, that is\nthe amount of data needed to achieve a given accuracy. The crucial question is then how to encode\nand exploit such relations among the tasks.\nPrevious work on MTL has mostly focused on studying the case where the tasks are linearly related\n(see Sec. 5). Indeed, this allows to capture a wide range of relevant situations and the resulting\nproblem can be often cast as a convex optimization, which can be solved ef\ufb01ciently. However, it\nis not hard to imagine situations where different tasks might be nonlinearly related. As a simple\nexample consider the problem of learning two functions f1, f2 : [0, 2\u03c0] \u2192 R, with f1(x) = cos(x)\nand f2(x) = sin(x). Clearly the two tasks are strongly related one to the other (they need to satisfy\nf1(x)2 + f2(x)2 \u2212 1 = 0 for all x \u2208 [0, 2\u03c0]) but such structure in nonlinear (here an equation of\ndegree 2). More realistic examples can be found for instance in the context of modeling physical\nsystems, such as the case of a robot manipulator. A prototypical learning problem (see e.g. [22]) is to\nassociate the current state of the system (position, velocity, acceleration) to a variety of measurements\n(e.g. torques) that are nonlinearly related one to the other by physical constraints (see e.g. [23]).\nFollowing the intuition above, in this work we model tasks relations as a set of P equations. Speci\ufb01-\ncally we consider a constraint function \u03b3 : RT \u2192 RP and require that \u03b3 (f1(x), . . . , fT (x)) = 0 for\nall x \u2208 X . When \u03b3 is linear, the problem reverts to linear MTL and can be addressed via standard\napproaches (see Sec. 5). On the contrary, the nonlinear case becomes signi\ufb01cantly more challenging\nand it is not clear how to address it in general. The starting point of our study is to consider the\ntasks predictors as a vector-valued function f = (f1, . . . , fT ) : X \u2192 RT but then observe that \u03b3\nimposes constraints on its range. Speci\ufb01cally, in this work we restrict f : X \u2192 C to take values in\nthe constraint set\n\nC =(cid:8)y \u2208 RT | \u03b3(y) = 0(cid:9) \u2286 RT\n\n(2)\nand formulate the nonlinear multitask learning problem as that of \ufb01nding a good approximation\n\n(cid:98)f : X \u2192 C to the solution of the multi-task expected risk minimization problem\n\n(cid:90)\n\nT(cid:88)\n\nX\u00d7R\n\nt=1\n\nminimize\n\nf :X\u2192C\n\nE(f ),\n\nE(f ) =\n\n1\nT\n\n2\n\n(cid:96)(ft(x), y)d\u03c1t(x, y)\n\n(3)\n\n\fi=1 have been independently sampled.\n\nwhere (cid:96) : R \u00d7 R \u2192 R is a prescribed loss function measuring prediction errors for each individual\ntask and, for every t = 1, . . . , T , \u03c1t is the distribution on X \u00d7 R from which the training points\n(xit, yit)nt\nNonlinear MTL poses several challenges to standard machine learning approaches. Indeed, when C\n(cid:80)nt\nis a linear space (e.g. \u03b3 is a linear map) the typical strategy to tackle problem (3) is to minimize the\ni=1 (cid:96)(ft(xit), yit) over some suitable space of hypotheses f : X \u2192 C\nempirical risk 1\nwithin which optimization can be performed ef\ufb01ciently. However, if C is a nonlinear subset of\nT\nRT , it is not clear how to parametrize a \u201cgood\u201d space of functions since most basic properties\ntypically needed by optimization algorithms are lost (e.g. f1, f2 : X \u2192 C does not necessarily imply\nf1 + f2 : X \u2192 C). To address this issue, in this paper we adopt the structured prediction perspective\nproposed in [21], which we review in the following.\n\n(cid:80)T\n\n1\nnt\n\nt=1\n\n2.1 Background: Structured Prediction and the SELF Framework\n\nThe term structured prediction typically refers to supervised learning problems with discrete outputs,\nsuch as strings or graphs [19, 20, 24]. The framework in [21] generalizes this perspective to account\nfor a more \ufb02exible formulation of structured prediction where the goal is to learn an estimator\napproximating the minimizer of\n\n(cid:90)\n\nminimize\n\nf :X\u2192C\n\nX\u00d7Y\n\nL(f (x), y)d\u03c1(x, y)\n\n(4)\n\ngiven a training set (xi, yi)n\ni=1 of points independently sampled from an unknown distribution \u03c1 on\nX \u00d7 Y, where L : Y \u00d7 Y \u2192 R is a loss function. The output sets Y and C \u2286 Y are not assumed\nto be linear spaces but can be either discrete (e.g. strings, graphs, etc.) or dense (e.g. manifolds,\ndistributions, etc.) sets of \u201cstructured\u201d objects. This generalization will be key to tackle the question\nof multitask learning with nonlinear output relations in Sec. 3 since it allows to consider the case\nwhere C is a generic subset of Y = RT . The analysis in [21] hinges on the assumption that the loss L\nis \u201cbi-linearizable\u201d, namely\nDe\ufb01nition 1 (SELF). Let Y be a compact set. A function (cid:96) : Y \u00d7 Y \u2192 R is a Structure Encoding\nLoss Function (SELF) if there exists a continuous feature map \u03c8 : Y \u2192 H, with H a reproducing\nkernel Hilbert space on Y and a continuous linear operator V : H \u2192 H such that for all y, y(cid:48) \u2208 Y\n(5)\n\n(cid:96)(y, y(cid:48)) = (cid:104)\u03c8(y), V \u03c8(y(cid:48))(cid:105)H.\n\n(cid:90)\n\nIn the original work the SELF de\ufb01nition was dubbed \u201closs trick\u201d as a parallel to the kernel trick [25].\nAs we discuss in Sec. 4, most MTL loss functions indeed satisfy the SELF property. Under this\nassumption, it can be shown that a solution f\u2217 : X \u2192 C to Eq. (4) must satisfy\n\nf\u2217(x) = argmin\n\n(cid:104)\u03c8(c), V g\u2217(x)(cid:105)H\n\n(6)\nfor all x \u2208 X (see [21] or the Appendix). Since g\u2217 : X \u2192 H is a function with values in a linear\n\nspace, we can apply standard regression techniques to learn a(cid:98)g : X \u2192 H to approximate g\u2217 given\n\nwith\n\nc\u2208C\n\nY\n\ng\u2217(x) =\n\n\u03c8(y) d\u03c1(y|x)\n\ni=1 and then obtain the estimator (cid:98)f : X \u2192 C as\n(cid:104)\u03c8(c) , V (cid:98)g(x)(cid:105)H\n\n(cid:98)f (x) = argmin\n\n(xi, \u03c8(yi))n\n\n\u2200x \u2208 X .\n\n(7)\n\n1\nn\n\nc\u2208C\n\nThe intuition here is that if(cid:98)g is close to g\u2217, so it will be (cid:98)f to f\u2217 (see Sec. 4 for a rigorous analysis of\nthis relation). If(cid:98)g is the kernel ridge regression estimator obtained by minimizing the empirical risk\n(cid:80)n\ni=1 (cid:107)g(xi) \u2212 \u03c8(yi)(cid:107)2H (plus regularization), Eq. (7) becomes\n(cid:98)f (x) = argmin\nsince(cid:98)g can be written as the linear combination(cid:98)g(x) =(cid:80)n\n\ni=1 \u03b1i(x) \u03c8(yi) and the loss function L is\nSELF. In the above formula \u03bb > 0 is a hyperparameter, I \u2208 Rn\u00d7n the identity matrix, K \u2208 Rn\u00d7n\nthe kernel matrix with elements Kij = k(xi, xj), Kx \u2208 Rn the vector with entries (Kx)i = k(x, xi)\nand k : X \u00d7 X \u2192 R a reproducing kernel on X .\n\n\u03b1(x) = (\u03b11(x), . . . , \u03b1n(x))(cid:62) = (K + n\u03bbI)\u22121Kx\n\n\u03b1i(x)L(c, yi),\n\nn(cid:88)\n\nc\u2208C\n\n(8)\n\ni=1\n\n3\n\n\fkernel ridge regression in(cid:98)g, followed by a prediction step, where the vector c \u2208 C minimizing the\noperator V allow to derive the SELF estimator, their knowledge is not needed to evaluate (cid:98)f (x) in\n\nThe SELF structured prediction approach is therefore conceptually divided into two distinct phases:\na learning step, where the score functions \u03b1i : X \u2192 R are estimated, which consists in solving the\nweighted sum in Eq. (8) is identi\ufb01ed. Interestingly, while the feature map \u03c8, the space H and the\npractice since the optimization at Eq. (8) depends exclusively on the loss L and the score functions\n\u03b1i.\n\n3 Structured Prediction for Nonlinear MTL\n\nIn this section we present the main contribution of this work, namely the extension of the SELF\nframework to the MTL setting. Furthermore, we discuss how to cope with possible violations of the\nconstraint set in practice. We study the theoretical properties of the proposed estimator in Sec. 4. We\nbegin our analysis by applying the SELF approach to vector-valued regression which will then lead\nto the MTL formulation.\n\n3.1 Nonlinear Vector-valued Regression\n\nT\n\nc\u2208C\n\n2 = \u03a0C (b(x)/a(x))\n\n(9)\n\nprojection onto C\n\nt=1(yt \u2212 y(cid:48)\n\nt)2. Then, the SELF\n\nVector-valued regression (VVR) is a special instance of MTL where for each input, all output\n(cid:80)\nexamples are available during training. In other words, the training sets can be combined into a single\ni=1, with yi = (yi1, . . . , yit)(cid:62) \u2208 RT . If we denote L : RT \u00d7 RT \u2192 R the separable\ndataset (xi, yi)n\nloss L(y, y(cid:48)) = 1\nt=1 (cid:96)(yt, y(cid:48)\nExample 1 (Nonlinear VVR with Square Loss). Let L(y, y(cid:48)) =(cid:80)T\nt), nonlinear VVR coincides with the structured prediction problem\nin Eq. (4). If L is SELF, we can therefore obtain an estimator according to Eq. (8).\nestimator for nonlinear VVR can be obtained as (cid:98)f : X \u2192 C from Eq. (8) and corresponds to the\nwith a(x) = (cid:80)n\n\ni=1 \u03b1i(x) yi. Interestingly, b(x) = (cid:80)n\n\n(cid:98)f (x) = argmin\ni=1 \u03b1i(x) and b(x) = (cid:80)n\n\n(cid:107)c \u2212 b(x)/a(x)(cid:107)2\n\ni=1 \u03b1i(x)yi =\nY (cid:62)(K +n\u03bbI)\u22121Kx corresponds to the solution of the standard vector-valued kernel ridge regression\nwithout constraints (we denoted Y \u2208 Rn\u00d7T the matrix with rows y(cid:62)\ni ). Therefore, nonlinear VVR\nconsists in: 1) computing the unconstrained kernel ridge regression estimator b(x), 2) normalizing it\nby a(x) and 3) projecting it onto C.\n\nThe example above shows that for speci\ufb01c loss functions the estimation of (cid:98)f (x) can be signi\ufb01cantly\n\nsimpli\ufb01ed. In general, such optimization will depend on the properties of the constraint set C (e.g.\nconvex, connected, etc.) and the loss (cid:96) (e.g. convex, smooth, etc.). In practice, if C is a discrete\n(or discretized) subset of RT , the computation can be performed ef\ufb01ciently via a nearest neighbor\nsearch (e.g. using k-d trees based approaches to speed up computations [26]). If C is a manifold,\nrecent geometric optimization methods [27] (e.g. SVRG [28]) can be applied to \ufb01nd critical points of\nEq. (8). This setting suggests a connection with manifold regression as discussed below.\nRemark 1 (Connection to Manifold Regression). When C is a Riemannian manifold, the problem of\nlearning f : X \u2192 C shares some similarities to the manifold regression setting studied in [17] (see\nalso [29] and references therein). Manifold regression can be interpreted as a vector-valued learning\nsetting where outputs are constrained to be in C \u2286 RT and prediction errors are measured according\nto the geodesic distance. However, note that the two problems are also signi\ufb01cantly different since,\n1) in MTL noise could make output examples yi lie close but not exactly on the constraint set C and\nmoreover, 2) the loss functions used in MTL typically measure errors independently for each task (as\nin Eq. (3), see also [5]) and rarely coincide with a geodesic distance.\n\n3.2 Nonlinear Multitask Learning\n\nDifferently from nonlinear vector-valued regression, the SELF approach introduced in Sec. 2.1 cannot\nbe applied to the MTL setting. Indeed, the estimator at Eq. (8) requires knowledge of all tasks outputs\nyi \u2208 Y = RT for every training input xi \u2208 X while in MTL we have a separate dataset (xit, yit)nt\nfor each task, with yit \u2208 R (this could be interpreted as the vector yi to have \u201cmissing entries\u201d).\ni=1\n\n4\n\n\fTherefore, in this work we extend the SELF framework to nonlinear MTL. We begin by proving a\ncharacterization of the minimizer f\u2217 : X \u2192 C of the expected risk E(f ) akin to Eq. (6).\nProposition 2. Let (cid:96) : R \u00d7 R \u2192 R be SELF, with (cid:96)(y, y(cid:48)) = (cid:104)\u03c8(y), V \u03c8(y(cid:48))(cid:105)H. Then, the expected\nrisk E(f ) introduced at Eq. (3) admits a measurable minimizer f\u2217 : X \u2192 C. Moreover, any such\nminimizer satis\ufb01es, almost everywhere on X ,\n\n(cid:104)\u03c8(ct), V g\u2217\n\nt (x)(cid:105)H,\n\nwith\n\ng\u2217\nt (x) =\n\n\u03c8(y) d\u03c1t(y|x).\n\n(10)\n\nf\u2217(x) = argmin\n\nc\u2208C\n\nT(cid:88)\n\nt=1\n\n(cid:90)\n\nR\n\nProp. 2 extends Eq. (6) by relying on the linearity induced by the SELF assumption combined with\nthe Aumann\u2019s principle [30], which guarantees the existence of a measurable selector f\u2217 for the\nminimization problem at Eq. (10) (see Appendix). By following the strategy outlined in Sec. 2.1, we\ng\u2217\n\npropose to learn T independent functions(cid:98)gt : X \u2192 H, each aiming to approximate the corresponding\nt : X \u2192 H and then de\ufb01ne (cid:98)f : X \u2192 C such that\n\nWe choose the(cid:98)gt to be the solutions to T independent kernel ridge regressions problems\n\nt=1\n\nT(cid:88)\n\nc\u2208C\n\n(cid:98)f (x) = argmin\nnt(cid:88)\n\nminimize\ng\u2208H\u2297G\n\n1\nnt\n\ni=1\n\n(cid:104) \u03c8(ct) , V (cid:98)gt(x) (cid:105)H\n\n\u2200x \u2208 X .\n\n(cid:107)g(xit) \u2212 \u03c8(yit)(cid:107)2 + \u03bbt(cid:107)g(cid:107)2H\u2297G\n\n(11)\n\n(12)\n\nProposition 3 (The Nonlinear MTL Estimator). Let k : X \u00d7 X \u2192 R be a reproducing kernel with\n\nfor t = 1, . . . , T , where G is a reproducing kernel Hilbert space on X associated to a kernel\nk : X \u00d7 X \u2192 R and the candidate solution g : X \u2192 H is an element of H \u2297 G. The following result\n\nshows that in this setting, evaluating the estimator (cid:98)f can be signi\ufb01cantly simpli\ufb01ed.\nassociated reproducing kernel Hilbert space G. Let(cid:98)gt : X \u2192 H be the solution of Eq. (12) for\nt = 1, . . . , T . Then the estimator (cid:98)f : X \u2192 C de\ufb01ned at Eq. (11) is such that\n(cid:98)f (x) = argmin\n\n(\u03b11t(x), . . . , \u03b1ntt(x))(cid:62) = (Kt + nt\u03bbtI)\u22121Ktx (13)\n\n\u03b1it(x)(cid:96)(ct, yit),\n\nnt(cid:88)\n\nT(cid:88)\n\nc\u2208C\n\nt=1\n\ni=1\n\nfor all x \u2208 X and t = 1, . . . , T , where Kt \u2208 Rnt\u00d7nt denotes the kernel matrix of the t-th task,\nnamely (Kt)ij = k(xit, xjt), and Ktx \u2208 Rnt the vector with i-th component equal to k(x, xit).\nProp. 3 provides an equivalent characterization for nonlinear MTL estimator at Eq. (11) that is more\namenable to computations (it does not require explicit knowledge of H, \u03c8 or V ) and generalizes the\nSELF approach (indeed for VVR, Eq. (13) reduces to the SELF estimator at Eq. (8)). Interestingly,\nthe proposed strategy learns the score functions \u03b1im : X \u2192 R separately for each task and then\ncombines them in the joint minimization over C. This can be interpreted as the estimator weighting\npredictions according to how \u201creliable\u201d each task is on the input x \u2208 X . We make this intuition more\nclear in the following.\nExample 2 (Nonlinear MTL with Square Loss). Let (cid:96) be the square loss. Then, analogously to\nExample 1 we have that for any x \u2208 X , the multitask estimator at Eq. (13) is\n\nat(x)(cid:0)ct \u2212 bt(x)/at(x)(cid:1)2\n\n(14)\n\nT(cid:88)\n(cid:98)f (x) = argmin\ni=1 \u03b1it(x) and bt(x) = (cid:80)nt\n\nc\u2208C\n\nt=1\n\nwith at(x) = (cid:80)nt\njection (cid:98)f (x) = \u03a0A(x)C\n\ni=1 \u03b1it(x)yit, which corresponds to perform the pro-\n(w(x)) of the vector w(x) = (b1(x)/a1(x), . . . , bT (x)/aT (x)) according to\nthe metric deformation induced by the matrix A(x) = diag(a1(x), . . . , aT (x)). This suggests to\ninterpret at(x) as a measure of con\ufb01dence of task t with respect to x \u2208 X . Indeed, tasks with small\nat(x) will affect less the weighted projection \u03a0A(x)C\n\n.\n\n5\n\n\f3.3 Extensions: Violating C\nIn practice, it is natural to expect the knowledge of the constraints set C to be not exact, for instance due\nto noise or modeling inaccuracies. To address this issue, we consider two extensions of nonlinear MTL\nthat allow candidate predictors to slightly violate the constraints C and introduce a hyperparameter to\ncontrol this effect.\nRobustness w.r.t. perturbations of C. We soften the effect of the constraint set by requiring\ncandidate predictors to take value within a radius \u03b4 > 0 from C, namely f : X \u2192 C\u03b4 with\n\nC\u03b4 = { c + r | c \u2208 C, r \u2208 RT ,(cid:107)r(cid:107) \u2264 \u03b4 }.\n\n(15)\n\nThe scalar \u03b4 > 0 is now a hyperparameter ranging from 0 (C0 = C) to +\u221e (C\u221e = RT ).\nPenalizing w.r.t. the distance from C. We can penalize predictions depending on their distance\nfrom the set C by introducing a perturbed version (cid:96)t\n\n\u00b5(y, z) = (cid:96)(yt, zt) + (cid:107)z \u2212 \u03a0C(z)(cid:107)2/\u00b5\n(cid:96)t\n\n(16)\nwhere \u03a0C : RT \u2192 C denotes the orthogonal projection onto C (see Example 1). Below we report the\nclosed-from solution for nonlinear vector-valued regression with square loss.\nExample 3 (VVR and Violations of C). With the same notation as Example 1, let f0 : X \u2192 C denote\nthe solution at Eq. (9) of nonlinear VVR with exact constraints, let r = b(x)/a(x)\u2212f0(x) \u2208 RT . Then,\nthe solutions to the problem with robust constraints C\u03b4 and perturbed loss function L\u00b5 = 1\nt (cid:96)t\n\u00b5\nare respectively (see Appendix for the MTL)\n\n(cid:80)\n\nT\n\n\u00b5 : RT \u00d7 RT \u2192 R of the loss\nfor all y, z \u2208 RT\n\n(cid:98)f\u03b4(x) = f0(x) + r min(1, \u03b4/(cid:107)r(cid:107))\n\n(cid:98)f\u00b5(x) = f0(x) + r \u00b5/(1 + \u00b5).\n\nand\n\n(17)\n\n4 Generalization Properties of Nonlinear MTL\n\nWe now study the statistical properties of the proposed nonlinear MTL estimator. Interestingly,\nthis will allow to identify speci\ufb01c training regimes in which nonlinear MTL achieves learning rates\nsigni\ufb01cantly faster than those available when learning the tasks independently. Our analysis revolves\naround the assumption that the loss function used to measure prediction errors is SELF. To this end\nwe observe that most multitask loss functions are indeed SELF.\nProposition 4. Let \u00af(cid:96) : [a, b] \u2192 R be differentiable almost everywhere with derivative Lipschitz\ncontinuous almost everywhere. Let (cid:96) : [a, b] \u00d7 [a, b] \u2192 R be such that (cid:96)(y, y(cid:48)) = \u00af(cid:96)(y \u2212 y(cid:48))\nor (cid:96)(y, y(cid:48)) = \u00af(cid:96)(yy(cid:48)) for all y, y(cid:48) \u2208 R. Then: (i) (cid:96) is SELF and (ii) the separable function\nL : Y T \u00d7 Y T \u2192 R such that L(y, y(cid:48)) = 1\n\nt) for all y, y(cid:48) \u2208 Y T is SELF.\n\n(cid:80)T\nt=1 (cid:96)(yt, y(cid:48)\n\nT\n\nInterestingly, most (mono-variate) loss functions used in multitask and supervised learning satisfy\nthe assumptions of Prop. 4. Typical examples are the square loss (y \u2212 y(cid:48))2, hinge max(0, 1 \u2212 yy(cid:48))\nor logistic log(1 \u2212 exp(\u2212yy(cid:48))): the corresponding derivative with respect to z = y \u2212 y(cid:48) or z = yy(cid:48)\nexists and it is Lipschitz almost everywhere on compact sets.\nThe nonlinear MTL estimator introduced in Sec. 3.2 relies on the intuition that if for each x \u2208 X\nt (x), then\n\nthe kernel ridge regression solutions(cid:98)gt(x) are close to the conditional expectations g\u2217\nalso (cid:98)f (x) will be close to f\u2217(x). The following result formally characterizes the relation between\nthe two problems, proving what is often referred to as a comparison inequality in the context of\nsurrogate frameworks [31]. Throughout the rest of this section we assume \u03c1t(x, y) = \u03c1t(y|x)\u03c1X (x)\n\u03c1X = L2(X ,H, \u03c1X ) norm of a function g : X \u2192 H\nfor each t = 1, . . . , T and denote (cid:107)g(cid:107)L2\naccording to the marginal distribution \u03c1X .\nt : X \u2192 H be de\ufb01ned as in Eq. (10), let(cid:98)gt : X \u2192 H be measurable functions\nTheorem 5 (Comparison Inequality). With the same assumptions of Prop. 2, for t = 1, . . . , T let\nf\u2217 : X \u2192 C and g\u2217\n\nand let (cid:98)f : X \u2192 C satisfy Eq. (11). Let V \u2217 be the adjoint of V . Then,\n(cid:118)(cid:117)(cid:117)(cid:116) 1\nE((cid:98)f ) \u2212 E(f\u2217) \u2264 qC,(cid:96),T\n\n(cid:107)V \u2217\u03c8(ct)(cid:107)2H.\n\n(18)\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\n(cid:107)(cid:98)gt \u2212 g\u2217\nt (cid:107)2\n\n\u03c1X the L2\n\nT(cid:88)\n\n,\n\nL\u03c12X\n\nqC,(cid:96),T = 2 sup\nc\u2208C\n\nT\n\nt=1\n\nT(cid:88)\n\nt=1\n\nT\n\n6\n\n\fThe comparison inequality at Eq. (18) is key to study the generalization properties of our nonlinear\nthe true g\u2217\nTheorem 6. Let C \u2286 [a, b]T , let X be a compact set and k : X \u00d7 X \u2192 R a continuous universal\n\nMTL estimator by showing that we can control its excess risk in terms of how well the(cid:98)gt approximate\nreproducing kernel (e.g. Gaussian). Let (cid:96) : [a, b] \u00d7 [a, b] \u2192 R be a SELF. Let (cid:98)fN : X \u2192 C denote\n\nt (see Appendix for a proof of Thm. 5).\n\nthe estimator at Eq. (13) with N = (n1, . . . , nT ) training points independently sampled from \u03c1t for\neach task t = 1, . . . , T and \u03bbt = n\n\n. Let n0 = min1\u2264t\u2264T nt. Then, with probability 1\n\n\u22121/4\nt\n\nn0\u2192+\u221eE((cid:98)fN ) = inf\n\nlim\n\nf :X\u2192C E(f ).\n\n(19)\n\nE((cid:98)fN ) \u2212 inf\n\nThe proof of Thm. 6 relies on the comparison inequality in Thm. 5, which links the excess risk\nof the MTL estimator to the square error between \u02c6gt and g\u2217\nt . Standard results from kernel ridge\nimposing further standard assumptions, we can also obtain generalization bounds on (cid:107)(cid:98)gt \u2212 g\u2217\nregression allow to conclude the proof [32] (see a more detailed discussion in the Appendix). By\nt (cid:107)L2\nTheorem 7. With the same assumptions and notation of Thm. 6 let (cid:98)fN : X \u2192 C denote the estimator\nt \u2208 H \u2297 G, for all t = 1, . . . , T . Then for any \u03c4 > 0 we\n\nthat automatically apply to nonlinear MTL again via the comparison inequality, as shown below.\n\n\u22121/2\nand assume g\u2217\nat Eq. (13) with \u03bbt = n\nhave, with probability at least 1 \u2212 8e\u2212\u03c4 , that\nt\n\nf :X\u2192C E(f ) \u2264 qC,(cid:96),T h(cid:96) \u03c4 2 n\n\n\u22121/4\n0\n\nlog T,\n\n(20)\n\nwhere qC,(cid:96),T is de\ufb01ned as in Eq. (18) and h(cid:96) is a constant independent of C, N, nt, \u03bbt, \u03c4, T .\nThe the excess risk bound in Thm. 7 is comparable to that in [21] (Thm. 5). To our knowledge this is\nthe \ufb01rst result studying the generalization properties of a learning approach to MTL with constraints.\n\n\u221a\n\n4.1 Bene\ufb01ts of Nonlinear MTL\nThe rates in Thm. 7 strongly depend on the constraints C via the constant qC,(cid:96),T . The following result\nstudies two special cases that allow to appreciate this effect.\nLemma 8. Let B \u2265 1, B = [\u2212B, B]T , S \u2282 RT be the sphere of radius B centered at the origin\nand let (cid:96) be the square loss. Then qB,(cid:96),T \u2264 2\n\nTo explain the effect of C on MTL, de\ufb01ne n =(cid:80)T\nrate of nonlinear MTL is of (cid:101)O(( T\nMTL achieves a learning rate of (cid:101)O(( 1\na rate of (cid:101)O(n\u22121/2) which is comparable to the optimal rates available for kernel ridge regression\n\nt=1 nt and assume that n0 = nt = n/T . Lemma 8\ntogether with Thm. 7 shows that when the tasks are assumed not to be related (i.e. C = B) the learning\nn )1/4), as if the tasks were learned independently. On the other hand,\nwhen the tasks have a relation (e.g. C = S, implying a quadratic relation between the tasks) nonlinear\nnT )1/4), which improves as the number of tasks increases and as\nthe total number of observed examples increases. Speci\ufb01cally, for T of the same order of n, we obtain\n\n\u221a\n5 B2 and qS,(cid:96),T \u2264 2\n\nwith only one task trained on the total number n of examples [32]. This observation corresponds to\nthe intuition that if we have many related tasks with few training examples each, we can expect to\nachieve signi\ufb01cantly better generalization by taking advantage of such relations rather than learning\neach task independently.\n\n5 B2 T \u22121/2.\n\n5 Connection to Previous Work: Linear MTL\nIn this work we formulated the nonlinear MTL problem as that of learning a function f : X \u2192 C\ntaking values in a set of constraints C \u2286 RT implicitly identi\ufb01ed by a set of equations \u03b3(f (x)) = 0. An\nalternative approach would be to characterize the set C via an explicit parametrization \u03b8 : RQ \u2192 C, for\nQ \u2208 N, so that the multitask predictor can be decomposed as f = \u03b8 \u25e6 h, with h : X \u2192 RQ. We can\nlearn h : X \u2192 RQ using empirical risk minimization strategies such as Tikhonov regularization,\n\nminimize\n\nh=(h1,...,hQ)\u2208HQ\n\n1\nn\n\nL(\u03b8 \u25e6 h(xi), yi) + \u03bb\n\n(cid:107)hq(cid:107)2H\n\n(21)\n\nn(cid:88)\n\nQ(cid:88)\n\ni=1\n\nq=1\n\n7\n\n\fFigure 1: (Bottom) MSE (logaritmic scale) of MTL methods for learning constrained on a circumference (Left)\nor a Lemniscate (Right). Results are reported in a boxplot across 10 trials. (Top) Sample predictions of the three\nmethods trained on 100 points and compared with the ground truth.\n\nto very different (cid:98)f = \u03b8 \u25e6(cid:98)h, which is not always desirable; 3) There are few results on empirical\n\nsince candidate h take value in RQ and therefore H can be a standard linear space of hypotheses.\nHowever, while Eq. (21) is interesting from the modeling standpoint, it also poses several problems:\n1) \u03b8 can be nonlinear or even non-continuous, making Eq. (21) hard to solve in practice even for\nL convex; 2) \u03b8 is not uniquely identi\ufb01ed by C and therefore different parametrizations may lead\nrisk minimization applied to generic loss functions L(\u03b8(\u00b7),\u00b7) (via so-called oracle inequalities, see\n[30] and references therein), and it is unclear what generalization properties to expect in this setting.\nA relevant exception to the issues above is the case where \u03b8 is linear. In this setting Eq. (21)\nbecomes more amenable to both computations and statistical analysis and indeed most previous MTL\nliterature has been focused on this setting, either by designing ad-hoc output metrics [33], linear\noutput encodings [34] or regularizers [5]. Speci\ufb01cally, in this latter case the problem is cast as that of\nminimizing the functional\n\nn(cid:88)\n\nT(cid:88)\n\nminimize\n\nf =(f1,...,fT )\u2208HT\n\ni=1\n\nt,s=1\n\nL(f (xi), yi) + \u03bb\n\nAts(cid:104)ft, fs(cid:105)H\n\n(22)\n\nwhere the psd matrix A = (Ats)T\ns,t=1 encourages linear relations between the tasks. It can be\nshown that this problem is equivalent to Eq. (21) when the \u03b8 \u2208 RT\u00d7Q is linear and A is set to\nthe pseudoinverse of \u03b8\u03b8(cid:62). As shown in [14], a variety of situations are recovered considering the\napproach above, such as the case where tasks are centered around a common average [9], clustered\nin groups [10] or sharing the same subset of features [3, 35]. Interestingly, the above framework can\nbe further extended to estimate the structure matrix A directly from data, an idea initially proposed in\n[12] and further developed in [2, 14, 16].\n\n6 Experiments\nSynthetic Dataset. We considered a model of the form y = f\u2217(x) + \u0001, with \u0001 \u223c N(0, \u03c3I) noise sam-\npled according to a normal distribution and f\u2217 : X \u2192 C, where C \u2282 R2 was either a circumference or\na lemniscate (see Fig. 1) of equation \u03b3circ(y) = y2\n2) = 0\nfor y \u2208 R2. We set X = [\u2212\u03c0, \u03c0] and f\u2217\nlemn(x) = (sin(x), sin(2x))\nthe parametric functions associated respectively to the circumference and Lemniscate. We sampled\nfrom 10 to 1000 points for training and 1000 for testing, with noise \u03c3 = 0.05.\nWe trained and tested three regression models over 10 trials. We used a Gaussian kernel on the\ninput and chose the corresponding bandwidth and the regularization parameter \u03bb by hold-out cross-\nvalidation on 30% of the training set (see details in the appendix). Fig. 1 (Bottom) reports the mean\n\ncirc(x) = (cos(x), sin(x)) or f\u2217\n\n2 \u2212 1 = 0 and \u03b3lemn(y) = y4\n\n1 \u2212 (y2\n\n1 \u2212 y2\n\n1 + y2\n\n8\n\n\fTable 1: Explained variance of the robust (NL-MTL[R]) and perturbed (NL-MTL[P]) variants of nonlinear MTL,\ncompared with linear MTL methods on the Sarcos dataset reported from [16].\n\nSTL\n\nMTL[36]\n\nCMTL[10] MTRL[11] MTFL[13]\n\nFMTL[16]\n\nNL-MTL[R]\n\nNL-MTL[P]\n\nExpl.\n\n40.5\nVar. (%) \u00b17.6\n\n34.5\n\u00b110.2\n\n33.0\n\u00b113.4\n\n41.6\n\u00b17.1\n\n49.9\n\u00b16.3\n\n50.3\n\u00b15.8\n\n55.4\n\u00b16.5\n\n54.6\n\u00b15.1\n\nTable 2: Rank prediction error according to the weighted binary loss in [37, 21].\n\nNL-MTL\n\nSELF[21]\n\nLinear [37]\n\nHinge [38]\n\nLogistic [39]\n\nSVMStruct [20]\n\nSTL\n\nMTRL[11]\n\nRank\nLoss\n\n0.271\n\u00b10.004\n\n0.396\n\u00b10.003\n\n0.430\n\u00b10.004\n\n0.432\n\u00b10.008\n\n0.432\n\u00b10.012\n\n0.451\n\u00b10.008\n\n0.581\n0.003\n\n0.613\n\u00b10.005\n\nsquare error (MSE) of our nonlinear MTL approach (NL-MTL) compared with the standard least\nsquares single task learning (STL) baseline and the multitask relations learning (MTRL) from [11],\nwhich encourages tasks to be linearly dependent. However, for both circumference and Lemniscate,\nthe tasks are strongly nonlinearly related. As a consequence our approach consistently outperforms\nits two competitors which assume only linear relations (or none at all). Fig. 1 (Top) provides a\nqualitative comparison on the three methods (when trained with 100 examples) during a single trial.\nSarcos Dataset. We report experiments on the Sarcos dataset [22]. The goal is to predict the torque\nmeasured at each joint of a 7 degrees-of-freedom robotic arm, given the current state, velocities and\naccelerations measured at each joint (7 tasks/torques for 21-dimensional input). We used the 10\ndataset splits available online for the dataset in [13], each containing 2000 examples per task with 15\nexamples used for training/validation while the rest is used to measure errors in terms of the explained\nvariance, namely 1 - nMSE (as a percentage). To compare with results in [13] we used the linear\nkernel on the input. We refer to the Appending for details on model selection.\nTab. 1 reports results from [13, 16] for a wide range of previous linear MTL methods [36, 10, 3, 11,\n13, 16], together with our NL-MTL approach (both robust and perturbed versions). Since, we did not\n\ufb01nd Sarcos robot model parameters online, we approximated the constraint set C as a point cloud by\ncollecting 1000 random output vectors that did not belong to training or test sets in [13] (we sampled\nthem from the original dataset [22]). NL-MTL clearly outperforms the \u201clinear\u201d competitors. Note\nindeed that the torques measured at different joints of a robot are highly nonlinear (see for instance\n[23]) and therefore taking such structure into account can be bene\ufb01cial to the learning process.\nRanking by Pair-wise Comparison. We consider a ranking problem formulated withing the MTL\nsetting: given D documents, we learn T = D(D \u2212 1)/2 functions fp,q : X \u2192 {\u22121, 0, 1}, for\neach pair of documents p, q = 1, . . . , D that predict whether one document is more relevant than\nthe other for a given input query x. The problem can be formulated as multi-label MTL with 0-1\nloss: for a given training query x only some labels yp,q \u2208 {\u22121, 0, 1} are available in output (with\n1 corresponding to document p being more relevant than q, \u22121 the opposite and 0 that the two are\nequivalent). We have therefore T separate training sets, one for each task (i.e. pair of documents).\nClearly, not all possible combinations of outputs f : X \u2192 {\u22121, 0, 1}T are allowed since predictions\nneed to be consistent (e.g. if p (cid:31) q (read \u201cp more relevant than q\u201d) and q (cid:31) r, then we cannot have\nr (cid:31) p). As shown in [37] these constraints are naturally encoded in a set DAG(D) in RT of all\nvectors G \u2208 RT that correspond to (the vectorized, upper triangular part of the adjacency matrix of)\na Directed Acyclic Graph with D vertices. The problem can be cast in our nonlinear MTL framework\nwith f : X \u2192 C = DAG(D) (see Appendix for details on how to perform the projection onto C).\nWe performed experiments on Movielens100k [40] (movies = documents, users = queries) to\ncompare our NL-MTL estimator with both standard MTL baselines as well as methods designed for\nranking problems. We used the (linear) input kernel and the train, validation and test splits adopted in\n[21] to perform 10 independent trials with 5-fold cross-validation for model selection. Tab. 2 reports\nthe average ranking error and standard deviation of the (weighed) 0-1 loss function considered in\n[37, 21] for the ranking methods proposed in [38, 39, 37], the SVMStruct estimator [20], the SELF\nestimator considered in [21] for ranking, the MTRL and STL baseline, corresponding to individual\nSVMs trained for each pairwise comparison. Results for previous methods are reported from [21].\nNL-MTL outperforms all competitors, achieving better performance than the the original SELF\nestimator. For the sake of brevity we refer to the Appendix for more details on the experiments.\nAcknowledgments. This work was supported in part by EPSRC grant EP/P009069/1.\n\n9\n\n\fReferences\n[1] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.\n[2] Mauricio A. \u00c1lvarez, Neil Lawrence, and Lorenzo Rosasco. Kernels for vector-valued functions: a review.\n\nFoundations and Trends in Machine Learning, 4(3):195\u2013266, 2012.\n\n[3] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. Advances\n\nin neural information processing systems, 19:41, 2007.\n\n[4] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and\n\ndata engineering, 22(10):1345\u20131359, 2010.\n\n[5] Charles A Micchelli and Massimiliano Pontil. Kernels for multi\u2013task learning. In Advances in Neural\n\nInformation Processing Systems, pages 921\u2013928, 2004.\n\n[6] Christopher M Bishop. Machine learning and pattern recognition. Information Science and Statistics.\n\nSpringer, Heidelberg, 2006.\n\n[7] Andreas Maurer and Massimiliano Pontil. Excess risk bounds for multitask learning with trace norm\n\nregularization. In Conference on Learning Theory (COLT), volume 30, pages 55\u201376, 2013.\n\n[8] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The bene\ufb01t of multitask represen-\n\ntation learning. Journal of Machine Learning Research, 17(81):1\u201332, 2016.\n\n[9] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel\n\nmethods. In Journal of Machine Learning Research, pages 615\u2013637, 2005.\n\n[10] Laurent Jacob, Francis Bach, and Jean-Philippe Vert. Clustered multi-task learning: a convex formulation.\n\nAdvances in Neural Information Processing Systems, 2008.\n\n[11] Yu Zhang and Dit-Yan Yeung. A convex formulation for learning task relationships in multi-task learning.\n\nIn Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2010.\n\n[12] Francesco Dinuzzo, Cheng S. Ong, Peter V. Gehler, and Gianluigi Pillonetto. Learning output kernels with\n\nblock coordinate descent. International Conference on Machine Learning, 2011.\n\n[13] Pratik Jawanpuria and J Saketha Nath. A convex feature learning formulation for latent task structure\n\ndiscovery. International Conference on Machine Learning, 2012.\n\n[14] Carlo Ciliberto, Youssef Mroueh, Tomaso A Poggio, and Lorenzo Rosasco. Convex learning of multiple\n\ntasks and their structure. In International Conference on Machine Learning (ICML), 2015.\n\n[15] Carlo Ciliberto, Lorenzo Rosasco, and Silvia Villa. Learning multiple visual tasks while discovering their\nstructure. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n131\u2013139, 2015.\n\n[16] Pratik Jawanpuria, Maksim Lapin, Matthias Hein, and Bernt Schiele. Ef\ufb01cient output kernel learning for\n\nmultiple tasks. In Advances in Neural Information Processing Systems, pages 1189\u20131197, 2015.\n\n[17] Florian Steinke and Matthias Hein. Non-parametric regression between manifolds. In Advances in Neural\n\nInformation Processing Systems, pages 1561\u20131568, 2009.\n\n[18] Arvind Agarwal, Samuel Gerber, and Hal Daume. Learning multiple tasks using manifold regularization.\n\nIn Advances in neural information processing systems, pages 46\u201354, 2010.\n\n[19] Thomas Hofmann Bernhard Sch\u00f6lkopf Alexander J. Smola Ben Taskar Bakir, G\u00f6khan and S.V.N Vish-\n\nwanathan. Predicting structured data. MIT press, 2007.\n\n[20] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods\n\nfor structured and interdependent output variables. In Journal of Machine Learning Research, 2005.\n\n[21] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured\n\nprediction. Advances in Neural Information Processing Systems 29 (NIPS), pages 4412\u20134420, 2016.\n\n[22] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The\n\nMIT Press, 2006.\n\n[23] Lorenzo Sciavicco and Bruno Siciliano. Modeling and control of robot manipulators, volume 8. McGraw-\n\nHill New York, 1996.\n\n[24] Sebastian Nowozin, Christoph H Lampert, et al. Structured learning and prediction in computer vision.\n\nFoundations and Trends in Computer Graphics and Vision, 2011.\n\n[25] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector machines, regulariza-\n\ntion, optimization, and beyond. MIT press, 2002.\n\n[26] Thomas H Cormen. Introduction to algorithms. MIT press, 2009.\n[27] Suvrit Sra and Reshad Hosseini. Geometric optimization in machine learning. In Algorithmic Advances in\n\nRiemannian Geometry and Applications, pages 73\u201391. Springer, 2016.\n\n10\n\n\f[28] Hongyi Zhang, Sashank J. Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on\n\nriemannian manifolds. In Advances in Neural Information Processing Systems 29. 2016.\n\n[29] Florian Steinke, Matthias Hein, and Bernhard Sch\u00f6lkopf. Nonparametric regression between general\n\nriemannian manifolds. SIAM Journal on Imaging Sciences, 3(3):527\u2013563, 2010.\n\n[30] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics.\n\nSpringer New York, 2008.\n\n[31] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[32] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.\n\nFoundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[33] Vikas Sindhwani, Aurelie C. Lozano, and Ha Quang Minh. Scalable matrix-valued kernel learning and\n\nhigh-dimensional nonlinear causal inference. CoRR, abs/1210.4792, 2012.\n\n[34] Rob Fergus, Hector Bernal, Yair Weiss, and Antonio Torralba. Semantic label sharing for learning with\n\nmany categories. European Conference on Computer Vision, 2010.\n\n[35] Guillaume Obozinski, Ben Taskar, and Michael I Jordan. Joint covariate selection and joint subspace\n\nselection for multiple classi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013252, 2010.\n\n[36] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi\u2013task learning. In Proceedings of the\n\ntenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004.\n\n[37] John C Duchi, Lester W Mackey, and Michael I Jordan. On the consistency of ranking algorithms. In\nProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 327\u2013334, 2010.\n[38] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression.\n\nAdvances in neural information processing systems, pages 115\u2013132, 1999.\n\n[39] Ofer Dekel, Yoram Singer, and Christopher D Manning. Log-linear models for label ranking. In Advances\n\nin neural information processing systems, page None, 2004.\n\n[40] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions\n\non Interactive Intelligent Systems (TiiS), 5(4):19, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1220, "authors": [{"given_name": "Carlo", "family_name": "Ciliberto", "institution": "University College London"}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "University of Genova"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT & UCL"}]}