{"title": "Efficient Output Kernel Learning for Multiple Tasks", "book": "Advances in Neural Information Processing Systems", "page_first": 1189, "page_last": 1197, "abstract": "The paradigm of multi-task learning is that one can achieve better generalization by learning tasks jointly and thus exploiting the similarity between the tasks rather than learning them independently of each other. While previously the relationship between tasks had to be user-defined in the form of an output kernel, recent approaches jointly learn the tasks and the output kernel. As the output kernel is a positive semidefinite matrix, the resulting optimization problems are not scalable in the number of tasks as an eigendecomposition is required in each step. Using the theory of positive semidefinite kernels we show in this paper that for a certain class of regularizers on the output kernel, the constraint of being positive semidefinite can be dropped as it is automatically satisfied for the relaxed problem. This leads to an unconstrained dual problem which can be solved efficiently. Experiments on several multi-task and multi-class data sets illustrate the efficacy of our approach in terms of computational efficiency as well as generalization performance.", "full_text": "Ef\ufb01cient Output Kernel Learning for Multiple Tasks\n\nPratik Jawanpuria1, Maksim Lapin2, Matthias Hein1 and Bernt Schiele2\n\n1Saarland University, Saarbr\u00a8ucken, Germany\n\n2Max Planck Institute for Informatics, Saarbr\u00a8ucken, Germany\n\nAbstract\n\nThe paradigm of multi-task learning is that one can achieve better generalization\nby learning tasks jointly and thus exploiting the similarity between the tasks rather\nthan learning them independently of each other. While previously the relationship\nbetween tasks had to be user-de\ufb01ned in the form of an output kernel, recent ap-\nproaches jointly learn the tasks and the output kernel. As the output kernel is a\npositive semide\ufb01nite matrix, the resulting optimization problems are not scalable\nin the number of tasks as an eigendecomposition is required in each step. Using\nthe theory of positive semide\ufb01nite kernels we show in this paper that for a certain\nclass of regularizers on the output kernel, the constraint of being positive semidef-\ninite can be dropped as it is automatically satis\ufb01ed for the relaxed problem. This\nleads to an unconstrained dual problem which can be solved ef\ufb01ciently. Experi-\nments on several multi-task and multi-class data sets illustrate the ef\ufb01cacy of our\napproach in terms of computational ef\ufb01ciency as well as generalization perfor-\nmance.\n\n1\n\nIntroduction\n\nMulti-task learning (MTL) advocates sharing relevant information among several related tasks dur-\ning the training stage. The advantage of MTL over learning tasks independently has been shown\ntheoretically as well as empirically [1, 2, 3, 4, 5, 6, 7].\nThe focus of this paper is the question how the task relationships can be inferred from the data.\nIt has been noted that naively grouping all the tasks together may be detrimental [8, 9, 10, 11].\nIn particular, outlier tasks may lead to worse performance. Hence, clustered multi-task learning\nalgorithms [10, 12] aim to learn groups of closely related tasks. The information is then shared only\nwithin these clusters of tasks. This corresponds to learning the task covariance matrix, which we\ndenote as the output kernel in this paper. Most of these approaches lead to non-convex problems.\nIn this work, we focus on the problem of directly learning the output kernel in the multi-task learning\nframework. The multi-task kernel on input and output is assumed to be decoupled as the product\nof a scalar kernel and the output kernel, which is a positive semide\ufb01nite matrix [1, 13, 14, 15]. In\nclassical multi-task learning algorithms [1, 16], the degree of relatedness between distinct tasks is\nset to a constant and is optimized as a hyperparameter. However, constant similarity between tasks\nis a strong assumption and is unlikely to hold in practice. Thus recent approaches have tackled the\nproblem of directly learning the output kernel. [17] solves a multi-task formulation in the framework\nof vector-valued reproducing kernel Hilbert spaces involving squared loss where they penalize the\nFrobenius norm of the output kernel as a regularizer. They formulate an invex optimization prob-\nlem that they solve optimally. In comparison, [18] recently proposed an ef\ufb01cient barrier method\nto optimize a generic convex output kernel learning formulation. On the other hand, [9] proposes a\nconvex formulation to learn low rank output kernel matrix by enforcing a trace constraint. The above\napproaches [9, 17, 18] solve the resulting optimization problem via alternate minimization between\ntask parameters and the output kernel. Each step of the alternate minimization requires an eigen-\n\n1\n\n\fvalue decomposition of a matrix having as size the number of tasks and a problem corresponding to\nlearning all tasks independently.\nIn this paper we study a similar formulation as [17]. However, we allow arbitrary convex loss\nfunctions and employ general p-norms for p \u2208 (1, 2] (including the Frobenius norm) as regularizer\nfor the output kernel. Our problem is jointly convex over the task parameters and the output kernel.\nSmall p leads to sparse output kernels which allows for an easier interpretation of the learned task\nrelationships in the output kernel. Under certain conditions on p we show that one can drop the\nconstraint that the output kernel should be positive de\ufb01nite as it is automatically satis\ufb01ed for the\nunconstrained problem. This signi\ufb01cantly simpli\ufb01es the optimization and our result could also be of\ninterest in other areas where one optimizes over the cone of positive de\ufb01nite matrices. The resulting\nunconstrained dual problem is amenable to ef\ufb01cient optimization methods such as stochastic dual\ncoordinate ascent [19], which scale well to large data sets. Overall we do not require any eigenvalue\ndecomposition operation at any stage of our algorithm and no alternate minimization is necessary,\nleading to a highly ef\ufb01cient methodology. Furthermore, we show that this trick not only applies to\np-norms but also applies to a large class of regularizers for which we provide a characterization.\nOur contributions are as follows: (a) we propose a generic p-norm regularized output kernel matrix\nlearning formulation, which can be extended to a large class of regularizers; (b) we show that the\nconstraint on the output kernel to be positive de\ufb01nite can be dropped as it is automatically satis\ufb01ed,\nleading to an unconstrained dual problem; (c) we propose an ef\ufb01cient stochastic dual coordinate\nascent based method for solving the dual formulation; (d) we empirically demonstrate the superiority\nof our approach in terms of generalization performance as well as signi\ufb01cant reduction in training\ntime compared to other methods learning the output kernel.\nThe paper is organized as follows. We introduce our formulation in Section 2. Our main technical\nresult is discussed in Section 3. The proposed optimization algorithm is described in Section 4. In\nSection 5, we report the empirical results. All the proofs can be found in the supplementary material.\n\n2 The Output Kernel Learning Formulation\n\nWe \ufb01rst introduce the setting considered in this paper. We denote the number of tasks by T . We\nassume that all tasks have a common input space X and a common positive de\ufb01nite kernel function\nk : X \u00d7 X \u2192 R. We denote by \u03c8(\u00b7) the feature map and by Hk the reproducing kernel Hilbert\ni=1, where xi \u2208 X , ti is the\nspace (RKHS) [20] associated with k. The training data is (xi, yi, ti)n\ntask the i-th instance belongs to and yi is the corresponding label. Moreover, we have a positive\nde\ufb01nite matrix \u0398 \u2208 ST\n+ is the set of T \u00d7 T symmetric and\npositive semide\ufb01nite (p.s.d.) matrices.\nIf one arranges the predictions of all tasks in a vector one can see multi-task learning as learning a\nvector-valued function in a RKHS [see 1, 13, 14, 15, 18, and references therein]. However, in this\npaper we use the one-to-one correspondence between real-valued and matrix-valued kernels, see\n[21], in order to limit the technical overhead. In this framework we de\ufb01ne the joint kernel of input\nspace and the set of tasks M : (X \u00d7 {1, . . . , T}) \u00d7 (X \u00d7 {1, . . . , T}) \u2192 R as\n\n+ on the set of tasks {1, . . . , T}, where ST\n\nWe denote the corresponding RKHS of functions on X \u00d7 {1, . . . , T} as HM and by (cid:107)\u00b7(cid:107)HM\ncorresponding norm. We formulate the output kernel learning problem for multiple tasks as\n\n(1)\nthe\n\nM(cid:0)(x, s), (z, t)(cid:1) = k(x, z)\u0398(s, t),\nn(cid:88)\n\nL(cid:0)yi, F (xi, ti)(cid:1) +\n\n(cid:107)F(cid:107)2\n\nC\n\n1\n2\n\ni=1\n\n\u0398\u2208ST\n\nmin\n+ ,F\u2208HM\n\n(2)\nwhere L : R \u00d7 R \u2192 R is the convex loss function (convex in the second argument), V (\u0398) is a\nconvex regularizer penalizing the complexity of the output kernel \u0398 and \u03bb \u2208 R+ is the regularization\nparameter. Note that (cid:107)F(cid:107)2\nimplicitly depends also on \u0398. In the following we show that (2) can\nbe reformulated into a jointly convex problem in the parameters of the prediction function and the\noutput kernel \u0398. Using the standard representer theorem [20] (see the supplementary material) for\n\ufb01xed output kernel \u0398, one can show that the optimal solution F \u2217 \u2208 HM of (2) can be written as\n\n+ \u03bb V (\u0398)\n\nHM\n\nHM\n\nF \u2217(x, t) =\n\n\u03b3isk(xi, x)\u0398(s, t).\n\n(3)\n\nT(cid:88)\n\nn(cid:88)\n\n\u03b3isM(cid:0)(xi, s), (x, t)(cid:1) =\n\nT(cid:88)\n\nn(cid:88)\n\ns=1\n\ni=1\n\ns=1\n\ni=1\n\n2\n\n\fWith the explicit form of the prediction function one can rewrite the main problem (2) as\n\nn(cid:88)\n\nL(cid:0)yi,\n\nT(cid:88)\n\nn(cid:88)\n\ni=1\n\ns=1\n\nj=1\n\n\u0398\u2208ST\n\nmin\n+ ,\u03b3\u2208Rn\u00d7T\n\nC\n\n(cid:1) +\n\n1\n2\n\nT(cid:88)\n\nn(cid:88)\n\nr,s=1\n\ni,j=1\n\n\u03b3jskji\u0398s ti\n\n\u03b3ir\u03b3jskij\u0398rs + \u03bb V (\u0398),\n\n(4)\n\nwhere \u0398rs = \u0398(r, s) and kij = k(xi, xj). Unfortunately, problem (4) is not jointly convex in \u0398\nand \u03b3 due to the product in the second term. A similar problem has been analyzed in [17]. They\ncould show that for the squared loss and V (\u0398) = (cid:107)\u0398(cid:107)2\nF the corresponding optimization problem is\ninvex and directly optimize it. For an invex function every stationary point is globally optimal [22].\nWe follow a different path which leads to a formulation similar to the one of [2] used for learning\nan input mapping (see also [9]). Our formulation for the output kernel learning problem is jointly\nconvex in the task kernel \u0398 and the task parameters. We present a derivation for the general RKHS\nHk, analogous to the linear case presented in [2, 9]. We use the following variable transformation,\n\n\u03b2it =\n\n\u0398ts\u03b3is, i = 1, . . . , n, s = 1, . . . , T,\n\nresp.\n\n\u03b3is =\n\nIn the last expression \u0398\u22121 has to be understood as the pseudo-inverse if \u0398 is not invertible. Note\nthat this causes no problems as in case \u0398 is not invertible, we can without loss of generality restrict\n\u03b3 in (4) to the range of \u0398. The transformation leads to our \ufb01nal problem formulation, where the\nprediction function F and its squared norm (cid:107)F(cid:107)2\n\ncan be written as\n\n(cid:0)\u0398\u22121(cid:1)\n\nT(cid:88)\n\nt=1\n\nst\u03b2it.\n\nT(cid:88)\n\ns=1\n\nsr\u03b2is\u03b2jrk(xi, xj).\n\nsr\u03b2is\u03b2jrkij + \u03bb V (\u0398)\n\n(5)\n\n(6)\n\ni=1\n\nHM\n\n(cid:107)F(cid:107)2\n\nF (x, t) =\n\n\u03b2itk(xi, x),\n\nWe get our \ufb01nal primal optimization problem\n\nn(cid:88)\nn(cid:88)\n\nn(cid:88)\nT(cid:88)\n(cid:0)\u0398\u22121(cid:1)\nT(cid:88)\nn(cid:88)\n(cid:0)\u0398\u22121(cid:1)\nlations in [9, 17]. With the task weight vectors wt =(cid:80)n\n\nL(cid:0)yi,\n\nmin\n+ ,\u03b2\u2208Rn\u00d7T\n\n(cid:1) +\n\nn(cid:88)\n\n\u03b2jtikji\n\nr,s=1\n\ni,j=1\n\nr,s=1\n\ni,j=1\n\n\u0398\u2208ST\n\n1\n2\n\n=\n\nHM\n\ni=1\n\nj=1\n\nC\n\n(cid:107)F(cid:107)2\n\nHM\n\n=\n\nr,s=1\n\ni,j=1\n\nT(cid:88)\n\nn(cid:88)\n(cid:0)\u0398\u22121(cid:1)\n=(cid:80)T\n\nBefore we analyze the convexity of this problem, we want to illustrate the connection to the formu-\nj=1 \u03b2jt\u03c8(xj) \u2208 Hk we get predictions as\nF (x, t) = (cid:104)wt, \u03c8(x)(cid:105) and one can rewrite\n\nsr\u03b2is\u03b2jrk(xi, xj) =\n\nT(cid:88)\n\n(cid:0)\u0398\u22121(cid:1)\n\nr,s=1\n\nsr (cid:104)ws, wt(cid:105) .\n\nThis identity is known for vector-valued RKHS, see [15] and references therein. When \u0398 is \u03ba times\nthe identity matrix, then (cid:107)F(cid:107)2\nand thus (2) is learning the tasks independently. As\nmentioned before the convexity of the expression of (cid:107)F(cid:107)2\nis crucial for the convexity of the full\nproblem (6). The following result has been shown in [2] (see also [9]).\nLemma 1 Let R(\u0398) denote the range of \u0398 \u2208 ST\n+ \u00d7 Rn\u00d7T \u2192 R \u222a {\u221e} de\ufb01ned as\nfunction f : ST\n\n+ and let \u0398\u2020 be the pseudoinverse. The extended\n\n(cid:107)wt(cid:107)2\n\nHM\n\nHM\n\nt=1\n\n\u03ba\n\n(cid:80)n\n\n(cid:0)\u0398\u2020(cid:1)\n\nr,s=1\n\ni,j=1\n\n(cid:40)(cid:80)T\n\n\u221e\n\nf (\u0398, \u03b2) =\n\nis jointly convex.\n\nsr\u03b2is\u03b2jrk(xi, xj),\n\nif \u03b2i\u00b7 \u2208 R(\u0398),\u2200 i = 1, . . . , n,\nelse .\n\n,\n\n[9] uses the constraint Trace(\u0398) \u2264 1 instead\nThe formulation in (6) is similar to [9, 17, 18].\nof a regularizer V (\u0398) enforcing low rank of the output kernel. On the other hand, [17] employs\nsquared Frobenius norm for V (\u0398) with squared loss function. [18] proposed an ef\ufb01cient algorithm\nfor convex V (\u0398). Instead we think that sparsity of \u0398 is better to avoid the emergence of spurious\nrelations between tasks and also leads to output kernels which are easier to interpret. Thus we\npropose to use the following regularization functional for the output kernel \u0398:\n\nT(cid:88)\n\nt,t(cid:48)=1\n\nV (\u0398) =\n\n|\u0398tt(cid:48)|p = (cid:107)\u0398(cid:107)p\np ,\n\n3\n\n\ffor p \u2208 [1, 2]. Several approaches [9, 17, 18] employ alternate minimization scheme, involving\ncostly eigendecompositions of T \u00d7 T matrix per iteration (as \u0398 \u2208 ST\n+). In the next section we show\nthat for a certain set of values of p one can derive an unconstrained dual optimization problem which\n+ cone. The resulting unconstrained dual problem\nthus avoids the explicit minimization over the ST\ncan then be easily optimized by stochastic coordinate ascent. Having explicit expressions of the\nprimal variables \u0398 and \u03b2 in terms of the dual variables allows us to get back to the original problem.\n\n3 Unconstrained Dual Problem Avoiding Optimization over ST\n+\n\nThe primal formulation (6) is a convex multi-task output kernel learning problem. The next lemma\nderives the Fenchel dual function of (6). This still involves the optimization over the primal variable\n\u0398 \u2208 ST\n+. A main contribution of this paper is to show that this optimization problem over the\n+ cone can be solved with an analytical solution for a certain class of regularizers V (\u0398). In the\nST\nfollowing we denote by \u03b1r := {\u03b1i | ti = r} the dual variables corresponding to task r and by Krs\nthe kernel matrix (k(xi, xj)| ti = r, tj = s) corresponding to the dual variables of tasks r and s.\nLemma 2 Let L\u2217\n\ni be the conjugate function of the loss Li : R \u2192 R, u (cid:55)\u2192 L(yi, u), then\n\nq : Rn \u2192 R, q(\u03b1) = \u2212C\n\n\u0398rs (cid:104)\u03b1r, Krs\u03b1s(cid:105) \u2212 V (\u0398)\n\n(7)\n\nn(cid:88)\n\ni=1\n\nL\u2217\n\ni\n\n(cid:16) \u2212 \u03b1i\n\n(cid:17) \u2212 \u03bb max\n\n(cid:16) 1\n\nT(cid:88)\n\nC\n\n\u0398\u2208ST\n\n+\n\n2\u03bb\n\nr,s=1\n\n(cid:17)\n\nF (x, s) =(cid:80)n\n\nis the dual function of (6), where \u03b1 \u2208 Rn are the dual variables. The primal variable \u03b2 \u2208 Rn\u00d7T\nin (6) and the prediction function F can be expressed in terms of \u0398 and \u03b1 as \u03b2is = \u03b1i\u0398sti and\n\nj=1 \u03b1j\u0398stj k(xj, x) respectively, where tj is the task of the j-th training example.\n\nWe now focus on the remaining maximization problem in the dual function in (7)\n\n\u0398rs (cid:104)\u03b1r, Krs\u03b1s(cid:105) \u2212 V (\u0398).\n\n(8)\n\nT(cid:88)\n\nr,s=1\n\n1\n2\u03bb\n\nmax\n\u0398\u2208ST\n+\n\nThis is a semide\ufb01nite program which is computationally expensive to solve and thus prohibits to\n(cid:80)T\nscale the output kernel learning problem to a large number of tasks. However, we show in the\nfollowing that this problem has an analytical solution for a subset of the regularizers V (\u0398) =\nr,s=1 |\u0398rs|p for p \u2265 1. For better readability we defer a more general result towards the end\n1\n2\nof the section. The basic idea is to relax the constraint on \u0398 \u2208 RT\u00d7T in (8) so that it is equivalent\nto the computation of the conjugate V \u2217 of V . If the maximizer of the relaxed problem is positive\nsemi-de\ufb01nite, one has found the solution of the original problem.\nTheorem 3 Let k \u2208 N and p = 2k\n\nT(cid:88)\n\nmax\n\u0398\u2208ST\n+\n\nr,s=1\n\n\u0398rs\u03c1rs \u2212 1\n2\n\n|\u0398rs|p =\n\n(cid:17)2k T(cid:88)\n\n2k\u22121 , then with \u03c1rs = 1\n\n2\u03bb (cid:104)\u03b1r, Krs\u03b1s(cid:105) we have\n(cid:16) 2k \u2212 1\nT(cid:88)\n(cid:16) 2k \u2212 1\n(cid:17)2k\u22121 (cid:104)\u03b1r, Krs\u03b1s(cid:105)2k\u22121 ,\n\nr, s = 1, . . . , T.\n\n4k \u2212 2\n\n2k\u03bb\n\nr,s=1\n\nr,s=1\n\n1\n\n\u0398\u2217\nrs =\n\n2k\u03bb\n\nand the maximizer is given by the positive semi-de\ufb01nite matrix\n\n(cid:104)\u03b1r, Krs\u03b1s(cid:105)2k ,\n\n(9)\n\n(10)\n\nPlugging the result of the previous theorem into the dual function of Lemma 2 we get for k \u2208 N and\np = 2k\n\np the following unconstrained dual of our main problem (6):\n\n2k\u22121 with V (\u0398) = (cid:107)\u0398(cid:107)p\nn(cid:88)\n\n\u2212C\n\nmax\n\u03b1\u2208Rn\n\nL\u2217\n\ni\n\n(cid:16) \u2212 \u03b1i\n\n(cid:17) \u2212 \u03bb\n\nC\n\n4k \u2212 2\n\ni=1\n\n(cid:16) 2k \u2212 1\n\n(cid:17)2k T(cid:88)\n\n2k\u03bb\n\nr,s=1\n\n(cid:104)\u03b1r, Krs\u03b1s(cid:105)2k .\n\n(11)\n\nC we effectively have only one hyper-\nNote that by doing the variable transformation \u03bai\nparameter in (11). This allows us to cross-validate more ef\ufb01ciently. The range of admissible values\nfor p in Theorem 3 lies in the interval (1, 2], where we get for k = 1 the value p = 2 and as k \u2192 \u221e\n\n:= \u03b1i\n\n4\n\n\fTable 1: Examples of regularizers V (\u0398) together with their generating function \u03c6 and the explicit\n2\u03bb (cid:104)\u03b1r, Krs\u03b1s(cid:105). The optimal value of (8) is given\nform of \u0398\u2217 in terms of the dual variables, \u03c1rs = 1\nin terms of \u03c6 as max\n\u0398\u2208RT\u00d7T\n\nr,s=1 \u03c6(\u03c1rs).\n\n(cid:104)\u03c1, \u0398(cid:105) \u2212 V (\u0398) =(cid:80)T\nT(cid:80)\n\n|\u0398rs| 2k\n\n2k\u22121\n\nr,s=1\n\n\u03c6(z)\n\nz2k\n\n2k , k \u2208 N\n\nez =(cid:80)\u221e\ncosh(z) \u2212 1 =(cid:80)\u221e\n\nzk\nk!\n\nk=0\n\nk=1\n\nz2k\n(2k)!\n\nV (\u0398)\n2k\u22121\n2k\n\n\uf8f1\uf8f2\uf8f3 T(cid:80)\n(cid:16)\nT(cid:80)\n\n\u221e\n\nr,s=1\n\nr,s=1\n\nrs\n\n\u0398\u2217\n\u03c12k\u22121\n\nrs\n\ne\u03c1rs\n\n\u0398rs log(\u0398rs) \u2212 \u0398rs\n\n\u0398rs arcsinh(\u0398rs) \u2212(cid:112)1 + \u03982\n\nrs\n\nif \u0398rs > 0\u2200r, s\nelse .\n\n(cid:17)\n\n+ T 2\n\narcsinh(\u03c1rs)\n\nwe have p \u2192 1. The regularizer for p = 2 together with the squared loss has been considered in the\nprimal in [17, 18]. Our analytical expression of the dual is novel and allows us to employ stochastic\ndual coordinate ascent to solve the involved primal optimization problem. Please also note that by\noptimizing the dual, we have access to the duality gap and thus a well-de\ufb01ned stopping criterion.\nThis is in contrast to the alternating scheme of [17, 18] for the primal problem which involves costly\nmatrix operations. Our runtime experiments show that our solver for (11) outperforms the solvers\nof [17, 18]. Finally, note that even for suboptimal dual variables \u03b1, the corresponding \u0398 matrix in\n(10) is positive semide\ufb01nite. Thus we always get a feasible set of primal variables.\n\nCharacterizing the set of convex regularizers V which allow an analytic expression for the\ndual function The previous theorem raises the question for which class of convex, separable reg-\nularizers we can get an analytical expression of the dual function by explicitly solving the opti-\nmization problem (8) over the positive semide\ufb01nite cone. A key element in the proof of the pre-\nvious theorem is the characterization of functions f : R \u2192 R which when applied elementwise\n+ result in a p.s.d. matrix, that is\nf (A) = (f (aij))T\nf (A) \u2208 ST\nTheorem 4 ([23]) Let f : R \u2192 R and A \u2208 ST\nwise application of f to A. It holds \u2200 T \u2265 2, A \u2208 ST\n\ni,j=1 to a positive semide\ufb01nite matrix A \u2208 ST\n\n+. This set of functions has been characterized by Hiai [23].\n\ni,j=1 the element-\n+ if and only if f is analytic\n\n+. We denote by f (A) = (f (aij))T\n\n+ =\u21d2 f (A) \u2208 ST\n\nand f (x) =(cid:80)\u221e\n\nk=0 akxk with ak \u2265 0 for all k \u2265 0.\n\nNote that in the previous theorem the condition on f is only necessary when we require the implica-\ntion to hold for all T . If T is \ufb01xed, the set of functions is larger and includes even (large) fractional\npowers, see [24]. We use the stronger formulation as we want that the result holds without any\nrestriction on the number of tasks T . Theorem 4 is the key element used in our following charac-\nterization of separable regularizers of \u0398 which allow an analytical expression of the dual function.\nk+1 zk+1 where ak \u2265\nr,s=1 \u03c6\u2217(\u0398rs), is a convex function V : RT\u00d7T \u2192 R and\n\nTheorem 5 Let \u03c6 : R \u2192 R be analytic on R and given as \u03c6(z) = (cid:80)\u221e\n0 \u2200k \u2265 0. If \u03c6 is convex, then, V (\u0398) :=(cid:80)T\nT(cid:88)\n(cid:1),\n\u03c6(cid:0)\u03c1rs\nrs =(cid:80)\u221e\n\nwhere the global maximizer ful\ufb01lls \u0398\u2217 \u2208 ST\nTable 1 summarizes e.g. of functions \u03c6, the corresponding V (\u0398) and the maximizer \u0398\u2217 in (12).\n\n(cid:104)\u03c1, \u0398(cid:105) \u2212 V (\u0398) = V \u2217(\u03c1) =\n\n+ if \u03c1 \u2208 ST\n\n+ and \u0398\u2217\n\nk=0 ak\u03c1k\nrs.\n\nmax\n\u0398\u2208RT \u00d7T\n\n(12)\n\nr,s=1\n\nk=0\n\nak\n\n4 Optimization Algorithm\n\nThe dual problem (11) can be ef\ufb01ciently solved via decomposition based methods like stochastic\ndual coordinate ascent algorithm (SDCA) [19]. SDCA enjoys low computational complexity per\niteration and has been shown to scale effortlessly to large scale optimization problems.\n\n5\n\n\fAlgorithm 1 Fast MTL-SDCA\n\nInput: Gram matrix K, label vector y, regularization parameter and relative duality gap parameter \u0001\nOutput: \u03b1 (\u0398 is computed from \u03b1 using our result in 10)\nInitialize \u03b1 = \u03b1(0)\nrepeat\n\nRandomly choose a dual variable \u03b1i\nSolve for \u2206 in (13) corresponding to \u03b1i\n\u03b1i \u2190 \u03b1i + \u2206\n\nuntil Relative duality gap is below \u0001\n\ni\n\ns(cid:54)=r\n\nOur algorithm for learning the output kernel matrix and task parameters is summarized in Algo-\nrithm 1 (refer to the supplementary material for more details). At each step of the iteration we opti-\nmize the dual objective over a randomly chosen \u03b1i variable. Let ti = r be the task corresponding to\n\u03b1i. We apply the update \u03b1i \u2190 \u03b1i + \u2206. The optimization problem of solving (11) with respect to \u2206\nis as follows:\n\u2206\u2208R L\u2217\nwhere a = kii, brs = (cid:80)\n\n(cid:0)(\u2212\u03b1i \u2212 \u2206)/C(cid:1) + \u03b7(cid:0)(a\u22062 + 2brr\u2206 + crr)2k + 2\n\n(cid:1), (13)\n(cid:17)2k\n(cid:16) 2k\u22121\n\n(brs\u2206 + crs)2k +\n\n(cid:88)\n\n(cid:88)\n\ns,z(cid:54)=r\n\nc2k\nsz\n\nmin\n\nj:tj =s kij\u03b1j \u2200s, csz = (cid:104)\u03b1s, Ksz\u03b1z(cid:105) \u2200s, z and \u03b7 =\n\n.\nThis one-dimensional convex optimization problem is solved ef\ufb01ciently via Newton method. The\ncomplexity of the proposed algorithm is O(T ) per iteration . The proposed algorithm can also be\nemployed for learning output kernels regularized by generic V (\u0398), discussed in the previous section.\nSpecial case p = 2(k = 1): For certain loss functions such as the hinge loss, the squared loss, etc.,\nL\u2217\n\ufb01nding the roots of a cubic equation, which has a closed form expression. Hence, our algorithm is\nhighly ef\ufb01cient with the above loss functions when \u0398 is regularized by the squared Frobenius norm.\n\n(cid:1) yields a linear or a quadratic expression in \u2206. In such cases problem (13) reduces to\n\n(cid:0) \u2212 \u03b1ti+\u2206\n\nC(4k\u22122)\n\n2k\u03bb\n\nti\n\nC\n\n\u03bb\n\n5 Empirical Results\n\nIn this section, we present our results on benchmark data sets comparing our algorithm with existing\napproaches in terms of generalization accuracy as well as computational ef\ufb01ciency. Please refer to\nthe supplementary material for additional results and details.\n\n5.1 Multi-Task Data Sets\n\nWe begin with the generalization results in multi-task setups. The data sets are as follows: a) Sarcos:\na regression data set, aim is to predict 7 degrees of freedom of a robotic arm, b) Parkinson: a\nregression data set, aim is to predict the Parkinson\u2019s disease symptom score for 42 patients, c) Yale:\na face recognition data with 28 binary classi\ufb01cation tasks, d) Landmine: a data set containing binary\nclassi\ufb01cations from 19 different landmines, e) MHC-I: a bioinformatics data set having 10 binary\nclassi\ufb01cation tasks, f) Letter: a handwritten letters data set with 9 binary classi\ufb01cation tasks.\nWe compare the following algorithms: Single task learning (STL), multi-task methods learning the\noutput kernel matrix (MTL [16], CMTL [12], MTRL [9]) and approaches that learn both input and\noutput kernel matrices (MTFL [11], GMTL [10]). Our proposed formulation (11) is denoted by\nFMTLp. We consider three different values for the p-norm: p = 2 (k = 1), p = 4/3 (k = 2) and\np = 8/7 (k = 4). Hinge and \u0001-SVR loss functions were employed for classi\ufb01cation and regression\nproblems respectively. We follow the experimental protocol1 described in [11].\nTable 2 reports the performance of the algorithms averaged over ten random train-test splits. The\nproposed FMTLp attains the best generalization accuracy in general. It outperforms the baseline\nMTL as well as MTRL and CMTL, which solely learns the output kernel matrix. Moreover, it\nachieves an overall better performance than GMTL and MTFL. The FMTLp=4/3,8/7 give compa-\nrable generalization to p = 2 case, with the additional bene\ufb01t of learning sparser and more inter-\npretable output kernel matrix (see Figure 1).\n\n1The performance of STL, MTL, CMTL and MTFL are reported from [11].\n\n6\n\n\fTable 2: Mean generalization performance and the standard deviation over ten train-test splits.\n\nData set\n\nSTL\n\nMTL\n\nCMTL\n\nMTFL\n\nGMTL\n\nMTRL\n\np = 2\n\nFMTLp\np = 4/3\n\np = 8/7\n\nRegression data sets: Explained Variance (%)\nSarcos\nParkinson\n\n40.5\u00b17.6 34.5\u00b110.2 33.0\u00b113.4\n2.8\u00b17.5\n2.7\u00b13.6\n\n4.9\u00b120.0\n\n49.9\u00b16.3\n16.8\u00b110.8 33.6\u00b19.4 12.0\u00b16.8 27.0\u00b14.4\n\n45.8\u00b110.6 41.6\u00b17.1 46.7\u00b16.9 50.3\u00b15.8 48.4\u00b15.8\n27.0\u00b14.4 27.0\u00b14.4\n\nClassi\ufb01cation data sets: AUC (%)\n93.4\u00b12.3\nYale\nLandmine 74.6\u00b11.6\n69.3\u00b12.1\nMHC-I\n61.2\u00b10.8\nLetter\n\n96.4\u00b11.6\n76.4\u00b11.0\n76.4\u00b10.8\n72.3\u00b11.9 72.6\u00b11.4 71.7\u00b12.2\n60.5\u00b11.8\n61.0\u00b11.6\n\n95.2\u00b12.1 97.0\u00b11.6 91.9\u00b13.2\n75.9\u00b10.7\n76.7\u00b11.2\n72.5\u00b12.7\n61.2\u00b10.9\n60.5\u00b11.1\n\n96.1\u00b12.1 97.0\u00b11.2 97.0\u00b11.4 96.8\u00b11.4\n76.1\u00b11.0 76.8\u00b10.8 76.7\u00b11.0 76.4\u00b10.9\n71.5\u00b11.7 71.7\u00b11.9\n70.8\u00b12.1 70.7\u00b11.9\n60.3\u00b11.4 61.4\u00b10.7 61.5\u00b11.0 61.4\u00b11.0\n\n(p = 2)\n\n(p = 4/3)\n\n(p = 8/7)\n\nFigure 1: Plots of |\u0398| matrices (rescaled to [0,1] and averaged over ten splits) computed by our\nsolver FMTLp for the Landmine data set for different p-norms, with cross-validated hyper-parameter\nvalues. The darker regions indicate higher value. Tasks (landmines) numbered 1-10 correspond to\nhighly foliated regions and those numbered 11-19 correspond to bare earth or desert regions. Hence,\nwe expect two groups of tasks (indicated by the red squares). We can observe that the learned \u0398\nmatrix at p = 2 depicts much more spurious task relationships than the ones at p = 4/3 and p = 8/7.\nThus, our sparsifying regularizer improves interpretability.\n\nTable 3: Mean accuracy and the standard deviation over \ufb01ve train-test splits.\n\nData set\n\nSTL\n\nMTL-SDCA GMTL\n\nMTRL\n\nFMTLp-H\np = 4/3\n\nFMTLp-S\np = 4/3\n\nMNIST 84.1\u00b10.3\n90.5\u00b10.3\nUSPS\n\n86.0\u00b10.2\n90.6\u00b10.2\n\np = 8/7\n84.8\u00b10.3 85.6\u00b10.4 86.1\u00b10.4 85.8\u00b10.4 86.2\u00b10.4 82.2\u00b10.6 82.5\u00b10.4 82.4\u00b10.3\n91.6\u00b10.3 92.4\u00b10.2 92.4\u00b10.2 92.6\u00b10.2 92.6\u00b10.1 87.2\u00b10.4 87.7\u00b10.3 87.5\u00b10.3\n\np = 8/7\n\np = 2\n\np = 2\n\n5.2 Multi-Class Data Sets\n\nThe multi-class setup is cast as T one-vs-all binary classi\ufb01cation tasks, corresponding to T classes.\nIn this section we experimented with two loss functions: a) FMTLp-H \u2013 the hinge loss employed in\nSVMs, and b) FMTLp-S \u2013 the squared loss employed in OKL [17]. In these experiments, we also\ncompare our results with MTL-SDCA, a state-of-the-art multi-task feature learning method [25].\nUSPS & MNIST Experiments: We followed the experimental protocol detailed in [10]. Results\nare tabulated in Table 3. Our approach FMTLp-H obtains better accuracy than GMTL, MTRL and\nMTL-SDCA [25] on both data sets.\nMIT Indoor67 Experiments: We report results on the MIT Indoor67 benchmark [26] which covers\n67 indoor scene categories. We use the train/test split (80/20 images per class) provided by the\nauthors. FMTLp-S achieved the accuracy of 73.3% with p = 8/7. Note that this is better than the\nones reported in [27] (70.1%) and [26] (68.24%).\nSUN397 Experiments: SUN397 [28] is a challenging scene classi\ufb01cation benchmark [26] with 397\nclasses. We use m = 5, 50 images per class for training, 50 images per class for testing and report\nthe average accuracy over the 10 standard splits. We employed the CNN features extracted with the\n\n7\n\n246810121416182468101214161824681012141618246810121416182468101214161824681012141618\fTable 4: Mean accuracy and the standard deviation over ten train-test splits on SUN397.\n\nm\n\n5\n50\n\nSTL\n\nMTL\n\nMTL-SDCA\n\n40.5\u00b10.9\n55.0\u00b10.4\n\n42.0\u00b11.4\n57.0\u00b10.2\n\n41.2\u00b11.3\n54.8\u00b10.3\n\np = 2\n41.5\u00b11.1\n55.1\u00b10.2\n\nFMTLp-H\np = 4/3\n41.6\u00b11.3\n55.6\u00b10.3\n\nFMTLp-S\np = 8/7\np = 4/3\n41.6\u00b11.2 44.1\u00b11.3 44.1\u00b11.1\n58.5\u00b10.1\n55.1\u00b10.3 58.6\u00b10.1\n\np = 2\n\np = 8/7\n44.0\u00b11.2\n58.6\u00b10.2\n\n(a)\n\n(b)\n\nFigure 2:\n(a) Plot compares the runtime of various algorithms with varying number of tasks on\nSUN397. Our approach FMTL2-S is 7 times faster that OKL [17] and 4.3 times faster than Convex-\nOKL [18] when the number of tasks is maximum. (b) Plot showing the factor by which FMTL2-\nS outperforms OKL and ConvexOKL over the hyper-parameter range on various data sets. On\nSUN397, we outperform OKL and ConvexOKL by factors of 5.2 and 7 respectively. On MIT In-\ndoor67, we are better than OKL and ConvexOKL by factors of 8.4 and 2.4 respectively.\n\nconvolutional neural network (CNN) [26] using Places 205 database. The results are tabulated in\nTable 4. The \u0398 matrices computed by FMTLp-S are discussed in the supplementary material.\n\n5.3 Scaling Experiment\n\nWe compare the runtime of our solver for FMTL2-S with the OKL solver of [17] and the Convex-\nOKL solver of [18] on several data sets. All the three methods solve the same optimization problem.\nFigure 2a shows the result of the scaling experiment where we vary the number of tasks (classes).\nThe parameters employed are the ones obtained via cross-validation. Note that both OKL and Con-\nvexOKL algorithms do not have a well de\ufb01ned stopping criterion whereas our approach can easily\ncompute the relative duality gap (set as 10\u22123). We terminate them when they reach the primal ob-\njective value achieved by FMTL2-S . Our optimization approach is 7 times and 4.3 times faster than\nthe alternate minimization based OKL and ConvexOKL, respectively, when the number of tasks is\nmaximal. The generic FMTLp=4/3,8/7 are also considerably faster than OKL and ConvexOKL.\nFigure 2b compares the average runtime of our FMTLp-S with OKL and ConvexOKL on the cross-\nvalidated range of hyper-parameter values. FMTLp-S outperform them on both MIT Indoor67 and\nSUN397 data sets. On MNIST and USPS data sets, FMTLp-S is more than 25 times faster than\nOKL, and more than 6 times faster than ConvexOKL. Additional details of the above experiments\nare discussed in the supplementary material.\n\n6 Conclusion\n\nWe proposed a novel formulation for learning the positive semi-de\ufb01nite output kernel matrix for\nmultiple tasks. Our main technical contribution is our analysis of a certain class of regularizers on the\noutput kernel matrix where one may drop the positive semi-de\ufb01nite constraint from the optimization\nproblem, but still solve the problem optimally. This leads to a dual formulation that can be ef\ufb01ciently\nsolved using stochastic dual coordinate ascent algorithm. Results on benchmark multi-task and\nmulti-class data sets demonstrates the effectiveness of the proposed multi-task algorithm in terms of\nruntime as well as generalization accuracy.\nAcknowledgments. P.J. and M.H. acknowledge the support by the Cluster of Excellence (MMCI).\n\n8\n\n5010015020025030035040010\u2212210\u22121100101102103Number of TasksTime (log10 scale), s FMTL2\u2212SConvexOKLOKL33.544.555.566.5702468101214161820Log10(\u03b7)(Time by baseline) / (Time by FMTL2\u2212S) MIT Indoor67, OKLSUN397, OKLMIT Indoor67, ConvexOKL .SUN397, ConvexOKL\fReferences\n[1] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. JMLR,\n\n6:615\u2013637, 2005.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. ML, 73:243\u2013272, 2008.\n[3] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking advantage of sparsity in multi-task\n\nlearning. In COLT, 2009.\n\n[4] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan. A dirty model for multi-task learning. In NIPS, 2010.\n[5] P. Jawanpuria and J. S. Nath. Multi-task multiple kernel learning. In SDM, 2011.\n[6] A. Maurer, M. Pontil, and B. Romera-paredes. Sparse coding for multitask and transfer learning.\n\nIn\n\nICML, 2013.\n\n[7] P. Jawanpuria, J. S. Nath, and G. Ramakrishnan. Generalized hierarchical kernel learning. JMLR, 16:617\u2013\n\n652, 2015.\n\n[8] R. Caruana. Multitask learning. ML, 28:41\u201375, 1997.\n[9] Y. Zhang and D. Y. Yeung. A convex formulation for learning task relationships in multi-task learning.\n\nIn UAI, 2010.\n\n[10] Z. Kang, K. Grauman, and F. Sha. Learning with whom to share in multi-task feature learning. In ICML,\n\n2011.\n\n[11] P. Jawanpuria and J. S. Nath. A convex feature learning formulation for latent task structure discovery. In\n\nICML, 2012.\n\n[12] L. Jacob, F. Bach, and J. P. Vert. Clustered multi-task learning: A convex formulation. In NIPS, 2008.\n[13] C. A. Micchelli and M. Pontil. Kernels for multitask learning. In NIPS, 2005.\n[14] A. Caponnetto, C. A. Micchelli, M. Pontil, and Y. Ying. Universal multi-task kernels. JMLR, 9:1615\u2013\n\n1646, 2008.\n\n[15] M. A. \u00b4Alvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: a review. Founda-\n\ntions and Trends in Machine Learning, 4:195\u2013266, 2012.\n\n[16] T. Evgeniou and M. Pontil. Regularized multi\u2013task learning. In KDD, 2004.\n[17] F. Dinuzzo, C. S. Ong, P. Gehler, and G. Pillonetto. Learning output kernels with block coordinate descent.\n\nIn ICML, 2011.\n\n[18] C. Ciliberto, Y. Mroueh, T. Poggio, and L. Rosasco. Convex learning of multiple tasks and their structure.\n\nIn ICML, 2015.\n\n[19] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. JMLR,\n\n14(1):567\u2013599, 2013.\n\n[20] B. Sch\u00a8olkopf and A. Smola. Learning with Kernels. MIT Press, 2002.\n[21] M. Hein and O. Bousquet. Kernels, associated structures and generalizations. Technical Report TR-127,\n\nMax Planck Institute for Biological Cybernetics, 2004.\n\n[22] A. Ben-Israel and B. Mond. What is invexity ? J. Austral. Math. Soc. Ser. B, 28:1\u20139, 1986.\n[23] F. Hiai. Monotonicity for entrywise functions of matrices. Linear Algebra and its Applications,\n\n431(8):1125 \u2013 1146, 2009.\n\n[24] R. A. Horn. The theory of in\ufb01nitely divisible matrices and kernels. Trans. Amer. Math. Soc., 136:269\u2013286,\n\n1969.\n\n[25] M. Lapin, B. Schiele, and M. Hein. Scalable multitask representation learning for scene classi\ufb01cation. In\n\nCVPR, 2014.\n\n[26] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition\n\nusing places database. In NIPS, 2014.\n\n[27] M. Koskela and J. Laaksonen. Convolutional network features for scene recognition. In Proceedings of\n\nthe ACM International Conference on Multimedia, 2014.\n\n[28] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. SUN database: Large-scale scene recognition\n\nfrom abbey to zoo. In CVPR, 2010.\n\n9\n\n\f", "award": [], "sourceid": 737, "authors": [{"given_name": "Pratik Kumar", "family_name": "Jawanpuria", "institution": "Saarlanduniversity"}, {"given_name": "Maksim", "family_name": "Lapin", "institution": "Max Planck Institute for Informatics"}, {"given_name": "Matthias", "family_name": "Hein", "institution": "Saarland University"}, {"given_name": "Bernt", "family_name": "Schiele", "institution": "Max Planck Institute for Informatics"}]}