{"title": "Parametric Task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1358, "page_last": 1366, "abstract": "We introduce a novel formulation of multi-task learning (MTL) called parametric task learning (PTL) that can systematically handle infinitely many tasks parameterized by a continuous parameter. Our key finding is that, for a certain class of PTL problems, the path of optimal task-wise solutions can be represented as piecewise-linear functions of the continuous task parameter. Based on this fact, we employ a parametric programming technique to obtain the common shared representation across all the continuously parameterized tasks efficiently. We show that our PTL formulation is useful in various scenarios such as learning under non-stationarity, cost-sensitive learning, and quantile regression, and demonstrate the usefulness of the proposed method experimentally in these scenarios.", "full_text": "Parametric Task Learning\n\nIchiro Takeuchi\n\nNagoya Institute of Technology\n\nNagoya, 466-8555, Japan\n\nTatsuya Hongo\n\nNagoya Institute of Technology\n\nNagoya, 466-8555, Japan\n\ntakeuchi.ichiro@nitech.ac.jp\n\nhongo.mllab.nit@gmail.com\n\nMasashi Sugiyama\n\nTokyo Institute of Technology\n\nTokyo, 152-8552, Japan\n\nShinichi Nakajima\nNikon Corporation\n\nTokyo, 140-8601, Japan\n\nsugi@cs.titech.ac.jp\n\nnakajima.s@nikon.co.jp\n\nAbstract\n\nWe introduce an extended formulation of multi-task learning (MTL) called para-\nmetric task learning (PTL) that can systematically handle in\ufb01nitely many tasks\nparameterized by a continuous parameter. Our key \ufb01nding is that, for a certain\nclass of PTL problems, the path of the optimal task-wise solutions can be repre-\nsented as piecewise-linear functions of the continuous task parameter. Based on\nthis fact, we employ a parametric programming technique to obtain the common\nshared representation across all the continuously parameterized tasks. We show\nthat our PTL formulation is useful in various scenarios such as learning under\nnon-stationarity, cost-sensitive learning, and quantile regression. We demonstrate\nthe advantage of our approach in these scenarios.\n\n1 Introduction\n\nMulti-task learning (MTL) has been studied for learning multiple related tasks simultaneously. A\nkey assumption behind MTL is that there exists a common shared representation across the tasks.\nMany MTL algorithms attempt to \ufb01nd such a common representation and at the same time to learn\nmultiple tasks under that shared representation. For example, we can enforce all the tasks to share a\ncommon feature subspace or a common set of variables by using an algorithm introduced in [1, 2]\nthat alternately optimizes the shared representation and the task-wise solutions.\n\nAlthough the standard MTL formulation can handle only a \ufb01nite number of tasks, it is sometimes\nmore natural to consider in\ufb01nitely many tasks parameterized by a continuous parameter, e.g., in\nlearning under non-stationarity [3] where learning problems change over continuous time, cost-\nsensitive learning [4] where loss functions are asymmetric with continuous cost balance, and quan-\ntile regression [5] where the quantile is a continuous variable between zero and one. In order to\nhandle these in\ufb01nitely many parametrized tasks, we propose in this paper an extended formulation\nof MTL called parametric-task learning (PTL).\n\nThe key contribution of this paper is to show that, for a certain class of PTL problems, the optimal\ncommon representation shared across in\ufb01nitely many parameterized tasks can be obtainable. Specif-\nically, we develop an alternating minimization algorithm `a la [1, 2] for \ufb01nding the entire continuum\nof solutions and the common feature subspace (or the common set of variables) among in\ufb01nitely\nmany parameterized tasks. Our algorithm exploits the fact that, for those classes of PTL problems,\nthe path of task-wise solutions is piecewise-linear in the task parameter. We use the parametric\nprogramming technique [6, 7, 8, 9] for computing those piecewise linear solutions.\n\n1\n\n\fNotations: Let us denote by R, R+, and R++ the set of real, nonnegative, and positive numbers,\nrespectively, while we de\ufb01ne Nn := f1; : : : ; ng for every natural number n. We denote by S d\n++ the\nset of d (cid:2) d positive de\ufb01nite matrices, and let I((cid:1)) be the indicator function.\n\n2 Review of Multi-Task Learning (MTL)\nIn this section, we review an MTL method developed in [1, 2]. Let f(xi; yi)gi2Nn be the set of\nn training instances, where xi 2 X (cid:18) Rd is the input and yi 2 Y is the output. We de\ufb01ne\nwi(t) 2 [0; 1]; t 2 NT as the weight of the ith instance for the tth task, where T is the number\nt x for each task, where (cid:12)t;0 2 R and\n\u22a4\nof tasks. We consider an af\ufb01ne model ft(x) = (cid:12)t;0 + (cid:12)\n(cid:12)t 2 Rd. For notational simplicity, we de\ufb01ne augmented vectors ~(cid:12) := ((cid:12)0; (cid:12)1; : : : ; (cid:12)d)\n\u22a4 2 Rd+1\nand ~x := (1; x1; : : : ; xd)\nThe multi-task feature learning method discussed in [1] is formulated as\n\n\u22a4 2 Rd+1, and write the af\ufb01ne model as ft(x) = ~(cid:12)\n\u2211\n\n\u2211\n\n\u2211\n\n\u22a4\nt ~x.\n\nmin\nf ~(cid:12)tgt2N\n++;tr(D)(cid:20)1\n\nT\n\nD2S d\n\nt2NT\n\nt2NT\n\nwi(t)\u2113t(r(yi; ~(cid:12)\n\n\u22a4\nt ~xi)) +\n\n(cid:13)\nT\n\nt2NT\n\n\u22a4\nt D\n\n(cid:12)\n\n(cid:0)1(cid:12)t;\n\n(1)\n\nwhere tr(D) is the trace of D, \u2113t : R ! R+ is the loss function for the tth task incurred on the\n\u22a4\nresidual r(yi; ~(cid:12)\nt ~xi)1, and (cid:13) > 0 is the regularization parameter2. It was shown [1] that the problem\n(1) is equivalent to\n\n\u2211\n\n\u2211\n\nt2NT\n\ni2NN\n\nmin\nf ~(cid:12)tgt2N\n\nT\n\nwi(t)\u2113t(r(yi; ~(cid:12)\n\n\u22a4\nt ~xi)) +\n\njjBjj2\ntr;\n\n(cid:13)\nT\n\n\u22a4\n\nwhere B is the d (cid:2) T matrix whose tth column is given by the vector (cid:12)t, and jjBjjtr\n\n:=\n)1=2) is the trace norm of B. As shown in [10], the trace norm is the convex upper envelope\ntr((BB\nof the rank of B, and (1) can be interpreted as the problem of \ufb01nding a common feature subspace\nacross T tasks. This problem is often referred to as multi-task feature learning. If the matrix D is\nrestricted to be diagonal, the formulation (1) is reduced to multi-task variable selection [11, 12].\nIn order to solve the problem (1), the alternating minimization algorithm was suggested in [1] (see\nAlgorithm 1). This algorithm alternately optimizes the task-wise solutions f ~(cid:12)tgt2NT and the com-\nmon representation matrix D. It is worth noting that, when D is \ufb01xed, each ~(cid:12)t can be independently\noptimized (Step 1). On the other hand, when f ~(cid:12)tgt2NT are \ufb01xed, the optimization of the matrix D\n, and the\ncan be reduced to the minimization over d eigenvalues (cid:21)1; : : : ; (cid:21)d of the matrix C := BB\noptimal D can be analytically computed (Step 2).\n\n\u22a4\n\n3 Parametric-Task Learning (PTL)\n\nWe consider the case where we have in\ufb01nitely many tasks parametrized by a single continuous\nparameter. Let (cid:18) 2 [(cid:18)L; (cid:18)U] be a continuous task parameter. Instead of the set of weights wi(t); t 2\nNT , we consider a weight function wi : [(cid:18)L; (cid:18)U] ! [0; 1] for each instance i 2 Nn. In PTL, we\nlearn a parameter vector ~(cid:12)(cid:18) 2 Rd+1 as a continuous function of the task parameter (cid:18):\n\nwi((cid:18)) \u2113(cid:18)(r(yi; ~(cid:12)\n\n\u22a4\n(cid:18) ~xi)) d(cid:18) + (cid:13)\n\n(cid:18)U\n\n\u22a4\n(cid:18) D\n\n(cid:0)1(cid:12)(cid:18) d(cid:18);\n\n(cid:12)\n\n(2)\n\n\u222b\n\n\u2211\n\n(cid:18)U\n\nmin\nf ~(cid:12)(cid:18)g(cid:18)2[(cid:18)L ;(cid:18)U]\nD2S d\n++;tr(D)(cid:20)1\n\n(cid:18)L\n\ni2Nn\n\n\u222b\n\n(cid:18)L\n\nwhere, note that, the loss function \u2113(cid:18) possibly depends on (cid:18).\nAs we will explain in the next section, the above PTL formulation is useful in various important\nmachine learning scenarios including learning under non-stationarity, cost-sensitive learning, and\n\nt ~xi) = (yi (cid:0) ~(cid:12)\n\u22a4\n1For example, r(yi; ~(cid:12)\n\n\u22a4\n\n~xi)2 for regression problems with yi 2 R, while r(yi; ~(cid:12)\n\u22a4\nt ~xi) =\n\n1 (cid:0) yi ~(cid:12)\nt ~xi for binary classi\ufb01cation problems with yi 2 f(cid:0)1; 1g.\n\u22a4\n\n2In [1], wi(t) takes either 1 or 0. It takes 1 only if the ith instance is used in the tth task. We slightly\n\ngeneralize the setup so that each instance can be used in multiple tasks with different weights.\n\n2\n\n\fAlgorithm 1 ALTERNATING MINIMIZATION ALGORITHM FOR MTL [1]\n1: Input: Data f(xi; yi)gi2Nn and weights fwi(t)gi2Nn;t2NT ;\n2: Initialize: D Id=d (Id is d (cid:2) d identity matrix)\n3: while convergence condition is not true do\n4:\n\nStep 1: For t = 1; : : : ; T do\n\n~(cid:12)t arg min\n\n~(cid:12)\n\n5:\n\nStep 2:\n\nwi(t)\u2113t(r(yi; ~(cid:12)\n\n\u22a4\n\n~xi)) +\n\n\u22a4\n\n(cid:0)1(cid:12)\n\nD\n\n(cid:12)\n\n(cid:13)\nT\n\n\u2211\n\ni2Nn\n\nD C 1=2\ntr(C)1=2\n\n= arg\n\nD2Sd\n\nmin\n++;tr(D)(cid:20)1\n\nwhere C := BB\n\nwhose (j; k)th element is de\ufb01ned as Cj;k :=\n\nt2NT\n\n(cid:12)tj(cid:12)tk.\n\n\u22a4\n\n6: end while\n7: Output: f ~(cid:12)tgt2NT and D;\n\n\u2211\n\nt2NT\n\n(cid:0)1(cid:12)t;\n\n(cid:12)\n\n\u22a4\nt D\n\n\u2211\n\nquantile regression. However, at \ufb01rst glance, the PTL optimization problem (2) seems computa-\ntionally intractable since we need to \ufb01nd in\ufb01nitely many task-wise solutions as well as the common\nfeature subspace (or the common set of variables if D is restricted to be diagonal) shared by in\ufb01nitely\nmany tasks.\n\nOur key \ufb01nding is that, for a certain class of PTL problems, when D is \ufb01xed, the optimal path of the\ntask-wise solutions ~(cid:12)(cid:18) is shown to be piecewise-linear in (cid:18). By exploiting this piecewise-linearity,\nwe can ef\ufb01ciently handle in\ufb01nitely many parameterized tasks, and the optimal solutions of those\nclass of PTL problems can be exactly computed.\nIn the following theorem, we prove that the task-wise solutions ~(cid:12)(cid:18) is piecewise-linear in (cid:18) if the\nweight functions and the loss function satisfy certain conditions.\nTheorem 1 For any d (cid:2) d positive-de\ufb01nite matrix D 2 S d\n++, the optimal solution path of\n\u22a4\nwi((cid:18))\u2113(cid:18)(r(yi; ~(cid:12)\n\n~(cid:12)(cid:18) arg min\n\n\u2211\n\n~xi)) + (cid:13)(cid:12)\n\n(cid:0)1(cid:12)\n\n(3)\n\nD\n\n\u22a4\n\nfor (cid:18) 2 [(cid:18)L; (cid:18)U] is written as a piecewise-linear function of (cid:18) if the residual r(y; ~(cid:12)\n~x) can be\nwritten as an af\ufb01ne function of ~(cid:12), and the weight functions wi : [(cid:18)L; (cid:18)U] ! [0; 1], i 2 Nn and the\nloss function \u2113 : R ! R+ satisfy either of the following conditions (a) or (b):\n\n\u22a4\n\n~(cid:12)\n\ni2Nn\n\n(a) All the weight functions are piecewise-linear functions, and the loss function is a convex\npiecewise-linear function which does not depend on (cid:18);\n\n\u2211\n\n(b) All the weight functions are piecewise-constant functions, and the loss function is a convex\npiecewise-linear function which depends on (cid:18) in the following form:\n\nmaxf(ah + bhr)(ch + dh(cid:18)); 0g;\n\nh2NH\n\n\u2113(cid:18)(r) =\n\n(4)\nwhere H is a positive integer, and ah; bh; ch; dh 2 R are constants such that ch + dh(cid:18) (cid:21) 0 for all\n(cid:18) 2 [(cid:18)L; (cid:18)U].\nIn the proof in Appendix A, we show that, if the weight functions and the loss function satisfy the\nconditions (a) or (b), the problem (3) is reformulated as a parametric quadratic program (parametric\nQP), where the parameter (cid:18) only appears in the linear term of the objective function. As shown, for\nexample, in [9], the optimal solution path of this class of parametric QP has a piecewise-linear form.\nIf ~(cid:12)(cid:18) is piecewise-linear in (cid:18), we can exactly compute the entire solution path by using parametric\nprogramming. In machine learning literature, parametric programming is often used in the context\n\n3\n\n\fAlgorithm 2 ALTERNATING MINIMIZATION ALGORITHM FOR PTL\n1: Input: Data f(xi; yi)gi2Nn and weight functions wi : [(cid:18)L; (cid:18)U] :! [0; 1] for all i 2 Nn;\n2: Initialize: D Id=d (Id is d (cid:2) d identity matrix)\n3: while convergence condition is not true do\nStep 1: For all the continuum of (cid:18) 2 [(cid:18)L; (cid:18)U] do\n4:\n\n~(cid:12)(cid:18) arg min\n\n~(cid:12)\n\nwi((cid:18))\u2113(cid:18)(r(yi; ~(cid:12)\n\n\u22a4\n\n~xi)) + (cid:13)(cid:12)\n\n\u22a4\n\n(cid:0)1(cid:12)\n\nD\n\n\u2211\n\ni2Nn\n\nby using parametric programming;\nStep 2:\n\n5:\n\nD C 1=2\ntr(C)1=2\n\n= arg\n\nD2S d\n\nmin\n++;tr(D)(cid:20)1\n\n\u222b\n\nwhere (j; k)th element of C 2 Rd(cid:2)d is de\ufb01ned as Cj;k :=\n\n6: end while\n7: Output: f ~(cid:12)(cid:18)g for (cid:18) 2 [(cid:18)L; (cid:18)U] and D;\n\n(cid:18)U\n\n\u222b\n\n(cid:18)L\n\n(cid:18)U\n(cid:18)L\n\n\u22a4\n(cid:18) D\n\n(cid:0)1(cid:12)(cid:18)d(cid:18);\n\n(cid:12)\n\n(5)\n\n(cid:12)(cid:18);j(cid:12)(cid:18);kd(cid:18);\n\nof regularization path-following [13, 14, 15]3. We start from the solution at (cid:18) = (cid:18)L, and follow\nthe path of the optimal solutions while (cid:18) is continuously increased. This is ef\ufb01ciently conducted by\nexploiting the piecewise-linearity.\n\nOur proposed algorithm for solving the PTL problem (2) is described in Algorithm 2, which is es-\nsentially a continuous version of the MTL algorithm shown in Algorithm 1. Note that, by exploiting\nthe piecewise linearity of (cid:12)(cid:18), we can compute the integral at Step 2 (Eq. (5)) in Algorithm 2.\nAlgorithm 2 can be changed to parametric-task variable selection if Step 2 is replaced with\n\nD diag((cid:21)1; : : : ; (cid:21)d) where (cid:21)j =\n\n(cid:12)2\n(cid:18);jd(cid:18)\n(cid:18)U\n(cid:18)L\n\n(cid:12)2\n(cid:18);j\u2032d(cid:18)\n\nfor all j 2 Nd;\n\nwhich can also be computed ef\ufb01ciently by exploiting the piecewise linearity of (cid:12)(cid:18).\n\n4 Examples of PTL Problems\n\nIn this section, we present three examples where our PTL formulation (2) is useful.\n\n\u221a\u222b\n\n\u221a\u222b\n\n(cid:18)U\n(cid:18)L\n\n\u2211\n\nj\u20322Nd\n\nBinary Classi\ufb01cation Under Non-Stationarity Suppose that we observe n training instances se-\nquentially, and denote them as f(xi; yi; (cid:28)i)gi2Nn, where xi 2 Rd, yi 2 f(cid:0)1; 1g, and (cid:28)i is the time\nwhen the ith instance is observed. Without loss of generality, we assume that (cid:28)1 < : : : < (cid:28)n. Under\nnon-stationarity, if we are requested to learn a classi\ufb01er to predict the output for a test input x ob-\nserved at time (cid:28) , the training instances observed around time (cid:28) should have more in\ufb02uence on the\nclassi\ufb01er than others.\nLet wi((cid:28) ) denote the weight of the ith instance when training a classi\ufb01er for a test point at time (cid:28) .\nWe can for example use the following triangular weight function (see Figure1):\n\nwi((cid:28) ) =\n\n(cid:0)1((cid:28)i (cid:0) (cid:28) )\n(cid:0)1((cid:28)i (cid:0) (cid:28) )\n\nif (cid:28) (cid:0) s (cid:20) (cid:28)i < (cid:28);\nif (cid:28) (cid:20) (cid:28)i < (cid:28) + s;\notherwise;\n\n(6)\n\nwhere s > 0 determines the width of the triangular time windows. The problem of training a\nclassi\ufb01er for time (cid:28) is then formulated as\n\nwi((cid:28) ) max(0; 1 (cid:0) yi\n\n~(cid:12)\n\n\u22a4\n\n~xi) + (cid:13)jj(cid:12)jj2\n2;\n\n8<: 1 + s\n\u2211\n\n1 (cid:0) s\n0\n\nmin\n\n~(cid:12)\n\ni2Nn\n\nwhere we used the hinge loss.\n\n3In regularization path-following, one computes the optimal solution path w.r.t. the regularization parameter,\n\nwhereas we compute the optimal solution path w.r.t. the task parameter (cid:18).\n\n4\n\n\fFigure 1: Examples of weight functions fwi((cid:28) )gi2Nn in non-stationary time-series learning. Given a\ntraining instances (xi; yi) at time (cid:28)i for i = 1; : : : ; n under non-stationary condition, it is reasonable\nto use the weights fwi((cid:28) )gi2Nn as shown here when we learn a classi\ufb01er to predict the output of a\ntest input at time (cid:28) .\n\nIf we have the belief that a set of classi\ufb01ers for different time should have some common structure,\nwe can apply our PTL approach to this problem. If we consider a time interval (cid:28) 2 [(cid:28)L; (cid:28)U], the\nparametric-task feature learning problem is formulated as\nwi((cid:28) ) max(0; 1 (cid:0) yi\n\n(cid:0)1(cid:12)(cid:28) d(cid:28):\n\n\u22a4\n(cid:28) ~xi) d(cid:28) + (cid:13)\n\n\u2211\n\n\u22a4\n(cid:28) D\n\n\u222b\n\n\u222b\n\n(7)\n\n~(cid:12)\n\n(cid:12)\n\n(cid:28)U\n\n(cid:28)U\n\nmin\nf ~(cid:12)((cid:28) )g(cid:28)2[(cid:28)L ;(cid:28)U]\nD2S d\n++;tr(D)(cid:20)1\n\n(cid:28)L\n\ni2Nn\n\nNote that the problem (7) satis\ufb01es the condition (a) in Theorem 1.\n\nJoint Cost-Sensitive Learning Next, let us consider cost-sensitive binary classi\ufb01cation. When\nthe costs of false positives and false negatives are unequal, or when the numbers of positive and\nnegative training instances are highly imbalanced, it is effective to use the cost-sensitive learning\napproach [16]. Suppose that we are given a set of training instances f(xi; yi)gi2Nn with xi 2 Rd and\nyi 2 f(cid:0)1; 1g. If we know that the ratio of the false positive and false negative costs is approximately\n(cid:18) : (1 (cid:0) (cid:18)), it is reasonable to solve the following cost-sensitive SVM [17]:\n~xi) + (cid:13)jj(cid:12)jj2\n2;\n\nwi((cid:18)) max(0; 1 (cid:0) yi\n\n\u2211\n\nmin\n\n~(cid:12)\n\n\u22a4\n\n(cid:28)L\n\n(cid:18)L\n\ni2Nn\nwhere the weight wi((cid:18)) is de\ufb01ned as\n\n~(cid:12)\n\n{\n\nwi((cid:18)) =\n\n(cid:18)\n1 (cid:0) (cid:18)\n\nif yi = (cid:0)1;\nif yi = +1:\n\nWhen the exact false positive and false negative costs in the test scenario are unknown [4], it is often\ndesirable to train several cost-sensitive SVMs with different values of (cid:18). If we have the belief that\na set of classi\ufb01ers for different cost ratios should have some common structure, we can apply our\nPTL approach to this problem. If we consider an interval (cid:18) 2 [(cid:18)L; (cid:18)U], 0 < (cid:18)L < (cid:18)U < 1, the\nparametric-task feature learning problem is formulated as\nwi((cid:18)) max(0; 1 (cid:0) yi\n\n(cid:0)1(cid:12)(cid:18) d(cid:18):\n\n\u22a4\n(cid:18) ~xi) d(cid:18) + (cid:13)\n\n\u2211\n\n\u22a4\n(cid:18) D\n\n\u222b\n\n\u222b\n\n(8)\n\n~(cid:12)\n\n(cid:18)U\n\n(cid:18)U\n\n(cid:12)\n\nmin\nf ~(cid:12)(cid:18)g(cid:18)2[(cid:18)L;(cid:18)U]\nD2S d\n++;tr(D)(cid:20)1\n\n(cid:18)L\n\ni2Nn\n\nThe problem (8) also satis\ufb01es the condition (a) in Theorem 1. Figure 2 shows an example of joint\ncost-sensitive learning applied to a toy 2D binary classi\ufb01cation problem.\nJoint Quantile Regression Given a set of training instances f(xi; yi)gi2Nn with xi 2 Rd and\nyi 2 R drawn from a joint distribution P (X; Y ), quantile regression [19] is used to estimate the\nY jX=x((cid:28) ) as a function of x, where (cid:28) 2 (0; 1) and FY jX=x is the cumu-\n(cid:0)1\nconditional (cid:28) th quantile F\nlative distribution function of the conditional distribution P (Y jX = x). Jointly estimating multiple\nconditional quantile functions is often useful for exploring the stochastic relationship between X\nand Y (see Section 5 for an example of joint quantile regression problems). Linear quantile regres-\nsion along with L2 regularization [20] at order (cid:28) 2 (0; 1) is formulated as\n(1 (cid:0) (cid:28) )jrj\n(cid:28)jrj\n\n~xi) + (cid:13)jj(cid:12)jj2\n\nif r (cid:20) 0;\nif r > 0:\n\n(cid:26)(cid:28) (yi (cid:0) ~(cid:12)\n\n2; (cid:26)(cid:28) (r) :=\n\n\u2211\n\n{\n\nmin\n\n\u22a4\n\n~(cid:12)\n\ni2Nn\n\n5\n\n \f(a) Independent cost-sensitive learning\n\n(b) Joint cost-sensitive learning\n\nFigure 2: An example of joint cost-sensitive learning on 2D toy dataset (2D input x is expanded to\nn-dimension by radial basis functions centered on each xi). In each plot, the decision boundaries\nof \ufb01ve cost-sensitive SVMs ((cid:18) = 0:1; 0:25; 0:5; 0:75; 0:9) are shown. (a) Left plot is the results ob-\ntained by independently training each cost-sensitive SVMs. (b) Right plot is the results obtained by\njointly training in\ufb01nitely many cost-sensitive SVMs for all the continuum of (cid:18) 2 [0:05; 0:95] using\nthe methodology we present in this paper (both are trained with the same regularization parameter\n(cid:13)). When independently trained, the inter-relationship among different cost-sensitive SVMs looks\ninconsistent (c.f., [18]).\n\nIf we have the belief that a family of quantile regressions at various (cid:28) 2 (0; 1) have some common\nstructure, we can apply our PTL framework to joint estimation of the family of quantile regressions\nThis PTL problem satis\ufb01es the condition (b) in Theorem 1, and is written as\n\nf(cid:12)(cid:28)g(cid:28)2(0;1)\n\nmin\n++;tr(D)(cid:20)1\n\nD2Sd\n\n(cid:26)(cid:28) (yi (cid:0) (cid:12)\n\n\u22a4\n(cid:28) xi)d(cid:28) + (cid:13)\n\n1\n\n\u22a4\n(cid:28) D\n\n(cid:0)1(cid:12)(cid:28) d(cid:28);\n\n(cid:12)\n\nwhere we do not need any weighting and omit wi((cid:28) ) = 1 for all i 2 Nn and (cid:28) 2 [0; 1].\n\n\u222b\n\n\u2211\n\n1\n\n0\n\ni2Nn\n\n\u222b\n\n0\n\n5 Numerical Illustrations\n\nIn this section, we illustrate various aspects of PTL with the three examples discussed in the previous\nsection.\n\nn\n\nn ; 2 2(cid:25)\n\nArti\ufb01cial Example for Learning under Non-stationarity We \ufb01rst consider a simple arti\ufb01cial\nproblem with non-stationarity, where the data generating mechanism gradually changes. We assume\nthat our data generating mechanism produces the training set f(xi; yi; (cid:28)i)gi2Nn with n = 100 as\nfollows. For each (cid:28)i 2 f0; 1 2(cid:25)\ng, the output yi is \ufb01rst determined as yi = 1 if i\nis odd, while yi = (cid:0)1 if i is even. Then, xi 2 Rd is generated as\n\nn ; : : : ; (n (cid:0) 1) 2(cid:25)\n\nxi1 (cid:24) N (yi cos (cid:28)i; 12); xi2 (cid:24) N (yi sin (cid:28)i; 12); xij (cid:24) N (0; 12);8j 2 f3; : : : ; dg;\n\n(9)\nwhere N ((cid:22); (cid:27)2) is the normal distribution with mean (cid:22) and variance (cid:27)2. Namely, only the \ufb01rst\ntwo dimensions of x differ in two classes, and the remaining d (cid:0) 2 dimensions are considered\nIn addition, according to the value of (cid:28)i, the means of the class-wise distributions in\nas noise.\nthe \ufb01rst two dimensions gradually change. The data distributions of the \ufb01rst two dimensions for\n(cid:28) = 0; 0:5(cid:25); (cid:25); 1:5(cid:25) are illustrated in Figure 3. Here, we applied our PT feature learning approach\nwith triangular time windows in (6) with s = 0:25(cid:25). Figure 4 shows the mis-classi\ufb01cation rate\nof PT feature learning (PTFL) and ordinary independent learning (IND) on a similarly generated\ntest sample with size 1000. When the input dimension d = 2, there is no advantage for learning\ncommon features since these two input dimensions are important for classi\ufb01cation. On the other\nhand, as d increases, PT feature learning becomes more and more advantageous. Especially when\nthe regularization parameter (cid:13) is large, the independent learning approach is completely deteriorated\nas d increases, while PTFL works reasonably well in all the setups.\n\n6\n\n-4-2 0 2 4-4-2 0 2 4 6x2x1-4-2 0 2 4-4-2 0 2 4 6x2x1\fFigure 3: The \ufb01rst 2 input dimensions of arti\ufb01cial example at (cid:28) = 0; 0:5(cid:25); (cid:25); 1:5(cid:25). The class-wise\ndistributions in these two dimensions gradually change with (cid:28) 2 [0; 2(cid:25)].\n\nFigure 4: Experimental results on arti\ufb01cial example under non-stationarity. Mis-classi\ufb01cation rate\non test sample with size 1000 for various setups d 2 f2; 5; 10; 20; 50; 100g and (cid:13) 2 f0:1; 1; 10g\nare shown. The red symbols indicate the results of our PT feature learning (PTFL) whereas the\nblue symbols indicate ordinary independent learning (IND). The plotted are average (and standard\ndeviation) over 100 replications with different random seeds. All the differences except d = 2 are\nstatistically signi\ufb01cant (p < 0:01).\n\nJoint Cost-Sensitive SVM Learning on Benchmark Datasets Here, we report the experimental\nresults on joint cost-sensitive SVM learning discussed in Section 4. Although our main contribution\nis not just claiming favorable generalization properties of parametric task learning solutions, we\ncompared, as an illustration, the generalization performances of PT feature learning (PTFL) and\nPT variable selection (PTVS) with the ordinary independent learning approach (IND). In PTFL\nand PTVS, we learned common feature subspaces and common sets of variables shared across the\ncontinuum of cost-sensitive SVM for (cid:18) 2 [0:05; 0:95] for 10 benchmark datasets (see Table 1). In\n\u2211\neach data set, we divided the entire sample into training, validation, and test sets with almost equal\nsize. The average test errors (and the standard deviation) of 10 different data splits are reported\nin Table 1. The total test errors for cost-sensitive SVMs with (cid:18) = 0:1; 0:2; : : : ; 0:9 are de\ufb01ned\n; where f(cid:18) is the\nas\ntrained SVM with the cost ratio (cid:18). Model selection was conducted by using the same criterion on\nvalidation sets. We see that, in most cases, PTFL or PTVS had better generalization performance\nthan IND.\n\n\u2211\ni:yi=(cid:0)1 I(f(cid:18)(xi) > 0) + (1 (cid:0) (cid:18))\n\ni:yi=1 I(f(cid:18)(xi) (cid:20) 0)\n\n\u2211\n\n(\n\n(cid:18)2f0:1;:::;0:9g\n\n(cid:18)\n\n)\n\nJoint Quantile Regression Finally, we applied PT feature learning to joint quantile regression\nproblems. Here, we took a slightly different approach from what was described in the previous\nsection. Given a training set f(xi; yi)gi2Nn, we \ufb01rst estimated conditional mean function E[Y jX =\nx] by least-square regression, and computed the residual ri := yi (cid:0) ^E[Y jX = xi], where ^E is the\nestimated conditional mean function. Then, we applied PT feature learning to f(xi; ri)gi2Nn, and\nY jX=x((cid:28) ) := ^E[Y jX = xi] + ^fres(xj(cid:28) ), where\n(cid:0)1\nestimated the conditional (cid:28) th quantile function as ^F\n^fres((cid:1)j(cid:28) ) is the estimated (cid:28) th quantile regression \ufb01tted to the residuals.\nWhen multiple quantile regressions with different (cid:28) s are independently learned, we often encounter\na notorious problem known as quantile crossing (see Section 2.5 in [5]). For example, in Figure 5(a),\nsome of the estimated conditional quantile functions cross each other (which never happens in the\ntrue conditional quantile functions). One possible approach to mitigate this problem is to assume\na model on the heteroscedastic structure. In the simplest case, if we assume that the data is ho-\nmoscedastic (i.e., the conditional distribution P (Y jx) does not depend on x except its location),\n\n7\n\n 0 0.1 0.2 0.3 0.4 0.525102050100Mis-classification RateInput DimensionPTLIND 0 0.1 0.2 0.3 0.4 0.525102050100Mis-classification RateInput DimensionPTLIND 0 0.1 0.2 0.3 0.4 0.525102050100Mis-classification RateInput DimensionPTLIND\fTable 1: Average (and standard deviation) of test errors obtained by joint cost-sensitive SVMs on\nbenchmark datasets. n is the sample size, d is the input dimension, Ind indicates the results when\neach cost-sensitive SVM was trained independently, while PTFL and PTVS indicate the results from\nPT feature learning and PT feature selection, respectively. The bold numbers in the table indicate\nthe best performance among three methods.\n\nData Name\nParkinson\n\nBreast Cancer Diagnostic\nBreast Cancer Prognostic\n\nAustralian\nDiabetes\nFourclass\nGermen\nSplice\n\nSVM Guide\n\nDVowel\n\nn\n195\n569\n194\n690\n768\n862\n1000\n1000\n300\n528\n\nd\n20\n30\n33\n14\n8\n2\n24\n60\n10\n10\n\nInd\n\n32.30 (10.60)\n20.36 (7.77)\n48.97 (12.92)\n117.97 (22.97)\n185.90 (21.13)\n181.69 (22.13)\n242.21 (18.35)\n179.80 (24.22)\n175.70 (15.55)\n175.16 (13.78)\n\nPTFL\n\n30.21 (9.09)\n18.49 (6.15)\n49.28 ( 9.83)\n106.25 (12.66)\n179.89 (16.31)\n179.30 (14.25)\n219.66 (16.22)\n151.69 (18.02)\n170.16 (9.99)\n175.74 (9.37)\n\nPTVS\n\n30.25 (8.53)\n19.46 (5.89)\n48.68 (5.89)\n111.22 (15.95)\n175.95 (16.26)\n178.67 (19.24)\n237.20 (15.78)\n183.54 (21.27)\n179.76 (14.76)\n175.50 (7.38)\n\nquantile regressions at different (cid:28) s can be obtained by just vertically shifting other quantile regres-\nsion function (see Figure 5(f)).\n\nOur PT feature learning approach, when applied to the joint quantile regression problem, allows us\nto interpolate these two extreme cases. Figure 5 shows a joint QR example on the bone mineral\ndensity (BMD) data [21]. We applied our approach after expanding univariate input x to a d = 5\ndimensional vector by using evenly allocated RBFs. When (a) (cid:13) ! 0, our approach is identical\nwith independently estimating each quantile regression, while it coincides with homoscedastic case\nwhen (f) (cid:13) ! 1. In our experience, the best solution is usually found somewhere between these\ntwo extremes: in this example, (d) (cid:13) = 5 was chosen as the best model by 10-fold cross-validation.\n\n(a) (cid:13) ! 0\n\n(b) (cid:13) = 0:1\n\n(c) (cid:13) = 1\n\n(d) (cid:13) = 5\n\n(e) (cid:13) = 10\n\n(f) (cid:13) ! 1\n\nFigure 5: Joint quantile regression examples on BMD data [21] for six different (cid:13)s.\n\n6 Conclusions\n\nIn this paper, we introduced parametric-task learning (PTL) approach that can systematically handle\nin\ufb01nitely many tasks parameterized by a continuous parameter. We illustrated the usefulness of this\napproach by providing three examples that can be naturally formulated as PTL. We believe that there\nare many other practical problems that falls into this PTL framework.\n\nAcknowledgments\n\nThe authors thank the reviewers for fruitful comments. IT, MS, and SN thank the support from\nMEXT Kakenhi 23700165, JST CREST Program, MEXT Kakenhi 23120004, respectively.\n\n8\n\n-2-1 0 1 2 3 4-2-1.5-1-0.5 0 0.5 1 1.5 2(Standardized) Relative BMD Change(Standardized) Age0.05, 0.10, ..., 0.95 conditional quantile functions-2-1 0 1 2 3 4-2-1.5-1-0.5 0 0.5 1 1.5 2(Standardized) Relative BMD Change(Standardized) Age0.05, 0.10, ..., 0.95 conditional quantile functions-2-1 0 1 2 3 4-2-1.5-1-0.5 0 0.5 1 1.5 2(Standardized) Relative BMD Change(Standardized) Age0.05, 0.10, ..., 0.95 conditional quantile functions-2-1 0 1 2 3 4-2-1.5-1-0.5 0 0.5 1 1.5 2(Standardized) Relative BMD Change(Standardized) Age0.05, 0.10, ..., 0.95 conditional quantile functions-2-1 0 1 2 3 4-2-1.5-1-0.5 0 0.5 1 1.5 2(Standardized) Relative BMD Change(Standardized) Age0.05, 0.10, ..., 0.95 conditional quantile functions-2-1 0 1 2 3 4-2-1.5-1-0.5 0 0.5 1 1.5 2(Standardized) Relative BMD Change(Standardized) Age0.05, 0.10, ..., 0.95 conditional quantile functions\fReferences\n\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural\n\nInformation Processing Systems, volume 19, pages 41\u201348. 2007.\n\n[2] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework\nfor multi-task structure learning. In Advances in Neural Information Processing Systems, vol-\nume 20, pages 25\u201332. 2008.\n\n[3] L. Cao and F. Tay. Support vector machine with adaptive parameters in \ufb01nantial time series\n\nforecasting. IEEE Transactions on Neural Networks, 14(6):1506\u20131518, 2003.\n\n[4] F. R. Bach, D. Heckerman, and E. Horvits. Considering cost asymmetry in learning classi\ufb01ers.\n\nJournal of Machine Learning Research, 7:1713\u201341, 2006.\n\n[5] R. Koenker. Quantile Regression. Cambridge University Press, 2005.\n[6] K. Ritter. On parametric linear and quadratic programming problems. mathematical Pro-\ngramming: Proceedings of the International Congress on Mathematical Programming, pages\n307\u2013335, 1984.\n\n[7] E. L. Allgower and K. George. Continuation and path following. Acta Numerica, 2:1\u201363,\n\n1993.\n\n[8] T. Gal. Postoptimal Analysis, Parametric Programming, and Related Topics. Walter de\n\nGruyter, 1995.\n\n[9] M. J. Best. An algorithm for the solution of the parametric quadratic programming problem.\n\nApplied Mathemetics and Parallel Computing, pages 57\u201376, 1996.\n\n[10] M. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimum\norder system approximation. In Proceedings of the American Control Conference, volume 6,\npages 4734\u20134739, 2001.\n\n[11] B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technomet-\n\nrics, 47:349\u2013363, 2005.\n\n[12] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection and joint sbspace selection\n\nfor multiple classi\ufb01cation problems. Statistics and Computing, 20(2):231\u2013252, 2010.\n\n[13] M. R. Osborne, B. Presnell, and B. A. Turlach. A new approach to variable selection in least\n\nsquares problems. IMA Journal of Numerical Analysis, 20(20):389\u2013404, 2000.\n\n[14] B. Efron and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407\u2013499, 2004.\n[15] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support\n\nvector machine. Journal of Machine Learning Research, 5:1391\u2013415, 2004.\n\n[16] Y. Lin, Y. Lee, and G. Wahba. Support vector machines for classi\ufb01cation in nonstandard\n\nsituations. Machine Learning, 46:191\u2013202, 2002.\n\n[17] M. A. Davenport, R. G. Baraniuk, and C. D. Scott. Tuning support vector machine for mini-\nmax and Neyman-Pearson classi\ufb01cation. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 2010.\n\n[18] G. Lee and C. Scott. Nested support vector machines. IEEE Transactions on Signal Processing,\n\n58(3):1648\u20131660, 2010.\n\n[19] R. Koenker. Quantile Regression. Cambridge University Press, 2005.\n[20] I. Takeuchi, Q. V. Le, T. Sears, and A. J. Smola. Nonparametric quantile estimation. Journal\n\nof Machine Learning Research, 7:1231\u20131264, 2006.\n\n[21] L. K. Bachrach, T. Hastie, M. C. Wang, B. Narasimhan, and R. Marcus. Acquisition in healthy\nAsian, hispanic, black and caucasian youth. a longitudinal study. The Journal of Clinical\nEndocrinology and Metabolism, 84:4702\u20134712, 1999.\n\n[22] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n9\n\n\f", "award": [], "sourceid": 693, "authors": [{"given_name": "Ichiro", "family_name": "Takeuchi", "institution": "Nagoya Institute of Technology"}, {"given_name": "Tatsuya", "family_name": "Hongo", "institution": "Nagoya Institute of Technology"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "Tokyo Institute of Technology"}, {"given_name": "Shinichi", "family_name": "Nakajima", "institution": "Nikon"}]}