{"title": "Heterogeneous multitask learning with joint sparsity constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2151, "page_last": 2159, "abstract": "Multitask learning addressed the problem of learning related tasks whose information can be shared each other. Traditional problem usually deal with homogeneous tasks such as regression, classification individually. In this paper we consider the problem learning multiple related tasks where tasks consist of both continuous and discrete outputs from a common set of input variables that lie in a high-dimensional space. All of the tasks are related in the sense that they share the same set of relevant input variables, but the amount of influence of each input on different outputs may vary. We formulate this problem as a combination of linear regression and logistic regression and model the joint sparsity as L1/Linf and L1/L2-norm of the model parameters. Among several possible applications, our approach addresses an important open problem in genetic association mapping, where we are interested in discovering genetic markers that influence multiple correlated traits jointly. In our experiments, we demonstrate our method in the scenario of association mapping, using simulated and asthma data, and show that the algorithm can effectively recover the relevant inputs with respect to all of the tasks.", "full_text": "Heterogeneous Multitask Learning with Joint\n\nSparsity Constraints\n\nXiaolin Yang\n\nDepartment of Statistics\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nSeyoung Kim\n\nEric P. Xing\n\nMachine Learning Department\nCarnegie Mellon University\n\nMachine Learning Department\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nPittsburgh, PA 15213\n\nxyang@stat.cmu.edu\n\nsssykim@cs.cmu.edu\n\nepxing@cs.cmu.edu\n\nAbstract\n\nMultitask learning addresses the problem of learning related tasks that presum-\nably share some commonalities on their input-output mapping functions. Previ-\nous approaches to multitask learning usually deal with homogeneous tasks, such\nas purely regression tasks, or entirely classi\ufb01cation tasks. In this paper, we con-\nsider the problem of learning multiple related tasks of predicting both continu-\nous and discrete outputs from a common set of input variables that lie in a high-\ndimensional feature space. All of the tasks are related in the sense that they share\nthe same set of relevant input variables, but the amount of in\ufb02uence of each input\non different outputs may vary. We formulate this problem as a combination of lin-\near regressions and logistic regressions, and model the joint sparsity as L1/L\u221e or\nL1/L2 norm of the model parameters. Among several possible applications, our\napproach addresses an important open problem in genetic association mapping,\nwhere the goal is to discover genetic markers that in\ufb02uence multiple correlated\ntraits jointly. In our experiments, we demonstrate our method in this setting, using\nsimulated and clinical asthma datasets, and we show that our method can effec-\ntively recover the relevant inputs with respect to all of the tasks.\n\n1 Introduction\n\nIn multitask learning, one is interested in learning a set of related models for predicting multiple\n(possibly) related outputs (i.e., tasks) given a set of input variables [4]. In many applications, the\nmultiple tasks share a common input space, but have different functional mappings to different\noutput variables corresponding to different tasks. When the tasks and their corresponding models\nare believed to be related, it is desirable to learn all of the models jointly rather than treating each\ntask as independent of each other and \ufb01tting each model separately. Such a learning strategy that\nallows us to borrow information across tasks can potentially increase the predictive power of the\nlearned models.\nDepending on the type of information shared among the tasks, a number of different algorithms have\nbeen proposed. For example, hierarchical Bayesian models have been applied when the parameter\nvalues themselves are thought to be similar across tasks [2, 14]. A probabilistic method for modeling\nthe latent structure shared across multiple tasks has been proposed [16]. For problems of which the\ninput lies in a high-dimensional space and the goal is to recover the shared sparsity structure across\ntasks, a regularized regression method has been proposed [10].\nIn this paper, we consider an interesting and not uncommon scenario of multitask learning, where\nthe tasks are heterogeneous and bear a union support. That is, each task can be either a regression\nor classi\ufb01cation problem, with the inputs lying in a very high-dimensional feature space, but only a\nsmall number of the input variables (i.e., predictors) are relevant to each of the output variables (i.e.,\n\n1\n\n\fresponses). Furthermore, we assume that all of the related tasks possibly share common relevant\npredictors, but with varying amount of in\ufb02uence on each task.\nPrevious approaches for multitask learning usually consider a set of homogeneous tasks, such as re-\ngressions only, or classi\ufb01cations only. When each of these discrete or continuous prediction tasks is\ntreated separately, given a high-dimensional design, the lasso method that penalizes the loss function\nwith an L1 norm of the parameters has been a popular approach for variable selection [13, 11], since\nthe L1 regularization has the property of shrinking parameters corresponding to irrelevant predictors\nexactly to zero. One of the successful extensions of the standard lasso is the group lasso that uses an\nL1/L2 penalty de\ufb01ned over predictor groups [15], instead of just the L1 penalty ubiquitously over\nall predictors. Recently, a more general L1/Lq-regularized regression scheme with q > 0 has been\nthoroughly investigated [17]. When the L1/Lq penalty is used in estimating the regression function\nfor a single predictive task, it makes use of information about the grouping of input variables, and\napplies the L1 penalty over the Lq norm of the regression coef\ufb01cients for each group of inputs. As\na result, variable selection can be effectively achieved on each group rather than on each individual\ninput variable. This type of regularization scheme can be also used against the output variables in\na single classi\ufb01cation task with multi-way (rather than binary) prediction, where the output is ex-\npanded from univariate to multivariate with dummy variables for each prediction category. In this\nsituation the group lasso can promote selecting the same set of relevant predictors across all of the\ndummy variables (which is desirable since these dummy variables indeed correspond to only a sin-\ngle multi-way output). In our multitask learning problem, when the L1/L2 penalty of group lasso is\nused for multitask regression [9, 10, 1], the L2 norm is applied to the regression coef\ufb01cients for each\ninput across all tasks, and the L1 norm is applied to these L2 norms, playing the role of selecting\ncommon input variables relevant to one or more tasks via a sparse union support recovery. Since the\nparameter estimation problem formulated with such penalty terms has a convex objective function,\nmany of the algorithms developed for a general convex optimization problem can be used for solving\nthe learning problem. For example, an interior point method and a preconditioned conjugate gra-\ndient algorithm have been used to solve a large-scale L1-regularized linear regression and logistic\nregression [8]. In [6, 13], a coordinate-descent method was used in solving an L1-regularized linear\nregression and generalized linear models, where the soft thresholding operator gives a closed-form\nsolution for each coordinate in each iteration.\nIn this paper, we consider the more challenging, but realistic scenario of having heterogenous out-\nputs, i.e., both continuous and discrete responses, in multitask learning. This means that the tasks\nin question consist of both regression and classi\ufb01cation problems. Assuming a linear regression for\ncontinuous-valued output and a logistic regression for discrete-valued output with dummy variables\nfor multiple categories, an L1/Lq penalty can be used to learn both types of tasks jointly for a sparse\nunion support recovery. Since the L1/Lq penalty selects the same relevant inputs for all dummy out-\nputs for each classi\ufb01cation task, the desired consistency in chosen relevant inputs across the dummy\nvariables corresponding to the same multi-way response is automatically maintained. We consider\nparticular cases of L1/Lq regularizations with q = 2 and q = \u221e.\nOur work is primarily motivated by the problem of genetic association mapping based on genome-\nwide genotype data of single nucleotide polymorphisms (SNPs), and phenotype data such as disease\nstatus, clinical traits, and microarray data collected over a large number of individuals. The goal in\nthis study is to identify the SNPs (or inputs) that explain the variation in the phenotypes (or outputs),\nwhile reducing false positives in the presence of a large number of irrelevant SNPs from the genome-\nscale data. Since many clinical traits for a given disease are highly correlated, it is greatly bene\ufb01cial\nto combine information across multiple such related phenotypes because the inputs often involve\nmillions of SNPs and the association signals of causal (or relevant) SNPs tend to be very weak\nwhen computed individually. However, statistically signi\ufb01cant patterns can emerge when the joint\nassociations to multiple related traits are estimated properly. Over the recent years, researchers\nstarted recognizing the importance of the joint analysis of multiple correlated phenotypes [5, 18],\nbut there has been a lack of statistical tools to systematically perform such analysis. In our previous\nwork [7], we developed a regularized regression method, called a graph-guided fused lasso, for\nmultitask regression problem that takes advantage of the graph structure over tasks to encourage a\nselection of common inputs across highly correlated traits in the graph. However, this method can\nonly be applied to the restricted case of correlated continuous-valued outputs. In reality, the set of\nclinical traits related to a disease often contains both continuous- and discrete-valued traits. As we\n\n2\n\n\fdemonstrate in our experiments, the L1/Lq regularization for the joint regression and classi\ufb01cation\ncan successfully handle this situation.\nThe paper is organized as follows. In Section 2, we introduce the notation and the basic formulation\nfor joint regression-classi\ufb01cation problem, and describe the L1/L\u221e and L1/L2 regularized regres-\nsions for heterogeneous multitask learning in this setting. In Section 3, we formulate the parameter\nestimation as a convex optimization problem, and present an interior-point method for solving it.\nSection 4 presents experimental results on simulated and asthma datasets. In Section 5, we conclude\nwith a brief discussion of future work.\n\n2 Joint Multitask Learning of Linear Regressions and Multinomial Logistic\n\nRegressions\n\nSuppose that we have K tasks of learning a predictive model for the output variable, given a common\nset of P input variables. In our joint regression-classi\ufb01cation setting, we assume that the K tasks\nconsist of Kr tasks with continuous-valued outputs and Kc tasks with discrete-valued outputs of an\narbitrary number of categories.\nFor each of the Kr regression problems, we assume a linear relationship between the input vector\nX of size P and the kth output Yk as follows:\nk0 + X\u03b2(r)\n\nYk = \u03b2(r)\nkP )(cid:48) represents a vector of P regression coef\ufb01cients for the kth regression\nk0 represents the\n\nwhere \u03b2(r)\ntask, with the superscript (r) indicating that this is a parameter for regression; \u03b2(r)\nintercept; and \u0001 denotes the residual.\nLet yk = (yk1, . . . , ykN )(cid:48) represent the vector of observations for the kth output over N samples;\nand X represent an N \u00d7 P matrix X = (x1, . . . , xN )(cid:48) of the input shared across all of the K tasks,\nwhere xi = (xi1, . . . , xiP )(cid:48) denotes the ith sample. Given these data, we can estimate the \u03b2(r)\nk \u2019s by\nminimizing the sum of squared error:\n\nk1 , . . . , \u03b2(r)\n\nk = 1, ..., Kr,\n\nk = (\u03b2(r)\n\nk + \u0001,\n\nLr =\n\n(yk \u2212 1\u03b2(r)\n\nk0 \u2212 X\u03b2(r)\n\nk )(cid:48) \u00b7 (yk \u2212 1\u03b2(r)\n\nk0 \u2212 X\u03b2(r)\nk ),\n\n(1)\n\nKr(cid:88)\n\nk=1\n\nwhere 1 is an N-vector of 1\u2019s.\nFor the tasks with discrete-valued output, we set up a multinomial (i.e., softmax) logistic regression\nfor each of the Kc tasks, assuming that the kth task has Mk categories:\n\nP (Yk = m|X = x) =\n\n1 +\nP (Yk = Mk|X = x) =\n\nexp (\u03b2(c)\n\n(cid:80)Mk\u22121\n(cid:80)Mk\u22121\n\nl=1\n\nk0 + x\u03b2(c)\nkm)\nexp (\u03b2(c)\n\nk0 + x\u03b2(c)\nkl )\n1\n\n,\n\nfor m = 1, . . . , Mk \u2212 1,\n\n,\n\nl=1\n\nexp (\u03b2(c)\n\nk0 + x\u03b2(c)\nkl )\n\nkm = (\u03b2(c)\n\nkm1, . . . , \u03b2(c)\n\n1 +\nkmP )(cid:48), m = 1, . . . , (Mk \u2212 1), is the parameter vector for the mth\n\nwhere \u03b2(c)\ncategory of the kth classi\ufb01cation task, and \u03b2(c)\nAssuming that the measurements for the Kc output variables are collected for the same set of N\nsamples as in the regression tasks, we expand each output data yki for the kth task of the ith sample\ninto a set of Mk binary variables y(cid:48)\nki = (yk1i, . . . , ykMki), where each ykmi, m = 1, . . . , Mk, takes\nvalue 1 if the ith sample for the kth classi\ufb01cation task belongs to the mth category and value 0 oth-\nm ykmi = 1. Using the observations for the output variable in this representation\nerwise, and thus\nand the shared input data X, one can estimate the parameters \u03b2(c)\nkm\u2019s by minimizing the negative\nlog-likelihood given as below:\n\nk0 is the intercept.\n\nP(cid:88)\n\n(cid:179)\n\nMk\u22121(cid:88)\n\nP(cid:88)\n\n(cid:80)\n(cid:195)\nMk\u22121(cid:88)\n\nLc = \u2212 N(cid:88)\n\nKc(cid:88)\n\n(2)\n\n(cid:180)(cid:33)\n\nykmi(\u03b2(c)\n\nk0 +\n\nxij\u03b2(c)\n\nkmj) \u2212 log\n\n1 +\n\nexp (\u03b2(c)\n\nk0 +\n\nxij\u03b2(c)\n\nkmj)\n\n.\n\n(3)\n\ni=1\n\nk=1\n\nm=1\n\nj=1\n\nm=1\n\nj=1\n\n3\n\n\fIn this joint regression-classi\ufb01cation problem, we form a global objective function by combining the\ntwo empirical loss functions in Equations (1) and (3):\nL = Lr + Lc.\n\n(4)\n\nThis is equivalent to estimating the \u03b2(r)\nkm\u2019s independently for each of the K tasks, assum-\ning that there are no shared patterns in the way that each of the K output variables is dependent\non the input variables. Our goal is to increase the performance of variable selection and prediction\npower by allowing the sharing of information among the heterogeneous tasks.\n\nk \u2019s and \u03b2(c)\n\n3 Heterogeneous Multitask Learning with Joint Sparse Feature Selection\n\nIn real-world applications, often the covariates lie in a very high-dimensional space with only a\nsmall fraction of them being involved in determining the output, and the goal is to recover the\nsparse structure in the predictive model by selecting the true relevant covariates. For example, in\na genetic association mapping, often millions of genetic markers over a population of individuals\nare examined to \ufb01nd associations with the given phenotype such as clinical traits, disease status,\nor molecular phenotypes. The challenge in this type of study is to locate the true causal SNPs that\nin\ufb02uence the phenotype. We consider the case where the related tasks share the same sparsity pattern\nsuch that they have a common set of relevant input variables for both the regression and classi\ufb01cation\ntasks and the amount of in\ufb02uence of the relevant input variables on the output may vary across the\ntasks. We introduce an L1/Lq regularization to the problem of the heterogeneous multitask learning\nin Equation (4) as below:\n\nL = Lr + Lc + \u03bbPq,\n\n(5)\nwhere Pq is the group penalty to the sum of linear regression loss and logistic loss, and \u03bb is a\nregularization parameter which determines the sparsity level and could be chosen by cross validation.\nWe consider two extreme cases of the L1/Lq penalty for group variable selection in our problem\nwhich are L\u221e norm and L2 norm across different tasks in one dimension.\n\n(cid:180)\n\nkj |, |\u03b2(c)\n|\u03b2(r)\nkmj|\n\nor P2 =\n\n|\u03b2(r)\n\nj\n\n, \u03b2(c)\n\nj\n\n|L2\n\n,\n\n(6)\n\n(cid:181) P(cid:88)\n\n(cid:179)\n\nP\u221e =\n\nmax\nk,m\n\nj=1\n\n(cid:180)(cid:182)\n\n(cid:181) P(cid:88)\n\nj=1\n\n, \u03b2(c)\n\nj\n\nj\n\nkj \u2019s and \u03b2(c)\n\nwhere \u03b2(r)\nare vector of parameters over all regression and classi\ufb01cation tasks, respectively,\nfor the jth dimension. Here, the L\u221e and L2 norms over the parameters across different tasks can\nregulate the joint sparsity among tasks. The L1/L\u221e and L1/L2 norms encourage group sparsity\nin a similar way in that the \u03b2(r)\nkmj\u2019s are set to 0 simultaneously for all of the tasks for\ndimension j if the L\u221e or L2 norm for that dimension is set to be 0. Similarly, if the L1 operator\nselects a non-zero value for the L\u221e or L2 norm of the \u03b2(r)\nkmj\u2019s for the jth input, the\nsame input is considered as relevant possibly to all of the tasks, and the \u03b2(r)\nkmj\u2019s can\nhave any non-zero values smaller than the maximum or satisfying the L2-norm constraints. The\nL1/L\u221e penalty tends to encourage the parameter values to be the same across all tasks for a given\ninput [17], whereas under L1/L2 penalty the values of the parameters across tasks tend to be more\ndifferent for a given input than in the L1/L\u221e penalty.\n\nkj \u2019s and \u03b2(c)\n\nkj \u2019s and \u03b2(c)\n\n4 Optimization Method\n\nDifferent methods such as gradient descent, steepest descent, Newton\u2019s method and Quasi-Newton\nmethod can be used to solve the problem in Equation (5). Although second-order methods have a\nfast convergence near the global minimum of the convex objective functions, they involve comput-\ning a Hessian matrix and inverting it, which can be infeasible in a high-dimensional setting. The\ncoordinate-descent method iteratively updates each element of the parameter vector one at a time,\nusing a closed-form update equation given all of the other elements. However, since it is a \ufb01rst-order\nmethod, the speed of convergence becomes slow as the number of tasks and dimension increase. In\n[8], the truncated Newton\u2019s method that uses a preconditionor and solves the linear system instead of\ninverting the Hessian matrix has been proposed as a fast optimization method for a very large-scale\n\n4\n\n\fproblem. The linear regression loss and logistic regression loss have different forms. The interior\nmethod optimizes their original loss function without any transformation so that it is more intuitive\nto see how the two heterogeneous tasks affect each other.\nIn this section, we discuss the case of the L1/L\u221e penalty since the same optimization method can be\neasily extended to handle the L1/L2 penalty. First, we re-write the problem of minimizing Equation\n(5) with the nondifferentiable L1/L\u221e penalty as\n\nP(cid:88)\n\nj=1\n\nminimize Lr + Lc + \u03bb\n\nuj\n\n(cid:180)\n\n(cid:179)\n\nsubject to max\nk,m\n\n|\u03b2(r)\nkj |, |\u03b2(c)\nkmj|\n\n< uj, for j = 1, . . . , P, k = 1, . . . , Kr + Kc.\n\nFurther re-writing the constraints in the above problem, we obtain 2\u00b7P \u00b7 (Kr +\ninequality constraints as follows:\n\n\u2212uj < \u03b2(r)\n\u2212uj < \u03b2(c)\n\nkj < uj,\n\nkmj < uj,\n\nfor\n\nfor\n\nk = 1, . . . , Kr, j = 1, . . . , P,\nk = 1, . . . , Kc, j = 1, . . . , P, m = 1, . . . , Mk \u2212 1.\n\n(cid:80)Kc\n\n(7)\nk=1(Mk \u2212 1))\n\nUsing the barrier method [3], we re-formulate the objective function in Equation (7) into an uncon-\nstrained problem given as\n\nLBarrier = Lr + Lc + \u03bb\n\nuj +\n\nI\u2212(\u2212\u03b2(c)\n\nkj \u2212 uj) + I\u2212(\u03b2(c)\n\n(cid:180)\nkj \u2212 uj)\n\nwhere\n\nk=1\n\nI\u2212(\u2212\u03b2(c)\n\nkmj \u2212 uj) + I\u2212(\u03b2(c)\n\nkmj \u2212 uj),\n\nP(cid:88)\nKc(cid:88)\n\nj=1\n\n+\n\n(cid:179)\n\nP(cid:88)\nP(cid:88)\n\nj=1\n\nj=1\n\nk=1\n\nKr(cid:88)\nMk\u22121(cid:88)\n(cid:189)\n\nm=1\n\nI\u2212(x) =\n\n0 x \u2264 0\n\u221e x > 0 .\n\nThen, we apply the log barrier function I\u2212(f(x)) = \u2212(1/t) log(\u2212f(x)), where t is an additional\nparameter that determines the accuracy of the approximation.\n\nk \u2019s and \u03b2(c)\n\nkm\u2019s. Given a strictly feasible \u0398, t = t(0) > 0,\n\nLet \u0398 denote the set of parameters \u03b2(r)\n\u00b5 > 1, and tolerance \u0001 > 0, we iterate the following steps until convergence.\nStep 1 Compute \u0398\u2217(t) by minimizing LBarrier, starting at \u0398.\nStep 2 Update: \u0398 := \u0398\u2217(t)\nStep 3 Stopping criterion: quit if m/t < \u0001 where m is the number of constraint functions.\nStep 4 Increase t: t := t\u00b5\nIn Step 1, we use the Newton\u2019s method to minimize LBarrier at t.\nIn each iteration, we in-\ncrease t in Step 4, so that we have a more accurate approximation of I\u2212(u) through I\u2212(f(x)) =\n\u2212(1/t) log(\u2212f(x)).\nIn Step 1, we \ufb01nd the direction towards the optimal solution using Newton\u2019s method:\n\n(cid:34)\n\n(cid:35)\n\nH\n\n\u2206\u03b2\n\u2206u\n\n= \u2212g,\n\nwhere \u2206\u03b2 and \u2206u are the searching directions of the model parameters and bounding parameters.\nThe g in the above equation is the gradient vector given as g = [g(r), g(c), g(u)]T , where g(r) has\nKr components for regression tasks, g(c) has Kc \u00d7 (Mk \u2212 1) components for classi\ufb01cation tasks,\nand H is the Hessian matrix given as:\n\n\uf8f9\uf8fa\uf8fa\uf8fb ,\n\n\uf8ee\uf8ef\uf8ef\uf8f0 R\n\n0\n\nH =\n\n0\n\nD(r)\n\nL D(c)\n\nD(r) D(c)\n\nF\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: The regularization path for L1/L\u221e-regularized methods. (a) Regression parameters esti-\nmated from the heterogeneous task learning method, (b) regression parameters estimated from re-\ngression tasks only, (c) logistic-regression parameters estimated from the heterogeneous task learn-\ning method, and (d) logistic-regression parameters estimated from classi\ufb01cation tasks only. Blue\ncurves: irrelevant inputs; Red curves: relevant inputs.\n\nwhere R and L are second derivatives of the parameters \u03b2 for regression tasks in the form of R =\n\u22072Lr + \u22072Pg|\u2202\u03b2(r)\u2202\u03b2(r), L = \u22072Lc + \u22072Pg|\u2202\u03b2(c)\u2202\u03b2(c), D = \u22072Pg|\u2202\u03b2\u2202u and F = D(r) + D(c).\nIn the overall interior-point method, the process of constructing and inverting Hessian matrix is the\nmost time-consuming part. In order to make the algorithm scalable to a large problem, we use a\npreconditionor diag(H) of the Hessian matrix H, and apply the preconditioned conjugate-gradient\nalgorithm to compute the searching direction.\n\n5 Experiments\n\nWe demonstrate our methods for heterogeneous multitask learning with L1/L\u221e and L1/L2 regular-\nizations on simulated and asthma datasets, and compare their performances with those from solving\ntwo types of multitask-learning problems for regressions and classi\ufb01cations separately.\n\n5.1 Simulation Study\n\nk \u2019s and \u03b2(c)\n\nIn the context of genetic association analysis, we simulate the input and output data with known\nmodel parameters as follows. We start from the 120 haplotypes of chromosome 7 from the popu-\nlation of European ancestry in HapMap data [12], and randomly mate the haplotypes to generate\ngenotype data for 500 individuals. We randomly select 50 SNPs across the chromosome as inputs.\nIn order to simulate the parameters \u03b2(r)\nkm\u2019s, we assume six regression tasks and a single\nclassi\ufb01cation task with \ufb01ve categories, and choose \ufb01ve common SNPs from the total of 50 SNPs as\nrelevant covariates across all of the tasks. We \ufb01ll the non-zero entries in the regression coef\ufb01cients\nk \u2019s with values uniformly distributed in the interval [a, b] with 5 \u2264 a, b \u2264 10, and the non-zero\n\u03b2(r)\nentries in the logistic-regression parameters \u03b2(c)\nkm\u2019s such that the \ufb01ve categories are separated in the\noutput space. Given these inputs and the model parameters, we generate the output values, using\nthe noise for regression tasks distributed as N(0, \u03c32\nsim). In the classi\ufb01cation task, we expand the\nsingle output into \ufb01ve dummy variables representing different categories that take values of 0 or 1\ndepending on which category each sample belongs to. We repeat this whole process of simulating\ninputs and outputs to obtain 50 datasets, and report the results averaged over these datasets.\nThe regularization paths of the different multitask-learning methods with an L1/L\u221e regularization\nobtained from a single simulated dataset are shown in Figure 1. The results from learning all of the\ntasks jointly are shown in Figures 1(a) and 1(c) for regression and classi\ufb01cation tasks, respectively,\nwhereas the results from learning the two sets of regression and classi\ufb01cation tasks separately are\nshown in Figures 1(b) and 1(d). The red curves indicate the parameters for true relevant inputs, and\nthe blue curves for true irrelevant inputs. We \ufb01nd that when learning both types of tasks jointly, the\nparameters of the irrelevant inputs are more reliably set to zero along the regularization path than\nlearning the two types of tasks separately.\nIn order to evaluate the performance of the methods, we use two criteria of sensitivity/speci\ufb01city\nplotted as receiver operating characteristic (ROC) curves and prediction errors on test data. To obtain\nROC curves, we estimate the parameters, sort the input-output pairs according to the magnitude of\nthe estimated \u03b2(r)\nkmj\u2019s, and compare the sorted list with the list of input-output pairs with\ntrue non-zero \u03b2(r)\nkmj\u2019s.\n\nkj \u2019s and \u03b2(c)\nkj \u2019s and \u03b2(c)\n\n6\n\n00.51\u221250510\u03bb/max|\u03bb|Parameters00.51\u221250510\u03bb/max|\u03bb|Parameters00.51\u221210\u221250510\u03bb/max|\u03bb|Parameters00.51\u221210\u221250510\u03bb/max|\u03bb|Parameters\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: ROC curves for detecting true relevant input variables when the sample size N varies. (a)\nRegression tasks with N = 100, (b) classi\ufb01cation tasks with N = 100, (c) regression tasks with\nN = 200, and (d) classi\ufb01cation tasks with N = 200. Noise level N(0,1) was used. The joint\nregression-classi\ufb01cation methods achieve nearly perfect accuracy, and their ROC curves are com-\npletely aligned with the axes.\u2018M\u2019 indicates homogeneous multitask learning, and \u2018HM\u2019 heterogenous\nmultitask learning (This notation is the same for the following other \ufb01gures).\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Prediction errors when the sample size N varies. (a) Regression tasks with N=100, (b)\nclassi\ufb01cation tasks with N = 100, (c) regression tasks with N = 200, and (d) classi\ufb01cation tasks\nwith N = 200. Noise level N(0,1) was used.\n\nWe vary the sample size to N = 100 and 200, and show the ROC curves for detecting true relevant\ninputs using different methods in Figure 2. We use \u03c3sim = 1 to generate noise in the regression\ntasks. Results for the regression and classi\ufb01cation tasks with N = 100 are shown in Figure 2(a) and\n(b) respectively, and similarly, the results with N = 200 in Figure 2(c) and (d). The results with\nL1/L\u221e penalty are shown with color blue and green to compare the homogeneous and heteroge-\nneous methods. Red and yellow are results using the L1/L2 penalty. Although the performance of\nlearning the two types of tasks separately improves with a larger sample size, the joint estimation\nperforms signi\ufb01cantly better for both sample sizes. A similar trend can be seen in the prediction\nerrors for the same simulated datasets in Figure 3.\nIn order to see how different signal-to-noise ratios affect the performance, we vary the noise level\nsim = 8, and plot the ROC curves averaged over 50 datasets with a sample size\nto \u03c32\nN = 300 in Figure 4. Our results show that for both of the signal-to-noise ratios, learning regression\nand classi\ufb01cation tasks jointly improves the performance signi\ufb01cantly. The same observation can be\nmade from the prediction errors in Figure 5. We can see that the L1/L2 method tends to improve\nthe variable selection, but the tradeoff is that the prediction error will be high when the noise level\nis low. While L1/L\u221e has a good balance between the variable selection accuracy and prediction\nerror at a lower noise level, as the noise increases, the L1/L2 outperforms L1/L\u221e in both variable\nselection and prediction accuracy.\n\nsim = 5 and \u03c32\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: ROC curves for detecting true relevant input variables when the noise level varies. (a)\nRegression tasks with noise level N(0, 5), (b) classi\ufb01cation tasks with noise level N(0, 5), (c) re-\ngression tasks with noise level N(0, 8), and (d) classi\ufb01cation tasks with noise level N(0, 8). Sample\nsize N=300 was used.\n\n7\n\n00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)Prediction error0100200300400500600700800       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)Classification error00.050.10.150.20.250.30.350.4       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)Prediction error0100200300400500600700800       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)Classification error00.050.10.150.20.250.30.350.4       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)00.20.40.60.8100.20.40.60.811\u2212SpecificitySencitivity  M (L1/L\u221e)HM (L1/L\u221e)M (L1/L2)HM (L1/L2)\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Prediction errors when the noise level varies.\n(a) Regression tasks with noise level\nN(0, 52), (b) classi\ufb01cation tasks with noise level N(0, 52), (c) regression tasks with noise level\nN(0, 82), and (d) classi\ufb01cation tasks with noise level N(0, 82). Sample size N=300 was used.\n\n(a)\n\n(b)\n\nFigure 6: Parameters estimated from the asthma dataset for discovery of causal SNPs for the cor-\nrelated phenotypes. (a) Heterogeneous task learning method, and (b) separate analysis of multitask\nregressions and multitask classi\ufb01cations. The rows represent tasks, and the columns represent SNPs.\n\n5.2 Analysis of Asthma Dataset\nWe apply our method to the asthma dataset with 34 SNPs in the IL4R gene of chromosome 11\nand \ufb01ve asthma-related clinical traits collected over 613 patients. The set of traits includes four\ncontinuous-valued traits related to lung physiology such as baseline predrug FEV1, maximum\nFEV1, baseline predrug FVC, and maximum FVC as well as a single discrete-valued trait with \ufb01ve\ncategories. The goal of this analysis is to discover whether any of the SNPs (inputs) are in\ufb02uenc-\ning each of the asthma-related traits (outputs). We \ufb01t the joint regression-classi\ufb01cation method with\nL1/L\u221e and L1/L2 regularizations, and compare the results from \ufb01tting L1/L\u221e and L1/L2 regular-\nized methods only for the regression tasks or only for the classi\ufb01cation task. We show the estimated\nparameters for the joint learning with L1/L\u221e penalty in Figure 6(a) and the separate learning with\nL1/L\u221e penalty in Figure 6(b), where the \ufb01rst four rows correspond to the four regression tasks,\nthe next four rows are parameters for the four dummy variables of the classi\ufb01cation task, and the\ncolumns represent SNPs. We can see that the heterogeneous multitask-learning method encourages\nto \ufb01nd common causal SNPs for the multiclass classi\ufb01cation task and the regression tasks.\n\n6 Conclusions\n\nIn this paper, we proposed a method for a recovery of union support in heterogeneous multitask\nlearning, where the set of tasks consists of both regressions and classi\ufb01cations. In our experiments\nwith simulated and asthma datasets, we demonstrated that using L1/L2 or L1/L\u221e regularizations\nin the joint regression-classi\ufb01cation problem improves the performance for identifying the input\nvariables that are commonly relevant to multiple tasks.\nThe sparse union support recovery as was presented in this paper is concerned with \ufb01nding inputs\nthat in\ufb02uence at least one task. In the real-world problem of association mapping, there is a cluster-\ning structure such as co-regulated genes, and it would be interesting to discover SNPs that are causal\nto at least one of the outputs within the subgroup rather than all of the outputs. In addition, SNPs in\na region of chromosome are often correlated with each other because of the non-random recombi-\nnation process during inheritance, and this correlation structure, called linkage disequilibrium, has\nbeen actively investigated. A promising future direction would be to model this complex correlation\npattern in both the input and output spaces within our framework.\nAcknowledgments EPX is supported by grant NSF DBI-0640543, NSF DBI-0546594, NSF IIS-0713379,\nNIH grant 1R01GM087694, and an Alfred P. Sloan Research Fellowship.\n\n8\n\nPrediction error2628303234363840       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)Classification error00.050.10.150.20.250.30.350.4       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)Prediction error646668707274767880       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)Classification error00.050.10.150.20.250.30.350.4       M      (L1/L\u221e)       HM     (L1/L\u221e)   M     (L1/L2)   HM    (L1/L2)  102030246800.20.40.60.81  102030246800.20.40.60.81\fReferences\n[1] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[2] B. Bakker and T. Heskes. Task clustering and gating for bayesian multitask learning. Journal of Machine\n\nLearning Research, 4:83\u201399, 2003.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[4] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[5] V. Emilsson, G. Thorleifsson, B. Zhang, A.S. Leonardson, F. Zink, J. Zhu, S. Carlson, A. Helgason, G.B.\nWalters, S. Gunnarsdottir, et al. Variations in dna elucidate molecular networks that cause disease. Nature,\n452(27):423\u201328, 2008.\n\n[6] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordi-\n\nnate descent. Technical Report 703, Department of Statistics, Stanford University, 2009.\n\n[7] S. Kim and E. P. Xing. Statistical estimation of correlated genome associations to a quantitative trait\n\nnetwork. PLoS Genetics, 5(8):e1000587, 2009.\n\n[8] K. Koh, S. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression.\n\nJournal of Machine Learning Research, 8(8):1519\u20131555, 2007.\n\n[9] G. Obozinski, B. Taskar, and M. Jordan. Joint covariate selection for grouped classi\ufb01cation. Technical\n\nReport 743, Department of Statistics, University of California, Berkeley, 2007.\n\n[10] G. Obozinski, M.J. Wainwright, and M.J. Jordan. High-dimensional union support recovery in multivari-\n\nate regression. In Advances in Neural Information Processing Systems 21, 2008.\n\n[11] M. Schmidt, G. Fung, and R. Rosales. Fast optimization methods for l1 regularization: a comparative\nstudy and two new approaches. In Proceedings of the European Conference on Machine Learning, 2007.\n[12] The International HapMap Consortium. A haplotype map of the human genome. Nature, 437:1399\u20131320,\n\n2005.\n\n[13] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society,\n\nSeries B, 58(1):267\u2013288, 1996.\n\n[14] K. Yu, V. Tresp, and A. Schwaighofer. Learning gaussian processes from multiple tasks. In Proceedings\n\nof the 22nd International Conference on Machine Learning, 2005.\n\n[15] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of\n\nRoyal Statistical Society, Series B, 68(1):49\u201367, 2006.\n\n[16] J. Zhang, Z. Ghahramani, and Y. Yang. Flexible latent variable models for multi-task learning. Machine\n\nLearning, 73(3):221\u2013242, 2008.\n\n[17] P. Zhao, G. Rocha, and B. Yu. Grouped and hierarchical model selection through composite absolute\n\npenalties. Technical Report 703, Department of Statistics, University of California, Berkeley, 2008.\n\n[18] J. Zhu, B. Zhang, E.N. Smith, B. Drees, R.B. Brem, L. Kruglyak, R.E. Bumgarner, and E.E. Schadt.\nIntegrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks.\nNature Genetics, 40:854\u201361, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1049, "authors": [{"given_name": "Xiaolin", "family_name": "Yang", "institution": null}, {"given_name": "Seyoung", "family_name": "Kim", "institution": null}, {"given_name": "Eric", "family_name": "Xing", "institution": null}]}