{"title": "Multitask Learning without Label Correspondences", "book": "Advances in Neural Information Processing Systems", "page_first": 1957, "page_last": 1965, "abstract": "We propose an algorithm to perform multitask learning where each task has potentially distinct label sets and label correspondences are not readily available. This is in contrast with existing methods which either assume that the label sets shared by different tasks are the same or that there exists a label mapping oracle. Our method directly maximizes the mutual information among the labels, and we show that the resulting objective function can be efficiently optimized using existing algorithms. Our proposed approach has a direct application for data integration with different label spaces for the purpose of classification, such as integrating Yahoo! and DMOZ web directories.", "full_text": "Multitask Learning without Label Correspondences\n\nNovi Quadrianto1, Alex Smola2, Tib\u00b4erio Caetano1, S.V.N. Vishwanathan3, James Petterson1\n\n1 SML-NICTA & RSISE-ANU, Canberra, ACT, Australia\n\n2 Yahoo! Research, Santa Clara, CA, USA\n\n3 Purdue University, West Lafayette, IN, USA\n\nAbstract\n\nWe propose an algorithm to perform multitask learning where each task has poten-\ntially distinct label sets and label correspondences are not readily available. This is\nin contrast with existing methods which either assume that the label sets shared by\ndifferent tasks are the same or that there exists a label mapping oracle. Our method\ndirectly maximizes the mutual information among the labels, and we show that the\nresulting objective function can be ef\ufb01ciently optimized using existing algorithms.\nOur proposed approach has a direct application for data integration with different\nlabel spaces, such as integrating Yahoo! and DMOZ web directories.\n\n1\n\nIntroduction\n\nIn machine learning it is widely known that if several tasks are related, then learning them simulta-\nneously can improve performance [1\u20134]. For instance, a personalized spam classi\ufb01er trained with\ndata from several different users is likely to be more accurate than one that is trained with data from\na single user. If one views learning as the task of inferring a function f from the input space X to the\noutput space Y, then multitask learning is the problem of inferring several functions fi : Xi (cid:55)\u2192 Yi\nsimultaneously. Traditionally, one either assumes that the set of labels Yi for all the tasks are the\nsame (that is, Yi = Y for all i), or that we have access to an oracle mapping function gi,j : Yi (cid:55)\u2192 Yj.\nHowever, as we argue below, in many natural settings these assumptions are not satis\ufb01ed.\nOur motivating example is the problem of learning to automatically categorize objects on the web\ninto an ontology or directory. It is well established that many web-related objects such as web direc-\ntories and RSS directories admit a (hierarchical) categorization, and web directories aim to do this\nin a semi-automated fashion. For instance, it is desirable, when building a categorizer for the Yahoo!\ndirectory1, to take into account other web directories such as DMOZ2. Although the tasks are clearly\nrelated, their label sets are not identical. For instance, some section heading and sub-headings may\nbe named differently in the two directories. Furthermore, different editors may have made differ-\nent decisions about the ontology depth and structure, leading to incompatibilities. To make matters\nworse, these ontologies evolve with time and certain topic labels may die naturally due to lack of\ninterest or expertise while other new topic labels may be added to the directory. Given the large label\nspace, it is unrealistic to expect that a label mapping function is readily available. However, the two\ntasks are clearly related and learning them simultaneously is likely to improve performance.\nThis paper presents a method to learn classi\ufb01ers from a collection of related tasks or data sets, in\nwhich each task has its own label dictionary, without constructing an explicit label mapping among\nthem. We formulate the problem as that of maximizing mutual information among the labels sets.\nWe then show that this maximization problem yields an objective function which can be written\nas a difference of concave functions. By exploiting convex duality [5], we can solve the resulting\noptimization problem ef\ufb01ciently in the dual space using existing DC programming algorithms [6].\n\n1http://dir.yahoo.com/\n2http://www.dmoz.org/\n\n1\n\n\fRelated Work As described earlier, our work is closely related to the research efforts on multitask\nlearning, where the problem of simultaneously learning multiple related tasks is addressed. Several\npapers have empirically and theoretically highlighted the bene\ufb01ts of multitask learning over single-\ntask learning when the tasks are related. There are several approaches to de\ufb01ne task relatedness.\nThe works of [2, 7, 8] consider the setting when the tasks to be learned jointly share a common\nsubset of features. This can be achieved by adding a mixed-norm regularization term that favors a\ncommon sparsity pro\ufb01le in features shared by all tasks. Task relatedness can also be modeled as\nlearning functions that are close to each other in some sense [3, 9]. Crammer et al. [10] consider the\nsetting where, in addition to multiple sources of data, estimates of the dissimilarities between these\nsources are also available. There is also work on data integration via multitask learning where each\ndata source has the same binary label space, whereas the attributes of the inputs can admit different\norderings as well as be linearly transformed [11].\nThe remainder of the paper is organized as follows. We brie\ufb02y develop background on the maximum\nentropy estimation problem and its dual in Section 2. We introduce in Section 3 the novel multi-\ntask formulation in terms of a mutual information maximization criterion. Section 4 presents the\nalgorithm to solve the optimization problem posed by the multitask problem. We then present the\nexperimental results, including applications on news articles and web directories data integration, in\nSection 5. Finally, in Section 6 we conclude the paper.\n\n2 Maximum Entropy Duality for Conditional Distributions\n\nin Section 4. Recall the de\ufb01nition of the Shannon entropy, H(y|x) := \u2212(cid:80)\nHere we brie\ufb02y summarize the well known duality relation between approximate conditional maxi-\nmum entropy estimation and maximum a posteriori estimation (MAP) [5, 12]. We will exploit this\ny p(y|x) log p(y|x),\nwhere p(y|x) is a conditional distribution on the space of labels Y. Let x \u2208 X and assume the\nexistence of \u03c6(x, y) : X \u00d7 Y (cid:55)\u2192 H, a feature map into a Hilbert space H. Given a data set\n(X, Y ) := {(x1, y1) , . . . , (xm, ym)}, where X := {x1, . . . , xm}, de\ufb01ne\nm(cid:88)\n\nm(cid:88)\n\nEy\u223cp(y|X) [\u03c6(X, y)] :=\n\nEy\u223cp(y|xi) [\u03c6(xi, y)] , and \u00b5 =\n\n1\nm\n\ni=1\n\n\u03c6(xi, yi).\n\n(1)\n\n1\nm\n\ni=1\n\nLemma 1 ([5], Lemma 6) With the above notation we have\n\n\u2212H(y|xi) s.t. (cid:13)(cid:13)Ey\u223cp(y|X) [\u03c6(X, y)] \u2212 \u00b5(cid:13)(cid:13)H \u2264 \u0001 and (cid:88)\nm(cid:88)\n(cid:104)\u03b8, \u00b5(cid:105)H \u2212 m(cid:88)\n\nexp((cid:104)\u03b8, \u03c6(xi, y)(cid:105)) \u2212 \u0001(cid:107)\u03b8(cid:107)H .\n\nlog(cid:88)\n\ny\u2208Y\n\ni=1\n\nmin\np(y|x)\n\n= max\n\n\u03b8\n\ni=1\n\ny\n\np(y|xi) = 1\n\n(2a)\n\n(2b)\n\nAlthough we presented a version of the above theorem using Hilbert spaces, it can also be extended\nto Banach spaces. Choosing different Banach space norms recovers well known algorithms such as\n(cid:96)1 or (cid:96)2 regularized logistic regression. Also note that by enforcing the moment matching constraint\nexactly, that is, setting \u0001 = 0, we recover the well-known duality between maximum (Shannon)\nentropy and maximum likelihood (ML) estimation.\n\n3 Multitask Learning via Mutual Information\n\nFor the purpose of explaining our basic idea, we focus on the case when we want to integrate two\ndata sources such as Yahoo! directory and DMOZ. Associated with each data source are labels\nY = {y1, . . . , yc} \u2286 Y and observations X = {x1, . . . , xm} \u2286 X (resp. Y (cid:48) = {y(cid:48)\nc(cid:48)} \u2286 Y(cid:48)\nand X(cid:48) = {x(cid:48)\nm(cid:48)} \u2286 X(cid:48)). The observations are disjoint but we assume that they are drawn\nfrom the same domain, i.e., X = X(cid:48) (in our running example they are webpages).\nIf we are interested to solve each of the categorization tasks independently, a maximum entropy\nestimator described in Section 2 can be readily employed [13]. Here we would like to learn the\n\n1, . . . , x(cid:48)\n\n1, . . . , y(cid:48)\n\n2\n\n\ftwo tasks simultaneously in order to improve classi\ufb01cation accuracy. Assuming that the labels are\ndifferent yet correlated we should assume that the joint distribution p(y, y(cid:48)) displays high mutual\ninformation between y and y(cid:48). Recall that the mutual information between random variables y and\ny(cid:48) is de\ufb01ned as I(y, y(cid:48)) = H(y) + H(y(cid:48)) \u2212 H(y, y(cid:48)), and that this quantity is high when the two\nvariables are mutually dependent. To illustrate this, consider in our running example of integrating\nYahoo! and DMOZ web directories, we would expect there is a high mutual dependency between\nsection heading \u2018Computer & Internet\u2019 at Yahoo! directory and \u2018Computers\u2019 at DMOZ directory\nalthough they are named somewhat slightly different. Since the marginal distributions over the\nlabels, p(y) and p(y(cid:48)) are \ufb01xed, maximizing mutual information can then be viewed as minimizing\nthe joint entropy\n\np(y, y(cid:48)) log p(y, y(cid:48)).\n\n(3)\n\nH(y, y(cid:48)) = \u2212(cid:88)\n\ny,y(cid:48)\n\nThis reasoning leads us to adding the joint entropy as an additional term for the objective function\nof the multitask problem. If we de\ufb01ne\n\nm(cid:88)\n\ni=1\n\nm(cid:48)(cid:88)\n\ni=1\n\nthen we have the following objective function\n\nmaximize\n\np(y|x)\n\ni) \u2212 \u03bbH(y, y(cid:48)) for some \u03bb > 0\n\n\u00b5 =\n\n1\nm\n\n\u03c6(xi, yi) and \u00b5(cid:48) =\n\n1\nm(cid:48)\n\n\u03c6(x(cid:48)\n\ni, y(cid:48)\ni),\n\nm(cid:48)(cid:88)\n\nH(y(cid:48)|x(cid:48)\n\nH(y|xi) +\n\nm(cid:88)\ns.t. (cid:13)(cid:13)Ey\u223cp(y|X) [\u03c6(X, y)] \u2212 \u00b5(cid:13)(cid:13) \u2264 \u0001 and (cid:88)\n(cid:13)(cid:13)Ey(cid:48)\u223cp(y(cid:48)|X(cid:48)) [\u03c6(cid:48)(X(cid:48), y(cid:48))] \u2212 \u00b5(cid:48)(cid:13)(cid:13) \u2264 \u0001(cid:48) and (cid:88)\n\ny\u2208Y\n\ni=1\n\ni=1\n\np(y|xi) = 1\n\np(y(cid:48)|x(cid:48)\n\ni) = 1.\n\ny(cid:48)\u2208Y(cid:48)\n\n(4)\n\n(5a)\n\n(5b)\n\n(5c)\n\nIntuitively, the above objective function tries to \ufb01nd a \u2018simple\u2019 distribution p which is consistent\nwith the observed samples via moment matching constraints while also taking into account task\nrelatedness. We can recover the single task maximum entropy estimator by removing the joint\nentropy term (by setting \u03bb = 0), since the optimization problem (the objective functions as well\nas the constraints) in (5) will be decoupled in terms of p(y|x) and p(y(cid:48)|x(cid:48)). There are two main\nchallenges in solving (5):\n\n\u2022 The joint entropy term H(y, y(cid:48)) is concave, hence the above objective of the optimization\nproblem is not concave in general (it is the difference of two concave functions). We there-\nfore propose to solve this non-concave problem using DC programming [6], in particular\nthe concave convex procedure (CCCP) [14, 15].\n\u2022 The joint distribution between labels p(y, y(cid:48)) is unknown. We will estimate this quan-\ntity (therefore the joint entropy quantity) from the observations x and x(cid:48). Further, we\nassume that y and y(cid:48) are conditionally independent given an arbitrary input x \u2208 X, that is\np(y, y(cid:48)|x) = p(y|x)p(y(cid:48)|x). For instance, in our example, annotations made by an editor\nat Yahoo! and an editor at DMOZ on the set of webpages are assumed conditionally in-\ndependent given the set of webpages. This assumption essentially means that the labeling\nprocess depends entirely on the set of webpages, i.e., any other latent factors that might\nconnect the two editors are ignored.\n\nIn the following section we discuss in further detail how to address these two challenges, as well\nas the resulting optimization problem obtained, which can be solved ef\ufb01ciently by existing convex\nsolvers.\n\n4 Optimization\nThe concave convex procedure (CCCP) works as follow: for a given function f(x) = g(x) \u2212 h(x),\nwhere g is concave and \u2212h is convex, a lower bound can be found by\nf(x) \u2265 g(x) \u2212 h(x0) \u2212 (cid:104)\u2202h(x0), x \u2212 x0(cid:105) .\n\n(6)\n\n3\n\n\fThis lower bound is concave and can be maximized effectively over a convex domain. Subsequently\none \ufb01nds a new location x0 and the entire procedure is repeated. This procedure is guaranteed to\nconverge to a local optimum or saddle point [16].\nTherefore, one potential approach to solve the optimization problem in (5) is to use successive linear\nlower bounds on H(y, y(cid:48)) and to solve the resulting decoupled problems in p(y|x) and p(y(cid:48)|x(cid:48))\nseparately. We estimate the joint entropy term H(y, y(cid:48)) by its empirical quantity on x and x(cid:48) with\nthe conditional independence assumption (in the sequel, we make the dependency of p(y|x) on a\nparameter \u03b8 explicit and similarly for the dependency of p(y(cid:48)|x(cid:48)) on \u03b8(cid:48)), that is\n\nH(y, y(cid:48)|X) = \u2212(cid:88)\n\ny,y(cid:48)\n\n(cid:34)\n\nm(cid:88)\n\ni=1\n\n1\nm\n\np(y|xi, \u03b8)p(y(cid:48)|xi, \u03b8(cid:48))\n\nlog\n\np(y|xj, \u03b8)p(y(cid:48)|xj, \u03b8(cid:48))\n\n(7)\n\n(cid:35)\n\n\uf8ee\uf8f0 1\n\nm\n\nm(cid:88)\n\nj=1\n\n\uf8f9\uf8fb ,\n\n\uf8ee\uf8f01 + log\n\n\uf8f9\uf8fb p(y(cid:48)|xi, \u03b8(cid:48)\n\nand similarly for H(y, y(cid:48)|X(cid:48)). Each iteration of CCCP approximates the convex part (negative joint\nentropy) by its tangent, that is (cid:104)\u2202h(x0), x(cid:105) in (6). Therefore, taking derivatives of the joint entropy\nwith respect to p(y|xi) and evaluating at parameters at iteration t \u2212 1, denoted as \u03b8t\u22121 and \u03b8(cid:48)\nt\u22121,\nyields\n\n=\n\nj=1\n\n(cid:34)\n\n1\nm\n\nt\u22121)\n\nt\u22121).\n\nmin\np(y|x)\n\n(8)\n\n(9)\n\n(cid:88)\n\ny\n\n(cid:88)\n\ny(cid:48)\n\n\u2212H(y|xi) + \u03bb\n\np(y|xj, \u03b8t\u22121)p(y(cid:48)|xj, \u03b8(cid:48)\n\n1\nm\nDe\ufb01ne similarly gy(x(cid:48)\ni), gy(cid:48)(xi), and gy(cid:48)(x(cid:48)\np(y(cid:48)|x(cid:48)\noptimization problems in p(y|xi) and an analogous problem in p(y(cid:48)|x(cid:48)\ni):\n\ngy(xi) := \u2212\u2202p(y|xi)H(y, y(cid:48)|X)\nm(cid:88)\ni) for the derivative with respect to p(y|x(cid:48)\n(cid:34)\n(cid:35)\nm(cid:88)\nm(cid:48)(cid:88)\nsubject to (cid:13)(cid:13)Ey\u223cp(y|X)[\u03c6(X, y)] \u2212 \u00b5(cid:13)(cid:13) \u2264 \u0001.\n\ni), p(y(cid:48)|xi) and\ni), respectively. This leads, by optimizing the lower bound in (6), to the following decoupled\n\n(10b)\nThe above objective function is still in the form of maximum entropy estimation, with the lineariza-\ntion of the joint entropy quantities acting like additional evidence terms. Furthermore, we also\nimpose an additional maximum entropy requirement on the \u2018off-set\u2019 observations p(y|x(cid:48)\ni), as after\nall we also want the \u2018simplicity\u2019 requirement of the distribution p on the input x(cid:48)\ni. We can of course\nweigh the requirement on \u2018off-set\u2019 observations differently.\nWhile we succeed in reducing the non-concave objective function in (5) to a decoupled concave ob-\njective function in (10), it might be desirable to solve the problem in the dual space due to dif\ufb01culty\nin handling the constraint in (10b). The following lemma shows the duality of the objective function\nin (10). The proof is given in the supplementary material.\n\ni) + \u03bb(cid:48)(cid:88)\n\ngy(x(cid:48)\n\ni)p(y|x(cid:48)\ni)\n\ngy(xi)p(y|xi)\n\n\u2212H(y|x(cid:48)\n\n+\n\ni=1\n\n(10a)\n\n(cid:35)\n\ni=1\n\ny\n\nLemma 2 The corresponding Fenchel\u2019s dual of (10) is\n\nmin\n\n\u03b8\n\nm(cid:88)\n\ni=1\n\n\u2212 1\nm\n\nlog(cid:88)\nm(cid:88)\n\ny\n\ni=1\n\nexp((cid:104)\u03b8, \u03c6(xi, y)(cid:105) \u2212 \u03bbgy(xi)) +\n\n(cid:104)\u03b8, \u03c6(xi, yi)(cid:105) + \u0001(cid:107)\u03b8(cid:107)(cid:96)2\n\nm(cid:48)(cid:88)\n\nlog(cid:88)\n\nexp((cid:104)\u03b8, \u03c6(x(cid:48)\n\ni, y)(cid:105) \u2212 \u03bb(cid:48)gy(x(cid:48)\ni))\n\ni=1\n\ny\n\n(11)\n\nThe above dual problem still has the form of logistic regression with the additional evidence terms\nfrom task relatedness appearing in the log-partition function. Several existing convex solvers can be\nused to solve the optimization problem in (11) ef\ufb01ciently. Refer to Algorithm 1 for a pseudocode of\nour proposed method.\n\nInitialization For each iteration of CCCP, the linearization part of the joint entropy function re-\nquires the value of \u03b8 and \u03b8(cid:48) at the previous iteration (refer to (9)). At the beginning of the iteration,\nwe can start the algorithm with a uniform prior, i.e. set p(y) = 1/|Y| and p(y(cid:48)) = 1/|Y(cid:48)|.\n\n4\n\n\fInput: Datasets (X, Y ) and (X(cid:48), Y (cid:48)) with Y (cid:54)= Y(cid:48), number of iterations N\nOutput: \u03b8, \u03b8(cid:48)\nInitialize p(y) = 1/|Y| and p(y(cid:48)) = 1/|Y(cid:48)|\nfor t = 1 to N do\n\nSolve the dual problem in (11) w.r.t. p(y|x, \u03b8) and obtain \u03b8t\nSolve the dual problem in (11) w.r.t. p(y(cid:48)|x(cid:48), \u03b8(cid:48)) and obtain \u03b8(cid:48)\n\nt\n\nAlgorithm 1 Multitask Mutual Information\n\nend for\nreturn \u03b8 \u2190 \u03b8N , \u03b8(cid:48) \u2190 \u03b8(cid:48)\n\nN\n\n5 Experiments\nTo assess the performance of our proposed multitask algorithm, we perform binary n-task (n \u2208\n{3, 5, 7, 10}) experiments on MNIST digit dataset and a multiclass 2-task experiment on the\nReuters1-v2 dataset plus an application on integrating Yahoo! and DMOZ web directory. We detail\nthose experiments in turn in the following sections.\n\n5.1 MNIST\nDatasets MNIST data set3 consists of 28 \u00d7 28-size images of hand-written digits from 0 through\n9. We use a small sample of the available training set to simulate the situation when we only have\nlimited number of labeled examples and test the performance on the entire available test set. In this\nexperiment, we look at a binary n-task (n \u2208 {3, 5, 7, 10}) problem. We consider digits {8, 9, 0},\n{6, 7, 8, 9, 0}, {4, 5, 6, 7, 8, 9, 0} and {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} for the 3-task, 5-task, 7-task and 10-\ntask, respectively. To simulate the problem that we have distinct label dictionaries for each task,\nwe consider the following setting: in the 3-task problem, the \ufb01rst task has binary labels {+1,\u22121},\nwhere label +1 means digit 8 and label \u22121 means digit 9 and 0; in the second task, label +1 means\ndigit 9 and label \u22121 means digit 8 and 0; lastly in the third task, label +1 means digit 0 and label\n\u22121 means digit 8 and 9. Similar one-against-rest grouping is also used for 5-task, 7-task and 10-task\nproblems. Each of the tasks has its own input x.\n\nAlgorithms We couldn\u2019t \ufb01nd in the literature of multitask learning methods addressing the same\nproblem as the one we study: learn multiple tasks when there is no correspondence between the\noutput spaces. Therefore we compared the performance of our multitask method against the baseline\ngiven by the maximum entropy estimator applied to each of the tasks independently. Note that\nwe focus on the setting in which data sources have disjoint sets of covariate observations (vide\nSection 3) and thus a simple strategy of multilabel prediction with union of label sets corresponds\nto our baseline. For both ours and the baseline method, we use a Gaussian kernel to de\ufb01ne the\nimplicit feature map on the inputs. The width of the kernel was set to the median between pairs\nof observations, as suggested in [17]. The regularization parameter was tuned for the single task\nestimator and the same value was used for the multitask. The weight on the joint entropy term was\nset to be equal to 1.\n\nwe can then de\ufb01ne the joint entropy as H(y, y(cid:48), y(cid:48)(cid:48)) = \u2212(cid:80)\n\nPairwise Label Correlation Section 3 describes the multitask objective function for the case of\nthe 2-task problem. For the case when the number of tasks to be learned jointly is greater than 2, we\nexperiment in two different ways: in one approach we can de\ufb01ne the joint entropy term on the full\njoint distribution, that is when we want to learn jointly 3 different tasks having label y, y(cid:48) and y(cid:48)(cid:48),\ny,y(cid:48),y(cid:48)(cid:48) p(y, y(cid:48), y(cid:48)(cid:48)) log p(y, y(cid:48), y(cid:48)(cid:48)). As\nmore computationally ef\ufb01cient way, we can consider the joint entropy on the pairwise distribution\ninstead. We found that the performance of our method is quite similar for the two cases and we\nreport results only on the pairwise case.\n\nResults The experiments are repeated for 10 times and the results are summarized in Table 1. We\n\ufb01nd that, on average, jointly learning the multiple related tasks always improves the classi\ufb01cation\n\n3http://yann.lecun.com/exdb/mnist\n\n5\n\n\fTable 1: Performance assessment, Accuracy \u00b1 STD. m(m(cid:48)) denotes the number of training data\npoints (number of test points). STL: single task learning; MTL: multi task learning and Upper\nBound: multi class learning. Boldface indicates a signi\ufb01cance difference between STL and MTL\n(one-sided paired Welch t-test with 99.95% con\ufb01dence level).\n\nTasks\n8 \\-8\n9 \\-9\n0 \\-0\nAverage\n6 \\-6\n7 \\-7\n8 \\-8\n9 \\-9\n0 \\-0\nAverage\n4 \\-4\n5 \\-5\n6 \\-6\n7 \\-7\n8 \\-8\n9 \\-9\n0 \\-0\nAverage\n1 \\-1\n2 \\-2\n3 \\-3\n4 \\-4\n5 \\-5\n6 \\-6\n7 \\-7\n8 \\-8\n9 \\-9\n0 \\-0\nAverage\n\nm (m\u2019)\n15 (2963)\n15 (2963)\n120 (2963)\n\n25 (4949)\n25 (4949)\n25 (4949)\n25 (4949)\n150 (4949)\n\n70 (6823)\n70 (6823)\n70 (6823)\n70 (6823)\n70 (6823)\n70 (6823)\n210 (6823)\n\n100 (10000)\n100 (10000)\n100 (10000)\n100 (10000)\n100 (10000)\n100 (10000)\n100 (10000)\n100 (10000)\n100 (10000)\n300 (10000)\n\nSTL\n77.39\u00b15.23\n91.12\u00b15.94\n98.66\u00b10.67\n89.06\n81.79\u00b110.18\n70.73\u00b116.58\n62.52\u00b110.15\n63.80\u00b113.70\n97.35\u00b11.33\n75.84\n71.69\u00b16.83\n67.55\u00b14.70\n86.31\u00b12.93\n83.34\u00b13.54\n75.61\u00b16.00\n63.69\u00b111.42\n97.20\u00b11.49\n77.91\n96.59\u00b12.11\n67.77\u00b13.49\n72.59\u00b15.90\n69.91\u00b15.82\n53.78\u00b12.78\n79.22\u00b15.21\n76.57\u00b110.2\n63.57\u00b12.65\n63.28\u00b16.69\n98.43\u00b10.84\n74.17\n\n80.03\u00b14.83\n91.96\u00b15.42\n98.21\u00b10.92\n90.07\n83.86\u00b19.51\n72.84\u00b115.77\n66.77\u00b19.43\n67.26\u00b112.65\n96.60\u00b11.64\n77.47\n73.49\u00b16.77\n70.10\u00b14.61\n87.21\u00b12.77\n84.02\u00b13.69\n76.97\u00b15.12\n65.74\u00b110.15\n96.56\u00b11.67\n79.16\n96.80\u00b11.91\n69.95\u00b12.68\n74.18\u00b15.54\n71.76\u00b15.47\n57.26\u00b12.72\n80.54\u00b14.53\n77.18\u00b19.43\n65.85\u00b12.50\n65.38\u00b16.09\n97.81\u00b11.01\n75.67\n\nMTL Upper Bound\n93.42\u00b10.87\n95.99\u00b10.75\n98.79\u00b10.25\n96.07\n96.37\u00b11.06\n91.99\u00b12.23\n92.05\u00b11.76\n92.53\u00b11.65\n97.59\u00b10.62\n94.10\n91.20\u00b11.55\n89.30\u00b10.34\n94.03\u00b10.95\n91.94\u00b10.90\n87.46\u00b11.69\n86.89\u00b11.79\n97.24\u00b10.73\n91.15\n96.89\u00b10.59\n88.74\u00b11.94\n87.59\u00b12.95\n92.87\u00b10.94\n85.71\u00b11.38\n92.93\u00b10.98\n89.83\u00b11.24\n83.51\u00b10.63\n84.94\u00b11.45\n98.49\u00b10.40\n90.82\n\naccuracy. When assessing the performance on each of the tasks, we notice that the advantage of\nlearning jointly is particularly signi\ufb01cant for those tasks with smaller number of observations.\n\n5.2 Ontology\n\nNews Ontologies\nIn this experiment, we consider multiclass learning in a 2-task problem. We\nuse the Reuters1-v2 news article dataset [18] which has been pre-processed4. In the pre-processing\nstage, the label hierarchy is reorganized by mapping the data set to the second level of topic hier-\narchy. The documents that only have labels of the third or fourth levels are mapped to their parent\ncategory of the second level. The documents that only have labels of the \ufb01rst level are not mapped\nonto any category. Lastly any multi-labelled instances are removed. The second level hierarchy\nconsists of 53 categories and we perform experiments on the top 10 categories. TF-IDF features are\nused, and the dictionary size (feature dimension) is 47236. For this experiment, we use 12500 news\narticles to form one set of data and another 12500 news article to form the second set of data. In the\n\ufb01rst set, we group the news articles having the label {1, 2}, {3, 4}, {5, 6}, {7, 8} and {9, 10} and\nre-label it as {1, 2, 3, 4, 5}. For the second set of data, it also has 5 labels but this time the labels are\n\n4http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/multiclass.html\n\n6\n\n\fTable 2: Yahoo! Top Level Categorization Results. STL: single task learning accuracy; MTL:\nmulti task learning accuracy; % Imp.: relative performance improvement. The highest relative\nimprovement at Yahoo! is for the topic of \u2018Computer & Internet\u2019, i.e. there is an increase in accuracy\nfrom 48.12% to 52.57%. Interestingly, DMOZ has a similar topic but was called \u2018Computers\u2019 and it\nachieves accuracy of 75.72%.\nTopic\n\nMTL/STL (% Imp.) Topic\n\nMTL/STL (% Imp.)\n\nArts\nBusiness & Economy\nComputer & Internet\nEducation\nEntertainment\nGovernment\nHealth\n\n56.27/55.11\n66.52/66.88\n52.57/48.12\n62.48/63.02\n63.30/61.37\n24.44/22.88\n85.42/85.27\n\n(2.10)\n(-0.53)\n(9.25)\n(-0.85)\n(3.14)\n(6.82)\n(1.76)\n\nNews & Media\nRecreation\nReference\nRegional\nScience\nSocial Science\nSociety & Culture\n\n15.23/14.83\n68.81/67.00\n26.65/24.81\n62.85/61.86\n78.58/79.75\n31.55/30.68\n49.51/49.05\n\n(1.03)\n(2.70)\n(7.42)\n(1.60)\n(-1.46)\n(2.84)\n(0.94)\n\nTable 3: DMOZ Top Level Categorization Results. STL: single task learning accuracy; MTL:\nmulti task learning accuracy; % Imp.: relative performance improvement. The improvement of\nmultitask to single task on each topic is negligible for DMOZ web directories. Arguably, this can be\npartly explained as DMOZ has higher average topic categorization accuracy than Yahoo! and there\nmight be more knowledge to be shared from DMOZ to Yahoo! than vice versa.\n\nTopic\n\nMTL/STL (% Imp.) Topic\n\nMTL/STL (% Imp.)\n\nArts\nBusiness\nComputers\nGames\nHealth\nHome\nNews\nRecreation\n\n57.52/57.84\n54.02/53.05\n75.08/75.72\n78.58/78.58\n82.34/82.55\n67.47/67.47\n61.70/62.01\n58.04/58.25\n\n(-0.5)\n(1.83)\n(-0.8)\n(0)\n(-0.14)\n(0)\n(-0.49)\n(-0.36)\n\nReference\nRegional\nScience\nShopping\nSociety\nSports\nWorld\n\n67.42/67.42\n28.59/28.56\n42.67/42.09\n75.20/74.62\n57.68/58.20\n83.49/83.53\n87.80/87.57\n\n(0)\n(0.10)\n(1.38)\n(0.54)\n(-0.89)\n(-0.05)\n(0.26)\n\ngenerated by {1, 6}, {2, 7}, {3, 8}, {4, 9} and {5, 10} grouping. We split equally the news articles\non each set to form training and test sets. We run a maximum entropy estimator independently,\np(y|x, \u03b8) and p(y(cid:48)|x(cid:48), \u03b8(cid:48)) , on the two sets achieving accuracy of 92.59% for the \ufb01rst set and 91.53%\nfor the second set. We then learn the two sets of the news articles jointly and in the \ufb01rst test set,\nwe achieve accuracy of 93.81%. For the second test set, we achieve an accuracy of 93.31%. This\nexperiment further emphasizes that it is possible to learn several related tasks simultaneously even\nthough they have different label sets and it is bene\ufb01cial to do so.\n\nWeb Ontologies We also perform an experiment on the data integration of Yahoo! and DMOZ\nweb directories. We consider the top level of the Yahoo!\u2019s topic tree and sample web links listed in\nthe directory. Similarly we also consider the top level of the DMOZ topic tree and retrieve sampled\nweb links. We consider the content of the \ufb01rst page of each web link as our input data. It is possible\nthat the \ufb01rst page that is being linked from the web directory contain mostly images (for the purpose\nof attracting visitors), thus we only consider those webpages that have enough texts to be a valid\ninput. This gives us 19186 webpages for Yahoo! and 35270 for DMOZ. For the sake of getting\nenough texts associated with each link, we can actually crawl many more pages associated with the\nlink. However, we \ufb01nd that it is quite damaging to do so because as we crawl deeper the topic of the\ntexts are rapidly changing. We use the standard bag-of-words representation with TF-IDF weighting\nas our features. The dictionary size (feature dimension) is 27075. We then use 2000 web pages from\nYahoo! and 2000 pages from DMOZ as training sets and the remainder as test sets. Table 2 and 3\nsummarize the experimental results.\n\n7\n\n\fFrom the experimental results on web directories integration, we observe the following:\n\n\u2022 Similarly to the experiments on MNIST digits and Reuters1-v2 news articles, multitask\nlearning always helps on average, i.e. the average relative improvements are positive for\nboth Yahoo! and DMOZ web directories;\n\u2022 The improvement of multitask to single task on each topic is more prominent for Yahoo!\nweb directories and is negligible for DMOZ web directories (2.62% and 0.07%, respec-\ntively). Arguably, this can be partly explained as Yahoo! has lower average topic catego-\nrization accuracy than DMOZ (c.f. 60.22% and 64.68 %, respectively). It seems that there\nis much more knowledge to be shared from DMOZ to Yahoo! in the hope to increase the\nlatter\u2019s classi\ufb01cation accuracies;\n\u2022 Looking closely at accuracy at each topic, the highest relative improvement at Yahoo! is\nfor the topic of \u2018Computer & Internet\u2019, i.e. there is an increase in accuracy from 48.12%\nto 52.57%. Interestingly, DMOZ has a similar topic but was called \u2018Computers\u2019 and it\nachieves accuracy of 75.72%. The improvement might be partly because our proposed\nmethod is able to discover the implicit label correlations despite the two topics being named\ndifferently;\n\u2022 Regarding the worst classi\ufb01ed categories, we have \u2018News & Media\u2019 for Yahoo! and \u2018Re-\ngional\u2019 for DMOZ. This is intuitive since those two topics can indeed cover a wide range\nof subjects. The easiest category to be classi\ufb01ed is \u2018Health\u2019 for Yahoo! and \u2018World\u2019 for\nDMOZ. As well, this is quite intuitive as the world of health contains mostly speci\ufb01c jargon\nand the world of world has much language-speci\ufb01c webpage content.\n\n6 Discussion and Conclusion\n\nWe presented a method to learn classi\ufb01ers from a collection of related tasks or data sets, in which\neach task has its own label set. Our method works without the need of an explicit mapping between\nthe label spaces of the different tasks. We formulate the problem as one of maximizing the mutual\ninformation among the label sets. Our experiments on binary n-task (n \u2208 {3, 5, 7, 10}) and mul-\nticlass 2-task problems revealed that, on average, jointly learning the multiple related tasks, albeit\nwith different label sets, always improves the classi\ufb01cation accuracy. We also provided experiments\non a prototypical application of our method: classifying in Yahoo! and DMOZ web directories.\nHere we deliberately used small amounts of data\u2013a common situation in commercial tagging and\nclassi\ufb01cation. This shows that classi\ufb01cation accuracy of Yahoo! signi\ufb01cantly increased. Given that\nDMOZ classi\ufb01cation was already 4.5% better prior to the application of our method, this shows\nthe method was able to transfer classi\ufb01cation accuracy from the DMOZ task to the Yahoo! task.\nFurthermore, the experiments seem to suggest that our proposed method is able to discover implicit\nlabel correlations despite the lack of label correspondences.\nAlthough the experiments on web directories integration is encouraging, we have clearly only\ntouched the surface of possibilities to be explored. While we focused on the categorization at the\ntop level of the topic tree, it might be bene\ufb01cial (and further highlight the usefulness of multitask\nlearning, as observed in [2\u20134, 9]) to consider categorization at deeper levels (take for example the\nsecond level of the tree), where we have much fewer observations for each category. In the extreme\ncase, we might consider the labels as corresponding to a directed acyclic graph (DAG) and encode\nthe feature map associated with the label hierarchy accordingly. One instance as considered in [19]\nis to use a feature map \u03c6(y) \u2208 Rk for k nodes in the DAG (excluding the root node) and associate\nwith every label y the vector describing the path from the root node to y, ignoring the root node\nitself.\nFurthermore, the application of data integration which admit a hierarchical categorization goes be-\nyond web related objects. With our method, it is also now possible to learn classi\ufb01ers from a collec-\ntion of related gene-ontology graphs [20] or patent hierarchies [19].\n\nAcknowledgments NICTA is funded by the Australian Government as represented by the Depart-\nment of Broadband, Communications and the Digital Economy and the Australian Research Council\nthrough the ICT Centre of Excellence program. N. Quadrianto is partly supported by Microsoft Re-\nsearch Asia Fellowship.\n\n8\n\n\fReferences\n[1] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature\n\nlearning. Mach. Learn., 73(3):243\u2013272, 2008.\n\n[3] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning gaussian processes from multiple\ntasks. In ICML \u201905: Proceedings of the 22nd international conference on Machine learning,\npages 1012\u20131019, New York, NY, USA, 2005. ACM.\n\n[4] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from mul-\n\ntiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817\u20131853, 2005.\n\n[5] Y. Altun and A.J. Smola. Unifying divergence minimization and statistical inference via convex\nduality. In H.U. Simon and G. Lugosi, editors, Proc. Annual Conf. Computational Learning\nTheory, LNCS, pages 139\u2013153. Springer, 2006.\n\n[6] T. Pham Dinh and L. Hoai An. A D.C. optimization algorithm for solving the trust-region\n\nsubproblem. SIAM Journal on Optimization, 8(2):476\u2013505, 1988.\n\n[7] G. Obozinski, B. Taskar, and M. I. Jordan. Multi-task feature selection. Technical report, U.C.\n\nBerkeley, 2007.\n\n[8] Remi Flamary, Alain Rakotomamonjy, Gilles Gasso, and Stephane Canu. Svm multi-task\n\nlearning and non convex sparsity measure. In The Learning Workshop, 2009.\n\n[9] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning multiple tasks\n\nwith kernel methods. J. Mach. Learn. Res., 6:615\u2013637, 2005.\n\n[10] K. Crammer, M. Kearns, and J. Wortman. Learning from multiple sources. In NIPS 19, pages\n\n321\u2013328. MIT Press, 2007.\n\n[11] Shai Ben-David, Johannes Gehrke, and Reba Schuller. A theoretical framework for learning\nfrom a pool of disparate data sources. In KDD \u201902: Proceedings of the 8th ACM international\nconference on Knowledge discovery and data mining, pages 443\u2013449. ACM, 2002.\n\n[12] M. Dud\u00b4\u0131k and R. E. Schapire. Maximum entropy distribution estimation with generalized reg-\nularization. In G\u00b4abor Lugosi and Hans U. Simon, editors, Proc. Annual Conf. Computational\nLearning Theory. Springer Verlag, June 2006.\n\n[13] Nadia Ghamrawi and Andrew McCallum. Collective multi-label classi\ufb01cation. In CIKM \u201905:\nProceedings of the 14th ACM international conference on Information and knowledge man-\nagement, pages 195\u2013200, New York, NY, USA, 2005. ACM.\n\n[14] A.L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15:915\u2013\n\n936, 2003.\n\n[15] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann. Kernel methods for missing variables. In\nR.G. Cowell and Z. Ghahramani, editors, Proceedings of International Workshop on Arti\ufb01cial\nIntelligence and Statistics, pages 325\u2013332, 2005.\n\n[16] Bharath Sriperumbudur and Gert Lanckriet. On the convergence of the concave-convex pro-\ncedure. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors,\nAdvances in Neural Information Processing Systems 22, pages 1759\u20131767. MIT Press, 2009.\n[17] B. Sch\u00a8olkopf. Support Vector Learning. R. Oldenbourg Verlag, Munich, 1997. Download:\n\nhttp://www.kernel-machines.org.\n\n[18] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection\nfor text categorization research. The Journal of Machine Learning Research, 5:361\u2013397, 2004.\n[19] Lijuan Cai and T. Hofmann. Hierarchical document categorization with support vector ma-\nchines. In Proceedings of the Thirteenth ACM conference on Information and knowledge man-\nagement, pages 78\u201387, New York, NY, USA, 2004. ACM Press.\n\n[20] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis,\nK. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis,\nS. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene\nontology: tool for the uni\ufb01cation of biology. the gene ontology consortium. Nat Genet, 25:25\u2013\n29, 2000.\n\n[21] J. M. Borwein and Q. J. Zhu. Techniques of Variational Analysis. CMS books in Mathematics.\n\nCanadian Mathematical Society, 2005.\n\n9\n\n\f", "award": [], "sourceid": 853, "authors": [{"given_name": "Novi", "family_name": "Quadrianto", "institution": null}, {"given_name": "James", "family_name": "Petterson", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}