{"title": "Semi-Supervised Multitask Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 937, "page_last": 944, "abstract": null, "full_text": "Semi-Supervised Multitask Learning\n\nQiuhua Liu, Xuejun Liao, and Lawrence Carin\nDepartment of Electrical and Computer Engineering\n\nDuke University\n\nDurham, NC 27708-0291, USA\n\nAbstract\n\nA semi-supervised multitask learning (MTL) framework is presented, in which\nM parameterized semi-supervised classi\ufb01ers, each associated with one of M par-\ntially labeled data manifolds, are learned jointly under the constraint of a soft-\nsharing prior imposed over the parameters of the classi\ufb01ers. The unlabeled data\nare utilized by basing classi\ufb01er learning on neighborhoods, induced by a Markov\nrandom walk over a graph representation of each manifold. Experimental results\non real data sets demonstrate that semi-supervised MTL yields signi\ufb01cant im-\nprovements in generalization performance over either semi-supervised single-task\nlearning (STL) or supervised MTL.\n\n1 Introduction\n\nSupervised learning has proven an effective technique for learning a classi\ufb01er when the quantity of\nlabeled data is large enough to represent a suf\ufb01cient sample from the true labeling function. Un-\nfortunately, a generous provision of labeled data is often not available since acquiring the label of\na datum is expensive in many applications. A classi\ufb01er supervised by a limited amount of labeled\ndata is known to generalize poorly even if it produces zero training errors. There has been much\nrecent work on improving the generalization of classi\ufb01ers based on using information sources be-\nyond the labeled data. These studies fall into two major categories: (i) semi-supervised learning\n[9, 12, 15, 10] and (ii) multitask learning (MTL) [3, 1, 13]. The former employs the information\nfrom the data manifold, in which the manifold information provided by the usually abundant unla-\nbeled data is exploited, while the latter leverages information from related tasks.\nIn this paper we attempt to integrate the bene\ufb01ts offered by semi-supervised learning and MTL, by\nproposing semi-supervised multitask learning. The semi-supervised MTL framework consists of M\nsemi-supervised classi\ufb01ers coupled by a joint prior distribution over the parameters of all classi\ufb01ers.\nEach classi\ufb01er provides the solution for a partially labeled data classi\ufb01cation task. The solutions for\nthe M tasks are obtained simultaneously under the uni\ufb01ed framework.\nExisting semi-supervised algorithms are often not directly amenable to MTL extensions. Transduc-\ntive algorithms directly operate on labels. Since the label is a local property of the associated data\npoint, information sharing must be performed at the level of data locations, instead of at the task\nlevel. The inductive algorithm in [10] employs a data-dependent prior to encode manifold informa-\ntion. Since the information transferred from related tasks is also often represented by a prior, the\ntwo priors will compete and need be balanced; moreover, this precludes a Dirichlet process [6] or\nits variants to represent the sharing prior across tasks, because the base distribution of a Dirichlet\nprocess cannot be dependent on any particular manifold.\nWe develop a new semi-supervised formulation, which enjoys several nice properties that make the\nformulation immediately amenable to an MTL extension. First, the formulation has a parametric\nclassi\ufb01er built for each task, thus multitask learning can be performed ef\ufb01ciently at the task level,\nusing the parameters of the classi\ufb01ers. Second, the formulation encodes the manifold information\n\n\fof each task inside the associated likelihood function, sparing the prior for exclusive use by the\ninformation from related tasks. Third, the formulation lends itself to a Dirichlet process, allowing\nthe tasks to share information in a complex manner.\nThe new semi-supervised formulation is used as a key component of our semi-supervised MTL\nframework. In the MTL setting, we have M partially labeled data manifolds, each de\ufb01ning a clas-\nsi\ufb01cation task and involving design of a semi-supervised classi\ufb01er. The M classi\ufb01ers are designed\nsimultaneously within a uni\ufb01ed sharing structure. The key component of the sharing structure is a\nsoft variant of the Dirichlet process (DP), which implements a soft-sharing prior over the parameters\nof all classi\ufb01ers. The soft-DP retains the clustering property of DP and yet does not require exact\nsharing of parameters, which increases \ufb02exibility and promotes robustness in information sharing.\n\n2 Parameterized Neighborhood-Based Classi\ufb01cation\n\nThe new semi-supervised formulation, termed parameterized neighborhood-based classi\ufb01cation\n(PNBC), represents the class probability of a data point by mixing over all data points in the neigh-\nborhood, which is formed via Markov random walk over a graph representation of the manifold.\n\nk=1\n\nwik\n\nwij(cid:80)n\n\n2.1 Neighborhoods Induced by Markov Random Walk\nLet G = (X , W) be a weighted graph such that X = {x1, x2, \u00b7\u00b7\u00b7, xn} is a set of vertices that\ncoincide with the data points in a \ufb01nite data manifold, and W = [wij]n\u00d7n is the af\ufb01nity matrix with\nthe (i, j)-th element wij indicating the immediate af\ufb01nity between data points xi and xj. We follow\n[12, 15] to de\ufb01ne wij = exp(\u22120.5(cid:107)xi \u2212 xj(cid:107)2/\u03c32\ni ), where (cid:107) \u00b7 (cid:107) is the Euclidean norm and \u03c3ij > 0.\nA Markov random walk on graph G = (X , W) is characterized by a matrix of one-step transition\nprobabilities A = [aij]n\u00d7n, where aij is the probability of transiting from xi to xj via a single step\n[4]. Let B = [bij]n\u00d7n = At. Then (i, j)-th element bij represents\nand is given by aij =\nthe probability of transiting from xi to xj in t steps.\nData point xj is said to be a t-step neighbor of xi if bij > 0. The t-step neighborhood of xi,\ndenoted as Nt(xi), is de\ufb01ned by all t-step neighbors of xi along with the associated t-step transition\nprobabilities, i.e., Nt(xi) = {(xj, bij) : bij > 0, xj \u2208 X}. The appropriateness of a t-step\nneighborhood depends on the right choice of t. A rule of choosing t is given in [12], based on\nmaximizing the margin of the associated classi\ufb01er on both labeled and unlabeled data points.\nThe \u03c3i in specifying wij represents the step-size (distance traversed in a single step) for xi to reach\nits immediate neighbor, and we have used a distinct \u03c3 for each data point. Location-dependent\nstep-sizes allow one to account for possible heterogeneities in the data manifold \u2014 at locations with\ndense data distributions a small step-size is suitable, while at locations with sparse data distributions\na large step-size is appropriate. A simple choice of heterogeneous \u03c3 is to let \u03c3i be related to the\ndistance between xi and close-by data points, where closeness is measured by Euclidean distance.\nSuch a choice ensures each data point is immediately connected to some neighbors.\n\n2.2 Formulation of the PNBC Classi\ufb01er\nLet p\u2217(yi|xi, \u03b8) be a base classi\ufb01er parameterized by \u03b8, which gives the probability of class label\nyi of data point xi, given xi alone (which is a zero-step neighborhood of xi). The base classi\ufb01er\ncan be implemented by any parameterized probabilistic classi\ufb01er. For binary classi\ufb01cation with\ny \u2208 {\u22121, 1}, the base classi\ufb01er can be chosen as logistic regression with parameters \u03b8, which\nexpresses the conditional class probability as\n\np\u2217(yi|xi, \u03b8) = [1 + exp(\u2212yi\u03b8T xi)]\u22121\n\n(1)\nwhere a constant element 1 is assumed to be pre\ufb01xed to each x (the pre\ufb01xed x is still denoted as x\nfor notational simplicity), and thus the \ufb01rst element in \u03b8 is a bias term.\nLet p(yi|Nt(xi), \u03b8) denote a neighborhood-based classi\ufb01er parameterized by \u03b8, representing the\nprobability of class label yi for xi, given the neighborhood of xi. The PNBC classi\ufb01er is de\ufb01ned as\na mixture\n\n(cid:80)n\nj=1bij p\u2217(yi|xj, \u03b8)\n\np(yi|Nt(xi), \u03b8) =\n\n(2)\n\n\fwhere the j-th component is the base classi\ufb01er applied to (xj, yi) and the associated mixing propor-\ntion is de\ufb01ned by the probability of transiting from xi to xj in t steps. Since the magnitude of bij\nautomatically determines the contribution of xj to the mixture, we let index j run over the entire X\nfor notational simplicity.\nThe utility of unlabeled data in (2) is conspicuous \u2014 in order for xi to be labeled yi, each neighbor\nxj must be labeled consistently with yi, with the strength of consistency proportional to bij; in\nsuch a manner, yi implicitly propagates over the neighborhood of xi. By taking neighborhoods\ninto account, it is possible to obtain an accurate estimate of \u03b8, based on a small amount of labeled\ndata. The over-\ufb01tting problem associated with limited labeled data is ameliorated in the PNBC\nformulation, through enforcing consistent labeling over each neighborhood.\nLet L \u2286 {1, 2,\u00b7\u00b7\u00b7 , n} denote the index set of labeled data in X . Assuming the labels are condition-\nally independent, we write the neighborhood-conditioned likelihood function\n\n(cid:161){yi, i \u2208 L}|{Nt(xi) : i \u2208 L}, \u03b8\n\n(cid:162)\n\np\n\n(cid:81)\n\n=\n\ni\u2208L p(yi|Nt(xi), \u03b8) =\n\ni\u2208L\n\n(cid:81)\n\n(cid:80)n\nj=1 bij p\u2217(yi|xj, \u03b8)\n\n(3)\n\n3 The Semi-Supervised MTL Framework\n\n3.1 The sharing prior\n\nSuppose we are given M tasks, de\ufb01ned by M partially labeled data sets\n\nDm = {xm\n\ni\n\n: i = 1, 2,\u00b7\u00b7\u00b7 , nm} \u222a {ym\n\n: i \u2208 Lm}\n\ni\n\nfor m = 1,\u00b7\u00b7\u00b7 , M, where ym\ni and Lm \u2282 {1, 2,\u00b7\u00b7\u00b7 , nm} is the index set\nof labeled data in task m. We consider M PNBC classi\ufb01ers, parameterized by \u03b8m, m = 1,\u00b7\u00b7\u00b7 , M,\nwith \u03b8m responsible for task m. The M classi\ufb01ers are not independent but coupled by a prior joint\ndistribution over their parameters\n\nis the class label of xm\n\ni\n\n(4)\n\n(cid:81)M\nm=1 p(\u03b8m|\u03b81,\u00b7\u00b7\u00b7 , \u03b8m\u22121)\n\n(cid:80)m\u22121\n\n(cid:164)\n\np(\u03b81,\u00b7\u00b7\u00b7 , \u03b8M ) =\n\n(cid:163)\n\nwith the conditional distributions in the product de\ufb01ned by\n\n\u03b1p(\u03b8m|\u03a5) +\n\np(\u03b8m|\u03b81,\u00b7\u00b7\u00b7 , \u03b8m\u22121) =\n\n1\n\n\u03b1+m\u22121\n\nl=1 N(\u03b8m; \u03b8l, \u03b72I)\n\n(5)\nwhere \u03b1 > 0, p(\u03b8m|\u03a5) is a base distribution parameterized by \u03a5, N(\u00b7 ; \u03b8l, \u03b72I) is a normal dis-\ntribution with mean \u03b8l and covariance matrix \u03b72I. As discussed below, the prior in (4) is linked to\nDirichlet processes and thus is more general than a parametric prior, as used, for example, in [5].\nEach normal distribution represents the prior transferred from a previous task; it is the meta-\nknowledge indicating how the present task should be learned, based on the experience with a previ-\nous task. It is through these normal distributions that information sharing between tasks is enforced.\nTaking into account the data likelihood, unrelated tasks cannot share since they have dissimilar solu-\ntions and forcing them to share the same solution will decrease their respective likelihood; whereas,\nrelated tasks have close solutions and sharing information helps them to \ufb01nd their solutions and\nimprove their data likelihoods.\nThe base distribution represents the baseline prior, which is exclusively used when there are no\nprevious tasks available, as is seen from (5) by setting m = 1. When there are m \u2212 1 previous\n\u03b1+m\u22121 , and uses the prior transferred from each\ntasks, one uses the baseline prior with probability\nof the m \u2212 1 previous tasks with probability\n\u03b1+m\u22121. The \u03b1 balances the baseline prior and the\npriors imposed by previous tasks. The role of baseline prior decreases as m increases, which is in\nagreement with our intuition, since the information from previous tasks increase with m.\nThe formulation in (5) is suggestive of the polya urn representation of a Dirichlet process (DP) [2].\nThe difference here is that we have used a normal distribution to replace Dirac delta in Dirichlet\nprocesses. Since N(\u03b8m|\u03b8l, \u03b72I) approaches Dirac delta \u03b4(\u03b8m \u2212 \u03b8l) as \u03b72 \u2192 0, we recover the\nDirichlet process in the limit case when limit case when \u03b72 \u2192 0.\nThe motivation behind the formulation in (5) is twofold. First, a normal distribution can be regarded\nas a soft version of the Dirac delta. While the Dirac delta requires two tasks to have exactly the\nsame \u03b8 when sharing occurs, the soft delta only requires sharing tasks to have similar \u03b8\u2019s. The\n\n\u03b1\n\n1\n\n\fsoft sharing may therefore be more consistent with situations in practical applications. Second, the\nnormal distribution is analytically more appealing than the Dirac delta and allows simple maximum\na posteriori (MAP) solutions. This is an attractive property considering that most classi\ufb01ers do not\nhave conjugate priors for their parameters and Bayesian learning cannot be performed exactly.\nUnder the sharing prior in (4), the current task is equally in\ufb02uenced by each previous task but is\nin\ufb02uenced unevenly by future tasks \u2014 a distant future task has less in\ufb02uence than a near future task.\nThe ordering of the tasks imposed by (4) may in principle affect performance, although we have not\nfound this to be an issue in the experimental results. Alternatively, one may obtain a sharing prior\nthat does not depend on task ordering, by modifying (5) as\n\np(\u03b8m|\u03b8\u2212m) =\n\n\u03b1p(\u03b8m|\u03a5) +\n\n1\n\n(6)\nwhere \u03b8\u2212m = {\u03b81,\u00b7\u00b7\u00b7 , \u03b8M} \\ {\u03b8m}. The prior joint distribution of {\u03b81,\u00b7\u00b7\u00b7 , \u03b8M} associated with\nthe full conditionals in (6) is not analytically available, nether is the corresponding posterior joint\ndistribution, which causes technical dif\ufb01culties in performing MAP estimation.\n\nl(cid:54)=mN(\u03b8m; \u03b8l, \u03b72I)\n\n\u03b1+M\u22121\n\n(cid:80)\n\n(cid:164)\n\n(cid:163)\n\n(cid:161){ym\n(cid:81)M\n\np\n=\n\n3.2 Maximum A Posteriori (MAP) Estimation\nAssuming that, given {\u03b81,\u00b7\u00b7\u00b7 , \u03b8M}, the class labels of different tasks are conditionally independent,\nthe joint likelihood function over all tasks can be written as\n\ni , i \u2208 Lm}M\ni\u2208Lm\n\nm=1\n\nm=1|{Nt(xm\n\ni ) : i \u2208 Lm}M\nj , \u03b8m)\n\ni |xm\n\nij p\u2217(ym\n\nm=1,{\u03b8m}M\n\nm=1\n\nj=1 bm\n\n(7)\nwhere the m-th term in the product is taken from (3), with the superscript m indicating the task\nindex. Note that the neighborhoods are built for each task independently of other tasks, thus a\nrandom walk is always restricted to the same task (the one where the starting data point belongs)\nand can never traverse multiple tasks. From (4), (5), and (7), one can write the logarithm of the joint\nposterior of {\u03b81,\u00b7\u00b7\u00b7 , \u03b8M}, up to a constant translation that does not depend on {\u03b81,\u00b7\u00b7\u00b7 , \u03b8M},\n\n(cid:80)nm\n\n(cid:81)\n\n(cid:162)\n\n(cid:80)M\n\n(cid:169)\n\n(cid:163)\n\n(cid:96)MAP(\u03b81,\u00b7\u00b7\u00b7 , \u03b8M ) = ln p\n\u03b1p(\u03b8m|\u03a5) +\n=\n\nln\n\n(cid:164)\n(cid:80)\nm=1,{Nt(xm\ni , i \u2208 Lm}M\nln\n+\n\nm=1|{ym\nl=1 N(\u03b8m; \u03b8l, \u03b72I)\n\n(cid:80)nm\n\ni ) : i \u2208 Lm}M\nij p\u2217(ym\nj=1 bm\n\n(cid:162)\ni |xm\n\nm=1\n\n(cid:170)\n\n(cid:161){\u03b8m}M\n(cid:80)m\u22121\n\nm=1\n\ni\u2208Lm\n\nj , \u03b8m)\n\n(8)\nWe seek the parameters {\u03b81,\u00b7\u00b7\u00b7 , \u03b8M} that maximize the log-posterior, which is equivalent to si-\nmultaneously maximizing the prior in (4) and the likelihood function in (7). As seen from (5), the\nprior tends to have similar \u03b8\u2019s across tasks (similar \u03b8\u2019s increase the prior); however sharing between\nunrelated tasks is discouraged, since each task requires a distinct \u03b8 to make its likelihood large.\nAs a result, to make the prior and the likelihood large at the same time, one must let related tasks\nhave similar \u03b8\u2019s. Although any optimization techniques can be applied to maximize the objective\nfunction (8), expectation maximization (EM) is particularly suitable, since the objective function\ninvolves summations under the logarithmic operation. To conserve space the algorithmic details are\nomitted here.\nUtilization of the manifold information and the information from related tasks has greatly reduced\nthe hypothesis space. Therefore, point MAP estimation in semi-supervised MTL will not suffer as\nmuch from over\ufb01tting as in supervised STL. This argument will be supported by the experimental re-\nsults in Section 4.2, where semi-supervised MTL outperforms both supervised MTL and supervised\nSTL, although the former is based on MAP and the latter two are based on Bayesian learning.\nWith MAP estimation, one obtains the parameters of the base classi\ufb01er in (1) for each task, which\ncan be employed to predict the class label of any data point in the associated task, regardless of\nwhether the data point has been seen during training. In the special case when predictions are desired\nonly for the unlabeled data points seen during training (transductive learning), one can alternatively\nemploy the PNBC classi\ufb01er in (2) to perform the predictions.\n\n4 Experimental Results\n\nFirst we consider semi-supervised learning on a single task and establish the competitive perfor-\nmance of the PNBC in comparison with existing semi-supervised algorithms. Then we demonstrate\nthe performance improvements achieved by semi-supervised MTL, relative to semi-supervised STL\nand supervised MTL. Throughout this section, the base classi\ufb01er in (1) is logistic regression.\n\n\f4.1 Performance of the PNBC on a Single Task\n\nFigure 1: Transductive results of the PNBC. The horizontal axis is the size of XL.\n\nFigure 2: Inductive results of the PNBC on Ionosphere. The horizontal axis is the size of XU .\n\nThe PNBC is evaluated on three benchmark data sets \u2013 Pima Indians Diabetes Database (PIMA),\nWisconsin Diagnostic Breast Cancer (WDBC) data, and Johns Hopkins University Ionosphere\ndatabase (Ionosphere), which are taken from the UCI machine learning repository [11]. The evalu-\nation is performed in comparison to four existing semi-supervised learning algorithms, namely, the\ntransductive SVM [9], the algorithm of Szummer & Jaakkola [12], GRF [15], and Logistic GRF\n[10]. The performance is evaluated in terms of classi\ufb01cation accuracy, de\ufb01ned as the ratio of the\nnumber of correctly classi\ufb01ed data over the total number of data being tested.\nWe consider two testing modes: transductive and inductive. In the transductive mode, the test data\nare the unlabeled data that are used in training the semi-supervised algorithms; in the inductive\nmode, the test data are a set of holdout data unseen during training. We follow the same procedures\nas used in [10] to perform the experiments. Denote by X any of the three benchmark data sets and\nY the associated set of class labels. In the transductive mode, we randomly sample XL \u2282 X and\nassume the associated class labels YL are available; the semi-supervised algorithms are trained by\nX \u222aYL and tested on X \\XL. In the inductive mode, we randomly sample two disjoint data subsets\nXL \u2282 X and XU \u2282 X , and assume the class labels YL associated with XL are available; the semi-\nsupervised algorithms are trained by XL \u222a YL \u222a XU and tested on 200 data randomly sampled from\nX \\ (XL \u222a XU ).\nThe comparison results are summarized in Figures 1 and 2, where the results of the PNBC and the\nalgorithm of Szummer & Jaakkola are calculated by us, and the results of remaining algorithms are\ncited from [10]. The algorithm of Szummer & Jaakkola [12] and the PNBC use \u03c3i = minj (cid:107)xi \u2212\nxj(cid:107)/3 and t = 100; learning of the PNBC is based on MAP estimation. Each curve in the \ufb01gures\nis a result averaged from T independent trials, with T = 20 for the transductive results and T = 50\nfor the inductive results. In the inductive case, the comparison is between the proposed algorithm\nand the Logistic GRF, as the others are transductive algorithms.\nFor the PNBC, we can either use the base classi\ufb01er in (1) or the PNBC classi\ufb01er in (2) to predict the\nlabels of unlabeled data seen in training (the transductive mode). In the inductive mode, however,\nthe {bij} are not available for the test data (unseen in training) since they are not in the graph\nrepresentation, therefore we can only employ the base classi\ufb01er. In the legends of Figures 1 and 2, a\nsuf\ufb01x \u201cII\u201d to PNBC indicates that the PNBC classi\ufb01er in (2) is employed in testing; when no suf\ufb01x\nis attached, the base classi\ufb01er is employed in testing.\nFigures 1 and 2 show that the PNBC outperforms all the competing algorithms in general, regardless\nof the number of labeled data points. The improvements are particularly signi\ufb01cant on PIMA and\n\n204060800.620.640.660.680.70.720.74Number of labeled dataAccuracy on unlabeled dataPIMAPNBCSzummer & JaakkolaLogistic GRFGRFTransductive SVM204060800.860.880.90.920.940.96Number of labeled dataAccuracy on unlabeled dataWDBCPNBCSzummer & JaakkolaLogistic GRFGRFTransductive SVM204060800.650.70.750.80.850.9Number of Labeled DataAccuracy on unlabeled dataIonospherePNBCSzummer & JaakkolaLogistic GRFGRFTransductive SVMPNBC\u2212II0204060801001200.50.60.70.80.9Number of Unlabeled SamplesAccuracy on separated test data10 labeled samplesPNBCLogistic GRF0204060801001200.50.60.70.80.9Number of Unlabeled SamplesAccuracy on separated test data20 labeled samplesPNBCLogistic GRF0204060801001200.50.60.70.80.9Number of Unlabeled SamplesAccuracy on separated test data30 labeled samplesPNBCLogistic GRF0204060801001200.50.60.70.80.9Number of Unlabeled SamplesAccuracy on separated test data40 labeled samplesPNBCLogistic GRF\fIonosphere. As indicated in Figure 1(c), employing manifold information in testing by using (2)\ncan improve classi\ufb01cation accuracy in the transductive learning case. The margin of improvements\nachieved by the PNBC in the inductive learning case is striking and encouraging \u2014 as indicated\nby the error bars in Figure 2, the PNBC signi\ufb01cantly outperforms Logistic GRF in almost all indi-\nvidual trials. Figure 2 also shows that the advantage of the PNBC becomes more conspicuous with\ndecreasing amount of labeled data considered during training.\n\n4.2 Performance of the Semi-Supervised MTL Algorithm\n\nWe compare the proposed semi-supervised MTL against: (a) semi-supervised single-task learning\n(STL), (b) supervised MTL, (c) supervised STL, (d) supervised pooling; STL refers to designing\nM classi\ufb01ers independently, each for the corresponding task, and pooling refers to designing a sin-\ngle classi\ufb01er based on the data of all tasks. Since we have evaluated the PNBC in Section 4.1 and\nestablished its effectiveness, we will not repeat the evaluation here and employ PNBC as a represen-\ntative semi-supervised learning algorithm in semi-supervised STL. To replicate the experiments in\n[13], we employ AUC as the performance measure, where AUC stands for area under the receiver\noperation characteristic (ROC) curve [7].\nThe basic setup of the semi-supervised MTL algorithm is as follows. The tasks are ordered as\nthey are when the data are provided to the experimenter (we have randomly permuted the tasks and\nfound the performance does not change much). A separate t-neighborhood is employed to represent\nthe manifold information (consisting of labeled and unlabeled data points) for each task, where the\nstep-size at each data point is one third of the shortest distance to the remaining points and t is set\nto half the number of data points. The base prior p(\u03b8m|\u03a5) = N(\u03b8m; 0, \u03c52I) and the soft delta is\nN(\u03b8m; \u03b8l, \u03b72I), where \u03c5 = \u03b7 = 1. The \u03b1 balancing the base prior and the soft delta\u2019s is 0.3. These\nsettings represent the basic intuition of the experimenter; they have not been tuned in any way and\ntherefore do not necessarily represent the best settings for the semi-supervised MTL algorithm.\n\n(a)\n\n(b)\n\nFigure 3: (a) Performance of the semi-supervised MTL algorithm on landmine detection, in com-\nparison to the remaining \ufb01ve algorithms. (b) The Hinton diagram of between-task similarity when\nthere are 140 labeled data in each task.\n\nLandmine Detection First we consider the remote sensing problem considered in [13], based on\ndata collected from real landmines. In this problem, there are a total of 29 sets of data, collected\nfrom various landmine \ufb01elds. Each data point is represented by a 9-dimensional feature vector\nextracted from radar images. The class label is binary (mine or false mine). The data are available\nat http://www.ee.duke.edu/\u223clcarin/LandmineData.zip.\nEach of the 29 data sets de\ufb01nes a task, in which we aim to \ufb01nd landmines with a minimum number\nof false alarms. To make the results comparable to those in [13], we follow the authors there and take\ndata sets 1-10 and 16-24 to form 19 tasks. Of the 19 selected data sets, 1-10 are collected at foliated\nregions and 11-19 are collected at regions that are bare earth or desert. Therefore we expect two\ndominant clusters of tasks, corresponding to the two different types of ground surface conditions.\nTo replicate the experiments in [13], we perform 100 independent trials, in each of which we ran-\ndomly select a subset of data for which labels are assumed available, train the semi-supervised MTL\n\n204060801001201400.550.60.650.70.750.8Number of Labeled Data in Each TaskAverage AUC on 19 tasksSupervised STLSupervised PoolingSupervised MTLSemi\u2212Supervised STLSemi\u2212Supervised MTL2468101214161824681012141618Index of Landmine FieldIndex of Landmine Field\fand semi-supervised STL classi\ufb01ers, and test the classi\ufb01ers on the remaining data. The AUC av-\neraged over the 19 tasks is presented in Figure 3(a), as a function of the number of labeled data,\nwhere each curve represents the mean calculated from the 100 independent trials and the error bars\nrepresent the corresponding standard deviations. The results of supervised STL, supervised MTL,\nand supervised pooling are cited from [13].\nSemi-supervised MTL clearly yields the best results up to 80 labeled data points; after that super-\nvised MTL catches up but semi-supervised MTL still outperforms the remaining three algorithms\nby signi\ufb01cant margins. In this example semi-supervised MTL seems relatively insensitive to the\namount of labeled data; this may be attributed to the doubly enhanced information provided by the\ndata manifold plus the related tasks, which signi\ufb01cantly augment the information available in the\nlimited labeled data. The superiority of supervised pooling over supervised STL on this dataset sug-\ngests that there are signi\ufb01cant bene\ufb01ts offered by sharing across the tasks, which partially explains\nwhy supervised MTL eventually catches up with semi-supervised MTL.\nWe plot in Figure 3(b) the Hinton diagram [8] of the between-task sharing matrix (an average over\nthe 100 trials) found by the semi-supervised MTL when there are 140 labeled data in each task.\nThe (m, l)-th element of similarity matrix is equal to exp(\u2212(cid:107)\u03b8m\u2212\u03b8l(cid:107)2\n) (normalized such that the\nmaximum element is one), which is represented by a square in the Hinton diagram, with a larger\nsquare indicating a larger value of the corresponding element. As seen from Figure 3(b), there is a\ndominant sharing among tasks 1-10 and another dominant sharing among tasks 11-19. Recall from\nthe beginning of the section that data sets 1-10 are from foliated regions and data sets 11-19 are\nfrom regions that are bare earth or desert. Therefore, the sharing is in agreement with the similarity\nbetween tasks.\n\n2\n\nArt Images Retrieval We now consider the problem of art image retrieval [14, 13], in which we\nhave a library of 642 art images and want to retrieve the images based on a user\u2019s preference. The\npreference of each user is available on a subset of images, therefore the objective is to learn the pref-\nerence of each user based on a subset of training examples. Each image is represented by a vector\nof features and a user\u2019s rating is represented by a binary label (like or dislike). The users\u2019 prefer-\nences are collected in a web-based survey, which can be found at http://honolulu.dbs.informatik.uni-\nmuenchen.de:8080/paintings/index.jsp.\nWe consider the same 69 users as considered in [13], who each rated more than 100 images. The\npreference prediction for each user is treated as a task, with the associated set of ground truth data\nde\ufb01ned by the images rated by the user. These 69 tasks are used in our experiment to evaluate\nthe performance of semi-supervised MTL. Since two users may give different ratings to exactly the\nsame image, pooling the tasks together can lead to multiple labels for the same data point. For this\nreason, we exclude supervised pooling and semi-supervised pooling in the performance comparison.\n\nFigure 4: Performance of the semi-supervised MTL algorithm on art image retrieval, in comparison\nto the remaining three algorithms.\n\nFollowing [13], we perform 50 independent trials, in each of which we randomly select a subset of\nimages rated by each user, train the semi-supervised MTL and semi-supervised STL classi\ufb01ers, and\ntest the classi\ufb01ers on the remaining images. The AUC averaged over the 69 tasks is presented in\nFigure 4, as a function of the number of labeled data (rated images), where each curve represents\n\n5101520253035404550550.520.530.540.550.560.570.580.590.60.610.62Number of Labeled Data for Each TaskAverage AUC on 68 tasksSupervised STLSupervised MTLSemi\u2212supervised STLSemi\u2212supervised MTL\fthe mean calculated from the 50 independent trials and the error bars represent the corresponding\nstandard deviations. The results of supervised STL and supervised MTL are cited from [13].\nSemi-supervised MTL performs very well, improving upon results of the three other algorithms by\nsigni\ufb01cant margins in almost all individual trials (as seen from the error bars). It is noteworthy\nthat the performance improvement achieved by semi-supervised MTL over semi-supervised STL\nis larger than corresponding improvement achieved by supervised MTL over supervised STL. The\ngreater improvement demonstrates that unlabeled data can be more valuable when used along with\nmultitask learning. The additional utility of unlabeled data can be attributed to its role in helping to\n\ufb01nd the appropriate sharing between tasks.\n\n5 Conclusions\n\nA framework has been proposed for performing semi-supervised multitask learning (MTL). Recog-\nnizing that existing semi-supervised algorithms are not conveniently extended to an MTL setting,\nwe have introduced a new semi-supervised formulation to allow a direct MTL extension. We have\nproposed a soft sharing prior, which allows each task to robustly borrow information from related\ntasks and is amenable to simple point estimation based on maximum a posteriori. Experimental\nresults have demonstrated the superiority of the new semi-supervised formulation as well as the\nadditional performance improvement offered by semi-supervised MTL. The superior performance\nof semi-supervised MTL on art image retrieval and landmine detection show that manifold infor-\nmation and the information from related tasks could play positive and complementary roles in real\napplications, suggesting that signi\ufb01cant bene\ufb01ts can be offered in practice by semi-supervised MTL.\n\nReferences\n[1] B. Bakker and T. Heskes. Task clustering and gating for Bayesian multitask learning. Journal of Machine\n\nLearning Research, pages 83\u201399, 2003.\n\n[2] D. Blackwell and J. MacQueen. Ferguson distributions via polya urn schemes. Annals of Statistics, 1:\n\n353\u2013355, 1973.\n\n[3] R. Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[4] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.\n[5] T. Evgeniou and M. Pontil. Regularized multi-task learning. In Proc. 17th SIGKDD Conf. on Knowledge\n\nDiscovery and Data Mining, 2004.\n\n[6] T. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1:209\u2013230,\n\n1973.\n\n[7] J. Hanley and B. McNeil. The meaning and use of the area under a receiver operating characteristic (ROC)\n\ncurve. Radiology, 143:29\u201336, 1982.\n\n[8] G. E. Hinton and T. J. Sejnowski. Learning and relearning in boltzmann machines. In J. L. McClelland,\nD. E. Rumelhart, and the PDP Research Group, editors, Parallel Distributed Processing: Explorations in\nthe Microstructure of Cognition, volume 1, pages 282\u2013317. MIT Press, Cambridge, MA, 1986.\n\n[9] T. Joachims. Transductive inference for text classi\ufb01cation using support vector machines. In Proc. 16th\nInternational Conf. on Machine Learning (ICML), pages 200\u2013209. Morgan Kaufmann, San Francisco,\nCA, 1999.\n\n[10] B. Krishnapuram, D. Williams, Y. Xue, A. Hartemink, L. Carin, and M. Figueiredo. On semi-supervised\n\nclassi\ufb01cation. In Advances in Neural Information Processing Systems (NIPS), 2005.\n\n[11] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of machine learning databases.\n\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html, 1998.\n\n[12] M. Szummer and T. Jaakkola. Partially labeled classi\ufb01cation with markov random walks. In Advances in\n\nNeural Information Processing Systems (NIPS), 2002.\n\n[13] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Multi-task learning for classi\ufb01cation with dirichlet\n\nprocess priors. Journal of Machine Learning Research (JMLR), 8:35\u201363, 2007.\n\n[14] K. Yu, A. Schwaighofer, V. Tresp, W.-Y. Ma, and H.J. Zhang. Collaborative ensemble learning: Com-\nbining collaborative and content-based information \ufb01ltering via hierarchical bayes. In Proceedings of the\n19th International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2003), 2003.\n\n[15] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions. In The Twentieth International Conference on Machine Learning (ICML), 2003.\n\n\f", "award": [], "sourceid": 1035, "authors": [{"given_name": "Qiuhua", "family_name": "Liu", "institution": null}, {"given_name": "Xuejun", "family_name": "Liao", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}]}