{"title": "Multi-task Gaussian Process Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 153, "page_last": 160, "abstract": "", "full_text": "Multi-task Gaussian Process Prediction\n\nEdwin V. Bonilla, Kian Ming A. Chai, Christopher K. I. Williams\n\nSchool of Informatics, University of Edinburgh, 5 Forrest Hill, Edinburgh EH1 2QL, UK\n\nedwin.bonilla@ed.ac.uk, K.M.A.Chai@sms.ed.ac.uk, c.k.i.williams@ed.ac.uk\n\nAbstract\n\nIn this paper we investigate multi-task learning in the context of Gaussian Pro-\ncesses (GP). We propose a model that learns a shared covariance function on\ninput-dependent features and a \u201cfree-form\u201d covariance matrix over tasks. This al-\nlows for good \ufb02exibility when modelling inter-task dependencies while avoiding\nthe need for large amounts of data for training. We show that under the assump-\ntion of noise-free observations and a block design, predictions for a given task\nonly depend on its target values and therefore a cancellation of inter-task trans-\nfer occurs. We evaluate the bene\ufb01ts of our model on two practical applications:\na compiler performance prediction problem and an exam score prediction task.\nAdditionally, we make use of GP approximations and properties of our model in\norder to provide scalability to large data sets.\n\n1 Introduction\n\nMulti-task learning is an area of active research in machine learning and has received a lot of at-\ntention over the past few years. A common set up is that there are multiple related tasks for which\nwe want to avoid tabula rasa learning by sharing information across the different tasks. The hope is\nthat by learning these tasks simultaneously one can improve performance over the \u201cno transfer\u201d case\n(i.e. when each task is learnt in isolation). However, as pointed out in [1] and supported empirically\nby [2], assuming relatedness in a set of tasks and simply learning them together can be detrimental.\nIt is therefore important to have models that will generally bene\ufb01t related tasks and will not hurt\nperformance when these tasks are unrelated. We investigate this in the context of Gaussian Process\n(GP) prediction.\nWe propose a model that attempts to learn inter-task dependencies based solely on the task identities\nand the observed data for each task. This contrasts with approaches in [3, 4] where task-descriptor\nfeatures t were used in a parametric covariance function over different tasks\u2014such a function may\nbe too constrained by both its parametric form and the task descriptors to model task similarities\neffectively. In addition, for many real-life scenarios task-descriptor features are either unavailable\nor dif\ufb01cult to de\ufb01ne correctly. Hence we propose a model that learns a \u201cfree-form\u201d task-similarity\nmatrix, which is used in conjunction with a parameterized covariance function over the input features\nx.\nFor scenarios where the number of input observations is small, multi-task learning augments the data\nset with a number of different tasks, so that model parameters can be estimated more con\ufb01dently;\nthis helps to minimize over-\ufb01tting. In our model, this is achieved by having a common covariance\nfunction over the features x of the input observations. This contrasts with the semiparametric latent\nfactor model [5] where, with the same set of input observations, one has to estimate the parameters\nof several covariance functions belonging to different latent processes.\nFor our model we can show the interesting theoretical property that there is a cancellation of inter-\ntask transfer in the speci\ufb01c case of noise-free observations and a block design. We have investigated\nboth gradient-based and EM-based optimization of the marginal likelihood for learning the hyper-\nparameters of the GP. Finally, we make use of GP approximations and properties of our model in\n\n\forder to scale our approach to large multi-task data sets, and evaluate the bene\ufb01ts of our model on\ntwo practical multi-task applications: a compiler performance prediction problem and a exam score\nprediction task.\nThe structure of the paper is as follows: in section 2 we outline our model for multi-task learning,\nand discuss some approximations to speed up computations in section 3. Related work is described\nin section 4. We describe our experimental setup in section 5 and give results in section 6.\n\n2 The Model\n\nGiven a set X of N distinct inputs x1, . . . , xN we de\ufb01ne the complete set of responses for M tasks\nas y = (y11, . . . , yN 1, . . . , y12, . . . , yN 2, . . . , y1M , . . . , yNM )T, where yil is the response for the lth\ntask on the ith input xi. Let us also denote the N \u00d7 M matrix Y such that y = vec Y .\nGiven a set of observations yo, which is a subset of y, we want to predict some of the unobserved\nresponse-values yu at some input locations for certain tasks.\nWe approach this problem by placing a GP prior over the latent functions {fl} so that we directly\ninduce correlations between tasks. Assuming that the GPs have zero mean we set\n\nhfl(x)fk(x0)i = K f\n\nlkkx(x, x0)\n\nyil \u223c N (fl(xi), \u03c32\nl ),\n\n(1)\n\nwhere K f is a positive semi-de\ufb01nite (PSD) matrix that speci\ufb01es the inter-task similarities, kx is a\ncovariance function over inputs, and \u03c32\nl is the noise variance for the lth task. Below we focus on\nstationary covariance functions kx; hence, to avoid redundancy in the parametrization, we further\nlet kx be only a correlation function (i.e. it is constrained to have unit variance), since the variance\ncan be explained fully by K f .\nThe important property of this model is that the joint Gaussian distribution over y is not block-\ndiagonal wrt tasks, so that observations of one task can affect the predictions on another task. In\n[4, 3] this property also holds, but instead of specifying a general PSD matrix K f , these authors set\nlk = kf (tl, tk), where kf (\u00b7,\u00b7) is a covariance function over the task-descriptor features t.\nK f\nOne popular setup for multi-task learning is to assume that tasks can be clustered, and that there are\ninter-task correlations between tasks in the same cluster. This can be easily modelled with a general\ntask-similarity K f matrix: if we assume that the tasks are ordered with respect to the clusters, then\nK f will have a block diagonal structure. Of course, as we are learning a \u201cfree form\u201d K f the ordering\nof the tasks is irrelevant in practice (and is only useful for explanatory purposes).\n\n2.1\n\nInference\n\nInference in our model can be done by using the standard GP formulae for the mean and variance\nof the predictive distribution with the covariance function given in equation (1). For example, the\nmean prediction on a new data-point x\u2217 for task l is given by\n\n\u00affl(x\u2217) = (kf\n\n(2)\n\n\u03a3 = K f \u2297 K x + D \u2297 I\n\nl \u2297 kx\u2217)T \u03a3\u22121y\nwhere \u2297 denotes the Kronecker product, kf\nl selects the lth column of K f , kx\u2217 is the vector of\ncovariances between the test point x\u2217 and the training points, K x is the matrix of covariances\nbetween all pairs of training points, D is an M \u00d7 M diagonal matrix in which the (l, l)th element is\nl , and \u03a3 is an M N \u00d7 M N matrix.\n\u03c32\nIn section 2.3 we show that when there is no noise in the data (i.e. D = 0), there will be no transfer\nbetween tasks.\n\n2.2 Learning Hyperparameters\n\nGiven the set of observations yo, we wish to learn the parameters \u03b8x of kx and the matrix K f\nto maximize the marginal likelihood p(yo|X, \u03b8x, K f ). One way to achieve this is to use the\nfact that y|X \u223c N (0, \u03a3). Therefore, gradient-based methods can be readily applied to maxi-\nmize the marginal likelihood. In order to guarantee positive-semide\ufb01niteness of K f , one possible\n\n\fparametrization is to use the Cholesky decomposition K f = LLT where L is lower triangular.\nComputing the derivatives of the marginal likelihood with respect to L and \u03b8x is straightforward. A\ndrawback of this approach is its computational cost as it requires the inversion of a matrix of poten-\ntial size M N \u00d7 M N (or solving an M N \u00d7 M N linear system) at each optimization step. Note,\nhowever, that one only needs to actually compute the Gram matrix and its inverse at the visible\nlocations corresponding to yo.\nAlternatively, it is possible to exploit the Kronecker product structure of the full covariance matrix\nas in [6], where an EM algorithm is proposed such that learning of \u03b8x and K f in the M-step is\ndecoupled. This has the advantage that closed-form updates for K f and D can be obtained (see\nequation (5)), and that K f is guaranteed to be positive-semide\ufb01nite. The details of the EM algorithm\nare as follows: Let f be the vector of function values corresponding to y, and similarly for F wrt\nY . Further, let y\u00b7l denote the vector (y1l, . . . , yN l)T and similarly for f\u00b7l. Given the missing data,\nwhich in this case is f, the complete-data log-likelihood is\n\ni\n\nF T (K x)\u22121 F\n\nh(cid:0)K f(cid:1)\u22121\ntr(cid:2)(Y \u2212 F )D\u22121(Y \u2212 F )T(cid:3) \u2212 M N\nE(cid:12)(cid:12)(cid:12) + M log |K x(\u03b8x)|(cid:17)\nl = N\u22121D\nb\u03c32\n\n2\n\nlog 2\u03c0 (3)\n\nLcomp = \u2212 N\n2\n\nlog |K f| \u2212 M\n2\n\u2212 N\n2\n\nlog |K x| \u2212 1\nMX\ntr\n2\nl \u2212 1\n2\n\nlog \u03c32\n\nfrom which we have following updates:\n\nl=1\n\n(cid:16)\nb\u03b8x = arg min\n(cid:28)\nF T(cid:16)\n(cid:17)\u22121\nK x(c\u03b8x)\n\n\u03b8x\n\n(cid:12)(cid:12)(cid:12)D\n(cid:29)\n\nN log\n\nF T (K x(\u03b8x))\u22121 F\n\nbK f = N\u22121\nwhere the expectations h\u00b7i are taken with respect to p(cid:0)f|yo, \u03b8x, K f(cid:1), andb\u00b7 denotes the updated\np(cid:0)f|y, \u03b8x, K f(cid:1) = N(cid:0)(K f \u2297 K x)\u03a3\u22121y, (K f \u2297 K x) \u2212 (K f \u2297 K x)\u03a3\u22121(K f \u2297 K x)(cid:1) .\n\nparameters. For clarity, let us consider the case where yo = y, i.e. a block design. Then\n\nE\n(y\u00b7l \u2212 f\u00b7l)T (y\u00b7l \u2212 f\u00b7l)\n\n(5)\n\n(4)\n\nF\n\nWe have seen that \u03a3 needs to be inverted (in time O(M 3N 3)) for both making predictions and\nlearning the hyperparameters (when considering noisy observations). This can lead to computational\nproblems if M N is large. In section 3 we give some approximations that can help speed up these\ncomputations.\n\n2.3 Noiseless observations and the cancellation of inter-task transfer\n\nOne particularly interesting case to consider is noise-free observations at the same locations for\nall tasks (i.e. a block-design) so that y|X \u223c Normal(0, K f \u2297 K x). In this case maximizing the\nmarginal likelihood p(y|X) wrt the parameters \u03b8x of kx reduces to maximizing \u2212M log |K x| \u2212\nN log |Y T (K x)\u22121Y |, an expression that does not depend on K f . After convergence we can obtain\nN Y T (K x)\u22121Y . The intuition behind is this: The responses Y are correlated via K f\nK f as \u02c6K f = 1\nand K x. We can learn K f by decorrelating Y with (K x)\u22121 \ufb01rst so that only correlation with respect\nto K f is left. Then K f is simply the sample covariance of the de-correlated Y .\nUnfortunately, in this case there is effectively no transfer between the tasks (given the kernels). To\nsee this, consider making predictions at a new location x\u2217 for all tasks. We have (using the mixed-\nproduct property of Kronecker products) that\n\nf(x\u2217) =(cid:0)K f \u2297 kx\u2217(cid:1)T(cid:0)K f \u2297 K x(cid:1)\u22121 y\n\uf8f6\uf8f7\uf8f8 ,\n\n=(cid:0)(K f )T \u2297 (kx\u2217)T(cid:1)(cid:0)(K f )\u22121 \u2297 (K x)\u22121(cid:1) y\n=(cid:2)(cid:0)K f (K f )\u22121(cid:1) \u2297(cid:0)(kx\u2217)T(K x)\u22121(cid:1)(cid:3) y\n\uf8eb\uf8ec\uf8ed (kx\u2217)T(K x)\u22121y\u00b71\n\n=\n\n...\n\n(kx\u2217)T(K x)\u22121y\u00b7M\n\n(6)\n(7)\n(8)\n\n(9)\n\nand similarly for the covariances. Thus, in the noiseless case with a block design, the predictions\nfor task l depend only on the targets y\u00b7l. In other words, there is a cancellation of transfer. One can\n\n\fin fact generalize this result to show that the cancellation of transfer for task l does still hold even\nif the observations are only sparsely observed at locations X = (x1, . . . , xN ) on the other tasks.\nAfter having derived this result we learned that it is known as autokrigeability in the geostatistics\nliterature [7], and is also related to the symmetric Markov property of covariance functions that is\ndiscussed in [8]. We emphasize that if the observations are noisy, or if there is not a block design,\nthen this result on cancellation of transfer will not hold. This result can also be generalized to\nmultidimensional tensor product covariance functions and grids [9].\n\n3 Approximations to speed up computations\n\nThe issue of dealing with large N has been much studied in the GP literature, see [10, ch. 8] and\n[11] for overviews. In particular, one can use sparse approximations where only Q out of N data\npoints are selected as inducing inputs[11]. Here, we use the Nystr\u00a8om approximation of K x in the\n\nmarginal likelihood, so that K x \u2248 eK x def= K x\u00b7I(K xII)\u22121K xI\u00b7, where I indexes Q rows/columns of\n\n[12] K f \u2248 eK f def= U\u039bU T + s2IM , where U is an M \u00d7 P matrix of the P principal eigenvectors\n\nK x. In fact for the posterior at the training points this result is obtained from both the subset of\nregressors (SoR) and projected process (PP) approximations described in [10, ch. 8].\nSpecifying a full rank K f requires M(M + 1)/2 parameters, and for large M this would be a lot of\nparameters to estimate. One parametrization of K f that reduces this problem is to use a PPCA model\nof K f , \u039b is a P \u00d7 P diagonal matrix of the corresponding eigenvalues, and s2 can be determined\nanalytically from the eigenvalues of K f (see [12] and references therein). For numerical stability,\nwe may further use the incomplete-Cholesky decomposition setting U\u039bU T = \u02dcL\u02dcLT, where \u02dcL is a\nM \u00d7 P matrix. Below we consider the case s = 0, i.e. a rank-P approximation to K f .\n\nApplying both approximations to get \u03a3 \u2248 e\u03a3 def= \u02dcK f \u2297 \u02dcK x + D \u2297 IN , we have, after using the\nWoodbury identity, e\u03a3\u22121 = \u2206\u22121 \u2212 \u2206\u22121B(cid:2)I \u2297 K xII + BT\u2206\u22121B(cid:3)\u22121\nBT\u2206\u22121 where B def= (\u02dcL \u2297\nK x\u00b7I), and \u2206 def= D \u2297 IN is a diagonal matrix. As \u02dcK f \u2297 \u02dcK x has rank P Q, we have that computation\nFor the EM algorithm, the approximation of eK x poses a problem in (4) because for the rank-de\ufb01cient\nof \u02dc\u03a3\u22121y takes O(M N P 2Q2).\nmatrix eK x, its log-determinant is negative in\ufb01nity, and its matrix inverse is unde\ufb01ned. We overcome\nthis by considering eK x = lim\u03be\u21920(K x\u00b7I(K xII)\u22121K xI\u00b7+\u03be2I), so that we solve an equivalent optimiza-\ntion problem where the log-determinant is replaced by the well-de\ufb01ned log |K xI\u00b7K x\u00b7I| \u2212 log |K xII|,\nand the matrix inverse is replaced by the pseudo-inverse. With these approximations the compu-\ntational complexity of hyperparameter learning can be reduced to O(M N P 2Q2) per iteration for\nboth the Cholesky and EM methods.\n\n4 Related work\n\nThere has been a lot of work in recent years on multi-task learning (or inductive transfer) using\nmethods such as Neural Networks, Gaussian Processes, Dirichlet Processes and Support Vector\nMachines, see e.g. [2, 13] for early references. The key issue concerns what properties or aspects\nshould be shared across tasks. Within the GP literature, [14, 15, 16, 17, 18] give models where\nthe covariance matrix of the full (noiseless) system is block diagonal, and each of the M blocks is\ninduced from the same kernel function. Under these models each y\u00b7i is conditionally independent,\nbut inter-task tying takes place by sharing the kernel function across tasks. In contrast, in our model\nand in [5, 3, 4] the covariance is not block diagonal.\nThe semiparametric latent factor model (SLFM) of Teh et al [5] involves having P latent processes\n(where P \u2264 M) and each of these latent processes has its own covariance function. The noiseless\noutputs are obtained by linear mixing of these processes with a M \u00d7 P matrix \u03a6. The covariance\nmatrix of the system under this model has rank at most P N, so that when P < M the system\ncorresponds to a degenerate GP. Our model is similar to [5] but simpler, in that all of the P latent\nprocesses share the same covariance function; this reduces the number of free parameters to be \ufb01tted\nand should help to minimize over\ufb01tting. With a common covariance function kx, it turns out that\nK f is equal to \u03a6\u03a6T, so a K f that is strictly positive de\ufb01nite corresponds to using P = M latent\n\n\fprocesses. Note that if P > M one can always \ufb01nd an M \u00d7 M matrix \u03a60 such that \u03a60\u03a60T = \u03a6\u03a6T.\nWe note also that the approximation methods used in [5] are different to ours, and were based on the\nsubset of data (SoD) method using the informative vector machine (IVM) selection heuristic.\nIn the geostatistics literature, the prior model for f\u00b7 given in eq. (1) is known as the intrinsic cor-\nrelation model [7], a speci\ufb01c case of co-kriging. A sum of such processes is known as the linear\ncoregionalization model (LCM) [7] for which [6] gives an EM-based algorithm for parameter es-\ntimation. Our model for the observations corresponds to an LCM model with two processes: the\nprocess for f\u00b7 and the noise process. Note that SLFM can also be seen as an instance of the LCM\nmodel. To see this, let Epp be a P \u00d7 P diagonal matrix with 1 at (p, p) and zero elsewhere. Then we\np=1(\u03a6Epp\u03a6T)\u2297K x\np ,\n\ncan write the covariance in SLFM as (\u03a6\u2297I)(PP\n\nwhere \u03a6Epp\u03a6T is of rank 1.\nEvgeniou et al. [19] consider methods for inducing correlations between tasks based on a correlated\nprior over linear regression parameters.\nIn fact this corresponds to a GP prior using the kernel\nk(x, x0) = xT Ax0 for some positive de\ufb01nite matrix A. In their experiments they use a restricted\nlk = (1 \u2212 \u03bb) + \u03bbM \u03b4lk (their eq. 25), i.e. a convex combination of a rank-1\nform of K f with K f\nmatrix of ones and a multiple of the identity. Notice the similarity to the PPCA form of K f given in\nsection 3.\n\np )(\u03a6\u2297I)T =PP\n\np=1 Epp\u2297K x\n\n5 Experiments\n\nWe evaluate our model on two different applications. The \ufb01rst application is a compiler performance\nprediction problem where the goal is to predict the speed-up obtained in a given program (task) when\napplying a sequence of code transformations x. The second application is an exam score prediction\nproblem where the goal is to predict the exam score obtained by a student x belonging to a speci\ufb01c\nschool (task). In the sequel, we will refer to the data related to the \ufb01rst problem as the compiler data\nand the data related to the second problem as the school data.\nWe are interested in assessing the bene\ufb01ts of our approach not only with respect to the no-transfer\ncase but also with respect to the case when a parametric GP is used on the joint input-dependent and\ntask-dependent space as in [3]. To train the parametric model note that the parameters of the covari-\nance function over task descriptors kf (t, t0) can be tuned by maximizing the marginal likelihood,\nas in [3]. For the free-form K f we initialize this (given kx(\u00b7,\u00b7)) by using the noise-free expression\nN Y T (K x)\u22121Y given in section 2.3 (or the appropriate generalization when the design is\n\u02c6K f = 1\nnot complete). For both applications we have used a squared-exponential (or Gaussian) covariance\nfunction kx and a non-parametric form for K f . Where relevant the parametric covariance function\nkf was also taken to be of squared-exponential form. Both kx and kf used an automatic relevance\ndetermination (ARD) parameterization, i.e. having a length scale for each feature dimension. All\nthe length scales in kx and kf were initialized to 1, and all \u03c32\nl were constrained to be equal for all\ntasks and initialized to 0.01.\n\n5.1 Description of the Data\n\nCompiler Data. This data set consists of 11 C programs for which an exhaustive set of 88214\nsequences of code transformations have been applied and their corresponding speed-ups have been\nrecorded. Each task is to predict the speed-up on a given program when applying a speci\ufb01c trans-\nformation sequence. The speed-up after applying a transformation sequence on a given program\nis de\ufb01ned as the ratio of the execution time of the original program (baseline) over the execution\ntime of the transformed program. Each transformation sequence is described as a 13-dimensional\nvector x that records the absence/presence of one-out-of 13 single transformations. In [3] the task-\ndescriptor features (for each program) are based on the speed-ups obtained on a pre-selected set of\n8 transformations sequences, so-called \u201ccanonical responses\u201d. The reader is referred to [3, section\n3] for a more detailed description of the data.\nSchool Data. This data set comes from the Inner London Education Authority (ILEA) and\nhas been used to study the effectiveness of schools.\nIt is publicly available under the name\nof \u201cschool effectiveness\u201d at http://www.cmm.bristol.ac.uk/learning-training/\nmultilevel-m-support/datasets.shtml. It consists of examination records from 139\n\n\fsecondary schools in years 1985, 1986 and 1987. It is a random 50% sample with 15362 students.\nThis data has also been used in the context of multi-task learning by Bakker and Heskes [20] and\nEvgeniou et al. [19]. In [20] each task is de\ufb01ned as the prediction of the exam score of a student\nbelonging to a speci\ufb01c school based on four student-dependent features (year of the exam, gen-\nder, VR band and ethnic group) and four school-dependent features (percentage of students eligible\nfor free school meals, percentage of students in VR band 1, school gender and school denomina-\ntion). For comparison with [20, 19] we evaluate our model following the set up described above\nand similarly, we have created dummy variables for those features that are categorical forming a\ntotal of 19 student-dependent features and 8 school-dependent features. However, we note that\nschool-descriptor features such as the percentage of students eligible for free school meals and the\npercentage of students in VR band 1 actually depend on the year the particular sample was taken.\nIt is important to emphasize that for both data sets there are task-descriptor features available. How-\never, as we have described throughout this paper, our approach learns task similarity directly without\nthe need for task-dependent features. Hence, we have neglected these features in the application of\nour free-form K f method.\n\n6 Results\n\nFor the compiler data we have M = 11 tasks and we have used a Cholesky decomposition\nparameterization of K f \u2248 eK f = \u02dcL\u02dcLT , with ranks 1, 2, 3 and 5. We have learnt the parame-\nK f = LLT . For the school data we have M = 139 tasks and we have preferred a reduced rank\nters of the models so as to maximize the marginal likelihood p(yo|X, K f , \u03b8x) using gradient-based\nsearch in MATLAB with Carl Rasmussen\u2019s minimize.m. In our experiments this method usually\noutperformed EM in the quality of solutions found and in the speed of convergence.\nCompiler Data: For this particular application, in a real-life scenario it is critical to achieve good\nperformance with a low number of training data-points per task given that a training data-point\nrequires the compilation and execution of a (potentially) different version of a program. Therefore,\nalthough there are a total of 88214 training points per program we have followed a similar set up\nto [3] by considering N = 16, 32, 64 and 128 transformation sequences per program for training.\nAll the M = 11 programs (tasks) have been used for training, and predictions have been done at\nthe (unobserved) remaining 88214 \u2212 N inputs. For comparison with [3] the mean absolute error\n(between the actual speed-ups of a program and the predictions) has been used as the measure\nof performance. Due to the variability of the results depending on training set selection we have\nconsidered 10 different replications.\nFigure 1 shows the mean absolute errors obtained on the compiler data for some of the tasks (top\nrow and bottom left) and on average for all the tasks (bottom right). Sample task 1 (histogram)\nis an example where learning the tasks simultaneously brings major bene\ufb01ts over the no transfer\ncase. Here, multi-task GP (transfer free-form) provides a reduction on the mean absolute error of\nup to 6 times. Additionally, it is consistently (although only marginally) superior to the parametric\napproach. For sample task 2 (\ufb01r), our approach not only signi\ufb01cantly outperforms the no transfer\ncase but also provides greater bene\ufb01ts over the parametric method (which for N = 64 and 128\nis worse than no transfer). Sample task 3 (adpcm) is the only case out of all 11 tasks where our\napproach degrades performance, although it should be noted that all the methods perform similarly.\nFurther analysis of the data indicates that learning on this task is hard as there is a lot of variability\nthat cannot be explained by the 1-out-of-13 encoding used for the input features. Finally, for all tasks\non average (bottom right) our approach brings signi\ufb01cant improvements over single task learning\nand consistently outperforms the parametric method. For all tasks except one our model provides\nbetter or roughly equal performance than the non-transfer case and the parametric model.\nSchool Data: For comparison with [20, 19] we have made 10 random splits of the data into training\n(75%) data and test (25%) data. Due to the categorical nature of the data there are a maximum\nof N = 202 different student-dependent feature vectors x. Given that there can be multiple ob-\nservations of a target value for a given task at a speci\ufb01c input x, we have taken the mean of these\nobservations and corrected the noise variances by dividing them over the corresponding number of\nobservations. As in [19], the percentage explained variance is used as the measure of performance.\nThis measure can be seen as the percentage version of the well known coef\ufb01cient of determination\nr2 between the actual target values and the predictions.\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 1: Panels (a), (b) and (c) show the average mean absolute error on the compiler data as a\nfunction of the number of training points for speci\ufb01c tasks. no transfer stands for the use of a single\nGP for each task separately; transfer parametric is the use of a GP with a joint parametric (SE)\ncovariance function as in [3]; and transfer free-form is multi-task GP with a \u201cfree form\u201d covariance\nmatrix over tasks. The error bars show \u00b1 one standard deviation taken over the 10 replications.\nPanel (d) shows the average MAE over all 11 tasks, and the error bars show the average of the\nstandard deviations over all 11 tasks.\n\nThe results are shown in Table 1; note that larger \ufb01gures are better. The parametric result given in\nthe table was obtained from the school-descriptor features; in the cases where these features varied\nfor a given school over the years, an average was taken. The results show that better results can\nbe obtained by using multi-task learning than without. For the non-parametric K f , we see that the\nrank-2 model gives best performance. This performance is also comparable with the best (29.5%)\nfound in [20]. We also note that our no transfer result of 21.1% is much better than the baseline of\n9.7% found in [20] using neural networks.\n\nno transfer\n21.05 (1.15)\n\nparametric\n31.57 (1.61)\n\nrank 1\n\nrank 2\n\nrank 3\n\nrank 5\n\n27.02 (2.03)\n\n29.20 (1.60)\n\n24.88 (1.62)\n\n21.00 (2.42)\n\nTable 1: Percentage variance explained on the school dataset for various situations. The \ufb01gures in\nbrackets are standard deviations obtained from the ten replications.\n\nOn the school data the parametric approach for K f slightly outperforms the non-parametric method,\nprobably due to the large size of this matrix relative to the amount of data. One can also run the\nparametric approach creating a task for every unique school-features descriptor1; this gives rise to\n288 tasks rather than 139 schools, and a performance of 33.08% (\u00b11.57). Evgeniou et al [19] use a\nlinear predictor on all 8 features (i.e. they combine both student and school features into x) and then\nintroduce inter-task correlations as described in section 4. This approach uses the same information\nas our 288 task case, and gives similar performance of around 34% (as shown in Figure 3 of [19]).\n\n1Recall from section 5.1 that the school features can vary over different years.\n\n16326412800.040.080.120.160.2SAMPLE TASK 1NMAE NO TRANSFERTRANSFER PARAMETRICTRANSFER FREE\u2212FORM16326412800.050.10.150.20.250.30.35SAMPLE TASK 2NMAE NO TRANSFERTRANSFER PARAMETRICTRANSFER FREE\u2212FORM16326412800.020.040.060.080.10.12SAMPLE TASK 3NMAE NO TRANSFERTRANSFER PARAMETRICTRANSFER FREE\u2212FORM16326412800.020.040.060.080.10.120.14ALL TASKSNMAE NO TRANSFERTRANSFER PARAMETRICTRANSFER FREE\u2212FORM\f7 Conclusion\n\nIn this paper we have described a method for multi-task learning based on a GP prior which has\ninter-task correlations speci\ufb01ed by the task similarity matrix K f . We have shown that in a noise-\nfree block design, there is actually a cancellation of transfer in this model, but not in general. We\nhave successfully applied the method to the compiler and school problems. An advantage of our\nmethod is that task-descriptor features are not required (c.f. [3, 4]). However, such features might\nbe bene\ufb01cial if we consider a setup where there are only few datapoints for a new task, and where\nthe task-descriptor features convey useful information about the tasks.\n\nAcknowledgments\n\nCW thanks Dan Cornford for pointing out the prior work on autokrigeability. KMC thanks DSO NL for support.\nThis work is supported under EPSRC grant GR/S71118/01 , EU FP6 STREP MILEPOST IST-035307, and in\npart by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-\n2002-506778. This publication only re\ufb02ects the authors\u2019 views.\n\nReferences\n[1] Jonathan Baxter. A Model of Inductive Bias Learning. JAIR, 12:149\u2013198, March 2000.\n[2] Rich Caruana. Multitask Learning. Machine Learning, 28(1):41\u201375, July 1997.\n[3] Edwin V. Bonilla, Felix V. Agakov, and Christopher K. I. Williams. Kernel Multi-task Learning using\n\nTask-speci\ufb01c Features. In Proceedings of the 11th AISTATS, March 2007.\n\n[4] Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, and Zhao Xu. Stochastic Relational Models for Discrimina-\n\ntive Link Prediction. In NIPS 19, Cambridge, MA, 2007. MIT Press.\n\n[5] Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models. In Pro-\n\nceedings of the 10th AISTATS, pages 333\u2013340, January 2005.\n\n[6] Hao Zhang. Maximum-likelihood estimation for multivariate spatial linear coregionalization models.\n\nEnvironmetrics, 18(2):125\u2013139, 2007.\n\n[7] Hans Wackernagel. Multivariate Geostatistics: An Introduction with Applications. Springer-Verlag,\n\nBerlin, 2nd edition, 1998.\n\n[8] A. O\u2019Hagan. A Markov property for covariance structures. Statistics Research Report 98-13, Nottingham\n\nUniversity, 1998.\n\n[9] C. K. I. Williams, K. M. A. Chai, and E. V. Bonilla. A note on noise-free Gaussian process prediction\n\nwith separable covariance functions and grid designs. Technical report, University of Edinburgh, 2007.\n\n[10] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cam-\n\nbridge, Massachusetts, 2006.\n\n[11] Joaquin Qui\u02dcnonero-Candela, Carl Edward Rasmussen, and Christopher K. I. Williams. Approximation\nMethods for Gaussian Process Regression. In Large Scale Kernel Machines. MIT Press, 2007. To appear.\n[12] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal of\n\nthe Royal Statistical Society, Series B, 61(3):611\u2013622, 1999.\n\n[13] S. Thrun. Is Learning the n-th Thing Any Easier Than Learning the First? In NIPS 8, 1996.\n[14] Thomas P. Minka and Rosalind W. Picard. Learning How to Learn is Learning with Point Sets. 1999.\n[15] Neil D. Lawrence and John C. Platt. Learning to learn with the Informative Vector Machine. In Proceed-\n\nings of the 21st International Conference on Machine Learning, July 2004.\n\n[16] Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning Gaussian Processes from Multiple Tasks. In\n\nProceedings of the 22nd International Conference on Machine Learning, 2005.\n\n[17] Anton Schwaighofer, Volker Tresp, and Kai Yu. Learning Gaussian Process Kernels via Hierarchical\n\nBayes. In NIPS 17, Cambridge, MA, 2005. MIT Press.\n\n[18] Shipeng Yu, Kai Yu, Volker Tresp, and Hans-Peter Kriegel. Collaborative Ordinal Regression. In Pro-\n\nceedings of the 23rd International Conference on Machine Learning, June 2006.\n\n[19] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning Multiple Tasks with\n\nKernel Methods. Journal of Machine Learning Research, 6:615\u2013537, April 2005.\n\n[20] Bart Bakker and Tom Heskes. Task Clustering and Gating for Bayesian Multitask Learning. Journal of\n\nMachine Learning Research, 4:83\u201399, May 2003.\n\n\f", "award": [], "sourceid": 431, "authors": [{"given_name": "Edwin", "family_name": "Bonilla", "institution": null}, {"given_name": "Kian", "family_name": "Chai", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}]}