{"title": "Radial Basis Function Network for Multi-task Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 792, "page_last": 802, "abstract": "", "full_text": "Radial Basis Function Network for Multi-task\n\nLearning\n\nXuejun Liao\n\nDepartment of ECE\n\nDuke University\n\nLawrence Carin\nDepartment of ECE\n\nDuke University\n\nDurham, NC 27708-0291, USA\n\nDurham, NC 27708-0291, USA\n\nxjliao@ee.duke.edu\n\nlcarin@ee.duke.edu\n\nAbstract\n\nWe extend radial basis function (RBF) networks to the scenario in which\nmultiple correlated tasks are learned simultaneously, and present the cor-\nresponding learning algorithms. We develop the algorithms for learn-\ning the network structure, in either a supervised or unsupervised manner.\nTraining data may also be actively selected to improve the network\u2019s gen-\neralization to test data. Experimental results based on real data demon-\nstrate the advantage of the proposed algorithms and support our conclu-\nsions.\n\n1\n\nIntroduction\n\nIn practical applications, one is frequently confronted with situations in which multiple\ntasks must be solved. Often these tasks are not independent, implying what is learned from\none task is transferable to another correlated task. By making use of this transferability,\neach task is made easier to solve. In machine learning, the concept of explicitly exploiting\nthe transferability of expertise between tasks, by learning the tasks simultaneously under a\nuni\ufb01ed representation, is formally referred to as \u201cmulti-task learning\u201d [1].\n\nIn this paper we extend radial basis function (RBF) networks [4,5] to the scenario of multi-\ntask learning and present the corresponding learning algorithms. Our primary interest is to\nlearn the regression model of several data sets, where any given data set may be correlated\nwith some other sets but not necessarily with all of them. The advantage of multi-task\nlearning is usually manifested when the training set of each individual task is weak, i.e., it\ndoes not generalize well to the test data. Our algorithms intend to enhance, in a mutually\nbene\ufb01cial way, the weak training sets of multiple tasks, by learning them simultaneously.\nMulti-task learning becomes super\ufb02uous when the data sets all come from the same gen-\nerating distribution, since in that case we can simply take the union of them and treat the\nunion as a single task. In the other extreme, when all the tasks are independent, there is no\ncorrelation to utilize and we learn each task separately.\n\nThe paper is organized as follows. We de\ufb01ne the structure of multi-task RBF network\nin Section 2 and present the supervised learning algorithm in Section 3. In Section 4 we\nshow how to learn the network structure in an unsupervised manner, and based on this\nwe demonstrate how to actively select the training data, with the goal of improving the\n\n\fgeneralization to test data. We perform experimental studies in Section 5 and conclude the\npaper in Section 6.\n\n2 Multi-Task Radial Basis Function Network\n\nFigure 1 schematizes the radial basis function (RBF) network structure customized to mul-\ntitask learning. The network consists of an input layer, a hidden layer, and an output layer.\nThe input layer receives a data point x = [x1, \u00b7 \u00b7 \u00b7 , xd]T \u2208 Rd and submits it to the hidden\nlayer. Each node at the hidden layer has a localized activation \u03c6n(x) = \u03c6(||x \u2212 cn||, \u03c3n),\nn = 1, \u00b7 \u00b7 \u00b7 , N, where || \u00b7 || denotes the vector norm and \u03c6n(\u00b7) is a radial basis function\n(RBF) localized around cn with the degree of localization parameterized by \u03c3n. Choos-\ning \u03c6(z, \u03c3) = exp(\u2212 z2\n2\u03c32 ) gives the Gaussian RBF. The activations of all hidden nodes\nare weighted and sent to the output layer. Each output node represents a unique task\nand has its own hidden-to-output weights. The weighted activations of the hidden nodes\nare summed at each output node to produce the output for the associated task. Denoting\nwk = [w0k, w1k, \u00b7 \u00b7 \u00b7 , wN k]T as the weights connecting hidden nodes to the k-th output\nnode, then the output for the k-th task, in response to input x, takes the form\n\nwhere \u03c6(x) =(cid:2)\u03c60(x), \u03c61(x), . . . , \u03c6N (x)(cid:3)T is a column containing N + 1 basis functions\n\nwith \u03c60(x) \u2261 1 a dummy basis accounting for the bias in Figure 1.\n\nfk(x) = wT\n\nk \u03c6(x)\n\n(1)\n\nf1(x)\n\nf2(x)\n\nfK(x)\n\nNetwork response\n\nTask\n\n1\n\nw\n\n1\n\nTask\n\n2\n\nw\n\n2\n\n\u2026\n\nTask\n\nK\n\nOutput layer\n\n(Specified by the \nnumber of tasks)\n\nw\n\nK\n\nHidden-to-output weights\n\n(cid:73)0 (cid:123)1\n\nBias\n\n(cid:73)1(x) (cid:73)2(x)\n\n\u2026\n\n(cid:73)N(x)\n\nHidden layer (basis functions)\n(To be learned by algorithms)\n\n\u2026\n\nInput layer\n\n(Specified by data \ndimensionality)\n\nx = [\n\nx1\n\nx2\n\n\u2026\n\nxd\n\n]T\n\nInput data point\n\nFigure 1: A multi-task structure of RBF Network. Each of the output nodes represents a unique\ntask. Each task has its own hidden-to-output weights but all the tasks share the same hidden nodes.\nThe activation of hidden node n is characterized by a basis function \u03c6n(x) = \u03c6(||x \u2212 cn||, \u03c3n). A\ntypical choice of \u03c6 is \u03c6(z, \u03c3) = exp(\u2212 z\n\n2\u03c32 ), which gives the Gaussian RBF.\n\n2\n\n3 Supervised Learning\n\nthe k-th task is Dk =\nSuppose we have K tasks and the data set of\n{(x1k, y1k), \u00b7 \u00b7 \u00b7 , (xJkk, yJkk)}, where yik is the target (desired output) of xik. By de\ufb01-\nnition, a given data point xik is said to be supervised if the associated target yik is provided\nand unsupervised if yik is not provided. The de\ufb01nition extends similarly to a set of data\n\n\fTable 1: Learning Algorithm of Multi-Task RBF Network\n\nInput: {(x1k, y2k), \u00b7 \u00b7 \u00b7 , (xJk,k, yJk,k)}k=1:K, \u03c6(\u00b7, \u03c3), \u03c3, and \u03c1; Output: \u03c6(\u00b7) and\n{wk}K\n\nk=1.\n\n1. For m = 1 : K, For n = 1 : Jm, For k = 1 : K, For i = 1 : Jk\n\ni=1 y2\n\nik = \u03c6(||xnm \u2212 xik||, \u03c3);\n\n3. For m = 1 : K, For n = 1 : Jm\n\nik \u2212 (Jk + \u03c1)\u22121(PJk\n\nCompute b\u03c6nm\nk=1hPJk\n2. Let N = 0, \u03c6(\u00b7) = 1, e0 =PK\nFor k = 1 : K, compute Ak = Jk +\u03c1, wk = (Jk +\u03c1)\u22121PJk\nIf b\u03c6nm is not marked as \u201cdeleted\u201d\nqk =PJk\nIf there exists k such that qk = 0, mark b\u03c6nm as \u201cdeleted\u201d;\nelse, compute \u03b4e(\u03c6,b\u03c6nm) using (5).\n\ni=1 \u03c6ikb\u03c6nm\n\nck =PJk\n\nFor k = 1 : K, compute\n\ni=1(b\u03c6nm\n\ni=1yik;\n\nik )2 + \u03c1 \u2212 cT\n\nk\n\nik ,\n\ni=1 yik)2i;\n\nA\u22121\n\nk\n\nck;\n\n4. If {b\u03c6ik}i=1:Jk,k=1:K are all marked as \u201cdeleted\u201d, go to 10.\n5. Let (n\u2217, m\u2217) = arg maxb\u03c6nm not marked as \u201cdeleted\u201d \u03b4e(\u03c6,b\u03c6nm); Mark b\u03c6n\u2217m\u2217 as \u201cdeleted\u201d.\n\n6. Tune RBF parameter \u03c3N +1 = arg max \u03c3 \u03b4e(\u03c6, \u03c6(|| \u00b7 \u2212xn\u2217m\u2217||, \u03c3))\n7. Let \u03c6N +1(\u00b7) = \u03c6(|| \u00b7 \u2212xn\u2217m\u2217||, \u03c3N +1); Update \u03c6(\u00b7) \u2190 [\u03c6T (\u00b7), \u03c6N +1(\u00b7)]T ;\n8. For k = 1 : K\n\nrespectively by (A-1) and (A-3) in the appendix; Up-\n\nCompute Anew\nk\ndate Ak \u2190 Anew\n\n, wk \u2190 wnew\n9. Let eN +1 = eN \u2212 \u03b4e(\u03c6, \u03c6N +1);\n\nand wnew\n\nk\n\nk\n\nk\n\nto 10, else update N \u2190 N + 1 and go back to 3.\n\n10. Exit and output \u03c6(\u00b7) and {wk}K\n\nk=1.\n\nIf the sequence {en}n=0:(N +1) is converged, go\n\npoints. We are interested in learning the functions fk(x) for the K tasks, based on \u222aK\nThe learning is based on minimizing the squared error\n\nk=1Dk.\n\ne (\u03c6, w) =PK\n\nk=1nPJk\n\ni=1(cid:0)wT\n\nk \u03c6ik \u2212 yik(cid:1)2\n\n+ \u03c1 ||wk||2o\n\nwhere \u03c6ik = \u03c6(xik) for notational simplicity. The regularization terms \u03c1 ||wk||2, k =\n1, \u00b7 \u00b7 \u00b7 , K, are used to prevent singularity of the A matrices de\ufb01ned in (3), and \u03c1 is typically\nset to a small positive number. For \ufb01xed \u03c6\u2019s, the w\u2019s are solved by minimizing e(\u03c6, w)\nwith respect to w, yielding\n\nwk = A\u22121\n\ni=1yik\u03c6ik\n\nk PJk\n\nand Ak =PJk\n\ni=1\u03c6ik\u03c6T\n\nik + \u03c1 I,\n\nk = 1, \u00b7 \u00b7 \u00b7 , K (3)\n\nIn a multi-task RBF network, the input layer and output layer are respectively speci\ufb01ed by\nthe data dimensionality and the number of tasks. We now discuss how to determine the\nhidden layer (basis functions \u03c6). Substituting the solutions of the w\u2019s in (3) into (2) gives\n\ne(\u03c6) =PK\n\nk=1PJk\n\ni=1(cid:0)y2\n\nik \u2212 yikwT\n\nk \u03c6ik(cid:1)\n\nwhere e(\u03c6) is a function of \u03c6 only because w\u2019s are now functions of \u03c6 as given by\n(3). By minimizing e(\u03c6), we can determine \u03c6. Recalling that \u03c6ik is an abbreviation of\n\n\u03c6(xik) =(cid:2)1, \u03c61(xik), . . . , \u03c6N (xik)(cid:3)T , this amounts to determining N, the number of ba-\n\nsis functions, and the functional form of each basis function \u03c6n(\u00b7), n = 1, . . . , N. Consider\nthe candidate functions {\u03c6nm(x) = \u03c6(||x\u2212 xnm||, \u03c3) : n = 1, \u00b7 \u00b7 \u00b7 , Jm, m = 1, \u00b7 \u00b7 \u00b7 , K}.\nWe learn the RBF network structure by selecting \u03c6(\u00b7) from these candidate functions such\nthat e(\u03c6) in (4) is minimized. The following theorem tells us how to perform the selection\nin a sequential way; the proof is given in the Appendix.\n\n(2)\n\n(4)\n\n\fTheorem 1 Let \u03c6(x) = [1, \u03c61(x), . . . , \u03c6N (x)]T and \u03c6N +1(x) be a single basis function.\nAssume the A matrices corresponding to \u03c6 and [\u03c6, \u03c6N +1]T are all non-degenerate. Then\n\nwhere \u03c6N +1\n\nik = \u03c6N +1(\u03c6ik), wk and A are the same as in (3), and\n\n\u03b4e(\u03c6, \u03c6N +1) = e(\u03c6) \u2212 e([\u03c6, \u03c6N +1]T ) =PK\nck =PJk\n\ndk =PJk\n\ni=1\u03c6ik\u03c6N +1\n\ni=1(\u03c6N +1\n\nik\n\nik\n\n,\n\n)2 + \u03c1,\n\nqk = dk \u2212 cT\nk\n\nA\u22121\n\nk\n\nck\n\n(6)\n\nk=1(cid:0)cT\n\nk\n\nwk \u2212PJk\n\ni=1yik\u03c6N +1\n\nik\n\nq\u22121\nk\n\n(5)\n\n(cid:1)2\n\nk\n\nis full rank and hence it is positive de\ufb01nite by con-\nBy the conditions of the theorem Anew\nstruction. By (A-2) in the Appendix, q\u22121\n)\u22121, therefore q\u22121\nis a diagonal element of (Anew\nis positive and by (5) \u03b4e(\u03c6, \u03c6N +1) > 0, which means adding \u03c6N +1 to \u03c6 generally makes\nthe squared error decrease. The decrease \u03b4e(\u03c6, \u03c6N +1) depends on \u03c6N +1. By sequentially\nselecting basis functions that bring the maximum error reduction, we achieve the goal of\nmaximizing e(\u03c6). The details of the learning algorithm are summarized in Table 1.\n\nk\n\nk\n\nk\n\n4 Active Learning\n\nIn the previous section, the data in Dk are supervised (provided with the targets). In this\nsection, we assume the data in Dk are initially unsupervised (only x is available without\naccess to the associated y) and we select a subset from Dk to be supervised (targets ac-\nquired) such that the resulting network generalizes well to the remaining data in Dk. The\napproach is generally known as active learning [6]. We \ufb01rst learn the basis functions \u03c6\nfrom the unsupervised data, and based on \u03c6 select data to be supervised. Both of these\nsteps are based on the following theorem, the proof of which is given in the Appendix.\n\nTheorem 2 Let there be K tasks and the data set of the k-th task is Dk \u222a eDk where\n\nDk = {(xik, yik)}Jk\ni=Jk+1. Let there be two multi-task RBF\nnetworks, whose output nodes are characterized by fk(\u00b7) and f \u223c\nk (\u00b7), respectively, for task\nk = 1, . . . , K. The two networks have the same given basis functions (hidden nodes)\n\u03c6(\u00b7) = [1, \u03c61(\u00b7), \u00b7 \u00b7 \u00b7 , \u03c6N (\u00b7)]T , but different hidden-to-output weights. The weights of fk(\u00b7)\n\ni=1 and eDk = {(xik, yik)}Jk+eJk\n\nT\n\nmin,k \u2264 1 (7)\n\ni=1(yik \u2212f \u223c\n\nk (\u00b7) are related by\n\n0 \u2264 [det \u0393k]\u22121\u2264 \u03bb\u22121\n\nsmallest eigenvalues of \u0393k.\n\ni=1(yik \u2212fk(xik))2\u2264 \u03bb\u22121\n\nk = 1, \u00b7 \u00b7 \u00b7 , K, the square errors committed on Dk by fk(\u00b7) and f \u223c\n\nare trained with Dk \u222a eDk, while the weights of f \u223c\nk (xik))2(cid:3)\u22121PJk\n\nk (\u00b7) are trained using eDk. Then for\nmax,k \u2264(cid:2)PJk\nk )\u22121\u03a6k(cid:3)2 with \u03a6 = (cid:2)\u03c6(x1k), . . . , \u03c6(xJkk)(cid:3) and e\u03a6 =\nk (\u03c1 I + e\u03a6ke\u03a6\nJk+eJk,k)(cid:3), and \u03bbmax,k and \u03bbmin,k are respectively the largest and\n\nwhere \u0393k = (cid:2)I + \u03a6T\n(cid:2)\u03c6(xJk+1,k), . . . , \u03c6(x\nSpecializing Theorem 2 to the case eJk = 0, we have\n\nCorollary 1 Let there be K tasks and the data set of the k-th task is Dk = {(xik, yik)}Jk\ni=1.\nLet the RBF network, whose output nodes are characterized by fk(\u00b7) for task k =\n1, . . . , K, have given basis functions (hidden nodes) \u03c6(\u00b7) = [1, \u03c61(\u00b7), \u00b7 \u00b7 \u00b7 , \u03c6N (\u00b7)]T and\nthe hidden-to-output weights of task k be trained with Dk. Then for k = 1, \u00b7 \u00b7 \u00b7 , K, the\nsquared error committed on Dk by fk(\u00b7) is bounded as 0 \u2264 [det \u0393k]\u22121 \u2264 \u03bb\u22121\nmax,k \u2264\n\nik(cid:3)\u22121PJk\n\ni=1 (yik \u2212 fk(xik))2 \u2264 \u03bb\u22121\n\n(cid:2)PJk\n\u03a6k(cid:1)2 with\n\u03a6 = (cid:2)\u03c6(x1,k), . . . , \u03c6(xJk,k)(cid:3), and \u03bbmax,k and \u03bbmin,k are respectively the largest and\ndet \u0393k = (cid:2)det(\u03c1I + \u03a6k\u03a6T\nk )(cid:3)2\n\nsmallest eigenvalues of \u0393k.\nIt is evident from the properties of matrix determinant [7] and the de\ufb01nition of \u03a6 that\n[det(\u03c1 I)]\u22122.\n\nmin,k \u2264 1, where \u0393k =(cid:0)I + \u03c1\u22121\u03a6T\n\n[det(\u03c1 I)]\u22122 = (cid:2)det(\u03c1I + PJk\n\nik)(cid:3)2\n\ni=1 \u03c6ik\u03c6T\n\ni=1y2\n\nk\n\n\fi=1y2\n\ni=1y2\n\nk=1(det Ak)\u22122.\n\nk][det(\u03c1 I)]\u22122. We are interested in se-\nUsing (3) we write succinctly det \u0393k = [det A2\nlecting the basis functions \u03c6 that minimize the error, before seeing y\u2019s. By Corollary 1\nk][det(\u03c1 I)]\u22122, the squared error is lower bounded by\nand the equation det \u0393k = [det A2\nik[det(\u03c1 I)]2[det Ak]\u22122. Instead of minimizing the error directly, we minimize its\nik does not depend on \u03c6, this amounts to selecting \u03c6 to\nminimize (det Ak)\u22122. To minimize the errors for all tasks k = 1 \u00b7 \u00b7 \u00b7 , K, we select \u03c6 to\n\nPJk\nlower bound. As [det(\u03c1 I)]2PLk\nminimizeQK\ntions \u03c6 = [1, \u03c61, \u00b7 \u00b7 \u00b7 , \u03c6N ]T . The associated A matrices are Ak = PJk\ning the determinant formula of block matrices [7], we get QK\nQK\nthe left-hand side is minimized by maximizing QK\ne0 = PK\n\nThe selection proceeds in a sequential manner. Suppose we have selected basis func-\nik +\n\u03c1 I(N +1)\u00d7(N +1), k = 1, \u00b7 \u00b7 \u00b7 , K. Augmenting basis functions to [\u03c6T , \u03c6N +1]T , the A\n] + \u03c1 I(N +2)\u00d7(N +2). Us-\nmatrices change to Anew\n)\u22122 =\nk=1(qk det Ak)\u22122, where qk is the same as in (6). As Ak does not depend on \u03c6N +1,\nk. The selection is easily imple-\nmented by making the following two minor modi\ufb01cations in Table 1: (a) in step 2, compute\nk. Employing the\n\nk=1 ln(Jk + \u03c1)\u22122; in step 3, compute \u03b4e(\u03c6,b\u03c6nm) = PK\n\nk=1 ln q2\nlogarithm is for gaining additivity and it does not affect the maximization.\n\n= PJk\n\nk=1(det Anew\n\nBased on the basis functions \u03c6 determined above, we proceed to selecting data to be su-\npervised and determining the hidden-to-output weights w from the supervised data using\nthe equations in (3). The selection of data is based on an iterative use of the following\ncorollary, which is a specialization of Theorem 2 and was originally given in [8].\n\ni=1 \u03c6ik\u03c6T\n\ni=1[\u03c6T\n\nik, \u03c6N +1\n\nik\n\nk=1 q2\n\nk\n\n]T [\u03c6T\n\nik, \u03c6N +1\n\nik\n\nk\n\nyJk+1,k(cid:3)2\n\nCorollary 2 Let there be K tasks and the data set of the k-th task is Dk = {(xik, yik)}Jk\ni=1.\nLet there be two RBF networks, whose output nodes are characterized by fk(\u00b7) and\nf +\nk (\u00b7), respectively, for task k = 1, . . . , K. The two networks have the same given\nbasis functions \u03c6(\u00b7) = [1, \u03c61(\u00b7), \u00b7 \u00b7 \u00b7 , \u03c6N (\u00b7)]T , but different hidden-to-output weights.\nThe weights of fk(\u00b7) are trained with Dk, while the weights of f +\nk (\u00b7) are trained us-\ning D+\nk = Dk \u222a {(xJk+1,k, yJk+1,k)}. Then for k = 1, \u00b7 \u00b7 \u00b7 , K, the squared errors\ncommitted on (xJk+1,k, yJk+1,k) by fk(\u00b7) and f +\nk (xJk+1,k) \u2212\n\nk (\u00b7) are related by (cid:2)f +\n= (cid:2)\u03b3(xJk+1,k)(cid:3)\u22121(cid:2)fk(xJk+1,k) \u2212 yJk+1,k(cid:3)2, where \u03b3(xJk+1,k) = (cid:2)1 +\ni=1(cid:2)\u03c1I + \u03c6(xik)\u03c6T (xik)(cid:3) is the same\nk \u03c6(xJk+1,k)(cid:3)2\n\n\u03c6T (xJk+1,k)A\u22121\nas in (3).\nTwo observations are made from Corollary 2. First, if \u03b3(xJk+1,k) \u2248 1, seeing yJk+1,k\ndoes not effect the error on xJk+1,k, indicating Dk already contain suf\ufb01cient information\nabout (xJk+1,k, yJk+1,k). Second, if \u03b3(xi) \u226b 1, seeing yJk+1,k greatly decrease the er-\nror on xJk+1,k, indicating xJk+1,k is signi\ufb01cantly dissimilar (novel) to Dk and xJk+1,k\nmust be supervised to reduce the error. Based on Corollary 2, the selection proceeds se-\nquentially. Suppose we have selected data Dk = {(xik, yik)}Jk\ni=1, from which we com-\npute Ak. We select the next data point as xJk+1,k = arg max i>Jk, k=1,\u00b7\u00b7\u00b7 ,K \u03b3(xik) =\n\n\u2265 1 and Ak =PJk\n\narg max i>Jk k=1,\u00b7\u00b7\u00b7 ,K(cid:2)1 + \u03c6T(xik)A\u22121\n\nk \u03c6(xik)(cid:3)2. After xJk+1,k is selected, the Ak is\n\nupdated and the next selection begins. As the iteration advances \u03b3 will decrease until it\nreaches convergence. We use (3) to compute w from the selected x and their associated\ntargets y, completing learning of the RBF network.\n\n5 Experimental Results\n\nIn this section we compare the multi-task RBF network against single-task RBF networks\nvia experimental studies. We consider three types of RBF networks to learn K tasks, each\n\n\fwith its data set Dk. In the \ufb01rst, which we call \u201cone RBF network\u201d, we let the K tasks\nshare both basis functions \u03c6 (hidden nodes) and hidden-to output weights w, thus we do\nnot distinguish the K tasks and design a single RBF network to learn a union of them. The\nsecond is the multi-task RBF network, where the K tasks share the same \u03c6 but each has its\nown w. In the third, we have K independent networks, each designed for a single task.\n\nWe use a school data set from the Inner London Education Authority, consisting of ex-\namination records of 15362 students from 139 secondary schools. The data are available\nat http://multilevel.ioe.ac.uk/intro/datasets.html. This data set was originally used to study\nthe effectiveness of schools and has recently been used to evaluate multi-task algorithms\n[2,3]. The goal is to predict the exam scores of the students based on 9 variables: year of\nexam (1985, 1986, or 1987), school code (1-139), FSM (percentage of students eligible for\nfree school meals), VR1 band (percentage of students in school in VR band one), gender,\nVR band of student (3 categories), ethnic group of student (11 categories), school gender\n(male, female, or mixed), school denomination (3 categories). We consider each school a\ntask, leading to 139 tasks in total. The remaining 8 variables are used as inputs to the RBF\nnetwork. Following [2,3], we converted each categorical variable to a number of binary\nvariables, resulting in a total number of 27 input variables, i.e., x \u2208 R27. The exam score\nis the target to be predicted.\n\nThe three types of RBF networks as de\ufb01ned above are designed as follows. The multi-task\nRBF network is implemented as the structure as shown in Figure 1 and trained with the\nlearning algorithm in Table 1. The \u201cone RBF network\u201d is implemented as a special case\nof Figure 1, with a single output node and trained using the union of supervised data from\nall 139 schools. We design 139 independent RBF networks, each of which is implemented\nwith a single output node and trained using the supervised data from a single school. We\nuse the Gaussian RBF \u03c6n(x) = exp(\u2212 ||x\u2212cn||2\n), where the cn\u2019s are selected from training\ndata points and \u03c3n\u2019s are initialized as 20 and optimized as described in Table 1. The main\nrole of the regularization parameter \u03c1 is to prevent the A matrices from being singular and\nit does not affect the results seriously. In the results reported here, \u03c1 is set to 10\u22126.\nFollowing [2-3], we randomly take 75% of the 15362 data points as training (supervised)\ndata and the remaining 25% as test data. The generalization performance is measured by\nthe squared error (fk(xik) \u2212 yik)2 averaged over all test data xik of tasks k = 1, \u00b7 \u00b7 \u00b7 , K.\nWe made 10 independent trials to randomly split the data into training and test sets and the\nsquared error averaged over the test data of all the 139 schools and the trials are shown in\nTable 2, for the three types of RBF networks.\n\n2\u03c32\n\nTable 2: Squared error averaged over the test data of all 139 schools and the 10 independent trials\nfor randomly splitting the school data into training (75%) and testing (25%) sets.\n\nMulti-task RBF network\n\nIndependent RBF networks One RBF network\n\n109.89 \u00b1 1.8167\n\n136.41 \u00b1 7.0081\n\n149.48 \u00b1 2.8093\n\nTable 2 clearly shows the multi-task RBF network outperforms the other two types of RBF\nnetworks by a considerable margin. The \u201cone RBF network\u201d ignores the difference be-\ntween the tasks and the independent RBF networks ignore the tasks\u2019 correlations, therefore\nthey both perform inferiorly. The multi-task RBF network uses the shared hidden nodes\n(basis functions) to capture the common internal representation of the tasks and meanwhile\nuses the independent hidden-to-output weights to learn the statistics speci\ufb01c to each task.\n\nWe now demonstrate the results of active learning. We use the method in Section 4 to ac-\ntively split the data into training and test sets using a two-step procedure. First we learn the\nbasis functions \u03c6 of multi-task RBF network using all 15362 data (unsupervised). Based\non the \u03c6, we then select the data to be supervised and use them as training data to learn\n\n\fthe hidden-to-output weights w. To make the results comparable, we use the same training\ndata to learn the other two types of RBF networks (including learning their own \u03c6 and w).\nThe networks are then tested on the remaining data.\n\nFigure 2 shows the results of active learning. Each curve is the squared error averaged over\nthe test data of all 139 schools, as a function of number of training data. It is clear that\nthe multi-task RBF network maintains its superior performance all the way down to 5000\ntraining data points, whereas the independent RBF networks have their performances de-\ngraded seriously as the training data diminish. This demonstrates the increasing advantage\nof multi-task learning as the number of training data decreases. The \u201cone RBF network\u201d\nseems also insensitive to the number of training data, but it ignores the inherent dissimilar-\nity between the tasks, which makes its performance inferior.\n\nMulti\u2212task RBF network\nIndependent RBF networks\nOne RBF network\n\n260\n\n240\n\n220\n\n200\n\n180\n\n160\n\n140\n\n120\n\na\n\nt\n\na\nd\n\n \nt\ns\ne\n\nt\n \nr\ne\nv\no\n\n \n\nd\ne\ng\na\nr\ne\nv\na\n \nr\no\nr\nr\ne\n\n \n\nd\ne\nr\na\nu\nq\nS\n\n100\n\n5000\n\n6000\n\n7000\n\n8000\n\n9000\n\n10000\n\n11000\n\n12000\n\nNumber of training (supervised) data\n\nFigure 2: Squared error averaged over the test data of all 139 schools, as a function of the number\nof training (supervised) data. The data are split into training and test sets via active learning.\n\n6 Conclusions\n\nWe have presented the structure and learning algorithms for multi-task learning with the\nradial basis function (RBF) network. By letting multiple tasks share the basis functions\n(hidden nodes) we impose a common internal representation for correlated tasks. Exploit-\ning the inter-task correlation yields a more compact network structure that has enhanced\ngeneralization ability. Unsupervised learning of the network structure enables us to actively\nsplit the data into training and test sets. As the data novel to the previously selected ones are\nselected next, what \ufb01nally remain unselected and to be tested are all similar to the selected\ndata which constitutes the training set. This improves the generalization of the resulting\nnetwork to the test data. These conclusions are substantiated via results on real multi-task\ndata.\n\nReferences\n\n[1] R. Caruana. (1997) Multitask learning. Machine Learning, 28, p. 41-75, 1997.\n\n[2] B. Bakker and T. Heskes (2003). Task clustering and gating for Bayesian multitask learning.\nJournal of Machine Learning Research, 4: 83-99, 2003\n\n[3] T. Evgeniou, C. A. Micchelli, and M. Pontil (2005). Learning Multiple Tasks with Kernel Meth-\nods. Journal of Machine Learning Research, 6: 615637, 2005\n\n\f[4] Powell M. (1987), Radial basis functions for multivariable interpolation : A review, J.C. Mason\nand M.G. Cox, eds, Algorithms for Approximation, pp.143-167.\n\n[5] Chen, F. Cowan, and P. Grant (1991), Orthogonal least squares learning algorithm for radial basis\nfunction networks, IEEE Transactions on Neural Networks, Vol. 2, No. 2, 302-309, 1991\n\n[6] Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995). Active learning with statistical models.\nAdvances in Neural Information Processing Systems, 7, 705-712.\n\n[7] V. Fedorov (1972), Theory of Optimal Experiments, Academic Press, 1972\n\n[8] M. Stone (1974), Cross-validatory choice and assessment of statistical predictions, Journal of the\nRoyal Statistical Society, Series B, 36, pp. 111-147, 1974.\n\nAppendix\n\nProof of Theorem 1:. Let \u03c6new = [\u03c6, \u03c6N +1]T . By (3), the A matrices corresponding to \u03c6new are\n\nAnew\n\nk =PJk\n\ni=1h \u03c6ik\n\n\u03c6N +1\n\nik\n\nwhere ck and dk are as in (6). By the conditions of the theorem, the matrices Ak and Anew\nnon-degenerate. Using the block matrix inversion formula [7] we get\n\nk\n\n(A-1)\n\nare all\n\n(A-2)\n\n(A-3)\n\ni(cid:2) \u03c6T\n)\u22121=(cid:20)A\u22121\n)\u22121(cid:20) PJk\nPJk\nikwk \u2212 yik(cid:0)\u03c6T\n\ni=1yik\u03c6N +1\n\nik\n\nk\n\nik \u03c6N +1\n\nik\n\nck\n\ndk i\n\ncT\nk\n\n(cid:3) + \u03c1 I(N +2)\u00d7(N +2) =h Ak\n(cid:21)\n\nk ckq\u22121\nk cT\nk A\u22121\nk cT\ncorresponding to [\u03c6T , \u03c6N +1]T are\n\nk ckq\u22121\nq\u22121\nk\n\nk A\u22121\n\n\u2212A\u22121\n\nk\n\nk\n\nk\n\nk + A\u22121\n\u2212q\u22121\nwhere qk is as in (6). By (3), the weights wnew\n\n(Anew\n\nk\n\nk\n\nT\n\nk\n\nik\n\nik\n\nik\n\nik\n\nik\n\ncT\n\n(cid:21)\n\nk gk\n\n\u2212q\u22121\n\nikA\u22121\n\nk ck \u2212\n\n(cid:3) =\n\nik \u2212 yik\u03c6T\n\ni=1yik\u03c6N +1\n\ni=1yik\u03c6N +1\n\nwnew\n\nk = (Anew\n\nk ck \u2212 \u03c6N +1\n\n. Hence, (\u03c6new\n\nk=1(cid:0)cT\n\n. The theorem is proved.\n\ni=1 yik\u03c6ik\ni=1 yik\u03c6N +1\n\nk ckq\u22121\nk gk\n= \u03c6T\n\n(cid:21) =(cid:20) wk + A\u22121\nikwk +(cid:0)\u03c6T\ni=1(cid:2)y2\nk=1PJk\nk (cid:3) = e(\u03c6) \u2212 PK\n(cid:1)gkq\u22121\n\nik \u2212 yik(\u03c6new\n\nik )T wnew\n\nikA\u22121\nik )T wnew\n\nwith gk = cT\n\u03c6N +1\n\nProof of Theorem 2: The proof applies to k = 1, \u00b7 \u00b7 \u00b7 , K. For any given k, de\ufb01ne \u03a6 =\n\nk\nk wk \u2212\nq\u22121\nk , where in arriving the last equality we have used (3) and (4) and gk =\n(cid:3)\n\nk wk \u2212 PJk\n(cid:1)gkq\u22121\nk , which is put into (4) to get e(\u03c6new = PK\ni=1(cid:2)y2\nPK\nk=1PJk\n(cid:1)2\nPJk\nk wk \u2212PJk\n(cid:2)\u03c6(x1k), . . . , \u03c6(xJk k)(cid:3), e\u03a6 = (cid:2)\u03c6(xJk+1,k), . . . , \u03c6(xJk+eJk ,k)(cid:3), yk = [y1k, . . . , yJk k]T , eyk =\n[yJk+1,k, . . . , yJk+eJk ,k]T , fk = [f (x1k), . . . , f (xJk k)]T , f \u223c\nk(cid:0)eAk +\nand eAk = \u03c1I + e\u03a6ke\u03a6\nk (cid:3)(cid:2)\u03a6kyk+\nk \u03a6k(cid:1)\u22121\nk \u03a6k+I\u2212I(cid:1)(cid:0)I+\u03a6T\nk \u2212(cid:0)\u03a6T\nk(cid:1)\u22121(\u03a6kyk+e\u03a6eyk)\n= (cid:2)\u03a6T\nk eA\u22121\nk eA\u22121\nk eA\u22121\nk eA\u22121\nk \u03a6k(cid:1)\u22121\nk +(cid:0)I +\n= (cid:0)I + \u03a6T\nk (cid:3)(cid:2)e\u03a6keyk + \u03a6kyk(cid:3) (b)\nk \u03a6k(cid:1)\u22121\ne\u03a6keyk(cid:3) = (cid:2)(cid:0)I + \u03a6T\nk eA\u22121\nk eA\u22121\nk eA\u22121\nk \u2212 yk(cid:1), where equa-\nk \u03a6k(cid:1)\u22121(cid:0)f \u223c\nk \u03a6k + I \u2212 I(cid:1)yk = yk +(cid:0)I + \u03a6T\nk \u03a6k(cid:1)\u22121(cid:0)\u03a6T\nk eA\u22121\nk eA\u22121\nk eA\u22121\nk e\u03a6keyk. Hence, fk \u2212 yk =(cid:0)I + \u03a6T\nk \u03a6k(cid:1)\u22121(cid:0)f \u223c\nk \u2212 yk(cid:1), which gives\nk eA\u22121\nk \u2212 yk(cid:1)T\ni=1 (yik \u2212 fk(xik))2 = (fk \u2212 yk)T (fk \u2212 yk) =(cid:0)f \u223c\nk (cid:0)f \u223c\nk \u2212 yk(cid:1)\nPJk\nk )\u22121\u03a6k(cid:3)2.\n\n\u03a6T\ntion (a) is due to the Sherman-Morrison-Woodbury formula and equation (b) results because\nk = \u03a6T\nf \u223c\n\nk . By (1), (3), and the conditions of the theorem, fk = \u03a6T\n\nwhere \u0393k =(cid:2)I + \u03a6T\n\nk \u03a6k(cid:3)2 =(cid:2)I + \u03a6T\n\nk diag[\u03bb1k, \u00b7 \u00b7 \u00b7 , \u03bbJk k]Ek with\nk Ek = I and \u03bb1k, \u00b7 \u00b7 \u00b7 , \u03bbJk k \u2265 1, which makes the \ufb01rst, second, and last inequality in (7) hold.\n\nBy construction, \u0393k has all its eigenvalues no less than 1, i.e., \u0393k = ET\nET\nUsing this expansion of \u0393k in (A-4) we get\n\nk eA\u22121\nk (\u03c1 I + e\u03a6ke\u03a6\n\nk eA\u22121\n\nk (x1k), . . . , f \u223c\n\nk (xJk k)]T ,\n\nk = [f \u223c\n\n\u03a6k\u03a6T\n\n(A-4)\n\n\u0393\u22121\n\n\u03a6T\n\n\u03a6T\n\nf \u223c\n\ni=1 (fk(xik) \u2212 yik)2 =(cid:0)f \u223c\nPJk\n\u2264(cid:0)f \u223c\nk \u2212 yk(cid:1)T\n\nmin,k I(cid:3)Ek(cid:0)f \u223c\n\nk(cid:2)\u03bb\u22121\n\nk \u2212 yk(cid:1)T\n\nk \u2212 yk(cid:1) = \u03bb\u22121\n\n(A-5)\nwhere the inequality results because \u03bbmin,k = min(\u03bb1,k, \u00b7 \u00b7 \u00b7 , \u03bbJk ,k). From (A-5) follows the\nfourth inequality in (7). The third inequality in (7) can be proven in in a similar way.\n(cid:3)\n\nk (xik) \u2212 yik)2\n\nET\n\n1k , . . . , \u03c3\u22121\ni=1(f \u223c\n\nmin,kPJk\n\nJk k ](cid:0)f \u223c\n\nk \u2212 yk(cid:1)\n\nET\n\nk diag[\u03c3\u22121\n\n(a)\n\nT\n\n\f", "award": [], "sourceid": 2907, "authors": [{"given_name": "Xuejun", "family_name": "Liao", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}]}