{"title": "Learning Model Bias", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 175, "abstract": null, "full_text": "Learning Model Bias \n\nJonathan Baxter \n\nDepartment of Computer Science \n\nRoyal Holloway College, University of London \n\njon~dcs.rhbnc.ac.uk \n\nAbstract \n\nIn this paper the problem of learning appropriate domain-specific \nbias is addressed. It is shown that this can be achieved by learning \nmany related tasks from the same domain, and a theorem is given \nbounding the number tasks that must be learnt. A corollary of the \ntheorem is that if the tasks are known to possess a common inter(cid:173)\nnal representation or preprocessing then the number of examples \nrequired per task for good generalisation when learning n tasks si(cid:173)\nmultaneously scales like O(a + ~), where O(a) is a bound on the \nminimum number of examples requred to learn a single task, and \nO( a + b) is a bound on the number of examples required to learn \neach task independently. An experiment providing strong qualita(cid:173)\ntive support for the theoretical results is reported. \n\n1 \n\nIntroduction \n\nIt has been argued (see [6]) that the main problem in machine learning is the biasing \nof a learner's hypothesis space sufficiently well to ensure good generalisation from \na small number of examples. Once suitable biases have been found the actual \nlearning task is relatively trivial. Exisiting methods of bias generally require the \ninput of a human expert in the form of heuristics, hints [1], domain knowledge, \netc. Such methods are clearly limited by the accuracy and reliability of the expert's \nknowledge and also by the extent to which that knowledge can be transferred to the \nlearner. Here I attempt to solve some of these problems by introducing a method \nfor automatically learning the bias. \n\nThe central idea is that in many learning problems the learner is typically em(cid:173)\nbedded within an environment or domain of related learning tasks and that the \nbias appropriate for a single task is likely to be appropriate for other tasks within \nthe same environment. A simple example is the problem of handwritten character \nrecognition. A preprocessing stage that identifies and removes any (small) rota(cid:173)\ntions, dilations and translations of an image of a character will be advantageous for \n\n\f170 \n\nJ.BAXTER \n\nrecognising all characters. If the set of all individual character recognition problems \nis viewed as an environment of learning tasks, this preprocessor represents a bias \nthat is appropriate to all tasks in the environment. It is likely that there are many \nother currently unknown biases that are also appropriate for this environment. We \nwould like to be able to learn these automatically. \n\nBias that is appropriate for all tasks must be learnt by sampling from many tasks. \nIf only a single task is learnt then the bias extracted is likely to be specific to that \ntask. For example, if a network is constructed as in figure 1 and the output nodes \nare simultaneously trained on many similar problems, then the hidden layers are \nmore likely to be useful in learning a novel problem of the same type than if only a \nsingle problem is learnt. In the rest of this paper I develop a general theory of bias \nlearning based upon the idea of learning multiple related tasks. The theory shows \nthat a learner's generalisation performance can be greatly improved by learning \nrelated tasks and that if sufficiently many tasks are learnt the learner's bias can be \nextracted and used to learn novel tasks. \n\nOther authors that have empirically investigated the idea of learning multiple re(cid:173)\nlated tasks include [5] and [8]. \n\n2 Learning Bias \n\nFor the sake of argument I consider learning problems that amount to minimizing \nthe mean squared error of a function h over some training set D. A more general \nformulation based on statistical decision theory is given in [3]. Thus, it is assumed \nthat the learner receives a training set of (possibly noisy) input- output pairs D = \n{(XI, YI), ... , (xm' Ym)}, drawn according to a probability distribution P on X X Y \n(X being the input space and Y being the output space) and searches through its \nhypothesis space 1l for a function h: X --+ Y minimizing the empirical error, \n\nE(h, D) = - 2)h(xd - yd 2. \n\n1 m \n\nm \n\ni=1 \n\nThe true error or generalisation error of h is the expected error under P: \n\nE(h, P) = r (h(x) - y)2 dP(x, y). \n\nixxY \n\n(1) \n\n(2) \n\nThe hope of course is that an h with a small empirical error on a large enough \ntraining set will also have a small true error, i.e. it will generalise well. \nI model the environment of the learner as a pair (P, Q) where P = {P} is a set of \nlearning tasks and Q is a probability measure on P. The learner is now supplied \nnot with a single hypothesis space 1l but with a hypothesis space family IHI = {1l}. \nEach 1l E IHI represents a different bias the learner has about the environment. For \nexample, one 1l may contain functions that are very smooth, whereas another 1l \nmight contain more wiggly functions. Which hypothesis space is best will depend \non the kinds of functions in the environment. To determine the best 1l E !HI for \n(P, Q), we provide the learner not with a single training set D but with n such \ntraining sets D I , ... , Dn. Each Di is generated by first sampling from 'P according \nto Q to give Pi and then sampling m times from X x Y according to Pi to give \nDi = {(XiI, Yil), ... , (Xim, Yim)}. The learner searches for the hypothesis space \n1l E IHI with minimal empirical error on D I , ... , Dn , where this is defined by \n\n~ \nE*(1l, D I , ... , Dn) = -\nn \n\n~ \n\ninf E(h, Dd. \nhE1/. \n\n(3) \n\n1 2:n \n\ni=1 \n\n\fLearning Model Bias \n\n171 \n\n\\ - - -'- - -, \n\nI \nI \n\nI \nI \n\n... 9 1 \n\n... \n\n... \n\n___ L __ , \n\nI \nI \n\nf \n\nFigure 1: Net for learning multiple tasks. Input Xij from training set Di is prop(cid:173)\nagated forwards through the internal representation f and then only through the \noutput network gi. The error [gi(l(Xij)) - Yij]2 is similarly backpropagated only \nthrough the output network gi and then f. Weight updates are performed after all \ntraining sets D 1 , ... , Dn have been presented. \n\nThe hypothesis space 1l with smallest empirical error is the one that is best able to \nlearn the n data sets on average. \n\nThere are two ways of measuring the true error of a bias learner. The first is how \nwell it generalises on the n tasks PI,\"\" Pn used to generate the training sets. \nAssuming that in the process of minimising (3) the learner generates n functions \nhI, ... , hn E 1l with minimal empirical error on their respective training setsI , the \nlearner's true error is measured by: \n\n(4) \n\nin \n\nthis \n\nthat \n\nNote \nror is given by En(hI, ... , hn' Db ... , Dn) = ~ 2:~=I E(hi' Dt}. The second way \nof measuring the generalisation error of a bias learner is to determine how good 1l \nis for learning novel tasks drawn from the environment (P, Q): \n\nempirical \n\nlearner's \n\ncase \n\nthe \n\nE*(1l, Q) = 1 inf E(h, P) dQ(P) \n\n1> hEll \n\ner(cid:173)\n\n(5) \n\nA learner that has found an 1l with a small value of (5) can be said to have learnt \nto learn the tasks in P in general. To state the bounds ensuring these two types of \ngeneralisation a few more definitions must be introduced. \n\nDefinition 1 Let !HI = {1l} be a hypothesis space family. Let ~ = {h E 1l: 1l E \nlHt}. For any h:X -+ Y, define a map h:X X Y -+ [0,1] by h(x,y) = (h(x) _ y)2. \nNote the abuse of notation: h stands for two different functions depending on its \nargument. Given a sequence of n functions h = (hI, . .. , hn ) let h: (X x y)n -+ [0,1] \nbe the function (XI, YI, ... , Xn , Yn) H ~ 2:~=I hi (Xi, yt). Let 1ln be the set of all such \nfunctions where the hi are all chosen from 1l. Let JHr = {1l n : 1l E H}. For each \n1l E !HI define 1l*:P -+ [0,1] by1l*(P) = infhEll E(h, P} and let:HI* = {1l*:1l E !HI}. \n\n1 This assumes the infimum in (3) is attained. \n\n\f172 \n\nJ. BAXTER \n\nDefinition 2 Given a set of function s 1i from any space Z to [0, 1], and any prob(cid:173)\nability measure on Z, define the pseudo-metric dp on 1i by \n\ndp(h, hI) = l\n\nlh(Z) - hl(z)1 dP(z). \n\nDenote the smallestE-cover of (1i,dp ) byJV{E,1i,dp ). Define the E-capacity of1i \nby \n\nC(E, 1i) = sup JV (E, 1i, dp ) \n\np \n\nwhere the supremum is over all discrete probability measures P on Z. \n\nDefinition 2 will be used to define the E-capacity of spaces such as !HI* and [IHF ]\"., \nwhere from definition 1 the latter is [IHF ]\". = {h E 1in :1i E H}. \nThe following theorem bounds the number of tasks and examples per task required \nto ensure that the hypothesis space learnt by a bias learner will, with high proba(cid:173)\nbility, contain good solutions to novel tasks in the same environment2 . \n\nTheorem 1 Let the n training sets D I , ... , Dn be generated by sampling n times \nfrom the environment P according to Q to give PI\"\", Pn , and then sampling m \ntimes from each Pi to generate Di . Let !HI = {1i} be a hypothesis space family and \nsuppose a learner chooses 1\u00a3 E 1HI minimizing (3) on D I , ... , Dn. For all E > 0 and \n0 < 8 < 1, if \n\nn \n\nand m \n\nthen \n\nThe bound on m in theorem 1 is the also the number of examples required per \ntask to ensure generalisation of the first kind mentioned above. That is, it is \nthe number of examples required in each data set Di to ensure good generalisa(cid:173)\ntion on average across all n tasks when using the hypothesis space family 1HI. \nIf \nwe let m(lHI, n, E, 8) be the number of examples required per task to ensure that \n\nPr {Db\"\" Dn: IEn(hI\"' \" hn , DI\"'\" Dn) - En(hI, . . . , hn , PI\" \" , Pn)1 > E} < \n\n8, where all hi E 1i for some fixed 1i E IHI, then \n\nG(IHI, n, E, 8) = m(lHI, 1, E, 8) \nm{lHI, n, E, 8) \n\nrepresents the advantage in learning n tasks as opposed to one task (the ordinary \nlearning scenario). Call G(IHI, n, E, 8) the n-task gain of IHI. Using the fact [3] that \n\nC (E, lHl\". ) :::; C (E, [IHr ]\".) :::; C (E, IHl\". t , \n\nand the formula for m from theorem 1, we have, \n\n1 :::; G{IHI, n, E, 8) :::; n. \n\n2The bounds in theorem 1 can be improved to 0 (~) if all 11. E H are convex and the \n\nerror is the squared loss [7]. \n\n\fLearning Model Bias \n\n173 \n\nThus, at least in the worst case analysis here, learning n tasks in the same environ(cid:173)\nment can result in anything from no gain at all to an n-fold reduction in the number \nof examples required per task. In the next section a very intuitive analysis of the \nconditions leading to the extreme values of G(H, n, c, J) is given for the situation \nwhere an internal representation is being learnt for the environment. I will also say \nmore about the bound on the number of tasks (n) in theorem 1. \n\n3 Learning Internal Representations with Neural Networks \n\nIn figure 1 n tasks are being learnt using a common representation f. \nIn this \ncase [JHF]\". is the set of all possible networks formed by choosing the weights in the \nrepresentation and output networks. IHl\". is the same space with a single output node. \nIf the n tasks were learnt independently (i.e. without a common representation) then \neach task would use its own copy of H\"., i.e. we wouldn't be forcing the tasks to all \nuse the same representation. \n\nLet W R be the total number of weights in the representation network and W 0 \nbe the number of weights in an individual output network. Suppose also that all \nthe nodes in each network are Lipschitz boundecP. Then it can be shown [3] that \nInC(c, [IHr]\".):::: 0 ((Wo + ~)In~) and InC(c,IHr):::: 0 (WRln~). Substituting \nthese bounds into theorem 1 shows that to generalise well on average on n tasks \nusing a common representation requires m :::: 0 (/2 [( W 0 + ~) In ~ + ~ In } ]) :::: \no (a + .~J examples of each task. In addition, if n :::: 0 CI, WR In ~) then with high \nprobability the resulting representation will be good for learning novel tasks from \nthe same environment. Note that this bound is very large. However it results from a \nworst-case analysis and so is highly likely to be beaten in practice. This is certainly \nborne out by the experiment in the next section. \nThe learning gain G(H, n, c) satisfies G(H, n, c) ~ Wo\u00b1'!a. Thus, if WR \u00bb Wo, \nG ~ n, while if Wo \u00bb W R then G ~ 1. This is perfectly intuitive: when Wo \u00bb \nW R the representation network is hardly doing any work, most of the power of \nthe network is in the ouput networks and hence the tasks are effectively being \nlearnt independently. However, if WR \u00bb Wo then the representation network \ndominates; there is very little extra learning to be done for the individual tasks \nonce the representation is known, and so each example from every task is providing \nfull information to the representation network. Hence the gain of n. \n\nWo\u00b1.:::.B. \n\nNote that once a representation has been learnt the sampling burden for learning a \nnovel task will be reduced to m:::: 0 (e1, [Wo In ~ + In}]) because only the output \nnetwork has to be learnt. If this theory applies to human learning then the fact \nthat we are able to learn words, faces, characters, etcwith relatively few examples \n(a single example in the case offaces) indicates that our \"output networks\" are very \nsmall, and, given our large ignorance concerning an appropriate representation, the \nrepresentation network for learning in these domains would have to be large, so we \nwould expect to see an n-task gain of nearly n for learning within these domains. \n\n3 A node a : lR P -t lR is LipJChitz bounded if there exists a constant e such that la( x) -\na(x'}1 < ellx - x'il for all x, x' E lR P \u2022 Note that this rules out threshold nodes, but sigmoid \nsquashing functions are okay as long as the weights are bounded. \n\n\f174 \n\nJ. BAXTER \n\n4 Experiment: Learning Symmetric Boolean Functions \n\nIn this section the results of an experiment are reported in which a neural network \nwas trained to learn symmetric4 Boolean functions. The network was the same as \nthe one in figure 1 except that the output networks 9i had no hidden layers. The \ninput space X = {O, Ipo was restricted to include only those inputs with between \none and four ones. The functions in the environment of the network consisted of all \npossible symmetric Boolean functions over the input space, except the trivial \"con(cid:173)\nstant 0\" and \"constant 1\" functions. Training sets D 1 , ..\u2022 ,Dn were generated by \nfirst choosing n functions (with replacement) uniformly from the fourteen possible, \nand then choosing m input vectors by choosing a random number between 1 and 4 \nand placing that many l's at random in the input vector. The training sets were \nlearnt by minimising the empirical error (3) using the backpropagation algorithm \nas outlined in figure 1. Separate simulations were performed with n ranging from \n1 to 21 in steps of four and m ranging from 1 to 171 in steps of 10. Further details \nof the experimental procedure may be found in [3], chapter 4. \n\nOnce the network had sucessfully learnt the n training sets its generalization ability \nwas tested on all n functions used to generate the training set. In this case the \ngeneralisation error (equation (4)) could be computed exactly by calculating the \nnetwork's output (for all n functions) for each of the 385 input vectors. The gener(cid:173)\nalisation error as a function of nand m is plotted in figure 2 for two independent \nsets of simulations. Both simulations support the theoretical result that the number \nof examples m required for good generalisation decreases with increasing n (cf the(cid:173)\norem 1) . For training sets D 1 , ... , Dn that led to a generalisation error of less than \n\nG&n erailsahon Error \n\nFigure 2: Learning surfaces for two independent simulations. \n\n0.01, the representation network f was extracted and tested for its true error, where \nthis is defined as in equation (5) (the hypothesis space 1\u00a3 is the set of all networks \nformed by attaching any output network to the fixed representation network f). \nAlthough there is insufficient space to show the representation error here (see [3] \nfor the details), it was found that the representation error monotonically decreased \nwith the number of tasks learnt, verifying the theoretical conclusions. \n\nThe representation's output for all inputs is shown in figure 3 for sample sizes \n(n, m) = (1,131), (5, 31) and (13,31). All outputs corresponding to inputs from \nthe same category (i.e. the same number of ones) are labelled with the same symbol. \nThe network in the n = 1 case generalised perfectly but the resulting representation \ndoes not capture the symmetry in the environment and also does not distinguish \nthe inputs with 2,3 and 4 \"I's\" (because the function learnt didn't), showing that \n\n4 A symmetric Boolean function is one that is invariant under interchange of its inputs, \n\nor equivalently, one that only depends on the number of \"l's\" in its input (e.g. parity). \n\n\fLearning Model Bias \n\n175 \n\nlearning a single function is not sufficient to learn an appropriate representation. \nBy n = 5 the representation's behaviour has improved (the inputs with differing \nnumbers of l's are now well separated, but they are still spread around a lot) and \nby n = 13 it is perfect. As well as reducing the sampling burden for the n tasks in \n\n( 1, HII \n\nnode . -\n\n( 5, J I I \n\n8~\\ 16 \n\nd \nno e \n\nnode J \n\n( I), ll) \n\no \n\nI \n\nI \n\nnode ~ \n\n0 \n\nFigure 3: Plots of the output of a representation generated from the indicated (n, m) \nsample. \n\nthe training set, a representation learnt on sufficiently many tasks should be good \nfor learning novel tasks and should greatly reduce the number of examples required \nfor new tasks. This too was experimentally verified although there is insufficient \nspace to present the results here (see [3]). \n\n5 Conclusion \n\nI have introduced a formal model of bias learning and shown that (under mild \nrestrictions) a learner can sample sufficiently many times from sufficiently many \ntasks to learn bias that is appropriate for the entire environment. In addition, the \nnumber of examples required per task to learn n tasks independently was shown \nto be upper bounded by O(a + bin) for appropriate environments. See [2] for an \nanalysis of bias learning within an Information theoretic framework which leads to \nan exact a + bin-type bound. \nReferences \n\n[1] Y. S. Abu-Mostafa. Learning from Hints in Neural Networks. Journal of Com(cid:173)\n\nplecity, 6:192-198, 1989. \n\n[2] J. Baxter. A Bayesian Model of Bias Learning. Submitted to COLT 1996, 1995. \n[3] J. Baxter. \n\nInternal Representations. \n\nthesis, De(cid:173)\n\nLearning \n\nPhD \n\npartment of Mathematics and Statistics, The Flinders University of \nSouth Australia, \nin Neuroprose Archive under \n\"/pub/neuroprose/Thesis/baxter.thesis.ps.Z\" . \n\nDraft copy \n\n1995. \n\n[4] J. Baxter. Learning Internal Representations. In Proceedings of the Eighth Inter(cid:173)\n\nnational Conference on Computational Learning Theory, Santa Cruz, California, \n1995. ACM Press. \n\n[5] R. Caruana. Learning Many Related Tasks at the Same Time with Backpropa(cid:173)\n\ngation. In Advances in Neural Information Processing 5, 1993. \n\n[6] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the \n\nbias/variance dilemma. Neural Comput., 4:1-58, 1992. \n\n[7] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Sample Complexity of Agnostic \n\nLearning with Squared Loss. In preparation, 1995. \n\n[8] T. M. Mitchell and S. Thrun. Learning One More Thing. Technical Report \n\nCMU-CS-94-184, CMU, 1994. \n\n\f", "award": [], "sourceid": 1112, "authors": [{"given_name": "Jonathan", "family_name": "Baxter", "institution": null}]}