{"title": "Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations", "book": "Advances in Neural Information Processing Systems", "page_first": 985, "page_last": 992, "abstract": null, "full_text": "Transfer Learning using Kolmogorov Complexity:\n\nBasic Theory and Empirical Evaluations\n\nSylvian R. Ray\n\nDepartment of Computer Science\n\nM. M. Hassan Mahmud\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nUniversity of Illinois at Urbana-Champaign\n\nmmmahmud@uiuc.edu\n\nray@cs.uiuc.edu\n\nAbstract\n\nIn transfer learning we aim to solve new problems using fewer examples using\ninformation gained from solving related problems. Transfer learning has been\nsuccessful in practice, and extensive PAC analysis of these methods has been de-\nveloped. However it is not yet clear how to de\ufb01ne relatedness between tasks. This\nis considered as a major problem as it is conceptually troubling and it makes it\nunclear how much information to transfer and when and how to transfer it. In\nthis paper we propose to measure the amount of information one task contains\nabout another using conditional Kolmogorov complexity between the tasks. We\nshow how existing theory neatly solves the problem of measuring relatedness and\ntransferring the \u2018right\u2019 amount of information in sequential transfer learning in a\nBayesian setting. The theory also suggests that, in a very formal and precise sense,\nno other reasonable transfer method can do much better than our Kolmogorov\nComplexity theoretic transfer method, and that sequential transfer is always justi-\n\ufb01ed. We also develop a practical approximation to the method and use it to transfer\ninformation between 8 arbitrarily chosen databases from the UCI ML repository.\n\n1 Introduction\n\nThe goal of transfer learning [1] is to learn new tasks with fewer examples given information gained\nfrom solving related tasks, with each task corresponding to the distribution/probability measure\ngenerating the samples for that task. The study of transfer is motivated by the fact that people use\nknowledge gained from previously solved, related problems to solve new problems quicker. Transfer\nlearning methods have been successful in practice, for instance it has been used to recognize related\nparts of a visual scene in robot navigation tasks, predict rewards in related regions in reinforcement\nlearning based robot navigation problems, and predict results of related medical tests for the same\ngroup of patients. Figure 1 shows a prototypical transfer method [1], and it illustrates some of the\nkey ideas. The m tasks being learned are de\ufb01ned on the same input space, and are related by virtue\nof requiring the same common \u2018high level features\u2019 encoded in the hidden units. The tasks are\nlearned in parallel \u2013 i.e. during training, the network is trained by alternating training samples from\nthe different tasks, and the hope is that now the common high level features will be learned quicker.\nTransfer can also be done sequentially where information from tasks learned previously are used to\nspeed up learning of new ones.\n\nDespite the practical successes, the key question of how one measures relatedness between tasks has,\nso far, eluded answer. Most current methods, including the deep PAC theoretic analysis in [2], start\nby assuming that the tasks are related because they have a common near-optimal inductive bias (the\ncommon hidden units in the above example). As no explicit measure of relatedness is prescribed, it\nbecomes dif\ufb01cult to answer questions such as how much information to transfer between tasks and\nwhen not to transfer information.\n\n1\n\n\fFigure 1: A typical Transfer Learning Method.\n\nThere has been some work which attempt to solve these problems. [3] gives a more explicit measure\nof task relatedness in which two tasks P and Q are said to be similar with respect to a given set of\nfunctions if the set contains an element f such that P (a) = Q(f (a)) for all events a. By assuming\nthe existence of these functions, the authors are able to derive PAC sample complexity bounds for\nerror of each task (as opposed to expected error, w.r.t. a distribution over the m tasks, in [2]). More\ninteresting is the approach in [4], where the author derives PAC bounds in which the sample com-\nplexity is proportional to the joint Kolmogorov complexity [5] of the m hypotheses. So Kolmogorov\ncomplexity (see below) determines the relatedness between tasks. However, the bounds hold only\nfor \u2265 8192 tasks (Theorem 3).\nIn this paper we approach the above idea from a Bayesian perspective and measure tasks relatedness\nusing conditional Kolmogorov complexity of the hypothesis. We describe the basics of the theory\nto show how it justi\ufb01es this approach and neatly solves the problem of measuring task relatedness\n(details in [6; 7]). We then perform experiments to show the effectiveness of this method.\nLet us take a brief look at our approach. We assume that each hypothesis is represented by a program\n\u2013 for example a decision tree is represented by a program that contains a data structure representing\nthe tree, and the relevant code to compute the leaf node corresponding to a given input vector. The\nKolmogorov complexity of a hypothesis h (or any other bit string) is now de\ufb01ned as the length of the\nshortest program that outputs h given no input. This is a measure of absolute information content of\nan individual object \u2013 in this case the hypothesis h. It can be shown that Kolmogorov complexity is\na sharper version of Information Theoretic entropy, which measures the amount of information in an\nensemble of objects with respect to a distribution over the ensemble. The conditional Kolmogorov\ncomplexity of hypothesis h given h\u2032, K(h|h\u2032), is de\ufb01ned as the length of the shortest program that\noutputs the program h given h\u2032 as input. K(h|h\u2032) measures the amount of constructive information\nh\u2032 contains about h \u2013 how much information h\u2032 contains for the purpose of constructing h. This\nis precisely what we wish to measure in transfer learning. Hence this becomes our measure of\nrelatedness for performing sequential transfer learning in the Bayesian setting.\n\nIn the Bayesian setting, any sequential transfer learning mechanism/algorithm is \u2018just\u2019 a conditional\nprior W (\u00b7|h\u2032) over the hypothesis/probability measure space, where h\u2032 is the task learned previously\n\u2013 i.e. the task we are trying to transfer information from. In this case, by setting the prior over the\nhypothesis space to be P (\u00b7|h\u2032) := 2\u2212K(\u00b7|h\u2032) we weight each candidate hypothesis by how related it\nis to previous tasks, and so we automatically transfer the right amount of information when learning\nthe new problem. We show that in a certain precise sense this prior is never much worse than\nany reasonable transfer learning prior, or any non-transfer prior. So, sequential transfer learning is\nalways justi\ufb01ed from a theoretical perspective. This result is quite unexpected as the current belief\nin the transfer learning community is that it should hurt to transfer from unrelated tasks. Due to\nlack of space, we only just brie\ufb02y note that similar results hold for an appropriate interpretation of\nparallel transfer, and that, translated to the Bayesian setting, current practical transfer methods look\nlike sequential transfer methods [6; 7]. Kolmogorov complexity is computable only in the limit (i.e.\nwith in\ufb01nite resources), and so, while ideal for investigating transfer in the limit, in practice we need\nto use an approximation of it (see [8] for a good example of this). In this paper we perform transfer\nin Bayesian decision trees by using a fairly simple approximation to the 2\u2212K(\u00b7|\u00b7) prior.\nIn the rest of the paper we proceed as follows. In section 3 we de\ufb01ne Kolmogorov complexity more\nprecisely and state all the relevant Bayesian convergence results for making the claims above. We\nthen describe our Kolmogorov complexity based Bayesian transfer learning method. In section 4\nwe describe our method for approximation of the above using Bayesian decision trees, and then in\nsection 5 we describe 12 transfer experiments using 8 standard databases from the UCI machine\nlearning repository [9]. Our experiments are the most general that we know of, in the sense that we\n\n2\n\n\ftransfer between arbitrary databases with little or no semantic relationships. We note that this fact\nalso makes it dif\ufb01cult to compare our method to other existing methods (see also section 6).\n\n2 Preliminaries\n\nWe consider Bayesian transfer learning for \ufb01nite input spaces Ii and \ufb01nite output spaces Oi. We\nassume \ufb01nite hypothesis spaces Hi, where each h \u2208 Hi is a conditional probability measure on Oi,\nconditioned on elements of Ii. So for y \u2208 Oi and x \u2208 Ii, h(y|x) gives the probability of output\nbeing y given input x. Given Dn = {(x1, y1), (x2, y2), \u00b7 \u00b7 \u00b7 , (xn, yn)} from Ii \u00d7 Oi, the probability\nof Dn according to h \u2208 Hi is given by:\n\nn\n\nh(Dn) :=\n\nh(yk|xk)\n\nYk=1\n\nThe conditional probability of a new sample (xnew, ynew) \u2208 Ii \u00d7 Oi for any conditional probability\nmeasure \u00b5 (e.g. h \u2208 Hi or MW in ( 3.2) ) is given by:\n\n\u00b5(ynew|xnew, Dn) :=\n\n\u00b5(Dn \u222a {(xnew, ynew)})\n\n\u00b5(Dn)\n\n(2.1)\n\nSo the learning problem is: given a training sample Dn, where for each (xk, yk) \u2208 Dn yk is\nassumed to have been chosen according a h \u2208 Hi, learn h. The prediction problem is to predict\nthe label of the new sample xnew using ( 2.1). The probabilities for the inputs x are not included\nabove because they cancel out. This is merely the standard Bayesian setting, translated to a typical\nMachine learning setting (e.g. [10]).\nWe use MCMC simulations in a computer to sample for our Bayesian learners, and so considering\nonly \ufb01nite spaces above is acceptable. However, the theory we present here holds for any hypothesis,\ninput and output space that may be handled by a computer with in\ufb01nite resources (see [11; 12] for\nmore precise descriptions). Note that we are considering cross-domain transfer [13] as our standard\nsetting (see section 6). We further assume that each h \u2208 Hi is a program (therefore a bit string)\nfor some Universal pre\ufb01x Turing machine U. When it is clear that a particular symbol p denotes a\nprogram, we will write p(x) to denote U (p, x), i.e. running program p on input x.\n\n3 Transfer Learning using Kolmogorov Complexity\n\n3.1 Kolmogorov Complexity based Task Relatedness\n\nA program is a bit string, and a measure of absolute constructive information that a bit string x\ncontains about another bit string y is given by the conditional Kolmogorov complexity of x given y\n[5] . Since our hypotheses are programs/bit strings, the amount of information that a hypothesis or\nprogram h\u2032 contains about constructing another hypothesis h is also given by the same:\nDe\ufb01nition 1. The conditional Kolmogorov complexity of h \u2208 Hj given h\u2032 \u2208 Hi is de\ufb01ned as the\nlength of the shortest program that given the program h\u2032 as input, outputs the program h.\n\nK(h|h\u2032) := min\n\n{l(r) : r(h\u2032) = h}\n\nr\n\nWe will use a minimality property of K. Let f (x, y) be a computable function over product of bit\nstrings. f is computable means that there is a program p such that p(x, n), n \u2208 N, computes f (x)\n\nto accuracy \u01eb < 2\u2212n in \ufb01nite time. Now assume that f (x, y) satis\ufb01es for each y Px 2\u2212f (x,y) \u2264 1.\n\nThen for a constant cf = K(f ) + O(1), independent of x and y, but dependent on K(f ), the length\nof shortest program computing f, and some small constant (O(1)) [5, Corollary 4.3.1]:\n\nK(x|y) \u2264 f (x, y) + cf\n\n(3.1)\n\n3.2 Bayesian Convergence Results\n\nA Bayes mixture MW over Hi is de\ufb01ned as follows:\n\nMW (Dn) := Xh\u2208Hi\n\nh(Dn)W (h) with Xh\u2208Hi\n\nW (h) \u2264 1\n\n(3.2)\n\n3\n\n\f(the inequality is suf\ufb01cient for the convergence results). Now assume that the data has been gener-\nated by a hj \u2208 Hi (this is standard for a Bayesian setting, but we will relax this constraint below).\nThen the following impressive result holds true for each (x, y) \u2208 Ii \u00d7 Oi.\n\n\u221e\n\nhj(Dn)[MW (y|x, Dn) \u2212 hj(y|x, Dn)]2 \u2264 \u2212 ln W (hj).\n\n(3.3)\n\nXt=0XDn\n\nSo for \ufb01nite \u2212 ln W (hj), convergence is rapid; the expected number of times n |MW (a|x, Dn) \u2212\nhj(a|x, Dn)| > \u01eb is \u2264 \u2212 ln W (hj)/\u01eb2, and the probability that the number of \u01eb deviations\n> \u2212 ln W (hj)/\u01eb2\u03b4 is < \u03b4. This result was \ufb01rst proved in [14], and extended variously in [11;\n12]. In essence these results hold as long as Hi can be enumerated and hj and W can be computed\nj \u2208 Hi such that the nth order\nwith in\ufb01nite resources. These results also hold if hj 6\u2208 Hi, but \u2203h\u2032\nKL divergence between hj and h\u2032\nj) + k\n[11, section 2.5]. Now consider the Solomonoff-Levin prior: 2\u2212K(h) \u2013 this has ( 3.3) error bound\nK(h) ln 2, and for any computable prior W (\u00b7), f (x, y) := \u2212 ln W (x)/ ln 2 satis\ufb01es conditions for\nf (x, y) in ( 3.1). So by ( 3.3), with y = the empty string, we get:\nK(h) ln 2 \u2264 \u2212 ln W (h) + cW\n\n(3.4)\nBy ( 3.3), this means that for all h \u2208 Hi, the error bound for the 2\u2212K(h) prior can be no more than a\nconstant worse than the error bound for any other prior. Since reasonable priors have small K(W )\n(= O(1)), cW = O(1) and this prior is universally optimal [11, section 5.3].\n\nj is bounded by k. In this case the error bound is \u2212 ln W (h\u2032\n\n3.3 Bayesian Transfer Learning\n\nAssume we have previously observed/learned m \u2212 1 tasks, with task tj \u2208 Hij , and the mth task to\nbe learned is in Him. Let t := (t1, t2, \u00b7 \u00b7 \u00b7 , tm\u22121). In the Bayesian framework, a transfer learning\nscheme corresponds to a computable prior W (\u00b7|t) over the space Him,\n\nW (h|t) \u2264 1\n\nXh\u2208Him\n\nIn this case, by ( 3.3), the error bound of the transfer learning scheme MW (de\ufb01ned by the prior W )\nis \u2212 ln W (h|t). We de\ufb01ne our transfer learning method MT L by choosing the prior 2\u2212K(\u00b7|t):\n\nMT L(Dn) := Xh\u2208Him\n\nh(Dn)2\u2212K(h|t).\n\nFor MT L the error bound is K(h|t) ln 2. By the minimality property ( 3.1), we get that\n\nK(h|t) ln 2 \u2264 \u2212 ln W (h|t) + cW\n\nSo for a reasonable computable transfer learning scheme MW , cW = O(1) and for all h and t, the\nerror bound for MT L is no more than a constant worse than the error bound for MW \u2013 i.e. MT L\nis universally optimal [11, section 5.3]. Also note that in general K(x|y) \u2264 K(x)1. Therefore by\n( 3.4) the transfer learning scheme MT L is also universally optimal over all non-transfer learning\nschemes \u2013 i.e. in the precise formal sense of the framework in this paper, sequential transfer learning\nis always justi\ufb01ed. The result in this section, while novel, are not technically deep (see also [6] [12,\nsection 6]). We should also note that the 2\u2212K(h) prior is not universally optimal with respect to\nthe transfer prior W (\u00b7|t) because the inequality ( 3.4) now holds only upto the constant cW (\u00b7|t)\nwhich depends on K(t). So this constant increases with increasing number of tasks which is very\nundesirable. Indeed, this is demonstrated in our experiments when the base classi\ufb01er used is an\napproximation to the 2\u2212K(h) prior and the error of this prior is seen to be signi\ufb01cantly higher than\nthe transfer learning prior 2\u2212K(h|t).\n\n4 Practical Approximation using Decision Trees\n\nSince K is computable only in the limit, to apply the above ideas in practical situations, we need\nto approximate K and hence MT L. Furthermore we also need to specify the spaces Hi, Oi, Ii and\nhow to sample from the approximation of MT L. We address each issue in turn.\n\n1Because arg K(x), with a constant length modi\ufb01cation, also outputs x given input y.\n\n4\n\n\f4.1 Decision Trees\n\nWe will consider standard binary decision trees as our hypotheses. Each hypothesis space Hi con-\nsists of decision trees for Ii de\ufb01ned by the set fi of features. A tree h \u2208 Hi is de\ufb01ned recursively:\n\nh := nroot\nnj := rj Cj \u2205 \u2205 | rj Cj nj\n\nL \u2205 | rj Cj \u2205 nj\n\nR | rj Cj nj\n\nL\n\nnj\nR\n\nC is a vector of size |Oi|, with component Ci giving the probability of the ith class. Each rule r is\nof the form f < v, where f \u2208 fi and v is a value for f. The vector C is used during classi\ufb01cation\nonly when the corresponding node has one or more \u2205 children. The size of each tree is N c0 where\nN is the number of nodes, and c0 is a constant, denoting the size of each rule entry, the outgoing\npointers, and C. Since c0 and the length of the program code p0 for computing the tree output are\nconstants independent of the tree, we de\ufb01ne the length of a tree as l(h) := N.\n\n4.2 Approximating K and the Prior 2\u2212K(\u00b7|t)\n\nApproximation for a single previously learned tree: We will approximate K(\u00b7|\u00b7) using a function\nthat is de\ufb01ned for a single previously learned tree as follows:\n\nwhere d(h, h\u2032) is the maximum number of overlapping nodes starting from the root nodes:\n\nCld(h|h\u2032) := l(h) \u2212 d(h, h\u2032)\n\nd(h, h\u2032) := d(nroot, n\u2032\nd(n, n\u2032) := 1 + d(nL, n\u2032\n\nroot)\n\nL) + d(nR, n\u2032\n\nR)\n\nd(n, \u2205) := 0\nd(\u2205, n\u2032) := 0\n\nIn the single task case, the prior is just 2\u2212l(h)/Zl (which is an approximation to the Solomonoff-\nLevin prior 2\u2212K(\u00b7)), and in the transfer learning case, the prior is 2\u2212Cld(\u00b7|h\u2032)/ZCld where the Zs\nare normalization terms2.\nIn both cases, we can sample from the prior directly by growing the\ndecision tree dynamically. Call a \u2205 in h a hole. Then for 2\u2212l(h), during the generation process, we\n\ufb01rst generate an integer k according to 2\u2212t distribution (easy to do using a pseudo random number\ngenerator). Then at each step we select a hole uniformly at random and then create a node there\n(with two more holes) and generate the corresponding rule randomly. We do so until we get a tree\nwith l(h) = k. In the transfer learning case, for the prior 2\u2212Cld(\u00b7|h\u2032) we \ufb01rst generate an integer k\naccording to 2\u2212t distribution. Then we generate as above until we get a tree h with Cld(h|h\u2032) = k.\nIt can be seen with a little thought that these procedures sample from the respective priors.\nApproximation for multiple previously learned trees: We de\ufb01ne Cld for multiple trees as an averag-\ning of the contributions of each of the m \u2212 1 previously learned trees:\n\nCm\n\nld (hm|h1, h2, \u00b7 \u00b7 \u00b7 , hm\u22121) := \u2212 log  1\n\nm \u2212 1\n\n2\u2212Cld(hm|hi)!\n\nm\u22121\n\nXi=1\n\nwhich reduces to 1/[(m\u2212\ni=1 2\u2212Cld(hm|hi). To sample from this, we can simply select a hi from the m \u2212 1 trees\n\nIn the transfer learning case, we need to sample according 2\u2212Cm\n1)ZCm\nat random and then sample from 2\u2212Cld(\u00b7|hi) to get the new tree.\nThe transfer learning mixture: The approximation of the transfer learning mixture MT L is now:\n\n]Pm\u22121\n\nld (\u00b7|\u00b7)/ZCm\n\nld\n\nld\n\nPT L(Dn) = Xh\u2208Him\n\nh(Dn)2\u2212Cm\n\nld (h|t)/ZCm\n\nld\n\nld (h|t) ln 2 + ln ZCld (the ln ZCld is a constant\nld , universality is maintained, but only up to the degree\nld approximates K. In our experiments we used the prior 1.005\u2212C instead of 2\u2212C above to\n\nSo by ( 3.3), the error bound for PT L is given by C m\nthat is same for all h \u2208 Hi). So when using C m\nthat C m\nmake larger trees more likely and hence speed up convergence of MCMC sampling.\n\n2The Z\u2019s exist, here because the Hs are \ufb01nite, and in general because ki = N c0 + l(p0) gives lengths of\n\nprograms, which are known to satisfyPi 2\u2212ki \u2264 1.\n\n5\n\n\fTable 1: Metropolis-Hastings Algorithm\n\n1. Let Dn be the training sample; select the current tree/state hcur using the proposal distribution\n\nq(hcur).\n\n2. For i = 1 to J do\n\n(a) Choose a candidate next state hprop according to the proposal distribution q(hprop).\n(b) Draw u uniformly at random from [0, 1] and set hcur := hprop if A(hprop, hcur) > u, where\n\nA is de\ufb01ned by\n\nA(h, h\u2032) := min(1,\n\nh(Dn)2\u2212Cm\nh\u2032(Dn)2\u2212Cm\n\nld (h|t)q(h\u2032)\n\nld (h\u2032|t)q(h))\n\n4.3 Approximating PT L using Metropolis-Hastings\n\nAs in standard Bayesian MCMC methods, the idea will be to draw N samples hmi from the poste-\nrior, P (h|Dn, t) which is given by\n\nP (h|Dn, t) := h(Dn)2\u2212Cm\n\nld (h|t)/(ZCm\n\nld\n\nP (Dn))\n\nThen we will approximate PT L by\n\n\u02c6PT L(y|x) :=\n\n1\nN\n\nN\n\nXi=1\n\nhmi(y|x)\n\nWe will use the standard Metropolis-Hastings algorithm to sample from PT L (see [15] for a brief\nintroduction and further references). The algorithm is given in table 1. The algorithm is \ufb01rst run for\nsome J = T , to get the Markov chain q \u00d7 A to converge, and then starting from the last hcur in\nthe run, the algorithm is run again for J = N times to get N samples for \u02c6PT L. In our experiments\nwe set T to 1000 and N = 50. We set q to our prior 2\u2212Cm\n, and hence the acceptance\nprobability A is reduced to min{1, h(Dn)/h\u2032(Dn)}. Note that every time after we generate a tree\naccording to q, we set the C entries using the training sample Dn in the usual way.\n\nld (\u00b7|t)/ZCm\n\nld\n\n5 Experiments\n\nWe used 8 databases from the UCI machine learning repository [9] in our experiments (table 2). To\nshow transfer of information we used 20% of the data for a task as the training sample, but also\nused as prior knowledge trees learned on another task using 80% of the data as training sample.\nThe reported error rates are on the testing sets and are averages over 10 runs . To the best of our\nknowledge our transfer experiments are the most general performed so far, in the sense that the\ndatabases information is transferred between have semantic relationship that is often tenuous.\n\nWe performed 3 sets of experiments. In the \ufb01rst set we learned each classi\ufb01er using 80% of the\ndata as training sample and 20% as testing sample (since it is a Bayesian method, we did not use\na validation sample-set). This set ensured that our base Bayesian classi\ufb01er with 2\u2212l(h) prior is\nreasonably powerful and that any improvement in performance in the transfer experiments (set 3)\nwas due to transfer and not de\ufb01ciency in our base classi\ufb01er. From a survey of literature it seems\nthe error rate for our classi\ufb01er is always at least a couple of percentage points better than C4.5. As\nan example, for ecoli our classi\ufb01er outperforms Adaboost and Random Forests in [16], but is a bit\nworse than these for German Credit.\n\nIn the second set of experiments we learned the databases that we are going to transfer to using 20%\nof the database as training sample, and 80% of the data as the testing sample. This was done to\nestablish baseline performance for the transfer learning case. The third and \ufb01nal set of experiments\nwere performed to do the actual transfer. In this case, \ufb01rst one task was learned using 80/20 (80%\ntraining, 20% testing) data set and then this was used to learn a 20/80 dataset. During transfer, the\nN trees from the sampling of the 80/20 task were all used in the prior 2\u2212CN\nld (\u00b7|t). The results are\n\n6\n\n\fTable 2: Database summary. The last column gives the error and standard deviation for 80/20\ndatabase split.\n\nData Set\n\nNo. of Samples No. of Feats. No. Classes\n\nError/S.D.\n\nEcoli\nYeast\n\nMushroom\n\nAustralian Credit\nGerman Credit\n\nHepatitis\n\nBreast Cancer,Wisc.\nHeart Disease, Cleve.\n\n336\n1484\n8124\n690\n1000\n155\n699\n303\n\n7\n8\n22\n14\n20\n19\n9\n14\n\n8\n10\n2\n2\n2\n2\n2\n5\n\n9.8%, 3.48\n14.8%, 2.0\n0.83%, 0.71\n16.6%, 3.75\n28.2%, 4.5\n\n18.86%, 2.03\n\n5.6%, 1.9\n\n23.0%, 2.56\n\ngiven in table 3. In our experiments, we transferred only to tasks that showed a signi\ufb01cant drop in\nerror rate with the 20/80 split. Surprisingly, the error of the other data sets did not change much.\nAs can be seen from comparing the tables, in most cases transfer of information improves the per-\nformance compared to the baseline transfer case. For ecoli, the transfer resulted in improvement to\nnear 80/20 levels, while for australian the improvement was better than 80/20. While the error rate\nfor mushroom and bc-wisc did not move up to 80/20 levels, there was improvement. Interestingly\ntransfer learning did not hurt in one single case, which agrees with our theoretical results in the\nidealized setting.\n\nTable 3: Results of 12 transfer experiments. Transfer To and From rows gives databases information\nis transferred to and from. The row No-Transfer gives the baseline 20/80 error-rate and standard\ndeviation. Row Transfer gives the error rate and standard deviation after transfer, and the \ufb01nal row\nPI gives percentage improvement in performance due to transfer. With our admittedly inef\ufb01cient\ncode, each experiment took between 15 \u2212 60 seconds on a 2.4 GHz laptop with 512 MB RAM.\n\nTrans. To\n\nTrans. From\n\nYeast\n\necoli\nGerm.\n\nBC Wisc\n\nGerm.\n\nAustralian\n\necoli\n\nhep.\n\nNo-Transfer\n\nTransfer\n\nPI\n\n20.6%, 3.8\n11.3%, 1.6\n\n45.1%\n\n20.6%, 3.8\n10.2%, 4.74\n\n20.6%, 3.8\n9.68%, 2.98\n\n23.2%, 2.4\n\n15.47%, 0.67\n\n23.2%, 2.4\n15.43%, 1.2\n\n23.2%, 2.4\n\n15.21%, 0.42\n\n49%\n\n53%\n\n33.0%\n\n33.5%\n\n34.4%\n\nTrans. To\n\nTrans. From\n\necoli\n\nmushroom\nBC Wisc.\n\nGerm.\n\nheart\n\nBC Wisc.\n\nAus.\n\necoli\n\nNo-Transfer\n\nTransfer\n\nPI\n\n13.8%, 1.3\n4.6%, 0.17\n\n66.0%\n\n13.8%, 1.3\n4.64%, 0.21\n\n13.8%, 1.3\n3.89%, 1.02\n\n66.0%\n\n71.8%\n\n10.3%, 1.6\n8.3%, 0.93\n\n19.4%\n\n10.3%, 1.6\n8.1%, 1.22\n\n21.3%\n\n10.3%, 1.6\n7.8%, 2.03\n\n24.3%\n\n6 Discussion\n\nIn this paper we introduced a Kolmogorov Complexity theoretic framework for Transfer Learn-\ning. The theory is universally optimal and elegant, and we showed its practical applicability\nby constructing approximations to it to transfer information across disparate domains in stan-\ndard UCI machine learning databases. The full theoretical development can be found in [6;\n7]. Directions for future empirical investigations are many. We did not consider transferring from\nmultiple previous tasks, and effect of size of source samples on transfer performance (using 70/30\netc. as the sources) or transfer in regression. Due to the general nature of our method, we can\nperform transfer experiments between any combination of databases in the UCI repository. We\n\n7\n\n\falso wish to perform experiments using more powerful generalized similarity functions like the gzip\ncompressor [8]3.\nWe also hope that it is clear that Kolmogorov complexity based approach elegantly solves the prob-\nlem of cross-domain transfer, where we transfer information between tasks that are de\ufb01ned over\ndifferent input,output and distribution spaces. To the best of our knowledge, the \ufb01rst paper to ad-\ndress this was [13], and recent works include [17] and [18]. All these methods transfer information\nby \ufb01nding structural similarity between various networks/rule that form the hypotheses. This is,\nof course, a way to measure constructive similarity between the hypotheses, and hence an approx-\nimation to Kolmogorov complexity based similarity. So Kolmogorov complexity elegantly uni\ufb01es\nthese ideas. Additionally, the above methods, particularly the last two, are rather elaborate and are\nhypothesis space speci\ufb01c ([18] is even task speci\ufb01c). The theory of Kolmogorov complexity and its\npractical approximations such as [8] and this paper suggests that we can get good performance by\njust using generalized compressors, such as gzip, etc., to measure similarity.\n\nAcknowledgments\n\nWe would like to thank Kiran Lakkaraju for their comments and Samarth Swarup for many fruitful\ndisucssions.\n\nReferences\n\n[1] Rich Caruana. Multitask learning. Machine Learning, 28:41\u201375, 1997.\n[2]\n\nJonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research, 12:149\u2013\n198, March 2000.\n\n[3] Shai Ben-David and Reba Schuller. Exploiting task relatedness for learning multiple tasks. In Proceedings\n\nof the 16th Annual Conference on Learning Theory, 2003.\n\n[4] Brendan Juba. Estimating relatedness via data compression. In Proceedings of the 23rd International\n\nConference on Machine Learning, 2006.\n\n[5] Ming Li and Paul Vitanyi. An Introduction to Kolmogorov Complexity and its Applications. Springer-\n\nVerlag, New York, 2nd edition, 1997.\n\n[6] M. M. Hassan Mahmud. On universal transfer learning. In Proceedings of the 18th International Confer-\n\nence on Algorithmic Learning Theory, 2007.\n\n[7] M. M. Hassan Mahmud. On universal transfer learning (Under Review). 2008.\n[8] R. Cilibrasi and P. Vitanyi. Clustering by compression.\n\nIEEE Transactions on Information theory,\n\n51(4):1523\u20131545, 2004.\n\n[9] D.J. Newman, S. Hettich, C.L. Blake, and C.J. Merz. UCI repository of ML databases, 1998.\n[10] Radford M. Neal. Bayesian methods for machine learning, NIPS tutorial, 2004.\n[11] Marcus Hutter. Optimality of Bayesian universal prediction for general loss and alphabet. Journal of\n\nMachine Learning Research, 4:971\u20131000, 2003.\n\n[12] Marcus Hutter. On universal prediction and bayesian con\ufb01rmation. Theoretical Computer Science (in\n\npress), 2007.\n\n[13] Samarth Swarup and Sylvian R. Ray. Cross domain knowledge transfer using structured representations.\n\nIn Proceedings of the 21st National Conference on Arti\ufb01cial Intelligence (AAAI), 2006.\n\n[14] R. J. Solomonoff. Complexity-based induction systems: comparisons and convergence theorems. IEEE\n\nTransactions on Information Theory, 24(4):422\u2013432, 1978.\n\n[15] Christophe Andrieu, Nando de Freitas, Arnaud Doucet, and Michael I. Jordan. An introduction to MCMC\n\nfor machine learning. Machine Learning, 50(1-2):5\u201343, 2003.\n\n[16] Leo Breiman. Random forests. Machine Learning, 45:5\u201332, 2001.\n[17] Lilyana Mihalkova, Tuyen Huynh, and Raymond Mooney. Mapping and revising markov logic networks\nfor transfer learning. In Proceedings of the 22nd National Conference on Arti\ufb01cial Intelligence (AAAI,\n2007.\n\n[18] Matthew Taylor and Peter Stone. Cross-domain transfer for reinforcement learning. In Proceedings of\n\nthe 24th International Conference on Machine Learning, 2007.\n\n3A \ufb02avor of this approach: if the standard compressor is gzip, then the function Cgzip(xy) will give the\nlength of the string xy after compression by gzip. Cgzip(xy) \u2212 Cgzip(y) will be the conditional Cgzip(x|y).\nSo Cgzip(h|h\u2032) will give the relatedness between tasks.\n\n8\n\n\f", "award": [], "sourceid": 327, "authors": [{"given_name": "M.", "family_name": "Mahmud", "institution": null}, {"given_name": "Sylvian", "family_name": "Ray", "institution": null}]}