{"title": "Transfer learning for text classification", "book": "Advances in Neural Information Processing Systems", "page_first": 299, "page_last": 306, "abstract": "", "full_text": "Transfer learning for text classi\ufb01cation\n\nComputer Science Department\n\nComputer Science Department\n\nChuong B. Do\n\nStanford University\nStanford, CA 94305\n\nAndrew Y. Ng\n\nStanford University\nStanford, CA 94305\n\nAbstract\n\nLinear text classi\ufb01cation algorithms work by computing an inner prod-\nuct between a test document vector and a parameter vector. In many such\nalgorithms, including naive Bayes and most TFIDF variants, the parame-\nters are determined by some simple, closed-form, function of training set\nstatistics; we call this mapping mapping from statistics to parameters, the\nparameter function. Much research in text classi\ufb01cation over the last few\ndecades has consisted of manual efforts to identify better parameter func-\ntions. In this paper, we propose an algorithm for automatically learning\nthis function from related classi\ufb01cation problems. The parameter func-\ntion found by our algorithm then de\ufb01nes a new learning algorithm for\ntext classi\ufb01cation, which we can apply to novel classi\ufb01cation tasks. We\n\ufb01nd that our learned classi\ufb01er outperforms existing methods on a variety\nof multiclass text classi\ufb01cation tasks.\n\narg maxk2f1;:::;KgPn\n\nIntroduction\n\n1\nIn the multiclass text classi\ufb01cation task, we are given a training set of documents, each\nlabeled as belonging to one of K disjoint classes, and a new unlabeled test document.\nUsing the training set as a guide, we must predict the most likely class for the test doc-\nument.\n\u201cBag-of-words\u201d linear text classi\ufb01ers represent a document as a vector x of\nword counts, and predict the class whose score (a linear function of x) is highest, i.e.,\ni=1 (cid:18)kixi. Choosing parameters f(cid:18)kig which give high classi\ufb01cation\n\naccuracy on test data, thus, is the main challenge for linear text classi\ufb01cation algorithms.\nIn this paper, we focus on linear text classi\ufb01cation algorithms in which the parameters are\npre-speci\ufb01ed functions of training set statistics; that is, each (cid:18)ki is a function (cid:18)ki := g(uki)\nof some \ufb01xed statistics uki of the training set. Unlike discriminative learning methods, such\nas logistic regression [1] or support vector machines (SVMs) [2], which use numerical op-\ntimization to pick parameters, the learners we consider perform no optimization. Rather, in\nour technique, parameter learning involves tabulating statistics vectors fukig and applying\nthe closed-form function g to obtain parameters. We refer to g, this mapping from statistics\nto parameters, as the parameter function.\nMany common text classi\ufb01cation methods\u2014including the multinomial and multivariate\nBernoulli event models for naive Bayes [3], the vector space-based TFIDF classi\ufb01er [4],\nand its probabilistic variant, PrTFIDF [5]\u2014belong to this class of algorithms. Here, picking\na good text classi\ufb01er from this class is equivalent to \ufb01nding the right parameter function\nfor the available statistics.\nIn practice, researchers often develop text classi\ufb01cation algorithms by trial-and-error,\nguided by empirical testing on real-world classi\ufb01cation tasks (cf. [6, 7]). Indeed, one could\n\n\fargue that much of the 30-year history of information retrieval has consisted of manually\ntrying TFIDF formula variants (i.e. adjusting the parameter function g) to optimize perfor-\nmance [8]. Even though this heuristic process can often lead to good parameter functions,\nsuch a laborious task requires much human ingenuity, and risks failing to \ufb01nd algorithm\nvariations not considered by the designer.\nIn this paper, we consider the task of automatically learning a parameter function g for\ntext classi\ufb01cation. Given a set of example text classi\ufb01cation problems, we wish to \u201cmeta-\nlearn\u201d a new learning algorithm (as speci\ufb01ed by the parameter function g), which may then\nbe applied new classi\ufb01cation problems. The meta-learning technique we propose, which\nleverages data from a variety of related classi\ufb01cation tasks to obtain a good classi\ufb01er for\nnew tasks, is thus an instance of transfer learning; speci\ufb01cally, our framework automates\nthe process of \ufb01nding a good parameter function for text classi\ufb01ers, replacing hours of\nhand-tweaking with a straightforward, globally-convergent, convex optimization problem.\nOur experiments demonstrate the effectiveness of learning classi\ufb01er forms. In low training\ndata classi\ufb01cation tasks, the learning algorithm given by our automatically learned parame-\nter function consistently outperforms human-designed parameter functions based on naive\nBayes and TFIDF, as well as existing discriminative learning approaches.\n2 Preliminaries\nLet V = fw1; : : : ; wng be a \ufb01xed vocabulary of words, and let X = Zn and Y =\nf1; : : : ; Kg be the input and output spaces for our classi\ufb01cation problem. A labeled docu-\nment is a pair (x; y) 2 X (cid:2) Y, where x is an n-dimensional vector with xi indicating the\nnumber of occurrences of word wi in the document, and y is the document\u2019s class label. A\nclassi\ufb01cation problem is a tuple hD; S; (xtest; ytest)i, where D is a distribution over X (cid:2) Y,\ni=1 is a set of M training examples, (xtest; ytest) is a single test example, and\nS = f(xi; yi)gM\nall M + 1 examples are drawn iid from D. Given a training set S and a test input vector\nxtest, we must predict the value of the test class label ytest.\n\nIn linear classi\ufb01cation algorithms, we evaluate the score fk(xtest) := Pi (cid:18)kixtesti for as-\nsigning xtest to each class k 2 f1; : : : ; Kg and pick the class y = arg maxk fk(xtest) with\nthe highest score. In our meta-learning setting, we de\ufb01ne each (cid:18)ki as the component-wise\nevaluation of the parameter function g on some vector of training set statistics uki:\n\n:=\n\n:\n\n(1)\n\n(cid:18)k1\n(cid:18)k2\n...\n(cid:18)kn\n\n2\n664\n\n3\n775\n\n2\n664\n\ng(uk1)\ng(uk2)\n\n...\n\ng(ukn)\n\n3\n775\n\nHere, each uki 2 Rq (k = 1; : : : ; K, i = 1; : : : ; n) is a vector whose components are\ncomputed from the training set S (we will provide speci\ufb01c examples later). Furthermore,\ng : Rq ! R is the parameter function mapping from uki to its corresponding parameter\n(cid:18)ki. To illustrate these de\ufb01nitions, we show that two speci\ufb01c cases of the naive Bayes and\nTFIDF classi\ufb01cation methods belong to the class of algorithms described above.\n\nNaive Bayes:\n\nIn the multinomial variant of the naive Bayes classi\ufb01cation algorithm,1 the\n\nscore for assigning a document x to class k is\n\nf NB\n\n(2)\nThe \ufb01rst term, ^p(y = k), corresponds to a \u201cprior\u201d over document classes, and\nthe second term, ^p(wi j y = k), is the (smoothed) relative frequency of word\n\nk (x) := log ^p(y = k) +Pn\n\ni=1 xi log ^p(wi j y = k):\n\n1Despite naive Bayes\u2019 overly strong independence assumptions and thus its shortcomings as a\nprobabilistic model for text documents, we can nonetheless view naive Bayes as simply an algorithm\nwhich makes predictions by computing certain functions of the training set. This view has proved\nuseful for analysis of naive Bayes even when none of its probabilistic assumptions hold [9]; here, we\nadopt this view, without attaching any particular probabilistic meaning to the empirical frequencies\n^p((cid:1)) that happen to be computed by the algorithm.\n\n\fwi in training documents of class k. For balanced training sets, the \ufb01rst term is\nirrelevant. Therefore, we have f NB\n\nuki1\nuki2\nuki3\nuki4\nuki5\n\n2\n6664\n\n3\n7775\n\n=\n\n2\n6664\n\nuki :=\n\nand\n\nnumber of times wi appears in documents of class k\n\nnumber of documents of class k containing wi\ntotal number of words in documents of class k\n\nk (x) = Pi (cid:18)kixi where (cid:18)ki = gNB(uki),\n3\n7775\n\ntotal number of documents of class k\n\ntotal number of documents\n\n; (3)\n\n(4)\n\n(5)\n\ngNB(uki) := log\n\nuki1 + \"\nuki3 + n\"\n\nwhere \" is a smoothing parameter. (\" = 1 gives Laplace smoothing.)\nIn the unnormalized TFIDF classi\ufb01er, the score for assigning x to class k is\n\nTFIDF:\n\nf TFIDF\n\nk\n\n(x) := Pn\n\ni=1(cid:16)xijy=k (cid:1) log\n\n1\n\n^p(xi>0)(cid:17)(cid:16)xi (cid:1) log\n\n1\n\n^p(xi>0)(cid:17) ;\n\nwhere xijy=k (sometimes called the average term frequency of wi) is the average\nith component of all document vectors of class k, and ^p(xi > 0) (sometimes\ncalled the document frequency of wi) is the proportion of all documents containing\nwi.2 As before, we write f TFIDF\nstatistics vector is again de\ufb01ned as in (3), but this time,\n\n(x) = Pi (cid:18)kixi with (cid:18)ki = gTFIDF(uki). The\n\nk\n\ngTFIDF(uki) :=\n\nuki1\n\nuki4 (cid:18)log\n\nuki5\n\nuki2(cid:19)2\n\n:\n\n(6)\n\nSpace constraints preclude a detailed discussion, but many other classi\ufb01cation algorithms\ncan similarly be expressed in this framework, using other de\ufb01nitions of the statistics vec-\ntors fukig. These include most other variants of TFIDF based on different TF and IDF\nterms [7], PrTFIDF [5], and various heuristically modi\ufb01ed versions of naive Bayes [6].\n3 Learning the parameter function\nIn the last section, we gave two examples of algorithms that obtain their parameters (cid:18)ki\nby applying a function g to a statistics vector uki. In each case, the parameter function\nwas hand-designed, either from probabilistic (in the case of naive Bayes [3]) or geometric\n(in the case of TFIDF [4]) considerations. We now consider the problem of automatically\nlearning a parameter function from example classi\ufb01cation tasks. In the sequel, we assume\n\ufb01xed statistics vectors fukig and focus on \ufb01nding an optimal parameter function g.\nIn the standard supervised learning setting, we are given a training set of examples sam-\npled from some unknown distribution D, and our goal is to use the training set to make a\nprediction on a new test example also sampled from D. By using the training examples to\nunderstand the statistical regularities in D, we hope to predict ytest from xtest with low error.\nAnalogously, the problem of meta-learning g is again a supervised learning task; here, how-\never, the training \u201cexamples\u201d are now classi\ufb01cation problems sampled from a distribution\nD over classi\ufb01cation problems.3 By seeing many instances of text classi\ufb01cation problems\n\nk\n\n2Note that (5) implicitly de\ufb01nes fTFIDF\n\n(x) as a dot product of two vectors, each of whose com-\nponents consist of a product of two terms.\nIn the normalized TFIDF classi\ufb01er, both vectors are\nnormalized to unit length before computing the dot product, a modi\ufb01cation that makes the algorithm\nmore stable for documents of varying length. This too can be represented within our framework by\nconsidering appropriately normalized statistics vectors.\n\n3Note that in our meta-learning problem, the output of our algorithm is a parameter function\ng mapping statistics to parameters. Our training data, however, do not explicitly indicate the best\nparameter function g (cid:3) for each example classi\ufb01cation problem. Effectively then, in the meta-learning\ntask, the central problem is to \ufb01t g to some unseen g(cid:3), based on test examples in each training\nclassi\ufb01cation problem.\n\n\fdrawn from D, we hope to learn a parameter function g that exploits the statistical regulari-\nties in problems from D. Formally, let S = fhD(j); S(j); (x(j); y(j))igm\nj=1 be a collection\nof m classi\ufb01cation problems sampled iid from D. For a new, test classi\ufb01cation problem\nhDtest; Stest; (xtest; ytest)i sampled independently from D, we desire that our learned g cor-\nrectly classify xtest with high probability.\nTo achieve our goal, we \ufb01rst restrict our attention to parameter functions g that are linear\nin their inputs. Using the linearity assumption, we pose a convex optimization problem\nfor \ufb01nding a parameter function g that achieves small loss on test examples in the training\ncollection. Finally, we generalize our method to the non-parametric setting via the \u201ckernel\ntrick,\u201d thus allowing us to learn complex, highly non-linear functions of the input statistics.\n\n3.1 Softmax learning\n\nRecall that in softmax regression, the class probabilities p(y j x) are modeled as\n\np(y = k j x; f(cid:18)kig) :=\n\n;\n\nk = 1; : : : ; K;\n\n(7)\n\nexp(Pi (cid:18)kixi)\nPk0 exp(Pi (cid:18)k0ixi)\n\nwhere the parameters f(cid:18)kig are learned from the training data S by maximizing the con-\nditional log likelihood of the data. In this approach, a total of Kn parameters are trained\njointly using numerical optimization. Here, we consider an alternative approach in which\neach of the Kn parameters is some function of the prespeci\ufb01ed statistics vectors; in partic-\nular, (cid:18)ki := g(uki). Our goal is to learn an appropriate g.\nTo pose our optimization problem, we start by learning the linear form g(uki) = (cid:12)T uki.\nUnder this parameterization, the conditional likelihood of an example (x; y) is\n\np(y = k j x; (cid:12)) =\n\n;\n\nk = 1; : : : ; K:\n\n(8)\n\nexp(Pi (cid:12)T ukixi)\nPk0 exp(Pi (cid:12)T uk0ixi)\n\nIn this setup, one natural approach for learning a linear function g is to maximize the\n(regularized) conditional log likelihood \u2018((cid:12) : S ) for the entire collection S :\n\n\u2018((cid:12) : S ) := Pm\nXj=1\n\n=\n\nm\n\nj=1 log p(y(j) j x(j); (cid:12)) (cid:0) Cjj(cid:12)jj2\ni (cid:17)\ny(j)ix(j)\nlog0\n@\ni (cid:17)\nki x(j)\n\nexp(cid:16)(cid:12)T Pi\nPk exp(cid:16)(cid:12)T Pi\n\n(j)\n\n(j)\n\nu\n\nu\n\n1\nA (cid:0) Cjj(cid:12)jj2:\n\n(9)\n\nIn (9), the latter term corresponds to a Gaussian prior on the parameters (cid:12), which provides\na means for controlling the complexity of the learned parameter function g. The maximiza-\ntion of (9) is similar to softmax regression training except that here, instead of optimizing\nover the parameters f(cid:18)kig directly, we optimize over the choice of (cid:12).\n\n3.2 Nonparametric function learning\n\nIn this section, we generalize the technique of the previous section to nonlinear g. By the\nRepresenter Theorem [10], there exists a maximizing solution to (9) for which the optimal\nparameter vector (cid:12)(cid:3) is a linear combination of training set statistics:\n\n(cid:12)(cid:3) = Pm\n\nj=1Pk (cid:11)(cid:3)\n\njkPi\n\nu\n\n(j)\n\nki x(j)\n\ni\n\n:\n\n(10)\n\nFrom this, we reparameterize the original optimization over (cid:12) in (9) as an equivalent opti-\nmization over training example weights f(cid:11)jkg. For notational convenience, let\n\nK(j; j0; k; k0) := PiPi0 x(j)\n\ni x(j 0)\n\ni0\n\n(u\n\n(j)\nki )T u\n\n(j 0)\nk0i0 :\n\n(11)\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n)\n\n2\nk\n\nu\n(\np\nx\ne\n\n0\n\n0\n\n0.2\n\n(a)\n\n0.4\n\n0.6\n\nexp(uk1)\n\n0\n\n\u22125\n\n\u221210\n\n2\nk\n\nu\n\n\u221215\n\n\u221220\n\n\u221225\n\n\u221230\n\n\u221235\n\n0.8\n\n1\n\n(b)\n\n\u221230\n\n\u221225\n\n\u221220\n\n\u221215\n\nuk1\n\n\u221210\n\n\u22125\n\n0\n\nFigure 1: Distribution of unnormalized uki vectors in dmoz data (a) with and (b) without\napplying the log transformation in (15). In principle, one could alternatively use a feature\nvector representation using these frequencies directly, as in (a). However, applying the log\ntransformation yields a feature space with fewer isolated points in R2, as in (b). When using\nthe Gaussian kernel, a feature space with few isolated points is important as the topology\nof the feature space establishes locality of in\ufb02uence for support vectors.\n\nSubstituting (10) and (11) into (9), we obtain\n\n\u2018(f(cid:11)jkg : S ) :=\n\nm\n\nXj 0=1\n\nlog0\n@\nXj=1\n\nm\n\nexp(cid:16)Pm\nPk0 exp(cid:16)Pm\nXj 0=1Xk Xk0\n\nm\n\n(cid:0) C\n\nj=1Pk (cid:11)jkK(j; j0; k; y(j 0))(cid:17)\nj=1Pk (cid:11)jkK(j; j0; k; k0)(cid:17)\n\n(cid:11)jk(cid:11)j 0k0K(j; j0; k; k0):\n\n1\nA\n\n(12)\n\nNote that (12) is concave and differentiable, so we can train the model using any standard\nnumerical gradient optimization procedure, such as conjugate gradient or L-BFGS [11].\nThe assumption that g is a linear function of uki, however, places a severe restriction on\nthe class of learnable parameter functions. Noting that the statistics vectors appear only as\nan inner product in (11), we apply the \u201ckernel trick\u201d to obtain\n(j)\nki ; u\n\ni x(j 0)\n\ni0 K(u\n\n(j 0)\nk0i0 );\n\n(13)\n\nwhere the kernel function K(u; v) = (cid:10)(cid:8)(u); (cid:8)(v)(cid:11) de\ufb01nes the inner product of some\nhigh-dimensional mapping (cid:8)((cid:1)) of its inputs.4 In particular, choosing a Gaussian (RBF)\nkernel, K(u; v) := exp((cid:0)(cid:13)jju (cid:0) vjj2), gives a non-parametric representation for g:\n\nK(j; j0; k; k0) := PiPi0 x(j)\n\ng(uki) = (cid:12)T (cid:8)(uki) = Pm\n\nj=1PkPi (cid:11)jkx(j)\n\ni\n\nexp((cid:0)(cid:13)jju\n\n(j)\nki (cid:0) ukijj2):\n\n(14)\n\nThus, g(uki) is a weighted combination of the values f(cid:11)jkx(j)\ni g, where the weights depend\n(j)\nexponentially on the squared \u20182-distance of uki to each of the statistics vectors fu\nki g. As a\nresult, we can approximate any suf\ufb01ciently smooth bounded function of u arbitrarily well,\ngiven suf\ufb01ciently many training classi\ufb01cation problems.\n4 Experiments\nTo validate our method, we evaluated its ability to learn parameter functions on a variety\nof email and webpage classi\ufb01cation tasks in which the number of classes, K, was large\n(K = 10), and the number of number of training examples per class, m=K, was small\n(m=K = 2). We used the dmoz Open Directory Project hierarchy,5 the 20 Newsgroups\ndataset,6 the Reuters-21578 dataset,7 and the Industry Sector dataset8.\n\n4Note also that as a consequence of our kernelization, K itself can be considered a \u201ckernel\u201d\n\nbetween all statistics vectors from two entire documents.\n\n5http://www.dmoz.org\n6http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.tar.gz\n7http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz\n8http://www.cs.umass.edu/\u02dcmccallum/data/sector.tar.gz\n\n\fTable 1: Test set accuracy on dmoz categories. Columns 2-4 give the proportion of correct\nclassi\ufb01cations using non-discriminative methods: the learned g, Naive Bayes, and TFIDF,\nrespectively. Columns 5-7 give the corresponding values for the discriminative methods:\nsoftmax regression, 1-vs-all SVMs, and multiclass SVMs. The best accuracy in each row\nis shown in bold.\n\nCategory\n\nArts\n\nBusiness\nComputers\n\nKids and Teens\n\nGames\nHealth\nHome\n\nNews\n\nRecreation\nReference\nRegional\nScience\nShopping\nSociety\nSports\nWorld\nAverage\n\ng\n\n0.421\n0.456\n0.467\n0.411\n0.479\n0.640\n0.252\n0.349\n0.663\n0.635\n0.438\n0.363\n0.612\n0.435\n0.619\n0.531\n0.486\n\ngNB\n0.296\n0.283\n0.304\n0.288\n0.282\n0.470\n0.205\n0.222\n0.487\n0.415\n0.268\n0.256\n0.456\n0.308\n0.432\n0.491\n0.341\n\ngTFIDF\n0.286\n0.286\n0.327\n0.240\n0.337\n0.454\n0.142\n0.212\n0.529\n0.458\n0.258\n0.246\n0.556\n0.285\n0.285\n0.352\n0.328\n\nsoftmax\n0.352\n0.336\n0.344\n0.279\n0.382\n0.501\n0.202\n0.382\n0.477\n0.602\n0.329\n0.353\n0.483\n0.379\n0.507\n0.329\n0.390\n\n1VA-SVM MC-SVM\n\n0.203\n0.233\n0.217\n0.240\n0.213\n0.333\n0.173\n0.270\n0.353\n0.383\n0.260\n0.223\n0.373\n0.213\n0.267\n0.277\n0.264\n\n0.367\n0.340\n0.387\n0.330\n0.337\n0.440\n0.167\n0.397\n0.590\n0.543\n0.357\n0.340\n0.550\n0.377\n0.527\n0.303\n0.397\n\nThe dmoz project is a hierarchical collection of webpage links organized by subject matter.\nThe top level of the hierarchy consists of 16 major categories, each of which contains sev-\neral subcategories. To perform cross-validated testing, we obtained classi\ufb01cation problems\nfrom each of the top-level categories by retrieving webpages from each of their respec-\ntive subcategories. For the 20 Newsgroups, Reuters-21578, and Industry Sector datasets,\nwe performed similar preprocessing.9 Given a dataset of documents, we sampled 10-class\n2-training-examples-per-class classi\ufb01cation problems by randomly selecting 10 different\nclasses within the dataset, picking 2 training examples within each class, and choosing one\ntest example from a randomly chosen class.\n\n4.1 Choice of features\nTheoretically, for the method described in this paper, any suf\ufb01ciently rich set of features\ncould be used to learn a parameter function for classi\ufb01cation. For simplicity, we reduced\nthe feature vector in (3) to the following two-dimensional representation,10\n\nuki = (cid:20)log(proportion of wi among words from documents of class k)\n\nlog(proportion of documents containing wi)\n\n(cid:21) :\n\n(15)\n\nNote that up to the log transformation, the components of uki correspond to the relative\nterm frequency and document frequency of a word relative to class k (see Figure 1).\n\n4.2 Generalization performance\nWe tested our meta-learning algorithm on classi\ufb01cation problems taken from each of the\n16 top-level dmoz categories. For each top-level category, we built a collection of 300\nclassi\ufb01cation problems from that category; results reported here are averages over these\n\n9For the Reuters data, we associated each article with its hand-annotated \u201ctopic\u201d label and dis-\ncarded any articles with more than one topic annotation. For each dataset, we discarded all categories\nwith fewer than 50 examples, and selected a 500-word vocabulary based on information gain.\n\n10Features were rescaled to have zero mean and unit variance over the training set.\n\n\fTable 2: Cross corpora classi\ufb01cation accuracy, using classi\ufb01ers trained on each of the four\ncorpora. The best accuracy in each row is shown in bold.\n\nDataset\ndmoz\n\ngnews\ngTFIDF\ngdmoz\n0.471 0.475 0.473 0.365 0.352\nn/a\n0.371 0.369 0.223 0.184\n20 Newsgroups 0.369\nn/a\n0.619 0.463 0.475\nReuters-21578\nn/a\n0.567 0.567\nIndustry Sector 0.438 0.459 0.446\nn/a\n0.374 0.274\n\ngindu\n\ngreut\n\ngNB\n\nsoftmax 1VA-SVM MC-SVM\n0.381\n0.217\n0.463\n0.376\n\n0.412\n0.248\n0.481\n0.375\n\n0.283\n0.206\n0.308\n0.271\n\nproblems. To assess the accuracy of our meta-learning algorithm for a particular test cate-\ngory, we used the g learned from a set of 450 classi\ufb01cation problems drawn from the other\n15 top-level categories.11 This ensured no overlap of training and testing data. In 15 out\nof 16 categories, the learned parameter function g outperforms naive Bayes and TFIDF in\naddition to the discriminative methods we tested (softmax regression, 1-vs-all SVMs [12],\nand multiclass SVMs [13]12; see Table 1).13\nNext, we assessed the ability of g to transfer across even more dissimilar corpora. Here, for\neach of the four corpora (dmoz, 20 Newsgroups, Reuters-21578, Industry Sector), we con-\nstructed independent training and testing datasets of 480 random classi\ufb01cation problems.\nAfter training separate classi\ufb01ers (gdmoz, gnews, greut, and gindu) using data from each of the\nfour corpora, we tested the performance of each learned classi\ufb01er on the remaining three\ncorpora (see Table 2). Again, the learned parameter functions compare favorably to the\nother methods. Moreover, these tests show that a single parameter function may give an ac-\ncurate classi\ufb01cation algorithm for many different corpora, demonstrating the effectiveness\nof our approach for achieving transfer across related learning tasks.\n5 Discussion and Related Work\nIn this paper, we presented an algorithm based on softmax regression for learning a pa-\nrameter function g from example classi\ufb01cation problems. Once learned, g de\ufb01nes a new\nlearning algorithm that can be applied to novel classi\ufb01cation tasks.\nAnother approach for learning g is to modify the multiclass support vector machine formu-\nlation of Crammer and Singer [13] in a manner analagous to the modi\ufb01cation of softmax\nregression in Section 3.1, giving the following quadratic program:\n\nminimize\n(cid:12)2Rn;(cid:24)2Rm\n\n1\n\n2 jj(cid:12)jj2 + CPj (cid:24)j\n\nsubject to (cid:12)T Pi x(j)\n\ni (u\n\n(j)\ny(j)i (cid:0) u\n\n(j)\n\nki ) (cid:21) Ifk6=y(j)g (cid:0) (cid:24)j; 8k, 8j:\n\nAs usual, taking the dual leads naturally to an SMO-like procedure for optimization. We\nimplemented this method and found that the learned g, like in the softmax formulation,\noutperforms naive Bayes, TFIDF, and the other discriminative methods.\nThe techniques described in this paper give one approach for achieving inductive transfer in\nclassi\ufb01er design\u2014using labeled data from related example classi\ufb01cation problems to solve\na particular classi\ufb01cation problem [16, 17]. Bennett et al. [18] also consider the issue of\nknowledge transfer in text classi\ufb01cation in the context of ensemble classi\ufb01ers, and propose\na system for using related classi\ufb01cation problems to learn the reliability of individual clas-\nsi\ufb01ers within the ensemble. Unlike their approach, which attempts to meta-learn properties\n\n11For each execution of the learning algorithm, (C; (cid:13)) parameters were determined via grid search\nusing a small holdout set of 160 classi\ufb01cation problems. The same holdout set was used to select\nregularization parameters for the discriminative learning algorithms.\n\n12We used LIBSVM [14] to assess 1VA-SVMs and SVM-Light [15] for multiclass SVMs.\n13For larger values of m=K (e.g. m=K = 10), softmax and multiclass SVMs consistently out-\nperform naive Bayes and TFIDF; nevertheless, the learned g achieves a performance on par with dis-\ncriminative methods, despite being constrained to parameters which are explicit functions of training\ndata statistics. This result is consistent with a previous study in which a heuristically hand-tuned\nversion of Naive Bayes attained near-SVM text classi\ufb01cation performance for large datasets [6].\n\n\fof algorithms, our method uses meta-learning to construct a new classi\ufb01cation algorithm.\nThough not directly applied to text classi\ufb01cation, Teevan and Karger [19] consider the\nproblem of automatically learning term distributions for use in information retrieval.\nFinally, Thrun and O\u2019Sullivan [20] consider the task of classi\ufb01cation in a mobile robot do-\nmain. In this work, the authors describe a task-clustering (TC) algorithm in which learning\ntasks are grouped via a nearest neighbors algorithm, as a means of facilitating knowledge\ntransfer. A similar concept is implicit in the kernelized parameter function learned by our\nalgorithm, where the Gaussian kernel facilitates transfer between similar statistics vectors.\nAcknowledgments\nWe thank David Vickrey and Pieter Abbeel for useful discussions, and the anonymous\nreferees for helpful comments. CBD was supported by an NDSEG fellowship. This work\nwas supported by DARPA under contract number FA8750-05-2-0249.\n\nReferences\n[1] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classi\ufb01cation. In\n\nIJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61\u201367, 1999.\n\n[2] T. Joachims. Text categorization with support vector machines: Learning with many relevant\n\nfeatures. In Machine Learning: ECML-98, pages 137\u2013142, 1998.\n\n[3] A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classi\ufb01cation.\n\nIn AAAI-98 Workshop on Learning for Text Categorization, 1998.\n\n[4] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information\n\nProcessing and Management, 29(5):513\u2013523, 1988.\n\n[5] T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categoriza-\n\ntion. In Proceedings of ICML-97, pages 143\u2013151, 1997.\n\n[6] J. D. Rennie, L. Shih, J. Teevan, and D. R. Karger. Tackling the poor assumptions of naive\n\nBayes text classi\ufb01ers. In ICML, pages 616\u2013623, 2003.\n\n[7] A. Moffat and J. Zobel. Exploring the similarity space. In ACM SIGIR Forum 32, 1998.\n[8] C. Manning and H. Schutze. Foundations of statistical natural language processing, 1999.\n[9] A. Ng and M. Jordan. On discriminative vs. generative classi\ufb01ers: a comparison of logistic\n\nregression and naive Bayes. In NIPS 14, 2002.\n\n[10] G. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an spline functions. J. Math. Anal.\n\nAppl., 33:82\u201395, 1971.\n\n[11] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.\n[12] R. Rifkin and A. Klautau. In defense of one-vs-all classi\ufb01cation. J. Mach. Learn. Res., 5:101\u2013\n\n141, 2004.\n\n[13] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. J. Mach. Learn. Res., 2:265\u2013292, 2001.\n\n[14] C-C. Chang and C-J. Lin. LIBSVM: a library for support vector machines, 2001. Software\n\navailable at http://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[15] T. Joachims. Making large-scale support vector machine learning practical.\nKernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.\n[16] S. Thrun. Lifelong learning: A case study. CMU tech report CS-95-208, 1995.\n[17] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n[18] P. N. Bennett, S. T. Dumais, and E. Horvitz.\n\nInductive transfer for text classi\ufb01cation using\ngeneralized reliability indicators. In Proceedings of ICML Workshop on The Continuum from\nLabeled to Unlabeled Data in Machine Learning and Data Mining, 2003.\n\nIn Advances in\n\n[19] J. Teevan and D. R. Karger. Empirical development of an exponential probabilistic model for\n\ntext retrieval: Using textual analysis to build a better model. In SIGIR \u201903, 2003.\n\n[20] S. Thrun and J. O\u2019Sullivan. Discovering structure in multiple learning tasks: The TC algorithm.\n\nIn International Conference on Machine Learning, pages 489\u2013497, 1996.\n\n\f", "award": [], "sourceid": 2843, "authors": [{"given_name": "Chuong", "family_name": "Do", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}