{"title": "Large Margin Taxonomy Embedding for Document Categorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1737, "page_last": 1744, "abstract": "Applications of multi-class classification, such as document categorization, often appear in cost-sensitive settings. Recent work has significantly improved the state of the art by moving beyond ``flat'' classification through incorporation of class hierarchies [Cai and Hoffman 04]. We present a novel algorithm that goes beyond hierarchical classification and estimates the latent semantic space that underlies the class hierarchy. In this space, each class is represented by a prototype and classification is done with the simple nearest neighbor rule. The optimization of the semantic space incorporates large margin constraints that ensure that for each instance the correct class prototype is closer than any other. We show that our optimization is convex and can be solved efficiently for large data sets. Experiments on the OHSUMED medical journal data base yield state-of-the-art results on topic categorization.", "full_text": "Large Margin Taxonomy Embedding with an\n\nApplication to Document Categorization\n\nKilian Weinberger\nYahoo! Research\n\nOlivier Chapelle\nYahoo! Research\n\nkilian@yahoo-inc.com\n\nchap@yahoo-inc.com\n\nAbstract\n\nApplications of multi-class classi\ufb01cation, such as document categorization, often\nappear in cost-sensitive settings. Recent work has signi\ufb01cantly improved the state\nof the art by moving beyond \u201c\ufb02at\u201d classi\ufb01cation through incorporation of class\nhierarchies [4]. We present a novel algorithm that goes beyond hierarchical clas-\nsi\ufb01cation and estimates the latent semantic space that underlies the class hierarchy.\nIn this space, each class is represented by a prototype and classi\ufb01cation is done\nwith the simple nearest neighbor rule. The optimization of the semantic space\nincorporates large margin constraints that ensure that for each instance the correct\nclass prototype is closer than any other. We show that our optimization is convex\nand can be solved ef\ufb01ciently for large data sets. Experiments on the OHSUMED\nmedical journal data base yield state-of-the-art results on topic categorization.\n\n1 Introduction\n\nMulti-class classi\ufb01cation is a problem that arises in many applications of machine learning. In many\ncases the cost of misclassi\ufb01cation varies strongly between classes. For example, in the context of\nobject recognition it may be signi\ufb01cantly worse to misclassify a male pedestrian as a traf\ufb01c light\nthan as a female pedestrian. Similarly, in the context of document categorization it seems more\nsevere to misclassify a medical journal on heart attack as a publication on athlete\u2019s foot than on\nCoronary artery disease. Although the scope of the proposed method is by no means limited to\ntext data and topic hierarchies, for improved clarity we will restrict ourselves to terminology from\ndocument categorization throughout this paper.\nThe most common approach to document categorization is to reduce the problem to a \u201c\ufb02at\u201d classi-\n\ufb01cation problem [13]. However, it is often the case that the topics are not just discrete classes, but\nare nodes in a complex taxonomy with rich inter-topic relationships. For example, web pages can be\ncategorized into the Yahoo! web taxonomy or medical journals can be categorized into the Medical\nSubject Headings (MeSH) taxonomy. Moving beyond \ufb02at classi\ufb01cation to settings that utilize these\nhierarchical representations of topics has signi\ufb01cantly pushed the state-of-the art [4, 15]. Additional\ninformation about inter-topic relationships can for example be leveraged through cost-sensitive de-\ncision boundaries or knowledge sharing between documents from closely related classes.\nIn reality, however, the topic taxonomy is a crude approximation of topic relations, created by an\neditor with knowledge of the true underlying semantic space of topics. In this paper we propose a\nmethod that moves beyond hierarchical presentations and aims to re-discover the continuous latent\nsemantic space underlying the topic taxonomy. Instead of regarding document categorization as\nclassi\ufb01cation, we will think of it as a regression problem where new documents are mapped into this\nlatent semantic topic space. Very different from approaches like LSI or LDA [1, 7], our algorithm is\nentirely supervised and explicitly embeds the topic taxonomy and the documents into a single latent\nsemantic space with \u201csemantically meaningful\u201d Euclidean distances.\n\n1\n\n\fTopic taxonomy\n\nT\n\nLow dimensional semantic space\n\nF\n\nHigh dimensional input space\n\nX\n\nP\n\nW\n\nstroke\n\narthritis\n\nheart\nattack \n\npneumonia\n\nclass\n\nprototypes\n\n!p\u03b1\n\nembedded\n\ninputs\n\nW!xi\n\noriginal\ninputs\n\n!xi\n\nFigure 1: A schematic layout of our taxem method (for Taxonomy Embedding). The classes are\nembedded as prototypes inside the semantic space. The input documents are mapped into the same\nspace, placed closest to their topic prototypes.\n\nIn this paper we derive a method to embed the taxonomy of topics into a latent semantic space in\nform of topic prototypes. A new document can be classi\ufb01ed by \ufb01rst mapping it into this space and\nthen assigning the label of the closest prototype. A key contribution of our paper is the derivation\nof a convex problem that learns the regressor for the documents and the placement of the prototypes\nin a single optimization. In particular, it places the topic prototypes such that for each document\nthe prototype of the correct topic is much closer than any other prototype by a large margin. We\nshow that this optimization is a special instance of semi-de\ufb01nite programs [2], that can be solved\nef\ufb01ciently [16] for large data sets.\nOur paper is structured as follows: In section 2 we introduce necessary notation and a \ufb01rst version\nof the algorithm based on a two-step approach of \ufb01rst embedding the hierarchical taxonomy into a\nsemantic space and then regressing the input documents close to their respective topic prototypes.\nIn section 3 we extend our model to a single optimization that learns both steps in one convex op-\ntimization with large margin constraints. We evaluate our method in section 4 and demonstrate\nstate-of-the-art results on eight different document categorization tasks from the OHSUMED med-\nical journal data set. Finally, we relate our method to previous work in section 5 and conclude in\nsection 6.\n\n2 Method\n\nWe assume that our input consists of documents, represented as a set of high dimensional sparse\nvectors (cid:31)x1(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) (cid:31)xn (cid:31) X of dimensionality d. Typically, these could be binary bag of words\nindicators or t\ufb01df scores.\nIn addition, the documents are accompanied by single topic labels\ny1(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) yn (cid:31)(cid:123) 1(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) c(cid:125)\nthat lie in some taxonomy T with c total topics. This taxonomy T gives\nrise to some cost matrix C (cid:31) Rc(cid:215) c, where C(cid:31)(cid:30) (cid:30) 0 de\ufb01nes the cost of misclassifying an element\nof topic (cid:30) as (cid:29) and C(cid:31)(cid:31) = 0. Technically, we only require knowledge of the cost matrix C, which\ncould also be obtained from side-information independent of a topic taxonomy. In this paper we will\nnot focus on how C is obtained. However, we would like to point out that a common way to infer a\ncost matrix from a taxonomy is to set C(cid:31)(cid:30) to the length of the shortest path between node (cid:30) and (cid:29),\nbut other approaches have also been studied [3].\nThroughout this paper we denote document indices as i(cid:44) j (cid:31) (cid:123) 1(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) n(cid:125)\n(cid:30)(cid:44) (cid:29) (cid:31)(cid:123) 1(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) c(cid:125)\nFigure 1 illustrates our setup schematically. We would like to create a low dimensional semantic\nfeature space F in which we represent each topic (cid:30) as a topic prototype (cid:31)p(cid:31) (cid:31)F and each document\n(cid:31)xi (cid:31) X as a low dimensional vector (cid:31)zi (cid:31) F. Our goal is to discover a representation of the\ndata where distances re\ufb02ect true underlying dissimilarities and proximity to prototypes indicates\ntopic membership. In other words, documents on the same or related topics should be close to the\nrespective topic prototypes, documents on highly different topics should be well separated.\n\n. Matrices are written in bold (e.g. C) and vectors have top arrows (e.g. (cid:31)xi).\n\nand topic indices as\n\n2\n\n\fThroughout this paper we will assume that F = Rc, however our method can easily be adapted to\neven lower dimensional settings F = Rr where r < c. As an essential part of our method is to\nembed the classes that are typically found in a taxonomy, we refer to our algorithm as taxem (short\nfor \u201ctaxonomy embedding\u201d).\n\nEmbedding topic prototypes\nThe \ufb01rst step of our algorithm is to embed the document taxonomy into a Euclidean vector\nspace. More formally, we derive topic prototypes (cid:126)p1, . . . , (cid:126)pc \u2208 F based on the cost matrix C,\nwhere (cid:126)p\u03b1 is the prototype that represents topic \u03b1. To simplify notation, we de\ufb01ne the matrix\nP = [(cid:126)p1, . . . , (cid:126)pc] \u2208 Rc\u00d7c whose columns consist of the topic prototypes.\nThere are many ways to derive the prototypes from the cost matrix C. By far the simplest method\nis to ignore the cost matrix C entirely and let PI = I, where I \u2208 Rc\u00d7c denotes the identity matrix.\n\u221a\n2\nThis results in a c dimensional feature space, where the class-prototypes are all in distance\nfrom each other at the corner of a (c-1)-dimensional simplex. We will refer to PI as the simplex\nprototypes.\nBetter results can be expected when the prototypes of similar topics are closer than those of dissim-\nilar topics. We use the cost matrix C as an estimate of dissimilarity and aim to place the prototypes\nsuch that the distance (cid:107)(cid:126)p\u03b1 \u2212 (cid:126)p\u03b2(cid:107)2\n\n2 re\ufb02ects the cost speci\ufb01ed in C\u03b1\u03b2. More formally, we set\n\nc(cid:88)\n\nPmds = argminP\n\n((cid:107)(cid:126)p\u03b1 \u2212 (cid:126)p\u03b2(cid:107)2\n\n2 \u2212 C\u03b1\u03b2)2.\n\n(1)\n\n\u03b1,\u03b2=1\n\n\u221a\n\nIf the cost matrix C de\ufb01nes squared Euclidean distances (e.g. when the cost is obtained through the\nshortest path between nodes and then squared), we can solve eq. (1) with metric multi-dimensional\nscaling [5]. Let us denote \u00afC = \u2212 1\n2 HCH, where the centering matrix H is de\ufb01ned as H = I \u2212\nc 11(cid:62), and let its eigenvector decomposition be \u00afC = V\u039bV(cid:62). We obtain the solution by setting\n1\nPmds =\nBoth prototypes embeddings PI and Pmds are still independent of the input data {(cid:126)xi}. Before we\ncan derive a more sophisticated method to place the prototypes with large margin constraints on the\ndocument vectors, we will brie\ufb02y describe the mapping W : X \u2192 F of the input documents into\nthe low dimensional feature space F.\n\n\u039bV. We will refer to Pmds as the mds prototypes.1\n\nDocument regression\nAssume for now that we have found a suitable embedding P of the class-prototypes. We need to\n\ufb01nd an appropriate mapping W : X \u2192 F, that maps each input (cid:126)xi with label yi as close as possible\nto its topic prototype (cid:126)pyi. We can \ufb01nd such a linear transformation (cid:126)zi = W(cid:126)xi by setting\n\nW = argminW\n\n(cid:107)(cid:126)pyi \u2212 W(cid:126)xi(cid:107)2 + \u03bb(cid:107)W(cid:107)2\nF .\n\n(2)\n\n(cid:88)\n\ni\n\nHere, \u03bb is the weight of the regularization of W, which is necessary to prevent potential over\ufb01tting\ndue to the high number of parameters in W. The minimization in eq. (2) is an instance of linear\nridge regression and has the closed form solution\n\nW = PJX(cid:62)(XX(cid:62) + \u03bbI)\u22121,\n\n(3)\nwhere X = [(cid:126)x1, . . . (cid:126)xn] and J \u2208 {0, 1}c\u00d7n, with J\u03b1i = 1 if and only if yi = \u03b1. Please note that\neq. (3) can be solved very accurately without the need to ever compute the d \u00d7 d matrix inverse\n(XX(cid:62) + \u03bbI)\u22121 explicitly, by solving with linear conjugate gradient for each row of W indepen-\ndently.\n\nInference\nGiven an input vector (cid:126)xt we \ufb01rst map it into F and estimate its label as the topic with the closest\nprototype (cid:126)p\u03b1\n\n\u02c6yt = argmin\u03b1(cid:107)(cid:126)p\u03b1 \u2212 W(cid:126)xt(cid:107)2.\n\n(4)\n\n1If \u00afC does not contain Euclidean distances one can use the common approximation of forcing negative\neigenvalues in \u039b to zero and thereby fall back onto the projection of C onto the cone of positive semi-de\ufb01nite\nmatrices.\n\n3\n\n\fTopic taxonomy\n\nT\n\n\u03b1\n\n!e\u03b1\n\nI\n\nyi\nHigh dimensional input space\n\n!xi\n\nX\nA\n\n!eyi\n\n!x!i\n\nRd\n\nRc\n\n!pyi\n!zi\n\nP\n\nembedded\n\ninputs\n\nLow dimensional semantic space\n\n!p\u03b1\n\nF\n\nclass\n\nprototypes\n\nlarge\nmargin\n\nRc\n\nFigure 2: The schematic layout of the large-margin embedding of the taxonomy and the documents.\nAs a \ufb01rst step, we represent topic (cid:30) as the vector (cid:31)e(cid:31) and document (cid:31)xi as (cid:31)x(cid:31)\ni = A(cid:31)xi. We then learn the\nmatrix P whose columns are the prototypes (cid:31)p(cid:31) = P(cid:31)e(cid:31) and which de\ufb01nes the \ufb01nal transformation\nof the documents (cid:31)zi = P(cid:31)x(cid:31)\ni. This \ufb01nal transformation is learned such that the correct prototype (cid:31)pyi\nis closer to (cid:31)zi than any other prototype (cid:31)p(cid:31) by a large margin.\n\nFor a given set of labeled documents ((cid:31)x1(cid:44) y1)(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) ((cid:31)xn(cid:44) yn) we measure the quality of our semantic\nspace with the averaged cost-sensitive misclassi\ufb01cation loss,\n\nCyi (cid:136)yi(cid:46)\n\n(5)\n\nE =\n\n1\nn\n\n3 Large Margin Prototypes\n\nn(cid:31)\n\ni=1\n\nSo far we have introduced a two step approach: First, we \ufb01nd the prototypes P based on the cost\nmatrix C, then we learn the mapping (cid:31)x (cid:29) W(cid:31)x that maps each input closest to the prototype of its\nclass. However, learning the prototypes independent of the data (cid:123) (cid:31)xi(cid:125)\nis far from optimal in order\nto reduce the loss in (5). In this section we will create a joint optimization problem that places the\nprototypes P and learns the mapping W while minimizing an upper bound on (5).\n\nCombined learning\nIn our attempt to learn both mappings jointly, we are faced with a \u201cchicken and egg\u201d problem. We\nwant to map the input documents closest to their prototypes and at the same time place the prototypes\nwhere the documents of the respective topic are mapped to. Therefore our \ufb01rst task is to de-tangle\nthis mutual dependency of W and P. Let us de\ufb01ne A as the following matrix product:\n\nA = JX(cid:30)(XX(cid:30) + (cid:28) I)(cid:29)1(cid:46)\n\n(6)\nIt follows immediately form eqs. (3) and (6) that W = PA. Note that eq. (6) is entirely independent\nof P and can be pre-computed before the prototypes have been positioned. With this relation we\nhave reduced the problem of determining W and P to the single problem of determining P. Figure 2\nillustrates the new schematic layout of the algorithm.\ni = A(cid:31)xi and let (cid:31)e(cid:31) = [0(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) 1(cid:44) (cid:46)(cid:46)(cid:46)(cid:44) 0](cid:30) be the vector with all zeros and a single 1 in the (cid:30)th\nLet (cid:31)x(cid:31)\nposition. We can then rewrite both, the topic prototypes (cid:31)p(cid:31) and the low dimensional documents (cid:31)zi,\nas vectors within the range of P : Rc (cid:29) Rc:\n\n(cid:31)p(cid:31) = P(cid:31)e(cid:31)(cid:44) and (cid:31)zi = P(cid:31)x(cid:31)\ni(cid:46)\n\n(7)\n\nOptimization\nIdeally we would like to learn P to minimize (5) directly. However, this function is non-continuous\nand non-differentiable. For this reason we will derive a surrogate loss function that strictly\nbounds (5) from above.\n\n4\n\n\fThe loss for a speci\ufb01c document (cid:126)xi is zero if its corresponding vector (cid:126)zi is closer to the correct\nprototype (cid:126)pyi than to any other prototype (cid:126)p\u03b1. For better generalization it would be preferable if\nprototype (cid:126)pyi was in fact much closer by a large margin. We can go even further and demand that\nprototypes that would incur a larger misclassi\ufb01cation loss should be further separated than those\nwith a small cost. More explicitly, we will try to enforce a margin of Cyi\u03b1. We can express this\ncondition as a set of \u201csoft\u201d inequality constraints, in terms of squared-distances,\n2 + \u03bei\u03b1,\n\n(8)\nwhere the slack-variable \u03bei\u03b1 \u2265 0 absorbs the amount of violation of prototype (cid:126)p\u03b1 into the margin of\n(cid:126)x(cid:48)\ni. Given this formulation, we create an upper bound on the loss function (5):\n\n2 + Cyi\u03b1 \u2264 (cid:107)P((cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\n(cid:107)P((cid:126)eyi \u2212 (cid:126)x(cid:48)\n\n\u2200i, \u03b1(cid:54)= yi\n\ni)(cid:107)2\n\ni)(cid:107)2\n\nTheorem 1 Given a prototype matrix P, the training error (5) is bounded above by 1\nn\nProof: First, note that we can rewrite the assignment of the closest prototype (4) as \u02c6yi =\nargmin\u03b1(cid:107)P((cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n2 \u2265 0 for all i (with\nequality when \u02c6yi = yi). We therefore obtain:\n\nIt follows that (cid:107)P((cid:126)eyi \u2212 (cid:126)x(cid:48)\n\n2 \u2212 (cid:107)P((cid:126)e\u02c6yi \u2212 (cid:126)x(cid:48)\n\ni\u03b1 \u03bei\u03b1.\n\ni)(cid:107)2\n\n(cid:80)\n\nThe result follows immediately from (9) and that \u03bei\u03b1 \u2265 0:\n\ni)(cid:107)2.\n\u03bei\u02c6yi = (cid:107)P((cid:126)eyi \u2212 (cid:126)x(cid:48)\n(cid:88)\n\ni)(cid:107)2\n\n2 + Cyi \u02c6yi \u2212 (cid:107)P((cid:126)e\u02c6yi \u2212 (cid:126)x(cid:48)\n\ni)(cid:107)2\n\n\u03bei\u03b1 \u2265(cid:88)\n\n\u03bei\u02c6yi \u2265(cid:88)\n\nCyi \u02c6yi.\n\ni,\u03b1\n\ni\n\ni\n\ni)(cid:107)2\n2 \u2265 Cyi \u02c6yi.\n\n(9)\n\n(10)\n\n(11)\n\nTheorem 1, together with the constraints in eq. (8), allows us to create an optimization problem that\nminimizes an upper bound on the average loss in eq. (5) with maximum-margin constraints:\n\nMinimize (cid:88)\n\nP\n\ni,\u03b1\n\n(1) (cid:107)P((cid:126)eyi \u2212 (cid:126)x(cid:48)\n(2) \u03bei\u03b1 \u2265 0\n\n\u03bei\u03b1 subject to:\ni)(cid:107)2\n\n2 + Cyi\u03b1 \u2264 (cid:107)P((cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\ni)(cid:107)2\n\n2 + \u03bei\u03b1\n\nNote that if we have a very large number of classes, it might be bene\ufb01cial to choose P \u2208 Rr\u00d7c with\nr < c. However, the convex formulation described in the next paragraph requires P to be square.\n\nConvex formulation\nThe optimization in eq. (11) is not convex. The constraints of type (8) are quadratic with respect to\nP. Intuitively, any solution P gives rise to in\ufb01nitely many solutions as any rotation of P results in\nthe same objective value and also satis\ufb01es all constraints. We can make (11) invariant to rotation by\nde\ufb01ning Q = P(cid:62)P, and rewriting all distances in terms of Q,\ni)(cid:62)Q((cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\n(12)\nNote that the distance formulation in eq. (12) is linear with respect to Q. As long as the matrix Q\nis positive semi-de\ufb01nite, we can re-decompose it into Q = P(cid:62)P. Hence, we enforce positive semi-\nde\ufb01niteness of Q by adding the constraint Q (cid:23) 0. We can now solve (11) in terms of Q instead of\nP with the large-margin constraints\n\ni) = (cid:107)(cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\n2 = ((cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\n(cid:107)P((cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\ni(cid:107)2\nQ.\n\ni)(cid:107)2\n\n\u2200i, \u03b1(cid:54)= yi\n\n(cid:107)(cid:126)eyi \u2212 (cid:126)x(cid:48)\n\ni(cid:107)2\nQ + Cyi\u03b1 \u2264 (cid:107)(cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\ni(cid:107)2\nQ + \u03bei\u03b1.\n\n(13)\n\nRegularization\nIf the size of the training data n is small compared to the number of parameters c2, we might run\ninto problems of over\ufb01tting to the training data set. To counter those effects, we add a regularization\nterm to the objective function.\nEven if the training data might differ from the test data, we know that the taxonomy does not change.\nIt is straight-forward to verify that if the mapping A was perfect, i.e. for all i we have (cid:126)x(cid:48)\ni = (cid:126)eyi, Pmds\nsatis\ufb01es all constraints (8) as equalities with zero slack. This gives us con\ufb01dence that the optimal\nsolution P for the test data should not deviate too much from Pmds. We will therefore penalize\n\n5\n\n\fTop category\n# samples n\n# topics c\n# nodes\n\nA\n\n7544\n424\n519\n\nB\n\n4772\n160\n312\n\nC\n\n4858\n453\n610\n\nD\n\n2701\n339\n608\n\nE\n\n7300\n457\n559\n\nF\n\n1961\n151\n218\n\nG\n\n8694\n425\n533\n\nH\n\n8155\n150\n170\n\nTable 1: Statistics of the different OHSUMED problems. Note that not all nodes are populated and\nthat we pruned all strictly un-populated subtrees.\n\n(cid:107)Q \u2212 \u00afC(cid:107)2\ntaxem with regularized objective becomes:\n\nF , where \u00afC = P(cid:62)\n\nMinimize (1 \u2212 \u00b5)(cid:88)\n\nQ\n\n(1) (cid:107)(cid:126)eyi \u2212 (cid:126)x(cid:48)\n(2) \u03bei\u03b1 \u2265 0\n(3) Q (cid:23) 0\n\nmdsPmds (as de\ufb01ned in section 2). The \ufb01nal convex optimization of\n\n\u03bei\u03b1 + \u00b5(cid:107)Q \u2212 \u00afC(cid:107)2\ni(cid:107)2\nQ + \u03bei\u03b1\n\ni(cid:107)2\nQ + Cyi\u03b1 \u2264 (cid:107)(cid:126)e\u03b1 \u2212 (cid:126)x(cid:48)\n\ni,\u03b1\n\nF subject to:\n\n(14)\n\nThe constant \u00b5 \u2208 [0, 1] regulates the impact of the regularization term. The optimization in (14) is\nan instance of a semide\ufb01nite program (SDP) [2]. Although SDPs can often be expensive to solve,\nthe optimization (14) falls into a special category2 and can be solved very ef\ufb01ciently with special\npurpose sub-gradient solvers even with millions of constraints [16]. Once the optimal solution Q\u2217\nis found, one can obtain the position of the prototypes with a simple svd or cholesky decomposition\nQ\u2217 =P(cid:62)P and consequently also obtains the mapping W from W = PA.\n\n4 Results\n\nWe evaluated our algorithm taxem on several classi\ufb01cation problems derived from categorizing pub-\nlications in the public OHSUMED medical journal data base into the Medical Subject Headings\n(MeSH) taxonomy.\n\nSetup and data set description\nWe used the OHSUMED 87 corpus [9], which consists of abstracts and titles of medical publica-\ntions. Each of these entries has been assigned one or more categories in the MeSH taxonomy3. We\nused the 2001 version of these MeSH headings resulting in about 20k categories organized in a tax-\nonomy. To preprocess the data we proceeded as follows: First, we discarded all database entries with\nempty abstracts, which left us with 36890 documents. We tokenized (after stop word removal and\nstemming) each abstract, and represented the corresponding bag of words as its d = 60727 dimen-\nsional t\ufb01df scores (normalized to unit length). We removed all topic categories that did not appear\nin the MeSH taxonomy (due to out-dated topic names). We further removed all subtrees of nodes\nthat were populated with one or less documents. The top categories in the OHSUMED data base are\n\u201corthogonal\u201d \u2014 for instance the B top level category is about organism while C is about diseases.\nWe thus created 8 independent classi\ufb01cation problems out of the top categories A,B,C,D,E,F,G,H.\nFor each problem, we kept only the abstracts that were assigned exactly one category in that tree,\nmaking each problem single-label. The statistics of the different problems are summarized in Ta-\nble 1. For each problem, we created a 70%/30% random split in training and test samples, ensuring\nhowever that each topic had at least one document in the training corpus.\n\nDocument Categorization\nThe classi\ufb01cation results on the OHSUMED data set are summarized in Table 2. We set the regular-\nization constants to be \u03bb = 1 for the regression and \u00b5 = 0.1 for the SDP. Preliminary experiments\non data set B showed that regularization was important but the exact settings of the \u00b5 and \u03bb had\nno crucial impact. We derived the cost-matrix C from the tree hop-distance in all experiments. We\n\n2The solver described in [16] utilizes that many constraints are inactive and that the sub-gradient can be\n\nef\ufb01ciently derived from previous gradient steps.\n\n3see http://en.wikipedia.org/wiki/Medical_Subject_Headings\n\n6\n\n\fSVM tax PI-taxem Pmds-taxem LM-taxem\n\ndata\nA\nB\nC\nD\nE\nF\nG\nH\nall\n\nSVM 1/all MCSVM SVM cost\n\n2.17\n1.50\n2.41\n3.10\n3.44\n2.59\n3.98\n2.42\n2.78\n\n2.13\n1.38\n2.32\n2.76\n3.42\n2.65\n4.12\n2.48\n2.77\n\n2.11\n1.64\n2.25\n2.92\n3.26\n2.66\n3.89\n2.40\n2.77\n\n1.96\n1.52\n2.25\n2.82\n3.25\n2.69\n3.82\n2.32\n2.65\n\n2.11\n1.57\n2.30\n2.82\n3.45\n2.63\n4.10\n2.45\n2.79\n\n2.33\n1.99\n2.61\n3.05\n3.05\n2.77\n3.63\n2.24\n2.73\n\n1.95\n1.39\n2.16\n2.66\n3.05\n2.51\n3.59\n2.17\n2.50\n\nTable 2: The cost-sensitive test error results on various ohsumed classi\ufb01cation data sets. The algo-\nrithms are from left to right: one vs. all SVM, MCSVM [6], cost-sensitive MCSVM, Hierarchical\nSVM [4], simplex regression, mds regression, large-margin taxem. The best results (up to statistical\nsigni\ufb01cance) are highlighted in bold. The taxem algorithm obtains the lowest overall loss and the\nlowest individual loss on each data set except B.\n\ncompared taxem against four commonly used algorithms for document categorization: 1. A lin-\near support vector machine (SVM) trained in one vs. all mode (SVM 1/all) [12], 2. the Crammer\nand Singer multi-class SVM formulation (MCSVM) [6], 3. the Cai and Hoffmann SVM classi\ufb01er\nwith cost-sensitive loss function (SVM cost) [4], 4. the Cai and Hoffmann SVM formulation with\na cost sensitive hierarchical loss function (SVM tax) [4]. All SVM classi\ufb01ers were trained with\nregularization constant C = 1 (which worked best on problem B; this value is also commonly used\nin text classi\ufb01cation when the documents have unit length). Further, we also evaluated the differ-\nence between our large margin formulation (taxem) and the results with the simplex (PI-taxem)\nand mds (Pmds-taxem) prototypes. To check the signi\ufb01cance of our results we applied a standard\nt-test with a 5% con\ufb01dence interval. The best results up to statistical signi\ufb01cance are highlighted in\nbold font. The \ufb01nal entry in Table 2 shows the average error over all test points in all data sets. Up\nto statistical signi\ufb01cance, taxem obtains the lowest loss on all data sets and the lowest overall loss.\nIgnoring statistical signi\ufb01cance, taxem has the lowest loss on all data sets except B. All algorithms\nhad comparable speed during test-time. The computation time required for solving eq. (6) and the\noptimization (14) was on the order of several minutes with our MATLABTM implementation on a\nstandard IntelTM 1.8GHz core 2 duo processor (without parallelization efforts).\n\n5 Related Work\n\nIn recent years, several algorithms for document categorization have been proposed. Several authors\nproposed adaptations of support vector machines that incorporate the topic taxonomy through cost-\nsensitive loss re-weighting and classi\ufb01cation at multiple nodes in the hierarchy [4, 8, 11]. Our\nalgorithm is based on a very different intuition. It differs from all these methods in that it learns\na low dimensional semantic representation of the documents and classi\ufb01es by \ufb01nding the nearest\nprototype.\nMost related to our work is probably the work by Karypis and Han [10]. Although their algorithm\nalso reduces the dimensionality with a linear projection, their low dimensional space is obtained\nthrough supervised clustering on the document data. In contrast, the semantic space obtained with\ntaxem is obtained through a convex optimization with maximum margin constraints. Further, the\nlow dimensional representation of our method is explicitly constructed to give rise to meaningful\nEuclidean distances.\nThe optimization with large-margin constraints was partially inspired by recent work on large margin\ndistance metric learning for nearest neighbor classi\ufb01cation [16]. However our formulation is a much\nmore light-weight optimization problem with O(cn) constraints instead of O(n2) as in [16]. The\noptimization problem in section 3 is also related to recent work on automated speech recognition\nthrough discriminative training of Gaussian mixture models [14].\n\n7\n\n\f6 Conclusion\n\nIn this paper, we have presented a novel framework for classi\ufb01cation with inter-class relationships\nbased on taxonomy embedding and supervised dimensionality reduction. We derived a single convex\noptimization problem that learns an embedding of the topic taxonomy as well as a linear mapping\nfrom the document space to the resulting low dimensional semantic space.\nAs future work we are planning to extend our algorithm to the more general setting of document\ncategorization with multiple topic memberships and multi-modal topic distributions. Further, we\nare keen to explore the implications of our proposed conversion of discrete topic taxonomies into\ncontinuous semantic spaces. This framework opens new interesting directions of research that go\nbeyond mere classi\ufb01cation. A natural step is to consider the document matching problem (e.g.\nof web pages and advertisements) in the semantic space: a fast nearest neighbor search can be\nperformed in a joint low dimensional space without having to resort to classi\ufb01cation all together.\nAlthough this paper is presented in the context of document categorization, it is important to empha-\nsize that our method is by no means limited to text data or class hierarchies. In fact, the proposed\nalgorithm can be applied in almost all multi-class settings with cost-sensitive loss functions (e.g.\nobject recognition in computer vision).\n\nReferences\n[1] D. Blei, A. Ng, M. Jordan, and J. Lafferty. Latent Dirichlet Allocation. Journal of Machine Learning\n\nResearch, 3(4-5):993\u20131022, 2003.\n\n[2] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[3] A. Budanitsky and G. Hirst. Semantic distance in wordnet: An experimental, application-oriented eval-\nuation of \ufb01ve measures. In Workshop on WordNet and Other Lexical Resources, in the North American\nChapter of the Association for Co mputational Linguistics (NAACL), 2001.\n\n[4] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In ACM\n\n13th Conference on Information and Knowledge Management, 2004.\n\n[5] T. Cox and M. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994.\n[6] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector ma-\n\nchines. Journal of Machine Learning Research, 2:265\u2013292, 2001.\n\n[7] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis.\n\nJournal of the American Society for Information Science, 41(6):391\u2013407, 1990.\n\n[8] S. Dumais and H. Chen. Hierarchical classi\ufb01cation of Web content. In Proceedings of SIGIR-00, 23rd\nACM International Conference on Research and Development in Information Retrieval, pages 256\u2013263.\nACM Press, New York, US, 2000.\n\n[9] W. Hersh, C. Buckley, T. J. Leone, and D. Hickam. OHSUMED: an interactive retrieval evaluation and\nnew large test collection for research. In SIGIR \u201994: Proceedings of the 17th annual international ACM\nconference on Research and development in information retrieval, pages 192\u2013201. Springer-Verlag New\nYork, Inc., 1994.\n\n[10] G. Karypis, E. Hong, and S. Han. Concept indexing a fast dimensionality reduction algorithm with appli-\ncations to document retrieval & categorization, 2000. Technical Report: 00-016 karypis, han@cs.umn.edu\nLast updated on.\n\n[11] T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classi\ufb01cation\n\nwith a very large-scale taxonomy. SIGKDD Explorations Newsletter, 7(1):36\u201343, 2005.\n\n[12] R. Rifkin and A. Klautau. In Defense of One-Vs-All Classi\ufb01cation. The Journal of Machine Learning\n\nResearch, 5:101\u2013141, 2004.\n\n[13] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1\u201347,\n\n2002.\n\n[14] F. Sha and L. K. Saul. Large margin hidden markov models for automatic speech recognition. In Advances\n\nin Neural Information Processing Systems 19, Cambridge, MA, 2007. MIT Press.\n\n[15] A. Weigend, E. Wiener, and J. Pedersen. Exploiting Hierarchy in Text Categorization.\n\nRetrieval, 1(3):193\u2013216, 1999.\n\nInformation\n\n[16] K. Q. Weinberger and L. K. Saul. Fast solvers and ef\ufb01cient implementations for distance metric learning.\n\npages 1160\u20131167, 2008.\n\n8\n\n\f", "award": [], "sourceid": 350, "authors": [{"given_name": "Kilian", "family_name": "Weinberger", "institution": null}, {"given_name": "Olivier", "family_name": "Chapelle", "institution": null}]}