{"title": "On Flat versus Hierarchical Classification in Large-Scale Taxonomies", "book": "Advances in Neural Information Processing Systems", "page_first": 1824, "page_last": 1832, "abstract": "We study in this paper flat and hierarchical classification strategies in the context of large-scale taxonomies. To this end, we first propose a multiclass, hierarchical data dependent bound on the generalization error of classifiers deployed in large-scale taxonomies. This bound provides an explanation to several empirical results reported in the literature, related to the performance of flat and hierarchical classifiers. We then introduce another type of bounds targeting the approximation error of a family of classifiers, and derive from it features used in a meta-classifier to decide which nodes to prune (or flatten) in a large-scale taxonomy. We finally illustrate the theoretical developments through several experiments conducted on two widely used taxonomies.", "full_text": "On Flat versus Hierarchical Classi\ufb01cation in\n\nLarge-Scale Taxonomies\n\nRohit Babbar, Ioannis Partalas, Eric Gaussier, Massih-Reza Amini\n\nUniversit\u00e9 Joseph Fourier, Laboratoire Informatique de Grenoble\n\nBP 53 - F-38041 Grenoble Cedex 9\n\nfirstname.lastname@imag.fr\n\nAbstract\n\nWe study in this paper \ufb02at and hierarchical classi\ufb01cation strategies in the context\nof large-scale taxonomies. To this end, we \ufb01rst propose a multiclass, hierarchi-\ncal data dependent bound on the generalization error of classi\ufb01ers deployed in\nlarge-scale taxonomies. This bound provides an explanation to several empirical\nresults reported in the literature, related to the performance of \ufb02at and hierarchical\nclassi\ufb01ers. We then introduce another type of bound targeting the approximation\nerror of a family of classi\ufb01ers, and derive from it features used in a meta-classi\ufb01er\nto decide which nodes to prune (or \ufb02atten) in a large-scale taxonomy. We \ufb01nally\nillustrate the theoretical developments through several experiments conducted on\ntwo widely used taxonomies.\n\n1\n\nIntroduction\n\nLarge-scale classi\ufb01cation of textual and visual data into a large number of target classes has been\nthe focus of several studies, from researchers and developers in industry and academia alike. The\ntarget classes in such large-scale scenarios typically have an inherent hierarchical structure, usually\nin the form of a rooted tree, as in Directory Mozilla1, or a directed acyclic graph, with a parent-\nchild relationship. Various classi\ufb01cation techniques have been proposed for deploying classi\ufb01ers\nin such large-scale taxonomies, from \ufb02at (sometimes referred to as big bang) approaches to fully\nhierarchical one adopting a complete top-down strategy. Several attempts have also been made in\norder to develop new classi\ufb01cation techniques that integrate, at least partly, the hierarchy into the\nobjective function being optimized (as [3, 5, 10, 11] among others). These techniques are however\ncostly in practice and most studies either rely on a \ufb02at classi\ufb01er, or a hierarchical one either deployed\non the original hierarchy or a simpli\ufb01ed version of it obtained by pruning some nodes (as [15, 18])2.\nHierarchical models for large scale classi\ufb01cation however suffer from the fact that they have to make\nmany decisions prior to reach a \ufb01nal category. This intermediate decision making leads to the error\npropagation phenomenon causing a decrease in accuracy. On the other hand, \ufb02at classi\ufb01ers rely on a\nsingle decision including all the \ufb01nal categories, a single decision that is however dif\ufb01cult to make as\nit involves many categories, potentially unbalanced. It is thus very dif\ufb01cult to assess which strategy\nis best and there is no consensus, at the time being, on to which approach, \ufb02at or hierarchical, should\nbe preferred on a particular category system.\nIn this paper, we address this problem and introduce new bounds on the generalization errors of\nclassi\ufb01ers deployed in large-scale taxonomies. These bounds make explicit the trade-off that both\n\ufb02at and hierarchical classi\ufb01ers encounter in large-scale taxonomies and provide an explanation to\n\n1www.dmoz.org\n2The study in [19] introduces a slightly different simpli\ufb01cation, through an embedding of both categories\n\nand documents into a common space.\n\n1\n\n\fseveral empirical \ufb01ndings reported in previous studies. To our knowledge, this is the \ufb01rst time that\nsuch bounds are introduced and that an explanation of the behavior of \ufb02at and hierarchical classi\ufb01ers\nis based on theoretical grounds. We also propose a well-founded way to select nodes that should be\npruned so as to derive a taxonomy better suited to the classi\ufb01cation problem. Contrary to [4] that\nreweighs the edges in a taxonomy through a cost sensitive loss function to achieve this goal, we use\nhere a simple pruning strategy that modi\ufb01es the taxonomy in an explicit way.\nThe remainder of the paper is organized as follows: Section 2 introduces the notations used and\npresents the generalization error bounds for classi\ufb01cation in large-scale taxonomies. It also presents\nthe meta-classi\ufb01er we designed to select those nodes that should be pruned in the original taxonomy.\nSection 3 illustrates these developments via experiments conducted on several taxonomies extracted\nfrom DMOZ and the International Patent Classi\ufb01cation. The experimental results are in line with\nresults reported in previous studies, as well as with our theoretical developments. Finally, Section 4\nconcludes this study.\n\n2 Generalization Error Analyses\nLet X \u2286 Rd be the input space and let V be a \ufb01nite set of class labels. We further assume that\nexamples are pairs (x, v) drawn according to a \ufb01xed but unknown distribution D over X \u00d7 V . In\nthe case of hierarchical classi\ufb01cation, the hierarchy of classes H = (V, E) is de\ufb01ned in the form of\na rooted tree, with a root \u22a5 and a parent relationship \u03c0 : V \\ {\u22a5} \u2192 V where \u03c0(v) is the parent of\nnode v \u2208 V \\ {\u22a5}, and E denotes the set of edges with parent to child orientation. For each node\nv \u2208 V \\{\u22a5}, we further de\ufb01ne the set of its sisters S(v) = {v(cid:48) \u2208 V \\{\u22a5}; v (cid:54)= v(cid:48) \u2227 \u03c0(v) = \u03c0(v(cid:48))}\nand its daughters D(v) = {v(cid:48) \u2208 V \\ {\u22a5}; \u03c0(v(cid:48)) = v}. The nodes at the intermediary levels of\nthe hierarchy de\ufb01ne general class labels while the specialized nodes at the leaf level, denoted by\nY = {y \u2208 V : (cid:64)v \u2208 V, (y, v) \u2208 E} \u2282 V , constitute the set of target classes. Finally for each class\ny in Y we de\ufb01ne the set of its ancestors P(y) de\ufb01ned as\n\nP(y) = {vy\n\n1 , . . . , vy\nky\n\n1 = \u03c0(y) \u2227 \u2200l \u2208 {1, . . . , ky \u2212 1}, vy\n; vy\n\nl+1 = \u03c0(vy\n\nl ) \u2227 \u03c0(vy\n\nky\n\n) =\u22a5}\n\nFor classifying an example x, we consider a top-down classi\ufb01er making decisions at each level of the\nhierarchy, this process sometimes referred to as the Pachinko machine selects the best class at each\nlevel of the hierarchy and iteratively proceeds down the hierarchy. In the case of \ufb02at classi\ufb01cation,\nthe hierarchy H is ignored, Y = V , and the problem reduces to the classical supervised multiclass\nclassi\ufb01cation problem.\n\n2.1 A hierarchical Rademacher data-dependent bound\n\nOur main result is the following theorem which provides a data-dependent bound on the generaliza-\ntion error of a top-down multiclass hierarchical classi\ufb01er. We consider here kernel-based hypotheses,\nwith K : X \u00d7X \u2192 R a PDS kernel and \u03a6 : X \u2192 H its associated feature mapping function, de\ufb01ned\nas :\n\nFB = {f : (x, v) \u2208 X \u00d7 V (cid:55)\u2192 (cid:104)\u03a6(x), wv(cid:105) | W = (w1 . . . , w|V |),||W||H \u2264 B}\n\nwhere W = (w1 . . . , w|V |) is the matrix formed by the |V | weight vectors de\ufb01ning the kernel-based\n\nhypotheses, (cid:104)., .(cid:105) denotes the dot product, and ||W||H =(cid:0)(cid:80)\n\nv\u2208V ||wv||2(cid:1)1/2 is the L2H group norm\n\nof W. We further de\ufb01ne the following associated function class:\n\nGFB = {gf : (x, y) \u2208 X \u00d7 Y (cid:55)\u2192 min\nv\u2208P(y)\n\n(f (x, v) \u2212 max\nv(cid:48)\u2208S(v)\n\nf (x, v(cid:48))) | f \u2208 FB}\n\nesis f from FB such that the generalization error of gf \u2208 GFB , E(gf ) = E(x,y)\u223cD(cid:2)1gf (x,y)\u22640\n\nFor a given hypothesis f \u2208 FB, the sign of its associated function gf \u2208 GFB directly de\ufb01nes a\nhierarchical classi\ufb01cation rule for f as the top-down classi\ufb01cation scheme outlined before simply\namounts to: assign x to y iff gf (x, y) > 0. The learning problem we address is then to \ufb01nd a hypoth-\nminimal (1gf (x,y)\u22640 is the 0/1 loss, equal to 1 if gf (x, y) \u2264 0 and 0 otherwise).\nThe following theorem sheds light on the trade-off between \ufb02at versus hierarchical classi\ufb01cation.\nThe notion of function class capacity used here is the empirical Rademacher complexity [1]. The\nproof of the theorem is given in the supplementary material.\n\n(cid:3), is\n\n2\n\n\fTheorem 1 Let S = ((x(i), y(i)))m\ni=1 be a dataset of m examples drawn i.i.d. according to a\nprobability distribution D over X \u00d7Y, and let A be a Lipschitz function with constant L dominating\nthe 0/1 loss; further let K : X \u00d7 X \u2192 R be a PDS kernel and let \u03a6 : X \u2192 H be the associated\nfeature mapping function. Assume that there exists R > 0 such that K(x, x) \u2264 R2 for all x \u2208 X .\nThen, for all 1 > \u03b4 > 0, with probability at least (1 \u2212 \u03b4) the following hierarchical multiclass\nclassi\ufb01cation generalization bound holds for all gf \u2208 GFB :\n(cid:88)\n\nm(cid:88)\n\n(cid:114)\n\n|D(v)|(|D(v)| \u2212 1) + 3\n\nA(gf (x(i), y(i))) +\n\nln(2/\u03b4)\n\n2m\n\n(1)\n\nE(gf ) \u2264 1\nm\n\n8BRL\u221a\nm\n\nv\u2208V \\Y\nwhere |D(v)| denotes the number of daughters of node v.\n\ni=1\n\ni=1 n2\n\nones, as any split of a set of category of size n in k parts n1,\u00b7\u00b7\u00b7 , nk ((cid:80)k\n(cid:80)k\n\nFor \ufb02at multiclass classi\ufb01cation, we recover the bounds of [12] by considering a hierarchy containing\na root node with as many daughters as there are categories. Note that the de\ufb01nition of functions\nin GFB subsumes the de\ufb01nition of the margin function used for the \ufb02at multiclass classi\ufb01cation\nproblems in [12], and that the factor 8L in the complexity term of the bound, instead of 4 in [12],\nis due to the fact that we are using an L-Lipschitz loss function dominating the 0/1 loss in the\nempirical Rademacher complexity.\nFlat vs hierarchical classi\ufb01cation on large-scale taxonomies. The generalization error is con-\ntrolled in inequality (1) by a trade-off between the empirical error and the Rademacher complexity\nof the class of classi\ufb01ers. The Rademacher complexity term favors hierarchical classi\ufb01ers over \ufb02at\ni=1 ni = n) is such that\ni \u2264 n2. On the other hand, the empirical error term is likely to favor \ufb02at classi\ufb01ers vs\nhierarchical ones, as the latter rely on a series of decisions (as many as the length of the path from\nthe root to the chosen category in Y) and are thus more likely to make mistakes. This fact is often\nreferred to as the propagation error problem in hierarchical classi\ufb01cation.\nOn the contrary, \ufb02at classi\ufb01ers rely on a single decision and are not prone to this problem (even\nthough the decision to be made is harder). When the classi\ufb01cation problem in Y is highly unbal-\nanced, then the decision that a \ufb02at classi\ufb01er has to make is dif\ufb01cult; hierarchical classi\ufb01ers still have\nto make several decisions, but the imbalance problem is less severe on each of them. So, in this\ncase, even though the empirical error of hierarchical classi\ufb01ers may be higher than the one of \ufb02at\nones, the difference can be counterbalanced by the Rademacher complexity term, and the bound in\nTheorem 1 suggests that hierarchical classi\ufb01ers should be preferred over \ufb02at ones.\nOn the other hand, when the data is well balanced, the Rademacher complexity term may not be\nsuf\ufb01cient to overcome the difference in empirical errors due to the propagation error in hierarchical\nclassi\ufb01ers; in this case, Theorem 1 suggests that \ufb02at classi\ufb01ers should be preferred to hierarchical\nones. These results have been empirically observed in different studies on classi\ufb01cation in large-\nscale taxonomies and are further discussed in Section 3.\nSimilarly, one way to improve the accuracy of classi\ufb01ers deployed in large-scale taxonomies is to\nmodify the taxonomy by pruning (sets of) nodes [18]. By doing so, one is \ufb02attening part of the\ntaxonomy and is once again trading-off the two terms in inequality (1): pruning nodes leads to\nreduce the number of decisions made by the hierarchical classi\ufb01er while maintaining a reasonable\nRademacher complexity. Even though it can explain several empirical results obtained so far, the\nbound displayed in Theorem 1 does not provide a practical way to decide on whether to prune a\nnode or not, as it would involve the training of many classi\ufb01ers which is impractical with large-scale\ntaxonomies. We thus turn towards another bound in the next section that will help us design a direct\nand simple strategy to prune nodes in a taxonomy.\n\n2.2 Asymptotic approximation error bounds\n\nWe now propose an asymptotic approximation error bound for a multiclass logistic regression (MLR)\nclassi\ufb01er. We \ufb01rst consider the \ufb02at, multiclass case (V = Y), and then show how the bounds can\nbe combined in a typical top-down cascade, leading to the identi\ufb01cation of important features that\ncontrol the variation of these bounds.\n\n3\n\n\fConsidering a pivot class y(cid:63) \u2208 Y, a MLR classi\ufb01er, with parameters \u03b2 = {\u03b2y\n{1, . . . , d}}, models the class posterior probabilities via a linear function in x = (xj)d\nexample [13] p. 96) :\n\nj ; y \u2208 Y \\{y(cid:63)}, j \u2208\nj=1 (see for\n\n0 , \u03b2y\n\nP (y|x; \u03b2)y(cid:54)=y(cid:63) =\n\nP (y(cid:63)|x; \u03b2) =\n\n0 +(cid:80)d\n\nexp(\u03b2y\n\nj=1 \u03b2y\n\nj xj)\n\n0 +(cid:80)d\n0 +(cid:80)d\n\n1\n\ny(cid:48)\u2208Y,y(cid:48)(cid:54)=y(cid:63) exp(\u03b2y(cid:48)\n\nj=1 \u03b2y(cid:48)\n\nj xj)\n\ny(cid:48)\u2208Y,y(cid:48)(cid:54)=y(cid:63) exp(\u03b2y(cid:48)\n\nj=1 \u03b2y(cid:48)\n\nj xj)\n\n1 +(cid:80)\n1 +(cid:80)\n\nThe parameters \u03b2 are usually \ufb01t by maximum likelihood over a training set S of size m (denoted by\n\n(cid:98)\u03b2m in the following) and the decision rule for this classi\ufb01er consists in choosing the class with the\n\nhighest class posterior probability :\n\nP (y|x,(cid:98)\u03b2m)\n\nhm(x) = argmax\n\ny\u2208Y\n\n\u221a\n\n\u2200y \u2208 Y,\n\nj and (\u03c3y\n\nj=1 \u03b2y\n\nj xj) <\n\nR|Y|\u03c30\n\u03b4m\n\nwhere \u03c30 = maxj,y \u03c3y\n\n0 +(cid:80)d\n\nThe following lemma states to which extent the posterior probabilities with maximum likelihood es-\n\ntimates(cid:98)\u03b2m may deviate from their asymptotic values obtained with maximum likelihood estimates\nwhen the training size m tends to in\ufb01nity (denoted by(cid:98)\u03b2\u221e).\nLemma 1 Let S be a training set of size m and let (cid:98)\u03b2m be the maximum likelihood estimates of\nthe MLR classi\ufb01er over S. Further, let (cid:98)\u03b2\u221e be the maximum likelihood estimates of parameters\n\nj )y,j represent the components of the inverse (diagonal) Fisher infor-\n\nof MLR when m tends to in\ufb01nity. For all examples x, let R > 0 be the bound such that \u2200y \u2208\nY\\{y(cid:63)}, exp(\u03b2y\nR; then for all 1 > \u03b4 > 0, with probability at least (1 \u2212 \u03b4) we\nhave:\n\n(cid:114)\n(cid:12)(cid:12)(cid:12) < d\n(cid:12)(cid:12)(cid:12)P (y|x,(cid:98)\u03b2m) \u2212 P (y|x,(cid:98)\u03b2\u221e)\nmation matrix at(cid:98)\u03b2\u221e.\nProof (sketch) By denoting the sets of parameters (cid:98)\u03b2m = { \u02c6\u03b2y\nand (cid:98)\u03b2\u221e = {\u03b2y\ncomponents of the inverse (diagonal) Fisher information matrix at(cid:98)\u03b2\u221e. Let \u03c30 = maxj,y \u03c3y\n1 \u2212 \u03c30/\u00012, |(cid:98)\u03b2y\n(cid:113) R\n(cid:12)(cid:12)(cid:12)P (y|x,(cid:98)\u03b2m) \u2212 P (y|x,(cid:98)\u03b2\u221e)\n(cid:12)(cid:12)(cid:12) < d\n\nj ; j \u2208 {0, . . . , d}, y \u2208 Y \\{y(cid:63)}},\nj ; j \u2208 {0, . . . , d}, y \u2208 Y \\{y(cid:63)}}, and using the independence assumption and the\nasymptotic normality of maximum likelihood estimates (see for example [17], p. 421), we have,\nfor 0 \u2264 j \u2264 d and \u2200y \u2208 Y \\ {y(cid:63)}:\nj )y,i represent the\nj . Then\nusing Chebyshev\u2019s inequality, for 0 \u2264 j \u2264 d and \u2200y \u2208 Y\\{y(cid:63)} we have with probability at least\nR; using a\nTaylor development of the functions exp(x+\u0001) and (1+x+\u0001x)\u22121 and the union bound, one obtains\nthat, \u2200\u0001 > 0 and y \u2208 Y with probability at least 1 \u2212 |Y|\u03c30\nm \u0001.\nSetting |Y|\u03c30\nLemma 1 suggests that the predicted and asymptotic posterior probabilities are close to each other,\nas the quantities they are based on are close to each other. Thus, provided that the asymptotic\nposterior probabilities between the best two classes, for any given x, are not too close to each other,\nthe generalization error of the MLR classi\ufb01er and the one of its asymptotic version should be similar.\nTheorem 2 below states such a relationship, using the following function that measures the confusion\nbetween the best two classes for the asymptotic MLR classi\ufb01er de\ufb01ned as :\n\nm. Further \u2200x and \u2200y \u2208 Y\\{y(cid:63)}, exp(\u03b2y\n\nto \u03b4, and solving for \u0001 gives the result. (cid:3)\n\nm((cid:98)\u03b2y\nj \u2212 \u03b2y\n\n0 +(cid:80)d\n\nj ) \u223c N (0, \u03c3y\n\nj \u2212 \u03b2y\n\nj | < \u0001\u221a\n\nj ) where the (\u03c3y\n\nj=1 \u03b2y\n\nj xj) <\n\n:\n\n\u00012\n\n\u221a\n\n\u221a\n\n\u00012\n\nP (y|x,(cid:98)\u03b2\u221e)\n\nh\u221e(x) = argmax\n\ny\u2208Y\n\n(2)\n\n(3)\n\nFor any given x \u2208 X , the confusion between the best two classes is de\ufb01ned as follows.\n\n4\n\n\fDe\ufb01nition 1 Let f 1\u221e(x) = maxy\u2208Y P (y|x,(cid:98)\u03b2\u221e) be the best class posterior probability for x by the\nasymptotic MLR classi\ufb01er, and let f 2\u221e(x) = maxy\u2208Y\\h\u221e(x) P (y|x,(cid:98)\u03b2\u221e) be the second best class\n\nposterior probability for x. We de\ufb01ne the confusion of the asymptotic MLR classi\ufb01er for a category\nset Y as:\n\nGY (\u03c4 ) = P(x,y)\u223cD(|f 1\u221e(x) \u2212 f 2\u221e(x)| < 2\u03c4 )\n\nfor a given \u03c4 > 0.\n\nThe following theorem states a relationship between the generalization error of a trained MLR clas-\nsi\ufb01er and its asymptotic version.\n\nTheorem 2 For a multi-class classi\ufb01cation problem in d dimensional feature space with a training\nset of size m, {x(i), y(i)}m\ni=1, x(i) \u2208 X , y(i) \u2208 Y, sampled i.i.d. from a probability distribution D, let\nhm and h\u221e denote the multiclass logistic regression classi\ufb01ers learned from a training set of \ufb01nite\nsize m and its asymptotic version respectively, and let E(hm) and E(h\u221e) be their generalization\nerrors. Then, for all 1 > \u03b4 > 0, with probability at least (1 \u2212 \u03b4) we have:\n\nE(hm) \u2264 E(h\u221e) + GY\n\n0 +(cid:80)d\n\nR is a bound on the function exp(\u03b2y\n\n\u221a\nwhere\nconstant.\nProof (sketch) The difference E(hm) \u2212 E(h\u221e) is bounded by the probability that the asymptotic\nMLR classi\ufb01er h\u221e correctly classi\ufb01es an example (x, y) \u2208 X \u00d7 Y randomly chosen from D, while\nhm misclassi\ufb01es it. Using Lemma 1, for all \u03b4 \u2208 (0, 1),\u2200x \u2208 X ,\u2200y \u2208 Y, with probability at least\n1 \u2212 \u03b4, we have:\n\nj xj), \u2200x \u2208 X and \u2200y \u2208 Y, and \u03c30 is a\n\nj=1 \u03b2y\n\n(4)\n\n(cid:33)\n\n(cid:32)\n\n(cid:114)\n\nd\n\nR|Y|\u03c30\n\u03b4m\n\n(cid:114)\n(cid:12)(cid:12)(cid:12)P (y|x,(cid:98)\u03b2m) \u2212 P (y|x,(cid:98)\u03b2\u221e)\n(cid:12)(cid:12)(cid:12) < d\n(cid:19)\n(cid:113) R|Y|\u03c30\n\n(cid:18)\n\nR|Y|\u03c30\n\u03b4m\n\nThus, the decision made by the trained MLR and its asymptotic version on an example (x, y) differs\nonly if the distance between the two predicted classes of the asymptotic classi\ufb01er is less than two\n\ntimes the distance between the posterior probabilities obtained with(cid:98)\u03b2m and(cid:98)\u03b2\u221e on that example;\n\nand the probability of this is exactly GY\n\nd\n\n\u03b4m\n\n, which upper-bounds E(hm) \u2212 E(h\u221e). (cid:3)\n\nand is related to the smallest amount of information one has on the estimation of each parameter(cid:98)\u03b2k\nNote that the quantity \u03c30 in Theorem 2 represents the largest value of the inverse (diagonal) Fisher\ninformation matrix ([17]). It is thus the smallest value of the (diagonal) Fisher information matrix,\nj .\nThis smallest amount of information is in turn related to the length (in number of occurrences) of\nthe longest (resp. shortest) class in Y denoted respectively by nmax and nmin as, the smaller they\nare, the larger \u03c30 is likely to be.\n\n2.3 A learning based node pruning strategy\n\nLet us now consider a hierarchy of classes and a top-down classi\ufb01er making decisions at each\nlevel of the hierarchy. A node-based pruning strategy can be easily derived from the approxima-\nIndeed, any node v in the hierarchy H = (V, E) is associated with three\ntion bounds above.\nits sister categories with the node itself S(cid:48)(v) = S(v) \u222a {v}, its daughter cate-\ncategory sets:\ngories, D(v), and the union of its sister and daughter categories, denoted F(v) = S(v) \u222a D(v).\nThese three sets of categories\nare the ones involved before\nand after the pruning of node\nLet us now denote the\nv.\nMLR classi\ufb01er by hS(cid:48)\nm learned\nfrom a set of sister categories\nof node v and the node itself,\nand by hDv\nm a MLR classi\ufb01er\nlearned from the set of daughter categories of node v (hS(cid:48)\ntotic versions). The following theorem is a direct extension of Theorem 2 to this setting.\n\nv\u221e and hDv\u221e respectively denote their asymp-\n\nS(v) \u222a {v}\n\nPruning\n\n\u22a5\n\n\u22a5\n\n...\n\nv\n\n...\n\nF(v)\n\nD(v)\n\n...\n\n...\n\n...\n\n...\n\nv\n\n5\n\n\fTheorem 3 With the notations de\ufb01ned above, for MLR classi\ufb01ers, \u2200\u0001 > 0, v \u2208 V \\ Y, one has, with\nprobability at least 1 \u2212\n\n+ Rd2|D(v)|\u03c3D(v)\n\nRd2|S(cid:48)(v)|\u03c3S(cid:48) (v)\n\n:\n\n0\n\n0\nmD(v)\u00012\n\n(cid:18)\n\n(cid:19)\n\nE(hS(cid:48)\n\nm ) + E(hDv\n\nv\n\nmS(cid:48) (v)\u00012\nm ) \u2264 E(hS(cid:48)\n\nv\u221e ) + E(hDv\u221e ) + GS(cid:48)(v)(\u0001) + GD(v)(\u0001)\n\n{|Y (cid:96)|, mY (cid:96), \u03c3Y (cid:96)\n0 ;Y (cid:96) \u2208 {S(cid:48)(v), D(v)}} are constants related to the set of categories Y (cid:96) \u2208\n{S(cid:48)(v), D(v)} and involved in the respective bounds stated in Theorem 2. Denoting by hFv\nm the\nMLR classi\ufb01er trained on the set F(v) and by hFv\u221e its asymptotic version, Theorem 3 suggests that\none should prune node v if:\n\nGF(v)(\u0001) \u2264 GS(cid:48)(v)(\u0001) + GD(v)(\u0001) and\n\n|F(v)|\u03c3F(v)\n\n0\nmF(v)\n\n\u2264 |S(cid:48)(v)|\u03c3S(cid:48)(v)\n\n0\nmS(cid:48)(v)\n\n|D(v)|\u03c3D(v)\n\n0\nmD(v)\n\n+\n\n(5)\n\nmax, nY (cid:96)\n\nFurthermore, the bounds obtained rely on the union bound and thus are not likely to be ex-\nploitable in practice. They nevertheless exhibit the factors that play an important role in as-\nsessing whether a particular trained classi\ufb01er in the logistic regression family is close or not\nto its asymptotic version. Each node v \u2208 V can then be characterized by factors in the set\nmin, GY (cid:96) (.)|Y (cid:96) \u2208 {S(cid:48)(v), D(v), F(v)}} which are involved in the estimation\n{|Y (cid:96)|, mY (cid:96), nY (cid:96)\nof inequalities (5) above. We propose to estimate the confusion term GY (cid:96)(.) with two simple quan-\ntities: the average cosine similarity of all the pairs of classes in Y (cid:96), and the average symmetric\nKullback-Leibler divergences between all the pairs in Y (cid:96) of class conditional multinomial distribu-\ntions.\nThe procedure for collecting training data associates a positive (resp. negative) class to a node if\nthe pruning of that node leads to a \ufb01nal performance increase (resp. decrease). A meta-classi\ufb01er is\nthen trained on these features using a training set from a selected class hierarchy. After the learning\nphase, the meta-classi\ufb01er is applied to each node of a new hierarchy of classes so as to identify\nwhich nodes should be pruned. A simple strategy to adopt is then to prune nodes in sequence: start-\ning from the root node, the algorithm checks which children of a given node v should be pruned by\ncreating the corresponding meta-instance and feeding the meta-classi\ufb01er; the child that maximizes\nthe probability of the positive class is then pruned; as the set of categories has changed, we recalcu-\nlate which children of v can be pruned, prune the best one (as above) and iterate this process till no\nmore children of v can be pruned; we then proceed to the children of v and repeat the process.\n\n3 Discussion\n\nWe start our discussion by presenting results on different hierarchical datasets with different char-\nacteristics using MLR and SVM classi\ufb01ers. The datasets we used in these experiments are two large\ndatasets extracted from the International Patent Classi\ufb01cation (IPC) dataset3 and the publicly avail-\nable DMOZ dataset from the second PASCAL large scale hierarchical text classi\ufb01cation challenge\n(LSHTC2)4. Both datasets are multi-class; IPC is single-label and LSHTC2 multi-label with an\naverage of 1.02 categories per class. We created 4 datasets from LSHTC2 by splitting randomly the\n\ufb01rst layer nodes (11 in total) of the original hierarchy in disjoint subsets. The classes for the IPC\nand LSHTC2 datasets are organized in a hierarchy in which the documents are assigned to the leaf\ncategories only. Table 1 presents the characteristics of the datasets.\nCR denotes the complexity ratio between hierarchical and \ufb02at classi\ufb01cation, given by the\n/ (|Y|(|Y| \u2212 1)); the\nRademacher complexity term in Theorem 1:\nsame constants B, R and L are used in the two cases. As one can note, this complexity ratio al-\nways goes in favor of the hierarchal strategy, although it is 2 to 10 times higher on the IPC dataset,\ncompared to LSHTC2-1,2,3,4,5. On the other hand, the ratio of empirical errors (last column of\nTable 1) obtained with top-down hierarchical classi\ufb01cation over \ufb02at classi\ufb01cation when using SVM\n\nv\u2208V \\Y |D(v)|(|D(v)| \u2212 1)\n\n(cid:16)(cid:80)\n\n(cid:17)\n\n3http://www.wipo.int/classifications/ipc/en/support/\n4http://lshtc.iit.demokritos.gr/\n\n6\n\n\fDataset\nLSHTC2-1\nLSHTC2-2\nLSHTC2-3\nLSHTC2-4\nLSHTC2-5\nIPC\n\n# Tr.\n25,310\n50,558\n38,725\n27,924\n68,367\n46,324\n\n# Test\n6,441\n13,057\n10,102\n7,026\n17,561\n28,926\n\n# Classes\n\n1,789\n4,787\n3,956\n2,544\n7,212\n451\n\n# Feat.\n145,859\n271,557\n145,354\n123,953\n192,259\n1,123,497\n\nDepth\n\n6\n6\n6\n6\n6\n4\n\nCR\n0.008\n0.003\n0.004\n0.005\n0.002\n0.02\n\nError ratio\n\n1.24\n1.32\n2.65\n1.8\n2.12\n12.27\n\nratio of hierarchical over the \ufb02at case ((cid:80)\n\nTable 1: Datasets used in our experiments along with the properties: number of training examples,\ntest examples, classes and the size of the feature space, the depth of the hierarchy and the complexity\nv\u2208V \\Y |D(v)|(|D(v)| \u2212 1)/|Y|(|Y| \u2212 1)), the ratio of\n\nempirical error for hierarchical and \ufb02at models.\n\nwith a linear kernel is this time higher than 1, suggesting the opposite conclusion. The error ratio is\nfurthermore really important on IPC compared to LSHTC2-1,2,3,4,5. The comparison of the com-\nplexity and error ratios on all the datasets thus suggests that the \ufb02at classi\ufb01cation strategy may be\npreferred on IPC, whereas the hierarchical one is more likely to be ef\ufb01cient on the LSHTC datasets.\nThis is indeed the case, as is shown below.\nTo test our simple node pruning strategy, we learned binary classi\ufb01ers aiming at deciding whether\nto prune a node, based on the node features described in the previous section. The label associated\nto each node in this training set is de\ufb01ned as +1 if pruning the node increases the accuracy of the\nhierarchical classi\ufb01er by at least 0.1, and -1 if pruning the node decreases the accuracy by more than\n0.1. The threshold at 0.1 is used to avoid too much noise in the training set. The meta-classi\ufb01er\nis then trained to learn a mapping from the vector representation of a node (based on the above\nfeatures) and the labels {+1;\u22121}. We used the \ufb01rst two datasets of LSHTC2 to extract the training\ndata while LSHTC2-3, 4, 5 and IPC were employed for testing.\nThe procedure for collecting training data is repeated for the MLR and SVM classi\ufb01ers resulting in\nthree meta-datasets of 119 (19 positive and 100 negative), 89 (34 positive and 55 negative) and 94 (32\npositive and 62 negative) examples respectively. For the binary classi\ufb01ers, we used AdaBoost with\nrandom forest as a base classi\ufb01er, setting the number of trees to 20, 50 and 50 for the MLR and SVM\nclassi\ufb01ers respectively and leaving the other parameters at their default values. Several values have\nbeen tested for the number of trees ({10, 20, 50, 100 and 200}), the depth of the trees ({unrestricted,\n5, 10, 15, 30, 60}), as well as the number of iterations in AdaBoost ({10, 20, 30}). The \ufb01nal values\nwere selected by cross-validation on the training set (LSHTC2-1 and LSHTC2-2) as the ones that\nmaximized accuracy and minimized false-positive rate in order to prevent degradation of accuracy.\nWe compare the fully \ufb02at classi\ufb01er (FL) with the fully hierarchical (FH) top-down Pachinko ma-\nchine, a random pruning (RN) and the proposed pruning method (PR) . For the random pruning\nwe restrict the procedure to the \ufb01rst two levels and perform 4 random prunings (this is the average\nnumber of prunings that are performed in our approach). For each dataset we perform 5 indepen-\ndent runs for the random pruning and we record the best performance. For MLR and SVM, we use\nthe LibLinear library [8] and apply the L2-regularized versions, setting the penalty parameter C by\ncross-validation.\nThe results on LSHTC2-3,4,5 and IPC are reported in Table 2. On all LSHTC datasets \ufb02at clas-\nsi\ufb01cation performs worse than the fully hierarchy top-down classi\ufb01cation, for all classi\ufb01ers. These\nresults are in line with complexity and empirical error ratios for SVM estimated on different col-\nlections and shown in table 1 as well as with the results obtained in [14, 7] over the same type of\ntaxonomies. Further, the work by [14] demonstrated that class hierarchies on LSHTC datasets suf-\nfer from rare categories problem, i.e., 80% of the target categories in such hierarchies have less than\n5 documents assigned to them.\nAs a result, \ufb02at methods on such datasets face unbalanced classi\ufb01cation problems which results in\nsmaller error ratios; hierarchical classi\ufb01cation should be preferred in this case. On the other hand,\nfor hierarchies such as the one of IPC, which are relatively well balanced and do not suffer from\nthe rare categories phenomenon, \ufb02at classi\ufb01cation performs at par or even better than hierarchical\n\n7\n\n\fLSHTC2-3\nMLR\nSVM\n\nLSHTC2-4\nMLR\nSVM\n\nLSHTC2-5\nMLR\nSVM\n\nSVM\nMLR\nFL 0.528\u2193\u2193 0.535\u2193\u2193 0.497\u2193\u2193 0.501\u2193\u2193 0.542\u2193\u2193 0.547\u2193\u2193 0.546\n0.446\nRN 0.493\u2193\u2193 0.517\u2193\u2193 0.478\u2193\u2193 0.484\u2193\u2193 0.532\u2193\u2193 0.536\u2193 0.547\u2193 0.458\u2193\u2193\n0.527 0.552\u2193 0.465\u2193\u2193\nFH 0.484\u2193\u2193 0.498\u2193\u2193 0.473\u2193\u2193 0.476\u2193\nPR 0.480\n0.472\n0.523\n0.450\n\n0.526\u2193\n0.522\n\n0.544\n\n0.493\n\n0.469\n\nIPC\n\nTable 2: Error results across all datasets. Bold typeface is used for the best results. Statistical\nsigni\ufb01cance (using micro sign test (s-test) as proposed in [20]) is denoted with \u2193 for p-value<0.05\nand with \u2193\u2193 for p-value<0.01.\n\nclassi\ufb01cation. This is in agreement with the conclusions obtained in recent studies, as [2, 9, 16, 6],\nin which the datasets considered do not have rare categories and are more well-balanced.\nThe proposed hierarchy pruning strategy aims to adapt the given taxonomy structure for better clas-\nsi\ufb01cation while maintaining the ancestor-descendant relationship between a given pair of nodes. As\nshown in Table 2, this simple learning based pruning strategy leads to statistically signi\ufb01cant better\nresults for all three classi\ufb01ers compared to both the original taxonomy and a randomly pruned one.\nA similar result is reported in [18] through a pruning of an entire layer of the hierarchy, which can be\nseen as a generalization, even though empirical in nature, of the pruning strategy retained here. An-\nother interesting approach to modify the original taxonomy is presented in [21]. In this study, three\nother elementary modi\ufb01cation operations are considered, again with an increase of performance.\n\n4 Conclusion\n\nWe have studied in this paper \ufb02at and hierarchical classi\ufb01cation strategies in the context of large-\nscale taxonomies, through error generalization bounds of multiclass, hierarchical classi\ufb01ers. The\n\ufb01rst theorem we have introduced provides an explanation to several empirical results related to the\nperformance of such classi\ufb01ers. We have also introduced a well-founded way to simplify a taxonomy\nby selectively pruning some of its nodes, through a meta-classi\ufb01er. The features retained in this\nmeta-classi\ufb01er derive from the error generalization bounds we have proposed. The experimental\nresults reported here (as well as in other papers) are in line with our theoretical developments and\njustify the pruning strategy adopted.\nThis is the \ufb01rst time, to our knowledge, that a data dependent error generalization bound is pro-\nposed for multiclass, hierarchical classi\ufb01ers and that a theoretical explanation is provided for the\nperformance of \ufb02at and hierarchical classi\ufb01cation strategies in large-scale taxonomies. In particular,\nthere is, up to now, no consensus on which classi\ufb01cation scheme, \ufb02at or hierarchical, to use on a\nparticular category system. One of our main conclusions is that top-down hierarchical classi\ufb01ers\nare well suited to unbalanced, large-scale taxonomies, whereas \ufb02at ones should be preferred for\nwell-balanced taxonomies.\nLastly, our theoretical development also suggests possibilities to grow a hierarchy of classes from\na (large) set of categories, as has been done in several studies (e.g. [2]). We plan to explore this in\nfuture work.\n\n5 Acknowledgments\n\nThis work was supported in part by the ANR project Class-Y, the Mastodons project Garguantua, the\nLabEx PERSYVAL-Lab ANR-11-LABX-0025 and the European project BioASQ (grant agreement\nno. 318652).\n\n8\n\n\fReferences\n[1] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[2] S. Bengio, J. Weston, and D. Grangier. Label embedding trees for large multi-class tasks. In\n\nAdvances in Neural Information Processing Systems 23, pages 163\u2013171, 2010.\n\n[3] L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In\nProceedings 13th ACM International Conference on Information and Knowledge Management\n(CIKM), pages 78\u201387. ACM, 2004.\n\n[4] O. Dekel. Distribution-calibrated hierarchical classi\ufb01cation. In Advances in Neural Informa-\n\ntion Processing Systems 22, pages 450\u2013458. 2009.\n\n[5] O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classi\ufb01cation. In Proceedings of\n\nthe 21st International Conference on Machine Learning, pages 27\u201335, 2004.\n\n[6] J. Deng, S. Satheesh, A. C. Berg, and F.-F. Li. Fast and balanced: Ef\ufb01cient label tree learning\nfor large scale object recognition. In Advances in Neural Information Processing Systems 24,\npages 567\u2013575, 2011.\n\n[7] S. Dumais and H. Chen. Hierarchical classi\ufb01cation of web content. In Proceedings of the 23rd\n\nannual international ACM SIGIR conference, pages 256\u2013263, 2000.\n\n[8] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for\n\nlarge linear classi\ufb01cation. Journal of Machine Learning Research, 9:1871\u20131874, 2008.\n\n[9] T. Gao and D. Koller. Discriminative learning of relaxed hierarchy for large-scale visual recog-\nnition. In IEEE International Conference on Computer Vision (ICCV), pages 2072\u20132079, 2011.\n[10] S. Gopal and Y. Y. A. Niculescu-Mizil. Regularization framework for large scale hierarchical\nclassi\ufb01cation. In Large Scale Hierarchical Classi\ufb01cation, ECML/PKDD Discovery Challenge\nWorkshop, 2012.\n\n[11] S. Gopal, Y. Yang, B. Bai, and A. Niculescu-Mizil. Bayesian models for large-scale hierarchi-\n\ncal classi\ufb01cation. In Advances in Neural Information Processing Systems 25, 2012.\n\n[12] Y. Guermeur. Sample complexity of classi\ufb01ers taking values in Rq, application to multi-class\n\nSVMs. Communications in Statistics - Theory and Methods, 39, 2010.\n\n[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer New\n\nYork Inc., 2001.\n\n[14] T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines\n\nclassi\ufb01cation with a very large-scale taxonomy. SIGKDD, 2005.\n\n[15] H. Malik. Improving hierarchical SVMs by hierarchy \ufb02attening and lazy classi\ufb01cation. In 1st\n\nPascal Workshop on Large Scale Hierarchical Classi\ufb01cation, 2009.\n\n[16] F. Perronnin, Z. Akata, Z. Harchaoui, and C. Schmid. Towards good practice in large-scale\nlearning for image classi\ufb01cation. In Computer Vision and Pattern Recognition, pages 3482\u2013\n3489, 2012.\n\n[17] M. Schervish. Theory of Statistics. Springer Series in Statistics. Springer New York Inc., 1995.\n[18] X. Wang and B.-L. Lu. Flatten hierarchies for large-scale hierarchical text categorization. In\n\n5th International Conference on Digital Information Management, pages 139\u2013144, 2010.\n\n[19] K. Q. Weinberger and O. Chapelle. Large margin taxonomy embedding for document cat-\nIn Advances in Neural Information Processing Systems 21, pages 1737\u20131744,\n\negorization.\n2008.\n\n[20] Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the\n\n22nd annual International ACM SIGIR conference, pages 42\u201349. ACM, 1999.\n\n[21] J. Zhang, L. Tang, and H. Liu. Automatically adjusting content taxonomies for hierarchical\n\nclassi\ufb01cation. In Proceedings of the 4th Workshop on Text Mining, 2006.\n\n9\n\n\f", "award": [], "sourceid": 926, "authors": [{"given_name": "Rohit", "family_name": "Babbar", "institution": "Universit\u00e9 Joseph Fourier, Grenoble"}, {"given_name": "Ioannis", "family_name": "Partalas", "institution": "UJF/LIG"}, {"given_name": "Eric", "family_name": "Gaussier", "institution": "Universit\u00e9 Joseph Fourier, Grenoble"}, {"given_name": "Massih R.", "family_name": "Amini", "institution": "Universit\u00e9 Joseph Fourier, Grenoble"}]}