Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*NicolĂ˛ Cesa-bianchi, Claudio Gentile, Andrea Tironi, Luca Zaniboni*

We study the problem of hierarchical classification when labels corre- sponding to partial and/or multiple paths in the underlying taxonomy are allowed. We introduce a new hierarchical loss function, the H-loss, im- plementing the simple intuition that additional mistakes in the subtree of a mistaken class should not be charged for. Based on a probabilistic data model introduced in earlier work, we derive the Bayes-optimal classifier for the H-loss. We then empirically compare two incremental approx- imations of the Bayes-optimal classifier with a flat SVM classifier and with classifiers obtained by using hierarchical versions of the Perceptron and SVM algorithms. The experiments show that our simplest incremen- tal approximation of the Bayes-optimal classifier performs, after just one training epoch, nearly as well as the hierarchical SVM classifier (which performs best). For the same incremental algorithm we also derive an H-loss bound showing, when data are generated by our probabilistic data model, exponentially fast convergence to the H-loss of the hierarchical classifier based on the true model parameters.

1 Introduction and basic definitions

We study the problem of classifying data in a given taxonomy of labels, where the tax- onomy is specified as a tree forest. We assume that every data instance is labelled with a (possibly empty) set of class labels called multilabel, with the only requirement that mul- tilabels including some node i in the taxonony must also include all ancestors of i. Thus, each multilabel corresponds to the union of one or more paths in the forest, where each path must start from a root but it can terminate on an internal node (rather than a leaf).

Learning algorithms for hierarchical classification have been investigated in, e.g., [8, 9, 10, 11, 12, 14, 15, 17, 20]. However, the scenario where labelling includes multiple and partial paths has received very little attention. The analysis in [5], which is mainly theoretical, shows in the multiple and partial path case a 0/1-loss bound for a hierarchical learning algorithm based on regularized least-squares estimates. In this work we extend [5] in several ways. First, we introduce a new hierarchical loss func- tion, the H-loss, which is better suited than the 0/1-loss to analyze hierarchical classification tasks, and we derive the corresponding Bayes-optimal classifier under the parametric data model introduced in [5]. Second, considering various loss functions, including the H-loss, we empirically compare the performance of the following three incremental kernel-based

```
This work was supported in part by the PASCAL Network of Excellence under EC grant no. 506778. This publication only reflects the authors' views.
```

algorithms: 1) a hierarchical version of the classical Perceptron algorithm [16]; 2) an ap- proximation to the Bayes-optimal classifier; 3) a simplified variant of this approximation. Finally, we show that, assuming data are indeed generated according to the parametric model mentioned before, the H-loss of the algorithm in 3) converges to the H-loss of the classifier based on the true model parameters. Our incremental algorithms are based on training linear-threshold classifiers in each node of the taxonomy. A similar approach has been studied in [8], though their model does not consider multiple-path classifications as we do. Incremental algorithms are the main focus of this research, since we strongly believe that they are a key tool for coping with tasks where large quantities of data items are generated and the classification system needs to be frequently adjusted to keep up with new items. However, we found it useful to provide a reference point for our empirical results. Thus we have also included in our experiments the results achieved by nonincremental algorithms. In particular, we have chosen a flat and a hierarchical version of SVM [21, 7, 19], which are known to perform well on the textual datasets considered here. We assume data elements are encoded as real vectors x Rd which we call instances. A multilabel for an instance x is any subset of the set {1, . . . , N } of all labels/classes, including the empty set. We denote the multilabel associated with x by a vector y = (y1, . . . , yN ) {0, 1}N , where i belongs to the multilabel of x if and only if yi = 1. A taxonomy G is a forest whose trees are defined over the set of labels. A multilabel y {0, 1}N is said to respect a taxonomy G if and only if y is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf. See Figure 1. We assume the data-generating mechanism produces examples (x, y) such that y respects some fixed underlying taxonomy G with N nodes. The set of roots in G is denoted by root(G). We use par(i) to denote the unique parent of node i, anc(i) to denote the set of ancestors of i, and sub(i) to denote the set of nodes in the subtree rooted at i (including i). Finally, given a predicate over a set , we will use {} to denote both the subset of where is true and the indicator function of this subset.

Do not remove: This comment is monitored to verify that the site is working properly