Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda
In this paper, we address the problem of statistical learning for multi- topic text categorization (MTC), whose goal is to choose all relevant top- ics (a label) from a given set of topics. The proposed algorithm, Max- imal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classifier on the induced multi-class cate- gorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experi- ments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines.
1 Multi-topic Text Categorization (MTC)
This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic.
In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features specific to a topic, not a label, are regarded as important features. However, the approach may result in inefficient learning as we will explain in the following example.
Imagine an MTC problem of scientific papers where quantum computing papers are as- signed multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words specific to quantum computing such as “qbit1”, one can say that efficient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only specific to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS.
1Qbit is a unit of quantum information, and frequently appears in real quantum computing litera-
tures, but rarely seen in other literatures.
Symbol x(∈ Rd) t1, t2, . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi, Li)}m
i=1
Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples