Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization

Part of Advances in Neural Information Processing Systems 12 (NIPS 1999)

Bibtex Metadata Paper

Authors

Thomas Hofmann

Abstract

The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual docu(cid:173) ment is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a low(cid:173) dimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function - known as the Fisher kernel - is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers inter(cid:173) esting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.