Part of Advances in Neural Information Processing Systems 12 (NIPS 1999)
Thomas Hofmann
The project pursued in this paper is to develop from first information-geometric principles a general method for learning the similarity between text documents. Each individual docu(cid:173) ment is modeled as a memoryless information source. Based on a latent class decomposition of the term-document matrix, a low(cid:173) dimensional (curved) multinomial subfamily is learned. From this model a canonical similarity function - known as the Fisher kernel - is derived. Our approach can be applied for unsupervised and supervised learning problems alike. This in particular covers inter(cid:173) esting cases where both, labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of the proposed method.