A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Jian Zhang, Zoubin Ghahramani, Yiming Yang

Advances in Neural Information Processing Systems 17 (NIPS 2004)

In this paper we propose a probabilistic model for online document clus- tering. We use non-parametric Dirichlet process prior to model the grow- ing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichlet- multinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.