Distributed Inference for Latent Dirichlet Allocation

Part of Advances in Neural Information Processing Systems 20 (NIPS 2007)

Bibtex Metadata Paper


David Newman, Padhraic Smyth, Max Welling, Arthur Asuncion


processors only sees

We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed compu- of the total data set. We pro- tation, where each of pose two distributed inference schemes that are motivated from different perspec- tives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme re- lies on a hierarchical Bayesian extension of the standard LDA model to directly processors—it has a theo- account for the fact that data are distributed across retical guarantee of convergence but is more complex to implement than the ap- proximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.