Sphere Embedding: An Application to Part-of-Speech Induction

Part of Advances in Neural Information Processing Systems 23 (NIPS 2010)

Bibtex Metadata Paper

Authors

Yariv Maron, Michael Lamar, Elie Bienenstock

Abstract

Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a high-dimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks.