Part of Advances in Neural Information Processing Systems 14 (NIPS 2001)
Dan Klein, Christopher D. Manning
This paper presents a novel approach to the unsupervised learning of syn- tactic analyses of natural language text. Most previous work has focused on maximizing likelihood according to generative PCFG models. In con- trast, we employ a simpler probabilistic model over trees based directly on constituent identity and linear context, and use an EM-like iterative procedure to induce structure. This method produces much higher qual- ity analyses, giving the best published results on the ATIS dataset. 1 Overview
To enable a wide range of subsequent tasks, human language sentences are standardly given tree-structure analyses, wherein the nodes in a tree dominate contiguous spans of words called constituents, as in figure 1(a). Constituents are the linguistically coherent units in the sentence, and are usually labeled with a constituent category, such as noun phrase (NP) or verb phrase (VP). An aim of grammar induction systems is to figure out, given just the sentences in a corpus S, what tree structures correspond to them. In this sense, the grammar induction problem is an incomplete data problem, where the complete data is the corpus of trees T , but we only observe their yields S. This paper presents a new approach to this problem, which gains leverage by directly making use of constituent contexts. It is an open problem whether entirely unsupervised methods can produce linguistically accurate parses of sentences. Due to the difficulty of this task, the vast majority of statis- tical parsing work has focused on supervised learning approaches to parsing, where one uses a treebank of fully parsed sentences to induce a model which parses unseen sentences [7, 3]. But there are compelling motivations for unsupervised grammar induction. Building supervised training data requires considerable resources, including time and linguistic ex- pertise. Investigating unsupervised methods can shed light on linguistic phenomena which are implicit within a supervised parser's supervisory information (e.g., unsupervised sys- tems often have difficulty correctly attaching subjects to verbs above objects, whereas for a supervised parser, this ordering is implicit in the supervisory information). Finally, while the presented system makes no claims to modeling human language acquisition, results on whether there is enough information in sentences to recover their structure are important data for linguistic theory, where it has standardly been assumed that the information in the data is deficient, and strong innate knowledge is required for language acquisition [4].