Zach Solan, David Horn, Eytan Ruppin, Shimon Edelman
We describe a pattern acquisition algorithm that learns, in an unsuper- vised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learn- ing structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm repre- sents sentences as paths on a graph whose vertices are words (or parts of words). Signiﬁcant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are rep- resented by trees composed of signiﬁcant patterns and their associated equivalence classes. An input module allows the algorithm to be sub- jected to a standard test of English as a Second Language (ESL) proﬁ- ciency. The results are encouraging: the model attains a level of per- formance considered to be “intermediate” for 9th-grade students, de- spite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.