{"title": "Natural Language Grammar Induction Using a Constituent-Context Model", "book": "Advances in Neural Information Processing Systems", "page_first": 35, "page_last": 42, "abstract": null, "full_text": "Natural Language Grammar Induction using a\nConstituent-Context Model\nDan Klein and Christopher D. Manning\n\nComputer Science Department\nStanford University\nStanford, CA 94305-9040\n\n{klein,\n\nmanning}@cs.stanford.edu\nAbstract\n\nThis paper presents a novel approach to the unsupervised learning of syn-\ntactic analyses of natural language text. Most previous work has focused\non maximizing likelihood according to generative PCFG models. In con-\ntrast, we employ a simpler probabilistic model over trees based directly\non constituent identity and linear context, and use an EM-like iterative\nprocedure to induce structure. This method produces much higher qual-\nity analyses, giving the best published results on the ATIS dataset.\n1 Overview\n\nTo enable a wide range of subsequent tasks, human language sentences are standardly given\ntree-structure analyses, wherein the nodes in a tree dominate contiguous spans of words\ncalled constituents, as in figure 1(a). Constituents are the linguistically coherent units in\nthe sentence, and are usually labeled with a constituent category, such as noun phrase (NP)\nor verb phrase (VP). An aim of grammar induction systems is to figure out, given just the\nsentences in a corpus S, what tree structures correspond to them. In this sense, the grammar\ninduction problem is an incomplete data problem, where the complete data is the corpus\nof trees T , but we only observe their yields S. This paper presents a new approach to this\nproblem, which gains leverage by directly making use of constituent contexts.\nIt is an open problem whether entirely unsupervised methods can produce linguistically\naccurate parses of sentences. Due to the difficulty of this task, the vast majority of statis-\ntical parsing work has focused on supervised learning approaches to parsing, where one\nuses a treebank of fully parsed sentences to induce a model which parses unseen sentences\n[7, 3]. But there are compelling motivations for unsupervised grammar induction. Building\nsupervised training data requires considerable resources, including time and linguistic ex-\npertise. Investigating unsupervised methods can shed light on linguistic phenomena which\nare implicit within a supervised parser's supervisory information (e.g., unsupervised sys-\ntems often have difficulty correctly attaching subjects to verbs above objects, whereas for\na supervised parser, this ordering is implicit in the supervisory information). Finally, while\nthe presented system makes no claims to modeling human language acquisition, results on\nwhether there is enough information in sentences to recover their structure are important\ndata for linguistic theory, where it has standardly been assumed that the information in the\ndata is deficient, and strong innate knowledge is required for language acquisition [4].\n\f\nS\nNP\nNN 1\n\nFactory\n\nNNS\n\npayrolls\n\nVP\nVBD\n\nfell\n\nPP\nIN\n\nin\n\nNN 2\n\nSeptember\n\nNode Constituent Context\n\nS NN NNS VBD IN NN\n\n#\n\n--\n\n#\n\nNP NN NNS\n\n#\n\n-- VBD\nVP VBD IN NN NNS --\n\n#\n\nPP IN NN VBD --\n\n# NN 1 NN\n\n#\n\n-- NNS\nNNS NNS NN -- VBD\nVBD VBD NNS -- IN\nIN IN VBD -- NN\nNN 2 NNS IN --\n\n#\n\nEmpty Context\n\n# 0\n\n#\n\n-- NN\n\n# 1 NN -- NNS\n\n# 2 NNS -- VBD\n\n# 3 VBD -- IN\n\n# 4 IN -- NN\n\n# 5 NN --\n\n#\n\nFigure 1: Example parse tree with the constituents and contexts for each tree node.\n\n2 Previous Approaches\n\nOne aspect of grammar induction where there has already been substantial success is the\ninduction of parts-of-speech. Several different distributional clustering approaches have\nresulted in relatively high-quality clusterings, though the clusters' resemblance to classical\nparts-of-speech varies substantially [9, 15]. For the present work, we take the part-of-\nspeech induction problem as solved and work with sequences of parts-of-speech rather\nthan words. In some ways this makes the problem easier, such as by reducing sparsity,\nbut in other ways it complicates the task (even supervised parsers perform relatively poorly\nwith the actual words replaced by parts-of-speech).\nWork attempting to induce tree structures has met with much less success. Most grammar\ninduction work assumes that trees are generated by a symbolic or probabilistic context-free\ngrammar (CFG or PCFG). These systems generally boil down to one of two types. Some\nfix the structure of the grammar in advance [12], often with an aim to incorporate linguis-\ntic constraints [2] or prior knowledge [13]. These systems typically then attempt to find\nthe grammar production parameters # which maximize the likelihood P(S|#) using the\ninside-outside algorithm [1], which is an efficient (dynamic programming) instance of the\n\nEM algorithm [8] for PCFGs. Other systems (which have generally been more success-\nful) incorporate a structural search as well, typically using a heuristic to propose candidate\ngrammar modifications which minimize the joint encoding of data and grammar using an\n\nMDL criterion, which asserts that a good analysis is a short one, in that the joint encoding\nof the grammar and the data is compact [6, 16, 18, 17]. These approaches can also be seen\nas likelihood maximization where the objective function is the a posteriori likelihood of\nthe grammar given the data, and the description length provides a structural prior.\nThe ``compact grammar'' aspect of MDL is close to some traditional linguistic argumen-\ntation which at times has argued for minimal grammars on grounds of analytical [10] or\ncognitive [5] economy. However, the primary weakness of MDL-based systems does not\nhave to do with the objective function, but the search procedures they employ. Such sys-\ntems end up growing structures greedily, in a bottom-up fashion. Therefore, their induction\nquality is determined by how well they are able to heuristically predict what local interme-\ndiate structures will fit into good final global solutions.\nA potential advantage of systems which fix the grammar and only perform parameter search\nis that they do compare complete grammars against each other, and are therefore able to\ndetect which give rise to systematically compatible parses. However, although early work\nshowed that small, artificial CFGs could be induced with the EM algorithm [12], studies with\nlarge natural language grammars have generally suggested that completely unsupervised\n\nEM over PCFGs is ineffective for grammar acquisition. For instance, Carroll and Charniak\n[2] describe experiments running the EM algorithm from random starting points, which\nproduced widely varying learned grammars, almost all of extremely poor quality.\n\n1\n1\n\nWe duplicated one of their experiments, which used grammars restricted to rules of the form\n\nx\n\n#\n\nx y\n\n|\n\ny x , where there is one category x for each part-of-speech (such a restricted CFG is\nisomorphic to a dependency grammar). We began reestimation from a grammar with uniform rewrite\n\f\nIt is well-known that EM is only locally optimal, and one might think that the locality\nof the search procedure, not the objective function, is to blame. The truth is somewhere\nin between. There are linguistic reasons to distrust an ML objective function. It encour-\nages the symbols and rules to align in ways which maximize the truth of the conditional\nindependence assumptions embodied by the PCFG. The symbols and rules of a natural lan-\nguage grammar, on the other hand, represent syntactically and semantically coherent units,\nfor which a host of linguistic arguments have been made [14]. None of these have any-\nthing to do with conditional independence; traditional linguistic constituency reflects only\ngrammatical regularities and possibilities for expansion. There are expected to be strong\nconnections across phrases (such as dependencies between verbs and their selected argu-\nments). It could be that ML over PCFGs and linguistic criteria align, but in practice they do\nnot always seem to. Experiments with both artificial [12] and real [13] data have shown that\nstarting from fixed, correct (or at least linguistically reasonable) structure, EM produces a\ngrammar which has higher log-likelihood than the linguistically determined grammar, but\nlower parsing accuracy.\nHowever, we additionally conjecture that EM over PCFGs fails to propagate contextual cues\nefficiently. The reason we expect an algorithm to converge on a good PCFG is that there\nseem to be coherent categories, like noun phrases, which occur in distinctive environments,\nlike between the beginning of the sentence and the verb phrase. In the inside-outside al-\ngorithm, the product of inside and outside probabilities # j ( p, q)# j ( p, q) is the probability\nof generating the sentence with a j constituent spanning words p through q: the outside\nprobability captures the environment, and the inside probability the coherent category. If\nwe had a good idea of what VPs and NPs looked like, then if a novel NP appeared in an\n\nNP context, the outside probabilities should pressure the sequence to be parsed as an NP.\n\nHowever, what happens early in the EM procedure, when we have no real idea about the\ngrammar parameters? With randomly-weighted, complete grammars over a symbol set X ,\nwe have observed that a frequent, short, noun phrase sequence often does get assigned to\nsome category x early on. However, since there is not a clear overall structure learned,\nthere is only very weak pressure for other NPs, even if they occur in the same positions,\nto also be assigned to x , and the reestimation process goes astray. To enable this kind of\nconstituent-context pressure to be effective, we propose the model in the following section.\n3 The Constituent-Context Model\n\nWe propose an alternate parametric family of models over trees which is better suited for\ngrammar induction. Broadly speaking, inducing trees like the one shown in figure 1(a) can\nbe broken into two tasks. One is deciding constituent identity: where the brackets should\nbe placed. The second is deciding what to label the constituents. These tasks are certainly\ncorrelated and are usually solved jointly. However, the task of labeling chosen brackets is\nessentially the same as the part-of-speech induction problem, and the solutions cited above\ncan be adapted to cluster constituents [6]. The task of deciding brackets, is the harder task.\nFor example, the sequence DT NN IN DT NN ([the man in the moon]) is virtually always a\nnoun phrase when it is a constituent, but it is only a constituent 66% of the time, because\nthe IN DT NN is often attached elsewhere ([we [sent a man] [to the moon]]). Figure 2(a)\n\nprobabilities. Figure 4 shows that the resulting grammar (DEP-PCFG) is not as bad as conventional\nwisdom suggests. Carroll and Charniak are right to observe that the search spaces is riddled with\npronounced local maxima, and EM does not do nearly so well when randomly initialized. The need\n\nfor random seeding in using EM over PCFGs is two-fold. For some grammars, such as one over a set X\n\nof non-terminals in which any x 1\n\n#\n\nx 2 x 3 , x i\n\n#\n\nX is possible, it is needed to break symmetry. This\nis not the case for dependency grammars, where symmetry is broken by the yields (e.g., a sentence\n\nnoun verb can only be covered by a noun or verb projection). The second reason is to start the search\nfrom a random region of the space. But unless one does many random restarts, the uniform starting\ncondition is better than most extreme points in the space, and produces superior results.\n\f\n-1.5 -1 -0.5 0 0.5 1\n-3\n-2\n-1\n0\n1\n2\nNP\nVP\nPP\n-1 -0.5 0 0.5 1 1.5\n-1.5\n-1\n-0.5\n0\n0.5\n1\n1.5\nUsually a Constituent\nRarely a Constituent\n(a) (b)\nFigure 2: The most frequent examples of (a) different constituent labels and (b) constituents\nand non-constituents, in the vector space of linear contexts, projected onto the first two\nprincipal components. Clustering is effective for labeling, but not detecting constituents.\nshows the 50 most frequent constituent sequences of three types, represented as points\nin the vector space of their contexts (see below), projected onto their first two principal\ncomponents. The three clusters are relatively coherent, and it is not difficult to believe that\na clustering algorithm could detect them in the unprojected space. Figure 2(a), however,\nshows 150 sequences which are parsed as constituents at least 50% of the time along with\n150 which are not, again projected onto the first two components. This plot at least suggests\nthat the constituent/non-constituent classification is less amenable to direct clustering.\nThus, it is important that an induction system be able to detect constituents, either implicitly\nor explicitly. A variety of methods of constituent detection have been proposed [11, 6],\nusually based on information-theoretic properties of a sequence's distributional context.\nHowever, here we rely entirely on the following two simple assumptions: (i) constituents\nof a parse do not cross each other, and (ii) constituents occur in constituent contexts. The\nfirst property is self-evident from the nature of the parse trees. The second is an extremely\nweakened version of classic linguistic constituency tests [14].\nLet # be a terminal sequence. Every occurrence of # will be in some linear context c(# )\n\n=\n\nx # y, where x and y are the adjacent terminals or sentence boundaries. Then we can\nview any tree t over a sentence s as a collection of sequences and contexts, one of each for\nevery node in the tree, plus one for each inter-terminal empty span, as in figure 1(b). Good\ntrees will include nodes whose yields frequently occur as constituents and whose contexts\nfrequently surround constituents. Formally, we use a conditional exponential model of the\nform:\n\nP(t|s, #)\n\n=\n\nexp(\n\n#\n\n(#,c)#t # # f #\n\n+\n\n# c f c )\n\n#\n\nt\n\n:yield(t)=s\n\nexp(\n\n#\n\n(#,c)#t\n\n# # f #\n\n+\n\n# c f c )\n\nWe have one feature f # (t) for each sequence # whose value on a tree t is the number of\nnodes in t with yield # , and one feature f c (t) for each context c representing the number of\ntimes c is the context of the yield of some node in the tree.\n\n2\n\nNo joint features over c and #\n\nare used, and, unlike many other systems, there is no distinction between constituent types.\nWe model only the conditional likelihood of the trees, P(T\n\n|S,\n\n#), where #\n\n= {#\n\n# , # c\n\n}.\n\nWe then use an iterative EM-style procedure to find a local maximum P(T\n\n|S,\n\n#) of the\ncompleted data (trees) T (P(T\n\n|S,\n\n#)\n\n=\n\n#\n\nt\n\n#T\n\n,s=yield(t) P(t|s, #)). We initialize # such\nthat each # is zero and initialize T to any arbitrary set of trees. In alternating steps, we first\nfix the parameters # and find the most probable single tree structure t # for each sentence\n\ns according to P(t|s, #), using a simple dynamic program. For any # this produces the\n\n2\n\nSo, for the tree in figure 1(a), P(t\n\n|s) #\n\nexp(# NN NNS\n\n+\n\n# VBD IN NN\n\n+\n\n# IN NN\n\n+\n\n#\n\n#-VBD +\n\n#\n\nNNS-#\n\n+\n\n# VBD-#\n\n+\n\n#\n\n#-NNS +\n\n# NN-VBD\n\n+\n\n#\n\nNNS-IN\n\n+\n\n# VBD-NN\n\n+\n\n# IN-# ).\n\f\nset of parses T # which maximizes P(T\n\n|S,\n\n#). Since T # maximizes this quantity, if T # is\nthe former set of trees, P(T #\n\n|S,\n\n#)\n\n#\n\nP(T #\n\n|S,\n\n#). Second, we fix the trees and estimate\nnew parameters #. The task of finding the parameters # # which maximize P(T\n\n|S,\n\n#)\n\nis simply the well-studied task of fitting our exponential model to maximize the condi-\ntional likelihood of the fixed parses. Running, for example, a conjugate gradient (CG)\nascent on # will produce the desired # # . If # # is the former parameters, then we will\nhave P(T\n\n|S,\n\n# # )\n\n#\n\nP(T\n\n|S,\n\n# # ). Therefore, each iteration will increase P(T\n\n|S,\n\n#) until\nconvergence.\n\n3\n\nNote that our parsing model is not a generative model, and this procedure,\nthough clearly related, is not exactly an instance of the EM algorithm. We merely guarantee\nthat the conditional likelihood of the data completions is increasing. Furthermore, unlike in\n\nEM where each iteration increases the marginal likelihood of the fixed observed data, our\nprocedure increases the conditional likelihood of a changing complete data set, with the\ncompletions changing at every iteration as we reparse.\nSeveral implementation details were important in making the system work well. First, tie-\nbreaking was needed, most of all for the first round. Initially, the parameters are zero, and\nall parses are therefore equally likely. To prevent bias, all ties were broken randomly.\nSecond, like so many statistical NLP tasks, smoothing was vital. There are features in our\nmodel for arbitrarily long yields and most yield types occurred only a few times. The most\nsevere consequence of this sparsity was that initial parsing choices could easily become\nfrozen. If a # # for some yield # was either\n\n#\n\n0 or\n\n#\n\n0, which was usually the case for\nrare yields, # would either be locked into always occurring or never occurring, respectively.\nNot only did we want to push the # # values close to zero, we also wanted to account for\nthe fact that most spans are not constituents.\n\n4\n\nTherefore, we expect the distribution of the\n\n# # to be skewed towards low values.\n\n5\n\nA greater amount of smoothing was needed for the\nfirst few iterations, while much less was required in later iterations.\nFinally, parameter estimation using a CG method was slow and difficult to smooth in\nthe desired manner, and so we used the smoothed relative frequency estimates # #\n\n=\n\ncount( f # )/(count(# )\n\n+\n\nM) and # c\n\n=\n\ncount( f c )/(count(c)\n\n+\n\nN). These estimates ensured\nthat the # values were between 0 and 1, and gave the desired bias towards non-constituency.\nThese estimates were fast and surprisingly effective, but do not guarantee non-decreasing\nconditional likelihood (though the conditional likelihood was increasing in practice).\n\n6\n4 Results\n\nIn all experiments, we used hand-parsed sentences from the Penn Treebank. For training,\nwe took the approximately 7500 sentences in the Wall Street Journal (WSJ) section which\ncontained 10 words or fewer after the removal of punctuation. For testing, we evaluated the\nsystem by comparing the system's parses for those same sentences against the supervised\nparses in the treebank. We consider each parse as a set of constituent brackets, discarding\nall trivial brackets.\n\n7\n\nWe calculated the precision and recall of these brackets against the\ntreebank parses in the obvious way.\n\n3\n\nIn practice, we stopped the system after 10 iterations, but final behavior was apparent after 4--8.\n\n4\n\nIn a sentence of length n, there are (n\n\n+\n\n1)(n\n\n+\n\n2)/2 total (possibly size zero) spans, but only 3n\nconstituent spans: n\n\n-\n\n1 of size\n\n#\n\n2, n of size 1, and n\n\n+\n\n1 empty spans.\n\n5\n\nGaussian priors for the exponential model accomplish the former goal, but not the latter.\n\n6\n\nThe relative frequency estimators had a somewhat subtle positive effect. Empty spans have no\neffect on the model when using CG fitting, as all trees include the same empty spans. However,\nincluding their counts improved performance substantially when using relative frequency estimators.\nThis is perhaps an indication that a generative version of this model would be advantageous.\n\n7\n\nWe discarded both brackets of length one and brackets spanning the entire sentence, since all of\nthese are impossible to get incorrect, and hence ignored sentences of length\n\n#\n\n2 during testing.\n\f\nS\nNP\nDT\n\nThe\n\nNN\n\nscreen\n\nVP\nVBD\n\nwas\n\nNP\nNP\nDT\n\na\n\nNN\n\nsea\n\nPP\nIN\n\nof\n\nNP\nNN\n\nred\n\n#\n#\n#\n\nDT\n\nThe\n\nNN\n\nscreen\n\nVBD\n\nwas\n\n#\n#\n\nDT\n\na\n\nNN\n\nsea\n\n#\n\nIN\n\nof\n\nNN\n\nred\n\nVBD\nVBD\nDT\nDT\n\nThe\n\nNN\n\nscreen\n\nVBD\n\nwas\n\nDT\nDT\nDT\nDT\n\na\n\nNN\n\nsea\n\nIN\n\nof\n\nNN\n\nred\n(a) (b) (c)\nFigure 3: Alternate parse trees for a sentence: (a) the Penn Treebank tree (deemed correct),\n(b) the one found by our system CCM, and (c) the one found by DEP-PCFG.\n\nMethod UP UR F 1 NP UR PP UR VP UR\n\nLBRANCH 20.5 24.2 22.2 28.9 6.3 0.6\n\nRANDOM 29.0 31.0 30.0 42.8 23.6 26.3\n\nDEP-PCFG 39.5 42.3 40.9 69.7 44.1 22.8\n\nRBRANCH 54.1 67.5 60.0 38.3 44.5 85.8\n\nCCM 60.1 75.4 66.9 83.8 71.6 66.3\n\nUBOUND 78.2 100.0 87.8 100.0 100.0 100.0\nSystem UP UR F 1 CB\n\nEMILE 51.6 16.8 25.4 0.84\n\nABL 43.6 35.6 39.2 2.12\n\nCDC-40 53.4 34.6 42.0 1.46\n\nRBRANCH 39.9 46.4 42.9 2.18\n\nCCM 54.4 46.8 50.3 1.61\n\n(a) (b)\nFigure 4: Comparative accuracy on WSJ sentences (a) and on the ATIS corpus (b). UR =\nunlabeled recall; UP = unlabeled precision; F 1 = the harmonic mean of UR and UP; CB =\ncrossing brackets. Separate recall values are shown for three major categories.\nTo situate the results of our system, figure 4(a) gives the values of several parsing strate-\ngies. CCM is our constituent-context model. DEP-PCFG is a dependency PCFG model [2]\ntrained using the inside-outside algorithm. Figure 3 shows sample parses to give a feel for\nthe parses the systems produce. We also tested several baselines. RANDOM parses ran-\ndomly. This is an appropriate baseline for an unsupervised system. RBRANCH always\nchooses the right-branching chain, while LBRANCH always chooses the left-branching\nchain. RBRANCH is often used as a baseline for supervised systems, but exploits a system-\natic right-branching tendency of English. An unsupervised system has no a priori reason\nto prefer right chains to left chains, and LBRANCH is well worse than RANDOM. A system\nneed not beat RBRANCH to claim partial success at grammar induction. Finally, we in-\nclude an upper bound. All of the parsing strategies and systems mentioned here give fully\nbinary-branching structures. Treebank trees, however, need not be fully binary-branching,\nand generally are not. As a result, there is an upper bound UBOUND on the precision and\nF 1 scores achievable when structurally confined to binary trees.\nClearly, CCM is parsing much better than the RANDOM baseline and the DEP-PCFG induced\ngrammar. Significantly, it also out-performs RBRANCH in both precision and recall, and,\nto our knowledge, it is the first unsupervised system to do so. To facilitate comparison\nwith other recent systems, figure 4(b) gives results where we trained as before but used\n(all) the sentences from the distributionally different ATIS section of the treebank as a test\nset. For this experiment, precision and recall were calculated using the EVALB system of\nmeasuring precision and recall (as in [6, 17]) -- EVALB is a standard for parser evaluation,\nbut complex, and unsuited to evaluating unlabeled constituency. EMILE and ABL are lexical\nsystems described in [17]. The results for CDC-40, from [6], reflect training on much more\ndata (12M words). Our system is superior in terms of both precision and recall (and so F 1 ).\nThese figures are certainly not all that there is to say about an induced grammar; there are a\nnumber of issues in how to interpret the results of an unsupervised system when comparing\nwith treebank parses. Errors come in several kinds. First are innocent sins of commis-\nsion. Treebank trees are very flat; for example, there is no analysis of the inside of many\nshort noun phrases ([two hard drives] rather than [two [hard drives]]). Our system gives a\n\f\nSequence Example CORRECT FREQUENCY ENTROPY DEP-PCFG CCM\n\nDT NN the man 1 2 2 1 1\nNNP NNP United States 2 1 -- 2 2\nCD CD 4 1/2 3 9 -- 5 5\nJJ NNS daily yields 4 7 3 4 4\nDT JJ NN the top rank 5 -- -- 7 6\nDT NNS the people 6 -- -- -- 10\nJJ NN plastic furniture 7 3 7 3 3\nCD NN 12 percent 8 -- -- -- 9\nIN NN on Monday 9 -- 9 -- --\nIN DT NN for the moment 10 -- -- -- --\nNN NNS fire trucks 11 -- 6 -- 8\nNN NN fire truck 22 8 10 -- 7\nTO VB to go 26 -- 1 6 --\nDT JJ ?the big 78 6 -- -- --\nIN DT *of the 90 4 -- 10 --\nPRP VBZ ?he says 95 -- -- 8 --\nPRP VBP ?they say 180 -- -- 9 --\nNNS VBP ?people are =350 -- 4 -- --\nNN VBZ ?value is =532 10 5 -- --\nNN IN *man from =648 5 -- -- --\nNNS VBD ?people were =648 -- 8 -- --\n\nFigure 5: Top non-trivial sequences by actual treebank constituent counts, linear frequency,\nscaled context entropy, and in DEP-PCFG and CCM learned models' parses.\n(usually correct) analysis of the insides of such NPs, for which it is penalized on precision\n(though not recall or crossing brackets). Second are systematic alternate analyses. Our\nsystem tends to form modal verb groups and often attaches verbs first to pronoun subjects\nrather than to objects. As a result, many VPs are systematically incorrect, boosting cross-\ning bracket scores and impacting VP recall. Finally, the treebank's grammar is sometimes\nan arbitrary, and even inconsistent standard for an unsupervised learner: alternate analy-\nses may be just as good.\n\n8\n\nNotwithstanding this, we believe that the treebank parses have\nenough truth in them that parsing scores are a useful component of evaluation.\nIdeally, we would like to inspect the quality of the grammar directly. Unfortunately, the\ngrammar acquired by our system is implicit in the learned feature weights. These are not\nby themselves particularly interpretable, and not directly comparable to the grammars pro-\nduced by other systems, except through their functional behavior. Any grammar which\nparses a corpus will have a distribution over which sequences tend to be analyzed as con-\nstituents. These distributions can give a good sense of what structures are and are not being\nlearned. Therefore, to supplement the parsing scores above, we examine these distributions.\nFigure 5 shows the top scoring constituents by several orderings. These lists do not say\nvery much about how long, complex, recursive constructions are being analyzed by a given\nsystem, but grammar induction systems are still at the level where major mistakes manifest\nthemselves in short, frequent sequences. CORRECT ranks sequences by how often they\noccur as constituents in the treebank parses. DEP-PCFG and CCM are the same, but use\ncounts from the DEP-PCFG and CCM parses. As a baseline, FREQUENCY lists sequences by\nhow often they occur anywhere in the sentence yields. Note that the sequence IN DT (e.g.,\n``of the'') is high on this list, and is a typical error of many early systems. Finally, ENTROPY\n\nis the heuristic proposed in [11] which ranks by context entropy. It is better in practice than\n\nFREQUENCY, but that isn't self-evident from this list. Clearly, the lists produced by the\n\nCCM system are closer to correct than the others. They look much like a censored version\nof the FREQUENCY list, where sequences which do not co-exist with higher-ranked ones\nhave been removed (e.g., IN DT often crosses DT NN). This observation may explain a good\npart of the success of this method.\nAnother explanation for the surprising success of the system is that it exploits a deep fact\nabout language. Most long constituents have some short, frequent equivalent, or proform,\n\nwhich occurs in similar contexts [14]. In the very common case where the proform is a\nsingle word, it is guaranteed constituency, which will be transmitted to longer sequences\n\n8\n\nFor example, transitive sentences are bracketed [subject [verb object]] (The president [executed\n\nthe law]) while nominalizations are bracketed [[possessive noun] complement] ([The president's ex-\necution] of the law), an arbitrary inconsistency which is unlikely to be learned automatically.\n\f\nvia shared contexts (categories like PP which have infrequent proforms are not learned well\nunless the empty sequence is in the model -- interestingly, the empty sequence appears to\nact as the proform for PPs, possibly due to the highly optional nature of many PPs).\n5 Conclusions\n\nWe have presented an alternate probability model over trees which is based on simple\nassumptions about the nature of natural language structure. It is driven by the explicit\ntransfer between sequences and their contexts, and exploits both the proform phenomenon\nand the fact that good constituents must tile in ways that systematically cover the corpus\nsentences without crossing. The model clearly has limits. Lacking recursive features, it\nessentially must analyze long, rare constructions using only contexts. However, despite, or\nperhaps due to its simplicity, our model predicts bracketings very well, producing higher\nquality structural analyses than previous methods which employ the PCFG model family.\n\nAcknowledgements. We thank John Lafferty, Fernando Pereira, Ben Taskar, and Sebas-\ntian Thrun for comments and discussion. This paper is based on work supported in part by\nthe National Science Foundation under Grant No. IIS-0085896.\n\nReferences\n\n[1] James K. Baker. Trainable grammars for speech recognition. In D. H. Klatt and J. J. Wolf,\neditors, Speech Communication Papers for the 97th Meeting of the ASA, pages 547--550, 1979.\n[2] Glenn Carroll and Eugene Charniak. Two experiments on learning probabilistic dependency\ngrammars from corpora. In C. Weir, S. Abney, R. Grishman, and R. Weischedel, editors, Work-\ning Notes of the Workshop Statistically-Based NLP Techniques, pages 1--13. AAAI Press, 1992.\n[3] Eugene Charniak. A maximum-entropy-inspired parser. In NAACL 1, pages 132--139, 2000.\n[4] Noam Chomsky. Knowledge of Language. Prager, New York, 1986.\n[5] Noam Chomsky & Morris Halle. The Sound Pattern of English. Harper & Row, NY, 1968.\n[6] Alexander Clark. Unsupervised induction of stochastic context-free grammars using distribu-\ntional clustering. In The Fifth Conference on Natural Language Learning, 2001.\n[7] Michael John Collins. Three generative, lexicalised models for statistical parsing. In ACL\n35/EACL 8, pages 16--23, 1997.\n[8] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\nthe EM algorithm. J. Royal Statistical Society Series B, 39:1--38, 1977.\n[9] Steven Finch and Nick Chater. Distributional bootstrapping: From word class to proto-sentence.\nIn Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 301--\n306, Hillsdale, NJ, 1994. Lawrence Erlbaum.\n[10] Zellig Harris. Methods in Structural Linguistics. University of Chicago Press, Chicago, 1951.\n[11] Dan Klein and Christopher D. Manning. Distributional phrase structure induction. In The Fifth\nConference on Natural Language Learning, 2001.\n[12] K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-\noutside algorithm. Computer Speech and Language, 4:35--56, 1990.\n[13] Fernando Pereira and Yves Schabes. Inside-outside reestimation from partially bracketed cor-\npora. In ACL 30, pages 128--135, 1992.\n[14] Andrew Radford. Transformational Grammar. Cambridge University Press, Cambridge, 1988.\n[15] Hinrich Schutze. Distributional part-of-speech tagging. In EACL 7, pages 141--148, 1995.\n[16] Andreas Stolcke and Stephen M. Omohundro. Inducing probabilistic grammars by Bayesian\nmodel merging. In Grammatical Inference and Applications: Proceedings of the Second Inter-\nnational Colloquium on Grammatical Inference. Springer Verlag, 1994.\n[17] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction systems:\nAlignment-based learning vs. emile. Technical Report 2001.05, University of Leeds, 2001.\n[18] J. G. Wolff. Learning syntax and meanings through optimization and distributional analysis. In\nY. Levy, I. M. Schlesinger, and M. D. S. Braine, editors, Categories and processes in language\nacquisition, pages 179--215. Lawrence Erlbaum, Hillsdale, NJ, 1988.\n\f\n", "award": [], "sourceid": 1945, "authors": [{"given_name": "Dan", "family_name": "Klein", "institution": null}, {"given_name": "Christopher", "family_name": "Manning", "institution": null}]}