{"title": "Differential Use of Implicit Negative Evidence in Generative and Discriminative Language Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 754, "page_last": 762, "abstract": "A classic debate in cognitive science revolves around understanding how children learn complex linguistic rules, such as those governing  restrictions on verb alternations, without negative  evidence. Traditionally, formal learnability arguments have been used to claim that such learning is impossible without the aid of innate  language-specific knowledge. However, recently, researchers have shown  that statistical models are capable of learning complex rules from only  positive evidence. These two kinds of learnability analyses differ in their assumptions about the role of the distribution from which linguistic  input is generated.  The former analyses assume that learners seek to identify grammatical sentences in a way that is robust to the distribution  from which the sentences are generated, analogous to discriminative  approaches in machine learning. The latter assume that learners are trying  to estimate a generative model, with sentences being sampled from that  model. We show that these two learning approaches differ in their use of implicit negative evidence -- the absence of a sentence -- when learning  verb alternations, and demonstrate that human learners can produce results  consistent with the predictions of both approaches, depending on the  context in which the learning problem is presented.", "full_text": "Differential Use of Implicit Negative Evidence in\nGenerative and Discriminative Language Learning\n\nAnne S. Hsu\n\nThomas L. Grif\ufb01ths\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\n{showen,tom griffiths}@berkeley.edu\n\nAbstract\n\nA classic debate in cognitive science revolves around understanding how children\nlearn complex linguistic rules, such as those governing restrictions on verb alter-\nnations, without negative evidence. Traditionally, formal learnability arguments\nhave been used to claim that such learning is impossible without the aid of innate\nlanguage-speci\ufb01c knowledge. However, recently, researchers have shown that sta-\ntistical models are capable of learning complex rules from only positive evidence.\nThese two kinds of learnability analyses differ in their assumptions about the dis-\ntribution from which linguistic input is generated. The former analyses assume\nthat learners seek to identify grammatical sentences in a way that is robust to the\ndistribution from which the sentences are generated, analogous to discriminative\napproaches in machine learning. The latter assume that learners are trying to esti-\nmate a generative model, with sentences being sampled from that model. We show\nthat these two learning approaches differ in their use of implicit negative evidence\n\u2013 the absence of a sentence \u2013 when learning verb alternations, and demonstrate\nthat human learners can produce results consistent with the predictions of both\napproaches, depending on how the learning problem is presented.\n\n1\n\nIntroduction\n\nLanguages have a complex structure, full of general rules with idiosyncratic exceptions. For ex-\nample, the causative alternation in English allows a class of verbs to take both the transitive form,\n\u201cI opened the door\u201d, and the intransitive form, \u201cThe door opened\u201d. With other verbs, alternations\nare restricted, and they are grammatical in only one form. For example, \u201cThe rabbit disappeared\u201d\nis grammatical whereas \u201cI disappeared the rabbit\u201d is ungrammatical. There is a great debate over\nhow children learn language, related to the infamous \u201cpoverty of the stimulus\u201d argument [1, 2, 3, 4].\nA central part of the debate arises from the fact that a child mostly learns language only by hear-\ning adults speak grammatical sentences, known as positive evidence. Children are believed to learn\nlanguage mostly from positive evidence because research has found that children rarely receive in-\ndications from parents that a sentence is not grammatical, and they ignore these indications when\nthey do recieve them. An explicit indication that a sentence is not grammatical is known as nega-\ntive evidence [5, 6, 7]. Yet, speaking a language speaking involves the generalization of linguistic\npatterns into novel combinations of phrases that have never been heard before. This presents the\nfollowing puzzle: How do children eventually learn that certain novel linguistic generalizations are\nnot allowed if they are not explicitly told? There have been two main lines of analyses addressing\nthis question. These analyses have taken two different perspectives on the basic task involved in\nlanguage learning, and have yielded quite different results.\n\nOne perspective is that language is acquired by learning rules for identifying grammatically ac-\nceptable and unacceptable sentences in a way that is robust to the actual distribution of observed\n\n\fsentences. From this perspective, Gold\u2019s theorem [8] asserts that languages with in\ufb01nite recursion,\nsuch as most human languages, are impossible to learn from positive evidence alone. In particu-\nlar, linguistic exceptions, such as the restrictions on verb alternations mentioned above, are cited\nas being impossible to learn empirically. More recent analyses yield similar results, while making\nweaker assumptions about the desired outcome of learning (for a review, see [9]). In light of this,\nit has been argued that child language learning abilities can only be explained by the presence of\ninnate knowledge speci\ufb01c to language [3, 4, 10].\n\nOn the other side of the debate, results indicating that relatively sophisticated linguistic representa-\ntions such as probabilistic context-free grammars can be learned from positive evidence have been\nobtained by viewing language acquisition as a process of forming a probabilistic model of the lin-\nguistic input, under the assumption that the observed data are sampled from this model [11, 12, 13].\nIn addition to these general theoretical results, statistical learning models have been shown to be\ncapable of learning exceptions in language from positive examples only in a variety of domains,\nincluding verb alternations [14, 15, 16, 17, 18, 19]. Furthermore, previous experimental work has\nshown that humans are capable of learning linguistic exceptions in an arti\ufb01cial language without\nnegative evidence [20], bearing out the predictions of some of these models.\n\nOne key difference between these two perspectives on learning is in the assumptions that they make\nabout how observed sentences are generated. In the former approach, the goal is to learn to iden-\ntify grammatical sentences without making assumptions about the distribution from which they are\ndrawn. In the latter approach, the goal is to learn a probability distribution over sentences, and the\nobserved sentences are assumed to be drawn from that distribution. This difference is analogous to\nthe distinction between discriminative and generative models in machine learning (e.g., [21]). The\nstronger distributional assumptions made in the generative approach result in a less robust learner,\nbut make it possible to learn linguistic exceptions without negative evidence. In particular, gener-\native models can exploit the \u201cimplicit negative evidence\u201d provided by the absence of a sentence:\nthe assumption that sentences are generated from the target probability distribution means that not\nobserving a sentence provides weak evidence that it does not belong to the language. In contrast,\ndiscriminative models that seek to learn a function for labelling sentences as grammatical or un-\ngrammatical are more robust to the distribution from which the sentences are drawn, but their weaker\nassumptions about this distribution mean that they are unable to exploit implicit negative evidence.\n\nIn this paper, we explore how these two different views of learning are related to human language\nacquisition. Here we focus on the task of learning an arti\ufb01cal language containing both alternating\nand non-alternating verbs. Our goal is to use modeling and human experiments to demonstrate that\nthe opposing conclusions from the two sides of the language acquisition debate can be explained by\na difference in learning approach. We compare the learning performance of a hierarchical Bayesian\nmodel [15], which takes a generative approach, with a logistic regression model, which takes a dis-\ncriminative approach. We show that without negative evidence, the generative model will judge a\nverb structure that is absent in the input to be ungrammatical, while the discriminative model will\njudge it to be grammatical. We then conduct an experiment designed to encourage human partici-\npants to adopt either a generative or discriminative language learning perspective. The experimental\nresults indicate that human learners behave in accordance with model predictions: absent verb struc-\ntures are rejected as ungrammatical under a generative learning perspective and accepted as gram-\nmatical under a discriminative one. Our modeling comparisons and experimental results contribute\nto the language acquisition debate in the following ways: First, our results lend credence to conclu-\nsions from both sides of the debate by showing that linguistic exceptions appear either unlearnable\nor learnable, depending on the learning perspective. Second, our results indicate that the opposing\nconclusions about learnability can indeed be attributed to whether one assumes a discriminative or\na generative learning perspective. Finally, because our generative learning condition is much more\nsimilar to actual child language learning, our results lend weight to the argument that children can\nlearn language empirically from positive input.\n\n2 Models of language learning: Generative and discriminative\n\nGenerative approaches seek to infer the probability distribution over sentences that characterizes the\nlanguage, while discriminative models seek to identify a function that indicates whether a sentence\nis grammatical. General results exist that characterize the learnability of languages from these two\n\n\f\u03bb,  \u00b5\n\n\u03b1,  \u03b2\u03b2\u03b2\u03b2\n\n\u03b81\n\ny1\n\n\u03b82\n\ny2\n\n\u03b83\n\ny3\n\n\u03b84\n\ny4\n\nS1  S1 S2\nS2  S2 S1\nS1  S2\u2026\n\nS2  S2 S2\nS2  S2 S2\nS2  S2 \u2026\n\nS1  S1 S1\nS1  S1 S1\nS1  S1 \u2026\n\nS1  S1 S1\nS1  S1 S1\nS1  S1 \u2026\n\nFigure 1: A hierarchical Bayesian model for learning verb alternations. Figure adapted from [15].\n\nperspectives, but there are few direct comparisons of generative and discriminative approaches to the\nsame speci\ufb01c language learning situation. Here, we compare a simple generative and discriminative\nmodel\u2019s predictions of how implicit negative evidence is used to learn verb alternations.\n\n2.1 Generative model: Hierarchical Bayes\n\nIn the generative model, the problem of learning verb alternations is formulated as follows. Assume\nwe have a set of m verbs, which can occur in up to k different sentence structures. Restricting\nourself to positive examples for the moment, we observe a total of n sentences x1, . . . xn. The ni\nsentences containing verb i can be summarized in a k-dimensional vector yi containing the verb\noccurrence frequency in each of the k sentence structures. For example if we had three possible\nsentence structure types and verb i occurred in the \ufb01rst type two times, the second type four times\nand the third type zero times, yi would be [2, 4, 0] and ni would be 6.\nWe model these data using a hierarchical Bayesian model (HBM) originally introduced in [15], also\nknown to statisticians as a Dirichlet-Multinomial model [22]. In statistical notation the HBM is\n\n\u03b8i\n\nDirichlet(\u03b1\u03b2)\nyi|ni \u223c Multinomial(\u03b8i)\n\n\u223c\n\n\u03b1 \u223c Exponential(\u03bb)\n\u03b2 \u223c\n\nDirichlet(\u00b5)\n\nwhere yi is the data (i.e. the observed frequency of different grammatical sentence structures for\nverb i) given ni occurrences of that verb, as summarized above. \u03b8i captures the distribution over\nsentence structures associated with verb i, assuming that sentences are generated independently and\nstructure k is generated with probability \u03b8i\nk. The hyperparameters \u03b1 and \u03b2 represent generalizations\nabout the kinds of sentence structures that typically occur. More precisely, \u03b2 represents the distribu-\ntion of sentence structures across all verbs, with \u03b2k being the mean probability of sentence structure\nk, while \u03b1 represents the extent to which verbs tends to appear in only one sentence structure type.\nIn this model, the number of verbs and the number of possible sentence structures are both \ufb01xed.\nThe hyperparameters \u03b1 and \u03b2 are learned, and the prior on these hyperparameters is \ufb01xed by setting\n\u03bb = 1 and \u00b5 = 1 for all i. This prior asserts a weak expectation that the range of \u03b1 and \u03b2 do\nnot contain extreme values. The model is \ufb01t to the data by computing the posterior distribution\np(\u03b8i|yi) = R\u03b1,\u03b2 p(\u03b8i|\u03b1, \u03b2, y)p(\u03b1, \u03b2|y) d\u03b1 d\u03b2. The posterior can be estimated using a Markov\nChain Monte Carlo (MCMC) algorithm. Following [15], we use Gaussian proposals on log(\u03b1), and\ndraw proposals for \u03b2 from a Dirichlet distribution with the current \u03b2 as its mean.\n\n2.2 Discriminative model: Logistic regression\n\nFor our discriminative model we use logistic regression. A logistic regression model can be used\nto learn a function that classi\ufb01es observations into two classes. In the context of language learning,\nthe observations are sentences and the classi\ufb01cation problem is deciding whether each sentence is\ngrammatical. As above, we observe n sentences, x1, . . . xn, but now each sentence xj is associated\n\n\fwith a variable cj indicating whether the sentence is grammatical (cj = +1) or ungrammatical\n(cj = \u22121). Each sentence is associated with a feature vector f (xj) that uses dummy variables to\nencode the verb, the sentence structure, and the interaction of the two (ie. each sentence\u2019s particular\nverb and sentence structure combination). With m verbs and k sentence structures, this results in\nm verb features, k sentence structure features, and mk interaction features, each of which take the\nvalue 1 when they match the sentence and 0 when they do not. For example, a sentence containing\nthe second of four verbs in the \ufb01rst of three sentence structures would be encoded with the binary\nfeature vector 0100100000100000000.\nThe logistic regression model learns which features of sentences are predictive of grammaticality.\nThis is done by de\ufb01ning the probability of grammaticality to be\n\np(cj = +1|xj, w, b) = 1/(1 + exp{\u2212wT f (xj) \u2212 b})\n\n(1)\nwhere w and b are the parameters of the model. w and b are estimated by maximizing the log\nlikelihood Pn\nj=1 log p(cj|xj, w, b). Features for which the likelihood is uninformative (e.g. features\nthat are not observed) have weights that are set to zero.\n\n3 Testing the models on an arti\ufb01cial language\n\nTo examine the predictions that these two models make about the use of implicit negative evidence\nin learning verb alternations, we applied them to a simple arti\ufb01cial language based on that used in\n[20]. This language has four transitive verbs and three possible sentence structures. Three of the\nverbs only appear in one sentence structure (non-alternating), while one verb appears in two possible\nsentence structures (alternating). The language consisted of three-word sentences, each containing a\nsubject (N1), object (N2) and verb (V), with the order depending on the particular sentence structure.\n\n3.1 Vocabulary\n\nThe vocabulary was a subset of that used in [20]. There were three two-syllable nouns, each begin-\nning with a different consonant, referring to three cartoon animals: blergen (lion), nagid (elephant),\ntombat (giraffe). Noun referents are \ufb01xed across participants. The four one-syllable verbs were:\ngund, \ufb02ern, semz, and norg, corresponding to the four transitive actions: eclipse, push-to-side, ex-\nplode and jump on. While the identity of the nouns and verbs is irrelevant to the models, we de-\nveloped this language with the intent of also examining human learning, as described below. With\nhuman learners, the mapping of verbs to actions was randomly selected for each participant.\n\n3.2 Syntax and grammar\n\nIn our language of three-word sentences, a verb could appear in 3 different positions (as the 1st,\n2nd or 3rd word). We constrained the possible sentences such that the subject, N1, always appeared\nbefore the object, N2. This leaves us with three possible sentence structures, S1,S2, and S3, each of\nwhich corresponded to one of the following word orders: N1-N2-V, N1-V-N2 and V-N1-N2. In our\nexperiment, the mapping from sentence structure to word order was randomized among participants.\nFor example, S1 might correspond to N1-N2-V for one participant or it might correspond to V-N1-\nN2 for another participant. There was always one sentence structure, which we denote S3, that was\nnever grammatical for any of the verbs. For S1 and S2, grammaticality varied depending on the verb.\nWe designed our language to have 1 alternating verb and 3 non-alternating verbs. One of the three\nnon-alternating verbs was only grammatical in S1. The other two non-alternating verbs were only\ngrammatical in S2. For example, let\u2019s consider the situation where S1 is N1-V-N2, S2 is N1-N2-V\nand S3 is V-N1-N2. If \ufb02ern was an alternating verb, both nagid \ufb02ern tombat and nagid tombat \ufb02ern\nwould be allowed. If semz was non-alternating, and only allowed in S2, nagid tombat semz would be\ngrammatical and nagid tombat semz would be ungrammatical. In this example, \ufb02ern nagid tombat\nand semz nagid tombat are both ungrammatical. The language is summarized in Table 1.\n\n3.3 Modeling results\n\nThe generative hierarchical Bayesian model and the discriminative logistic regression model out-\nlined in the previous section were applied to a corpus of sentences generated from this language.\n\n\fSentence Structure\nS1\nS3\n-(9)\n+(9)\n-(3)\n-(3)\n+(18)\n-(3)\n-(6)\n+(18)\n\nS2\n+(9)\n+(18)\n-(3)\n?(0)\n\nVerb\nV1\nV2\nV3\nV4\n\nTable 1: Grammaticality of verbs. + and - indicate grammatical and ungrammatical respectively,\nwhile ? indicates that grammaticality is underdetermined by the data. The number in parentheses is\nthe frequency with which each sentence was presented to model and human learners in our experi-\nment. Verb V4 was never shown in sentence structure S2. Grammaticality predictions for sentences\ncontaining this verb were used to explore the interpretation of implicit negative evidence.\n\na) V1 (S1,S2)\n\nb) V2 (S2)\n\nc) V3 (S1)\n\nd) V4 (S1)\n\ny\nt\ni\nl\n\na\nc\ni\nt\na\nm\nm\na\nr\nG\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\ngenerative\ndiscriminative\n\nFigure 2: Predicted grammaticality judgments from generative and discriminative models. In paren-\ntheses next to the verb index in the title of each plot is the sentence structure(s) that were shown to\nbe grammatical for that verb in the training corpus.\n\nThe frequencies of each verb and sentence structure combination are also shown in Table 1. We\nwere particularly interested in the predictions that the two models made about the grammaticality of\nverb V4 in sentence structure S2, since this combination of verb and sentence structure never occurs\nin the data. As a consequence, a generative learner receives implicit negative evidence that S2 is not\ngrammatical for V4, while a discriminative learner receives no information.\n\nWe trained the HBM on the grammatical instances of the sentences, using 10,000 iterations of\nMCMC. The results indicate that V1 is expected to occur in both S1 and S2 50% of the time, while\nall other verbs are expected to occur 100% of the time in the one sentence structure for which they\nare grammatical, accurately re\ufb02ecting the distribution in our language input. Predictions for gram-\nmaticality are extracted from the HBM model as follows: The ith verb is grammatical in sentence\nstructure k if the probability of sentence structure k, \u03b8i\nk is greater than or equal to \u01eb and ungram-\nmatical otherwise, where \u01eb is a small number. Theoretically, \u01eb should be set so that any sentence\nobserved once will be considered grammatical. Here, posterior values of \u03b8i\nk were highly peaked\nabout 0.5 for V1 in S1 and S2, and either 0 or 1 for other verb and sentence structure combinations,\nresulting in clear grammaticality predictions. These are shown in Figure 2. Critically, the model\npredicts that V4 in S2 is not grammatical.\n\nLogistic regression was performed using all sentences in our corpus, both grammatical and ungram-\nmatical. Predictions for grammaticality from the logistic regression model were read out directly\nfrom p(cj = +1|xj, w, b). The results are shown in Figure 2. While the model has not seen V4\nin S2, and has consequently not estimated a weight for the feature that uniquely identi\ufb01es this sen-\ntence, it has seen 27 grammatical and 3 ungrammatical instances of S2, and 18 grammatical and\n6 ungrammatical instances of V4, so it has learned positive weights for both of these features of\nsentences. As a consequence, it predicts that V4 in S2 is grammatical.\n\n4 Generative and discriminative learning in humans\n\nThe simulations above illustrate how generative and discriminative approaches to language learning\ndiffer in their treatment of implicit negative evidence. This raises the question of whether a similar\ndifference can be produced in human learners by changing the nature of the language learning task.\nWe conducted an experiment to explore whether this is the case.\n\n\fIn our experiment, participants learned the arti\ufb01cial language used to generate the model predictions\nin the previous section by watching computer animated scenes accompanied by spoken and written\nsentences describing each scene. Participants were also provided with information about whether the\nsentence was grammatical or ungrammatical. Participants were assigned to one of two conditions,\nwhich prompted either generative or discriminative learning. Participants in both conditions were\nexposed to exactly the same sentences and grammaticality information. The two conditions differed\nonly in how grammaticality information presented.\n\n4.1 Participants\n\nA total of 22 participants were recruited from the community at the University of California, Berke-\nley.\n\n4.2 Stimuli\n\nAs summarized in Table 1, participants viewed each of the 4 verbs 24 times, 18 grammatical sen-\ntences and 6 ungrammatical sentences. The alternating verb was shown 9 times each in S1 and\nS2 and 6 times in S3. The non-alternating verbs were shown 18 times each in their respectively\ngrammatical sentence structures and 3 times each in the 2 ungrammatical structures. Presentation\nof sentences was ordered as follows: Two chains of sentences were constructed, one grammatical\nand one ungrammatical. The grammatical chain consisted of 72 sentences (18 for each verb) and\nthe ungrammatical chain consisted of 24 sentences (6 for each verb). For each sentence chain, verbs\nwere presented cyclically and randomized within cycles. For the grammatical chain, V1 occurrences\nof S1 and S2 were cycled through in semi-random order (verbs V2-V4 appeared grammatically in\nonly one sentence construction). Similarly, for the ungrammatical chain, V2 and V3 cycled semi-\nrandomly through occurrences of S1 and S3 and S2 and S3 respectively (verbs V1 and V4 only\nappeared ungrammatically in S3). While participants were being trained on the language, presen-\ntation of one sentence from the ungrammatical chain was randomly interleaved within every three\npresentations of sentences from the grammatical chain. Subject-object noun pairs were randomized\nfor each verb across presentations. There were a total of 96 training sentences.\n\n4.3 Procedure\n\nParticipants in both conditions underwent pre-training trials to acquaint them with the vocabulary.\nDuring pre-training they heard and saw each word along with pictures of each noun and scenes\ncorresponding to each verb along with spoken audio of each noun/verb. All words were cycled\nthrough three times during pre-training. During the main experiment, all participants were told they\nwere to learn an arti\ufb01cial language. They all saw a series of sentences describing animated scenes\nwhere a subject noun performed an action on an object noun. All sentences were presented in both\nspoken and written form.\n\n4.3.1 Generative learning condition\n\nIn the generative learning condition, participants were told that they would listen to an adult speaker\nwho was always spoke grammatical sentences and a child speaker who always spoke ungrammat-\nically. Cartoon pictures of either the adult or child speaker accompanied each scene. The child\nspeaker\u2019s voice was low-pass \ufb01ltered to create a believably child-like sound. We hypothesized that\nparticipants in this condition would behave similarly to a generative model:\nthey would build a\nprobabilistic representation of the language from the grammatical sentences produced by the adult\nspeaker.\n\n4.3.2 Discriminative learning condition\n\nIn the discriminative learning condition, participants were presented with spoken and written sen-\ntences describing each scene and asked to choose whether each of the presented sentences were\ngrammatical or not. They were assured that only relevant words were used and they only had to \ufb01g-\nure out if the verb occurred in a grammatical location. Participants then received feedback on their\nchoice. For example, if a participant answered that the sentence was grammatical, they would see\neither \u201cYes, you were correct. This sentence is grammatical!\u201d or \u201cSorry, you were incorrect. This\n\n\fa) V1 (S1,S2)\n\nb) V2 (S2)\n\nc) V3 (S1)\n\nd) V4 (S1)\n\nl\n\na\nc\ni\nt\n\na\nm\nm\na\nr\ng\nn\no\n\n \n\ni\nt\nr\no\np\no\nr\nP\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\n1\n\n0.5\n\n0\n\nS1 S2 S3\n\ngenerative\ndiscriminative\n\nFigure 3: Human grammar judgments, showing proportion grammatical for each sentence structure.\n\nsentence is ungrammatical!\u201d The main difference from the generative condition is that in the dis-\ncriminative condition, the presented sentences are assumed to be chosen at random, whereas in the\ngenerative learning condition, sentences from the adult speaker are assumed to have been sampled\nfrom the language distribution. We hypothesized that participants in the discriminative condition\nwould behave similarly to a discriminative model: they would use feedback about both grammatical\nand ungrammatical sentences to formulate rules about what made sentences grammatical.\n\n4.3.3 Testing\n\nAfter the language learning phase, participants in both conditions were subjected to a grammar test.\nIn this testing phase, participants were shown a series of written sentences and asked to rate the\nsentence as either grammatical or ungrammatical. Here, all sentences had blergen as the subject\nand nagid as the object. All verb-sentence structure combinations were shown twice. Additionally\nthe verb V4 was shown an extra two times in S2 as this was the crucial generalization that we were\ntesting.\n\nParticipants also underwent a production test in which they were shown a scene and asked to type\nin a sentence describing that scene. Because we did not want this to be a memory test, we displayed\nthe relevant verb on the top of the screen. Pictures of all the nouns, with their respective names\nbelow, were also available on the bottom of the screen for reference. Four scenes were presented for\neach verb, using subject-object noun pairs that were cycled through random. Verbs were also cycled\nthrough at random.\n\n4.4 Results\n\nOur results show that participants in both conditions were largely able to learn much of the grammar\nstructure. Hoewever, there were signi\ufb01cant differences between the generative and discriminative\nconditions (see Figure 3). Most notably, the generative learners overwhelmingly judged verb V4 to\nbe ungrammatical in S2, while the majority of discriminative learners deemed V4 in to be grammat-\nical in S2 (see Figure 3d). This difference between conditions was highly statistically signi\ufb01cant\nby a Pearson\u2019s \u03c72 test (\u03c72(1) = 7.28, p = 0.007). This difference aligned with the difference in\nthe predictions of the HBM (generative) model and the logistic regression (discriminative) model\ndiscussed earlier. Our results strongly suggest participants in the generative condition were learning\nlanguage with a probabilistic perspective that allowed them to learn restrictions on verb alterna-\ntions by using implicit negative evidence whereas participants in the discriminative condition made\nsampling assumptions that did not allow them to learn the alternation restriction.\n\nAnother difference we found between the two conditions was that discriminative learners were more\nwilling to consider verbs to be alternating (i.e. allow those verbs to be grammatical in two sentence\nstructures.) This is evidenced by the fact that participants in the generative condition rated occur-\nrences of V1 (the alternating verb) in S1 and S2 as grammatical only 68% and 72% of the time. This\nis because many participants judged V1 to be grammatical in either S1 or S2 and not both. On the\nother hand, participants in the discriminative condition rated occurrences of V1 in S1 and S2 gram-\nmatical 100% of the time (see Figure 3a). Pearson\u2019s \u03c72 tests for the difference between conditions\nfor grammaticality of V1 in S1 and S2 were marginally signi\ufb01cant, with \u03c72(1) = 4.16, p = .04\nand \u03c72(1) = 3.47, p = 0.06 respectively. From post-experiment questioning, we learned that many\nparticipants in the generative condition did not think verbs would occur in two possible sentence\n\n\fd) V4 (S1)\n\nb) V2 (S2)\n\nc) V3 (S1)\n\nd) V4 (S1)\n\nn\no\n\ni\nt\nc\nu\nd\no\nr\nP\n\n1\n\n0.5\n\n0\n\nS1 S2 S3other\n\n1\n\n0.5\n\n0\n\nS1 S2 S3other\n\n1\n\n0.5\n\n0\n\nS1 S2 S3other\n\n1\n\n0.5\n\n0\n\ngenerative\ndiscriminative\n\nS1 S2 S3other\n\nFigure 4: Human production data, showing proportion of productions in each sentence structure.\n\nstructures. None of the participants in the discriminative condition were constrained by this as-\nsumption. Why the two conditions prompted signi\ufb01cantly different prior assumptions about the\nprevalence of verb alternations will be a question for future research, but is particularly interesting\nin the context of the HBM, which can learn a prior expressing similar constraints.\n\nProduction test results showed that participants tended to use verbs in the sentences structure that\nthey heard them in (see Figure 4). Notably, even though the majority of the learners in the discrim-\ninative condition rated verb V4 in S2 as grammatical, only 20% of the productions of V4 were in\nS2. This is in line with previous results that show that how often a sentence structure is produced\nis proportional to how often that structure is heard, and rarely heard structures are rarely produced,\neven if they are believed to be grammatical [20].\n\n5 Discussion\n\nWe have shown that arti\ufb01cial language learners may or may not learn restrictions on verb alterna-\ntions, depending on the learning context. Our simulations of generative and discriminative learners\nmade predictions about how these approaches deal with implicit negative evidence, and these pre-\ndictions were borne out in an experiment with human learners. Participants in both experimental\nconditions viewed exactly the same sentences and were told whether each sentence was grammatical\nor ungrammatical. What varied between conditions was the way the the grammaticality information\nwas presented. In the discriminative condition, participants were given yes/no grammaticality feed-\nback on sentences presumed to be sampled at random. Because of the random sampling assumption,\nthe absence of a verb in a given sentence structure did not provide implicit negative evidence against\nthe grammaticality of that construction. In contrast, participants in the generative condition judged\nthe unseen verb-sentence structure to be ungrammatical. This is in line with the idea that they had\nsought to estimate a probability distribution over sentences, under the assumption that the sentences\nthey observed were drawn from that distribution.\n\nOur simulations and behavioral results begin to clarify the connection between theoretical analyses\nof language learnability and human behavior. In showing that people learn differently under differ-\nent construals of the learning problem, we are able to examine how well normal language learning\ncorresponds to the learning behavior we see in these two cases. Participants in our generative condi-\ntion heard sentences spoken by a grammatical speaker, similar to the way children learn by listening\nto adult speech. In post-experiment questioning, generative learners also stated that they ignored all\nnegative evidence from the ungrmamatical child speaker, similar to the way children ignore negative\nevidence in real language acquisition. These observations support the idea that human language\nlearning is better characterized by the generative approach. Establishing this connection to the gen-\nerative approach helps to identify the strengths and limitations of human language learning, leading\nto the expectation that human learners can use implicit negative evidence to identify their language,\nbut will not be as robust to variation in the distribution of observed sentences as a discriminative\nlearner might be.\n\nAcknowledgments. This work was supported by grant SES-0631518 from the National Science Foundation.\n\n\fReferences\n\n[1] C. L. Baker. Syntactic theory and the projection problem. Linguistic Inquiry, 10:533\u2013538, 1979.\n[2] C. L. Baker and J. J. McCarthy. The logical problem of language acquisition. MIT Press, 1981.\n[3] N. Chomsky. Aspects if the theories of syntax. MIT Press, 1965.\n[4] S. Pinker. Learnability and Cognition: The acquisition of argument structure. MIT Press, 1989.\n[5] M. Bowerman. The \u2019No Negative Evidence\u2019 Problem: How do children avoid constructing an overly\nIn J. Hawkins, editor, Explaining Language Universals, pages 73\u2013101. Blackwell,\n\ngeneral grammar?\nNew York, 1988.\n\n[6] R. Brown and C. Hanlon. Derivational complexity and order of acquisition in child speech. Wiley, 1970.\n[7] G. F. Marcus. Negative evidence in language acquisition. Cognition, 46:53\u201385, 1993.\n[8] E. M. Gold. Language identi\ufb01cation in the limit. Information and Control, 16:447\u2013474, 1967.\n[9] M. A. Nowak, N. L. Komarova, and P. Niyogi. Computational and evolutionary aspects of language.\n\nNature, 417:611\u2013617, 2002.\n\n[10] S. Crain and L. D. Martin. An introduction to linguistic theory and language acquisition. Blackwell,\n\n1999.\n\n[11] D. Angluin. Identifying languages from stochastic examples. Technical Report YALEU/DCS/RR-614,\n\nYale University, Department of Computer Science, 1988.\n\n[12] J. J. Horning. A study of grammatical inference. PhD thesis, Stanford University, 1969.\n[13] N. Chater and P. Vitanyi. \u201cIdeal learning\u201d of natural language: Positive results about learning from\n\npositive evidence. Journal of Mathematical Psychology, 51:135\u2013163, 2007.\n\n[14] M. Dowman. Addressing the learnability of verb subcategorizations with Bayesian inference. In Pro-\n\nceedings of the 22nd Annual Conference of the Cognitive Science Society, 2005.\n\n[15] D. Kemp, A. Perfors, and J. Tenenbaum. Learning overhypothesis with hierarchical Bayesian models.\n\nDevelopmental Science, 10:307\u2013321, 2007.\n\n[16] P. Langley and S. Stromsten. Learning context-free grammars with a simplicity bias. In Proceedings of\n\nthe 11th European Conference on Machine Learning, 2000.\n\n[17] L. Onnis, M. Roberts, and N. Chater. Simplicity: A cure for overgeneralizations in language acquisition?\n\nIn Proceedings of the 24th Annual Conference of the Cognitive Science Society, pages 720\u2013725, 2002.\n\n[18] A. Perfors, J. Tenenbaum, and T. Regier. Poverty of the stimulus: A rational approach? In Proceedings\n\nof the 28th Annual Conference of the Cognitive Science Society, pages 664\u2013668, 2006.\n\n[19] A. Stolcke. Bayesian learning of probabilistic language models. PhD thesis, UC Berkeley, 1994.\n[20] E. Wonnacott, E. Newport, and M. Tanenhaus. Acquiring and processing verb argument structure: Dis-\n\ntributional learning in a miniature language. Cognitive Psychology, 56:165\u2013209, 2008.\n\n[21] A. Y. Ng and M. Jordan. On discriminative vs. generative classi\ufb01ers: A comparison of logistic regression\n\nand naive Bayes. In Advances in Neural Information Processing Systems 17, 2001.\n\n[22] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman Hall, 2003.\n\n\f", "award": [], "sourceid": 639, "authors": [{"given_name": "Anne", "family_name": "Hsu", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}