{"title": "Why are some word orders more common than others? A uniform information density account", "book": "Advances in Neural Information Processing Systems", "page_first": 1585, "page_last": 1593, "abstract": "Languages vary widely in many ways, including their canonical word order. A basic aspect of the observed variation is the fact that some word orders are much more common than others. Although this regularity has been recognized for some time, it has not been well-explained. In this paper we offer an information-theoretic explanation for the observed word-order distribution across languages, based on the concept of Uniform Information Density (UID). We suggest that object-first languages are particularly disfavored because they are highly non-optimal if the goal is to distribute information content approximately evenly throughout a sentence, and that the rest of the observed word-order distribution is at least partially explainable in terms of UID. We support our theoretical analysis with data from child-directed speech and experimental work.", "full_text": "Why are some word orders more common than\nothers? A uniform information density account\n\nLuke Maurits, Amy Perfors & Daniel Navarro\n\nSchool of Psychology,\nUniversity of Adelaide,\n\n{luke.maurits, amy.perfors, daniel.navarro}@adelaide.edu.au\n\nAdelaide, South Australia, 5000\n\nAbstract\n\nLanguages vary widely in many ways, including their canonical word order. A\nbasic aspect of the observed variation is the fact that some word orders are much\nmore common than others. Although this regularity has been recognized for\nsome time, it has not been well-explained. In this paper we offer an information-\ntheoretic explanation for the observed word-order distribution across languages,\nbased on the concept of Uniform Information Density (UID). We suggest that\nobject-\ufb01rst languages are particularly disfavored because they are highly non-\noptimal if the goal is to distribute information content approximately evenly\nthroughout a sentence, and that the rest of the observed word-order distribution\nis at least partially explainable in terms of UID. We support our theoretical analy-\nsis with data from child-directed speech and experimental work.\n\n1\n\nIntroduction\n\nMany of the world\u2019s languages are sensitive to word order. In these languages, the order in which\nwords are spoken conveys a great deal of the sentence\u2019s meaning. The classic English example is\nthe distinction between \u201cdog bites man\u201d and \u201cman bites dog\u201d, which differ in terms of who is biting\nwhom. The so-called \u201cbasic\u201d word order of a language is de\ufb01ned according to the order of three\nof the principal components of basic transitive sentences: subject (S), verb (V) and object (O). This\nresults in six logically distinct word orders: SOV, SVO, VSO, VOS, OVS and OSV (e.g., English\nhas SVO basic word order). Curiously, the world\u2019s order-sensitive languages make use of these\nsix possibilities in an uneven fashion. According to a survey of 402 languages [17], the majority\nof languages are either SOV (44.78%) or SVO (41.79%). VSO (9.20%) is much less frequent but\nstill signi\ufb01cant, and very few languages make use of VOS (2.99%), OVS (1.24%) or OSV (0.00%)\nas their basic word order. Broadly speaking, the basic pattern appears to be (SOV, SVO) > VSO\n> (VOS, OVS) > OSV. This non-uniformity is a striking empirical \ufb01nding that demands some\nexplanation. Unfortunately, most of the explanations that have been offered are either proximate\nexplanations that simply shift the question, or else are circular.\nOne of the most straightforward explanations is that the observed word order frequencies may be the\nconsequence of genetically encoded biases toward particular orders, as part of the universal grammar\nhypothesis; this possibility is considered in [4]. However, this can be only a proximate explanation:\nwhy does our genetic endowment happen to bias us in the particular way that it does? And if there\nis nothing special about the observed distribution \u2013 if it is not an adaption to the environment \u2013\nwhy have thousands of years of adaption and genetic drift not blurred it into something closer to\nuniformity?\nA similar objection can be made against the proposal that all languages which are alive today de-\nscend from a single common ancestor, and that this proto\u2013language used SOV word order [8], ex-\n\n1\n\n\fplaining the observation that SOV is the most common word order today. If there is nothing special\nabout SOV, why has random drift (this time in language evolution, not human genetic evolution)\nnot more signi\ufb01cantly changed the word order distribution from its ancient form? Furthermore, it\nis clear that ancient SOV languages must have changed into SVO languages much more frequently\ninto than, say, VOS languages in order to arrive at the current state of affairs. Common descent from\nSOV cannot explain this by itself.\nAnother explanation seeks to derive word order frequencies as a consequence of more fundamental\nor general linguistic principles. Three such principles are presented in [17]: the \u201ctheme-\ufb01rst princi-\nple\u201d, \u201cverb-object bonding\u201d and the \u201canimate-\ufb01rst principle\u201d. These principles do an excellent job\nof explaining the observed word order frequencies; the frequency of each word order is proportional\nto the number of the principles which that word order permits to be realized (all three principles are\nrealized in SOV and SVO, two are realized in VSO, one in VOS and OVS, and none in OSV). How-\never, these principles are primarily motivated by the fact that a large body of cross-linguistic data is\nconsistent with them. Without a deeper justi\ufb01cation, they are, in essence, a useful recharacterization\nof the data; to offer them as explanations of patterns in that data is circular. In other words, it is not\nclear why these principles work.\nIn this paper we propose a novel explanation for the observed distribution of word orders across\nlanguages, based on uniform information density (UID). The UID hypothesis [13, 10] suggests that\nlanguage producers unconsciously endeavor to keep the rate of information transmission as close to\nconstant as possible when speaking. We use the term \u201cinformation\u201d here in its information-theoretic\nsense of reduction of entropy (uncertainty) of a random variable (where the random variable is the\nunderlying meaning of an utterance). Conveying information via speech with a uniform information\ndensity represents an optimal solution to the computational problem of conveying information over\na noisy channel in a short time with low probability of error. A listener\u2019s comprehension of an\nutterance is made more dif\ufb01cult if a syllable, word or clause which carries a lot of information is lost\ndue to ambient noise or problems with articulation or perception. The most error resistant strategy is\ntherefore to convey minimal information with each unit of speech. Unfortunately, this leads to other\nproblems \u2013 namely, that it will take excessive time to convey any meaningful quantity of information.\nThe best trade off between time ef\ufb01ciency and error resistance is to spread information content as\nequally as possible across units and have each unit carry as much information as it can without\nexceeding the threshold for error correctability (the channel capacity). Also, UID minimizes the\ndif\ufb01culty involved in online sentence processing, assuming that the dif\ufb01culty of processing a speech\nunit increases superlinearly with that unit\u2019s surprisal [13].\nThe UID hypothesis is supported by a range of empirical evidence. It suggests that speakers should\nattempt to slow down the rate at which information is conveyed when unexpected, high entropy\ncontent is being discussed, and increase the rate when predictable, low entropy content is being\ndiscussed. This prediction is supported by \ufb01ndings indicating that certain classes of words [1] and\nsyllables [3] are spoken more slowly in unexpected contexts. In addition, analysis of corpus data\nsuggests that the entropy of sentences taken out of context is higher for sentences further into a body\nof text [7, 12]. Furthermore, the use of both optional contractions (e.g., \u201cyou are\u201d vs. \u201cyou\u2019re\u201d)\n[2] and optional function words in relative clauses (e.g., \u201chow big is the house that you live in?\u201d\nvs. \u201chow big is the house you live in?\u201d) [14, 11] appears to be affected by information density\nconsiderations, with contractions used less often when the relative clause is unexpected.\nWe propose that the basic word order of a language in\ufb02uences the average uniformity of information\ndensity for sentences in that language, and that a preference for languages that are closer to the UID\nideal can explain some of the structure in the observed distribution over basic word orders. The\nlayout of the rest of the paper is as follows. In Section 2 we describe the underlying conceptual\nmodel and terminology using a simple illustrative example. In Section 3,\n\n2 Development of hypothesis and illustrative examples\n\nThis work is based on a simple probabilistic model of language production. We assume that lan-\nguages are grounded in a world, consisting of objects (elements of a set O) and actions (which are\nbinary relations between objects, and elements of a set R, such that if r \u2208 R then r \u2282 O \u00d7 O). An\nevent in the world is a triple consisting of a relation r and two objects o1, o2 and is written (o1, r, o2).\nEvents in the world are generated probabilistically in a sequential fashion, as independent identically\n\n2\n\n\fdistributed draws from a probability distribution P over the set of events O \u00d7 R \u00d7 O. We assume\nthat a language consists of nouns (each of which corresponds to a unique object) and verbs (each of\nwhich corresponds to a unique action). Utterances are generated from events by combining the three\nrelevant words in one of the six possible orders. Each utterance is therefore three words long (there\nare no function words in the model). This de\ufb01nes a probabilistic generative model for three-word\nutterances.\nTo make this idea more concrete, we construct a simple toy world consisting of thirteen objects\nand two relations. Five of the objects represent individual people (ALICE, BOB, EVE, MALLORY,\nTRENT) and the other eight represent items which are either food (APPLE, BREAD, CAKE, RICE)\nor drink (COFFEE, COLA, JUICE, WATER). The two relations are EAT and DRINK, so that the\nevents in this world represent particular people eating or drinking particular items (e.g. (ALICE,\nDRINK, COFFEE)). Impossible events (e.g., (COFFEE, DRINK, ALICE)) are given zero probability\nin the event distribution P . A diagrammatic representation of all the non-zero probabilities of P is\navailable in the supplementary material, but the salient features of the example are as follows: each\nof the \ufb01ve people eat and drink equally often, and equally as often as each other; nobody drinks\nfoods or eats drinks; and each person has their own particular idiosyncratic distribution over which\nfoods they prefer to eat and which drinks they prefer to drink.\nWhat is the link between word order and information density in this toy world? Consider a listener\nwho learns about events in this toy world by hearing three-word utterances (such as \u201cAlice eats\napples\u201d or \u201cBob drinks coffee\u201d), one word at a time. Until they have heard all three words in the\nutterance, there will generally remain some degree of uncertainty about what the event is, with\nthe uncertainty decreasing as each word is heard. Formally, the event underlying an utterance is a\nrandom variable, and the listener\u2019s uncertainty is represented by the entropy of that random variable.\nBefore any words are spoken, the observer\u2019s uncertainty is given by the entropy of the event distri-\nbution (which we refer to as the base entropy and denote H0):\n\n(cid:88)\n\nH0 = H(P ) =\n\n\u2212P (o1, r, o2) log(P (o1, r, o2)),\n\n(1)\n\n(o1,r,o2)\n\nwhere the sum is taken over all possible events in the world. After the \ufb01rst word, the observer\u2019s\nuncertainty about the event is reduced, and now corresponds to the entropy of one of the condi-\ntional distributions, P (o1, o2|r), P (r, o2|o1) or P (o1, r|o2), depending on whether the \ufb01rst word\ncorresponds to the action (VSO or VOS word order), the person (SVO or SOV word order) or\nthe food/drink (OVS or OSV word order). Similarly, after the second word, the uncertainty is the\nentropy of one of the conditional distributions P (o2|o1, r), P (o1|r, o2) or P (r|o1, o2), depending\nagain on word order. After the third word the event is uniquely determined and the entropy is zero.\nThis means that for any particular event, the six different choices of word order each de\ufb01ne a differ-\nent monotonically decreasing sequence of intermediate entropies, with the \ufb01rst point in the sequence\nalways being H0 and the \ufb01nal point always being zero. Equivalently, the different choices of word\norder result in different distributions of the total information content of a sentence amongst its con-\nstituent words. We call sequences of entropies (H0, H1, H2, 0) entropy trajectories, and sequences\nof information (I1 = H0 \u2212 H1, I2 = H2 \u2212 H1, I3 = H2) information pro\ufb01les. Figure 1 shows the\nentropy trajectories and corresponding information pro\ufb01les for the event (ALICE, EAT APPLE) in\nour toy world, for three different word orders. The \ufb01gure demonstrates the correspondence between\ntrajectories and pro\ufb01les, as well as the dependency of both on word order. Note that in the \ufb01gure we\nhave normalized entropies and informations, so that H0 = 1.\nIf we make the simplifying assumption that all words are of equal length1, the UID hypothesis\nsuggests that the ideal shape of an entropy trajectory is a perfectly straight line from the initial\nbase entropy to the eventual zero entropy, or, equivalently, that the ideal shape of an information\npro\ufb01le is for each word to convey one third of the total information. Figure 1 demonstrates that\nsome trajectories are better realizations of this ideal than others. For example, in our toy world the\nentropy trajectories for the word orders SOV, OSV and OVS (two of which are pictured in Figure 1)\nare perfectly horizontal at various points (equivalently, some words carry zero information) because\n\n1Obviously this is not true. However, in order for this simplifying assumption to skew our results, the length\nof nouns would need to vary systematically depending on the relative frequency with which the nouns were the\nsubject and orbject of sentences, which is highly unlikely to be the case.\n\n3\n\n\fFigure 1: The entropy trajectories and corresponding information pro\ufb01les for the event (ALICE,\nEAT, APPLE) in our toy world, for three different word orders. Dotted lines indicate the ideal\ntrajectory and pro\ufb01le according to the UID hypothesis. Observe that word orders in which the\nobject preceeds the verb have signi\ufb01cant \u201ctroughs\u201d in their information pro\ufb01les, making them far\nfrom ideal. This pattern arises because of the event structure in our toy world; our question is what\nword orders are optimal given real-world event structure.\n\nknowledge of the object in this world uniquely determines the verb (since foods are strictly eaten\nand drinks are strictly drunk). Thus, any word order that places O before V renders the verb entirely\nuninformative, in signi\ufb01cant con\ufb02ict with the UID hypothesis.\nTo formalize the intuitive notion of distance from the UID ideal we de\ufb01ne the UID deviation score\nD(I) of any given information pro\ufb01le I = (I1, I2, I3). D(I) is given by the formula:\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\n\u2212 1\n3\n\nD(I) =\n\n3\n4\n\n(cid:12)(cid:12)(cid:12)(cid:12) Ii\n\nH0\n\n3(cid:88)\n\ni=1\n\n(2)\n\nIt is easy to verify that the UID ideal information pro\ufb01le, with I1 = I2 = I3, has a deviation score\nof zero, and the least-ideal pro\ufb01le, in which all information is conveyed by a single word, has a\ndeviation score of 1.\nThe UID deviation score allow us, for each event in the model world, to produce both an ordering of\nthe word orders from \u201cmost UID-like\u201d to \u201cleast UID-like\u201d, as well as a quantitative measure of the\nextent to which each word order approaches uniform information density. We can straightforwardly\ncalculate a mean deviation score for the entire model world, by summing the scores for each indi-\nvidual event and weighting by that event\u2019s probability according to the event distribution P . This\nlets us assess the extent to which each word order is UID-suited to a given world. For our toy world,\nthe ordering of word orders from lowest to highest mean deviation score is: VSO, VOS, SVO, OVS,\nSOV, OSV.\nOf course, our toy world is a highly contrived example, and so there is no reason to expect it to\nproduce the observed cross-linguistic distribution of word orders. This is because we constructed the\narti\ufb01cial P distribution to be pedagogically useful, not to re\ufb02ect the real-world distribution of events.\nThe toy example is intended only as a demonstration of the core idea underlying our hypothesis: that\ndifferent choices of word order map the same probabilistic structure of the world (P ) onto different\ninformation pro\ufb01les. Since these pro\ufb01les have differing levels of information density uniformity, the\nUID hypothesis implies a preference ranking of word orders.\nWhat are the mean deviation scores when the event distribution P more accurately approximates re-\nality? Does the preferred ranking of word orders implied by the UID hypothesis re\ufb02ect the observed\ncross-linguistic distribution of word orders? We investigate these questions in the rest of the paper.\n\n3 Corpus analysis\n\nOur work above implies that a particular word ordering in a language is good to the extent that it\nproduces minimal UID deviation scores for events in the world. Accordingly, it would be ideal to\nassess the optimality of a particular word ordering with respect to the true distribution over \u201cpsy-\nchologically meaningful\u201d events in the everyday environment. Although we do not have access to\nthis distribution, we may be able to construct sensible approximations. One option is to assume that\nspontaneous speech is informative about event probabilities \u2013 that the probability with which speak-\ners discuss an event is roughly proportional to the actual frequency or psychological importance of\nthat event. Guided by this assumption, in this section we estimate P on the basis of child-directed\n\n4\n\n\fFigure 2: Distribution of information across words for the world instantiated from an English corpus\n\nFigure 3: Distribution of information across words for the world instantiated from a Japanese corpus\nspeech corpora in two languages, English and Japanese. We use child-directed speech even though\nthe UID hypothesis applies equally well to adult speakers for two reasons: because child-directed\nspeech is more amenable to the particular analysis we provide (which requires relatively simple sen-\ntences), and because children learn their language\u2019s basic word very quickly and accurately [9, 5],\nsuggesting that any aspect of primary linguistic data relevant to word order learning must be present\nin simple child-directed speech.\nAs our source of English data, we take the \u201cAdam\u201d transcripts from the Brown corpus [5] in the\nCHILDES database [15]. From this data we extract all of the child-directed utterances involving a\nrandom subset of the singly transitive verbs in the corpus (a total of 544 utterances). The subjects\nand objects of these utterances de\ufb01ne the set O and the verbs de\ufb01ne the set R. In our analysis, we\ntreat each utterance as a distinct event, setting the probability of an event in P to be proportional to\nthe number of times the corresponding utterance occurs in the corpus. Thus the event distribution\nP is a measure of the probability that speakers of the language choose to discuss events (rather than\ntheir frequency in the real world). For simplicity, we ignore adjectives, plurality, tense, and so forth:\nfor instance, the utterances \u201cthe black cat sat on the mat\u201d and \u201cthe cats are sitting on the soft mat\u201d\nwould both be mapped to the same event, (CAT, SIT, MAT). Utterances involving pronouns which\nwere considered likely to refer to a wide range of objects across the corpus (such as \u201cit\u201d, \u201cthis\u201d,\netc.) were discarded, while those involving pronouns which in the context of the discourse could be\nexpected to refer to a small set of objects (such as \u201che\u201d or \u201cshe\u201d) were retained.\nFigure 2 shows the distribution of information amongst words (summarizing all of the model world\u2019s\ninformation pro\ufb01les) for all six word orders according to the event distribution P derived from the\n\u201cAdam\u201d transcripts. The mean deviation scores for the six word orders are (from lowest to highest)\nVSO (0.38), SVO (0.41), VOS (0.48), SOV (0.64), OSV (0.78), OVS (0.79).\nTo guard against the possibility that these results are a by-product of the fact that English has basic\nword order SVO, we repeat the method discussed above using utterances involving singly transi-\ntive verbs taken from the \u201cAsato\u201d, \u201cNanami\u201d and \u201cTomito\u201d transcripts in the MiiPro corpus of the\nCHILDES database, which is in Japanese (basic order SOV). From these transcripts we retreive\n134 utterances. The distribution of information amongst words for the event distribution derived\nfrom the Japanese transcripts are shown in Figure 3. The mean deviation scores are SVO (0.66),\nVSO (0.71), SOV (0.72), VOS (0.72), OSV (0.82), OVS (0.83). This is not precisely the ranking\nrecovered from the English corpus, but there are clear similarities, which we discuss later.\n\n4 Experiment\n\nIn the previous analyses, the event distribution P was estimated on the basis of linguistic input.\nWhile this is sensible in many respects, it blurs the distinction between the frequency of events in\n\n5\n\n\fTable 1: Objects and relations in our experiment\u2019s model world. Asterisks denote \u201cactor\u201d status.\n\nObjects\n\nRelations\n\nAPPLE, BEAR*, BED, BELLY-BUTTON, BLANKET, BUNNY*, CAT*,\nCHAIR, CHEESE, COOKIE, COW*, CRACKER, CUP, DIAPER, DOOR,\n\nDUCK*, EAR, FISH*, FLOWER, FOOT*, HAIR, HAND*, HAT, HORSE*,\nKEY*, LIGHT, MILK, MOUTH*, NOSE*, OUTSIDE, PERSON*, PIG*,\n\nSPOON*, TV, TELEPHONE, TOE*, TOOTH*, TREE, WATER\n\nBITE, DRINK, EAT, HELP, HUG, KISS, OPEN, READ, SEE, SWING\n\nTable 2: Most and least probable completions of event frames according to experimentally deter-\nmined event distribution P\n\nEvent frame\nPERSON EAT\nCAT DRINK\nPERSON\n\nCAT\nEAT FLOWER\n\nMost probable completion Least probable completion\n\nAPPLE\nMILK\nHELP\nCOW\n\nDOOR\nBED\nEAT\n\nTOOTH\n\nthe world and the frequency with which speakers choose to discuss those events. In one version of\nthe UID hypothesis, we would expect that word order would be optimal with respect to the latter,\n\u201cspeaker-weighted\u201d frequencies. We refer to this as the \u201cweak\u201d hypothesis since it only requires\nthat a language be \u201cinternally\u201d consistent, insofar as the word order is expected to be optimal with\nrespect to the topics spoken about. However, there is also a \u201cstrong\u201d version of the hypothesis, which\nstates that the language must also be optimal with respect to the perceived frequencies of events in\nthe external world. To test the strong version of the UID word order hypothesis, it is not valid to rely\non corpus analysis. Accordingly, in this section we present the results of an experiment designed to\nmeasure people\u2019s perceptions regarding which events are most likely.\nOur experiment consists of three parts. In the \ufb01rst part we identify the objects O and relations R for\nthe model world based on the \ufb01rst words learned by English-speaking children, on the assumption\nthat those words would re\ufb02ect the objects and relations that are highly salient. The MacArthur\nCommunicative Development Inventory [6] provides a list of those words, along with norms for\nwhen they are learned. We identi\ufb01ed all of the words that were either singly-transitive verbs or\nnouns that were potential subjects or objects for these verbs, yielding 324 nouns and 81 verbs. The\nonly transformation we made to this list was to replace all nouns that referred to speci\ufb01c people\n(e.g., \u201cMommy\u201d or \u201cGrandpa\u201d) with a single noun \u201cPerson\u201d. In order to limit the total number of\npossible events to a number tractable for parts two and three of the experiment, we then identi\ufb01ed\nthe 40 objects and 10 relations2 uttered by the highest percentage of children below the age of 16\nmonths; these comprise the sets O and R. The objects and relations are shown in Table 1.\nThe 40 objects and 10 relations in our world de\ufb01ne a total of 16,000 events, but the overwhelming\nmajority of the events in the world are physically impossible (e.g., (TELEVISION, DRINK, CAT))\nand thus should receive a probability of 0. The goal of the second part of the experiment was\nto identify these impossible events. The \ufb01rst step was to identify the subset of objects capable\nof acting as actors, indicated with asterisks in Table 1. We set the probability of events whose\nsubjects were non-actors to zero, leaving 6,800 events. To identify which of these events were still\nimpossible, we had two participants3 judge the possibility or impossibility of each, obtaining two\njudgements for each event. When both judges agreed that an event was impossible, its probability\nwas set to zero; if they disagreed, we solicited a third judgement and set the event probability to\nzero if the majority agreed that it was impossible. At the end of this process, a total of 2,536 events\nremained. Subsequent analysis revealed that many participants had interpreted the noun OUTSIDE\nas an adverb in events such as (BEAR, EAT, OUTSIDE), leading to events which should properly\n\n2The ratio of 4 objects for every 1 relation was chosen to re\ufb02ect the proportion of each reported in [6].\n3This experiment involved 11,839 binary decisions in the second part and 35,280 binary choices in the third\npart. In order to collect such a large quantity of data in a reasonable time period, we used Amazon.com\u2019s \u201cMe-\nchanical Turk\u201d web application to distribute the judgement tasks to a large international pool of participants,\nwho completed the tasks using their web browsers in exchange for small payments of cash or Amazon.com\nstore credit. A total of 8,956 participants contributed in total, presumably but not veri\ufb01ably representing a\nbroad range of nationalities, ages, levels of education, etc.\n\n6\n\n\fFigure 4: Distribution of information across words for the world instantiated from the experimentally\nproduced event distribution.\n\nhave been considered impossible being classed as possible; we therefore set all events involving the\nnoun OUTSIDE which did not involve the verb SEE to also be impossible. This reduced the number\nof events to 2,352.\nIn the \ufb01nal part of the experiment, we derived a probability distribution over the remaining, possi-\nble events using the responses of participants to a large number of judgement tasks. In each task,\nparticipants were presented with a pair of events and asked to indicate which of the two events they\nconsidered most probable. Full details of this part of the experiment are available in the supple-\nmentary material. Table 2 shows the most and least probable completions of several event frames\naccording to the distribution P produced by our experiment. The completions are in line with com-\nmon sense, although some of the least probable completions are in fact physically impossible (e.g.\n(CAT, DRINK, BED)), suggesting that the \ufb01ltering in part two was not quite perfect.\nWe now analyse the P distribution we have estimated. The distribution of information among words\nis shown in Figure 4 and the mean deviation scores are VSO (0.17), SVO (0.18), VOS (0.20), SOV\n(0.23), OVS (0.23), OVS (0.24).\n\n5 Discussion\n\nOn the basis of two corpora of child-directed speech, in different languages, and an experiment,\nwe have derived three different event distributions which are assumed to represent the important\nfeatures of the probabilistic structure of the physical world. From these different distributions we\nderive three different preferential rankings of word orders according to the UID hypothesis. From\nthe English corpus, we get VSO > SVO > VOS > SOV > OSV > OVS; from the Japanese corpus,\nwe get SVO > VSO > SOV = VOS > OSV > OVS; from the experiment, we get VSO > SVO\n> VOS > SOV = OVS > OSV. While these three rankings are not in perfect agreement, there is\nsome degree of common structure. All three rankings are compatible with the partial ranking (SVO,\nVSO) > (SOV, VOS) > (OVS, OSV). How does this compare with the empirically observed ranking\n(SOV, SVO) > VSO > (VOS, OVS) > OVS?\nThe strongest empirical regularity regarding word order frequency - that object-\ufb01rst word orders\nare extremely rare - coincides with our most robust \ufb01nding: object-\ufb01rst word orders lead to the\nleast uniform information density in all three of our estimated event distributions. These orders\ntogether account for less than 2% of the world\u2019s word order-sensitive languages, and in all our\nmodels have deviation scores that are notably greater than the deviation scores of the other word\norders. What is the reason for this effect? As the pro\ufb01les in Figures 2, 3 and 4 indicate, object-\n\ufb01rst word orders deviate from uniformity because the \ufb01rst word (the object) carries disproportionate\namount of information. This seems to occur because many objects are predictive of very few subjects\nor verbs. For instance, hearing the object word \u201cwater\u201d implies only a few possibilities for verbs\n(e.g., \u201cdrink\u201d), which in turn restricts the subjects (e.g. to living things). By contrast, hearing the\nverb \u201cdrink\u201d implies many possibilities for objects (e.g., \u201cwater\u201d, \u201ccoffee\u201d, \u201ccola\u201d, \u201cjuice\u201d, etc.).\nThere are further points of agreement between the rankings produced by our analyses and the em-\npirical data. All three of our estimated event distributions lead to word order rankings in which VSO\nis ranked more highly than VOS, which is in agreement with the data. In fact, in all of our rankings,\nSVO and VSO occupy the two highest positions (though their relative position varies), consistent\nwith the fact that these word orders occupy the second and third highest positions in the empirical\n\n7\n\n\franking respectively, and are two of the only three word orders which appear with any appreciable\nfrequency.\nThe greatest apparent discrepancy between the rankings produced by our analyses and the empirical\ndata is the fact that SOV word order, which occurs frequently in real languages, appears to be only\nmoderately compatible with the UID hypothesis. One possible explanation for this is that some other\nfactor besides UID-compatibility has in\ufb02uenced the distribution of word orders, and this factor may\nfavour SOV suf\ufb01ciently to lift it to the top or equal-top place in a combined ranking. Another\npossibility is to combine the idea we saw earlier of common descent from SOV with the idea that\nword order change away from SOV is in\ufb02uenced by the UID hypothesis. This explanation could\nalso lift SOV word order to a higher position in the word order ranking.\nTo what extent are our rankings consistent with the the theme-\ufb01rst principle (TFP), verb-object\nbonding (VOB) and animate-\ufb01rst principle (AFP) principles of [17], which perfectly explain the\nempirical ranking? The three orders that permit the greatest realization of the TFP and AFP prin-\nciples are SOV, SVO, and VSO. We note that two of these orders, SVO and VSO, are consistently\nranked highest in our results, and the third, SOV, is typically not too far behind. In fact, with the\nevent distribution derived from the Japanese corpus, SOV is in equal third place with VOS. This\nsuggests that perhaps the UID word order hypothesis is unable to provide a complete explanation of\nall of the word order rankings, but is able provide a sensible justi\ufb01cation for the TFP and/or AFP.\nA full consideration of the effects of word order on information density should not limit itself only\nto the considerations made in this paper, and so our results here must be considered only prelim-\ninary. For instance, we have given no consideration to sentences involving intransitive verbs (SV\nsentences), sentences without an explicit subject (VO sentences), or sentences involving ditransitive\nverbs (SVO1O2 sentences). A word order optimal for one of these sentence classes may not be opti-\nmal for others, so that the question of how to meaningfully combine the results of separate analyses\nbecomes a central challenge in such an extended study. Furthermore, a number of other word order\nparameters beyond basic word order may have a signi\ufb01cant effect on information density, such as\nwhether a language uses prepositions or postpositions, or the relative position of nouns and adjec-\ntives or nouns and relative clauses. For instance, consider the order of nouns and adjectives. The\nutterance \u201cI ate the...\u201d can be completed by any edible object, but \u201cI ate the red...\u201d only by those\nobjects which are both edible and red. Thus, adjectives which preceed unexpected nouns can be\nused to \u201csmooth out\u201d what might otherwise be sudden spikes in information density. Adjectives\nwhich come after nouns cannot do this. Several correlations and rules are known to exist between\nvarious word order parameters, and it is possible that these effects may be able to be explained in\nterms of information density.\nOn the whole, while the word order rankings recovered from our analyses do not perfectly match the\nempirically observed ranking, they are in much better agreement with observation than one would\nexpect if a preference for UID had played no role whatsoever. Furthermore, the particular pattern\nof what our rankings do and do not explain, and the ways our two rankings differ, are consistent\nwith a weaker hypothesis that UID may be able to provide a principled cognitive explanation for the\ntheme-\ufb01rst and/or animate-\ufb01rst principles of earlier work. It is possible that the discrepancies which\ndo exist between our results and the empirical distribution could be explained by a combination of\nmore and richer data and consideration of additional word order parameters. It is also the case that\neven if information theoretic concerns have exerted a signi\ufb01cant in\ufb02uence on language evolution,\nthere is no reason to expect them to have been the only such in\ufb02uence: genetic and social factors as\nwell additional cognitive constraints may have played some role as well, so that the UID hypothesis\nalone need not explain all the observed regularity. Regardless, we have shown that information-\ntheoretic principles can explain several aspects of the empirical distribution of word orders, and\nmost robustly explains the most pronounced of these aspects: the nearly complete lack of object-\n\ufb01rst languages. Moreover, they do so on independently justi\ufb01ed, general cognitive principles, and\nas such represent a signi\ufb01cant advance in our understanding of word order.\n\n6 Acknowledgements\n\nDJN was supported by an Australian Research Fellowship (ARC grant DP-0773794). Kirsty Maurits\nassisted signi\ufb01cantly in the translation of utterances from the Japanese transcripts.\n\n8\n\n\fReferences\n[1] Alan Bell, Daniel Jurafsky, Eric Fosler lussier, Cynthia Girand, Michelle Gregory, and Daniel\nGildea. Effects of dis\ufb02uencies, predictability, and utterance position on word form variation in\nEnglish conversation. Journal of the Acoustical Society of America, 113(2), 2003.\n\n[2] Austin F. Frank and T. Florian Jaeger. Speaking Rationally: Uniform Information Density as\nan Optimal Strategy for Language Production. In Proceedings of the 30th Annual Meeting of\nthe Cognitive Science Society, pages 933\u2013938, 2008.\n\n[3] M. Aylett and A. Turk. The Smooth Signal Redundancy Hypothesis: A functional explana-\ntion for relationships between redundancy, prosodic prominence, and duration in spontaneous\nspeech. Language and Speech, 47:31\u201356, 2004.\n\n[4] Ted Briscoe. Grammatical Acquisition: Inductive Bias and Coevolution of Language and the\n\nLanguage Acquisition Device. Language, 76(2):245\u2013296, 2000.\n\n[5] R. Brown. A \ufb01rst language. Harvard University Press, Cambridge, MA, 1973.\n[6] Larry Fenson, Philip S. Dale, J. Steven Reznick, Elizabeth Bates, Donna J. Thal, and Stephen J.\nPethick. Variability in Early Communicative Development. Monographs of the Society for\nResearch in Child Development, 59, 1994.\n\n[7] D. Genzel and E. Charniak. Entropy rate constancy in text. In In Proceedings of ACL, 2002.\n[8] Talmy Giv\u00b4on. On Understanding Grammar. Academic Press, New York, NY, 1979.\n[9] R. Hirsh Pasek, K.and Golinkoff. The origins of grammar: Evidence from early language\n\ncomprehension. MIT Press, Cambridge, MA, 1996.\n\n[10] T. F. Jaeger. Redundancy and syntactic reduction in spontaneous speech. Unpublished doctoral\n\ndissertation, Stanford University, 2006.\n\n[11] T. Florian Jaeger. Redundancy and reduction: Speakers manage syntactic information density.\n\nCognitive Psychology, 61:23\u201362, 2010.\n\n[12] F. Keller. The entropy rate principle as a predictor of processing effort: An evaluation against\neye-tracking data. In Proceedings of the Conference on Empirical Methods in Natural Lan-\nguage Processing, pages 317\u2013324, 2004.\n\n[13] R. Levy. Probabilistic Models of Word Order and Syntactic Discontinuity. PhD thesis, Stanford\n\nUniversity, 2005.\n\n[14] R. Levy and T. F. Jaeger. Speakers optimize information density through syntactic reduction.\n\nIn Advances in Neural Information Processing Systems, pages 849\u2013856, 2007.\n\n[15] B. MacWhinney. The CHILDES project : Tools for analyzing talk. Lawrence Erlbaum Asso-\n\nciates, Mahwah, NJ, 3rd edition, 2000.\n\n[16] B. Miller, P. Hemmer, M. Steyvers, and M.D. Lee. The wisdom of crowds in rank ordering\nproblems. In A. Howesa, D. Peebles, and R. Cooper, editors, 9th International Conference on\nCognitive Modeling, 2009.\n\n[17] Russel S. Tomlin. Basic word order: functional principles. Croom Helm, 1986.\n\n9\n\n\f", "award": [], "sourceid": 369, "authors": [{"given_name": "Luke", "family_name": "Maurits", "institution": null}, {"given_name": "Dan", "family_name": "Navarro", "institution": null}, {"given_name": "Amy", "family_name": "Perfors", "institution": null}]}