{"title": "The Early Word Catches the Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 52, "page_last": 58, "abstract": null, "full_text": "The Early Word Catches the Weights \n\nMark A. Smith \n\nGarrison W. Cottrell \n\nKaren L. Anderson \n\nDepartment of Computer Science \n\nUniversity of California at San Diego \n\nLa Jolla, CA 92093 \n\n{masmith,gary,kanders}@cs.ucsd.edu \n\nAbstract \n\nThe strong correlation between the frequency of words and their naming \nlatency has been well documented. However, as early as 1973, the Age \nof Acquisition (AoA) of a word was alleged to be the actual variable of \ninterest, but these studies seem to have been ignored in most of the lit(cid:173)\nerature. Recently, there has been a resurgence of interest in AoA. While \nsome studies have shown that frequency has no effect when AoA is con(cid:173)\ntrolled for, more recent studies have found independent contributions of \nfrequency and AoA. Connectionist models have repeatedly shown strong \neffects of frequency, but little attention has been paid to whether they \ncan also show AoA effects. Indeed, several researchers have explicitly \nclaimed that they cannot show AoA effects. In this work, we explore \nthese claims using a simple feed forward neural network. We find a sig(cid:173)\nnificant contribution of AoA to naming latency, as well as conditions \nunder which frequency provides an independent contribution. \n\n1 Background \n\nNaming latency is the time between the presentation of a picture or written word and the \nbeginning of the correct utterance of that word. It is undisputed that there are significant \ndifferences in the naming latency of many words, even when controlling word length, syl(cid:173)\nlabic complexity, and other structural variants. The cause of differences in naming latency \nhas been the subject of numerous studies. Earlier studies found that the frequency with \nwhich a word appears in spoken English is the best determinant of its naming latency (Old(cid:173)\nfield & Wingfield, 1965). More recent psychological studies, however, show that the age \nat which a word is learned, or its Age of Acquisition (AoA), may be a better predictor \nof naming latency. Further, in many multiple regression analyses, frequency is not found \nto be significant when AoA is controlled for (Brown & Watson, 1987; Carroll & White, \n1973; Morrison et al. 1992; Morrison & Ellis, 1995). These studies show that frequency \nand AoA are highly correlated (typically r = -.6) explaining the confound of older studies \non frequency. However, still more recent studies question this finding and find that both \nAoA and frequency are significant and contribute independently to naming latency (Ellis \n& Morrison, 1998; Gerhand & Barry, 1998,1999). \n\nMuch like their psychological counterparts, connectionist networks also show very strong \nfrequency effects. However, the ability of a connectionist network to show AoA effects has \nbeen doubted (Gerhand & Barry, 1998; Morrison & Ellis, 1995). Most of these claims are \n\n\fbased on the well known fact that connectionist networks exhibit \"destructive interference\" \nin which later presented stimuli, in order to be learned, force early learned inputs to become \nless well represented, effectively increasing their associated errors. However, these effects \nonly occur when training ceases on the early patterns. Continued training on all the patterns \nmitigates the effects of interference from later patterns. \n\nRecently, Ellis & Lambon-Ralph (in press) have shown that when pattern presentation is \nstaged, with one set of patterns initially trained, and a second set added into the training set \nlater, strong AoA effects are found. They show that this result is due to a loss of plasticity \nin the network units, which tend to get out of the linear range with more training. While \nthis result is not surprising, it is a good model of the fact that some words may not come \ninto existence until late in life, such as \"email\" for baby boomers. However, they explicitly \nclaim that it is important to stage the learning in this way, and offer no explanation of \nwhat happens during early word acquisition, when the surrounding vocabulary is relatively \nconstant, or why and when frequency and AoA show independent effects. \n\nIn this paper, we present an abstract feed-forward computational model of word acquisition \nthat does not stage inputs. We use this model to examine the effects of frequency and AoA \non sum squared error, the usual variable used to model reaction time. We find a consistent \ncontribution of AoA to naming latency, as well as the conditions under which there is an \nindependent contribution from frequency in some tasks. \n\n2 Experiment 1: Do networks show AoA effects? \n\nOur first goal was to show that AoA effects could be observed in a connectionist network \nusing the simplest possible model. First, we need to define AoA in a network. We did \nthis is such a way that staging the inputs was not necessary: we defined a threshold for \nthe error, after which we would say a pattern has been \"acquired.\" The AoA is defined to \nbe the epoch during which this threshold is crossed. Since error for a particular pattern \nmay occasionally go up again during online learning, we also measured the last epoch that \nthe pattern went below the threshold for final time. We analyzed our networks using both \ndefinitions of acquisition (which we call first acquisition and final acquisition), and have \nfound that the results vary little between these definitions. In what follows, we use first \nacquisition for simplicity. \n\n2.1 The Model \n\nThe simplest possible model is an autoencoder network. Using a network architecture of \n20-15-20, we trained the network to autoencode 200 patterns of random bits (each bit had a \n50% probability of being on or off). We initialized weights randomly with a flat distribution \nof values between 0.1 and -0.1, used a learning rate of 0.001 and momentum of 0.9. \n\nFor this experiment, we chose the AoA threshold to be 2, indicating an average squared \nerror of .1 per input bit, yielding outputs much closer to the correct output than any other. \nWe calculated Euclidean distances between all outputs and patterns to verify that the input \nwas mapped most closely to the correct output. Training on the entire corpus continued \nuntil 98% of all patterns fell below this threshold. \n\n2.2 Results \n\nAfter the network had learned the input corpus, we investigated the relationship between \nthe epoch at which the input vector had been learned and the final sum squared error (equiv(cid:173)\nalent, for us, to \"adult\" naming latency) for that input vector. These results are presented \nin Figure 1. The relationship between the age of acquisition of the input vector and its \n\n\f'1'$Iac:q llSlll C)n \n\n' 1'$Iac:qui sl lon reW\"\"sion -\nfinal oc;qllS~lon \nInalac:q''''l onmW''''slon \n\n... \n\n\" \n\n',~-;:,OOO-=~-----;~=-,----C_:::---=,OO,------,OO=,-----:~=---;:\"OO~~' \n\n'~,----C,oo:::---=~,-----;~=,~,oo=---=oo,~=~-----;ro=-,~\",~, ~~ \n\nEpDdlo1 L ....... In~ \n\nEpoMNlI'11Df1' \n\nFigure 1: Exp. 1. Final SSE vs. AoA. \n\nFreq.w:ncyDlAppe\"''''''''' \n\n, rT----~----~----~------, \n\nFigure 2: SSs.m~t!:P~9!!..!?'y Percentile \n''-~~7\",.~~~\",~\"OO~' ~~~~~~--' \n\n~,staoq\"\"!\"\"\"\"IJ'assoon -\n\n\"nataoqu'srtoon \n',nataoqltlS,bon'''IJ'assoon \n\n\" \n\n,,~----~-===~,;,, ====~;===~ \n\nPallarnNumbe, \n\n\"!--'::'OO:-, ---:::'------:;:::-=MOO:-::'OO=\";--;;;;\",,:::-OO ---:,=.\"\":;--;:\"!::,,,,'---::!,,oo:::-, ---;:!\",oo, \n\nEpochDlLoatnlr'lg \n\n.' \n\nFigure 3: Exp. 2 Frequency Distribution \n\nFigure 4: Exp. 2 Final SSE vs. AoA \n\nfinal sum squared error is clear: the earlier an input is learned, the lower its final error will \nbe. A more formal analysis of this relationship yields a significant (p \u00ab .005) correlation \ncoefficient of r=0.749 averaged over 10 runs of the network. \n\nIn order to understand this relationship better, we divided the learned words into five per(cid:173)\ncentile groups depending upon AoA. Figure 2 shows the average SSE for each group plotted \nover epoch number. The line with the least average SSE corresponds to the earliest acquired \nquintile while the line with the highest average SSE corresponds to the last acquired quin(cid:173)\ntile. From this graph we can see that the average SSE for earlier learned patterns stays \nbelow errors for late learned patterns. This is true from the outset of learning as well as \nwhen the error starts to decrease less rapidly as it asymptotically approaches some lowest \nerror limit. We sloganize this result as \"the patterns that get to the weights first, win.\" \n\n3 Experiment 2: Do AoA effects survive a frequency manipulation? \n\nHaving displayed that AoA effects are present in connectionist networks, we wanted to \ninvestigate the interaction with frequency. We model the frequency distribution of inputs \nafter the known English spoken word frequency in which very few words appear very \noften while a very large portion of words appear very seldom (Zipf's law). The frequency \ndistribution we used (presentation probability= 0.05 + 0.95 * ((1 -\n(l.O/numinputs) * \ninpuLnumber) +0.05)10) is presented in Figure 3 (a true version of Zipf's law still shows \nthe result). Otherwise, all parameters are the same as Exp. 1. \n\n3.1 Results \nResults are plotted in Figure 4. Here we find again a very strong and significant (p \u00ab \n0.005) correlation between the age at which an input is learned and its naming latency. \nThe correlation coefficient averaged over 10 runs is 0.668. This fits very well with known \ndata. Figure 5 shows how the frequency of presentation of a given stimulus correlates with \n\n\fNamU\"lgLatoocyvs Ff\"'loonc'f \n\nFflllqU9flC'f'VS AgeofA<:qlllslOOn \n\nh9QOOflCll \n\n+ \n\nfrOCJllOnc'f 'C9'GSSlOn -\n\n1 8 \n\n+ + \n\n! \n\n1 6 \n\n: \n\n~ 14000 \n\n~ 10000 \n\ni 12000 \ni 8000 \n\n.. .. \n\n..... .. \n\n... \n\n:+ .. \n\nLi. \n\n11000 \n\n.:-\n\n4000 ~5t +:t \n~OO~.~. ~. \n, \no \n\n2000 \n\n4DOO \n\n6000 \n\nFf~ ofAppe\"n ... \"\" \n\nFigure 5: Exp. 2 Frequency vs. SSE \n\nFigure 6: Exp. 2 AoA vs. Frequency \n\nnaming latency. We find that the best fitting correlation is an exponential one in which \nnaming latency correlates most strongly with the log of the frequency. The correlation \ncoefficient averaged over 10 runs is significant (p \u00ab 0.005) at -0.730. This is a slightly \nstronger correlation than is found in the literature. \n\nFinally, figure 6 shows how frequency and AoA are related. Again, we find a significant (p \n< 0.005) correlation coefficient of -0.283 averaged over 10 runs. However, this is a much \nweaker correlation than is found in the literature. Performing a multiple regression with the \ndependent variable as SSE and the two explaining variables as AoA and log frequency, we \nfind that both AoA and log frequency contribute significantly (p\u00ab 0.005 for both variables) \nto the regression equation. Whereas AoA correlates with SSE at 0.668 and log frequency \ncorrelates with SSE at -0.730, the multiple correlation coefficient averaged over 10 runs is \n0.794. AoA and log frequency each make independent contributions to naming latency. \n\nWe were encouraged that we found both effects of frequency and AoA on SSE in our \nmodel, but were surprised by the small size of the correlation between the two. The naming \nliterature shows a strong correlation between AoA and frequency. However, pilot work \nwith a smaller network showed no frequency effect, which was due to the autoencoding \ntask in a network where the patterns filled 20% of the input space (200 random patterns \nin a 10-8-10 network, with 1024 patterns possible). This suggests that autoencoding is not \nan appropriate task to model naming, and would give rise to the low correlation between \nAoA and frequency. Indeed, English spellings and their corresponding sounds are certainly \ncorrelated, but not completely consistent, with many exceptional mappings. Spelling-sound \nconsistency has been shown to have a significant effect on naming latency (Jared, McRae, \n& Seidenberg, 1990). Object naming, another task in which AoA effects are found, is a \ncompletely arbitrary mapping. Our third experiment looks at the effect that consistency of \nour mapping task has on AoA and frequency effects. \n\n4 Experiment 3: Consistency effects \n\nOur model in this experiment is identical to the previous model except for two changes. \nFirst, to encode mappings with varying degrees of consistency, we needed to increase the \nnumber of hidden units to 50, resulting in a 20-50-20 architecture. Second, we found that \nsome patterns would end up with one bit off, leading to a bimodal distribution of SSE's. \nWe thus used cross-entropy error to ensure that all bits would be learned. \n\nEleven levels of consistency were defined; from 100% consistent, or autoencoding; to 0% \nconsistent, or a mapping from one random 20 bit vector to another random 20 bit vector. \nNote that in a 0% consistent mapping, since each bit as a 50% chance of being on, about \n50% of the bits will be the same by chance. Thus an intermediate level of 50% consistency \nwill have on average 75% of the corresponding bits equal. \n\n\fCom:rlahonStrOO!1hvs MappmgConSlStA a nd RMSE \n\nIog fr\"\"\",ncyandRMSE(cid:173)\n\nt\" (Iog flO