{"title": "Serial Order in Reading Aloud: Connectionist Models and Neighborhood Structure", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 65, "abstract": null, "full_text": "Local Dimensionality Reduction \n\nSethu Vijayakumar 3, I \n\nChristopher G. Atkeson 4 \n\nsethu@cs.titech.ac.jp \n\ncga@cc.gatech.edu \n\nhttp://www.cc.gatech.edul \n\nStefan Schaal  1,2,4 \n\nsschaal@usc.edu \n\nhttp://www-slab.usc.edulsschaal \n\nhttp://ogawa(cid:173)\n\nwww.cs.titech.ac.jp/-sethu \n\nfac/Chris.Atkeson \n\nIERATO Kawato Dynamic Brain Project (IST), 2-2 Hikaridai, Seika-cho, Soraku-gun, 619-02 Kyoto \n\n2Dept.  of Comp. Science &  Neuroscience, Univ. of South. California HNB-I 03, Los Angeles CA 90089-2520 \n\n3Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo-I 52 \n\n4College of Computing, Georgia Institute of Technology, 801  Atlantic Drive, Atlanta, GA 30332-0280 \n\nAbstract \n\nIf globally high dimensional data has locally only low dimensional distribu(cid:173)\ntions,  it is  advantageous to perform a local dimensionality reduction before \nfurther processing the data.  In this paper we examine several techniques for \nlocal  dimensionality reduction  in the  context of locally weighted linear re(cid:173)\ngression.  As possible candidates, we derive local versions of factor analysis \nregression, principle component regression, principle component regression \non joint distributions, and partial least squares regression. After outlining the \nstatistical bases of these  methods,  we perform Monte Carlo  simulations to \nevaluate  their  robustness  with  respect  to  violations  of their  statistical  as(cid:173)\nsumptions.  One  surprising  outcome  is  that  locally  weighted  partial  least \nsquares  regression offers the best average results,  thus outperforming even \nfactor analysis, the theoretically most appealing of our candidate techniques. \n\n1  INTRODUCTION \nRegression tasks involve mapping a n-dimensional continuous input vector  x  E  ~n onto \na  m-dimensional output vector y  E ~m \u2022  They form a ubiquitous class of problems found \nin fields including process control, sensorimotor control, coordinate transformations, and \nvarious stages of information processing in biological nervous  systems.  This paper will \nfocus  on  spatially  localized  learning  techniques,  for  example,  kernel  regression  with \nGaussian weighting functions.  Local learning offer advantages for real-time incremental \nlearning problems due to fast convergence, considerable robustness towards problems of \nnegative interference, and large tolerance in model selection (Atkeson, Moore, &  Schaal, \n1997; Schaal &  Atkeson, in press). Local learning  is usually based on interpolating data \nfrom a local neighborhood around the  query point.  For high dimensional learning prob(cid:173)\nlems, however,  it suffers  from  a bias/variance dilemma,  caused by the  nonintuitive fact \nthat \" ...  [in high dimensions]  if neighborhoods  are  local,  then they  are  almost  surely \nempty, whereas if a neighborhood is not empty, then it is not local.\" (Scott,  1992, p.198). \nGlobal  learning  methods,  such  as  sigmoidal  feedforward  networks,  do  not  face  this \n\n\f634 \n\nS.  School, S.  Vijayakumar and C.  G.  Atkeson \n\nproblem  as  they  do  not  employ  neighborhood  relations,  although  they  require  strong \nprior knowledge about the problem at hand in order to be successful. \nAssuming that local learning in high dimensions is a hopeless, however,  is  not necessar(cid:173)\nily warranted: being globally high dimensional does not imply that data remains high di(cid:173)\nmensional  if viewed locally.  For example,  in  the  control  of robot  anns and biological \nanns we have shown that for estimating the inverse dynamics of an ann, a globally 21-\ndimensional space reduces on average to 4-6 dimensions locally (Vijayakumar &  Schaal, \n1997).  A  local  learning  system that can robustly  exploit  such locally  low dimensional \ndistributions should be able to avoid the curse of dimensionality. \nIn pursuit  of the  question  of  what,  in  the  context  of local  regression,  is  the  \"right\" \nmethod to perfonn local  dimensionality  reduction,  this  paper will  derive  and  compare \nseveral  candidate  techniques  under  i)  perfectly  fulfilled  statistical  prerequisites  (e.g., \nGaussian noise,  Gaussian input distributions,  perfectly  linear data),  and ii)  less perfect \nconditions (e.g.,  non-Gaussian distributions,  slightly  quadratic  data,  incorrect guess  of \nthe dimensionality of the true data distribution). We will focus on nonlinear function ap(cid:173)\nproximation with locally weighted linear regression (L WR), as  it allows us to adapt a va(cid:173)\nriety of global linear dimensionality reduction techniques, and as L WR has  found wide(cid:173)\nspread application in several local learning  systems  (Atkeson,  Moore,  &  Schaal,  1997; \nJordan &  Jacobs,  1994; Xu,  Jordan,  &  Hinton,  1996).  In particular,  we  will derive  and \ninvestigate locally weighted principal component regression (L WPCR), locally weighted \njoint  data  principal  component  analysis  (L WPCA),  locally  weighted  factor  analysis \n(L WF A), and locally weighted partial least squares (L WPLS). Section 2 will briefly out(cid:173)\nline  these  methods  and their theoretical  foundations,  while  Section  3  will  empirically \nevaluate the robustness of these methods using synthetic data sets  that increasingly vio(cid:173)\nlate some of the statistical assumptions of the techniques. \n\n2  METHODS OF DIMENSIONALITY REDUCTION \nWe assume that our regression data originate from a generating process with two sets of \nobservables, the \"inputs\"  i  and the \"outputs\"  y. The characteristics of the process en(cid:173)\nsure a functional relation y = f(i). Both  i  and  yare obtained through some measure(cid:173)\nment device that adds  independent mean zero noise  of different magnitude in each ob(cid:173)\nservable, such that  x ==  i  + Ex  and y = y + Ey \u2022  For the sake of simplicity, we will only fo(cid:173)\ncus  on  one-dimensional  output  data  (m=l)  and  functions  /  that  are  either  linear  or \nslightly quadratic, as these cases are the most common in nonlinear function approxima(cid:173)\ntion with locally linear models. Locality of the regression is ensured by weighting the er(cid:173)\nror of each data point with a weight from a Gaussian kernel: \nWi  = exp(-O.5(Xi - Xqf D(Xi - Xq)) \n\n(1) \n\nXtt  denotes the query point, and D  a positive semi-definite distance metric which deter(cid:173)\nmmes the size and shape of the neighborhood contributing to the regression (Atkeson et \naI.,  1997). The parameters  Xq  and D can be determined in the framework of nonparamet(cid:173)\nric statistics (Schaal &  Atkeson, in press) or parametric maximum likelihood estimations \n(Xu et aI,  1995}- for the present study they are determined manually since their origin is \nsecondary to the results of this paper. Without loss of generality, all our data sets will set \n!,q  to the zero vector, compute the weights, and then translate the input data such that the \nlocally weighted mean,  i  =  L WI  Xi  /  L Wi  ,  is zero.  The output data is  equally translated to \nbe mean zero. Mean zero data is necessary for most of techniques considered below. The \n(translated)  input  data  is  summarized  in  the  rows  of the  matrix  X,  the  corresponding \n(translated) outputs are the elements of the vector y, and the corresponding weights are in \nthe diagonal matrix W. In some cases, we need the joint input and output data, denoted \nas Z=[X y). \n\n\fLocal Dimensionality Reduction \n\n2.1  FACTORANALYSIS(LWFA) \n\n635 \n\nFactor analysis  (Everitt,  1984)  is  a technique  of dimensionality reduction which is  the \nmost appropriate given the generating process of our regression data.  It assumes the  ob(cid:173)\nserved data z  was  produced. by a  mean  zero  independently  distributed k  -dimensional \nvector of factors  v, transformed by the  matrix U,  and contaminated by mean zero  inde(cid:173)\npendent noise f:  with diagonal covariance matrix Q: \n\nz=Uv+f:,  where  z=[xT,yt  and  f:=[f:~,t:yr \n\n(2) \n\nIf both v and f:  are  normally distributed, the parameters Q  and U  can be obtained itera(cid:173)\ntively by the  Expectation-Maximization algorithm (EM) (Rubin &  Thayer,  1982). For a \nlinear regression problem, one assumes that z was generated with U=[I, f3  Y and v = i, \nwhere f3  denotes the vector of regression coefficients of the linear model y  = f31 x, and I \nthe identity matrix. After calculating Q  and U by EM in joint data space as formulated in \n(2),  an  estimate  of  f3  can be derived from  the  conditional  probability  p(y I x). As  all \ndistributions are assumed to be normal, the expected value ofy is the mean of this condi(cid:173)\ntional distribution. The locally weighted version (L WF A) of f3  can be obtained together \nwith an estimate of the factors v from the joint weighted covariance matrix 'I' of z and v: \n\nE{[: ] + [ ~ } ~ ~,,~,;'x,  where  ~ ~ [ZT, VT~~Jft: w; ~ \n\n(3) \n\n[ Q+UU T  U]  ['I'II(=n x n) \n= \n\nI  =  '\u00a521(= (m + k) x n) \n\nUT \n\n'I'12(=nX(m+k\u00bb)] \n'1'22(= (m + k) x (m + k\u00bb) \n\nwhere E { .}  denotes the expectation operator and  B a matrix  of coefficients involved in \nestimating the factors  v.  Note that unless the noise  f:  is zero, the estimated  f3  is different \nfrom the true  f3 as it tries to average out the noise in the data. \n\n2.2  JOINT-SPACE PRINCIPAL COMPONENT ANALYSIS (LWPCA) \nAn alternative way of determining the parameters  f3  in a reduced space employs locally \nweighted principal component analysis (LWPCA) in the joint data space. By defining the  . \nlargest k+ 1 principal components of the weighted covariance matrix ofZ as  U: \n\nU = [eigenvectors(I Wi (Zi - ZXZi - Z)T II Wi)] \n\nmax(l:k+1l \n\n(4) \n\nand noting that the eigenvectors in  U are unit length, the matrix inversion theorem (Hom \n&  Johnson,  1994) provides a means to derive an efficient estimate of f3 \n\n[Ux(=nXk)] \nf3=U x Uy -Uy  UyUy -I  UyUyt  where  U=  Uy(=mxk) \n\n( T \n\nT( \n\nT \n\n)-1 \n\nT\\ \n\n(5) \n\nIn our one dimensional output case,  U y  is just a  (1 x k) -dimensional row vector and the \nevaluation of (5) does not require a matrix inversion anymore but rather a division. \nIf one assumes normal distributions in all variables as in L WF A,  L WPCA is the special \ncase  of L WF A where the  noise  covariance Q  is  spherical,  i.e.,  the  same magnitude  of \nnoise in all observables. Under these circumstances, the subspaces spanned by U in both \nmethods will be  the  same.  However, the  regression coefficients of L WPCA will be dif(cid:173)\nferent from those of L WF A unless the noise level is zero, as L WF A optimizes the coeffi(cid:173)\ncients  according to  the noise  in the data (Equation (3\u00bb . Thus,  for  normal  distributions \nand a correct guess of k, L WPCA is always expected to perform worse than L WF A. \n\n\f636 \n\nS. Schaal, S.  Vijayakumar and C.  G.  Atkeson \n\nFor Training: \nInitialize: \nDo  = X, \nFor i = 1 to k: \n\n2.3  PARTIAL LEAST SQUARES (LWPLS, LWPLS_I) \nPartial  least squares  (Wold,  1975;  Frank &  Friedman,  1993) recursively  computes  or(cid:173)\nthogonal  projections  of the  input  data and performs  single  variable  regressions  along \nthese projections on the residuals of the previous iteration step.  A locally weighted ver(cid:173)\nsion of partial least squares (LWPLS) proceeds as  shown in Equation (6) below. \nAs  all  single variable regressions are ordinary uni(cid:173)\nvariate least-squares minim izations, \nL WPLS \nmakes the  same  statistical assumption as  ordinary \nlinear regressions,  i.e.,  that  only  output variables \nhave  additive  noise, but input variables  are  noise(cid:173)\nless.  The choice of the projections u, however,  in(cid:173)\ntroduces an element in L WPLS that remains statis(cid:173)\ntically still debated (Frank &  Friedman,  1993), al(cid:173)\nthough,  interestingly, there exists a strong similar(cid:173)\nity with the way projections are chosen in Cascade \nCorrelation (Fahlman &  Lebiere,  1990).  A peculi(cid:173)\narity  of L WPLS  is that it also regresses the inputs \nof the previous  step against the projected inputs  s \nin  order to ensure the orthogonality of all  the pro(cid:173)\njections u.  Since L WPLS chooses projections in a \nvery  powerful  way,  it  can  accomplish  optimal \nfunction  fits  with only one single projections (i.e., \nk= 1)  for  certain input distributions.  We  will  address this  issue  in our empirical evalua(cid:173)\ntions by comparing k-step L WPLS with  I-step L WPLS, abbreviated L WPLS_I. \n\neo  = y \n\n(6) \n\nI \n\nFor Lookup: \nInitialize: \n\ndo  = x,  y= \u00b0 \nFor i = 1 to k: \ns.  = dT.u. \n\n1-\n\nI \n\n2.4  PRINCIPAL COMPONENT REGRESSION (L WPCR) \nAlthough not optimal, a computationally efficient techniques of dimensionality reduction \nfor  linear regression is principal component regression (LWPCR) (Massy,  1965). The in(cid:173)\nputs  are  projected onto  the  largest  k  principal components of the weighted covariance \nmatrix of the input data by the matrix U: \n\nU = [eigenvectors(2: Wi (Xi  - xX Xi  - xt /2: Wi )] \n\nmax(l:k) \n\nThe regression coefficients f3 are thus calculated as: \n\nf3  = (UTXTwxUtUTXTWy \n\n(7) \n\n(8) \n\nEquation (8) is inexpensive to evaluate since after projecting X with U,  UTXTWXU  be(cid:173)\ncomes a diagonal matrix that is easy to invert. L WPCR assumes that the inputs have ad(cid:173)\nditive  spherical noise,  which includes the zero noise case.  As  during dimensionality re(cid:173)\nduction L WPCR does not take into account the output data, it is endangered by clipping \ninput dimensions with low  variance which nevertheless have  important contribution to \nthe regression output.  However, from a statistical point of view, it is  less likely that low \nvariance  inputs  have  significant  contribution  in a  linear  regression,  as  the  confidence \nbands of the regression coefficients increase inversely proportionally with the variance of \nthe associated input. If the input data has  non-spherical noise, L WPCR is prone to focus \nthe regression on irrelevant projections. \n\n3  MONTE CARLO EVALUATIONS \nIn order to evaluate the candidate methods, data sets with 5 inputs and 1 output were ran(cid:173)\ndomly generated. Each data set consisted of 2,000 training points and  10,000 test points, \ndistributed either uniformly  or nonuniformly  in the unit hypercube.  The  outputs  were \n\n\fLocal Dimensionality Reduction \n\n637 \n\ngenerated by either a linear or quadratic  function.  Afterwards,  the 5-dimensional input \nspace  was  projected  into  a  to-dimensional  space  by  a  randomly  chosen  distance  pre(cid:173)\nserving linear transformation. Finally, Gaussian noise of various magnitudes was added \nto both the  10-dimensional inputs and one dimensional output. For the test sets, the addi(cid:173)\ntive  noise  in  the  outputs  was  omitted.  Each  regression  technique  was  localized by  a \nGaussian kernel  (Equation  (1))  with a  to-dimensional distance  metric  D=IO*I (D  was \nmanually chosen to ensure that the Gaussian kernel had sufficiently many data points and \nno \"data holes\" in the  fringe  areas  of the kernel)  . The  precise experimental conditions \nfollowed closely those suggested by Frank and Friedman (1993): \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n2 kinds of linear functions  y = {g.I  for: \n2 kinds of quadratic functions  y = f3J.I + f3::.aAxt ,xi ,xi ,X;,X;]T for: \ni)  1311.  = [I, I, I, I, Wand f3q.ad = 0.1 [I, I, I, I, If, and ii)  131 .. = [1,2,3,4, sf and f3quad  = 0.1 [I, 4, 9, 16, 2sf \n\nii)  I3Ii.  = [1,2,3,4, sf \n\ni)  131 .. = [I, I, I, I, If , \n\n3 kinds of noise conditions, each with 2 sub-conditions: \ni) only output noise: \n\na) low noise: \nb) high noise: \n\nlocal signal/noise ratio Isnr=20, \nIsnr=2, \n\nii) equal noise in inputs and outputs: \n\nand \n\na) low noise  Ex \u2022\u2022  = Sy  = N(O,O.Ot2),  n e[I,2, ... ,10], \nb) high noise  Ex \u2022\u2022  =sy=N(0,0.12),ne[I,2, ... ,10], \n\nand \n\niii) unequal noise in  inputs and outputs: \n\nand \n\na) low noise:  Ex .\u2022  = N(0,(0.0In)2),  n e[I,2, ... ,1O]  and Isnr=20, \nb) high noise:  Ex .\u2022 = N(0,(0.0In)2),  n e[I,2, ... ,1O]  and Isnr=2, \n\n\u2022 \n\n2 kinds of input distributions:  i) uniform in unit hyper cube, ii) uniform in unit hyper cube excluding data \npoints which activate a  Gaussian  weighting  function  (I) at  c = [O.S,O,o,o,of  with  D=IO*I  more than \nw=0.2 (this forms a \"hyper kidney\" shaped distribution) \n\nEvery  algorithm  was  run * 30 times  on each of the 48  combinations  of the  conditions. \nAdditionally, the complete test was repeated for three further conditions varying the di(cid:173)\nmensionality--called factors in accordance with L WF A-that the algorithms assumed to \nbe the true dimensionality of the  to-dimensional data from k=4 to 6, i.e., too few, correct, \nand too many factors. The average results are summarized in Figure  I. \nFigure  I a,b,c show the summary results of the three factor conditions. Besides averaging \nover the 30 trials per condition, each mean of these charts also averages over the two in(cid:173)\nput distribution conditions and the linear and quadratic function condition, as  these  four \ncases are frequently observed violations of the  statistical assumptions in nonlinear func(cid:173)\ntion approximation with locally linear models. In  Figure  I b the number of factors equals \nthe  underlying  dimensionality  of the  problem,  and  all  algorithms  are  essentially per(cid:173)\nforming  equally well.  For perfectly Gaussian distributions  in all  random variables (not \nshown separately), LWFA's assumptions are perfectly fulfilled and it achieves the best \nresults, however, almost indistinguishable closely followed by L WPLS. For the ''unequal \nnoise  condition\",  the  two  PCA based techniques,  L WPCA  and  L WPCR,  perform the \nworst since--as expected-they choose suboptimal projections.  However, when violat(cid:173)\ning  the  statistical assumptions, L WF A loses  parts of its advantages,  such that the  sum(cid:173)\nmary results become fairly balanced in Figure  lb. \nThe quality of function fitting changes significantly when violating the correct number of \nfactors,  as illustrated in Figure  I a,c.  For too few  factors  (Figure  la), L WPCR performs \nworst because it randomly omits one of the principle components in the input data, with(cid:173)\nout  respect to  how  important it  is  for the regression.  The  second worse  is  L WF A:  ac(cid:173)\ncording to its assumptions it believes that the signal it cannot model must be noise, lead(cid:173)\ning to a degraded estimate of the data's subspace and, consequently, degraded regression \nresults. L WPLS has a clear lead in this test, closely followed by L WPCA and L WPLS_I. \n\n* Except for  LWFA, all  methods can evaluate a data set in  non-iterative calculations.  LWFA was trained with EM for  maxi(cid:173)\n\nmally  1000 iterations or until the log-likelihood increased less than  I.e-lOin one iteration. \n\n\f638 \n\nS. Schaal, S. Vljayakumar and C.  G. Atkeson \n\nFor too many factors than necessary (Figure  Ie), it is now LWPCA which degrades. This \neffect is  due to its  extracting one very noise contaminated projection which strongly in(cid:173)\nfluences  the recovery of the  regression parameters in Equation (4).  All other algorithms \nperform almost equally well, with L WF A and L WPLS taking a small lead. \n\nOnlyOutpul \n\nNoise \n\nEqual NoIse In ell \nIn puIS end OutpUIS \n\nUnequel NoIse In ell \nInputs end OutpulS \n\n0.1 \n\nc \no \n~  0.01 \n::::;; \nc \nII> \nC> \n~  0.001 \n~ \n\n0.0001 \n\n0.1 \n\n0.01 \n\nc:: \no \nW \n\n~ c:: \n\nII> \nC) \n~  0.001 \n~ \n\n0.0001 \n\nfl- I. E>O  ~I. \u00a3 >>(I  ~ J. &>O  ~J , E \u00bb O  ~ J .E>O  fl- I.&\u00bb O  ~I . & >O  ~ I . & >>O  ~I.& >O  ~ I .\u00a3>>o  p,. 1. s>O \n\ntJ-J .\u00a3>>O \n\n\u2022 \n\nLWFA \n\n\u2022 \n\ne) RegressIon Results with 4 Factors \nLWPCR  0  LWPLS \n\nLWPCA \n\n\u2022 \n\n\u2022 \n\nLWPLS_1 \n\n0.1 \n\n0.01 \n\n~ \n~ \n8 \nW \n~ c:: g, \n\n~  0.001 \n~ \n\n0.0001 \n\n0.1 \n\njj \nil \nf-a \n~  0.01 \n::::;; \nc \nII> \nC) \n~  0.001 \n~ \n\n0.0001 \n\nc) RegressIon Results with 6 Feclors \n\nd) Summery Results \n\nFigure  I:  Average summary results of Monte Carlo experiments. Each chart is primarily \ndivided into the three major noise conditions, cf. headers in chart (a). In each noise con(cid:173)\ndition, there are four further subdivision: i) coefficients of linear or quadratic model are \nequal with low added noise; ii) like i) with high added noise; iii) coefficients oflinear or \nquadratic model are different with low noise added; iv) like iii) with high added noise. \n\nRefer to text and descriptions of Monte Carlo studies for further explanations. \n\n\fLocal Dimensionality Reduction \n\n639 \n\n4  SUMMARY AND CONCLUSIONS \nFigure 1 d summarizes all the Monte Carlo experiments in a final average plot. Except for \nL WPLS, every other technique showed at least one clear weakness in one of our \"robust(cid:173)\nness\" tests.  It was particularly an  incorrect number of factors  which made these weak(cid:173)\nnesses apparent. For high-dimensional regression problems, the local dimensionality, i.e., \nthe  number of factors,  is not a clearly  defined number but rather a varying quantity, de(cid:173)\npending on the way the generating process operates. Usually, this process  does  not need \nto  generate locally  low dimensional distributions, however, it often \"chooses\" to  do  so, \nfor  instance,  as  human  ann movements  follow  stereotypic  patterns  despite  they could \ngenerate arbitrary ones. Thus, local dimensionality reduction needs to find  autonomously \nthe appropriate number of local factor.  Locally weighted partial least squares turned out \nto be a surprisingly robust technique for this purpose, even outperforming the statistically \nappealing  probabilistic  factor  analysis.  As  in  principal  component analysis,  LWPLS's \nnumber of factors can easily be controlled just based on a variance-cutoff threshold in in(cid:173)\nput space  (Frank &  Friedman,  1993),  while  factor  analysis  usually  requires  expensive \ncross-validation techniques.  Simple,  variance-based control over the number of factors \ncan actually improve the results of L WPCA and L WPCR in practice, since,  as  shown in \nFigure  I a, L WPCR is  more robust towards overestimating the number of factors,  while \nL WPCA is more robust towards an underestimation.  If one is  interested in dynamically \ngrowing the number of factors while obtaining already good regression results with too \nfew  factors,  L WPCA  and,  especially,  L WPLS  seem  to  be  appropriate-it  should  be \nnoted how well one factor L WPLS (L WPLS_l) already performed in Figure  I! \nIn  conclusion,  since  locally  weighted partial  least  squares  was  equally robust  as  local \nweighted  factor  analysis  towards  additive  noise  in. both  input  and  output  data,  and, \nmoreover,  superior when mis-guessing the number of factors,  it  seems  to  be a most  fa(cid:173)\nvorable technique for local dimensionality reduction for high dimensional regressions. \n\nAcknowledgments \nThe authors are grateful to Geoffrey Hinton for reminding them of partial least squares. This work was sup(cid:173)\nported by the ATR Human Information Processing Research Laboratories. S. Schaal's support includes the \nGerman Research Association, the Alexander von Humboldt Foundation, and the German Scholarship Founda(cid:173)\ntion.  S. Vijayakumar was supported by the Japanese Ministry of Education, Science, and Culture (Monbusho). \nC. G. Atkeson acknowledges the Air Force Office of Scientific Research grant F49-6209410362 and a National \nScience Foundation Presidential Young Investigators Award. \nReferences \n\ntures of experts and the EM algorithm.\" Neural Com-\nputation, 6,  2, pp.181-214. \n\nAtkeson, C. G., Moore, A. W., &  Schaal, S, (1997a).  Massy, W.  F, (1965). \"Principle component regression \n\"Locally weighted learning.\"  ArtifiCial Intelligence Re- in exploratory statistical research.\" Journal of the \nview, 11,  1-5, pp.II-73. \nAmerican Statistical Association, 60, pp.234-246. \nAtkeson, C. G., Moore, A. W., &  Schaal, S, (1997c).  Rubin, D.  B., &  Thayer, D. T, (l982). \"EM algorithms \n\"Locally weighted learning for control.\" ArtifiCial In-\nfor ML factor analysis.\" Psychometrika, 47,  I, 69-76. \nSchaal, S., &  Atkeson, C. G, (in press). \"Constructive \ntelligence Review, 11,  1-5, pp.75-113. \nBelsley, D.  A., Kuh, E., & Welsch, R.  E, (1980). Re-\nincremental learning from  only local information.\" \ngression diagnostics: Identifying influential data and  Neural Computation. \nsources of collinearity. New York:  Wiley. \nEveritt, B.  S, (1984). An introduction to latent variable  New York: Wiley. \nmodels. London: Chapman and Hall. \nFahlman, S.  E.  ,Lebiere, C, (1990). \"The cascade-\ncorrelation learning architecture.\" In: Touretzky, D.  S.  International Conference on Computational Intelli-\n(Ed.), Advances in  Neural Information Processing \nSystems II, pp.524-532. Morgan Kaufmann. \nFrank, I.  E., &  Friedman, 1. H, (1993). \"A statistical  Wold, H. (1975). \"Soft modeling by latent variables: \nview of some chemometric regression tools.\"  Tech-\nthe nonlinear iterative partial least squares approach.\" \nIn: Gani, J. (Ed.), Perspectives in Probability and Sta-\nnometrics, 35, 2, pp.l09-135. \nGeman, S., Bienenstock, E., &  Doursat, R. (1992). \ntistics,  Papers in Honour ofM S. Bartlett. Aca<j.  Press. \nXu,  L., Jordan, M.l., &  Hinton, G. E, (1995). \"An al-\n\"Neural networks and the bias/variance dilemma.\" \nNeural Computation, 4, pp.I-58. \nternative model for mixture of experts.\" In: Tesauro, \nHom, R.  A., & Johnson, C. R, (1994). Matrix analySis.  G., Touretzky, D.  S., &  Leen, T. K. (Eds.), Advances in \nPress Syndicate of the University of Cambridge. \nNeural Information Processing Systems  7, pp.633-640. \nJordan, M.I., &  Jacobs, R, (1994). \"Hierarchical mix- Cambridge, MA:  MIT Press. \n\nVijayakumar, S., &  Schaal, S,  (1997). \"Local dimen-\nsionality reduction for locally weighted learning.\" In: \n\ngence in Robotics and Automation, pp.220-225, Mon-\nteray, CA, July  10-11, 1997. \n\nScott, D. W, (1992). Multivariate Density Estimation. \n\n\fSerial  Order in Reading Aloud: \n\nConnectionist Models  and Neighborhood \n\nStructure \n\nJeanne C.  Milostan \n\nComputer Science  &  Engineering  0114 \n\nUniversity of California San Diego \n\nLa Jolla, CA  92093-0114 \n\nGarrison W.  Cottrell \n\nComputer Science  & Engineering 0114 \n\nUniversity  of California San Diego \n\nLa Jolla, CA 92093-0114 \n\nAbstract \n\nDual-Route and Connectionist Single-Route models ofreading have \nbeen at odds over claims as to the correct explanation of the reading \nprocess.  Recent  Dual-Route models  predict  that  subjects  should \nshow  an increased  naming latency for  irregular words when  the ir(cid:173)\nregularity is earlier in the word  (e.g.  chef is slower  than glow) - a \nprediction  that has  been  confirmed  in  human experiments.  Since \nthis would appear to be an effect of the left-to-right reading process, \nColtheart & Rastle (1994) claim that Single-Route parallel connec(cid:173)\ntionist  models cannot  account  for  it.  A refutation of this  claim is \npresented  here,  consisting  of network  models  which  do  show  the \ninteraction,  along  with  orthographic  neighborhood  statistics  that \nexplain the effect. \n\n1 \n\nIntroduction \n\nA major component of the task of learning to read is the development of a mapping \nfrom  orthography  to phonology.  In  a  complete model  of reading,  message  under(cid:173)\nstanding must playa role,  but many psycholinguistic phenomena can be explained \nin  the  context  of this simple mapping task.  A  difficulty  in  learning  this  mapping \nis  that  in  a  language  such  as  English,  the  mapping is  quasiregular  (Plaut  et  al., \n1996);  there  are  a  wide  range  of exceptions  to  the  general  rules.  As  with  nearly \nall  psychological phenomena, more frequent stimuli are processed  faster,  leading to \nshorter  naming latencies.  The  regularity  of mapping interacts  with  this  variable, \na  robust  finding  that is  well-explained  by  connectionist  accounts  (Seidenberg  and \nM.cClelland,  1989; Taraban and McClelland,  1987). \nIn this paper we  consider a recent effect  that seems difficult to account for in terms \nof the  standard  parallel  network  models.  Coltheart  &  Rastle  (1994)  have  shown \n\n\f60 \n\nFiller \nNonword \n\nException \n\n1.  C.  Milostan and G. W.  Cottrell \n\nIrregular  Phoneme \n\nPosition  of \n2 \n\n1 \n\n3 \n\nIrregular \nRegular Control \nDifference \n\nIrregular \nRegular Control \nDifference \nAvg.  Difl'. \n\n554 \n502 \n52 \n\n545 \n500 \n45 \n48.5 \n\n542 \n516 \n26 \n\n524 \n503 \n21 \n23.5 \n\n530 \n518 \n12 \n\n528 \n503 \n25 \n18.5 \n\nTable  1:  Naming Latency  vs.  Irregularity Position \n\n4 \n\n529 \n523 \n6 \n\n526 \n515 \n11 \n8.5 \n\n5 \n\n537 \n525 \n12 \n\n528 \n524 \n4 \n8 \n\nthat the amount of delay experienced in naming an exception word is related to the \nphonemic position of the irregularity in pronunciation.  Specifically,  the earlier the \nexception  occurs  in  the  word,  the  longer  the  latency  to the  onset  of pronouncing \nthe  word.  Table  1,  adapted from  (Coltheart and  Rastle,  1994)  shows  the response \nlatencies  to  two-syllable  words  by  normal  subjects.  There  is  a  clear  left-to-right \nranking of the latencies compared to controls in the last row of the Table.  Coltheart \net  al.  claim this delay ranking cannot be achieved by standard connectionist models. \nThis paper shows this claim to be false,  and shows that the origin of the effect  lies in \na statistical regularity of English, related to the number of \"friends\"  and  \"enemies\" \nof the pronunciation within the word's neighborhood  1. \n\n2  Background \n\nComputational modeling of the reading  task  has  been  approached  from  a  number \nof different  perspectives.  Advocates  of a  dual-route  model  of oral  reading  claim \nthat  two separate  routes,  one  lexical  (a lexicon,  often  hypothesized  to  be  an  asso(cid:173)\nciative network)  and one rule-based,  are  required to account for  certain phenomena \nin  reaction  times  and  nonword  pronunciation  seen  in  human  subjects  (Coltheart \net  al.,  1993).  Connectionist  modelers claim that  the same phenomena can  be  cap(cid:173)\ntured  in  a single-route  model  which  learns simply by  exposure  to a  representative \ndataset  (Seidenberg and McClelland,  1989). \nIn the  Dual-Route Cascade model (DRC)  (Coltheart et al.,  1993), the lexical route \nis  implemented  as  an  Interactive  Activation  (McClelland  and  Rumelhart,  1981) \nsystem, while  the non-lexical route  is implemented by  a set  of grapheme-phoneme \ncorrespondence  (GPC)  rules  learned from  a  dataset.  Input at  the letter identifica(cid:173)\ntion  layer  is  activated  in  a  left-to-right  sequential  fashion  to simulate the  reading \ndirection  of English,  and  fed  simultaneously  to  the  two  pathways  in  the  model. \nActivation from  both the GPC route  and the lexicon route  then  begins to interact \nat the output (phoneme) level, starting with  the phonemes at the beginning of the \nword.  If the  GPC  and  the  lexicon  agree  on  pronunciation,  the  correct  phonemes \nwill  be  activated  quickly.  For  words  with  irregular  pronunciation,  the  lexicon  and \nGPC  routes  will  activate  different  phonemes:  the  GPC  route  will  try  to  activate \nthe regular pronunciation while the lexical route will activate the irregular (correct) \n\n1 Friends  are  words  with  the  same  pronunciations  for  the  ambiguous  letter-ta-sound \n\ncorrespondence;  enemies  are  words with different  pronunciations. \n\n\fSerial Ortier in Reading Aloud \n\n61 \n\npronunciation.  Inhibitory links between alternate phoneme pronunciations will slow \ndown  the  rise  in  activation,  causing  words  with  inconsistencies  to  be  pronounced \nmore  slowly  than  regular  words.  This slowing  will  not  occur,  however,  when  an \nirregularity  appears  late in  a  word.  This is  because  in  the  model the lexical node \nspreads  activation to  all of a  word's  phonemes as  soon  as  it  becomes active.  If an \nirregularity is  late in  a  word,  the  correct  pronunciation  will  begin  to be  activated \nbefore the GPC route is able to vote against it.  Hence  late irregularities will not be \nas  affected  by conflicting information.  This result  is  validated by simulations with \nthe one-syllable  DRC model (Coltheart and Rastle, 1994). \nSeveral  connectionist  systems  have  been  developed  to  model  the  orthography  to \nphonology  process  (Seidenberg  and  McClelland,  1989;  Plaut et  al.,  1996).  These \nconnectionist  models provide evidence  that the  task,  with accompanying phenom(cid:173)\nena, can be learned through a single mechanism.  In particular, Plaut et  al.  (hence(cid:173)\nforth  PMSP)  develop  a  recurrent  network  which  duplicates  the  naming latencies \nappropriate  to  their  data set,  consisting  of approximately  3000  one-syllable  En(cid:173)\nglish  words  (monosyllabic  words  with  frequency  greater  than  1  in  the  Kucera  & \nFrancis corpus (Kucera and Francis, 1967\u00bb.  Naming latencies are computed based \non  time-t~settle for  the recurrent  network,  and  based on  MSE  for  a feed-forward \nmodel used in some simulations. In addition to duplicating frequency  and regularity \ninteractions  displayed in  previous  human studies,  this model  also performs appr~ \npriately  in  providing  pronunciation of pronounceable  nonwords.  This  provides  an \nimprovement over,  and a  validation of,  previous  work  with  a strictly  feed-forward \nnetwork  (Seidenberg  and McClelland,  1989).  However,  to date,  no one has  shown \nthat  Coltheart's naming latency  by  irregularity  of position  interaction  can  be  ac(cid:173)\ncounted  for  by  such  a  model.  Indeed,  it is  difficult  to see  how  such  a  model  could \naccount  for  such  a  phenomenon,  as  its  explanation  (at  least  in  the  DRC  model) \nseems to require the serial,  left-t~right nature of processing in the model, whereas \nnetworks such as PMSP present the word orthography all at once.  In the following, \nwe fill this gap in the literature, and explain why a parallel, feed-forward model can \naccount for  this result. \n\n3  Experiments &  Results \n\n3.1  The Data \n\nPronunciations for  approximately 100,000 English words were obtained through an \nelectronic dictionary developed  by CMU  2 .  The provided format was not amenable \nto an automated method for  distinguishing the number of syllables in the word.  To \nobtain syllable counts,  English  tw~syllable words  were  gathered from  the Medical \nResearch  Council  (MRC)  Psycholinguistic  Database  (Coltheart  and  Rastle,  1994), \nwhich is conveniently annotated with syllable counts and frequency  (only those with \nKucera-Francis  written frequency  of one  or greater were selected).  Intersecting the \ntwo databases resulted in 5,924 tw~syllable words.  There is some noise in the data; \nZONED  and AERIAL, for  example, are  in this database of purported  tw~syllable \nwords.  Due  to the size  of the  database and  time limitations, we  did  not  prune the \ndata of these  errors,  nor  did  we  eliminate proper  nouns  or foreign  words.  Single(cid:173)\nsyllable words  with  the same frequency  criterion were  also selected  for  comparison \nwith  previous  work.  3,284 unique  single-syllable words  were  obtained,  in  contrast \nto 2,998  words  used  by  PMSP.  Similar noisy  data as  in  the  tw~syllable set  exists \nin this database.  Each  word was  represented  using the orthography and phonology \nrepresentation scheme outlined by  PMSP. \n\n2 Available  via ftp://ftp.cs.cmu.edu/project/fgdata/dict/ \n\n\f62 \n\n1.  C.  Milostan and G.  W  Cottrell \n\n1.0 \n\nS 0.8 \nI 0.6 \n\"\" it. J 0.4 \n\n>::' ... \nIi \nr!I  02 \n\nFigure 1:  I-syllable network  latency differences  & neighborhood statistics \n\n3.2  Methods \n\nFor the single syllable words, we  used  an identical network to the feed-forward  net(cid:173)\nwork used  by  PMSP,  i.e., a 105-100-61  network,  and for  the two syllable words,  we \nsimply used the same architecture with the each layer size doubled.  We trained each \nnetwork for 300 epochs, using batch training with a cross entropy objective function, \nan initial learning rate of 0.001, momentum of 0.9  after the first  10  epochs,  weight \ndecay  of 0.0001,  and delta-bar-delta learning rate adjustment.  Training exemplars \nwere  weighted  by  the log of frequency  as  found  in  the  Kucera-Francis  corpus.  Af(cid:173)\nter  this training,  the single syllable feed-forward  networks  averaged  98.6%  correct \noutputs,  using  the same evaluation technique outlined in  PMSP. Two syllable net(cid:173)\nworks  were  trained  for  1700  epochs  using  online  training,  a  learning  rate  of 0.05, \nmomentum of 0.9  after  the first  10 epochs,  and  raw  frequency  weighting.  The two \nsyllable  network achieved  85%  correct.  Naming latency  was equated  with  network \noutput MSE;  for successful  results,  the error difference  between  the irregular words \nand associated  control words should  decrease  with irregularity position. \n\n3.3  Results \n\nSingle  Syllable  Words  First,  Coltheart's  challenge  that  a  single-route  model \ncannot  produce  the  latency  effects  was  explored.  The  single-syllable  network  de(cid:173)\nscribed above was tested on the collection of single-syllable words identified as irreg(cid:173)\nular by  (Taraban and  McClelland,  1987).  In  (Coltheart  and  Rastle,  1994), control \nwords are selected based on equal number of letters, same beginning phoneme, and \nKucera-Francis frequency  between  1 and 20  (controls were not frequency  matched). \nFor  single  syllable  words  used  here,  the  control  condition  was  modified  to  allow \nfrequency  from  1 to 70,  which  is  the range of the  \"low frequency\"  exception  words \nin the Taraban &  McClelland set.  Controls were chosen  by  drawing randomly from \nthe words  meeting the control criteria. \nEach  test  and  control  word  input  vector  was  presented  to  the  network,  and  the \nMSE at the output layer (compared to the expected correct target)  was calculated. \nFrom  these  values,  the  differences  in  MSE  for  target  and  matched  control  words \nwere  calculated  and  are  shown  in  Figure  1.  Note  that words  with  an  irregularity \nin  the first  phoneme position have  the  largest  difference  from  their  control  words, \nwith  this  (exception  - regular  control)  difference  decreasing  as  phoneme  position \nincreases.  Contrary to the claims of the Dual-Route model, this network does show \nthe desired  rank-ordering of MSE/latency. \n\n\fSerial Order in Reading Aloud \n\n63 \n\n02 \n\n1.0 \n\nI'  0.1 d \nI 1;l \n\n0.0 \n\n::I! \n\no \n\nI'boaeme 1 ........ larit)' PooIIioa \n\n4 \n\n6 \n\nO.O+--~--r----.---~--' \n\no \n\n2 \n\n\" \n\nPbonane lnegularlty PosItioIl \n\n6 \n\nFigure 2:  2-syllable network latency differences  & neighborhood statistics \n\nTwo  Syllable Words  Testing  of the  two-syllable  network  is  identical  to  that \nof the  one-syllable  network.  The  difference  in  MSE  for  each  test  word  and  its \ncorresponding  control  is  calculated,  averaging  across  all  test  pairs  in  the  position \nset.  Both  test  words  and their  controls  are  those  found  in  (Coltheart  and  Rastle, \n1994).  The 2-syllable network  appears to produce approximately the correct linear \ntrend in the naming MSE/latency (Figure 2), although the results displayed are not \nmonotonically decreasing with  position.  Note,  however,  that the  results  presented \nby  Coltheart,  when  taken  separately,  also fail  to exhibit  this  trend  (Table  1).  For \ncorrect  analysis,  several  \"subject\"  networks  should  be  trained,  with  formal  linear \ntrend  analysis  then  performed  with  the  resulting data.  These  further  simulations \nare currently  being undertaken. \n\n4  Why the network works:  Neighborhood effects \n\nA possible explanation for these results relies on the fact that connectionist networks \ntend  to extract statistical regularities in the data, and are affected  by regularity by \nfrequency  interactions.  In  this  case,  we  decided  to  explore  the  hypothesis  that \nthe  results  could  be  explained  by  a  neighborhood  effect:  Perhaps  the  number  of \n\"friends\"  and  \"enemies\"  in  the  neighborhood  (in  a  sense  to  be  defined  below)  of \nthe exception word varies in English in a position-dependent way.  If there are more \nenemies (different  pronunciations) than friends  (identical pronunciations) when  the \nexception occurs at the beginning of a word than at the end, then one would expect \na network to reflect  this statistical regularity in its output errors.  In particular, one \nwould  expect  higher  errors  (and  therefore  longer  latencies  in  naming) if the  word \nhas a  higher proportion of enemies in the neighborhood. \nTo test  this hypothesis,  we  created some data search engines to collect word  neigh(cid:173)\nborhoods  based  on  various  criteria.  There  is  no  consensus  on  the exact  definition \nof the  \"neighborhood\"  of a  word.  There  are  some  common  measures,  however,  so \nwe  explored several of these.  Taraban & McClelland  (1987)  neighborhoods (T&M) \nare  defined  as  words  containing the same vowel  grouping and final  consonant  clus(cid:173)\nter.  These  neighborhoods  therefore  tend  to consist  of words  that  rhyme  (MUST, \nDUST,  TRUST).  There  is  independent  evidence  that  these  word-body  neighbors \nare  psychologically  relevant  for  word  naming tasks  (i.e.,  pronunciation)  (Treiman \nand Chafetz,  1987).  The neighborhood measure given  by  Coltheart  (Coltheart and \nRastle,  1994),  N,  counts  same-length  words  which  differ  by  only  one  letter,  tak(cid:173)\ning string position  into account.  Finally, edit-distance-1  (ED1)  neighborhoods  are \nthose  words  which  can  be  generated  from  the  target  word  by  making one  change \n\n\f64 \n\nJ.  C.  Milostan and G.  W  Cottrell \n\n(Peereman,  1995):  either  a  letter  substitution,  insertion  or  deletion.  This  differs \nfrom the Coltheart N definition in that  \"TRUST\"  is in the EDI neighborhood  (but \nnot the  N neighborhood) of \"RUST\" , and provides  a neighborhood measure which \nconsiders both pronunciation and spelling similarity.  However,  the N and the ED-l \nmeasure have not been shown to be psychologically real in terms of affecting naming \nlatency  (Treiman and  Chafetz,  1987). \nWe  therefore  extended  T&M  neighborhoods  to multi-syllable words.  Each  vowel \ngroup  is  considered  within  the  context  of its  rime,  with  each  syllable  considered \nseparately.  Consonant  neighborhoods  consist  of orthographic  clusters  which  cor(cid:173)\nrespond  to  the  same  location  in  the  word.  This  results  in  4  consonant  cluster \nlocations:  first  syllable onset,  first  syllable coda, second syllable onset,  and second \nsyllable  coda.  Consonant  cluster  neighborhoods  include  the  preceeding  vowel  for \ncoda consonants,  and the following vowel  for  onset  consonants. \n\nThe notion of exception words is  also not universally agreed  upon.  Precisely  which \nwords  are  exceptions  is  a  function  of the  working definition  of pronunciation  and \nregularity  for  the  experiment  at hand.  Given  a  definition  of neighborhood,  then, \nexception  words  can be defined  as  those words which  do  not agree  with the phono(cid:173)\nlogical mapping favored  by  the majority of items in that  particular  neighborhood. \nAlternatively,  in  cases  assuming  a  set  of rules  for  grapheme-phoneme  correspon(cid:173)\ndence,  exception words  are those which  violate the rules  which  define  thp majority \nof pronunciations.  For  this investigation, single syllable exception  words  are those \ndefined  as  exception  by  the  T&M  neighborhood  definition.  For  instance,  PINT \nwould  be  considered  an  exception  word  compared  to its  neighbors  MINT,  TINT, \nHINT,  etc.  Coltheart, on  the other  hand,  defines  exception  words  to be those for \nwhich his G PC rules produce incorrect pronunciation.  Since we  are concerned with \naddressing  Coltheart's  claims,  these  2-syllable  exception  words  will  also  be  used \nhere. \n\n4.1  Results \n\nSingle syllable words  For each  phoneme position,  we  compare each  word  with \nirregularity  at  that  position  with  its  neighbors,  counting  the  number  of enemies \n(words with alternate pronunciation at the supposed irregularity) and friends (words \nwith  pronunciation  in  agreement)  that  it  has.  The T &M  neighborhood  numbers \n(words containing the same vowel grouping and final consonant cluster) used in Fig(cid:173)\nure  1 are found in (Taraban and McClelland,  1987).  For each word, we  calculate its \n(enemy) /  (friend+enemy) ratio; these ratios are then averaged over all the words in \nthe position set.  The results  using neighborhoods as  defined  in Taraban & McClel(cid:173)\nland  clearly  show  the  desired  rank  ordering  of effect.  First-position-irregularity \nwords  have  more  \"enemies\"  and  fewer  \"friends\"  than  third-position-irregularity \nwords,  with  the  second-position  words  falling  in  the  middle  as  desired.  We  sug(cid:173)\ngest  that this statistical regularity in the data is  what the above networks  capture. \nHowever  convincing  these  results  may  be,  they  do  not  fully  address  Coltheart's \ndata,  which  is  for  two syllable  words  of five  phonemes  or phoneme  clusters,  with \nirregularities  at each  of five  possible  positions.  Also,  due  to the  size  of the  T&M \ndata set, there are only 2 members in the position I set, and the single-syllable data \nonly goes  up  to phoneme  position 3.  The neighborhoods for  the  two-syllable data \nset  were  thus examined. \n\nTwo  syllable results  Recall  that  the  two-syllable  test  words  are  those  used  in \nthe  (Coltheart  and  Rastle,  1994)  subject  study,  for  which  naming  latency  differ(cid:173)\nences  are  shown  in  Table  1.  CoItheart's  I-letter-different  neighborhood  definition \n\n\fSerial Order in Reading Aloud \n\n65 \n\nis not very  informative in this case,  since  by  this criterion most of the target words \nprovided  in  (Coltheart  and  Rastle,  1994)  are  loners  (i.e.,  have  no  neighbors  at \nall).  However,  using a neighborhood based on T&M-2 recreates  the desired  ranking \n(Figure 2)  as  indicated by  the ratio of hindering pronunciations to the total of the \nhelping  and hindering  pronunciations.  As  with the single syllable words,  each  test \nword is compared with its neighbor words and the (enemy)/(friend+enemy) ratio is \ncalculated.  Averaging over  the words in each position set,  we  again see  that words \nwith early irregularities are at a support disadvantage compared to words with late \nirregularities. \n\n5  Summary \n\nDual-Route  models  claim  the  irregularity  position  effect  can  only  be  accounted \nfor  by  two-route  models with  left-to-right activation of phonemes,  and  interaction \nbetween  GPC rules  and the  lexicon.  The work  presented  in this paper refutes  this \nclaim by presenting results from feed-forward connectionist networks which show the \nsame rank ordering of latency.  Further,  an analysis of orthographic neighborhoods \nshows  why the networks can do  this:  the effect  is  based on  a statistical interaction \nbetween  friend/enemy  support  and  position.  Words  with  irregular  orthographic(cid:173)\nphonemic correspondence at word beginning have less support from their neighbors \nthan words  with later irregularities; it is  this  difference  which  explains the latency \nresults.  The resulting statistical regularity is  then easily captured  by connectionist \nnetworks exposed  to representative  data sets. \n\nReferences \n\nColtheart,  M.,  Curitis,  B.,  Atkins,  P.,  and  Haller,  M.  (1993).  Models  of reading \n\naloud:  Dual-route and  parallel-distributed-processing approaches.  Psychologi(cid:173)\ncal  Review,  100(4):589-608. \n\nColtheart, M.  and  Rastle,  K.  (1994).  Serial  processing  in  reading  aloud:  Evidence \nfor  dual route models of reading.  Journal of Experimental Psychology:  Human \nPerception  and  Performance,  20(6):1197-1211. \n\nKucera,  H.  and Francis, W. (1967).  Computational Analysis of Present-Day Amer(cid:173)\n\nican  English.  Brown  University  Press,  Providence,  RI. \n\nMcClelland, J. and Rumelhart, D. (1981). An interactive activation model of context \neffects  in  letter perception:  Part 1.  an account of basic findings.  Psychological \nReview, 88:375-407. \n\nPeereman,  R.  (1995).  Naming regular  and  exception  words:  Further examination \nof the  effect  of phonological  dissension  among  lexical  neighbours.  European \nJournal  of Cognitive  Psychology,  7(3):307-330. \n\nPlaut, D., McClelland, J., Seidenberg,  M., and Patterson, K. (1996). Understanding \nnormal and impaired word  reading:  Computational principles in quasi-regular \ndomains.  Psychological Review,  103(1):56-115. \n\nSeidenberg,  M.  and  McClelland, J.  (1989).  A  distributed,  developmental model of \n\nword  recognition and naming.  Psychological Review, 96:523-568. \n\nTaraban, R.  and McClelland, J.  (1987).  Conspiracy effects  in  word  pronunciation. \n\nJournal  of Memory  and  Language,  26:608-631. \n\nTreiman, R. and Chafetz, J. (1987).  Are  there onset- and rime-like units in printed \nIn  Coltheart,  M.,  editor,  Attention  and  Performance  XII:  The  Psy(cid:173)\n\nwords? \nchology  of Reading.  Erlbaum, Hillsdale,  NJ. \n\n\f", "award": [], "sourceid": 1389, "authors": [{"given_name": "Jeanne", "family_name": "Milostan", "institution": null}, {"given_name": "Garrison", "family_name": "Cottrell", "institution": null}]}