{"title": "Learning with Noise and Regularizers in Multilayer Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 260, "page_last": 266, "abstract": null, "full_text": "Learning with Noise and Regularizers \n\n\u2022 In \n\nMultilayer Neural Networks \n\nDavid Saad \n\nDept.  of Compo  Sci.  & App.  Math. \n\nAston  University \n\nBirmingham B4  7ET,  UK \n\nD .Saad@aston.ac.uk \n\nAbstract \n\nSara A.  Solla \n\nAT &T Research  Labs \n\nHolmdel, NJ  07733,  USA \nsolla@research .at t .com \n\nWe  study  the  effect  of  noise  and  regularization  in  an  on-line \ngradient-descent  learning scenario for  a  general  two-layer student \nnetwork  with  an  arbitrary  number  of hidden  units.  Training ex(cid:173)\namples  are  randomly  drawn  input  vectors  labeled  by  a  two-layer \nteacher  network  with an arbitrary number of hidden  units; the ex(cid:173)\namples are corrupted  by Gaussian noise  affecting either the output \nor  the  model itself.  We  examine the  effect  of both  types  of noise \nand  that  of  weight-decay  regularization  on  the  dynamical evolu(cid:173)\ntion of the order parameters and the generalization error in various \nphases  of the learning process. \n\n1 \n\nIntroduction \n\nOne  of the  most  powerful  and  commonly  used  methods for  training  large  layered \nneural networks is that of on-line learning, whereby the internal network parameters \n{J} are modified after  the  presentation of each  training example so  as to minimize \nthe  corresponding  error.  The  goal  is  to  bring  the  map  fJ  implemented  by  the \nnetwork  as  close  as  possible  to a  desired  map j  that generates  the examples.  Here \nwe focus on the learning of continuous maps via gradient descent  on a differentiable \nerror function. \n\nRecent work  [1]-[4]  has provided a powerful tool for  the analysis of gradient-descent \nlearning in  a  very  general  learning scenario  [5]:  that of a  student network  with  N \ninput units, I<  hidden units, and a single linear output unit, trained to implement a \ncontinuous map from an  N-dimensional input space e onto a scalar (.  Examples of \nthe target task  j  are  in  the  form of input-output  pairs (e', (1').  The output labels \n(JAto independently  drawn inputs e'  are  provided  by  a  teacher network of similar \n\n\fLearning with Noise and Regularizers in Multilayer Neural Networks \n\n261 \n\narchitecture,  except  that its  number  M  of hidden  units  is  not  necessarily  equal  to \nK . \n\nHere  we  consider  the  possibility  of  a  noise  process  pI-'  that  corrupts  the  teacher \noutput.  Learning  from  corrupt  examples  is  a  realistic  and  frequently  encountered \nscenario.  Previous  analysis  of this  case  have  been  based  on  various  approaches: \nBayesian  [6],  equilibrium statistical physics  [7],  and  nonequilibrium techniques  for \nanalyzing  learning  dynamics  [8].  Here  we  adapt  our  previously  formulated  tech(cid:173)\nniques  [2]  to investigate  the effect  of different  noise  mechanisms on  the  dynamical \nevolution of the learning process  and  the  resulting generalization ability. \n\n2  The model \n\nWe  focus  on  a  soft  committee  machine  [1],  for  which  all  hidden-to-output  weights \nare  positive  and  of unit  strength.  Consider  the  student  network:  hidden  unit  i \nreceives  information from  input  unit  r  through  the  weight  Jir,  and  its  activation \nunder  presentation  of an  input  pattern e =  (6,\u00b7 .. , ~N) is  Xi  =  J i  . e,  with  J i  = \n(Jil , . .. , JiN )  defined  as  the vector  of incoming weights  onto the  i-th  hidden  unit. \nThe  output  of the  student  network  is  O\"(J, e)  =  2:~1  g  (J i \n. e),  where  g  is  the \nactivation function  of the  hidden  units,  taken  here  to be  the error function  g(x) == \nerf(x/v'2),  and J  ==  {Jdl~i~K is  the set  of input-to-hidden adaptive weights. \nThe components of the input vectors el-'  are uncorrelated random variables with zero \nmean and unit variance.  Output labels (I-'  are provided by a teacher network of sim(cid:173)\nilar  architecture:  hidden  unit  n  in  the  teacher  network  receives  input  information \nthrough the weight vector Bn =  (Bn 1, . . . , BnN ),  and  its activation under presenta(cid:173)\ntion of the input pattern e is  Y~ = Bn . e.  In the noiseless  case  the teacher output \nis  given  by  (t  = 2:~=1  g  (Bn  . e).  Here  we  concentrate  on  the  architecturally \nmatched  case  M  =  K,  and  consider  two  types  of Gaussian  noise:  additive  output \nnoise  that results  in (I-'  = pI-'  + 2:~=1  g  (Bn  . e), and  model noise  introduced  as \nfluctuations in the  activations Y~ of the  hidden units, (I-'  =  'E~=1  g (p~ + Bn . e). \nThe  random  variables  pI-'  and  p~  are  taken  to  be  Gaussian  with  zero  mean  and \nvariance  (J'2 . \n\nThe  error  made  by  a  student  with  weights  J  on  a  given  input e is  given  by  the \n\nquadratic deviation \n\n(1) \n\nmeasured  with  respect  to  the  noiseless  teacher  (it  is  also  possible  to  measure \nperformance  as  deviations  with  respect  to  the  actual  output  (  provided  by  the \nnoisy  teacher) .  Performance  on  a  typical  input  defines  the  generalization  error \nEg(J)  ==  <  E(J,e)  >{O'  through  an  average  over  all  possible  input  vectors  e to \nbe performed implicitly through averages over the activations x  =  (Xl, ... , X K)  and \nY =  (Yl, . . . , YK) .  These averages  can  be  performed  analytically  [2]  and  result  in  a \ncompact expression  for  Eg  in terms of order parameters:  Qik  ==  Ji \u00b7Jk,  Rin  ==  Ji\u00b7 B n , \nand Tnm  ==  Bn . B m ,  which represent student-student, student-teacher,  and teacher(cid:173)\nteacher  overlaps,  respectively.  The parameters Tnm  are  characteristic of the task to \nbe learned and remain fixed  during training,  while the overlaps Qik  among student \nhidden units and  R in  between  a  student  and a  teacher  hidden  units are determined \nby  the student  weights J  and evolve  during training. \n\nA gradient descent rule on the error made with respect to the actual output provided \n\n\f262 \n\nD.  Saad and S.  A. SolLa \n\nby  the noisy  teacher  results  in Jr+1  =  Jf + N 8f e for  the  update of the student \nweights,  where  the  learning  rate  1]  has  been  scaled  with  the  input size  N,  and  8f \ndepends  on  the  type of noise.  The  time evolution  of the overlaps  Rin  and  Qik  can \nbe  written  in  terms of similar difference  equations.  We  consider  the large  N  limit, \nand  introduce  a  normalized  number of examples  Q'  =  III N  to  be  interpreted  as  a \ncontinuous time variable in  the  N  -+  00  limit.  The  time evolution of Rin  and  Qik \nis  thus described  in  terms of first-order  differential  equations. \n\n3  Output  noise \n\nThe resulting equations of motion for the student-teacher and student-student over(cid:173)\nlaps  are given  in  this  case  by: \n\n(2) \n\nwhere  each  term  is  to  be  averaged  over  all  possible  ways  in  which  an  example e \ncould  be  chosen  at  a  given  time step.  These  averages  have  been  performed  using \nthe  techniques  developed  for  the  investigation  of the  noiseless  case  [2];  the  only \ndifference  due  to the  presence  of additive  output noise  is  the  need  to evaluate  the \nfourth  term in the equation of motion for  Qik,  proportional to both  1]2  and  0'2. \nWe focus on isotropic un correlated teacher vectors:  Tnm  = T  8nm , and choose T  = 1 \nin  our numerical examples.  The time evolution of the overlaps  Rin  and  Qik  follows \nfrom integrating the equations of motion (2)  from initial conditions determined  by \na  random  initialization  of the  student  vectors  {Jd1<i<K.  Random  initial norms \nQii  for  the student vectors are taken here from  a  uniform distribution in  the [0,0.5] \ninterval.  Overlaps  Qik  between  independently  chosen  student  vectors  J i  and  J k, \nor  Rin  between  Ji  and  an  unknown teacher  vector  B n ,  are small numbers of order \n1/-..iN  for  N \u00bb  K,  and  taken  here  from  a  uniform  distribution  in  the  [0,10- 12] \ninterval. \n\nWe  show  in  Figures  l.a and  1. b  the  evolution  of the  overlaps  for  a  noise  variance \n0'2  = 0.3  and  learning rate  1]  = 0.2.  The example corresponds  to  M  = K  = 3.  The \nqualitative behavior is similar to the one observed  for  M  = K  in  the noiseless  case \nextensively  analyzed  in  [2].  A  very  short  transient  is  followed  by  a  long  plateau \ncharacterized  by  lack  of differentiation  among student  vectors:  all  student  vectors \nhave the same norm Qii = Q,  the overlap between any two different student vectors \ntakes  a  unique  value  Qik  =  C  for  i  :j:.  k,  and  the overlap  Rin  between  an  arbitrary \nstudent  vector  i  and  a  teacher  vector  n  is  independent  of i  (as student  vectors  are \nindistinguishable  in  this  regime)  and  of n  (as  the  teacher  is  isotropic),  resulting \nin  Rin  = R.  This  phase  is  characterized  by  an  unstable  symmetric solution;  the \nperturbation introduced  through  the nonsymmetric initialization of the norms Qii \nand  overlaps  Rin  eventually  takes  over  in  a  transition  that  signals  the  onset  of \nspecialization. \n\nThis  process  is  driven  by  a  breaking  of  the  uniform  symmetry  of the  matrix  of \nstudent-teacher  overlaps:  each  student  vector  acquires  an  increasingly  dominant \noverlap R  with a specific  teacher  vector  which  it begins to imitate, and a gradually \ndecreasing secondary overlap S  with the remaining teacher vectors.  In  the example \nof Figure  l.b  the  assignment  corresponds  to  i  = 1  -+  n  = 1,  i  = 2  -+  n  = 3,  and \ni  = 3  -+  n  = 2.  A  relabeling  of the  student  hidden  units  allows  us  to  identify  R \nwith  the  diagonal  elements  and  S  with  the  off-diagonal  elements  of the  matrix of \nstudent-teacher  overlaps. \n\n\fLearning with Noise and Regularizers in Multilayer Neural Networks \n\n263 \n\n(a) \n\nQII  - - QI2  ----- QI3  r \n\n--- Qu  .. -.... -.  Q\"  -----.  Q\" \n\nI,f \n--=-~~--------j:/ \n\nr---\n\n(b) \n\n1.2-.-~'--------------, \n\n1.0 \n\n0.8 \n\nr:l 0.6 \n\n0.4 \n\nRII  - - RI2  -.---- RI3 \nR\"  ------ Ru  ... ---- R\" \nR\"  -.- - - R\"  -----.  R\" \n\n1.5 \n\n1.0 \n\n~ \n\nCI \n\n0.5 \n\n:--... \n\n0.0 \n\n~ \n\n0.2 \n\n0.0 \n\n0 \n\n2000 \n\n4000 \n\n6000 \n\no \n\n2000 \n\n4000 \n\n6000 \n\n(c) \n\n0.03 \n\n0.025 \n\n----------------------~ \n- -.- - - - -- - - - - - - - - .... \n\ntil 0.02 \nW 0.015 \n\n0.0\\ \n\n0.005 \n\n- - u~o 1 \n17.02 \n\nl \n\nl \n\n17.0  a \n\n, , , , , \\ \n\"  \\ ,  \\ \n\\ \\ ,  \\ \n\"  \\ , \\ \n\n\\ \n\nO.o-+==::::;:::===r~-.--'-=:::::::::=F====? \n3000 \n\n1500 \n\n2500 \n\n2000 \n\no \n\n500 \n\n1000 \n\n\\0,,:::.. __ - =::\"_-:.=.:\"-::..=.:\" \n\n(d) \n\nr--\n\n0.03-\n\n0.025-\n\ntil 0.02 \nW 0.015 \n\n0.01 \n\n0.005-\n\nZoo \nZos \n- - Z007 \n-------- Zos \n\n\\ \n~~:.:.:.:.:::---:.::::.:.:.:--------: --.: - ------::: --\n\n0.0..L-.-1.::..:.::::=:::::::::~~1~~~~~1~~;d \n\n4*10' \n\n6*10' \n\n8*103 \n\nFigure  1:  Dependence  of the overlaps  and  the  generalization  error  on  the  normal(cid:173)\nized  number of examples  a  for  a  three-node  student  learning corrupted  examples \ngenerated  by an isotropic three-node  teacher.  (a) student-student overlaps Qik  and \n(b)  student-teacher overlaps  Rin  for  172  = 0.3.  The generalization error is  shown in \n(c)  for  different  values  of the  noise  variance  172 ,  and  in  (d)  for  different  powers  of \nthe  polynomial learning rate  decay,  focusing  on a  > 0'0  ( asymptotic regime). \n\nAsymptotically the secondary overlaps S decay to zero,  while Rin  -+ -ICJii indicates \nfull  alignment for  Tnn  = L  As  specialization  proceeds,  the  student  weight  vectors \ngrow  in  length  and  become  increasingly  uncorrelated.  It is  interesting  to  observe \nthat  in  the  presence  of noise  the  student  vectors  grow  asymptotically longer  than \nthe teacher  vectors:  Qii  -+ Qoo  >  1,  and  acquire  a  small negative correlation  with \neach  other.  Another  detectable  difference  in  the  presence  of noise  is  a  larger  gap \nbetween  the  values  of  Q  and  C  in  the  symmetric  phase.  Larger  norms  for  the \nstudent  vectors  result  in  larger  generalization  errors:  as  shown  in  Figure  I.c,  the \ngeneralization error increases monotonically with increasing noise level,  both in  the \nsymmetric and  asymptotic regimes. \n\nFor an isotropic teacher,  the teacher-student  and student-student overlaps can thus \nbe fully characterized by four parameters:  Qik = QCik +C(I- Cik)  and R;n  = RCin + \nS(I-Cin).  In the symmetric phase the additional constraint R =  S reflects  the lack \nof differentiation  among student  vectors  and reduces  the  number of parameters to \nthree. \n\nThe  symmetric  phase  is  characterized  by  a  fixed  point  solution  to  the  equations \n\n\f264 \n\nD. Saad and S.  A. Solfa \n\nof motion  (2)  whose  coordinates  can  be  obtained  analytically  in  the  small  noise \napproximation:  R*  = I/JK(2K -1) + 1/  0'2  r8  , Q*  = 1/(2K -1) + 1/  0'2  q8  , and \nC* =  1/(2K -1) + 1/  0'2  C8, with  r 8, q8,  and  C8  given  by relatively simple functions \nof K.  The generalization error  in this regime is  given  by: \n\n*  K  (7r \n0'21/  (2K - 1 ?/2 \nfg  =  -;  6' - Ii arCSIn  2K  +  27r  (2K + 1)1/2  ; \n\n.  (  1  )) \n\n, \n\n(3) \n\nnote its increase  over  the  corresponding  noiseless value,  recovered  for  0'2  =  O. \nThe asymptotic phase is characterized  by a fixed  point solution with  R*  =j:.  S*.  The \ncoordinates  of the  asymptotic fixed  point  can  also  be  obtained  analytically in  the \nsmall noise  approximation:  R*  = 1 + 1/  0'2  r a ,  S*  = -1/  0'2  Sa,  Q*  = 1 + 1/  0'2  qa, \nand C*  =  -1/  0'2  Ca ,  with  r a,  Sa,  qa,  and  Ca  given  by  rational functions  of K  with \ncorrections of order  1/.  The asymptotic generalization error is  given by \n\n* \nf g  =  67r  1/  0'  .Ii  . \n\n2  T.' \n\n.J3 \n\n(4) \n\nExplicit expressions for  the coefficients  r 8, q8' C8 ,  r a, Sa, qa,  and  Ca  will  not  be  given \nhere for  lack of space;  suffice it to say that the fixed  point coordinates predicted on \nthe  basis  of the  small  noise  approximation are  found  to be  in excellent  agreement \nwith the values obtained from the numerical integration of the equations of motion \nfor  0'2  ~ 0.3. \n\nIt is  worth  noting  in  Figure  I.c  that  in  the  small  noise  regime  the  length  of the \nsymmetric plateau decreases  with  increasing  noise.  This effect  can  be investigated \nanalytically by linearizing the equations of motion around the symmetric fixed  point \nand identifying the positive eigenvalue responsible for the escape from the symmetric \nphase.  This  calculation  has  been  carried  out in  the  small noise  approximation, to \nobtain  A =  (2/7r)K(2K  - 1)-1/2(2K + 1)-3/2 + Au 0'21/,  where  Au  is  positive  and \nincreases  monotonically  with  K  for  K  >  1.  A  faster  escape  from  the  symmetric \nplateau  is  explained  by  this  increase  of the  positive  eigenvalue.  The  calculation \nis  valid  for  0'21/  ~ 1;  we  observe  experimentally  that  the  trend  is  reversed  as  0'2 \nincreases.  A  small  level  of noise  assists  in  the  process  of  differentiation  among \nstudent  vectors,  while  larger  levels  of noise  tend  to  keep  student  vectors  equally \nignorant about the task to be  learned. \n\nThe asymptotic value (4) for the generalization error indicates that learning at finite \n1/  will  result  in  asymptotically suboptimal performance  for  0'2  >  O.  A  monotonic \ndecrease ofthe learning rate is necessary to achieve optimal asymptotic performance \nwith f; = O.  Learning  at  small 1/  results  in  long  trapping  times in  the  symmetric \nphase;  we therefore suggest starting the training process with a relatively large value \nof 1/  and switching to a decaying learning rate at 0' =  0'0,  after specialization begins. \nWe  propose  1/  = 1/0  for  0'  ~ 0'0  and  1/  = 1/0/(0'  - O'oy  for  0'  >  0'0 .  Convergence \nto  the  asymptotic  solution  requires  z  ~ 1.  The  value  z  =  1  corresponds  to  the \nfastest  decay for  1/(0');  the question  of interest  is  to determine the value of z  which \nresults  in  fastest  decay  for  fg(O').  Results  shown  in  Figure  l.d for  0'  >  0'0  =  4000 \ncorrespond  to  M  = K  = 3,  1/0  = 0.7,  and  0'2  = 0.1.  Our numerical results  indicate \noptimal decay  of fg(O')  for  z  =  1/2.  A  rigorous justification of this  result  remains \nto  be found. \n\n4  Model  noise \n\nThe resulting equations of motion for the student-teacher and student-student over(cid:173)\nlaps can also be obtained analytically in this case;  they exhibit a  structure  remark-\n\n\fLearning with Noise and Regularizers in Multilayer Neural Networks \n\n265 \n\n0.06-\n\n, \n------- -- ---.-.---\u00b7,1 \n1-------.. :' \n~~ \n~ \n\n.~, \n\ntIGo.04 \nW \n\n0.02 \n\n- - u~o 5 \nu~o 1 \n--- - u'JJ  9 \n\n0.04 \n\n~0.03 \n100.02 \n\n~ -._.- - ---.-- ---_.- -- - - --. \n'._---_._-.... . _---- _ ... _------ -_._ .. _-------\n\n0.01 \n\n0.0 -+ - - - - - - . - - - - - - - - . - - - - '  \n\n0*10' \n\n5*10' \n\n1*10' \n\n\\0 \n\n20 \n\n30 \n\n40 \n\n50 \n\n60 \n\n70 \n\n80 \n\n90 \n\nK \n\nFigure  2:  Left  - The  generalization  error  for  different  values  of the  noise  variance \n(72;  training examples  are corrupted  by  model noise.  Right - 7max  as  a  function  of \nK. \n\nably similar to those  for  the noiseless  case  reported in  [2],  except  for some changes \nin  the relevant  covariance matrices. \n\nA numerical investigation of the dynamical evolution of the overlaps and generaliza(cid:173)\ntion error  reveals  qualitative and  quantitative differences  with the  case  of additive \noutput  noise:  1)  The  sensitivity  to  noise  is  much  higher  for  model  noise  than  for \noutput  noise.  2)  The  application  of independent  noise  to  the  individual  teacher \nhidden  units  results  in  an  effective  anisotropic  teacher  and  causes  fluctuations  in \nthe symmetric phase;  the  various student  hidden  units  acquire  some degree  of dif(cid:173)\nferentiation and the symmetric phase can no longer be fully characterized by unique \nvalues  of Q and  C.  3)  The  noise  level  does  not  affect  the  length  of the  symmetric \nphase. \n\nThe effect  of model noise  on  the  generalization error  is  illustrated  in  Figure  2 for \nM  = K  = 3,  'rJ  = 0.2,  and  various  noise  levels.  The  generalization  error  increases \nmonotonically  with  increasing  noise  level,  both  in  the  symmetric  and  asymptotic \nregimes,  but  there  is  no  modification in  the  length  of the  symmetric  phase.  The \ndynamical  evolution  of the  overlaps,  not  shown  here  for  the  case  of model  noise, \nexhibits qualitative features  quite similar to those  discussed  in the  case of additive \noutput noise:  we  observe  again a  noise-induced widening of the gap  between Q and \nC  in  the symmetric phase,  while the asymptotic phase exhibits an enhancement  of \nthe norm of the student  vectors  and a small degree of negative correlation  between \nthem. \n\nApproximate analytic expressions  based  on  a  small noise  expansion  have  been  ob(cid:173)\ntained for  the coordinates of the fixed  point solutions which  describe  the symmetric \nand asymptotic phases.  In the case of model noise the expansions for the symmetric \nsolution  are  independent  of 'rJ  and  depend  only  on  (72  and  K.  The  coordinates  of \nthe  asymptotic  fixed  point  can  be  expressed  as:  R*  =  1 + (72  r a,  S*  =  _(72  Sa, \nQ*  =  1 + (72  qa,  C*  =  _(72  Ca ,  with coefficients  r a ,  Sa,  qa,  and  Ca  given  by  rational \nfunctions  of K  with corrections  of order  'rJ .  The  important difference  with  the out(cid:173)\nput noise case is that the asymptotic fixed  point is shifted from its noiseless position \neven  for  'rJ  = O.  It is  therefore  not  possible  to  achieve  optimal asymptotic  perfor(cid:173)\nmance  even  if a  decaying  learning  rate  is  utilized.  The  asymptotic generalization \nerror is  given by \n\n* \n(g=--(7  \\+'rJ(7 \n\n2}-' \n\n2K' \n\ny'3 \n1271\" \n\n(}., \n\n) \n\\,'rJ \n\n(a \n\n. \n\n(5) \n\n\f266 \n\nD.  Saad and S.  A. Solla \n\nNote that the  asymptotic generalization error remains finite  even  as TJ  - O. \n\n5  Regularlzers \n\nA  method frequently  used  in real  world training scenarios to overcome the effects  of \nnoise and  parameter redundancy  (1<  > M)  is  the  use  of regularizers such  as  weight \ndecay  (for  a  review  see  [6]). \n\nWeight-decay regularization  is  easily  incorporated  within  the framework  of on-line \nlearning; it leads to a  rule for  the update of the student weights of the form Jf+l = \nJf + 11  6r  e - 1:r  Jf\u00b7  The  corresponding  equations of motion for  the  dynamical \nevolution of the teacher-student and student-student overlaps can again be obtained \nanalytically and integrated  numerically from random initial conditions. \n\nThe picture that emerges is basically similar to that described for the noisy case:  the \ndynamical evolution of the learning process goes  through the same stages,  although \nspecific  values  for  the order  parameters  and  generalization error  at  the  symmetric \nphase and in the asymptotic regime are changed as a consequence of the modification \nin  the dynamics. \n\nOur  numerical investigations have  revealed  no scenario,  either  when  training from \nnoisy  data  or  in  the  presence  of redundant  parameters,  where  weight  decay  im(cid:173)\nproves  the  system  performance  or  speeds  up  the  training  process.  This  lack  of \neffect  is  probably  a  generic  feature  of on-line  learning,  due  to  the  absence  of an \nadditive,  stationary  error  surface  defined  over  a  finite  and  fixed  training  set.  In \noff-line  (batch)  learning,  regularization leads to improved performance through the \nmodification of such  error  surface.  These  observations  are  consistent  with  the  ab(cid:173)\nsence  of 'overfitting'  phenomena in  on-line  learning.  One  of the effects  that  arises \nwhen weight-decay regularization is  introduced in on-line learning is  a prolongation \nof the symmetric phase,  due to a  decrease  in  the  positive eingenvalue that controls \nthe onset of specialization.  This positive eigenvalue,  which  signals the instability of \nthe  symmetric fixed  point,  decreases  monotonically  with  increasing  regularization \nstrength  'Y,  and  crosses  zero  at  'Ymax  =  TJ  7max.  The  dependence  of 7max  on  1<  is \nshown in  Figure 2;  for  'Y  > 'Ymax  the symmetric fixed  point is  stable and the system \nremains trapped  there for  ever. \nThe work  reported  here  focuses  on  an architecturally  matched scenario,  with M  = \n1<.  Over-realizable  cases  with  1<  >  M  show  a  rich  behavior  that  is  rather  less \namenable to generic analysis.  It will be of interest to examine the effects of different \ntypes  of noise  and  regularizers  in  this regime. \n\nAcknowledgement:  D.S.  acknowledges  support from  EPSRC grant  GRjLl9232. \n\nReferences \n[1]  M.  Biehl  and  H.  Schwarze,  J.  Phys.  A  28,  643  (1995). \n[2]  D.  Saad  and  S.A.  Solla,  Phys.  Rev.  E  52, 4225  (1995). \n[3]  D.  Saad  and  S.A.  Solla,  preprint  (1996). \n[4]  P.  Riegler  and  M.  Biehl,  J.  Phys.  A  28,  L507  (1995). \n[5]  G.  Cybenko,  Math.  Control Signals  and Systems 2,  303  (1989). \n[6]  C.M.  Bishop,  Neural networks for pattern recognition,  (Oxford  University  Press,  Ox(cid:173)\n\nford,  1995). \n\n[7]  T.L.H.  Watkin,  A.  Rau,  and  M.  Biehl,  Rev.  Mod.  Phys.  65, 499  (1993). \n[8]  K.R.  Muller,  M.  Finke,  N.  Murata,  K.  Schulten,  and  S.  Amari,  Neural  Computation \n\n8,  1085  (1996). \n\n\f", "award": [], "sourceid": 1243, "authors": [{"given_name": "David", "family_name": "Saad", "institution": null}, {"given_name": "Sara", "family_name": "Solla", "institution": null}]}