{"title": "On-line Learning from Finite Training Sets in Nonlinear Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 357, "page_last": 363, "abstract": null, "full_text": "Online learning from finite  training sets \n\nin nonlinear  networks \n\nPeter Sollich* \n\nDavid Barbert \n\nDepartment of Physics \nUniversity of Edinburgh \nEdinburgh ERg 3JZ,  U.K. \n\nP.Sollich~ed.ac.uk \n\nDepartment of Applied Mathematics \n\nAston University \n\nBirmingham B4 7ET, U.K. \nD.Barber~aston . ac.uk \n\nAbstract \n\nOnline  learning  is  one  of the  most  common  forms  of neural  net(cid:173)\nwork training.  We present an analysis of online learning from finite \ntraining sets for  non-linear networks  (namely,  soft-committee ma(cid:173)\nchines),  advancing the  theory to more  realistic learning scenarios. \nDynamical  equations  are  derived  for  an  appropriate  set  of order \nparameters;  these  are  exact  in  the  limiting  case  of  either  linear \nnetworks  or  infinite  training  sets.  Preliminary  comparisons  with \nsimulations suggest that the theory captures some  effects  of finite \ntraining sets, but may not yet account correctly for  the presence of \nlocal minima. \n\n1 \n\nINTRODUCTION \n\nThe analysis of online gradient descent learning, as  one of the most common forms \nof supervised learning, has recently stimulated a great deal of interest [1, 5, 7,  3].  In \nonline learning, the weights of a network ('student')  are updated immediately after \npresentation  of each  training  example  (input-output  pair)  in  order  to  reduce  the \nerror that the network makes on  that example.  One of the primary goals of online \nlearning analysis is  to track the resulting evolution of the generalization error - the \nerror that the student network makes on a novel test example, after a given number \nof example  presentations.  In  order  to  specify  the  learning  problem,  the  training \noutputs are assumed to be  generated by a  teacher network of known  architecture. \nPrevious  studies  of online  learning  have  often  imposed  somewhat  restrictive  and \n\n\u2022 Royal Society Dorothy Hodgkin  Research Fellow \ntSupported by EPSRC grant GR/J75425:  Novel  Developments in Learning Theory for \n\nNeural Networks \n\n\f358 \n\nP.  SolIich and D. Barber \n\nunrealistic assumptions about the learning framework.  These restrictions are, either \nthat the size of the training set is infinite,  or that the learning rate is small[l, 5,  4]. \nFinite  training  sets  present  a  significant  analytical  difficulty  as  successive  weight \nupdates are correlated, giving rise to highly non-trivial generalization dynamics. \nFor linear  networks,  the difficulties  encountered  with  finite  training sets  and  non(cid:173)\ninfinitesimal  learning rates  can be  overcome  by  extending  the  standard set  of de(cid:173)\nscriptive ('order') parameters to include the effects of weight update correlations[7]. \nIn the present work, we  extend our analysis to nonlinear networks.  The particular \nmodel we  choose to study is  the soft-committee machine which is  capable of repre(cid:173)\nsenting  a  rich  variety of input-output  mappings.  Its online  learning dynamics  has \nbeen  studied  comprehensively for  infinite  training sets[l,  5].  In order to carry out \nour analysis,  we  adapt tools  originally  developed  in  the statistical mechanics liter(cid:173)\nature which  have found  application,  for  example,  in  the study of Hopfield  network \ndynamics[2]. \n\n2  MODEL  AND  OUTLINE OF  CALCULATION \n\nFor an  N-dimensional input vector x, the output of the soft  committee machine is \ngiven by \n\n(I) \n\nwhere the nonlinear activation function  g(hl )  = erf(hz/V2)  acts  on the activations \nhi  = wtxl.JFi (the factor 1/.JFi is for convenience only).  This is a neural network \nwith L  hidden units,  input to hidden weight  vectors WI,  1 = I..L,  and all hidden to \noutput weights set to 1. \n\nIn online learning the student weights are adapted on a sequence of presented exam(cid:173)\nples to better approximate the teacher mapping.  The training examples are drawn, \nwith  replacement,  from  a  finite  set,  {(X/\",yl-')  ,j.t = I..p}.  This  set  remains  fixed \nduring  training.  Its  size  relative  to  the  input  dimension  is  denoted  by  a  = piN. \nWe  take the input  vectors xl-'  as  samples  from  an N  dimensional  Gaussian distri(cid:173)\nbution with zero  mean and unit variance.  The training outputs  y'\"  are assumed to \nbe generated by a teacher soft committee machine with hidden weight vectors w~, \nm  =  I..M, with additive Gaussian noise corrupting its activations and output. \nThe  discrepancy  between  the  teacher  and  student  on  a  particular training  exam(cid:173)\nple  (x, y),  drawn from  the training set,  is  given  by  the squared difference  of their \ncorresponding outputs, \n\nE= H~9(hl) -yr = H~9(hl) - ~g(km +em) -eor \n\nwhere the student and teacher activations are, respectively \n\nh, = {J;wtx \n\nkm  = {J;(w:n?x, \n\n(2) \nand em,  m  =  I..M and eo  are noise variables corrupting the teacher activations and \noutput respectively. \nGiven  a  training  example  (x,  y),  the  student  weights  are  updated  by  a  gradient \ndescent step with learning rate \"I, \n\nw; - W,  = -\"I\\1wIE = - JNx8h l E \n\n(3) \n\n\fOn-line Learning from Finite Training Sets in Nonlinear Networks \n\n359 \n\nThe generalization error is defined to be the average error that the student makes on \na  test example selected  at random  (and uncorrelated with the training set), which \nwe  write as  \u20acg  =  (E). \nAlthough one could,  in principle,  model the student weight  dynamics directly,  this \nwill typically involve too many parameters, and we seek a more compact representa(cid:173)\ntion for  the evolution of the generalization error.  It is straightforward to show that \nthe generalization  error depends,  not  on  a  detailed  description  of all  the network \n\nweights,  but only on  the overlap  parameters Qll'  = ~ W r WI'  and  Rim  = ~ W r w':n \n[1,  5,  7].  In the case  of infinite  0, it  is  possible  to obtain a  closed  set  of equations \ngoverning the overlap parameters Q, R  [5].  For finite  training sets, however, this is \nno longer possible,  due to the correlations between successive weight  updates[7]. \nIn order to overcome this difficulty, we use a technique developed originally to study \nstatistical physics systems [2] .  Initially,  consider the dynamics of a general vector of \norder parameters, denoted by  0, which  are functions  of the network weights w.  If \nthe weight  updates  are described  by  a  transition  probability T(w -+  w'),  then  an \napproximate update equation for  0  is \n\n0' - 0  =  IfdW' (O(w') - O(w))  T(w -+ W')) \n\n\\ \n\nP(w)oc6(O(w)-O) \n\n(4) \n\nIntuitively,  the  integral  in  the  above  equation expresses  the  average  changel  of 0 \ncaused by a weight  update w  -+  w', starting from  (given)  initial weights w.  Since \nour aim is to develop a closed set of equations for the order parameter dynamics, we \nneed to remove the dependency  on  the initial weights  w.  The only information we \nhave regarding w  is  contained in  the chosen order parameters 0, and we  therefore \naverage the result  over  the  'subshell'  of all  w  which  correspond to these values  of \nthe order parameters.  This is  expressed as the 8-function constraint in equation(4). \nIt is  clear  that  if the  integral  in  (4)  depends  on  w  only  through  O(w),  then  the \naverage is  unnecessary and the resulting dynamical equations are exact.  This is  in \nfact the case for  0  -+  00 and 0  =  {Q, R}, the standard order parameters mentioned \nabove[5].  If this cannot be achieved, one should choose a set of order parameters to \nobtain approximate equations which  are as close as possible  to the exact solution. \nThe motivation for  our choice of order parameters is  based on the linear perceptron \ncase where, in addition to the standard parameters Q and R,  the overlaps projected \nonto eigenspaces of the training input correlation matrix A = ~ E:=l xl' (xl') T  are \nrequired2 .  We  therefore split the eigenvalues of A  into r  equal blocks  ('Y  =  1 ... r) \ncontaining  N'  =  N Ir eigenvalues  each,  ordering  the  eigenvalues  such  that  they \nincrease with 'Y.  We  then define  projectors p'Y  onto the corresponding eigenspaces \nand take as order parameters: \n\nR'Y \n\n_  1  Tp'Y \n\n1m  - N'w, \n\nwm  U'Y  - ~ Tp'Yb \n.. \n\nI.  - Nt W, \n\nII \n\n(5) \n\nwhere the b B  are linear combinations of the noise variables and training inputs, \n\n(6) \n\n1 Here  we  assume  that the system size  N  is  large  enough  that the mean values of the \n\nparameters alone describe the dynamics sufficiently well  (i. e.,  self-averaging  holds). \n\n2The order parameters actually used in our calculation for  the linear perceptron[7]  are \n\nLaplace transforms of these projected order parameters. \n\n\f360 \n\nP.  Sollich and D.  Barber \n\nAs  r  -+  00,  these order parameters become functionals  of a  continuous variable3 . \nThe  updates  for  the  order  parameters  (5)  due  to  the  weight  updates  (3)  can  be \nfound  by taking the scalar products of (3)  with either projected student or teacher \nweights, as appropriate.  This then introduces the following activation 'components', \n\nk'Y  =  ff(w* )Tp'\"Yx \nm  VNi  m \n\nso  that the student and teacher activations are h,  = ~ E'\"Y hi and km = ~ E'\"Y k~, \nrespectively.  For the linear perceptron, the chosen order parameters form a complete \nset - the dynamical equations close,  without need for  the average in  (4). \nFor the nonlinear case, we now sketch the calculation of the order parameter update \nequations  (4).  Taken  together,  the  integral  over  Wi  (a sum of p  discrete  terms  in \nour  case,  one  for  each  training  example)  and  the  subshell  average  in  (4),  define \nan  average  over  the  activations  (2),  their  components  (7),  and  the  noise  variables \n~m, ~o.  These  variables  turn out  to  be  Gaussian  distributed  with  zero  mean,  and \ntherefore only their covariances need to be worked out.  One finds  that these are in \nfact  given  by the naive training set averages.  For example, \n\n= \n\n(8) \n\nwhere  we  have  used  p'\"Y A  = a'\"YP'\"Y  with  a'\"Y \n'the'  eigenvalue  of  A  in  the  ,-th \neigenspace;  this  is  well  defined  for  r  -+  00  (see  [6]  for  details  of  the  eigenvalue \nspectrum).  The  correlations  of  the  activations  and  noise  variables  explicitly  ap(cid:173)\npearing in  the error in  (3)  are calculated similarly to give, \n\n(9) \n\n'\"Y \n\n(h,h,,)  = ~ L:; Q~, \n(h,km) =  ~ L :; Rim \n(h,~s)  =  ~ L ~U,~ \n\n'\"Y \n\n'\"Y \n\nwhere the final  equation defines  the noise variances.  The T~m' are projected over(cid:173)\nlaps between teacher weight vectors, T~m' =  ~ (w~)Tp'\"Yw:n,. We will assume that \nthe teacher weights and training inputs are uncorrelated, so that T~m' is indepen(cid:173)\ndent of ,. The required covariances of the 'component' activations are \n\n(kinh,) \n\n(c] h,) \n\n(hi h\" ) \n\n-\n\n-\n\na'\"YR'\"Y \na \n'm \na'\"YU'\"Y \na \nls \na'\"YQ'\"Y \na \nII' \n\n(k~km')  = \n\na'\"YT'\"Y \na  mm' \n\n(c]km, ) \n\n(hJkm,) \n\n-\n\n-\n\n0 \n\na'\"YR'\"Y \na \n\n'm \n\n(k~~s)  -\n\n(C]~8' ) \n\n-\n\n(hJ~s)  = \n\n0 \na'\"Y  2 \n-(7s588, \na \n.!.U'\"Y \na \n's \n\n(10) \n3Note that the limit r -+  00 is  taken after the thermodynamic limit,  i.e.,  r ~ N.  This \nensures that the number of order parameters is always negligible compared to N  (otherwise \nself-averaging would break down). \n\n\fOn-line Learning from Finite Training Sets in Nonlinear Networks \n\n0.03 r f I I : - - - - -........ - - - - - - - ,  \n\n(a)  0.25 \n\n361 \n\n(b) \n\n0.025 \n\nI \n\n0.02 \n\no \n\n0 0 \n\n000000000000000000000000 \n\n0.2 \n\n0.15 \n\nOOOOOOOOC \n\n000000000 \n\n0000 \n\nL.. \n~  aaaoaaaaaaaaaaaaaaaac \nI \n\\ \n\n... o~ooo \n'NNNoaaoa \n,,------------\n\n0.01  '------~-----~ \n100 \n\n50 \n\no \n\nt \n\no \n\n50 \n\nt  100 \n\nFigure  1:  fg  vs  t  for  student  and  teacher  with  one  hidden  unit  (L  =  M  =  1); \na  = 2, 3, 4 from  above,  learning rate \"I  = 1.  Noise  of equal variance was  added to \nboth  activations  and output  (a)  O'~  =  0'5  =  0.01,  (b)  O'~  =  0'5=  0.1.  Simulations \nfor  N  = 100  are  shown  by  circles;  standard errors  are of the  order  of the symbol \nsize.  The bottom dashed lines  show  the infinite training set result for  comparison. \nr  =  10 was used for  calculating the theoretical predictions; the curved marked \"+\" \nin  (b),  with r  = 20  (and a  = 2), shows that this is large enough to be effectively in \nthe r  -+  00  limit. \n\nUsing  equation  (3)  and the definitions  (7),  we  can now  write  down  the dynamical \nequations,  replacing the number of updates n by the continuous  variable  t = n/ N \nin the limit N  -+  00: \n\nOtRim \nOtU?s \nOtQIz, \n\n-\"I (k-:nOh,E) \n-\"I (c~oh,E) \n-\"I (h7 Oh\"  E) - \"I (h~ Oh, E) + \"12  a-y  (Oh,Eoh\"  E) \n\na \n\n(11) \n\nwhere the averages are over zero mean Gaussian variables, with covariances (9,10). \nUsing the explicit form  of the error E, we  have \n\noh,E =  g'(h,)  [L9(hl') - Lg(km + em) - eo] \n\nI' \n\nm \n\n(12) \n\nwhich,  together with the equations  (11)  completes the description of the dynamics. \nThe  Gaussian  averages  in  (11)  can  be  straightforwardly  evaluated  in  a  manner \nsimilar  to  the  infinite  training  set  case[5],  and  we  omit  the  rather  cumbersome \nexplicit form  of the resulting equations. \nWe  note  that,  in  contrast  to the  infinite  training set  case,  the student  activations \nhI  and  the  noise  variables  Cs and  es  are  now  correlated  through  equation  (10). \nIntuitively,  this  is  reasonable  as  the  weights  become  correlated,  during  training, \nwith the examples in the training set.  In calculating the generalization error, on the \nother hand, such correlations are absent, and one has the same result as for  infinite \ntraining  sets.  The  dynamical  equations  (11),  together  with  (9,10)  constitute  our \nmain result.  They are exact for  the limits of either a linear network (R, Q, T  -+ 0, \nso that g(x)  ex:  x)  or a  -+  00, and can be integrated numerically in a straightforward \nway.  In  principle, the limit r  -+  00 should be taken but, as  shown below,  relatively \nsmall values of r  can be taken in  practice. \n\n3  RESULTS  AND  DISCUSSION \n\nWe  now  discuss  the  main consequences of our result  (11),  comparing the resulting \npredictions for the generalization dynamics, fg(t), to the infinite training set theory \n\n\f362 \n\n0.25k \n\\  ... ---\n\n. \n0.15 \n\n0.05 \n\n1 \n, \n\n0.1 \n\nP.  Sollich and D.  Barber \n\n(a) \n\n0.4 ..----------~--~----, \n\n(b) \n\n,'--\n\n0.3 \n\n0.2 \n\n0.1 \n\n~ooooooooooooooooooo \nOL---~----------~----~ \no \n1W  t  200 \n\n100 \n\nW \n\n02  100000000000000000000000 \n\n______________ ~ \n\n---\n\nO~--~------~----~~~ \no \n40  t  50 \n\n30 \n\n20 \n\n10 \n\nFigure  2:  \u20acg  VS  t  for  two  hidden  units  (L  =  M  =  2).  Left:  a  = 0.5,  with a  = 00 \nshown  by dashed line  for  comparison;  no  noise.  Right:  a  =  4,  no  noise  (bottom) \nand noise on teacher activations and outputs of variance 0.1  (top).  Simulations for \nN  =  100 are shown by small circles;  standard errors are less  than the symbol size. \nLearning rate fJ  =  2 throughout. \n\nand  to  simulations.  Throughout,  the  teacher  overlap  matrix  is  set  to  Tij  =  c5ij \n(orthogonal teacher weight  vectors of length V'ii). \nIn figure(l),  we  study  the  accuracy  of our  method  as  a  function  of the  training \nset  size  for  a  nonlinear  network with  one hidden  unit  at two  different  noise  levels. \nThe  learning  rate  was  set  to  fJ  =  1  for  both  (a)  and  (b).  For  small  activation \nand  output  noise  (0'2  = 0.01),  figure(la) ,  there  is  good  agreement  with  the  sim(cid:173)\nulations  for  a  down  to  a  =  3,  below  which  the  theory  begins  to  underestimate \nthe  generalization  error,  compared  to  simulations.  Our  finite  a  theory,  however, \nis  still considerably  more  accurate than the infinite a  predictions.  For larger noise \n(0'2  =  0.1, figure(lb\u00bb, our theory provides a reasonable quantitative estimate of the \ngeneralization  dynamics  for  a  > 3.  Below  this  value  there  is  significant  disagree(cid:173)\nment,  although  the  qualitative behaviour  of the  dynamics  is  predicted  quite well, \nincluding the overfitting phenomenon beyond t  ~ 10.  The infinite a  theory in this \ncase is qualitatively incorrect. \nIn  the two  hidden  unit  case,  figure(2),  our theory  captures  the initial  evolution of \n\u20acg(t)  very well,  but diverges significantly from  the simulations at larger t;  neverthe(cid:173)\nless, it provides a considerable improvement on the infinite a  theory.  One reason for \nthe discrepancy  at large t  is  that the theory predicts that different  student hidden \nunits will  always specialize to individual teacher hidden units for  t --+  00,  whatever \nthe value of a.  This leads to a decay of \u20acg  from a plateau value at intermediate times \nt.  In the simulations, on the other hand, this specialization (or symmetry breaking) \nappears to be inhibited or at least delayed until very large t.  This can happen even \nfor  zero  noise  and  a  2::  L, where  the  training data should  should  contain  enough \ninformation to force  student  and  teacher weights  to  be  equal  asymptotically.  The \nreason for  this is  not  clear to us,  and deserves further study.  Our initial investiga(cid:173)\ntions, however, suggest that symmetry breaking may be strongly delayed due to the \npresence of saddle points in  the training error surface with  very  'shallow' unstable \ndirections. \n\nWhen  our  theory  fails,  which  of  its  assumptions  are  violated?  It  is  conceivable \nthat  multiple local minima in  the training error surface could  cause self-averaging \nto break  down;  however,  we  have  found  no  evidence  for  this,  see  figure(3a).  On \nthe other hand,  the simulation  results  in  figure(3b)  clearly show that the implicit \nassumption of Gaussian  student activations - as  discussed  before  eq.  (8)  - can  be \nviolated. \n\n\fOn-line Learning from Finite Training Sets in Nonlinear Networks \n\n(a) \n\n363 \n\n(b) \n\n/ \n\nVariance over training histories \n\n10\"'\" ' - - - - - - - - - - - - - - - '  \n\n102 \n\nN \n\nFigure 3:  (a)  Variance of fg(t = 20)  vs  input dimension  N  for  student and teacher \nwith  two hidden  units  (L  = M  = 2),  a  = 0.5,  'fJ  = 2,  and zero  noise.  The bottom \ncurve shows the variance due to different random choices of training examples from \na fixed  training set  ('training history'); the top curve also includes the variance due \nto different training sets.  Both are compatible with the liN decay expected if self(cid:173)\naveraging holds  (dotted line).  (b)  Distribution  (over training set)  of the activation \nhI of the first hidden unit of the student.  Histogram from simulations for N  = 1000, \nall other parameter values as in  (a). \n\nIn summary, the main theoretical contribution of this paper is the extension of online \nlearning  analysis  for  finite  training  sets  to  nonlinear networks.  Our  approximate \ntheory does not require the use of replicas and yields ordinary first  order differential \nequations for  the  time  evolution of a  set  of order parameters.  Its  central implicit \nassumption  (and  its  Achilles'  heel)  is  that  the  student  activations  are  Gaussian \ndistributed.  In comparison with simulations, we have found that it is more accurate \nthan the infinite training set analysis at predicting the generalization dynamics for \nfinite  training  sets,  both  qualitatively  and  also  quantitatively  for  small  learning \ntimes t.  Future work will have to show whether the theory can be extended to cope \nwith  non-Gaussian  student  activations  without  incurring the  technical  difficulties \nof dynamical replica theory [2],  and whether this will  help to capture the effects of \nlocal  minima and,  more generally,  'rough' training error surfaces. \nAcknowledgments:  We  would like to thank Ansgar West for helpful discussions. \n\nReferences \n\n[1]  M.  Biehl and H.  Schwarze.  Journal  of Physics  A,  28:643-656, 1995. \n[2]  A.  C.  C.  Coolen,  S.  N.  Laughton, and D.  Sherrington.  In  NIPS 8,  pp.  253-259, \nMIT Press, 1996;  S.N.  Laughton,  A.C.C.  Coolen,  and  D.  Sherrington.  Journal \nof Physics  A, 29:763-786, 1996. \n\n[3]  See for  example:  The dynamics of online learning.  Workshop at NIPS'95. \n[4]  T.  Heskes  and B.  Kappen.  Physical Review A, 44:2718-2762, 1994. \n[5]  D.  Saad and S.  A.  Solla  Physical Review E,  52:4225,  1995. \n[6]  P.  Sollich.  Journal  of Physics  A,  27:7771-7784, 1994. \n[7]  P.  Sollich and D.  Barber.  In NIPS 9,  pp.274-280, MIT Press, 1997;  Europhysics \n\nLetters, 38:477-482, 1997. \n\n\f", "award": [], "sourceid": 1390, "authors": [{"given_name": "Peter", "family_name": "Sollich", "institution": null}, {"given_name": "David", "family_name": "Barber", "institution": null}]}