{"title": "Dynamics of Training", "book": "Advances in Neural Information Processing Systems", "page_first": 141, "page_last": 147, "abstract": null, "full_text": "Dynamics of Training \n\nSiegfried  Bos* \n\nLab for  Information Representation \nRIKEN,  Hirosawa 2-1, Wako-shi \n\nSaitama 351-01,  Japan \n\nManfred  Opper \n\nTheoretical Physics III \nUniversity of Wiirzburg \n\n97074  Wiirzburg,  Germany \n\nAbstract \n\nA new method to calculate the full training process of a neural net(cid:173)\nwork is  introduced.  No sophisticated methods like the replica trick \nare  used.  The results  are directly  related  to  the actual number of \ntraining  steps.  Some  results  are  presented  here,  like  the  maximal \nlearning rate, an exact description of early stopping, and the neces(cid:173)\nsary number of training steps.  Further problems can be addressed \nwith  this approach. \n\nINTRODUCTION \n\n1 \nTraining  guided  by  empirical  risk  minimization  does  not  always  minimize  the  ex(cid:173)\npected risk.  This phenomenon is called overfitting and is  one of the major problems \nin neural network learning. In a  previous work [Bos  1995]  we  developed an approx(cid:173)\nimate description  of the training process  using  statistical mechanics.  To solve  this \nproblem exactly, we introduce a  new description which is directly dependent on the \nactual training steps. As a first  result we get analytical curves for empirical risk and \nexpected risk  as functions of the training time,  like the ones shown in Fig.  l. \n\nTo make the method tractable we  restrict ourselves to a quite simple neural net(cid:173)\n\nwork model, which nevertheless demonstrates some typical behavior of neural nets. \nThe model  is  a  single layer perceptron,  which  has one  N -dim.  layer of adjustable \nweights W between input x and output z.  The outputs are linear,  Le. \n\nZ  = h =  r;:;r L WiXi . \n\n1  N \n\nvN i=l \n\n(1) \n\nWe  are  interested  in  supervised  learning,  where  examples  xf  (J-L  =  1,  ... ,  P)  are \ngiven  for  which  the  correct  output  z*  is  known.  To  define  the  task  more  clearly \nand  to  monitor  the  training  process,  we  assume  that  the  examples  are  provided \nby another network,  the so  called  teacher network.  The teacher is  not restricted to \nlinear outputs,  it can have a  nonlinear output function 9*(h*). \n\n* email:  boesClzoo.riken.go.jpandopperCiphysik.uni-wuerzburg.de \n\n\f142 \n\nS.  Bos and M.  Opper \n\nLearning by examples attempts to minimize the error averaged over all examples, \nLe.  ET  := 1/2 < (z~ - ZIl)2 >{il'}' which is called training error or empirical risk.  In \nfact  what  we  are interested in  is  a  minimal error averaged over  all  possible inputs \nX,  i.e EG  := 1/2 < (z*  - z)2 > {xEInput}'  called generalization error or expected risk. \nIt can be shown  [see Bos 1995]  that for random inputs, Le. all components Xi  are \nindependent  and have  zero  means  and  unit  variance,  the  generalization  error can \nbe  described by the order parameters R  and Q, \n\nEG(t) = 2\"  [G  - 2H R(t) + Q(t)] \n\n1 \n\n(2) \n\nwith the two parameters G =< [g*(h)]2 >h  and H  =<g*(h) h>h. The order param(cid:173)\neters are defined as: \n\nR(t) =<  N  L WtWi(t)  >{wt} , \n\n1  N \n\ni=l \n\n(3) \n\nAs a novelty in this paper we average the order parameters not as usual in statistical \nmechanics over many example realizations {xt}, but over many teacher realizations \n{Wt}, where we use a spherical distribution. This corresponds to a Bayesian average \nover the unknown teacher.  A  study of the static properties of this  model was  done \nby Saad [1996].  Further comments about the averages can be found in the appendix. \nIn the next  section  we  introduce our  new  method  briefly.  Readers,  who  do  not \nwish to go into technical details in first reading, can tum directly to the results (15) \nand (16).  The remainder of the section  can  be  read  later,  as  a  proof.  In  the  third \nsection results will be presented and discussed.  Finally, we  conclude the paper with \na  summary and a  perspective on further  problems. \n\n2  DYNAMICAL  APPROACH \n\nBasically we exploit the gradient descent  learning rule,  using the linear student, i.e \ng'(h) = 1 and zJl.  = hll  = )wWxJl., \n\nFor P  < N, the weights are linear combinations of the example inputs xr, if Wi(O)  = \n\n0, \n\nAfter some algebra a  recursion for  (1 Jl.(t)  can be found,  Le. \n\n(5) \n\n(6) \n\nwhere  the  term  in  the  round  brackets  defines  the  overlap  matrix C IlV .  From  the \ngeometric series we know the solution of this recursion, and therefore for the weights \n\n(7) \n\n\fDynamics o/Training \n\nIt fulfills  the  initial  conditions  Wi(O)  =  \u00b0 and  Wi (l)  = 1] 'L:=1 z~ xr  (Hebbian), \n\nand yields after infinite time steps the so called  Pseudo-inverse weights,  i.e. \n\n143 \n\nWi(oo)  = ~ L  z~ (C- 1 )/-IV xr . \n\np \n\n/-1,\u00a31=1 \n\n(8) \n\nThis is valid as long as the examples are linearly independent, i.e.  P  < N. Remarks \nabout the other case (P > N) will follow  later. \n\nWith  the expression  (7)  we  can  calculate  the  behavior  of the  order  parameters \n\nfor  the whole training process.  For R(t)  we  get \n\nR(t)  =  1  ~ [E - (E - 1]C)t] \n\nC \n\nN  6 \n\n/-1,\u00a31=1 \n\n/-I \n\n< Z* \n\n(  1  ~ W*  v) \n\n'N 6 \nV  lV  a=1 \n\ni  xi  >{W,\"} \n\n/-IV \n\nFor the average  we  have used  expression  (21)  from  the appendix.  Similarly  we  get \nfor  the other order parameter, \n\n(9) \n\nX  < Z.  Z*  N  LXi Xi  >{W:} \n\n) \n\u00a3ItT \n\n/-IT \n\n( \n\n1 \n\nN \n\na=1 \n\n(10) \n\nAgain we have applied an identity (20)  from  the appendix and we  did some matrix \nalgebra.  Note,  up  to this  point  the order parameters  were  calculated  without  any \nassumption  about  the  statistics  of the  inputs.  The results  hold,  even  without  the \nthermodynamic limit. \n\nThe trace can be calculated by an integration over the eigenvalues, thus we attain \n\nintegrals of the following  form, \n\np \n\n~ L  [(E - 1]C)1 cm] /-1/-1  =  J d~ p(~)(l- 1]~)1 ~m =: I!n(t, 0:, 1]), \n\n~mu \n\n(11) \n\n/-1=1 \n\n~min \n\nwith l = {O, t, 2t} and m  = {-1,0, 1}. \n\nThese  integrals  can  be  calculated  once  we  know  the  density  of the eigenvalues \np(~). The determination of this density can be found in recent literature calculated \nby Opper  [1989J  using  replicas,  by Krogh  [1992J  using  perturbation theory and by \nSollich [1994J  with matrix identities. We should note, that the thermodynamic limit \nand the special assumptions about the inputs enter the calculation here. All authors \nfound \n\n(12) \n\n\f144 \n\nS. Bos and M. Opper \n\nfor a  < 1.  The maximal and the minimal eigenvalues are ~max,min := (1 \u00b1  fo)2. So \nall that remains now is  a  numerical integration. \n\nSimilarly we  can calculate the behavior of the training error from \n\nET(t) =<  2P L (z~ - hl-')2  >{Wn  . \n\np \n\n1 \n\n(13) \n\n1-'=1 \n\nFor the overdetermined case  (P > N)  we  can find  a  recursion  analog to  (6), \n\nW;{t + I}  = ~ [6;;  - (~ ~ x'tx'J ) 1 Wj{t} + .IN ~ z~x't . \n\n{14} \n\nThe  term  in  the  round  brackets  defines  now  the  matrix  B ij .  The  calculation  is \ntherefore quite similar to above with the matrix B  playing the role of matrix C. The \ndensity of the eigenvalues  p(A)  for  matrix B  is  the one from  above  (12)  multiplied \nbya. \n\nAltogether,  we  find  the following  results in the case of a  < 1, \n\nEG(t, a, 77)  =  \"2  + \n\nG  G - H2  (1 \n\n2 \n\na \n\n2t  ) \n1 _  a  - 2.1':\"'1  +  L1 \n\nrl \n\nH2 \n\n- T a  1- 10 \n\n( \n\n2t) \n\n, \n\n(15) \n\n(16) \n\nEr(t,a,77)  = \n\nG - H2  2t  H2  2t \n10  + 2  II  , \n\n2 \n\nand in the case of a  > 1, \n\nE G(t,a,77) = \n\nG - H2  ( \n\n2 \n\n1+ a_I-2I-1+L1  +T1o, \n\n2t  ) \n\nH2  2t \n\nt \n\n1 \n\nET ( t, a, 77)  = \n\nG - H2  ( \n\n2 \n\n1 -\n\nl~t ) \n1 \n- + -\na \na \n\n+ -\n\nH2 1ft \n-\n2  a \n\n. \n\nIf t  ---+  00  all  the  time-dependent  integrals  Ik  and .t;t  vanish.  The remaining first \ntwo terms describe, in the limit a  -+  00,  the optimal convergence rate of the errors. \nIn the next section we  discuss  the implications of this result. \n\n3  RESULTS \nFirst we  illustrate how  well the theoretical results  describe  the training process. If \nwe  compare  the theory  with  simulations,  we  find  a  very good  correspondence,  see \nFig.  1. \n\nTrying  other  values  for  the  learning  rate  we  can  see  that  there  is  a  maximal \nlearning rate. It is twice the inverse of the maximal eigenvalue of the matrix B, i.e. \n\n2 \n\n2 \n\n77max  = ~max = (1 + fo)2 . \n\n(17) \n\nThis is consistent with a more general result, that the maximalleaming is  twice the \ninverse of the maximal eigenvalue of the Hessian. In the case of the linear perceptron \nthe matrix B  is  identical to the Hessian. \n\nAs our approach is directly related to the actual number of training steps we can \nexamine  how  training  time  varies  in  different  training  scenarios.  Training  can  be \nstopped if the training error reaches a certain minimal value, i.e if ET(t)  ~ E\u00a5!in +f. \nOr, in cross-validated early stopping, we will terminate training if the generalization \nerror starts to increase,  i.e.  if EG(t + 1)  > EG(t). \n\n\fDynamics of Training \n\n145 \n\n0.4 \n\n0.3 \n\n0.2 \n\n0.1 \n\nET \n\n, , , \n, , , , , , , , \n\n, , , \n'l , \n\n----~ -----\u00b1-----~-----\u00b1-----~-- ---%- - -- -~ - - - --%- - - --~- - - --%- ----~ - - --~ - - - --~ ----\n\n0.0 \n\n0 \n\n50 \n\ntraining steps \n\n100 \n\n150 \n\nFigure 1:  Behavior of the generalization error EG  (upper line) and the training error \nET  (lower line)  during the training process.  As  the loading rate a  = P / N  = 1.5  is \nnear the storage capacity (a =  1) of the net overfitting occurs. The theory describes \nthe results  of the simulations very  well.  Parameters:  learning rate TJ  =  0.1,  system \nsize N  =  200,  and g .. (h)  =  tanh(')'h)  with gain 'Y  = 5. \n\nFig.  2 shows that in exhaustive training the training time diverges  for  a  near 1, \nin  the  region  where  also  the  overfitting occurs.  In  the  same  region  early  stopping \nshows  only a  slight increase in training time. \n\nFurthermore,  we  can  guess  from  Fig.  2  that  asymptotically  only  a  few  training \nsteps  are  necessary  to  fulfill  the  stopping  criteria.  This  has  to  be  specified  more \nprecisely.  First we  study the behavior of EG  after only one training step,  i.e.  t =  1. \nSince we  interested in the limit of many examples (a -+ 00)  we  choose the learning \nrate as  a  fraction of the maximal learning rate (17),  i.e.  TJ  = TJO/~max' Then we  can \ncalculate  the behavior of EG(t  =  1, a, ~;ix) analytically.  We  find  that only in  the \ncase of TJo  = 1, the generalization error can reach its asymptotic minimum Einf.  The \nrate of the convergence is a-l like in the optimal case, but the prefactor is  different. \n\nHowever,  already for  t =  2  we  find, \n\nEG  (t  =  2, a, TJ  = ~;!x) := EG  - E;.nf  =  G -2 H2  a  ~ 1 + 0  (~2) . \n\n(18) \n\nIf a  is large, so that we can neglect the a- 2  term, then two batch training steps are \nalready enough to get the optimal convergence rate. These results are illustrated in \nFig.  3. \n\n4  SUMMARY \nIn this paper we have calculated the behavior of the learning and the training error \nduring the whole  training process.  The novel approach relates  the errors directly to \nthe actual number of training steps. It was shown how good this theory describes the \ntraining process.  Several results have been presented, such as the maximal learning \nrate and the training time in  different  scenarios, like early stopping. If the learning \nrate  is  chosen  appropriately,  then  only  two  batch  training  steps  are  necessary  to \nreach  the optimal convergence rate for  sufficiently large a. \n\n\f146 \n\nS.  Bas and M.  Opper \n\neps=O.OOl \neps=O.OlO \nearly stop \n\nle+02 \n\nt \n\nle+OI \n\ntl. \n\n40 \n\n0 \n, , , , \n,  , \n:  ~ \n,: \n,? \n\n~ \n/ \nA'\". \n\n\\ \n\\0  ~~\" \n\\, \n\n\" \n\\~ \n1_ \n\n.\\i>. \n'.~ \nl \u00b7'.\\.a. \n\n\"'P. \n\n\"-t-R \n\n'---: \n\n.co ... '\n\n4-. \n\n4):' \n\nI;J / \n..-\n0,_,-' \n\n~ ... i \n\nO __ ~_i-\n\n0 \n\ncr-..t;>-L-~-<>-+-'1' -4 \n\no  0: ___ : \n,----_.. \n, \n-' \n\nle+OO \n\nle-Ol \n\nle+OO \n\nle+OI \n\nPIN \n\n...... \\ ..... ~ ...... ~ \n\n-----. \n, \n.. ---------1 \n, \n-----------------------\"\"'i~ ........... -, \n\n............... -..... \n\n. \n, \n\n, \n\" \n\nle+02 \n\nFigure 2:  Number of necessary training steps to fulfill certain stopping criteria. The \nupper lines  show  the result  if training is  stopped  when  the  training  error is  lower \nthan E~in+\u20ac, with \u20ac  =  0.001 (dotted line) and \u20ac  =  0.01  (dashed line). The solid line \nis  the early stopping result where training is stopped, when the generalization error \nstarted to increase,  EG(t + 1)  > EG(t).  Simulation results  are  indicated by marks. \nParameters:  learning  rate  11  =  0.01,  system  size  N  =  200,  and  9.(h)  =  tanh(-yh) \nwith gain '\"Y  =  5. \n\nFurther problems, like the dynamical description of weight decay, and the relation \nof the dynamical approach to the thermodynamic description of the training process \n[see  Bos,  1995]  can not  be discussed here  due to lack of space.  These problems are \nexamined in an extended version of this work [Bos and Opper 1996]. It would be very \ninteresting if this method could be extended towards other,  more realistic models. \n\nA  APPENDIX \nHere we  add some  identities which  are necessary for  the averages  over the teacher \nweight  distributions,  eqs.  (9)  and  (10).  In  the  statistical  mechanics  approach one \nassumes that the distribution of the local fields  h is  Gaussian. This becomes true, if \none averages over random inputs Xi,  with first  moments zero and one,  which is  the \nusual approach [see Bos 1995 and ref.].  In principle it is  also possible to average over \nmany  tasks,  i.e  many  teacher  realizations  W\u00b7,  which  is  done  here.  The  Gaussian \nlocal fields  h~ fulfill, \n\n< h~ >= 0,  < h~h~ >=CjJ.v' \n\n(19) \n\nThis implies \n\n< z~ z~ >{Wtl \n\n00 \n\n00 \n\nJ Dh~ J Dh~ 9.(V1 - (CjJ.v)2 h~ + CjJ.v h~) 9.(h~) \n\n-00 \n\n-00 \n\nIn the second identity we first calculated the diagonal term and for the non-diagonal \nterm  we  made  an  expansion  assuming  small  correlations.  Similarly  the  following \n\n(20) \n\n\fDynamics o/Training \n\n147 \n\nle+OO \n\nIe-OI \n\nle-02 \n\nEO \n\nle-03 \n\nle-04 \n\nIe-OS \n\nIe-OI \n\n----\n\n............... \n\nexh. \nopt. \nt=1 \nt=2 \nt=3 \n\nIe+OO \n\nle+OI \n\nPIN \n\nle+02 \n\nle+03 \n\nle+04 \n\nFigure  3:  Behavior of EG  =  EG  - Einf  after  t  training  steps.  Results  for  t  =  1,  2 \nand 3 are given.  For large enough a  it is  already after t =  2 training steps possible \nto reach the optimal convergence  (solid  line).  If t = 3 the optimal result  is  reached \neven  faster.  Parameters:  learning  rate  17  =  ~;!x and 9*(h)  =  tanhbh)  with  gain \n')'  =  5. \n\nidentity can be proved, \n\n< z~ h~ >{W;}= 8,.\u00a31/  H  + (G,.W \n\n- 8,.\" .. ) H. \n\n(21) \n\nAcknowledgment:  We  thank  Shun-ichi  Amari  for  many  discussions  and  E. \n\nHelle,  A.  Stevenin-Barbier for  proofreading and valuable comments. \n\nReferences \nBas  S.  (1995),  'Avoiding  overfitting  by  finite  temperature  learning  and  cross(cid:173)\nvalidation',  in  Int.  Conference  on  Artificial Neural  Networks  95  (ICANN'95), \nedited by EC2 &  Cie,  Vo1.2,  p.1l1- 1l6. \n\nBas S.,  and Opper M.  (1996),  'An exact  description of early stopping and weight \n\ndecay',  submitted. \n\nKinzel  W.,  and  Opper  M.  (1995),  'Dynamics  of  learning',  in  Models  of Neural \nNetworks I,  edited by E.  Domany, J. L. van Hemmen and K. Schulten, Springer, \np.157-179. \n\nKrogh  A.  (1992),  'Learning  with  noise  in  a  linear  perceptron',  J.  Phys.  A  25, \n\np.1l35-1l47. \n\nOpper  M.  (1989),  'Learning  in  neural  networks:  Solvable  dynamics',  Europhys. \n\nLett.  8, p.389-392. \n\nSaad D.  (1996),  'General Gaussian  priors for  improved generalization',  submitted \n\nto  Neural  Networks. \n\nSollich P. (1995), 'Learning in large linear perceptrons and why the thermodynamic \n\nlimit  is  relevant  to the real world', in  NIPS  7,  p.207-214. \n\n\f", "award": [], "sourceid": 1225, "authors": [{"given_name": "Siegfried", "family_name": "B\u00f6s", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}]}