{"title": "The Learning Dynamcis of a Universal Approximator", "book": "Advances in Neural Information Processing Systems", "page_first": 288, "page_last": 294, "abstract": null, "full_text": "The Learning Dynamics of \na  Universal Approximator \n\nAnsgar H. L.  West1,2 \nA.H.L.West~aston.ac.uk \n\nDavid Saad1 \n\nIan T.  N abneyl \n\nD.Saad~aston.ac.uk \n\nI.T.Nabney~aston.ac.uk \n\n1 Neural  Computing Research Group,  University of Aston \n\nBirmingham B4  7ET, U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\n2Department of Physics, University of Edinburgh \n\nEdinburgh EH9 3JZ, U.K. \n\nAbstract \n\nThe learning properties of a  universal  approximator, a normalized \ncommittee machine with  adjustable  biases,  are studied for  on-line \nback-propagation learning.  Within  a  statistical  mechanics  frame(cid:173)\nwork,  numerical  studies  show  that this  model  has  features  which \ndo not exist in  previously studied two-layer network models  with(cid:173)\nout adjustable biases, e.g., attractive suboptimal symmetric phases \neven for  realizable cases and noiseless data. \n\n1 \n\nINTRODUCTION \n\nRecently  there  has  been  much  interest  in  the  theoretical  breakthrough in  the  un(cid:173)\nderstanding of the on-line learning dynamics of multi-layer feedforward perceptrons \n(MLPs)  using  a  statistical  mechanics  framework.  In  the  seminal  paper  (Saad  & \nSolla,  1995),  a  two-layer  network  with  an  arbitrary  number  of  hidden  units  was \nstudied, allowing insight into the learning behaviour of neural network models whose \ncomplexity is  of the same order as those used  in real  world  applications. \nThe model studied, a soft committee machine (Biehl &  Schwarze,  1995), consists of \na single hidden layer with adjustable input-hidden, but fixed hidden-output weights. \nThe average learning dynamics of these networks are studied in the thermodynamic \nlimit  of  infinite  input  dimensions  in  a  student-teacher  scenario,  where  a  stu.dent \nnetwork  is  presented serially with  training examples  (elS , (IS)  labelled by a  teacher \nnetwork  of the  same  architecture  but  possibly  different  number  of  hidden  units. \nThe  student  updates  its  parameters  on-line,  i.e.,  after  the  presentation  of  each \nexample,  along  the  gradient  of the  squared  error  on  that  example,  an  algorithm \nusually referred to as back-propagation. \nAlthough the  above model  is  already quite similar to real  world networks,  the  ap(cid:173)\nproach  suffers  from  several  drawbacks.  First,  the  analysis  of the  mean  learning \ndynamics employs the thermodynamic limit  of infinite  input dimension -\nlem which has been addressed in  (Barber et al.,  1996), where finite size effects have \nbeen  studied  and  it  was  shown  that  the  thermodynamic  limit  is  relevant  in  most \n\na  prob(cid:173)\n\n\fThe Learning Dynamcis of a UniversalApproximator \n\n289 \n\ncases.  Second,  the  hidden-output  weights  are  kept  fixed,  a  constraint  which  has \nbeen  removed  in  (Riegler  &  Biehl,  1995),  where  it  was  shown  that  the  learning \ndynamics are usually  dominated  by the input-hidden weights.  Third, the biases of \nthe hidden units were fixed  to zero, a constraint which is  actually more severe than \nfixing  the  hidden-output  weights.  We  show  in  Appendix  A  that  soft  committee \nmachines  are  universal  approximators provided one  allows  for  adjustable  biases  in \nthe hidden layer. \nIn this paper, we therefore study the model of a normalized soft committee machine \nwith  variable  biases  following  the framework  set  out  in  (Saad &  Solla,  1995).  We \npresent numerical studies of a variety of learning scenarios which lead to remarkable \neffects not present for  the model  with fixed  biases. \n\n2  DERIVATION OF  THE DYNAMICAL EQUATIONS \n\nThe  student  network  we  consider  is  a  normalized  soft  committee  machine  of K \nhidden  units with adjustable biases.  Each hidden  unit i  consists  of a  bias (Ji  and a \nweight vector lVi which is connected to the N-dimensional inputs e.  All hidden units \nare connected  to a  linear output unit with arbitrary but fixed  gain 'Y  by  couplings \nof fixed  strength.  The  activation  of any  unit  is  normalized  by  the  inverse  square \nroot of the number of weight  connections into the unit,  which allows  all weights to \nbe of 0(1) magnitude, independent of the input dimension or the number of hidden \nunits.  The  implemented  mapping  is  therefore  /w(e) =  (-Y/VK) L:~1 g(Ui  - (Ji), \nwhere Ui  =  lVi \u00b7e/.,fJii and  g(.)  is  a  sigmoidal  transfer function.  The teacher  net(cid:173)\nwork  to  be  learned  is  of the  same  architecture  except  for  a  possible  difference  in \nthe  number  of hidden  units  M  and  is  defined  by  the  weight  vectors  En  and  bi(cid:173)\nases  Pn  (n =  1, ... , M).  Training  examples  are  of  the  form  (e, (1-'),  where  the \ninput  vectors  el-'  are  drawn  form  the  normal  distribution  and  the  outputs  are \n(I-'  = (-Y/.JiJ) L:~1 g(v~ - Pn),  where v~ = Bn \u00b7el-' /.,fJii. \nThe weights and biases are updated in response to the presentation of an example \n(el-', (1-'),  along the gradient of the squared error measure  \u20ac  =  ![(I-'  - /w(el-')F \n\nI \n\n(J.I-'+!  - (J .I-'  = - 1/0 61!' \nNt \n\nI \n\nI \n\nt \n\nand \n\nWol-'+!  - Wol-'  = 1/  61!'  el-' \nWt.,fJii \n\n(1) \nwith 6f ==  [(I-'  - /w(el-')]g'(uf - (Ji).  The two learning rates are 1/w  for  the weights \nand  1/0  for  the  biases.  In  order  to  analyse  the  mean  learning  dynamics  resulting \nfrom  the above update equations, we  follow  the statistical mechanics framework in \n(Saad & Solla,  1995).  Here we  will  only outline the main ideas and concentrate on \nthe results of the calculation. \nAs  we  are interested in  the typical behaviour of our training algorithm  we  average \nover all  possible instances of the examples e.  We  rewrite the update equations  (1) \nin  lVi  as  equations in  the order  parameters describing  the overlaps between  pairs \nof student  nodes  Qij = lVi\u00b7W;/N,  student  and  teacher  nodes  Rin = lVi\u00b7En/N, \nand  teacher  nodes  Tnm  = Bn \u00b7Bm/N.  The generalization  error  \u20acg,  measuring  the \ntypical performance, can be expressed solely in these variables and the biases (Ji  and \nPn.  The order parameters Qij, Rin  and the biases (Ji  are the dynamical variables. \nThese  quantities  need  to  be  self-averaging with  respect  to  the  randomness  in  the \ntraining data in  the thermodynamic limit  (N ~ 00),  which enforces two necessary \nconstraints on our calculation.  First, the number of hidden units K  \u00ab N, whereas \none  needs  K\", O(N)  for  the  universal  approximation  proof to hold.  Second,  one \ncan show that the updates of the biases have to be of 0(1/N), i.e., the bias learning \nrate has to be scaled by 1/ N, in order to make the biases self-averaging quantities, \na  fact  that is  confirmed by simulations  [see  Fig.  1].  If we  interpret the normalized \n\n\f290 \n\nA. H.  L.  West,  D. Saad and I.  T.  Nabney \n\nexample number  0  =  piN as  a  continuous time  variable, the update equations for \nthe order parameters and the biases become first order coupled differential equations \n\ndQij \ndo \ndRin \ndo \n\nTJw  (8iuj + 8j U i}e + TJ!.  (8i8j }e\u00b7 \n\nTJw  (8i vn }e  ' \n\nand \n\ndOi \ndo  =  -TJo (8i }e  . \n\n(2) \n\nChoosing g(x)  =  erf(xlV2) as the sigmoidal transfer, most integrations in  Eqs.  ~2) \ncan  be performed analytically, but for  single Gaussian integrals  remaining for  TJw -\nterms  and  the  generalization  error.  The  exact  form  of  the  resulting  dynamical \nequations  is  quite  complicated  and  will  be presented elsewhere.  Here  we  only  re(cid:173)\nmark, that the gain \"/  of the linear output unit,  which determines the output scale, \nmerely  rescales the learning rates  with ,,/2  and can therefore be set to one without \nloss of generality.  Due to the numerical integrations required, the differential equa(cid:173)\ntions  can only be solved  accurately in  moderate times for smaller student networks \n(K ~ 5)  but any teacher size  M. \n\n3  ANALYSIS OF  THE DYNAMICAL EQUATIONS \n\nThe  dynamical  evolution  of the  overlaps  Qij,  R in  and  the  biases  Oi  follows  from \nintegrating the equations  of motion  (2)  from  initial  conditions  determined  by  the \n(random)  initialization  of the  student  weights  Wi  and  biases  Oi.  For  random  ini(cid:173)\ntialization  the resulting norms  Qii  of the student  vector will  be order 0(1), while \nthe overlaps Qij  between different student vectors, and student-teacher vectors Rin \nwill be only order CJ(I/VN).  A random initialization of the weights and biases can \ntherefore be simulated by initializing the norms Qii, the biases Oi  and the normalized \noverlaps Qij  =  Qij I JQiiQjj  and Rin  =  Rinl JQiiTnn  from  uniform  distributions \nin  the [0,1]'  [-1,1], and [_10- 12,10- 12]  intervals respectively. \nWe  find  that  the  results  of the  numerical  integration  are  sensitive  to  these  ran(cid:173)\ndom  initial  values,  which  has  not  been  the  case  to  this  extent  for  fixed  biases. \nFurthermore,  the  dynamical  behaviour  can  become  very  complex  even  for  realiz(cid:173)\nable  cases  (K =  M)  and  networks  with  three  or  four  hidden  units.  For  sake  of \nsimplicity,  we  will  therefore restrict  our presentation to networks  with two  hidden \nunits (K = M  =  2)  and uncorrelated isotropic teachers, defined  by Tnm  =  8nm , al(cid:173)\nthough larger networks  and graded teacher scenarios were  investigated extensively \nas  well.  We  have  further  limited  our  scope  by  investigating  a  common  learning \nrate  (TJo  =  TJo  =  TJw)  for  biases  and weights.  To  study the effect  of different  weight \ninitialization,  we  have  fixed  the initial  values  of the  student-student  overlaps  Qij \nand biases Oi,  as these can be manipulated freely in any learning scenario.  Only the \ninitial student-teacher overlaps  Rin  are randomized  as suggested above. \nIn Fig. 1 we compare the evolution of the overlaps, the biases and the generalization \nerror for  the soft  committee  machine  with  and  without  adjustable bias  learning a \nsimilar realizable teacher task.  The student denoted  by  * lacks  biases,  Le.,  Oi  = 0, \nand  learns  to  imitate  an  isotropic  teacher  with  zero  biases  (Pn  =  0).  The  other \nstudent  features  adjustable  biases,  trained  from  an  isotropic  teacher  with  small \nbiases (Pl,2  =  =FO.I).  For both scenarios, the learning rate and the initial conditions \nwere  judiciously  chosen  to  be  TJo  =  2.0,  Qll  =  0.1,  Q22  =  0.2,  Rin  =  Q12  = \nU[ _10- 12,10-12]  with 01  =  0.0 and O2  = 0.5 for the student with adjustable biases. \nIn  both  cases,  the  student  weight  vectors  (Fig.  Ia)  are  drawn  quickly  from  their \ninitial values  into a suboptimal symmetric phase,  characterized by  the lack of spe(cid:173)\ncialization of the student  hidden  units on  a  particular teacher  hidden  unit,  as  can \nbe  depicted  from  the  similar  values  of ~n in  Fig.  1 b.  This  symmetry  is  broken \n\n\fThe Learning Dynamcis of a UniversalApproximator \n\n291 \n\n,-\n\n0)  Q11 ,Q22 \n\nQil--\nQ22\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \nQi2-'-'-\nQ11-----\u00b7  ~n \n\n0 \n\nQ \n\n(N \n\n=1 \n\n11 \n\no \n\"Q11  (N=100) \nA  Q12  (N=10) \no  Q12  (N=100) \n\n:} \n\n/1 \n_.-.- -_.-. __ ._._. \n~-.-.-.-.-.-.-.-.-:~.'::,'\"------. \n\n'Q*  Q*  Q22---\n'-.  Q12----\u00b7 \n\nIII  22 \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 \n\no  100  200  300  400  500  600  700 \n\nex \n\no  100  200  300  400  500  600  700 \n\nex \n\n0 . 02 - r - - - - - - - - - - - - - ,  \n\n.. -\n-..---==:0-..... -,.:.<--- '1< ....... ,. <:: \n\n, , \n\n, \n'.  ,\". \n,... \n\n' \n.... \n\n1.0 \n\n0.8 \n\n0.6 \n\nQij \n\n0.4 \n\n0.2 \n\n0.0 \n\n0.3 \n\n(}i  0.0 \n\n0.2 \n\n0.1 \n\n-0.1 \n\n-0.2 \n\n-0.3 \n\n(d) \n\nfg(O.Ol) - -(cid:173)\nfg(O.l) _ ... - . \nfg(0.5)(cid:173)\n\nfg(l) - -\n\nfg(O*) ---\nfg(O) .......... . \nf g(10-S) ----_. \nfg(10-4) - .- .-\nN=200 0 \nN=500\" \n\n0.015 \n\n0.005 \n\n0 \n\n100  200  300  400  500  600  700 \n\nex \n\no  100  200  300  400  500  600  700 \n\nex \n\nFigure 1:  The dynamical evolution of the student-student overlaps  Qij  (a), and the \nstudent-teacher overlaps Rin  (b)  as a function of the normalized example number 0 \nis compared for two student-teacher scenarios:  One student (denoted by *)  has fixed \nzero  biases, the other has  adjustable biases.  The influence of the symmetry in  the \ninitialization of the biases on  the dynamics is  shown  for  the student  biases  (Ji  (c), \nand the generalization error fg  (d):  (Jl  =  0 is  kept for  all  runs,  but the initial value \nof (J2  varies and is given in brackets in  the legends.  Finite size simulations for input \ndimensions  N  =  10 ... 500 show that the dynamical variables are self-averaging. \n\nalmost immediately in the learning scenario with adjustable biases and the student \nconverges  quickly  to  the  optimal  solution,  characterized  by  the  evolution  of the \noverlap  matrices  Q,  R  and  biases  (Ji  (see  Fig.  1c)  to  their  optimal  values  T  and \nPn  (up  to the permutation symmetry due  to the arbitrary labeling of the student \nnodes).  Likewise,  the generalization error fg  decays to zero in  Fig.  1d.  The student \nwith  fixed  biases  is  trapped for  most  of its  training time  in  the  symmetric  phase \nbefore it eventually converges. \nExtensive simulations for input dimensions N  =  10 ... 500 confirm that the dynamic \nvariables are self-averaging and show that variances decrease with  liN.  The mean \ntrajectories  are  in  good  agreement  with  the  theoretical  predictions  even  for  very \nsmall input dimensions  (N = 10)  and are virtually indistinguishable for  N  = 500. \nThe length of the symmetric phase for  the isotropic teacher scenario is  dominated \nby  the  learning  ratel ,  hut  also  exhibits  a  logarithmic  dependence  on  the  typical \n\n1The length of the symmetric phase is linearly dependent on 110  for small learning rates. \n\n\f292 \n\nA. H.  L.  West,  D.  Saad and I.  T.  Nabney \n\n(}i  O.O-t----\u00b7 -' '-' -' '-' -' '-'-' -' ' -,;--'~; -~--i \n\n.....  ..... .. .................. \n\nI \n\\ \n\n\" \n\n\\ \nI \n\n-\n\nr'\" \n\n-0.2- .,;:.:-=:::~.::::.------_ \n.:::--:..:-~-..:.;::.-- .. ' \n-- --.  (h \n(05) \n-----.  Ol \nI \n\n-0.4- ......  O2 \n(0.25) \n..........  01 \n\n-0.6 \n\nI \n\n\" \n\" \n\nI \nI \ni \n....  :::-'~.-j \n\n.... \n\nI \n\no \n\n400 \n\n800 \n\n1200 \n\na \n\nJa) \n\nI \n\n1600 \n\n3200 \n\n2800 \n\n2400 \na c \n2000 \n\n1600 \n\n1200 \n\n2-0 \n\n!I \n--710-\n!I \n...........  710=0.01!1 \n-----\u00b7710=0.1  ;1 \n- .- .- 710=0.5  II \n-I \n- - - 710=1 \nil \n-\"'- 710=1.5 \niI \nil \n--- 710=2 \nil \n---. ''10=3 \n!I \n!I \n!I \n/ :   / \n\n--------------}/  I \n:77~.:7\"''=:-.~;''- ' ~ :-,.\" \n\nI \nI \n\nI \nI \nI \n\n, \n\n/ \n. ,/ \n\nI \nI \nI \nI \nI \nI \nI \nI \nr \nr \nr \nI \nI \nI \nI \nI \nI \nI \nI \n\n1 \n1 \n1 \nI \nI \nj \nI \ni \n, \n, \ni \n, \n, \ni \n, , \n,. , \n\n0.0 \n\n0.2 \n\n0.4 \n\n0.8 \n\n1.0 \n\n0.6 \n(}2 \n\nFigure  2:  (a)  The dynamical  evolution  of the  biases  Oi  for  a  student imitating an \nisotropic teacher with zero biases.  reveals symmetric dynamics for  01  and O2 \u2022  The \nstudent was randomly initialized identically for the different runs, but for  a  change \nin  the range of the random initialization of the biases  (U[-b,b]), with the value of \nb given  in  the legend.  Above  a  critical  value  of b the  student  remains  stuck  in  a \nsuboptimal  phase.  (b)  The  normalized  convergence time  ~ ==  TJoQc  is  shown  as  a \nfunction  of the initialization  of O2  for  varios  learning rates  TJo  (see  legend,  TJ5  = 0 \nsymbolizes the dynamics neglecting TJ5  terms.). \n\ndifferences  in  the initial student-teacher overlaps Rin  (Biehl  et  al.,  1996)  which are \ntypically  of order O(I/..fN)  and  cannot  be  influenced  in  real  scenarios  without  a \npriori  knowledge.  The  initialization  of the  biases,  however,  can  be  controlled  by \nthe user and its influence on the learning dynamics  is  shown in  Figs.  lc and  Id for \nthe  biases  and  the  generalization  error  respectively.  For  initially  identical  biases \n(0 1  = O2  = 0),  the evolution of the order parameters and hence  the generalization \nerror  is  almost  indistinguishable  from  the  fixed  biases  case.  A  breaking  of this \nsymmetry leads  to a  decrease of the symmetric phase linear in  log(IOl  - ( 2 1)  until \nit  has  all  but  disappeared.  The  dynamics  are  again  slowed  down  for  very  large \ninitialization  of the biases  (see  Id),  where  the  biases  have  to travel a  long  way  to \ntheir optimal values. \nThis  suggests  that  for  a  given  learning  rate  the  biases  have  a  dominant  effect  in \nthe learning process and strongly break existent symmetries in  weight  space.  This \nis  argueably due  to  a  steep minimum  in  the generalization error surface  along the \ndirection of the biases.  To  confirm  this, we  have  studied  a  range of other learning \nscenarios including larger networks and non-isotropic teachers, e.g., graded teachers \nwith Tnm  =  n6nm .  Even when the norms of the teacher weight vectors are strongly \ngraded,  which  also  breaks the weight  symmetry and  reduces  the symmetric  phase \nsignificantly in the case of fixed  biases, we  have found  that the biases usually  have \nthe  stronger  symmetry  breaking effect:  the trajectories  of the  biases  never  cross, \nprovided that they were not initialized too symmetrically. \nThis would seem to promote initializing the biases of the student hidden units evenly \nacross the  input domain,  which  has been suggested  previously on  a  heuristic  basis \n(Nguyen &  Widrow,  1990).  However,  this  can  lead  to the student being stuck in  a \nsuboptimal configuration.  In Fig. 2a, we show the dynamics of the student biases Oi \nwhen the teacher biases are symmetric (Pn  =  0).  We find  that the student progress \nis  inversely  related  to  the  magnitude  of the  bias  initialization  and  finally  fails  to \nconverge at all.  It remains in a suboptimal phase characterized by biases of the same \nlarge magnitude but  opposite sign  and  highly  correlated weight  vectors.  In  effect, \nthe outputs of the two student nodes cancel out over most of the input domain.  In \n\n\fThe Learning Dynamcis of a  Universal Approximator \n\n293 \n\nFig. 2b, the influence of the learning rate in combination with the bias initialization \nin  determining convergence is illustrated.  The convergence time  Qc,  defined  as  the \nexample  number  at  which  the  generalization  error  has  decayed  to  a  small  value, \nhere judiciously chosen to be 10-8 ,  is  shown as  a  function  of the initial value of ()2 \nfor various learning rates 'TJo.  For convenience, we  have normalized the convergence \ntime  with  1/\"\"0.  The  initialization  of the  other  order  parameters  is  identical  to \nFig.  1a.  One finds  that the  convergence time diverges for  all  learning rates,  above \na  critical initial  value of (h.  For  increasing learning  rates,  this  transition  becomes \nsharper and occurs  at smaller ()2,  i.e.,  the dynamics  become more sensitive to the \nbias initialization. \n\n4  SUMMARY AND DISCUSSION \n\nThis  research  has  been  motivated  by  recent  progress  in  the  theoretical  study  of \nthe soft-committee \non-line learning in  realistic two-layer neural network models -\nmachine,  trained  with  back-propagation  (Saad  & Solla,  1995).  The studies so far \nhave excluded  biases to the hidden  layers, a  constraint which  has been removed  in \nthis paper, which makes the model a  universal approximator.  The dynamics of the \nextended model turn out to be very rich and more complex than the original model. \nIn this paper, we  have concentrated on the effect of initialization of student weights \nand biases.  We  have further  restricted our presentation for  simplicity to realizable \ncases  and  small  networks  with  two  hidden  units,  although  larger  networks  were \nstudied  for  comparison.  Even  in  these  simple  learning scenarios,  we  find  surpris(cid:173)\ning  dynamical  effects  due  to the  adjustable  biases.  In  the case  where  the  teacher \nnetwork  exhibits  distinct  biases,  unsymmetric  initial  values  of the  student  biases \nbreak the node symmetry in weight space effectively and can speed up the learning \nprocess considerably,  suggesting that student biases should  in  practice be  initially \nspread evenly across the input domain if there is no a priori knowledge of the func(cid:173)\ntion  to  be  learned.  For  degenerate  teacher  biases  however  such  a  scheme  can  be \ncounterproductive  as  different  initial  student  bias  values  slow  down  the  learning \ndynamics and can even lead to the student being stuck in suboptimal fixed  points, \ncharacterized by student biases being grouped symmetrically around the degenerate \nteacher biases and strong correlations between the associated weight  vectors. \nIn  fact,  these  attractive  suboptimal  fixed  points  exist  even  for  non-degenerate \nteacher  biases,  but  the  range  of  initial  conditions  attracted  to  these  suboptimal \nnetwork  configurations  decreases  in  size.  Furthermore,  this  domain  is  shifted  to \nvery large initial student biases as the difference in the values of the teacher biases \nis  increased.  We  have  found  these  effects  also  for  larger  network sizes,  where  the \ndynamics  and number of attractive suboptimal fixed  points with different  internal \nsymmetries increases.  Although attractive suboptimal fixed  points were also found \nin  the original  model  (Biehl  et  al.,  1996),  the  basins of attraction of initial  values \nare in general very small and are therefore only of academic interest. \nHowever,  our numerical work suggests that  a  simple  rule of thumb to avoid  being \nattracted  to suboptimal fixed  points  is  to  always  initialize  the squared  norm  of a \nweight  vector  larger  than  the  magnitude  of the  corresponding  bias.  This  scheme \nwill  still support spreading of the biases across the main input domain  in order to \nencourage node symmetry breaking.  This is  somewhat similar to previous findings \n(Nguyen  &  Widrow,  1990;  Kim  &  Ra,  1991),  the former  suggesting spreading the \nbiases  across the input  domain,  the latter relating the  minimal  initial  size  of each \nweight with the learning rate.  This work provides a more theoretical motivation for \nthese results and also distinguishes between the different roles of biases and weights. \nIn  this  paper we  have  addressed  mainly  one  important issue  for  theoreticians  and \n\n\f294 \n\nA. H.  L  West,  D.  Saad and l.  T.  Nabney \n\npractitioners  alike:  the  initialization  of  the  student  network  weights  and  biases. \nOther important issues,  notably the question of optimal and maximal learning rates \nfor  different  network sizes during convergence, will  be reported elsewhere. \nA  THEOREM \nLet S9  denote the class of neural networks defined by sums of the form  L~l nig(ui - (h) \nwhere K  is arbitrary (representing an arbitrary number of hidden units), (h  E lR and ni  E Z \n(i.e.  integer weights).  Let 'I/J(x)  ==  ag(x)/ax and let 1>\",  denote the class of networks defined \nby sums of the form L~l Wi'I/J(Ui -0;) where W;  E lR.  If 9 is continuously differentiable and \nif the class 1>\",  are universal approximators,  then S9  is  a class of universal approximatorsj \nthat is,  such functions are dense in the space of continuous functions  with the Loo  norm. \nAs  a  corollary,  the normalized soft committee machine forms a  class of universal approxi(cid:173)\nmators with both sigmoid and error transfer functions [since radial  basis function networks \nare universal  (Park &  Sandberg, 1993)  and we  need consider only the one-dimensional in(cid:173)\nput case as noted in the proof below).  Note that some restriction on 9  is necessary:  if 9 is \nthe step function,  then with  arbitrary  hidden-output weights,  the network is  a  universal \napproximator, while with fixed hidden-output weights it is  not. \nA.!  Proof \nBy the arguments of (Hornik  et  al.,  1990)  which use the properties of trigonometric poly(cid:173)\nnomials,  it  is  sufficient  to consider  the case of one-dimensional  input and output spaces. \nLet  I  denote  a  compact  interval  in  lR  and  let  f  be  a  continuous  function  defined  on  I. \nBecause 1>\",  is  universal,  given any  E  > 0 we  can find  weights Wi  and biases Oi  such  that \n\nK \n\nf- LW;'I/J(u-Oi ) \n\n;=1 \n\nE <-2 \n\n00 \n\n(i) \n\nBecause  the  rationals  are  dense  in  the  reals,  without  loss  of generality  we  can  assume \nthat the weights  Wi  E  Q.  Since 'I/J(x)  is  continuous  and  I  is  compact,  the convergence of \n[g(x + h) - g(x)J1h to ag(x)/ax is  uniform and hence for  all  n> n (21;Wi)  the following \ni~.lity hblds: \n\n(ii) \nAlso  note that  for  suitable  ni  > n  (2~Wi)' rn.  = now;  E  Z, as  Wi  is  a  rational  number. \nThus,  by  the triangle inequality, \n\nI \n\nK \n\nL .rn;  [g(u+ ~i -0;)  -g(u-Oi )]  - LWi'I/J(u-Oi) \n\nK \n\n(iii) \n\n.=1 \n\ni=l \n\n00 \n\nThe result now follows  from  equations (i)  and (iii)  and the triangle inequality. \n\nReferences \nBarber,  D.,  Saad,  D.,  &  Sollich,  P.  1996.  Europhys.  Lett.,  34, 151-156. \nBiehl,  M.,  &  Schwarze,  H.  1995.  J.  Phys.  A,  28, 643-656. \nBiehl,  M.,  Riegler,  P.,  &  Wohler,  C.  1996.  University  of Wiirzburg Preprint  WUE-ITP-\n\n96-003. \n\nHornik,  K.,  Stinchcombe,  M.,  &  White,  H.  1990.  Neural  Networks,  3, 551-560. \nKim,  Y.  K.,  &  Ra,  J. ,B.  1991.  Pages  2396-2401  of:  International  Joint  Conference  on \n\nNeural  Networks  91. \n\nNguyen,  D.,  &  Widrow,  B.  1990.  Pages  C21-C26 of:  IJCNN International  Conference  on \n\nNeural  Networks  90. \n\nPark, J., &  Sandberg, 1.  W.  1993.  Neural  Computation,  5,  305-316. \nRiegler,  P.,  &  Biehl,  M.  1995.  J.  Phys.  A,  28, L507-L513. \nSaad,  D.,  &  SoHa,  S.  A.  1995.  Phys.  Rev.  E,  52, 4225-4243. \n\n\f", "award": [], "sourceid": 1256, "authors": [{"given_name": "Ansgar", "family_name": "West", "institution": null}, {"given_name": "David", "family_name": "Saad", "institution": null}, {"given_name": "Ian", "family_name": "Nabney", "institution": null}]}