{"title": "Size of Multilayer Networks for Exact Learning: Analytic Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 162, "page_last": 168, "abstract": null, "full_text": "Size of multilayer networks for  exact \n\nlearning:  analytic approach \n\nAndre Elisseefl' \n\nD~pt Mathematiques et Informatique \n\nEcole Normale Superieure de  Lyon \n\n46  allee  d'Italie \n\nHelene Paugam-Moisy \n\nLIP, URA  1398 CNRS \n\nEcole Normale Superieure  de  Lyon \n\n46  allee  d'Italie \n\nF69364  Lyon  cedex  07,  FRANCE \n\nF69364 Lyon  cedex  07,  FRANCE \n\nAbstract \n\nThis  article  presents  a  new  result  about  the  size  of a  multilayer \nneural network computing real outputs for exact learning of a finite \nset of real samples.  The architecture of the network is feedforward, \nwith  one  hidden  layer  and several  outputs.  Starting from  a  fixed \ntraining set,  we  consider  the  network  as  a  function  of its  weights. \nWe  derive,  for  a  wide family  of transfer  functions,  a  lower  and  an \nupper  bound  on  the  number  of hidden  units  for  exact  learning, \ngiven  the size  of the  dataset  and  the  dimensions of the  input and \noutput spaces. \n\n1  RELATED WORKS \n\nThe context of our work is rather similar to the well-known results of Baum et al.  [1, \n2,3,5, 10],  but we consider both real inputs and outputs, instead ofthe dichotomies \nusually  addressed.  We  are interested  in learning exactly all the examples of a fixed \ndatabase,  hence  our  work  is  different  from  stating  that  multilayer  networks  are \nuniversal  approximators  [6,  8,  9].  Since  we  consider  real  outputs  and  not  only \ndichotomies,  it  is  not  straightforward to  compare our  results  to  the  recent  works \nabout  the  VC-dimension  of multilayer  networks  [11,  12,  13].  Our  study  is  more \nclosely  related  to several  works of Sontag [14,  15],  but with different  hypotheses  on \nthe  transfer  functions  of the units.  Finally,  our  approach  is  based  on  geometrical \nconsiderations and is  close  to the model of Coetzee  and  Stonick  [4]. \n\nFirst  we  define  the  model  of network  and  the  notations  and  second  we  develop \nour analytic approach  and  prove  the fundamental theorem.  In  the  last section,  we \ndiscuss  our point of view  and propose some practical consequences  of the result. \n\n\fSize of Multilayer Networks for Exact Learning: Analytic Approach \n\n163 \n\n2  THE NETWORK AS  A  FUNCTION OF ITS WEIGHTS \n\nGeneral concepts  on neural networks are  presented  in matrix and vector notations, \nin  a  geometrical  perspective.  All  vectors  are  written  in  bold  and  considered  as \ncolumn vectors,  whereas  matrices are denoted  with upper-case script. \n\n2.1  THE  NETWORK  ARCHITECTURE  AND  NOTATIONS \n\nConsider a multilayer network with N/ input units, N H  hidden units and N s  output \nunits.  The inputs  and  outputs  are  real-valued.  The hidden  units  compute a  non(cid:173)\nlinear function f  which will be specified  later on.  The output units are assumed  to \nbe linear.  A learning set of Np  examples is given and fixed.  For allp E  {1..Np }, the \npth  example is  defined  by  its input vector dp  E  iRNI  and  the corresponding  desired \noutput  vector  tp  E  iRNs.  The  learning set  can  be  represented  as  an  input  matrix, \nwith  both row  and column notations, as follows \n\nSimilarly, the  target  matrix is T  =  [ti, ... ,ttp ( ,  with  independent  row  vectors. \n\n2.2  THE  NETWORK  AS A  FUNCTION g  OF ITS  WEIGHTS \nFor all h E {1..N H }, w;  =  (w;I' ... ,WkNI? E iRNI  is the vector of the weights be(cid:173)\ntween all the input units and the hth hidden unit.  The input weight matrix WI is de(cid:173)\nfined  as  WI = [wi, . .. ,wJvH ].  Similarly, a  vector  w~ = (w;I' ... ,W~NHf E iRNH \nrepresents  the weights between  all the hidden units and the  sth  output unit, for  all \ns  E {1..N s}.  Thus the output weight matrix W 2  is defined as W 2  = [w~, ... ,wJ.,.s]' \nFor an input matrix V, the network  computes an  output  matrix \n\nwhere  each  output  vector  z(dp )  must be equal  to  the  target  tp  for  exact  learning. \nThe network computation can be detailed  as follows,  for  all s  E {1..N s} \n\nNI \n\nNH \nL  w~h.f(L dpi.wt) \nh=1 \nNH \nL  w;h.f(d;.w;) \nh=1 \n\ni=1 \n\nHence,  for  the  whole learning set,  the  sth  output component is \n\n[f(di .. wu ] \n\n2 \n\nNH \n\nL W 8h' \n\nh=l \n\n: \n\nf(d~p.w;) \n\n(1) \n\nNH \n\nL W;h\u00b7F(V.W;) \n\nh=l \n\n\f164 \n\nA. Elisseeff and H.  Paugam-Moisy \n\nIn equation (1),  F  is a vector operator which transforms a n  vector v  into a n  vector \nF(v)  according  to  the  relation  [F(V)]i  =  f([v]d,  i  E {1..n}.  The same notation F \nwill  be used  for  the  matrix operator.  Finally,  the  expression  of the output  matrix \ncan be deduced from equation  (1)  as follows \n\n(2) \n\n2(V) \n2(V)  =  F(V.Wl).W2 \n\n[F(V.wt), ... ,F(V.WhH )]  :  [wi, . ..  ,w~s] \n\nFrom  equation  (2),  the  network  output  matrix  appears  as  a  simple  function  of \nthe  input  matrix and  the  network  weights.  Unlike  Coetzee  and  Stonick,  we  will \nconsider  that the input matrix V  is  not a variable of the problem.  Thus we  express \nthe network output matrix 2(V) as a function of its weights.  Let 9  be this function \n\n9  : n.N[xNH+NHxNs \n\n--t  n.NpxNs \n\nW  =  (Wl, W2)  --t  F(V.Wl).W2 \n\nThe 9  function clearly depends  on the  input matrix and could  have  be denoted  by \ng'D  but this index will be dropped for  clarity. \n\n3  FUNDAMENTAL  RESULT \n\n3.1  PROPERTY OF  FUNCTION  9 \n\nLearning  is  said  to  be  exact  on  V  if and  only  if there  exists  a  network  such  that \nits  output  matrix  2(V)  is  equal  to  the  target  matrix T.  If 9  is  a  diffeomorphic \nfunction from RN[xNH+NHXNS  onto RNpxNs  then the network can learn any target \nin  RNpxNs  exactly.  We  prove  that it is  sufficient  for  the network function  9  to be \na  local  diffeomorphism.  Suppose  there  exist  a  set  of weights  X,  an  open  subset \nU  C  n.N[NH+NHNS  including X  and  an open  subset  V  C  n.NpNs  including g(X) \nsuch  that  9  is  diffeomorphic  from  U  to  V.  Since  V  is  an  open  neighborhood  of \ng(X),  there exist a real ..\\  and a  point y  in V such  that T  = ..\\(y - g(X)) . Since 9 is \ndiffeomorphic from U to V,  there exists a set of weights Y  in U such that y =  g(Y), \nhence  T  = ..\\(g(Y)  - g(X)).  The  output  units  of the  network  compute  a  linear \ntransfer function,  hence  the linear combination of g(X)  and g(Y) can be integrated \nin the output weights  and a  network  with  twice  N/ N H  + N H N s  weights  can learn \n(V, T)  exactly  (see  Figure  1). \n\ng(Y) \n\n(J;) \n\n'T=A.(g(Y)-g(X)) \n\n~)---z \n\nFigure  1:  A network for  exact  learning of a  target T  (unique output for  clarity) \n\nFor 9 a local diffeomorphism, it is sufficient  to find  a set of weights  X  such that the \nJacobian of 9  in  X  is  non-zero  and  to  apply  the  theorem  of local  inversion.  This \nanalysis is developed in next sections and requires some assumptions on the transfer \nfunction f  of the hidden units.  A function  which  verifies  such an hypothesis 11.  will \nbe  called a ll-function and is  defined  below. \n\n\fSize of Multilayer Networks for Exact Learning,'  Analytic Approach \n\n165 \n\n3.2  DEFINITION  AND  THEOREM \nDefinition 1  Consider a function f  : 'R  ~ 'R  which is C1 ('R)  (i.e.  with  continuous \nderivative)  and which  has finite  limits in  -00 and +00.  Such  a function  is  called a \n1l-function  iff it verifies  the  following  property \n\n(1l) \n\n(Va  E'RI I a I>  1) \n\nlim  I ff'~(ax))  1=  0 \n\nx--+\u00b1 oo \n\nx \n\nFrom  this  hypothesis  on  the  transfer  function  of all  the  hidden  units,  the  funda(cid:173)\nmental result  can be stated  as  follows \n\nTheorem 1  Exact learning of a set of Np  examples,  in general position,  from'RNr \nto 'RNs ,  can  be  realized by a network with linear output units and a  transfer function \nwhich is a 1l-function,  if the size N H  of its hidden layer verifies the following bounds \n\nLower Bound \n\nUpper Bound \n\nN H  = r !:r~ 1  hidden  units  are  necessary \nNH = 2 r N~'Ns 1 Ns \n\nhidden  units are  sufficient \n\nThe  proof of  the  lower  bound  is  straightforward,  since  a  condition  for  g  to  be \ndiffeomorphic from  RNrxNH+NHXNs  onto RNpxNs  is  the equality of its input and \noutput space  dimensions NJNH + NHNS  =  NpNs . \n\n3.3  SKETCH  OF THE PROOF FOR THE UPPER BOUND \n\nThe 9 function is an expression of the network as a function of its weights, for a given \ninput  matrix:  g(W1, W2)  = F(V .W1 ).W2  and 9 can  be decomposed  according  to \nits  vectorial  components on the  learning  set  (which  are  themselves  vectors  of size \nNs) .  For all p E  {1..Np} \n\nThe derivatives of 9 w.r.t.  the  input weight  matrix WI  are,  for  all i  E  {1..NJ}, for \nall h E  {l..NH} \n\n:!L = [W~h !,(d~.wl)dpi\"\"  ,WJvsh  f'(d~.wl)dpi]T \n\nFor the output weight  matrix W2,  the derivatives of 9  are,  for  all  h E  {1..NH}, for \nall  s E {l..Ns} \n\n88g~  =  [ 0, ... ,O,f(d~ .w~), 0, .. . , 0 y \n\n' - - \"  \n\nNS-8 \n\nW 8h \n\n' - - \"  \n\n8-1 \n\nThe  Jacobian  matrix  MJ(g)  of g,  the  size  of  which  is  NJNH + NHNS  columns \nand NsNp rows,  is  thus composed of a  block-diagonal part (derivatives w.r.t.  W2) \nand several other  blocks  (derivatives  w.r.t.  WI).  Hence  the Jacobian  J(g)  can be \nrewritten  J(g)  =1  J1, h,\u00b7\u00b7 . ,JNH  I,  after  permutations of rows  and  columns,  and \nusing the Hadamard and Kronecker  product  notations, each  J h  being equal to \n(3)  Jh  = [F(v.wl) \u00ae INs,  [F'(v.wl) 061 \"  .F'(v.wl) 06Nr ] 0  [W~h\"  ,WJvsh]] \nwhere  INs  is  for  the identity matrix in dimension Ns. \n\n\fA. Elisseeff and H.  Paugam-Moisy \n166 \nOur  purpose  is  to  prove  that  there  exists  a  point  X  = (Wi, W2)  such  that  the \nJacobian  J(g)  is  non-zero  at X,  i.e.  such  that  the column vectors  of the Jacobian \nmatrix MJ(g) are linearly independent at X.  The proof can be divided in two steps. \nFirst we  address the case of a single output unit.  Afterwards,  this proof can be used \nto extend the result to several output units.  Since the complete development of both \nproofs require a lot of calculations, we only present their sketchs below.  More details \ncan be found  in  [7] . \n\n3.3.1  Case of a  single output unit \n\nThe proof is  based on a linear arrangement of the projections of the column vectors \nof Jh  onto  a  subspace.  This  subspace  is  orthogonal  to  all  the  Ji  for  i  <  h.  We \nbuild a  vector wi  and  a scalar  w~h such  that the  projected  column vectors  are  an \nindependent family, hence  they are  independent with  the Ji  for  i  < h.  Such  a  con(cid:173)\nstruction is  recursively  applied until h = N H.  We derive then vectors wi, .. . ,wkrH \nand wi  such  that  J(g)  is  non-zero.  The  assumption on 1l-fonctions is  essential for \nproving that the projected column vectors of Jh  are independent . \n\n3.3.2  Case of multiple output units \n\nIn order  to extend the result  from a single output to s  output units,  the usual idea \nconsists  in  considering as  many subnetworks  as  the  number of output units.  From \nthis  point  of view,  the  bound on  the  hidden  units  would  be  N H  =  2 'f.!;~f  which \ndiffers  from  the  result  stated  in  theorem  1.  A  new  direct  proof can be developed \n(see  [7])  and get a  better bound:  the denominator is  increased  to  N/ + N s  . \n\n4  DISCUSSION \n\nThe definition of a 1l-function includes both sigmoids and gaussian functions which \nare commonly used  for  multilayer perceptrons  and  RBF networks,  but is  not  valid \nfor  threshold  functions .  Figure  2  shows  the  difference  between  a  sigmoid,  which \nis  a  1l-function,  and  a  saturation  which  is  not  a  1l-function.  Figures  (a)  and  (b) \nrepresent the span of the output space by the network when the weights are varying , \ni.e.  the  image of g .  For  clarity,  the  network  is  reduced  to  1 hidden  unit ,  1 input \nunit, 1 output unit and 2 input patterns.  For a 1l-function, a  ball can be extracted \nfrom  the output space 'R},  onto which  the 9  function  is  a  diffeomorphism.  For  the \nsaturation, the image of 9  is  reduced  to two lines , hence 9  cannot be onto on a  ball \nof R2 .  The  assumption of the  activation function  is  thus  necessary  to  prove  that \nthe jacobian is  non-zero. \n\nOur  bound  on  the  number  of hidden  units  is  very  similar  to  Baum's  results  for \ndichotomies and functions from real inputs to binary outputs [1] .  Hence  the present \nresult  can  be  seen  as  an  extension  of Baum's  results  to  the  case  of real  outputs, \nand for  a  wide  family of transfer  functions ,  different  from  the  threshold  functions \naddressed  by  Baum and  Haussler  in  [2].  An  early  result  on sigmoid networks  has \nbeen  stated  by  Sontag  [14]:  for  a  single  output  and  at  least  two  input  units,  the \nnumber of examples  must be  twice  the number of hidden  units.  Our upper  bound \non the number of hidden units is strictly lower  than that (as soon as the number of \ninput  units  is  more  than  two).  A  counterpart  of considering  real  data is  that  our \nresults  bear little relation to the VC-dimension point of view. \n\n\fSize of Multilayer Networksfor Exact Learning: Analytic Approach \n\n167 \n\nD~ \n\n0.5 \n\n-.!, \n\nt .5 \n\n-1 \n\n~J.5 \n\n0 \n\n05 \n\n1 \n\n1.5 \n\n2 \n\n2.5 \n\n:I \n\n1.5 \n\n0.5 \n\no  \u2022\u2022\u2022\u2022\u2022\u2022 -----~I__-----\n\n-0.5 \n\n-1 \n\n-1.5 \n\n-~2~---:'-1.5=----~1 --:-07.5 -~---'0-:-:.5----7---:':1.5:---!. \n\n(a)  :  A saturation function \n\n(b)  :  A sigmoid function \n\nFigure 2:  Positions of output vectors, for given data, when varying network weights \n\n5  CONCLUSION \n\nIn this paper, we show that a number of hidden units N H =  2 r N p N s / (Nr + N s) 1 is \nsufficient for  a network ofll-functions to exactly learn a given set of Np examples in \ngeneral  position.  We  now  discuss  some of the practical consequences  of this  result. \nAccording  to  this  formula,  the  size  of the  hidden  layer  required  for  exact  learn(cid:173)\ning  may grow  very  high  if the  size  of the  learning set  is  large.  However,  without \na  priori knowledge on the degree  of redundancy  in the learning set,  exact  learning \nis  not  the  right  goal  in  practical cases.  Exact  learning  usually  implies overfitting, \nespecially  if the examples  are  very  noisy.  Nevertheless,  a  right  point of view  could \nbe  to  previously  reduce  the  dimension  and  the size  of the  learning set  by feature \nextraction  or  data  analysis  as  pre-processing.  Afterwards,  our  theoretical  result \ncould  be  a  precious  indication for  scaling  a  network  to  perform  exact  learning on \nthis representative learning set,  with a good compromise between, bias and variance. \nOur bound  is  more optimistic than the  rule-of-thumb N p  = lOw  derived  from  the \ntheory of PAC-learning.  In our architecture,  the number of weights is w =  2NpNs. \nHowever  the proof is not constructive enough to be derived  as  a learning algorithm, \nespecially  the  existence  of g(Y)  in  the  neighborhood  of g(X)  where  9 is  a  local \ndiffeomorphism  (cf.  figure  1).  From  this  construction  we  can  only  conclude  that \nNH  =  r NpNs/(Nr+Ns)l is necessary and NH  = 2  fNpNs/(Nr+Ns)l  is sufficient \nto realize exact  learning of Np  examples, from  nNr  to nNs. \n\n\f168 \n\nA. Elisseeff and H.  Paugam-Moisy \n\nThe opportunity of using multilayer networks as  auto-associative  networks  and for \ndata compression can be discussed at the light of this results.  Assume that N s  =  NJ \nand  the  expression  of the  number  of hidden  units  is  reduced  to  N H  =  N p  or  at \nleast  NH  =  N p /2 . Since  N p  ~ NJ + Ns,  the  number of hidden  units  must  verify \nN H  ~ NJ.  Therefore,  an architecture  of \"diabolo\"  network seems  to  be  precluded \nfor  exact  learning of auto-associations.  A consequence  may be  that exact  retrieval \nfrom  data  compression  is  hopeless  by  using  internal  representations  of a  hidden \nlayer smaller than the data dimension. \n\nAcknowledgements \n\nThis  work  was  supported  by  European  Esprit  III  Project  nO  8556,  NeuroCOLT \nWorking Group.  We  thank C.S.  Poon and J.V.  Shah for fruitful  discussions. \n\nReferences \n\n[1]  E.  B. Baum.  On  the capabilities of multilayer perceptrons.  J.  of Complexity, \n\n4:193-215,  1988. \n\n[2]  E.  B. Baum and D. Haussler. What size net gives  valid generalization?  Neural \n\nComputation,  1:151- 160, 1989. \n\n[3]  E.  K.  Blum  and  L.  K.  Li.  Approximation theory  and  feedforward  networks. \n\nNeural  Networks,  4(4) :511-516, 1991. \n\n[4]  F. M.  Coetzee and V.  L. Stonick.  Topology and geometry of single hidden layer \nnetwork,  least squares weight solutions.  Neural Computation,  7:672-705, 1995. \n[5]  M.  Cosnard,  P.  Koiran,  and  H.  Paugam-Moisy.  Bounds  on  the  number  of \nunits  for  computing  arbitrary  dichotomies  by  multilayer  perceptrons.  J.  of \nComplexity,  10:57-63,  1994. \n\n[6]  G.  Cybenko.  Approximation by superpositions of a sigmoidal function.  Math. \n\nControl,  Signal Systems,  2:303-314, October  1988. \n\n[7]  A.  Elisseeff and H.  Paugam-Moisy. Size of multilayer networks for  exact learn(cid:173)\n\ning:  analytic approach.  Rapport de  recherche  96-16,  LIP,  July  1996. \n\n[8]  K.  Funahashi.  On  the  approximate  realization  of  continuous  mappings  by \n\nneural  networks.  Neural Networks,  2(3):183- 192, 1989. \n\n[9]  K.  Hornik,  M.  Stinchcombe,  and  H.  White.  Multilayer feedforward  networks \n\nare  universal  approximators.  Neural Networks,  2(5):359-366, 1989. \n\n[10]  S.-C.  Huang and  Y.-F.  Huang.  Bounds  on  the number  of hidden  neurones  in \n\nmultilayer perceptrons.  IEEE  Trans.  Neural  Networks, 2:47- 55,  1991. \n\n[11]  M.  Karpinski  and  A.  Macintyre.  Polynomial bounds  for  vc  dimension of sig(cid:173)\nmoidal  neural  networks.  In  27th  ACM Symposium  on  Theory  of Computing, \npages  200-208, 1995. \n\n[12]  P.  Koiran and  E.  D.  Sontag.  Neural networks with quadratic vc  dimension.  In \n\nNeural  Information Processing Systems  (NIPS *95),  1995.  to appear. \n\n[13]  W .  Maass.  Bounds  for  the  computational  power  and  learning  complexity  of \nanalog  neural  networks.  In  25th  ACM Symposium  on  Theory  of Computing, \npages  335-344, 1993. \n\n[14]  E.  D.  Sontag.  Feedforward  nets  for  interpolation and classification.  J.  Compo \n\nSyst.  Sci.,  45:20-48,  1992. \n\n[15]  E.  D.  Sontag.  Shattering all sets of k points in  \"general  position\"  requires  (k-\n1)/2 parameters.  Technical  Report  Report 96-01,  Rutgers Center for  Systems \nand Control  (SYCON),  February  1996. \n\n\f", "award": [], "sourceid": 1303, "authors": [{"given_name": "Andr\u00e9", "family_name": "Elisseeff", "institution": null}, {"given_name": "H\u00e9l\u00e8ne", "family_name": "Paugam-Moisy", "institution": null}]}