{"title": "On the Optimality of Incremental Neural Network Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 301, "abstract": null, "full_text": "On the optimality of incremental neural network \n\nalgorithms \n\nRon Meir* \n\nDepartment of Electrical Engineering \n\nTechnion, Haifa 32000, Israel \n\nVitaly Maiorovt \n\nDepartment of Mathematics \nTechnion, Haifa 32000, Israel \n\nrmeir@dumbo.technion.ac.il \n\nmaiorov@tx.technion.ac.il \n\nAbstract \n\nWe  study the approximation of functions by  two-layer feedforward neu(cid:173)\nral  networks,  focusing  on  incremental  algorithms  which  greedily  add \nunits,  estimating  single  unit  parameters  at  each  stage.  As  opposed  to \nstandard algorithms for fixed architectures, the optimization at each stage \nis  performed over a small  number of parameters, mitigating many of the \ndifficult numerical problems inherent in  high-dimensional non-linear op(cid:173)\ntimization.  We  establish  upper bounds on  the  error incurred by  the  al(cid:173)\ngorithm, when approximating functions from the  Sobolev class, thereby \nextending previous results  which only provided rates of convergence for \nfunctions in certain convex hulls of functional spaces. By comparing our \nresults to recently derived lower bounds,  we  show  that the greedy algo(cid:173)\nrithms  are  nearly  optimal.  Combined  with  estimation  error results  for \ngreedy algorithms, a strong case can be made for this type of approach. \n\n1  Introduction and background \n\nA  major problem  in  the  application  of neural  networks  to  real  world  problems is  the ex(cid:173)\ncessively long time required for training large networks of a fixed  architecture. Moreover, \ntheoretical results establish the intractability of such training in  the worst case  [9][4].  Ad(cid:173)\nditionally,  the problem of determining the  architecture and size of the network required to \nsolve  a certain  task  is  left open.  Due  to  these  problems,  several  authors  have  considered \nincremental algorithms  for  constructing the  network by  the  addition  of hidden units, and \nestimation of each unit's parameters incrementally.  These  approaches  possess  two desir(cid:173)\nable  attributes:  first,  the  optimization  is  done  step-wise,  so  that  only  a  small  number of \nparameters  need  to  be  optimized at  each  stage;  and  second,  the  structure  of the  network \n\n-This work  was supported in part by the a grant from the  Israel  Science Foundation \ntThe author was  partially supported by  the center for Absorption  in  Science, Ministry of Immi(cid:173)\n\ngrant Absorption,  State of Israel. \n\n\f296 \n\nR.  Meir and V Maiorov \n\nis  established concomitantly with the learning, rather than specifying it in advance.  How(cid:173)\never,  until  recently these algorithms have been rather heuristic in  nature, as  no  guaranteed \nperformance bounds had been established.  Note that  while there  has  been  a recent surge \nof interest  in  these  types  of algorithms,  they  in  fact  date back  to  work done  in  the  early \nseventies (see [3]  for a historical survey). \n\nThe first theoretical result establishing performance bounds for incremental approximations \nin  Hilbert space, was given by Jones  [8].  This work  was  later extended by Barron [2],  and \napplied to  neural network approximation of functions characterized by  certain  conditions \non  their Fourier coefficients.  The  work  of Barron  has  been  extended  in  two  main direc(cid:173)\ntions.  First, Lee et at.  [10]  have considered approximating general functions using Hilbert \nspace  techniques,  while  Donahue et al.  [7]  have  provided powerful  extensions of Jones' \nand  Barron's results  to general  Banach spaces.  One of the  most impressive results of the \nlatter work is the demonstration that iterative algorithms can, in many cases, achieve nearly \noptimal rates of convergence, when approximating convex hulls. \n\nWhile this paper is concerned mainly with issues of approximation, we  comment that it is \nhighly relevant to  the statistical  problem  of learning  from  data in  neural  networks.  First, \nLee et at.  [10]  give estimation error bounds  for algorithms performing incremental  opti(cid:173)\nmization  with  respect to  the  training  error.  Under certain  regularity  conditions,  they  are \nable to achieve rates of convergence comparable to those obtained by the much more com(cid:173)\nputationally  demanding algorithm  of empirical  error minimization.  Moreover,  it  is  well \nknown  that  upper bounds  on  the  approximation  error are  needed  in  order to  obtain  per(cid:173)\nformance  bounds,  both  for  parametric  and  nonparametric estimation,  where  the  latter  is \nachieved using  the  method of complexity regularization.  Finally,  as  pointed out by  Don(cid:173)\nahue et al.  [7],  lower bounds on  the approximation error are crucial  in  establishing  worst \ncase speed limitations for learning. \n\nThe main contribution of this paper is  as follows.  For functions belonging to  the Sobolev \nclass (see definition below), we establish, under appropriate conditions, near-optimal rates \nof convergence for the incremental approach, and obtain explicit bounds on the parameter \nvalues of the network.  The latter bounds are often crucial for establishing estimation error \nrates.  In  contrast  to  the  work  in  [10]  and  [7],  we  characterize  approximation  rates  for \nfunctions belonging to standard smoothness classes, such as  the Sobolev class.  The former \nwork establishes rates  of convergence with  respect to  the convex  hulls  of certain  subsets \nof functions,  which do not relate in  a any  simple way to  standard functional classes (such \nas  Lipschitz, Sobolev,  Holder, etc.).  As  far  as  we  are  aware,  the results reported here are \nthe  first  to  report on  such bounds for  incremental  neural  network procedures.  A detailed \nversion of this  work, complete with the detailed proofs, is available in  [13]. \n\n2  Problem statement \n\nWe  make use  of the  nomenclature and definitions  from  [7].  Let H  be  a Banach  space of \nfunctions  with norm  II  . II.  For concreteness we  assume henceforth that the  norm is  given \nby  the  Lq  norm,  1 < q  < 00, denoted by II  . Ilq. Let linn H  consist of all  sums of the form \nL~=l aigi , gi  E H  and arbitrary ai, and COn H  is  the set of such sums  with ai  E  [0,1]  and \nL~=l ai = 1.  The distances, measured in  the  Lq  norm, from  a function f  are given by \n\ndist(1innH,f) =  inf {l lh - fllq  :  hE linnH}, \ndist(conH , f) =  inf {l lh - fllq  :  hE conH}. \n\nThe linear span of H  is  given by  linH  = Un linn H,  while  the convex-hull of H  is  coH = \nUncon H. We  follow standard notation and denote closures of sets by a bar, e.g.  coH is  the \nclosure of the convex hull of H. In this work we focus on the special case where \n\nH  =  H1}  ~ {g  : g(x)  =  eCJ(aT x + b),  lei::;  1}, IICJ(\u00b7)llq  ::;  I}, \n\n(1) \n\n\fOn  the Optimality of Incremental Neural Network Algorithms \n\n297 \n\ncorresponding to the basic building blocks of multilayer neural  networks.  The restriction \n110-011  ::;  1 is not very demanding as many sigmoidal functions can be expressed as  a sum \nof functions of bounded norm. It should be obvious that linn 1l1)  corresponds to a two-layer \nneural  network  with  a  linear  output  unit  and  o--activation  functions  in  the  single  hidden \nlayer, while COn 1l1)  is equivalent to a restricted form of such a network, where restrictions \nare placed on the  hidden-to-output weights.  In  terms of the definitions introduced above, \nthe  by  now  well  known property of universal  function  approximation over compacta can \nbe stated as  lin1l  =  C(M), where C(M) is  the class of continuous real  valued functions \ndefined  over M , a  compact subset  of Rd .  A  necessary  and  sufficient condition  for  this \nhas  been  established  by  Leshno et af.  [11],  and  essentially  requires  that 0-(.)  be  locally \nintegrable and non-polynomial.  We comment that if 'T}  = 00 in  (l), and c is unrestricted in \nsign, then co1l=  =  lin1l=.  The distinction becomes important only if 'T}  < 00 ,  in  which \ncase co1l1)  C  lin1l1). \n\nFor the purpose of incremental approximation, it turns out to be useful to consider the con(cid:173)\nvex  hull  co1l,  rather than  the  usual  linear span, as  powerful  algorithms and performance \nbounds  can  be  developed  in  this  case.  In  this  context  several  authors  have  considered \nbounds for the approximation of a function 1 belonging to co1l by sequences of functions \nbelonging to  COn 1l .  However, it is  not clear in  general how well convex hulls of bounded \nfunctions  approximate  general  functions .  One contribution  of this  work  is  to  show  how \none may control the rate of growth of the bound 'T}  in (1), so that general functions, belong(cid:173)\ning  to  certain smoothness classes (e.g.  Sobolev), may be  well  approximated.  In  fact,  we \nshow that the incremental approximation scheme described below achieves nearly optimal \napproximation error for functions in the Sobolev space. \n\nFollowing Donahue et af.  [7],  we consider c:-greedy algorithms. Let E  =  (El '  E2, ... )  be a \npositive sequence,  and similarly  for  (ai, a 2, ... ),  0  < an  < 1.  A  sequence of functions \nhI, h2 ' ...  is E-greedy with respect to 1 if for n  =  0, 1, 2, .. . , \n\nIlhn+I  - Illq < inf {llanhn + (1  - an)g - Illq : 9  E 1l1)} + En , \n\n(2) \nwhere we set ho  = O.  For simplicity we set an = (n  -\nl)/n , although other schemes are \nalso  possible.  It should  be clear that at each stage n,  the  function  hn  belongs to  COn 1l1). \nObserve also that at each step,  the  infimum is  taken  with respect to  9  E  1l1)'  the  function \nhn  being  fixed.  In  terms  of neural  networks, this  implies  that the  optimization over each \nhidden unit parameters (a, b, c) is  performed independently of the others.  We  note in pass(cid:173)\ning,  that  while  this  greatly facilitates  the  optimization process in  practice,  no  theoretical \nguarantee can  be  made  as  to  the  convexity of the  single-node error function  (see  [1]  for \ncounter-examples).  The  variables  En  are  slack  variables,  allowing  the  extra  freedom  of \nonly approximate minimization. In this paper we do not optimize over an, but rather fix  a \nsequence in  advance, forfeiting  some generality at  the  price of a  simpler presentation.  In \nany event, the rates we obtain are unchanged by such a restriction. \n\nIn  the  sequel  we  consider E-greedy approximations of smooth functions  belonging to  the \nSobolev class of functions, \n\nW;  =  {I : m?x  WDk 1112  ::; \u00b71} , \n\nOS k S r \n\nwhere k = (k 1 ,  . \u2022\u2022  , k d ) ,  ki  2::  0  and Ikl  = ki  + ... k d .  Here Vk is  the  partial derivative \noperator of order k. All functions are defined over a compact domain K  C  Rd. \n\n3  Upper bound for the L2  norm \n\nFirst,  we  consider the  approximation of functions  from  WI  using  the  L2  norm.  In  dis(cid:173)\ntinction  with  other  Lq  norms,  there  exists  an  inner product in  this  case,  defined  through \n\n\f298 \n\nR.  Meir and V Maiorov \n\n(\".) =  II\u00b7II~. This simp1ification is essential to the proof in this case. \nWe  begin by  recalling  a  result  from  [12],  demonstrating that any  function  in  L2  may  be \nexactly expressed as a convex integral representation of the form \n\nf(x) = Q J h(x, O)w(O)dO, \n\n(3) \n\nwhere 0  < Q <  00 depends on  f, and w( 0)  is  a probability density  function  (pdf) with \nrespect to  the multi-dimensional variable O. Thus, we may write f(x)  = QEw{h(x, e)}, \nwhere Ew  denotes the  expectation operator with  respect to  the  pdf w .  Moreover,  it  was \nshown  in  [12],  using  the  Radon  and  wavelet transforms,  that  the  function  h(x, 0)  can be \ntaken to be a ridge function with 0 = (a, b, e)  and h(x, 0)  =  ea(aT x + b). \nIn  the  case  of neural  networks,  this  type  of convex  representation  was  first  exploited by \nBarron in  [2], assuming f  belongs to  a class of functions characterized by certain moment \nconditions  on  their Fourier transforms.  Later,  Delyon et al. \n[6]  and  Maiorov  and  Meir \n[12]  extended Barron's results to  the  case of wavelets  and  neural  networks,  respectively, \nobtaining rates of convergence for functions in the Sobolev class. \nThe basic  idea at this  point is  to generate an  approximation, hn(x), based on n  draws  of \nrandom variables en = {e l , e2, ... ,en}, ei  ,.....,  w(\u00b7), resulting in the random function \n\nhn(x; en) = - 2: h(x, ei ). \n\nQ  n \n\nn  i=l \n\n(4) \n\nThroughout the  paper we  conform  to  standard notation,  and  denote random  variables  by \nuppercase  letters,  as  in e,  and their realization by  lower case letters,  as  in  O.  Let wn  = \nTI~=l Wi  represent the  product pdf for {e l ,  ... ,en}.  Our first  result demonstrates that, \non the average, the above procedure leads to good approximation of functions belonging to \nW{. \n\nTheorem 3.1  Let K  C  Rd  be a  compact set.  Then/or any f  E  W{,  n  > 0 and c  > 0 \nthere exists a constant e > 0,  such that \n\nEw \"  Ilf - hn(x; en)112  :S  en- rld+E , \n\n(5) \n\nwhere Q < en(1/2- r l d)+,  and (x)+  =  max(O,x). \n\nThe  implication  of the  upper  bound  on  the  expected  value,  is  that  there  exists  a  set  of \nvalues o* ,n = {Or, ...  , O~}, for which the rate (5) can be achieved.  Moreover, as  long as \nthe functions h(x, Od  in (4) are bounded in the L2  norm, a bound on Q implies a bound on \nthe size of the function  h n  itself. \nProof sketch  The  proof proceeds  by  expressing  f  as  the  sum  of two  functions,  iI \nand  12.  The  function  iI is  the  best  approximation  to  f  from  the  class  of multi-variate \nsplines  of  degree  r.  From  [12]  we  know  that  there  exist  parameters  on  such  that \nIliI (.)  - h n {-, on)112  :S  en-rid.  Moreover,  using  the  results  of [5]  it  can  be  shown  that \n1112112 \n:S  en-rid.  Using  these  two  observations,  together  with  the  triangle  inequality \nIlf - h n l12 :S  IliI - hnl12 + 1112112,  yields the desired result. \nI \nNext,  we  show that given the  approximation rates attained in Theorem 3.1, the same rates \nmay be obtained using an c -greedy algorithm. Moreover, since in  [12]  we have established \nthe optimality of the upper bound (up to a logarithmic factor in n), we conclude that greedy \napproximations can  indeed yield near-optimal perfonnance, while at  the  same time  being \nmuch more attractive computationally.  In fact,  in this  section  we  use  a  weaker algorithm, \nwhich does not perform a full  minimization at each stage. \n\n\fOn the Optimality of Incremental Neural Network Algorithms \n\n299 \n\nIncremental algorithm:  (q = 2) Let an = 1 -\n\n1.  Let 0i be chosen to  satisfy \n\nlin, 6:n = 1 - an =  lin. \n\nIlf(x)  - Qh(x,Onll~ =  EWl  {llf(x)  - Qh(x,edIID\u00b7 \n\n2.  Assume that 0i , \u00b02, ...  ,O~-l have been generated.  Select O~ to obey \n\n2 \n\nf(x)  - ::~; L h(x,On - 6:nQh(x,O~) \n\nn-l \n\ni=l \n\n2 \n\nf(x)  - Qan  \"h(x, On  - 6:nQh(x, en) \n\nn-l \nn-lL..,; \ni=l \n\n22}  . \n\nDefine \n\nwhich  measures  the  error  incurred  at  the  n-th  stage  by  this  incremental  procedure.  The \nmain result of this section then follows. \n\nTheorem 3.2  For any f  E WI  and c  > 0,  the error of the incremental algorithm above is \nbounded as \n\nfor some finite constant c. \n\nProof sketch  The claim will  be established upon showing that \n\n(6) \n\nnamely,  the  error  incurred  by  the  incremental  procedure  is  identical  to  that  of the  non(cid:173)\nincremental one described preceding Theorem 3.1.  The result will  then follow upon using \nHolder's inequality and the upper bound (5) for the r.h.s.  of (6).  The remaining details are \nI \nstraightforward, but tedious, and can be found in  the full  paper [13]. \n\n4  Upper bound for general Lq norms \n\nHaving established rates of convergence for  incremental  approximation of WI  in the  L2 \nnorm,  we  move now to general  Lq  norms.  First,  note  that the proof of Theorem 3.2 relies \nheavily on  the existence on an  inner product.  This useful tool  is  no longer available in  the \ncase of general Banach spaces such as L q .  In  order to extend the results to the latter norm, \nwe  need to use more advanced ideas from the theory of the geometry of Banach spaces.  In \nparticular, we will make use of recent results from the work of Donahue et al.  [7].  Second, \nwe must keep in mind that the approximation of the Sobolev space WI  using the Lq  norm \nonly makes sense if the embedding condition rid> (1/2 -\nl/q)+ holds, since otherwise \nthe  Lq  norm may  be  infinite (the embedding condition guarantees its  finiteness;  see  [14] \nfor details). \n\nWe  first present the main result of this section, followed by a sketch of the proof.  The full \ndetails of the rather technical proof can be found in  [13].  Note that in this case we need to \nuse the greedy algorithm (2) rather than the algorithm of Section 3. \n\n\f300 \n\nR.  Meir and V Maiorov \n\nTheorem 4.1  Let the  embedding condition r / d  >  (1/2 - 1/ q) + hold for 1  <  q  <  00, \n0< r  < r*,  r*  =  ~ + (~- ~)+ andassumethatllh(-,O)llq:S  IforallO.  Thenforany \nJEW; andf. > 0 \n\nwhere \n\n,  = \n\n{ \n\n;I- (~-P \n!+% _qd \n~ \n\nq > 2, \nq :s  2, \n\n(7) \n\nc  =  c(r, d , K)  and hn(-, on) is  obtained via  the  incremental greedy algorithm  (2)  with \ncn  =  O. \n\nProof sketch  The  main  idea  in  the  proof of Theorem 4.1  is  a  two-part  approximation \nscheme.  First,  based on  [13],  we  show  that any  JEW; may  be  well  approximated by \nfunctions  in  the  convex  class  COn ('111/)  for  an  appropriate  value  of TJ  (see  Lemma 5.2  in \n[13]),  where R1/  is  defined  in  (1).  Then,  it  is  argued, making use  of results  from  [7]  (in \nparticular,  using Corollary 3.6) , that an  incremental greedy algorithm can be used to  ap(cid:173)\nproximate the closure of the class co( R 1/)  by the class COn (111/).  The proof is completed by \nusing the triangle inequality. The proof along the above lines is done for the case q > 2. In \nthe case q :s  2, a simple use of the Holder inequality in the form Ilfllq  ~ IKI 1/q- l/21IfI12, \nwhere IKI is  the volume of the region K, yields the desired result, which, given the lower \nI \nbounds in  [12], is nearly optimal. \n\n5  Discussion \n\nWe have presented a theoretical analysis of an increasingly popular approach to incremental \nlearning in  neural  networks . Extending previous results , we  have shown that near-optimal \nrates  of convergence may  be  obtained  for  approximating  functions  in  the  Sobolev  class \nW; . These results extend and clarify previous work dealing solely with the approximation \nof functions belonging to the closure of convex hulls of certain sets of functions.  Moreover, \nwe have given explicit bounds on the parameters used in the algorithm , and shown that the \nrestriction to COn 111/  is not too stringent. In the case q  ~ 2 the rates obtained are as good (up \nto logarithmic factors) to the rates obtained for general spline functions,  which are known \nto  be  optimal  for  approximating  Sobolev  spaces.  The  rates  obtained  in  the  case  q  >  2 \nare  sub-optimal  as  compared to  spline functions,  but can be  shown  to  be  provably  better \nthan any  linear approach.  In any event,  we  have shown that the  rates  obtained are equal, \nup  to  logarithmic  factors,  to  approximation  from  linn 111/'  when  the  size  of TJ  is  chosen \nappropriately, implying that positive input-to-output weights suffice for approximation. An \nopen problem remaining at this point is to demonstrate whether incremental algorithms for \nneural network construction can be shown to  be optimal for every value of q.  In  fact,  this \nis not even known at this stage for neural network approximation in general. \n\nReferences \n\n[1]  P.  Auer, M.  Herbster, and M. Warmuth.  Exponentially many local  minima for single \nneurons.  In  D.S.  Touretzky,  M .e. Mozer, and M.E.  Hasselmo, editors, Advances in \nNeural Information Processing Systems 8, pages 316-322. MIT Press,  1996. \n\n[2]  AR. Barron.  Universal approximation bound for superpositions of a sigmoidal func(cid:173)\n\ntion.  IEEE Trans.  In! Th.,  39:930-945, 1993. \n\n[3]  AR.  Barron  and  R.L.  Barron.  Statistical  learning  networks:  a  unifying  view. \n\nIn \nE.  Wegman , editor, Computing Science and Statistics:  Proceedings 20th Symposium \nInterface , pages 192-203, Washington D.e.,  1988. Amer. Statis. Assoc. \n\n\fOn  the Optimality of Incremental Neural Network Algorithms \n\n301 \n\n[4]  A.  Blum and R.  Rivest.  Training a 3-node neural net is  np-complete.  In  D.S. Touret(cid:173)\n\nzky,  editor, Advances in  Neural  Information  Processing  Systems  I, pages  494-50l. \nMorgan Kaufmann, 1989. \n\n[5]  c.  de  Boor  and  G.  Fix.  Spline  approximation  by  quasi-interpolation.  J.  Approx. \n\nTheory, 7:19-45,1973. \n\n[6]  B.  Delyon, A.  Juditsky, and A. Benveniste.  Accuracy nalysis for wavelet approxima(cid:173)\n\ntions.  IEEE Transaction on Neural Networks, 6:332-348, 1995. \n\n[7]  M.J. Donahue, L. Gurvits, C.  Darken, and E.  Sontag.  Rates of convex approximation \n\nin  non-hilbert spaces.  Constructive Approx. , 13: 187-220, 1997. \n\n[8]  L.  Jones.  A  simple  lemma on  greedy  approximation  in  Hilbert  space  and  conver(cid:173)\n\ngence rate for projection pursuit regression and neural network training.  Ann. Statis. , \n20:608-613, 1992. \n\n[9]  S.  Judd.  Neural Network Design and the Complexity of Learning.  MIT Press, Boston, \n\nUSA,1990. \n\n[10]  W.S . Lee, P.S.  Bartlett,  and  R.c. Williamson.  Efficient Agnostic  learning of neural \n\nnetworks with bounded fan-in . IEEE Trans. In!  Theory, 42(6):2118-2132, 1996. \n\n[11]  M.  Leshno, V.  Lin, A.  Pinkus,  and S. Schocken.  Multilayer Feedforward Networks \nwith  a Nonpolynomial Activation Function Can Approximate any Function.  Neural \nNetworks,  6:861-867,1993. \n\n[12]  V.E.  Maiorov and R. Meir.  On the near optimality of the stochastic approximation of \n\nsmooth functions by  neural  networks.  Technical Report CC-223, Technion, Depart(cid:173)\nment of Electrical Engineering, November  ]997.  Submitted to Advances in  Compu(cid:173)\ntational Mathematics. \n\n[13]  R.  Meir  and  V.  Maiorov.  On  the  optimality  of  neural  network  approxima-\nSubmitted  for  publication,  October  1998. \n\ntion  using  incremental  algorithms. \nftp://dumbo.technion.ac.il/pub/PAPERSlincrementa] .pdf. \n\n[14]  H.  Triebel.  Theory of Function Spaces.  Birkhauser, Basel,  1983. \n\n\f", "award": [], "sourceid": 1517, "authors": [{"given_name": "Ron", "family_name": "Meir", "institution": null}, {"given_name": "Vitaly", "family_name": "Maiorov", "institution": null}]}