{"title": "Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 190, "page_last": 196, "abstract": null, "full_text": "Almost  Linear  VC  Dimension Bounds for \n\nPiecewise Polynomial Networks \n\nPeter L.  Bartlett \n\nDepartment of System Engineering \n\nAustralian National University \n\nCanberra, ACT 0200 \n\nAustralia \n\nPeter.Bartlett@anu.edu.au \n\nVitaly Maiorov \n\nDepartment of Mathematics \n\nTechnion,  Haifa 32000 \n\nIsrael \n\nRon Meir \n\nDepartment of Electrical Engineering \n\nTechnion,  Haifa 32000 \n\nIsrael \n\nrmeir@dumbo.technion.ac.il \n\nAbstract \n\nWe  compute  upper  and  lower  bounds  on  the  VC  dimension  of \nfeedforward  networks  of  units  with  piecewise  polynomial  activa(cid:173)\ntion functions.  We  show that if the number of layers is  fixed,  then \nthe  VC  dimension  grows  as  W log W,  where  W  is  the  number of \nparameters in the network.  This result stands in opposition to the \ncase  where  the  number of  layers  is  unbounded,  in  which  case  the \nVC  dimension grows as W 2 \u2022 \n\n1  MOTIVATION \n\nThe VC  dimension is  an important measure of the complexity of a  class of binary(cid:173)\nvalued functions,  since it characterizes the amount of data required for  learning in \nthe  PAC  setting  (see  [BEHW89,  Vap82]).  In  this  paper,  we  establish  upper  and \nlower  bounds  on  the VC  dimension  of a  specific  class  of multi-layered feedforward \nneural  networks.  Let  F  be  the  class  of  binary-valued  functions  computed  by  a \nfeed forward neural network with W  weights and k computational (non-input) units, \neach with a  piecewise polynomial activation function.  Goldberg and Jerrum [GJ95] \nhave  shown  that  VCdim(F)  :s  Cl(W2 + Wk)  = O(W2),  where  Cl  is  a  constant. \nMoreover,  Koiran  and  Sontag  [KS97]  have demonstrated such  a  network  that has \nVCdim(F)  ~ C2 W 2 = O(W2),  which  would  lead one to  conclude that the  bounds \n\n\fAlmost Linear VC Dimension Bounds for Piecewise Polynomial Networks \n\n191 \n\nare  in fact  tight  up to a  constant.  However,  the  proof used  in  [KS97]  to establish \nthe  lower  bound made use  of the fact  that the number of layers can grow  with W. \nIn practical applications, this number is  often a small constant.  Thus, the question \nremains as to whether it is possible to obtain a better bound in the realistic scenario \nwhere the number of layers is  fixed. \n\nThe  contribution  of this  work  is  the  proof of upper  and  lower  bounds  on the  VC \ndimension  of piecewise  polynomial  nets.  The  upper  bound  behaves  as  O(W L2 + \nW L log W L),  where  L  is  the number  of layers.  If L  is  fixed,  this  is  O(W log W), \nwhich  is  superior  to the previous  best  result  which  behaves  as  O(W2).  Moreover, \nusing ideas from  [KS97]  and [GJ95]  we  are able to derive a lower bound on the VC \ndimension  which  is  O(WL)  for  L  =  O(W).  Maass  [Maa94]  shows  that three-layer \nnetworks with threshold activation functions and binary inputs have VC  dimension \nO(W log W), and Sakurai [Sak93]  shows that this is also true for two-layer networks \nwith threshold  activation functions  and  real  inputs.  It is  easy  to show  that these \nresults imply similar lower bounds if the threshold activation function is replaced by \nany piecewise polynomial activation function f  that has bounded and distinct limits \nlimx-t - oo  f(x)  and limx-too  f(x).  We thus conclude that if the number oflayers L  is \nfixed, the VC dimension of piecewise polynomial networks with L  ~ 2 layers and real \ninputs,  and of piecewise polynomial networks with L  ~ 3 layers and binary inputs, \ngrows as  W log W.  We  note that for  the piecewise polynomial networks considered \nin  this  work,  it is  easy to show  that the VC  dimension  and  pseudo-dimension  are \nclosely  related  (see  e.g.  [Vid96]),  so that similar bounds  (with  different  constants) \nhold  for  the pseudo-dimension.  Independently,  Sakurai has obtained similar upper \nbounds  and improved  lower  bounds on the VC  dimension  of piecewise  polynomial \nnetworks  (see  [Sak99]). \n\n2  UPPER BOUNDS \n\nWe  begin the technical discussion with precise definitions of the VC-dimension and \nthe class of networks considered in this work. \nDefinition 1  Let  X  be  a  set,  and  A  a  system  of  subsets  of  X.  A  set  S  = \n{ Xl, . ..  ,xn} is  shattered  by  A  if,  for  every  subset B  ~ S,  there  exists  a  set A  E  A \nsuch that SnA =  B.  The  VC-dimension  of A,  denoted  by VCdim(A),  is the  largest \ninteger n  such  that  there  exists  a set of cardinality n  that  is  shattered  by  A. \n\nIntuitively,  the VC  dimension  measures  the size,  n,  of the largest set of points  for \nwhich all possible 2n  labelings may be achieved by sets A  E  A.  It is often convenient \nto talk about the VC  dimension of classes of indicator functions  F.  In this case we \nsimply  identify the sets of points  X  E  X  for  which  f(x)  =  1 with  the subsets of A, \nand use  the notation VCdim(F). \n\nA  feedforward  multi-layer  network  is  a  directed  acyclic  graph  that  represents  a \nparametrized  real-valued  function  of d  real  inputs.  Each  node  is  called  either  an \ninput unit or a  computation unit.  The computation units are arranged in L  layers. \nEdges  are  allowed  from  input  units  to  computation  units.  There  can  also  be  an \nedge  from  a  computation  unit  to  another  computation  unit,  but  only  if the  first \nunit  is  in  a  lower  layer  than  the  second.  There  is  a  single  unit  in  the  final  layer, \ncalled  the  output unit.  Each input unit  has an associated  real  value,  which  is  One \nof the  components  of the  input  vector  x  E  Rd.  Each  computation  unit  has  an \nassociated  real  value,  called  the unit's output value.  Each edge  has  an  associated \nreal parameter, as does  each computation unit.  The output of a  computation unit \nis given by (7 CEe weze + wo),  where the sum ranges over the set of edges leading to \n\n\f192 \n\nP  L.  Bartlett,  V.  Maiorov and R. Meir \n\nthe unit, We  is the parameter (weight)  associated with edge e,  Ze  is the output value \nof the unit from  which edge e  emerges,  Wo  is  the parameter  (bias)  associated  with \nthe unit, and a  : R  -t R  is  called the activation function of the unit.  The argument \nof a  is  called  the  net  input of the  unit.  We  suppose  that  in  each  unit  except  the \noutput unit, the  activation function  is  a fixed  piecewise  polynomial function  of the \nform \n\nfor i  = 1, ... ,p+ 1 (and set to  = -00 and tp+1  = 00),  where each cPi  is  a polynomial \nof  degree  no  more  than  l.  We  say  that  a  has p  break-points,  and  degree  l.  The \nactivation  function  in  the  output  unit  is  the  identity  function.  Let  ki  denote  the \nnumber of computational units in layer i  and suppose there is  a total of W  param(cid:173)\neters (weights and biases)  and k computational units (k  =  k1 + k2 + ... + k L - 1 + 1). \nFor  input  x  and  parameter  vector  a  E  A  =  R w, let  f(x, a)  denote  the output  of \nthis  network,  and  let  F  =  {x  f-t  f(x,a)  :  a  E  RW}  denote  the  class  of functions \ncomputed  by  such  an  architecture,  as  we  vary  the  W  parameters.  We  first  dis(cid:173)\ncuss the computation of the VC  dimension, and thus consider the class of functions \nsgn(F)  =  {x f-t  sgn(f(x, a))  : a E  RW}. \nBefore  giving  the  main  theorem  of  this  section,  we  present  the  following  result, \nwhich is  a  slight improvement of a  result  due to Warren  (see  [ABar],  Chapter 8). \n\nLemma 2.1  Suppose  II (.), h (.), .. , ,f m  (-)  are  fixed  polynomials  of  degree  at \nmost  1  in  n  ~  m  variables. \nthe  number  of  distinct  sign  vectors \n{sgn(Jl (a)), ... ,sgn(J m (a))}  that  can  be  generated  by  varying  a  ERn  is  at  most \n2(2eml/n)n. \n\nThen \n\nWe  then have our main result: \n\nTheorem 2.1  For  any positive  integers W,  k  ~ W,  L  ~ W,  l,  and p,  consider  a \nnetwork with  real inputs,  up  to W  parameters,  up  to  k  computational units arranged \nin L  layers,  a single  output unit with  the  identity activation  function,  and  all  other \ncomputation  units  with  piecewise  polynomial  activation  functions  of degree  1  and \nwith  p  break-points.  Let F  be  the  class  of real-valued  functions  computed  by  this \nnetwork.  Then \n\nVCdim(sgn(F))  ~ 2WLlog(2eWLpk) + 2WL2log(1 + 1) + 2L. \n\nSince Land k  are O(W), for  fixed  1 and p  this implies that \n\nVCdim(sgn(F)) =  O(WLlogW + WL2). \n\nBefore  presenting  the  proof,  we  outline  the  main  idea  in  the  construction.  For \nany  fixed  input  x,  the  output  of the  network  f(x, a)  corresponds  to  a  piecewise \npolynomial function in the parameters a, of degree no larger than (l + I)L-1  (recall \nthat  the  last  layer  is  linear).  Thus,  the  parameter domain  A  =  R W  can  be  split \ninto regions,  in each of which the function  f(x,\u00b7)  is  polynomial.  From Lemma 2.1, \nit is  possible to obtain an upper bound on the number of sign assignments that can \nbe attained by varying the parameters of a set of polynomials.  The theorem will  be \nestablished by combining this bound with  a  bound on the number of regions. \n\nPROOF  OF  THEOREM  2.1  For  an  arbitrary choice  of m  points  Xl, X2, ..\u2022 ,xm ,  we \nwish  to bound \n\nK  = I {(sgn(f(Xl ,a)), . .. ,sgn(J(xm, a)))  : a  E  A }I. \n\n\fAlmost Linear VC Dimension Bounds for Piecewise Polynomial Networks \n\n193 \n\nFix  these  m  points,  and  consider  a  partition  {SI, S2, ... , S N}  of  the  parameter \ndomain  A.  Clearly \n\nN \n\nK  ~ L I {(sgn(J(xl , a\u00bb, ... , sgn(J(xm, a\u00bb) : a ESdi\u00b7 \n\ni=1 \n\nWe  choose the partition so  that within each region Si, f (Xl, .), ... ,f (x m, .)  are  all \nfixed  polynomials  of degree  no  more  than  (1  + I)L-1.  Then,  by  Lemma 2.1,  each \nterm in the sum above is  no more than \n\n2 (2em(1;' I)L - l) W \n\n(1) \n\nThe  only  remaining  point  is  to  construct  the  partition  and  determine  an  upper \nbound  on  its  size.  The  partition  is  constructed  recursively,  using  the  following \nprocedure.  Let 51  be a  partition of A  such that, for  all  S E 51,  there are constants \nbh,i,j  E  {0,1} for  which \n\nwhere  j  E  {I, ... ,m}, h  E  {I, ... ,kd and  i  E {1, ... ,pl.  Here  ti  are the  break(cid:173)\npoints of the piecewise polynomial activation functions,  and Ph,x)  is  the affine func(cid:173)\ntion  describing the net  input  to  the  h-th unit  in  the  first  layer,  in  response to  X j. \nThat is, \n\nfor  all  a  E  S, \n\nwhere  ah  E  R d,  ah,O  E  R  are the weights  of the  h-th unit in  the  first  layer.  Note \nthat the  partition 51  is  determined  solely  by  the  parameters corresponding to the \nfirst  hidden  layer,  as  the input to this layer is  unaffected  by  the other parameters. \nClearly,  for  a  E S,  the output of any first  layer unit in response to  an  Xj  is  a  fixed \npolynomial in  a. \n\nNow, let WI, ... , W L  be the number of variables used in computing the unit outputs \nup to layer 1, ... , L respectively  (so WL =  W), and let kl ,  . ..  , kL  be the number of \ncomputation units in layer  1, ... , L  respectively  (recall that kL  =  1).  Then we  can \nchoose 51  so that 1511 is no more than the number of sign assignments possible with \nmkl P affine functions in WI  variables.  Lemma 2.1 shows that 151 1  ~ 2 (2e~~IP) WI \nNow,  we  define  5 n  (for  n  > 1)  as  follows.  Assume  that  for  all  S  in  5 n - 1  and  all \nXj,  the  net  input  of every  unit  in  layer  n  in  response  to  Xj  is  a  fixed  polynomial \nfunction  of a  E  S,  of degree  no  more  than  (1  + l)n-1 .  Let  5n  be  a  partition  of A \nthat  is  a  refinement  of 5n- 1  (that  is,  for  all  S  E  5n, there  is  an  S'  E  5n- 1  with \nS  ~ S'),  such that for  all  S  E 5n there are constants  bh,i,j  E {O, I}  such that \n\nsgn(Ph,x) (a)  - ti )  =  bh,i,j \n\nfor  all  a E S, \n\n(2) \n\nwhere Ph ,x)  is  the  polynomial function  describing the net input of the  h-th unit  in \nthe n-th layer, in response to Xj,  when a E  S.  Since S  ~ S' for  some S'  E 5 n- 1 ,  (2) \nimplies  that  the  output  of  each  n-th  layer  unit  in  response  to  an  X j \nis  a  fixed \npolynomial in  a  of degree no  more than l (l + 1) n-l, for  all  a  E S. \nFinally,  we  can  choose  5n  such  that,  for  all  S'  E  5n- 1  we  have  I {S  E  5n  :  S  ~ \nS'}I  is  no  more than  the  number of sign  assignments  of mknP polynomials in  Wn \nvariables of degree no  more than (l + 1)n- l, and by Lemma 2.1  this is no more than \n2 (2emkn~n+lr-I ) Wn  .  Notice also that the net input of every unit in layer n + 1 in \n\n\f194 \n\nP.  L. Bartlett,  V Maiorov and R. Meir \n\nresponse to  Xj  is  a fixed  polynomial function  of a  ESE Sn  of degree  no more than \n(l + l)n. \nProceeding in  this  way  we  get  a  partition  SL-l  of A  such  that  for  S  E  SL-l  the \nnetwork output in  response to any  Xj  is  a  fixed  polynomial of a  E  S  of degree  no \nmore than l(l + 1)L-2.  Furthermore, \n\nJSL-d  <  2 Ce;:,P) W, TI 2 eemk'p~,+ 1)'-') W , \n\nMultiplying by the bound  (1)  gives the result \n\n<  TI 2 CemkiP~,+ 1)'-') W; \nK  ~ IT 2 (2emkip(l .+ l)i-l) W. \n\ni=l \n\nW t \n\n\u2022 \n\nSince the points Xl, ... ,Xm  were chosen arbitrarily, this .gives  a bound on the max(cid:173)\nimal  number  of dichotomies  induced  by  a  E  A  on  m  points.  An  upper  bound  on \nthe  VC-dimension  is  then obtained by computing the largest value  of m  for  which \nthis number is  at least  2m ,  yielding \n\nm  <  L + t. w, log Cempk'~,+ 1)i-1 ) \n\n<  L [1 + (L -\n\nl)W log(l + 1) + W  log(2empk)] , \n\nwhere all logarithms are to the base 2.  We conclude (see for example [Vid96]  Lemma \n4.4)  that \n\nVCdim(F)  ~ 2L [(L  -l)W log(l + 1) + W  log (2eWLpk) + 1]. \n\nWe  briefly  mention  the  application  of this  result  to  the  problem  of learning  a  re(cid:173)\ngression  function  E[YIX  =  x],  from  n  input/output  pairs  {(Xi, Yi)}i=l'  drawn \nindependently  at  random  from  an  unknown  distribution  P(X, Y).  In  the  case  of \nquadratic loss, L(f) =  E(Y - f(X))2, one can show that there exist constants Cl  ;:::  1 \nand  C2  such that \n\nEL(f~  ) \n\n\u2022  f  L-(f) \nn  <  8  + Cl  In \nJET \n\n2 \n\n-\n\n+ C2 \n\nMPdim(F) logn \n, \n\nn \n\nwhere  82 =  E [Y - E[YIX]]2 is the noise variance, i(f) =  E [(E[YIX] -\nf(X))2]  is \nthe  approximation error of f, and in  is  a  function  from  the class  F  that  approxi(cid:173)\nmately minimizes the sample average of the quadratic loss.  Making use of recently \nderived  bounds  [MM97]  on the approximation error,  inf JET i(f), which  are equal, \nup  to  logarithmic  factors,  to  those  obtained  for  networks  of  units  with  the  stan(cid:173)\ndard sigmoidal function  u{u) =  (1 + e-u)-l , and combining with the considerably \nlower pseudo-dimension bounds for  piecewise polynomial networks, we obtain much \nbetter error rates than are currently available for  sigmoid networks. \n\n3  LOWER BOUND \n\nWe  now  compute  a  lower  bound  on  the  VC  dimension  of  neural  networks  with \ncontinuous activation functions.  This result generalizes the lower  bound in  [KS97], \nsince it holds for  any  number of layers. \n\n\fAlmost Linear VC Dimension Bounds for Piecewise Polynomial Networks \n\n195 \n\nTheorem 3.1  Suppose  f  : R  -+  R  has  the following  properties: \n\n1.  limo-too  f(a) = 1  and limo-t-oo f(a)  = 0,  and \n2.  f  is  differentiable  at some point Xo  with  derivative  f'(xo)  =1=  O. \n\nThen  for  any  L  ~ 1  and W  ~ 10L - 14,  there  is  a  feedforward  network  with  the \nfollowing  properties:  The  network has L  layers  and W  parameters,  the  output unit \nis  a  linear unit,  all  other computation units have  activation function  f,  and the  set \nsgn(F)  of functions  computed by  the  network has \n\nVCdim(sgn(F\u00bb  ~ l ~ J l ~ J ' \n\nwhere  l u J is  the  largest  integer less  than  or equal  to  u. \n\nPROOF  As  in  [KS97],  the  proof follows  that  of Theorem  2.5  in  [GJ95],  but  we \nshow  how  the  functions  described  in  [GJ95]  can  be  computed  by  a  network,  and \nkeep  track  of the  number  of  parameters  and  layers  required.  We  first  prove  the \nlower  bound for  a  network containing linear threshold  units  and linear  units  (with \nthe  identity  activation  function),  and  then  show  that  all  except  the  output  unit \ncan be replaced by units with activation function  f, and the resulting network still \nshatters the same set.  For further details of the proof, see  the full  paper [BMM98]. \n\nFix  positive  integers  M, N  E  N.  We  now  construct  a  set  of  M N  points,  which \nmay  be  shattered  by  a  network  with  O(N)  weights  and  O(M)  layers.  Let  {ad, \ni  = 1,2, ... ,N denote  a  set  of N  parameters,  where  each  ai  E  [0,1)  has an  M -bit \nbinary  representation  ai  =  E~l 2-jai,j,  ai,j  E  {O, I},  i.e. \nthe  M-bit  base  two \nrepresentation of ai is  ai = O.ai,l ai,2 ... ai,M.  We  will  consider inputs in B N  X  B M, \nwhere  BN  =  {ei  :  1 ~ i  ~ N},  ei  E  {O, I}N  has  i-th  bit  1 and  all  other  bits  0,  and \nBM  is  defined  similarly.  We  show  how  to  extract  the  bits  of the  ai,  so  that  for \ninput  x  =  (el' ern)  the  network  outputs  al,rn.  Since  there  are  N M  inputs  of the \nform  (el,ern ),  and  al,rn  can  take  on  all  possible  2MN  values,  the  result  will  follow. \nThere are three stages to the computation of al,rn:  (1)  computing ai,  (2)  extracting \nal,k  from  ai, for  every k,  and  (3)  selecting al,rn  among the  al,ks. \n,UN),(Vt, ... ,VM\u00bb  = (el,e rn ).  Using \nSuppose  the  network  input  is  x  = ((Ul,'\" \none  linear  unit  we  can  compute  E~l Uiai  = al.  This  involves  N  + 1  parameters \nand one computation unit in one layer.  In fact,  we  only need N  parameters, but we \nneed the extra parameter when we  show  that this linear unit can  be replaced by a \nunit  with  activation function  f. \nConsider the parameter Ck  = O.al,k ... al,M,  that is,  Ck  = E~k 2k-1-jal,j  for  k  = \n1, ... ,M.  Since  Ck  ~ 1/2 iff  al,k  = 1,  clearly  sgn(ck  - 1/2) = al,k  for  all  k.  Also, \nCl  = al  and Ck  = 2Ck-l  - al ,k-l'  Thus,  consider the recursion \n\nCk  = 2Ck-l  - al,k-l \nal,k = sgn(ck  - 1/2)' \n\nwith  initial  conditions  CI  = al  and  au  = sgn(al  - 1/2).  Clearly,  we  can  compute \nal,l, ... ,al,M-l and C2,' .. ,CM-l  in another 2(M - 2) + 1 layers, using 5(M - 2) + 2 \nparameters in  2(M - 2) + 1 computational units. \nWe  could  compute  al,M  in  the  same  way,  but  the  following  approach gives  fewer \nlayers.  Set b = sgn  (2C M - 1 - al,M - l  - E~~I Vi)'  If m  =1=  M  then b = O.  If m  = M \nthen  the  input  vector  (VI, ... ,VM)  = eM,  and  thus  E~~lvi = 0,  implying  that \nb = sgn(cM) = sgn(O.al,M)  = al,M. \n\n\f196 \n\nP  L. Bartlett,  V.  Maiorov and R.  Meir \n\nIn  order  to  conclude  the  proof,  we  need  to  show  how  the  variables  al,m  may  be \nrecovered,  depending  on  the  inputs  (VI, V2, ... ,VM).  We  then  have  al,m  =  b V \nV';~I(al,i/\\vi).  Since for  boolean x  and y,  x/\\y =  sgn(x+y-3/2), and V';I Xi  = \nsgn(2:,;1 Xi  - 1/2), we  see that the computation of al,m  involves an additional 5M \nparameters in  M  +  1 computational units,  and adds another 2 layers. \n\nIn total, there are 2M layers and 10M + N -7 parameters, and the network shatters \na  set  of  size  N M.  Clearly,  we  can  add  parameters  and  layers  without  affecting \nthe  function  of the  network.  So  for  any  L, WEN, we  can  set  M  =  lL/2J  and \nN  = W  +  7 - 10M, which  is  at least  lW/2J  provided W  :2:  10L - 14.  In that case, \nthe VC-dimension is  at least  l L /2 J l W /2 J . \nThe network just constructed uses linear threshold units and linear units.  However, \nit is easy to show (see [KS97],  Theorem 5) that each unit except the output unit can \nbe replaced by a unit with activation function f  so that the network still shatters the \nset of size M N.  For linear units, the input and output weights are scaled so that the \nlinear function can be approximated to sufficient accuracy by f  in the neighborhood \nof the point Xo.  For  linear threshold units, the input weights are scaled so  that the \nbehavior of f  at infinity accurately approximates a  linear threshold function. \n\u2022 \n\nReferences \n\n[ABar] \n\nM.  Anthony and P.  L. Bartlett.  Neural  Network  Learning:  Theoretical \nFoundations.  Cambridge University Press,  1999  (to appear). \n\n[BEHW89]  A.  Blumer,  A.  Ehrenfeucht,  D.  Haussler,  and M.  K. Warmuth.  Learn-\nability  and  the  Vapnik-Chervonenkis  dimension.  J.  ACM,  36(4):929-\n965,  1989. \n\n[BMM98]  P.  L.  Bartlett,  V.  Maiorov,  and  R.  Meir.  Almost  linear  VC-dimension \nNeural  Computation, \n\nbounds  for  piecewise  polynomial  networks. \n10:2159- 2173,  1998. \nP.W.  Goldberg  and  M.R.  Jerrum.  Bounding  the  VC  Dimension  of \nConcept  Classes  Parameterized by  Real  Numbers.  Machine  Learning, \n18:131- 148,  1995. \nP.  Koiran  and  E.D.  Sontag.  Neural  Networks  with  Quadratic VC  Di(cid:173)\nmension.  Journal  of Computer and System  Science,  54:190- 198,  1997. \n[Maa94]  W.  Maass.  Neural  nets  with  superlinear  VC-dimension.  Neural  Com(cid:173)\n\n[KS97] . \n\n[GJ95] \n\n[MM97] \n\n[Sak93] \n\n[Sak99] \n\n[Vap82] \n\n[Vid96] \n\nOn \n\nputation, 6(5):877- 884,  1994. \nV.  Maiorov  and  R.  Meir. \nthe  Near  Optimality  of  the \nStochastic  Approximation  of  Smooth  Functions  by  Neural  Networks. \nSubmitted for  publication,  1997. \nA.  Sakurai.  Tighter  bounds  on  the  VC-dimension  of three-layer  net(cid:173)\nworks.  In  World  Congress  on  Neural  Networks,  volume  3,  pages  540-\n543,  Hillsdale,  NJ,  1993. Erlbaum. \nA.  Sakurai.  Tight  bounds  for  the  VC-dimension  of piecewise  polyno(cid:173)\nmial networks.  In Advances in Neural Information  Processing  Systems, \nvolume  11.  MIT Press,  1999. \nV.  N.  Vapnik.  Estimation  of Dependences  Based  on  Empirical  Data. \nSpringer-Verlag, New  York,  1982. \nM  Vidyasagar.  A  Theory  of Learning  and  Generalization.  Springer \nVerlag,  New  York,  1996. \n\n\f", "award": [], "sourceid": 1515, "authors": [{"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Vitaly", "family_name": "Maiorov", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}]}