{"title": "The Perceptron Algorithm Is Fast for Non-Malicious Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 676, "page_last": 685, "abstract": null, "full_text": "676 \n\nBaum \n\nThe Perceptron Algorithm Is Fast tor \n\nNon-Malicious Distributions \n\nErice B. Baum \n\nNEC Research Institute \n4 Independence Way \nPrinceton, NJ  08540 \n\nAbstract:  Within  the  context  of Valiant's  protocol  for  learning,  the  Perceptron \nalgorithm is shown  to learn  an arbitrary half-space in time O(r;;) if D, the proba(cid:173)\nbility distribution of examples,  is  taken uniform over the unit sphere sn.  Here  f  is \nthe accuracy parameter.  This is surprisingly fast,  as  \"standard\"  approaches involve \nsolution  of a  linear  programming problem involving  O( 7')  constraints in  n  dimen(cid:173)\nsions.  A  modification  of Valiant's  distribution  independent  protocol  for  learning \nis  proposed  in which  the  distribution  and  the function  to be learned  may be  cho(cid:173)\nsen  by adversaries,  however  these  adversaries may  not  communicate.  It is  argued \nthat  this  definition  is  more  reasonable  and  applicable  to real  world  learning  than \nValiant's.  Under  this  definition,  the  Perceptron algorithm  is  shown to be  a  distri(cid:173)\nbution independent learning algorithm.  In  an  appendix we  show  that, for  uniform \ndistributions,  some  classes  of infinite  V-C  dimension  including  convex  sets  and  a \nclass of nested  differences of convex sets are learnable. \n\n\u00a71:  Introduction \n\nInterest  in  this  algorithm  waned  in  the  1970's  after  it  was  empha(cid:173)\n\nThe  Percept ron  algorithm  was  proved  in  the  early  1960s[Rosenblatt,1962]  to \nconverge  and  yield  a  half space  separating  any  set  of linearly  separable  classified \nexamples. \nsized[Minsky  and  Papert,  1969]  (1)  that  the  class of problems solvable  by  a  single \nhalf space  was  limited,  and  (2)  that  the  Perceptron  algorithm,  although  converg(cid:173)\ning in finite  time,  did  not  converge in  polynomial time.  In  the  1980's,  however,  it \nhas become  evident  that  there  is  no  hope  of providing a  learning algorithm which \ncan learn  arbitrary functions  in polynomial  time  and much research has  thus been \nrestricted  to  algorithms  which  learn  a  function  drawn  from  a  particular  class  of \nfunctions.  Moreover, learning theory has focused  on protocols like  that of [Valiant, \n1984]  where  we  seek  to  classify,  not  a  fixed  set  of examples,  but examples  drawn \nfrom  a  probability  distribution.  This  allows  a  natural  notion  of \"generalization\" . \nThere are  very few  classes which have yet been proven learnable in polynomial time, \nand  one  of these  is  the  class  of half spaces.  Thus there  is  considerable  theoretical \ninterest  now  in  studying  the  problem  of learning  a  single  half space,  and  so  it  is \nnatural to reexamine  the Percept ron algorithm within the formalism of Valiant. \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n677 \n\nIn Valiant's protocol,  a class of functions  is  called  learnable if there  is  a learn(cid:173)\ning  algorithm  which  works  in  polynomial  time  independent  of the  distribution  D \ngenerating  the  examples.  Under  this  definition  the  Perceptron learning  algorithm \nis  not  a  polynomial  time  learning  algorithm.  However  we  will  argue  in  section  2 \nthat this definition  is  too restrictive.  We  will  consider  in  section 3 the  behavior of \nthe  Perceptron  algorithm if D  is  taken to  be  the  uniform  distribution  on  the  unit \nsphere  sn.  In  this  case,  we  will  see  that  the  Perceptron  algorithm  converges  re(cid:173)\nmarkably rapidly.  Indeed we  will give a time  bound which is  faster than any bound \nknown  to  us  for  any  algorithm  solving  this  problem.  Then,  in  section  4,  we  will \npresent what we  believe to be  a more natural definition of distribution independent \nlearning in  this  context,  which  we  will  call  N onmalicious  distribution independent \nlearning.  We will see that the Perceptron algorithm is indeed a polynomial time non(cid:173)\nmalicious  distribution independent  learning  algorithm.  In  Appendix  A,  we  sketch \nproofs that, if one restricts attention to the uniform distribution, some classes with \ninfinite  Vapnik-Chervonenkis  dimension  such  as  the  class  of convex  sets  and  the \nclass  of nested  differences  of convex  sets  (which  we  define)  are  learnable.  These \nresults support our assertion that distribution independence is too much to ask for, \nand may also be of independent interest. \n\n\u00a72:  Distribution Independent Learning \n\nIn  Valiant's  protocol [Valiant ,  1984],  a  class  F  of Boolean  functions  on  ~n is \ncalled learnable  if a learning algorithm A  exists which satisfies the following  condi(cid:173)\ntions.  Pick some  probability distribution  D  on  ~n.  A  is  allowed  to call  examples, \nwhich  are  pairs (x, I(x\u00bb, where  x is  drawn according  to the distribution  D.  A  is  a \nvalid learning  algorithm for  F  if for  any probability distribution D  on  ~n, for  any \no < 8, f  < 1,  for  any  I  E F, A  calls  examples  and,  with  probability at  least  1 - 8 \noutputs  in  time  bounded  by  a  polynomial  in  n, 8- 1 ,  and  f- 1  a  hypothesis  9  such \nthat the  probability that I(x) \"I g(x)  is  less  than  f  for  x  drawn according  to D. \nThis  protocol  includes  a  natural  formalization  of 'generalization'  as  predic(cid:173)\ntion.For  more  discussion  see  [Valiant,  1984].  The  definition  is  restrictive  in  de(cid:173)\nmanding  that  A  work  for  an  arbitrary  probability  distribution  D.  This  demand \nis  suggested by  results on  uniform convergence of the empirical  distribution to the \nactual distribution.  In particular, if F has Vapnik-Chervonenkis (V-C) dimensionl1 \nd,  then it has  been  proved[Blumer  et  al,  1987]  that all  A  needs  to do  to be  a valid \nlearning  algorithm  is  to  call  MO(f, 8, d)  = max(~logj, Sfdlog1f3)  examples  and  to \nfind  in polynomial time a function  9  E F  which  correctly classifies  these. \nThus,  for  example,  it  is  simple  to  show  that  the  class  H  of half  spaces  is \nValiant  learnable[Blumer  et  aI,  1987].  The  V-C  dimension  of H  is  n + 1.  All  we \nneed  to do  to learn H  is  to call  MO(f, 8, n + 1)  examples  and find  a separating half \nspace  using  Karmarkar's  algorithm  [Karmarkar,  1984].  Note  that  the  Perceptron \nalgorithm  would  not  work  here,  since  one  can  readily  find  distributions  for  which \nthe  Perceptron algorithm would  be  expected  to  take  arbitrarily long  times  to find \na separating half space. \n\n11  We  say  a  set  S  C  Rn  is  shattered  by  a  class  F  of  Boolean  functions  if F \ninduces  all  Boolean  functions  on  S.  The  V -C  dimension  of F  is  the cardinality of \nthe largest set  S which F  shatters. \n\n\f678 \n\nBaum \n\nNow,  however,  it  seems  from  three  points  of view  that  the  distribution  inde(cid:173)\npendent definition is  too strong.  First, although the results of [Blumer et al.,  1987] \ntell  us we  can gather enough information for  learning in polynomial time,  they say \nnothing about when we can actually find  an algorithm A which learns in polynomial \ntime.  So  far,  such algorithms  have  only  been  found  in  a  few  cases,  and  (see,  e.g. \n[Baum,  1989a])  these  cases may be argued  to be  trivial. \n\nSecond,  a few  cl~es of functions  have been  proved  (modulo strong  but plau(cid:173)\nsible complexity theoretic hypotheses) unlearnable by construction of cryptograph(cid:173)\nically  secure  subclasses.  Thus for  example  [Kearns  and  Valiant,  1988]  show  that \nthe class of feedforward  networks of threshold  gates of some  constant depth, or of \nBoolean  gates  of logarithmic  depth,  is  not  learnable  by  construction  of a  crypto(cid:173)\ngraphically secure subclass.  The relevance of such results to learning in the natural \nworld  is  unclear  to  us.  For  example,  these  results  do  not  rule  out  a  learning  al(cid:173)\ngorithm  that  would  learn  almost  any  log  depth  net.  We  would  thus  prefer  a  less \nrestrictive  definition  of learnability,  so  that  if a  class  were  proved  unlearnable,  it \nwould  provide a  meaningful limit on  pragmatic learning. \n\nThird, the results of [Blumer et aI,  1987] imply that we can only expect to learn \na  class  of functions  F  if F  has  finite  V-C  dimension.  Thus we  are  in  the  position \nof assuming an enormous  amount of information about the class of functions  to be \nlearned- namely that it be some specific  class of finite  V-C  dimension,  but nothing \nwhatever  about  the  distribution  of examples.  In  the  real  world,  by  contrast,  we \nare likely to know at least as  much about the distribution D  as  we  know about the \nclass of functions F.  If we relax the distribution independence criterion,  then it can \nbe  shown that classes of infinite Vapnik-Chervonenkis dimension are learnable.  For \nexample, for  the  uniform distribution,  the class of convex sets and a class of nested \ndifferences of convex sets ( both of which trivially have  infinite V -C dimension)  are \nshown to be  learnable in  Appendix A. \n\n\u00a73:  The Perceptron Algorithm and Uniform Distributions \n\nThe  Percept ron  algorithm  yields,  in  finite  time,  a  half-space  (WH, ()H)  which \ncorrectly  classifies  any given  set  of linearly  separable  examples  [Rosenblatt,1962]. \nThat is, given a set of classified examples {z~} such that, for some (w~, ()~), W~ .z+ > \n()~  and  W~ \u2022 z~ <  ()~  for  alII',  the  algorithm  converges  in  finite  time  to  output  a \n( W H , () H)  such  that  W H  \u2022 z~ 2::  () Hand W H  .  z~ < () H.  We  will  normalize  so  that \nw~ . w~ = 1.  Note that  Iw~ . z - ()~ I is the Euclidean distance from z  to the separating \nhyperplane  {y : W~ . Y = ()~}. \n\nThe  algorithm  is  the  following.  Start  with  some  initial  candidate  (wo, ()o), \nwhich we  will take to be (0,0).  Cycle through the examples.  For each example, test \nwhether  that example is  correctly classified.  If so,  proceed  to the  next  example.  If \nnot, modify the  candidate by \n\nwhere  the  sign  of the  modification  is  determined  by  the  classification of the  miss(cid:173)\nclassified example. \n\nIn this section we will apply the Perceptron algorithm to the problem of learning \n\n(1) \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n679 \n\nin  the  probabilistic  context  described  in  section  2,  where  however  the  distribution \nD  generating  examples  is  uniform  on  the  unit  sphere  sn.  Rather  than  have  a \nfixed  set  of examples,  we  apply  the  algorithm in  a  slightly novel  way:  we  call  an \nexample, perform a  Perceptron update step,  discard  the example,  and iterate until \nwe converge to accuracy c/ 2  If we applied the Perceptron algorithm in the standard \nway,  it seemingly would not converge as rapidly.  We  will return to this point at the \nend of this section. \n\nNow  the  number  of updates  the  Perceptron  algorithm  must  make  to learn  a \ngiven set  of examples is  well  known  to be  O( f;),  where I  is  the minimum distance \nfrom an example  to the classifying hyperplane  (see ego  [Minsky and Papert, 1969]). \nIn  order  to  learn  to  c  accuracy  in  the  sense  of Valiant,  we  will  observe  that  for \nthe uniform distribution we do not need to correctly classify examples closer to the \ntarget  separating hyperplane than O( -7,:).  Thus we  will  prove that the Perceptron \nalgorithm  will  converge  (with  probability  1 - 8)  after  O( ~) updates,  which  will \noccur after  O( -!i)  presentations of examples. \n\nIndeed take Ot  =  0 so  the  target hyperplane passes through the origin.  Parallel \nhyperplanes  a  distance  tc/2  above  and  below  the target  hyperplane  bound  a  band \n\nB of probability measure  1,,/2 \n\nP(tc) = \n\nn  2  A \n\nh/1 - z2)  - dz ~ \n\nAn \n\n(2) \n\n-,,/2 \n\n(for n > 2), where An  =  f\u00ab~:+ll)/;) is the area of sn.  See figure 1.  Using the readily \n\nt \nK \nJ.. \n\nFigure  1:  The  target  hyperplane  intersects  the  sphere  sn  along  its  equator  (if \nOe  = 0) shown as the central line.  Points in (say) the upper hemisphere are classifie.d \nas positive examples and those  in the lower  as  negative examples.  The band  B  18 \nformed by intersecting the sphere with two planes parallel to the target hyperplane \nand\u00b7 a distance tc/2  above and below it. \n\n/2  We say that our candidate half space has accuracy c when the probability that \nit missclassifies an example drawn from  D  is  no greater than c. \n\n\f680 \n\nBaum \n\nobtainable  (e.g.  by Stirling's formula)  bound  that  AA:l  < vn,  and  the fact  that \nthe integrand is  nowhere greater than  1,  we  find  that for\", = \u20ac/2vn, \nthe band has \nIf Ot  # 0,  a  band  of width\", will  have  less  measure  than it \nmeasure  less  than \u20ac/2. \nwould  for  Ot  =  0.  We  will  thus  continue  to argue  (without  loss  of generality)  by \nassuming the worst case  condition that Ot  = 0. \n\nSince  B has measure less  than \u20ac/2, \n\nif we  have  not  yet converged to accuracy \u20ac, \nthere is no more than probability 1/2 that the next example on which we update will \nbe in B.  We will show that once we have made rno  = rnax(144In!,  ~) updates, we \nhave converged  unless more than 7/12 of the updates are  in  B.  The probability of \nmaking this fraction of the up dates in B, hC?wever,  is less than 6/2 if the probability \nof each update lying in B  is not more than 1/2.  We conclude with confidence 1-6/2 \nthat the probability our next update will  be in B  is  greater than 1/2 and thus that \nwe  have  converged  to \u20ac-accuracy. \n\nIndeed,  consider  the change in  the  quantity \n\nwhen  we  update. \n\n(3) \n\nNow  note  that \u00b1(Wk . X:l::  - Ok)  < \u00b0 since x  was  miss classified  by (Wk' Ok)  (else  we \nwould not update).  Let A  = (=F(Wt\u00b7 x:l::  - Ot\u00bb.  If x  E B, then A  < 0.  If x  rt.  B, then \nA  ~ -\",/2.  Recalling  x2  = 1,  we  see  that  tl.N < 2 for  x  E Band tl.N < -0'\" + 2 \nfor  x  rt.  B.  If we  choose  0  =  8/\"\"  we  find  that  tl.N  ~ -6 for  x  ~ B.  Recall  that, \nfor  k = 0,  with  (Wo, (0) = (0,0),  we  have  N  = 0 2 = 64/\",2.  Thus we  see  that if we \nhave made 0  updates on  points outside  B, and 1 updates on points in B, N  < \u00b0 if \n\n60 - 21> 64/\",2.  But  N  is  positive semidefinite.  Once  we  have  made 48/\",2  tot'al \nupdates,  at least  7/12 of the  updates must thus have  been on  examples  in  B. \n\n(4) \n\nIf you assume  that the probability of updates falling in  B  is  less than 1/2 (and \nthus that our hypothesis half space  is not yet  at \u20ac  - accuracy), then the probability \nthat more  than 7/12 of mo  = max(144In~,  ~) updates fall  in  B  is  less  than 6/2. \nTo see  this define  LE(p, m, r) as  the probability of having at most r successes in  m \nindependent  Bernoulli trials  with  probability of success p  and  recall,  [Angluin  and \n\nValiant,1979], for \u00b0 < f3  < 1 that \n\n(5) \nApplying  this  formula  with  m = mo, p  = 1/2, f3  = 1/6 shows  the  desired  result. \nWe  conclude  that  the  probability  of making  rno  updates  without  converging  to  \u20ac \naccuracy  is less  than 6/2. \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n681 \n\nHowever,  as  it approaches 1 - \u20ac  accuracy,  the algorithm will only update on a \nfraction  \u20ac  of the examples.  To get, with confidence 1- 8/2, rno  updates, it suffices to \ncall  M  = 2mo/\u20ac  examples.  Thus we  see  that  the  Perceptron algorithm  converges, \nwith confidence  1 - 0,  after we  have  called \n\n\u00b0 48n \nM  = -max(144In-2 ,  -2  ) \n\n2 \n\u20ac \n\n\u20ac \n\n(6) \n\nexamples. \n\nEach  example  could  be  processed  in  time  of order  1  on  a  \"neuron\"  which \ncomputes  Wk  .  x  in  time  1 and  updates  each  of its  \"synaptic  weights\"  in  parallel. \nOn  a serial  computer, however,  processing each example will  take  time  of order  n, \nso  that we  have  a  time of order O(n2/\u20ac3) \n\nfor  convergence on  a serial computer. \n\nThis is remarkably fast.  The general learning procedure, described in section 2, \nis to call Mo(\u20ac,  0, n+1) examples and find a separating halfspace, by some polynomial \ntime  algorithm for  linear programming such as  Karmarkar's algorithm.  This linear \nprogramming  problem  thus  contains  0(7)  constraints  in  n  dimensions.  Even  to \nwrite down the problem thus takes time o(nf~)' The upper time bound to solve this \ngiven by [Karmarkar,  1984]  is  O(n505\u20ac-2) .  For  large  n the  Percept ron algorithm is \nfaster  by  a factor  of n305 \u2022  Of course  it  is  likely  that  Karmarkar's algorithm could \nbe  proved  to  work  faster  than  O( n 505 )  for  the  particular  distribution  of examples \nof interest.  If,  however,  Karmarkar's  algorithm  requires  a  number  of  iterations \ndepending  even  logarithmically  on  n,  it  will  scale  worse  (for  large  n)  than  the \nPerceptron algorithm/3 \n\nNotice  also  that  if we  simply  called  Mo(\u20ac,  0, n + 1)  examples  and  used  the \nPerceptron algorithm,  in the traditional way,  to find  a linear separator for  this set \nof examples,  our time  performance  would  not be  nearly as  good.  In  fact,  equation \n2 tells  us that we  would expect one of these examples to be  a distance O( nt.g)  from \nthe  target  hyperplane,  since  we  are  calling  0(7) examples  and  a  band  of width \nO( nf.s)  has measure  O( *).  Thus this approach would  take time O( ~), or  a factor \nof n 2  worse  than the one  we  have  proposed. \nAn  alternative  approach  to  learning  using  only  O( 7)  examples,  would  be  to \ncall  MoCi, 0, n + 1)  examples  and  apply  the  Perceptron algorithm  to  these  until a \nfraction  1- \u20ac/2  had  been  correctly classified.  This would suffice  to assure  that the \nhypothesis  half space  so  generated  would  (with  confidence  1 - 0)  have  error  less \nthan \u20ac,  as  is seen from  [Blumer et aI,  1987,  Theorem A3.3].  It is  unclear to us what \ntime  performance  this  procedure would yield. \n\n\u00a74:  Non-Malicious  Distribution Independent Learning \n\nNext  we  propose  modification  of the  distribution  independence  assumption, \nwhich  we  have  argued  is  too  strong  to  apply  to  real  world  learning.  We  begin \nwith  an  informal  description.  We  allow  an  adversary  (adversary  1)  to  choose  the \n\n/3  We  thank P.  Vaidya for  a  discussion on  this point. \n\n\f682 \n\nBaum \n\nfunction f  in the class F  to present to the learning algorithm A.  We allow a second \nadversary (adversary  2)  to choose  the  distribution  D  arbitrarily.  We  demand  that \n(with  probability 1 - 8)  A  converge  to  produce  an  (-accurate  hypothesis  g.  Thus \nfar  we  have not  changed Valiant's  definition.  Our  restriction is simply  that before \ntheir  choice  of distribution  and  function,  adversaries  1  and  2  are  not  allowed  to \nexchange  information.  Thus they  must  work  independently.  This seems  to  us  an \nentirely natural and reasonable restriction in  the  real  world. \n\nNow  if we  pick any distribution and any hyperplane  independently, it is highly \nunlikely that the probability measure will  be  concentrated close  to the hyperplane. \nThus  we  expect  to  see  that  under  our  restriction,  the  Perceptron  algorithm  is  a \ndistribution  independent  learning  algorithm  for  H  and  converges  in  time  O( S;2) \non  a  serial computer. \n\nIf adversary  1 and  adversary  2 do  not exchange information,  the  least  we  can \nexpect is that they have no notion of a preferred  direction on the sphere.  Thus our \ninformal  demand  that  these  two  adversaries  do  not  exchange  information  should \n\nimply, at least,  that adversary  1 is  equally likely to choose  any w,  (relative e.g.  to \n\nwhatever direction adversary 2 takes as his z  axis).  This formalizes,  sufficiently for \nour current purposes,  the notion of Nonmalicious  Distribution Independence. \nTheorem  1:  Let  U  be  the  uniform  probability  measure  on sn  and  D  any  other \nprobability  distribution  on  sn.  Let  R  be  any  region  on  sn  of U-measure  (8  and \nlet  z  label  some  point  in  R.  Choose  a  point  y  on  sn  randomly  according  to  U. \nConsider  the region  R' formed  by translating  R  rigidly  so that  z  is  mapped  to  y. \nThen the probability that the measure  D(R/) > ( is  less  than 8. \nProof:  Fix any point z  E sn.  Now  choose y  and thus R'.  The probability z  E R' is \n(8.  Thus in  particular, if we  choose  a  point p  according  to  D  and  then choose  R', \nthe  probability that pER' is (8. \n\nN ow  assume  that  there  is  probability greater than  8 that D( R/)  > (.  Then we \narrive  immediately  at  a  contradiction,  since  we  discover  that  the  probability  that \np  E Fe  is greater than (8.  Q.E.D. \n\nCorollary  2:  The  Perceptron algorithm is  aNon-malicious  distribution indepen(cid:173)\ndent  learning  algorithm  for  half spaces  on  the  unit  sphere  which  converges,  with \nconfidence  1 - {)  to accuracy 1 - (  in time of order  O( S;2)  on  a serial  computer. \nProof sketch:  Let  \",,  = (8/2fo,.  Apply  Theorem  1 to show that  a  band formed  by \nhyperplanes a distance \",, /2 on either side of the  target hyperplane  has probability \nless  than  8  of  having  measure  for  examples  greater  than  (/2.  Then  apply  the \narguments of the last section,  with \",'  in place  of \"'.  Q.E.D. \n\nAppendix  A:  Convex Sets Are Learnable for  Uniform Distribution \n\nIn  this  appendix  we  sketch  proofs  that  two  classes  of functions  with  infinite \nV -C  dimension are learnable.  These  classes  are  the class of convex sets and a  class \nof nested  differences  of convex  sets  which  we  define.  These  results  support  our \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n683 \n\nconjecture  that  full  distribution  independence  is  too  restrictive  a  criterion  to ask \nfor  if we want our results  to have  interesting applications.  We believe these results \nare also of independent interest. \n\nTheorem 3:  The class C of convex sets is  learnable  in time polynomial in (-1 and \n6- 1  if the  distribution of examples is  uniform on  the unit square in  d dimensions. \n\nRemarks:  (1)  C  is  well  known  to have  infinite  V-C  dimension.  (2)  So  far  as  we \nknow,  C is  not learnable in time  polynomial in d as  well. \n\nProof Sketch:/ 4  We work, for simplicity, in 2 dimensions.  Our arguments can readily \nbe  extended to d dimensions. \n\nThe learning algorithm is to call M  examples (where M  will be specified).  The \npositive examples are by definition within the convex set to be learned.  Let  M+  be \nthe set  of positive  examples.  We  classify  examples  as  negative if they  are  linearly \nseparable from  M+,  i.e.  outside of c+,  the  convex hull of M+. \n\nClearly this approach will never missclassify a negative example, but may miss(cid:173)\n\nclassify  positive examples which are outside c+  and inside  Ct.  To show (- accuracy, \n\nU \n~~~~II \nlllllUHf  ~~  ~f=: \n~  ~ \n~~~ \n~t?0 \n\n~~ \n~l== \n\n~~ \n\nt?0~  ~II \n\n~~~ \n\n\u00a7~ \nE~ \nE~  ~ \n\n=~~~  mf \n~~E= \n\nFigure  2:  The  boundary  of the  target  concept  Ct  is  shown.  The  set  It  of little \nsquares intersecting the boundary of c,  are hatched vertically.  The set 12  of squares \nthese examples contains all  points inside c,  except possibly  those in It, 12 ,  or  13 \u2022 \n\njust  inside  Ii  are  hatched  horizontally.  The  set  13  of squares  just  inside  12  are \nhatched diagonally.  If we  have an example in  each square in 12,  the convex hull of \n\n/4  This proofis inspired by arguments presented in [Pollard, 1984], pp22-24.  After \nthis proof was completed, the author heard D. Haussler present related, unpublished \nresults at the  1989 Snowbird meeting on  Neural  Computation. \n\n\f684 \n\nBaum \n\nwe  must  choose  M  large  enough  so  that,  with  confidence  1 - 8,  the  symmetric \ndifference of the target set  C.  and c+  has area less  than f. \n\nDivide  the  unit  square  into k2  equal  subsquares.  (See  figure  2.)  Call  the  set \nof subsquares  which  the  boundary  of Ct  intersects  II.  It  is  easy  to  see  that  the \ncardinality of II  is  no greater than 4k.  The set  12  of subsquares just inside  11  also \nhas  cardinality  no  greater  than 4k,  and  likewise  for  the  set  13  of subsquares just \ninside 12 \u2022  If we  have an example in each of the squares in 12 ,  then Ct  and C+  clearly \nhave symmetric difference at most equal the area of 11 U 12 U 13  < 12k X  k- 2 = 12/ k. \nThus take k = 12/f.  Now  choose  M  sufficiently large so  that after  M  trials there is \nless  than 8 probability we  have not got an example in each of the 4k squares in  12 \u2022 \nThus we  need  LE(k- 2 ,M,4k) < 8.  Using equation  5,  we  see  that M  = 5f~oln8 will \nsuffice.  Q.E.D. \n\nActually,  one  can  learn  (for  uniform  distributions)  a  more  complex  class  of \nfunctions formed out of nested convex regions.  For any set {C1, C2,  \u2022\u2022. , c,}  of I convex \nregions  in  ~d, let  R1  =  C1  and for  j  =  2, ... ,1  let  Rj  =  Rj-1 n Cj.  Then  define  a \nconcept f  =  R1 - R2 + R3 -\n\u2022.. R,.  The class C of concepts so formed  we  call nested \nconvex sets.  See figure  3. \n\nc, \n\nFigure  3:  Cl  is  the  five  sided  region,  C2  is  the  tria~gular region,  and  Cs  is  the \nsquare.  The positive region  C1  - C2  U C1  + C3  U C2  U C1  IS shaded. \n\n\fThe Perceptron Algorithm Is Fast for Non-Malicious Distributions \n\n685 \n\nThis class can be learned by an iterative procedure which peels the onion.  Call \na  sufficient  number of examples.  (One can easily  see  that a  number  polynomial in \nI, f,  and  6  but  of course  exponential in  d will  suffice.)  Let  the  set  of examples  so \nobtained be called S.  Those negative examples which are linearly separable from all \npositive examples  are in the  outermost layer.  Class  these  in  set  Sl.  Those positive \nexamples  which  are  linearly  separable  from  all  negative  examples  in  S - Sl  lie  in \nthe  next  layer- call  this set  of positive  examples  S2.  Those  negative  examples  in \nS - Sl linearly separable from  all positive examples in S - S2  lie  in the next layer, \nS3.  In  this  way  one  builds  up  I + 1  sets  of examples.  (Some  of these  sets  may \nbe  empty.)  One can  then  apply  the methods  of Theorem 3  to  build  a  classifying \nfunction from the outside in.  If the innermost layer S,+1  is (say) negative examples, \nthen any future  example is  called  negative if it  is  not linearly separable from S'+1, \nor is  linearly separable from  S,  and not linearly separable from  S,-1,  or  is  linearly \nseparable  from  S,-2  but not linearly separable from S,-3, etc. \n\nAcknowledgement:  I  would  like  to  thank  L.E.  Baum  for  conversations  and  L.  G. \nValiant  for  conunents  on  a  draft.  Portions  of  the  work  reported  here  were  per(cid:173)\nformed  while  the  author  was  an  employee  of Princeton  University  and  of the  Jet \nPropulsion  Laboratory,  California Institute of Technology,  and  were  supported  by \nNSF grant DMR-8518163  and  agencies of the US  Department of Defence including \nthe  Innovative  Science  and  Technology  Office  of the  Strategic  Defence  Initiative \nOr ganization. \n\nReferences \n\nANGLUIN,  D., VALIANT, L.G.  (1979),  Fast probabilistic  algorithms for  Hamilto(cid:173)\nnian  circuits and matchings, J. of Computer and Systems Sciences,  18, pp  155-193. \nBAUM,  E.B.,  (1989),  On  learning  a  union  of half spaces,  Journal  of Complexity \nV5,  N4. \nBLUMER,  A.,  EHRENFEUCHT,A.,  HAUSSLER,D., and WARMUTH,M.  (1987), \nLearnability and the Vapnik-Chervonenkis  Dimension,  U.C.S.C.  tech.  rep.  UCSC(cid:173)\nCRL-87-20,  and J. ACM, to appear. \nKARMARKAR,  N.,  (1984),  A new  polynomial time algorithm for  linear  program(cid:173)\nming, Combinatorica 4,  pp373-395 \nKEARNS,  M,  and  VALIANT,  L.,  (1989),  Cryptographic  limitations  on  learning \nBoolean  formulae  and  finite  automata,  Proc.  21st  ACM  Symp.  on  Theory  of \nComputing, pp433-444. \nMINSKY, M,  and  PAPERT,S., (1969),  Perceptrons,  and Introduction  to  Computa(cid:173)\ntional Geometry,  MIT Press,  Cambridge MA. \nPOLLARD,  D.  (1984),  Convergence  of stochastic  processes,  New  York:  Springer(cid:173)\nVerlag. \nROSENBLATT, F.  (1962),  Principles  of Neurodynamics, Spartan Books,  N.Y. \nVALIANT,  L.G.,  (1984),  A  theory  of the  learnable,  Conun.  of ACM  V27,  Nll, \npp1l34-1142. \n\n\f", "award": [], "sourceid": 226, "authors": [{"given_name": "Eric", "family_name": "Baum", "institution": null}]}