{"title": "Direct Optimization of Margins Improves Generalization in Combined Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 288, "page_last": 294, "abstract": null, "full_text": "Direct  Optimization of Margins Improves \n\nGeneralization  in  Combined  Classifiers \n\nLlew  Mason,Peter Bartlett, Jonathan  Baxter \n\nDepartment of Systems  Engineering \n\nAustralian National University,  Canberra, ACT 0200,  Australia \n\n{lmason, bartlett, jon }@syseng.anu.edu.au \n\nAbstract \n\nCumulative training margin  dis(cid:173)\ntributions  for  AdaBoost  versus \nour  \"Direct  Optimization  Of \nMargins\" \n(DOOM)  algorithm. \nThe dark curve is  AdaBoost, the \nlight  curve  is  DOOM.  DOOM \nsacrifices  significant  training  er(cid:173)\nror for  improved test error  (hori(cid:173)\nzontal marks on margin= 0 line)_ \n\n-1 \n\n-0.8  -0.6  -0.4  -0.2  0 \n\n0.2  0.4  0.6  0.8 \n\n1 \n\nMargin \n\n1 \n\nIntroduction \n\nMany learning algorithms for  pattern classification minimize some  cost function of \nthe training data, with the aim of minimizing error (the probability of misclassifying \nan example).  One  example  of such  a  cost  function  is  simply  the  classifier's  error \non the training data.  Recent results have examined  alternative cost functions  that \nprovide better error estimates in some cases_  For example,  results in  [Bar98]  show \nthat the error of a sigmoid network classifier f(-)  is no more than the sample average \nof the cost function sgn(B-yf(x)) (which takes value 1 when yf(x) is no more than \nBand 0  otherwise)  plus  a  complexity  penalty  term  that  scales  as  IlwlldB,  where \n(x,y)  E  X  x  {\u00b11}  is  a  labelled  training  example,  and  Ilwlll  is  the  sum  of  the \nmagnitUdes  of the output node  weights.  The  quantity  yf(x)  is  the  margin of the \nreal-valued function  f, and reflects the extent to which f(x) agrees with the label y  E \n{\u00b1 1}.  By  minimizing squared error,  neural network learning algorithms implicitly \nmaximize margins, which may  explain their good generalization performance. \n\nMore recently,  Schapire  et al  [SFBL98]  have shown a similar result for  convex com(cid:173)\nbinations of classifiers, such as  those produced by  boosting algorithms.  They show \n\n\fDirect Optimization of Margins Improves Generalization \n\n289 \n\nthat, with high  probability over m  random examples,  every convex combination of \nclassifiers from  some finite  class H  has error satisfying \n\nPr[yf(x) <:  0]  <:  Es [sgn(O  - yf(x))] + 0  ( J,n Cogm ~~gIHI + IOg(1/0)) t)  (1) \n\nfor  all e > 0,  where Es denotes the average over the sample S. \nOne way to think of these results is  as a  technique for  adjusting the effective com(cid:173)\nplexity  of the  function  class  by  adjusting e.  Large  values  of e correspond  to low \ncomplexity  and small values  to high  complexity.  If the  learning algorithm were to \noptimize the parametrized cost  function  Essgn(e - yf(x))  for  large values of e,  it \nwould not be able to make fine  distinctions between different functions in the class, \nand so the effective  complexity of the class  would be reduced.  The second term in \nthe error bounds (the regularization term involving the complexity parameter e and \nthe size of the base hypothesis class H) would be correspondingly reduced.  In both \nthe  neural  network  and boosting  settings,  the  learning  algorithms  do  not  directly \nminimize these cost functions;  we  use  different  values of the complexity parameter \nin the cost functions only in explaining their generalization performance. \n\nIn  this  paper,  we  address  the  question:  what  are  suitable  cost  functions  for  con(cid:173)\nvex  combinations of classifiers?  In the next  section,  we  give  general conditions on \nparametrized families of cost functions that ensure that they can be used to give er(cid:173)\nror bounds for convex combinations of classifiers.  In the remainder of the paper, we \ninvestigate  learning  algorithms  that  choose  the  convex  coefficients  of a  combined \nclassifier  by  minimizing  a  suitable  family  of piecewise  linear  cost  functions  using \ngradient descent.  Even when the base hypotheses  are  chosen  by  the AdaBoost al(cid:173)\ngorithm,  and we  only  use  the new  cost  functions  to adjust the convex coefficients, \nwe  obtained  an  improvement  on  the  test  error  of AdaBoost  in  all  but  one  of the \nUC  Irvine data sets we used.  Margin distribution plots show that in many cases the \nalgorithm  achieves  these  lower  errors  by  sacrificing training error,  in  the  interests \nof reducing the new  cost function. \n\n2  Theory \n\nIn  this  section,  we  derive  an  error  bound  that  generalizes  the  result  for  convex \ncombinations of classifiers described  in  the  previous section.  The result involves  a \nfamily of margin cost functions (functions mapping from the interval [-1, 1]  to ~+), \nindexed  by  an integer-valued complexity parameter N, which  measures the resolu(cid:173)\ntion at which we examine the margins.  The following definition gives conditions on \nthe margin cost functions that relate the complexity N  to the amount by which the \nmargin cost function  is  larger than the function  sgn( -yf(x)).  The particular form \nof this definition is  not important.  In particular, the functions  lit N  are only used in \nthe analysis in this section,  and will  not concern us later in the paper. \n\nDefinition 1  A  family  {CN  : N  E N}  of margin  cost functions  is B-admissible for \nB  ~ 0  if for  all  N  E  N  there  is  an  interval Y  C  ~ of length  no  more  than  B  and  a \nfunction  lit N  :  [-1, 1]  -+  Y  that satisfies \n\nsgn( -a) ~ EZ~QN,Q (lit N(Z))  ~ CN(a) \n\nfor  all  a  E  [-1, 1],  where  E Z ~Q N, Q  ( . )   denotes  the  expectation  when  Z  is  chosen \nrandomly  as  Z  = (l/N) 2:/(=1  Zi  with Zi  E {-1, 1}  and Pr(Zi = 1)  = (1 + a)/2. \nAs  an  example,  let  CN(a)  =  sgn(e - a) + c,  for  e = l/VN and  some  constant  c. \nThis is  a B-admissible family of margin cost functions, for suitably large B.  (This is \n\n\f290 \n\nL. Mason,  P  L.  Bartlett and J.  Baxter \n\nexhibited by the functions  W N(a)  =  sgn(O /2 - a) + c/2; the proof involves Chernoff \nbounds.)  Clearly,  for  larger  values  of N,  the  cost  functions  CN  are  closer  to  the \nthreshold  function  sgn( -a).  Inequality  (1)  is  implied  by  the  following  theorem. \nIn this  theorem,  co(H)  is  the  set of convex  combinations of functions  from  H.  A \nsimilar proof gives the same result with VCdim(H) In m  replacing In IHI. \n\nTheorem 2  For  any B-admissible family  {CN  : N  E N}  of margin  cost junctions, \nany finite hypothesis class H  and any distribution P  on X x { -1,1}, with probability \nat  least  1 - 8  over  a random  sample  S  of m  labelled  examples  chosen  according  to \nP,  every N  and  every  f  in co(H)  satisfies \n\nPr [yf(x)  ~ 0]  < Es [CN(yf(x))] + \n\nB2 \n2m (N In IHI + In(N(N + 1)/8)). \n\nProof  Fix  Nand f  E  co(H),  and  suppose  that  f  =  r:d aihi  for  hi  E  H.  Define \ncON(H)  =  {(I/N) 2:.%,1  hj \n:  hj  E H} ,  and  notice  that  ICON(H)I  :s;  IHIN.  As  in \nthe proof of  (1)  in  [SFBL98],  we  show using the probabilistic method that there is \na  function  9  in  cON(H)  that closely  approximates  f.  Let Q be the distribution on \ncON(H)  corresponding to the average of N  independent draws from  {hd according \nto the distribution {ad, and let Q N,Ci  be the distribution given in Definition 1.  Then \nfor  any fixed  pair x, y,  when 9  is  chosen according to Q the distribution of yg(x)  is \nQ N,yf(x)'  Now,  fix  the function  W N  implied  by  the  B-admissibility  condition.  By \nthe definition of B-admissibility, \nEg~QEp [w N(yg(X))]  =  EpEz~QN , Yf(\")  [WN(Z)]  ~ Ep sgn( -yf(x)) =  P  [yf(x)  ~ 0]. \nSimilarly,  Es [CN(yf(x))]  ~  Eg~QEs [WN(yg(X))].  Hence,  if  Pr [yf(x) :s;  0]  -\nEs [CN(yf(x))]  ~ EN,  then Eg~Q [Ep [WN(yg(X))]- Es [WN(yg(X))]]  ~ EN.  Thus, \n\nPr [3f  E  co(H):  Pr [yf(x)  ~ 0]  ~ Es [CN(yf(x))] + EN] \n\n~  Pr [3g  E  CON (H) : Ep [w N(yg(X))]  ~ Es [WN(yg(X))]  + EN] \n~  IHIN exp( -2mE~/ B2), \n\nwhere the last inequality follows  from  the union bound  and  Hoeffding's inequality. \nSetting this  probability to 8N  =  8/(N(N + 1)), solving for  EN,  and  summing over \nvalues of N  completes the proof, since 2:NEN 8N  =  8. \n0 \n\nFor the best bounds, we want  W N to satisfy EZ~QN.\" [w N(Z)]  2 sgn( -0), but with \nthe  difference  EZ~QN , ,,  [WN(Z)  - sgn(-a)]  as  small  as  possible  for  a  E  [-1 , 1]. \nOne approach would be to minimize the expectation of this difference,  for  0  chosen \nuniformly  in  [-1,1].  However,  this  yields  a  non-monotone  solution  for  CN(o). \nFigure  la illustrates  an  example of a  monotone  B-admissible family;  it shows  the \ncost functions  CN(a)  = EZ~QN,,, WN(Z),  for  N  = 20,50 and 200,  where WN(O)  = \nsgn(y'210gN/N - a) + I/N. \n\n3  Algorithm \n\nWe  now  consider  how  to  select  convex  coefficients  WI, ... , WT  for  a  sequence  of \n{-1,1}  classifiers  h1 , ... ,hT so that  the  combined  classifier  f(x)  =  2:;=1 Wtht(x) \nhas small error.  In the experiments we  used the hypotheses provided by AdaBoost. \n(The aim was to investigate how useful are the error estimates provided by the cost \nfunctions of the previous section.) \nIf we  take Theorem  2  at  face  value  and  ignore  log  terms,  the  best  error bound  is \nobtained  if the  weights  WI, . .. , WT  and  the  complexity  N  are  chosen  to  minimize \n\n\fDirect Optimization of Margins improves Generalization \n\n291 \n\n1.2 .----~--~--~---, \n\n0.8 \nCii 8 0.6 \n0.4 \n\n0.2 \n\n-1 \n\n-0.5 \n\n0 \n\n0.5 \n\n0.8 \nCii 8 0.6 \n0.4 \n\n0.2 \n\n0 \n-1 \n\n-0.5 \n\n0 \n\n0.5 \n\nFigure 1:  (a) The margin cost functions CN(O), for  N  = 20,50 and 200,  compared to the \nfunction  sgn( -0).  Larger  values  of  N  correspond  to  closer  approximations  to  sgn( -0). \n(b)  Piecewise  linear  upper  bounds on  the functions  C N (0), and the function  sgn( -0). \n(11m) 2::1 CN(yi!(xd) + KvNlm, where K is  a  constant and {CN}  is  a family  of \nB-admissible  cost  functions.  Although  Theorem  2 provides  an  expression  for  the \nconstant  K,  in  practical problems this will  almost  certainly be an overestimate and \nso  our  penalty  for  even  moderately  complex  models  will  be  too  great.  To  solve \nthis  problem,  instead of optimizing the average cost  of the margins plus a  penalty \nterm over all  values of the parameter 0,  we  estimated the optimal value of 0 using \na  cross-validation  set.  That is,  for  fixed  values  of 0  in  a  discrete  but  fairly  dense \nset  we  selected  weights optimizing the  average cost  ! 2::1 Co (yi!(Xi))  and  then \nchose the solution with smallest error on an independent  cross-validation set. \n\nWe considered the use of the cost functions plotted in Figure la, but the existence of \nflat  regions caused  difficulties for  gradient descent approaches.  Instead we  adopted \na piecewise linear family of cost functions Co  that are linear in the intervals [-1, OJ, \n[0, OJ,  and  [0,1]'  and pass  through the points  (-1,1.2), (0,0.1),  (0,0.1),  and  (1,0), \nfor  0  E  (0,1).  The  numbers  were  chosen  to ensure  the  Co  are  upper  bounds  on \nthe  cost  functions  of Figure  Ia  (see  Figure  Ib).  Note  that  0  plays  the  role  of a \ncomplexity  parameter,  except  that  in  this  case  smaller  values  of 0  correspond  to \nhigher complexity  classes. \n\nEven with the restriction to piecewise linear cost functions, the problem of optimiz(cid:173)\ning  ! 2::1 Co (yi!(Xi))  is  still hard.  Fortunately,  the nature of this cost function \nmakes it possible to find  successful heuristics  (which is why  we chose it).  The algo(cid:173)\nrithm  we  have  devised  to optimize the  Co  family  of cost functions  is  called  Direct \nOptimization Of Margins  (DOOM).  (The pseudo-code of the  algorithm is  given  in \nthe full  version  [MBB98].)  DOOM is  basically a form  of gradient descent, with two \ncomplications:  it  takes  account  of the  fact  that  the  cost  function  is  not  differen(cid:173)\ntiable at 0  and 0,  and  it ensures  that the weight  vector lies  on the unit  ball  in  it. \nIn  order to avoid  problems  with  local  minima we  actually allow  the  weight  vector \nto lie  within the it-ball throughout optimization rather than on  the  h-ball.  If the \nweight  vector reaches the surface of the ll-ball and the update direction points out \nof the it -ball,  it is  projected back to the surface of the it -ball. \nObserve  that  the  gradient  of  ! 2::1 CO(yi!(Xi))  is  a  constant  function  of  the \nweights  W  =  (WI, ... , WT)  provided  no  example  (Xi, Yi)  \"crosses\"  one  of the  dis(cid:173)\ncontinuities  at  0  or  0  (Le.  provided  the  margin  yi!(Xi)  does  not  cross  0  or  0). \nHence,  the central operation of DOOM is  to step in the negative gradient direction \nuntil  an  example's  margin  hits one  of the  discontinuities  (projecting where  neces(cid:173)\nsary to ensure the weight  vector lies  within  the h  ball).  At this point the gradient \nvector becomes multi-valued (generally two-valued but it can be more).  Each of the \npossible gradient directions is then tested by taking a small step in that direction (a \n\n\f292 \n\nL. Mason.  P  L.  Bartlett and J.  Baxter \n\nrandom subset of the gradient directions  is  chosen if there are too many of them). \nIf none of the directions lead to a decrease in the cost,  the examples whose margins \nlie  on  discontinuities of the cost function  are added  to a  constraint set E.  In sub(cid:173)\nsequent  iterations  the  same  stepping  procedure  above  is  followed  except  that  the \ndirection step is  modified  to ensure that the examples in E  do  not move  (Le.  they \nremain on the discontinuity points of C(J).  That is,  the weight vector moves within \nthe subspace defined by the examples in E.  If no progress is  made in any iteration, \nthe  constraint  set  E  is  reset  to  zero.  If still  no  progress  is  made  the  procedure \nterminates. \n\n4  Experiments \n\nWe  used  the following two-class problems from  the  UC  Irvine database [CBM98]  : \nCleveland  Heart  Disease,  Credit  Application,  German,  Glass,  Ionosphere,  King \nRook  vs  King  Pawn,  Pima Indians  Diabetes,  Sonar,  Tic-Tac-Toe,  and  Wisconsin \nBreast  Cancer.  For  the  sake  of simplicity  we  did  not  consider  multi-class  prob(cid:173)\nlems.  Each  data set  was  randomly  separated into  train,  test  and  validation  sets, \nwith  the  test  and  validation  sets  being equal  in  size.  This  was  repeated  10  times \nindependently  and the results were averaged. \n\n.  x .. \n\nx \n\n-5 \n\n35 \n\nx : \n\n. .  ...~ .... \n\nx \n\n0 \n\n5 \n\n10 \n\n15 \n\n20 \n\n25 \n\nxi \n\n30 \n\n, \ni \n.. ,! \n\n5 \n\n0 8  0 \n\nAdaBoost Test Error (%) \n\n~ 30 \nIt \nc:  25 \n'\" > \n\u00a7.  20 \n.\u00a7 \n'\"  15 \n> \n.~ \n;:;  10 \n0:: \n::E \n\nEach experiment consisted of the following steps. \nFirst, AdaBoost was run on the training data to \nproduce  a  sequence  of base  classifiers  and  their \ncorresponding weights.  In all of the experiments \nthe  base  classifiers  were  axis-orthogonal  hyper(cid:173)\nplanes  (also  known  as  decision  stumps);  this \nchoice  ensured  that  the  complexity  of the  class \nof base  classifiers  was  constant.  Boosting  was \nhalted when adding a new  classifier failed  to de(cid:173)\ncrease  the  error  on  the  validation  set.  DOOM \nwas  then run on the classifiers produced by  Ad-\naBoost  for  a  large  range  of e values  and  1000 \nrandom initial weight vectors for each value of e. \nFigure  2:  Relative  improvement  of  The weight  vector  (and e value)  with  minimum \nDOOM over AdaBoost for  all exam- misclassification on the validation set was chosen \nined datasets. \nIn some cases the training sets were reduced in size to make overfitting more likely, \nso  that  complexity  regularization  with  DOOM  could  have  an  effect.  (The  details \nare given  in  the  full  version  [MBB98].)  In  three  of the  datasets  (Credit  Applica(cid:173)\ntion,  Wisconsin  Breast  Cancer  and  Pima Indians  Diabetes),  AdaBoost  gained  no \nadvantage from  using more  than a  single  classifier.  In  these  datasets,  the number \nof classifiers  was  chosen so that the validation error was  reasonably stable. \nA comparison between the test errors generated by  AdaBoost and DOOM is shown \nin  Figure  2.  In only  one data set did  DOOM  produce a  classifier which  performed \nworse than AdaBoost in  terms of test error; for  most  data sets DOOM's test error \nwas  a  significant improvement over AdaBoost's. \n\nas  the final  solution. \n\nFigure  3  shows  cumulative  training  margin  distribution  graphs  for  four  of  the \ndatasets for both AdaBoost and DOOM (with optimal e chosen by cross-validation). \nFor a given margin the value on the curve  corresponds to the proportion of training \nexamples with margin no mOI;e  than this value.  The test errors for  both algorithms \nare also shown for  comparison,  as short horizontal lines on the vertical axis. \nThe margin distributions show  that the value of the minimum training margin has \nno  real  impact on  generalization performance.  (See  also  [Bre97]  and  [GS98].)  As \n\n\fDirect Optimization of Margins Improves  Generalization \n\n293 \n\n40 \n\n..................... . .... .................................................... .\n\n....... . \n\n100 \n\n......................... . \n\nWisconsion Breast Cancer \n\nCredit  Application \n\n30 \n\n~ \n1! \n:\u00a7  20 \n\" \nE \n\" \nu \n\n10 \n\no \n-I \n\n~ \n\n!  !: \n\n. .\" \n~ \n\" \nu \n\nso \n\n60 \n\n40 \n\n20 \n\no+-~~--~~~~--~~~~ \n\nJI \n\n-U.S  -0.6  -U.4 \n\n\u00b70.2  0 \n\n0.2  0.4  0.6  0.8 \n\nI \n\n-I \n\n-O.S  -U.6  -U.4  -0.2 \n\n0 \n\n0.2  0.4  0.6  0.8 \n\nI \n\n100 \n\n..\u2022........................... \n\nIonosphere \n\n100 \n\n.........................\u2022.................... \n\nSonar \n\n\"\" 1:  60 \n] \n~  40 \n\" u \n\n20 \n\n20  _~ ..... - ............ -. \n\n\u00b71 \n\n-0.8  -U.6 \n\n-U.4  -0.2 \n\n0 \n\n0.2  0.4  0.6  0.8 \n\nI \n\n-I \n\n-0.8  -0.6  -U.4  -U.2 \n\n0 \n\n0.2  0.4  0.6  0.8 \n\nI \n\nMargin \n\nMargin \n\nFigure 3:  Cumulative training  margin  distributions for  four  datasets.  The dark curve is \nAdaBoost,  the light  curve  is  DOOM  with e selected  by  cross-validation.  The test errors \nfor  both algorithms are marked on  the vertical axis  at margin O. \n\ncan  be  seen in  Figure  3  (Credit  Application  and  Sonar data sets),  the generaliza(cid:173)\ntion  performance  of the  combined  classifier  produced  by  DOOM  can  be  as  good \nas or better than that of the classifier produced by  AdaBoost, despite  having dra(cid:173)\nmatically worse minimum  training margin.  Conversely,  Figure 3  (Ionosphere  data \nset)  shows  that  improved  generalization  performance  can  be  associated  with  an \nimproved minimum  margin. \n\nThe  margin  distributions  also  show  that  there  is  a  balance  to  be  found  between \ntraining  error  and  complexity  (as  measured  by  0).  DOOM  is  willing  to  sacrifice \ntraining  error  in  order  to  reduce  complexity  and  thereby  obtain  a  better  margin \ndistribution.  For instance,  in  Figure 3  (Sonar  data set),  DOOM's training error is \nover  20%  while  AdaBoost's is  0%,  but DOOM's  test  error is  5%  less  than that of \nAdaBoost's.  The reaSOn  for  this  success  can  be seen  in  Figure 4,  which  illustrates \nthe  changes in  the  cost  function,  training error,  and  test  error  as  a  function  of o. \nThe optimal  complexity for  this  data set  is  low  (corresponding to a  large optimal \n0).  In this case, a reduction in complexity is  more important to generalization error \nthan a  reduction in training error. \n\n5  Conclusion \n\nIn  this  paper we  have  addressed the question:  what are suitable cost functions  for \nCOnvex  combinations of base hypotheses?  For general families of cost functions that \nare functions  of the  margin of a  sample,  we  proved  (Theorem  2)  that the  error of \na  COnvex  combination is  nO  more than the sample average of the cost function  plus \na  regularization term involving the  complexity of the  cost  function  and the size  of \nthe base hypothesis  class. \nWe constructed a piecewise linear family of cost functions satisfying the conditions of \nTheorem 2 and presented a heuristic algorithm (DOOM)  for  optimizing the sample \n\n\f294 \n\nL.  Mason, P L.  Bartlett and J.  Baxter \n\n0.45 \n\n0.40 \n\n0.35 \n\n0.30 \n\n~ 0.25 \nu  0.20 \n\n0. 15 \n\n0. 10 \n\n0.05 \n\n0.00 \n\n0 \n\n8 \n\n50 -r----------------- -.. -\n45 \n\n.................... ................... ......... ..... ......... \n\n... \n\n. \n\n.:~~, \n\n\u2022 \n\n:: 25 \n~ 20 \n\n15 \n10 \n\na \n\n..... Ada Boost Train _ ... . __ ..  _  ..  __ .. . .  _. \n-- AdaBoost Tes! \n... ..... DOOM Train \n.  -0- DOOM Tes! \n\no \n\n0  00 0  \n\n8  e  5  ~  ~ \n\n0 \n\no \ntv  W \n'J> \na \n\n0 \n\nFigure 4:  Sonar data set,  Left:  Plot of cost  (~ 2:~1 C9(yi/(Xi))) against ()  for  AdaBoost \nand DOOM. Right:  Plot of training  and test error against  (). \n\naverage of the cost. \n\nWe  ran experiments on several of the datasets in the UC  Irvine database, in which \nAdaBoost was used  to generate a  set  of base classifiers  and then DOOM  was  used \nto  find  the  optimal  convex  combination  of those  classifiers.  In  all  but  one  case \nthe convex combination generated by DOOM had lower test error than AdaBoost's \ncombination.  Margin distribution  plots  show  that in  many  cases  DOOM  achieves \nthese lower test errors by sacrificing training error, in the interests of reducing the \nnew cost function.  The margin plots also show that the size of the minimum margin \nis not relevant to generalization performance. \n\nAcknow ledgments \n\nThanks to Yoav  Freund, Wee Sun Lee and Rob Schapire for  helpful  comments and \nsuggestions.  This research was supported in part by a grant from the Australian Re(cid:173)\nsearch Council.  Jonathan Baxter was supported by an Australian Research Council \nFellowship and Llew  Mason was  supported by  an Australian Postgraduate Award. \n\nReferences \n[Bar98] \n\n[Bre97] \n\nP. L. Bartlett. The sample complexity of pattern classification with neural \nnetworks:  the  size  of  the  weights  is  more  important  than  the  size  of \nthe network.  IEEE  Transactions  on  Information  Theory,  44(2):525- 536, \n1998. \nL.  Breiman.  Prediction games and arcing algorithms.  Technical Report \n504,  Department of Statistics,  University of California, Berkeley,  1997. \n\n[CBM98]  E.  Keogh  C.  Blake and  C.J . Merz.  UCI  repository of machine  learning \ndatabases, 1998.  http://www.ics.uci.edu/rvmlearn/MLRepository.html. \nA.  Grove  and  D.  Schuurmans.  Boosting  in  the  limit:  Maximizing  the \nmargin  of learned  ensembles.  In  Proceedings  of the  Fifteenth  National \nConference  on  Artificial Intelligence,  pages 692- 699,  1998. \n\n[GS98] \n\n[MBB98]  L.  Mason, P. L. Bartlett, and J. Baxter. Improved generalization through \n\nexplicit  optimization of margins.  Technical  report,  Department of Sys(cid:173)\ntems Engineering, Australian National  University, 1998.  (Available from \nhttp://syseng.anu.edu.au/lsg) . \n\n[SFBL98]  R.  E.  Schapire,  Y.  Freund,  P.  L.  Bartlett,  and  W.  S.  Lee.  Boosting \nthe  margin:  a  new  explanation for  the  effectiveness  of voting  methods. \nAnnals of Statistics,  (to appear), 1998. \n\n\f", "award": [], "sourceid": 1553, "authors": [{"given_name": "Llew", "family_name": "Mason", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Jonathan", "family_name": "Baxter", "institution": null}]}