{"title": "ARC-LH: A New Adaptive Resampling Algorithm for Improving ANN Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 522, "page_last": 528, "abstract": null, "full_text": "ARC-LH:  A  New Adaptive Resampling \nAlgorithm for  Improving ANN  Classifiers \n\nFriedrich Leisch \n\nKurt Hornik \n\nFriedrich.Leisch@ci.tuwien.ac.at \n\nKurt.Hornik@ci.tuwien.ac.at \n\nInstitut fiir  Statistik und Wahrscheinlichkeitstheorie \n\nTechnische  UniversWit  Wien \n\nA-I040 Wien, Austria \n\nAbstract \n\nWe introduce arc-Ih,  a new algorithm for improvement of ANN clas(cid:173)\nsifier  performance,  which  measures  the  importance of patterns  by \naggregated network output errors.  On several  artificial benchmark \nproblems,  this  algorithm compares favorably  with other  resample \nand combine techniques. \n\n1 \n\nIntroduction \n\nThe training of artificial neural networks  (ANNs)  is usually a stochastic and unsta(cid:173)\nble  process.  As  the weights  of the  network  are  initialized  at random  and  training \npatterns  are  presented  in random order,  ANNs  trained on  the same data will  typ(cid:173)\nically  be  different  in  value  and  performance.  In  addition,  small  changes  in  the \ntraining  set  can  lead  to  two  completely  different  trained  networks  with  different \nperformance even if the  nets  had the same initial weights. \n\nRoughly speaking,  ANNs  have  a low  bias because  of their  approximation capabili(cid:173)\nties, but a rather high variance because of the instability.  Recently, several resample \nand  combine techniques  for  improving ANN  performance  have  been  proposed.  In \nthis paper we  introduce an new arcing  (\"~aptive resample and \u00a3ombine\") method \ncalled  arc-Ih.  Contrary  to  the  arc-fs  method  by  Freund  & Schapire  (1995),  which \nuses misclassification rates for  adapting the resampling probabilities, arc-Ih  uses  the \naggregated network output error.  The performance of arc-Ih  is compared with other \ntechniques  on several popular artificial benchmark problems. \n\n\fARC-Uf: A New Adaptive Resampling Algorithm/or ANN Classifiers \n\n523 \n\n2  Bias-Variance Decomposition of 0-1  Loss \nConsider  the task of classifying a random vector e taking values  in X  into one of c \nclasses  G1 ,  ... , Ge ,  and let g(.)  be a classification function mapping the input space \non the finite set  {I, ... , c}. \n\nThe classification  task is  to find  an optimal function g  minimizing the risk \n\nRg = IELg(e) = l Lg(x) dF(x) \n\n(1) \nwhere  F  denotes  the  (typically  unknown)  distribution  function  of e,  and  L  is  a \nloss  function.  In  this  paper,  we  consider  0-1  loss  only,  i.e.,  the  loss  is  1 for  all \nmisclassified patterns and  zero otherwise. \n\nIt is  well  known that the optimal classifier,  i.e., the classifier  with minimum risk,  is \nthe  Bayes classifier  g*  assigning to each input  x  the class with maximum posterior \nprobability IP(Gnlx) .  These  posterior  probabilities  are  typically  unknown,  hence \nthe Bayes classifier  cannot  be used  directly.  Note  that  Rg*  =  0 for  disjoint  classes \nand  Rg* > 0 otherwise. \nLet  X N  = {xt, ... ,xN} be  a  set  of independent  input  vectors  for  which  the  true \nclass  is  known,  available  for  training  the  classifier.  Further,  let  g X N  ( .)  denote  a \nclassifier  trained using set X N.  The risk  Rg x N  ~ Rg* of classifier g x N  is a random \nvariable depending  on  the  training sample X N.  In  the case  of ANN  classifiers  it \nalso depends  on  the network  training, i.e., even for fixed  X N  the performance of a \ntrained  ANN  is  a  random  variable depending  on  the  initialization of weights  and \nthe  (often  random)  presentation of the patterns  [x nl during training. \nFollowing Breiman (1996a)  we decompose the risk of a classifier into the  (minimum \npossible)  Bayes error,  a systematic bias term of the model class and the variance of \nthe classifier  within its model class.  We call a classifier  model  unbiased for  input x \nif,  over replications of all possible training sets X N  of size N, network initializations \nand pattern presentations, g  picks the correct class more often than any other class. \nLet U  = U(g) denote the set of all x  E X  where g is unbiased;  and B  =  B(g)  =  X\\U \nthe set of all points where g is biased.  The risk of classifier  g can be decomposed  as \n\nRg = Rg* + Bias(g) + Var(g) \n\n(2) \n\nwhere  Rg* is the risk of the Bayes classifier, \n\nBias(g) \nVar(g) \n\nRag - Rag* \nRug - Rug* \n\nand Ra and Ru  denote the risk on set  Band U, respectively,  i.e., the integration in \nEquation 1 is over B or U  instead of X, repectively. \n\nA  simpler  bias-variance  decomposition  has  been  proposed  by  Kong  &  Dietterich \n(1995): \n\nBias(g) \nVar(g) \n\nIP{B} \nRg - Bias(g) \n\n\f524 \n\nF.  LeischandK. Hornik \n\nThe size  of the  bias  set  is  seen  as  the  bias  of the model  (i.e.,  the  error  the  model \nclass  \"typically\"  makes) .  The variance is simply the  difference  between  the actual \nrisk  and  this  bias  term.  This decompostion  yields negative  variance if the current \nclassifier  performs better than the average classifier. \n\nIn  both  decompositions,  the  bias  gives  the  systematic risk  of the  model,  whereas \nthe  variance  measures  how  good  the  current  realization  is  compared  to  the  best \npossible  realization  of the  model.  Neural  networks  are  very  powerful  but  rather \nunstable  approximators,  hence  their  bias  should  be  low,  but  the  variance  may be \nhigh. \n\n3  Resample and  Combine \n\nSuppose  we  had k  independent  training sets X N1 , .. . , X Nk  and corresponding clas(cid:173)\nsifiers  91' . .. , 9k  trained using these  sets,  respectively.  We  can  then  combine  these \nsingle classifiers into ajoint voting classifier 9~ by assigning to each input x  the class \nthe  majority of the 9j  votes  for .  If the 9j  have  low  bias,  then 9~ should  have  low \nbias, too.  If the model is  unbiased for  an input x, then  the variance of 9~ vanishes \nas  k -+  00 ,  and 9 v  = limk --+ oo  9k  is optimal for  x.  Hence,  by resampling training sets \nfrom  the  original  training set  and combining the  resulting  classifiers  into  a  voting \nclassifier it  might be  possible to reduce  the  high  variance of unstable classification \nalgorithms. \n\nTraining sets \n\nANN  classifiers \n\nX N1 \n\nXN'J \n\n\u2022 \n\nX Nk \n\nX N \n\nresample \n\n3.1  Bagging \n\nt::-\n\n~ . - .\" \n\nt::-\n\n........ \n\n...c:,.~ ... . '& .'\" \n\n. ~~-.... -.. \n\nt::-\n.-.-.... -.. \n\n... -:::1; .. \u00b7\u00b7-\u00b7 .. \u2022 ... \u2022 \n\n92 \n\u2022 \n\nI> \n\n9k \n\nadapt \n\n91  ~ \n\n-;/ 9k \n\ncombine \n\nBreiman  (1994,  1996a)  introduced  a  procedure  called  bagging  (\"Qootstrap  aggre(cid:173)\ngating\") for  tree classifiers that may also be used for ANNs.  The bagging algorithm \nstarts with a  training set  X N  of size  N.  Several  bootstrap replica X J..\" \n. .. ,X7v  are \nconstructed  and  a  neural  network  is  trained  on each.  These  networks  are  finally \ncombined by  majority voting.  The bootstrap sets X1  consist  of N  patterns drawn \nwith replacement from  the original training set  (see  Efron  &  Tibshirani  (1993)  for \nmore information on the bootstrap). \n\n\fARC-ill: A New Adaptive Resampling Algorithm/or ANN Classifiers \n\n525 \n\n3.2  Arcing \n\n3.2.1  Arcing Based on Misclassification Rates \n\nArcing,  which  is  a  more sophisticated  version  of bagging,  was  first  introduced  by \nFreund  & Schapire  (1995)  and called  boosting.  The new  training sets  are  not con(cid:173)\nstructed  by  uniformly sampling from  the empirical distribution of the  training set \nXN \"  but  from  a  distribution  over  X N  that  includes  information  about  previous \nmisclassifications. \n\nLet P~ denote  the probability that pattern  xn  is included into the  i-th training set \nX},y  and  initialize with P~ = 1/ N.  Freund  and  Schapire's arcing  algorithm, called \narc-fs  as  in Breiman (1996a),  works  as  follows: \n\n1.  Construct  a pattern set Xiv  by  sampling with replacement  with  probabili(cid:173)\n\nties P~ from X N  and train a classifier 9i  using set xiv. \n\n2.  Set  dn  = 1 for  all patterns that are  misclassified  by 9i  and zero  otherwise. \n\nWith fi  =  L~=lp~dn and!3i = (1- fi)/fi  update the  probabilities by \n\nHI  _ \n-\n\nPn \n\ni  /3dn \nPn \ni \n'd  \nN \nLn=l P~(3i n \n\n3.  Set  i  := i + 1 and repeat. \n\nAfter  k  steps, 91' . . . ,gk  are combined with weighted  voting were  each 9j'S vote has \nweight log!3i.  Breiman (1996a)  and Quinlan (1996)  compare bagging and arcing for \nCART and C4.5 classifiers,  respectively.  Both bagging and arc-fs  are  very  effective \nin reducing the high variance component of tree classifiers,  with adaptive resampling \nbeing a  bit better than simple bagging. \n\n3.2.2  Arcing Based on Network Error \n\nIndependently  from  the  arcing  and  bagging  procedures  described  above,  adaptive \nresampling  has  been  introduced  for  active  pattern  selection  in  leave-k-out  cross(cid:173)\nvalidation CV / APS  (Leisch  & Jain,  1996;  Leisch  et  al.,  1995).  Whereas  arc-fs  (or \nBreiman's arc-x4)  uses  only  the  information whether  a  pattern  is  misclassified  or \nnot, in CV / APS  the fact  that MLPs approximate the posterior probabilities of the \nclasses  (Kanaya & Miyake, 1991) is utilized, too.  We introduce a simple new  arcing \nmethod based on the main idea of CV / APS  that the  \"importance\" of a pattern for \nthe  learning  process  can  be  measured  by  the  aggregated  output  error  of an  MLP \nfor  the pattern over several training runs. \n\nLet  the classifier  9  be  an ANN  using  l-of-c coding, i.e., one output node  per class, \nthe  target  t(x)  for  each  input  x  is  one  at  the  node  corresponding  to  the  class  of \nx  and  zero  at  the  remaining  output  nodes.  Let  e(x)  = It(x)  - 9(x))12  be  the \nsquared error ofthe network for input x.  Patterns that 'repeatedly have high output \nerrors are somewhat harder to learn for  the network and therefore  their resampling \nprobabilities are increased proportionally to the error.  Error-dependent  resampling \n\n\f526 \n\nF.  Leisch and K.  Hornik \n\nintroduces a  \"grey-scale\"  of pattern-importance as opposed to the \"black and white\" \nparadigm of misclassification dependent  resampling. \n\nAgain let p~ denote the probability that pattern xn is included into the i-th training \nset Xiv and initialize with p; =  1/ N.  Our new arcing algorithm, called arc-Ih,  works \nas  follows: \n\n1.  Construct  a pattern set xiv  by  sampling with replacement  with  probabili(cid:173)\n\nties p~ from X N  and train a classifier gj  using set xiv. \n\n2.  Add  the network output error of each  pattern to  the resampling probabili(cid:173)\n\nties: \n\n3.  Set  i  := i + 1 and repeat. \n\nAfter k steps,  g1' ... ,gk  are combined by  majority voting. \n\n3.3  Jittering \n\nIn our experiments, we  also compare the above resample and combine methods with \njittering, which resamples the training set  by contaminating the inputs by  artificial \nnoise.  No  voting is done,  but the size of the training set is increased  by  creation of \nartificial inputs  \"around\"  the original inputs, see  Koistinen  &  Holmstrom (1992). \n\n4  Experiments \n\nWe  demonstrate  the effects  of bagging  and  arcing  on  several  well  known  artificial \nbenchmark  problems.  For  all  problems,  i - h - c single  hidden  layer  perceptrons \n(SHLPs)  with i  input, h hidden and c output nodes were  used.  The number of hid(cid:173)\nden nodes  h  was chosen  in a way  that the corresponding networks have reasonably \nlow  bias. \n\n2  Spirals with noise:  2-dimensional input,  2 classes.  Inputs  with  uniform  noise \n\naround  two spirals.  N  = 300.  Rg* = 0%.  2-14-2 SHLP. \n\nContinuous XOR:  2-dimensional  input,  2  classes.  Uniform  inputs  on  the  2-\ndimensional square -1 :::;  x, y  :::;  1 classified  in the two classes  x * y  ~ 0 and \nx * y  < O.  N  = 300.  Rg* = 0%.  2-4-2 SHLP. \n\nRingnorm:  20-dimensional input,  2 classes.  Class  1 is  normal wit  mean zero  and \ncovariance 4 times the identity matrix.  Class 2 is a unit normal with mean \n(a, a, ... , a).  a = 2/.../20.  N  = 300.  Rg* = 1.2%.  20-4-2 SHLP. \n\nThe first  two  problems  are  standard  benchmark  problems  (note  however  that  we \nuse  a  noisy  variant  of the  standard spirals  problem);  the  last  one  is,  e.g.,  used  in \nBreiman (1994,  1996a). \n\n\fARC-LH: A New Adaptive Resampling Algorithm/or ANN Classifiers \n\n527 \n\nAll  experiments  were  replicated  50  times,  in each  bagging  and  arcing  replication \n10 classifiers  were  combined to build a  voting classifier.  Generalization errors  were \ncomputed using  Monte  Carlo techniques on  test  sets of size  10000. \n\nTable  1 gives the  average  risk  over  the 50  replications for  a standard single  SHLP, \nan SHLP trained on a jittered training set and for  voting classifiers using ten votes \nconstructed with bagging, arc-Ih  and arc-fs, respectively.  The Bayes risk ofthe spiral \nand xor example is zero,  hence  the risk of a network equals the sum of its bias and \nvariance.  The Bayes risk  of the ringnorm example is  1.2%. \n\nKong  &  Dietterich \nRg  Bias(g)  Var(g)  Bias(g)  Var(g) \n\nBreiman \n\nstandard \njitter \nbagging \narc-fs \narc-Ih \n\nstandard \njitter \nbagging \narc-fs \narc-Ih \n\n7.75 \n6.53 \n4.39 \n4.31 \n4.32 \n\n6.54 \n6.29 \n3.69 \n3.73 \n3.58 \n\nstandard  18.64 \njitter \n18.56 \n15.72 \nbagging \n15.71 \narc-fs \narc-Ih \n15.63 \n\n2 Spirals \n\n7.43 \n6.27 \n4.04 \n3.96 \n4.01 \nXOR \n6.01 \n5.92 \n3.09 \n3.15 \n3.08 \n\nRingnorm \n\n8.26 \n8.34 \n4.91 \n4.81 \n5.13 \n\n0.82 \n0.52 \n0.68 \n0.60 \n0.72 \n\n1.32 \n1.08 \n1.22 \n1.12 \n1.20 \n\n13.84 \n13.72 \n13.54 \n13.58 \n13.20 \n\n0.32 \n0.26 \n0.35 \n0.35 \n0.31 \n\n0.53 \n0.37 \n0.59 \n0.58 \n0.50 \n\n9.19 \n9.03 \n9.61 \n9.70 \n9.30 \n\n6.93 \n6.02 \n3.71 \n3.71 \n3.60 \n\n5.22 \n5.21 \n2.47 \n2.61 \n2.38 \n\n4.80 \n4.84 \n2.18 \n2.13 \n2.43 \n\nTable  1:  Bias-variance decompositions. \n\nThe variance part was drastically reduced by the res ample & combine methods, with \nonly  a  negligible change in  bias.  Note  the low  bias in  the spiral and xor  problems. \nANNs  obviously  can solve  these  classification tasks  (one  could  create  appropriate \nnets by hand), but of course  training cannot find  the exact  boundaries between  the \nclasses.  Averaging over  several  nets  helps  to overcome  this problem.  The  bias  in \nthe ringnorm example is  rather high, indicating that a change of network  topology \n(bigger net,  etc.)  or  training algorithm (learning rate,  etc.)  may lower  the  overall \nrisk. \n\n5  Summary \n\nComparison  of of  the  resample  and  combine  algorithms  shows  slight  advantages \nfor  adaptive  resampling,  but  no  algorithm dominates  the  other  two.  Further  im-\n\n\f528 \n\nF.  Leisch and K.  Hornik \n\nprovements should  be  possible  based  on  a  better  understanding  of the  theoretical \nproperties  of resample  and  combine  techniques.  These  issues  are  currently  being \ninvestigated. \n\nReferences \n\nBreiman,  L.  (1994).  Bagging predictors.  Tech.  Rep.  421,  Department  of Statistics,  Uni(cid:173)\n\nversity  of California,  Berkeley,  California,  USA. \n\nBreiman,  1.  (1996a).  Bias,  variance,  and  arcing  classifiers.  Tech.  Rep.  460,  Statistics \n\nDepartment,  University  of California,  Berkeley,  CA,  USA. \n\nBreiman,  L. (1996b).  Stacked regressions.  Machine  Learning, 24,49. \n\nDrucker,  H.  &  Cortes,  C. (1996) .  Boosting decision  trees.  In Touretzky,  S.,  Mozer,  M.  C., \n& Hasselmo,  M.  E.  (eds.),  Advances in  Neural  Information  Processing Systems,  vol.  8. \nMIT Press. \n\nEfron,  B.  &  Tibshira...u,  R.  J.  (1993).  An  introduction to  the  bootstrap.  Monographs  on \n\nStatistics and Applied  Probability.  New  York:  Chapman & Hall. \n\nFreund, Y.  & Schapire, R.  E.  (1995).  A  decision-theoretic generalization of on-line learning \nand an  application to  boosting. Tech.  rep.,  AT&T Bell Laboratories,  600 Mountain Ave, \nMurray Hill,  NJ,  USA. \n\nKanaya,  F.  &  Miyake,  S.  (1991).  Bayes  statistical  behavior  and  valid  generalization  of \npattern classifying neural networks.  IEEE  Transactions on Neural Networks,  2(4), 471-\n475. \n\nKohavi,  R.  &  Wolpert,  D.  H.  (1996).  Bias plus  variance  decomposition  for  zero-one  loss. \n\nIn  Machine  Learning:  Proceedings  of the  19th  International Conference. \n\nKoistinen,  P.  &  Holmstrom,  L.  (1992).  Kernel  regression  and  backpropagation  training \nwith noise.  In Moody,  J. E., Hanson,  S.  J., & Lippmann, R. P. (eds.), Advances in Neural \nInformation Processing Systems,  vol.  4,  pp.  1033-1039.  Morgan  Kaufmann Publishers, \nInc. \n\nKong,  E.  B.  &  Dietterich,  T.  G.  (1995).  Error-correcting output coding  corrects bias and \nvariance.  In  Machine  Learning:  Proceedings  of the  12th  International Conference,  pp. \n313-321.  Morgan-Kaufmann. \n\nLeisch,  F.  &  Jain,  1. C.  (1996).  Cross-validation  with active  pattern selection  for neural \n\nnetwork classifiers.  Submitted to  IEEE Transactions  on Neural  Networks,  in  Review. \n\nLeisch,  F.,  Jain,  1. C.,  & Hornik,  K. (1995).  NN  classifiers:  Reducing  the  computational \ncost  of cross-validation  by  active  pattern selection.  In  Artificial Neural Networks  and \nExpert Systems,  vol.  2.  Los  Alamitos,  CA,  USA:  IEEE Computer Society  Press. \n\nQuinlan,  J. R.  (1996).  Bagging,  boosting  and C4.5.  University  of Sydney,  Australia. \n\nRipley,  B.  D. (1996).  Pattern recognition and neural networks.  Cambridge,  UK:  Cambridge \n\nUniversity  Press. \n\nTibshirani,  R.  (1996a).  Bias,  variance and prediction error for classification  rules.  Univer(cid:173)\n\nsity of Toronto,  Canada. \n\nTibshirani,  R.  (1996b).  A comparison  of some error estimates for neural network models. \n\nNeural  Computation, 8(1),  152-163. \n\n\f", "award": [], "sourceid": 1198, "authors": [{"given_name": "Friedrich", "family_name": "Leisch", "institution": null}, {"given_name": "Kurt", "family_name": "Hornik", "institution": null}]}