{"title": "Neural Network Ensembles, Cross Validation, and Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 231, "page_last": 238, "abstract": null, "full_text": "Neural Network Ensembles, Cross \nValidation, and Active Learning \n\nAnders Krogh\" \n\nNordita \n\nBlegdamsvej  17 \n\n2100  Copenhagen,  Denmark \n\nJesper Vedelsby \n\nElectronics Institute,  Building 349 \nTechnical University of Denmark \n\n2800  Lyngby,  Denmark \n\nAbstract \n\nLearning  of continuous  valued  functions  using  neural  network  en(cid:173)\nsembles  (committees) can give  improved accuracy,  reliable estima(cid:173)\ntion of the generalization error,  and active learning.  The  ambiguity \nis  defined as the variation of the output of ensemble members aver(cid:173)\naged  over  unlabeled  data, so  it quantifies the  disagreement  among \nthe networks.  It is  discussed  how  to use the ambiguity in combina(cid:173)\ntion with cross-validation to give a reliable estimate of the ensemble \ngeneralization error, and how this type of ensemble cross-validation \ncan sometimes improve performance.  It is  shown  how  to estimate \nthe optimal weights of the ensemble members using unlabeled  data. \nBy a generalization of query  by  committee, it is finally shown how \nthe ambiguity can be used to select new training data to be labeled \nin an active learning scheme. \n\n1 \n\nINTRODUCTION \n\nIt is well known that a combination of many different predictors can improve predic(cid:173)\ntions.  In the  neural networks  community \"ensembles\"  of neural networks  has  been \ninvestigated  by  several  authors,  see  for  instance  [1,  2,  3].  Most  often the networks \nin  the  ensemble  are  trained individually and  then  their  predictions  are  combined. \nThis combination is  usually done  by  majority (in  classification)  or  by  simple aver(cid:173)\naging (in regression),  but one can also use a weighted combination of the networks . \n\n.. Author to whom  correspondence  should  be addressed.  Email:  kroghlnordita. elk \n\n\f232 \n\nAnders Krogh,  Jesper  Vedelsby \n\nAt the workshop after the last  NIPS  conference  (December,  1993)  an entire session \nwas  devoted  to ensembles of neural networks  ( \"Putting it all together\",  chaired  by \nMichael Perrone) .  Many interesting papers were  given, and it showed that this area \nis  getting a  lot of attention. \n\nA  combination of the output of several  networks  (or other predictors)  is  only useful \nif they disagree on  some inputs.  Clearly,  there is  no  more information to be gained \nfrom  a  million  identical  networks  than  there  is  from  just  one  of them  (see  also \n[2]).  By  quantifying the  disagreement  in  the ensemble  it  turns  out  to  be  possible \nto  state  this  insight  rigorously  for  an  ensemble  used  for  approximation  of  real(cid:173)\nvalued functions  (regression).  The simple and beautiful expression  that relates the \ndisagreement  (called  the  ensemble  ambiguity)  and  the  generalization  error  is  the \nbasis for  this paper,  so we  will derive  it with no further  delay. \n\n2  THE BIAS-VARIANCE  TRADEOFF \n\nAssume the task is to learn a function J from RN to R for which you have a sample \nof p  examples,  (xiJ , yiJ),  where  yiJ  =  J(xiJ)  and  J.t  =  1, . . . ,p.  These  examples \nare  assumed  to  be  drawn  randomly  from  the  distribution  p(x) .  Anything  in  the \nfollowing is  easy  to generalize to several  output variables. \nThe ensemble  consists  of N  networks  and  the  output  of network  a  on  input  x  is \ncalled  va (x).  A weighted  ensemble average is  denoted by  a  bar, like \n\nV(x) = L Wa Va(x). \n\na \n\n(1) \n\nThis is  the final  output of the ensemble.  We  think of the weight Wa  as our belief in \nnetwork  a  and therefore  constrain  the weights to be positive and sum to one.  The \nconstraint  on  the sum is  crucial for some of the following results. \nThe  ambiguity  on input  x of a single member of the ensemble is  defined  as  aa (x) = \n(V a(x) - V(x))2 . The  ensemble  ambiguity  on  input  x is \n\na(x) = Lwaaa(x) = LWa(va(x) - V(x))2 . \n\na \n\na \n\n(2) \n\nIt is  simply the  variance  of the weighted  ensemble  around  the weighed  mean,  and \nit measures the disagreement  among the networks on input  x.  The quadratic error \nof network  a  and of the ensemble are \n\n(J(x) - V a(x))2 \n(J(x) - V(X))2 \n\nrespectively.  Adding and subtracting J( x)  in  (2)  yields \na(x) = L  Wafa(X)  - e(x) \n\na \n\n(3) \n(4) \n\n(5) \n\n(after  a  little  algebra  using  that  the  weights  sum  to  one) .  Calling  the  weighted \naverage of the individual errors \u20ac(  x) = La Wa fa (x)  this becomes \n\ne(x) = \u20ac(x)  - a(x). \n\n(6) \n\n\fNeural Network  Ensembles, Cross  Validation,  and Active Learning \n\n233 \n\nAll  these  formulas can  be  averaged  over  the  input  distribution.  Averages  over  the \ninput distribution will  be  denoted  by capital letter,  so \nJ dxp(xVl! (x) \nJ dxp(x)aa(x) \nJ dxp(x)e(x). \n\n(7) \n(8) \n(9) \n\nE \n\nThe first  two  of these  are  the  generalization  error  and  the  ambiguity respectively \nfor  network n , and E  is  the generalization error for  the ensemble.  From (6)  we  then \nfind  for  the  ensemble  generalization  error \n\n(10) \n\nThe first  term  on  the  right  is  the  weighted  average  of the  generalization  errors  of \nthe  individual networks  (E = La waEa),  and  the  second  is  the  weighted  average \nof the  ambiguities (A  =  La WaAa),  which  we  refer  to as  the ensemble ambiguity. \nThe beauty of this equation is  that it separates the generalization error into a  term \nthat  depends  on  the  generalization  errors  of the  individual  networks  and  another \nterm  that  contain  all  correlations  between  the  networks.  Furthermore,  the  corre(cid:173)\nlation term  A  can  be  estimated entirely  from  unlabeled  data,  i. e., no  knowledge  is \nrequired of the real function  to be approximated.  The term  \"unlabeled example\"  is \nborrowed from classification problems,  and  in this context it means an input  x  for \nwhich  the value of the target function  f( x)  is  unknown. \nEquation  (10)  expresses  the  tradeoff  between  bias  and  variance  in  the  ensemble, \nbut in  a  different  way  than the  the common bias-variance relation  [4]  in which  the \naverages are over possible training sets instead of ensemble averages.  If the ensemble \nis strongly biased the ambiguity will be small, because the networks implement very \nsimilar functions  and thus  agree on inputs even  outside the training set.  Therefore \nthe  generalization  error  will  be  essentially  equal  to  the  weighted  average  of the \ngeneralization  errors  of the  individual  networks.  If, on  the  other  hand ,  there  is  a \nlarge variance , the  ambiguity is  high  and  in  this  case  the generalization error  will \nbe  smaller than  the average  generalization error .  See  also [5]. \n\nFrom  this  equation  one  can  immediately see  that  the  generalization  error  of the \nensemble  is  always  smaller  than  the  (weighted)  average  of the  ensemble  errors, \nE  <  E.  In  particular for  uniform weights: \n\nE ~ ~ 'fEcx \n\n(11) \n\nwhich  has been  noted  by several  authors , see  e.g.  [3] . \n\n3  THE  CROSS-VALIDATION  ENSEMBLE \n\nFrom (10) it is obvious that increasing the ambiguity (while not increasing individual \ngeneralization errors)  will improve the overall generalization.  We want the networks \nto  disagree!  How  can  we  increase  the  ambiguity of the  ensemble?  One  way  is  to \nuse  different  types  of approximators like  a  mixture of neural  networks  of different \ntopologies  or  a  mixture  of completely  different  types  of approximators.  Another \n\n\f234 \n\nAnders Krogh,  Jesper  Vedelsby \n\n:~ t \n, \n\n, ' ,  \n\n.  ..... , . . \n. \nv '. --:  '1 \n\n.' .~.--c\u00b7\u00b7 . \n. -- --\\\\ \n\n.~ \n\n1.  -\n\n-\n\n.. , \n\nE  o ...... -'  '.- .. '  ........ ....,. \n> \n\n-1.k! \n\n~  .t. \n\nf. \n:\\,'. - \u00b7-.l \n:--, ____ ~. \n. . \n\nI f  \n\n, '  \n\n,\n\n__ ..  -.tI\"  , ._  \u2022 .\" \u2022 \n\n~. \n\n.. - ..... \n\n_._ ..... . '-._._.1 \n\n1\\.1 \n\n~~ .~ . \n\n-\n\n-, \n\n-\n~\\. \n\n:  ~: \n\n' 0' \n\n\u2022 \n\n-4 \n\n-2 \n\no \nx \n\n2 \n\n4 \n\nFigure  1:  An  ensemble  of five  networks  were  trained  to  approximate the  square \nwave  target  function  f(x).  The  final  ensemble  output  (solid  smooth  curve)  and \nthe  outputs of the individual networks  (dotted  curves)  are shown.  Also the square \nroot of the  ambiguity is  shown  (dash-dot  line) _ For  training 200  random examples \nwere used,  but each network had a cross-validation set of size 40, so they were each \ntrained on  160 examples. \n\nobvious way  is  to train the networks on different  training sets.  Furthermore, to be \nable  to estimate the first  term  in  (10)  it  would  be  desirable  to  have  some  kind  of \ncross-validation.  This suggests  the following strategy. \nChose a  number K  :::;  p.  For each network in the ensemble hold out K  examples for \ntesting, where the N  test sets should have minimal overlap,  i. e.,  the N  training sets \nshould be  as  different  as possible.  If, for  instance,  K  :::;  piN it is  possible to choose \nthe K  test sets with no overlap.  This enables us to estimate the generalization error \nE(X  of the  individual  members  of the  ensemble,  and  at  the  same  time  make sure \nthat the ambiguity increases.  When holding out examples the generalization errors \nfor  the  individual members of the ensemble,  E(X,  will  increase,  but  the  conjecture \nis  that  for  a  good  choice  of the  size  of the  ensemble  (N)  and  the  test  set  size \n(K),  the  ambiguity will  increase  more  and  thus  one  will  get  a  decrease  in  overall \ngeneralization error. \n\nThis  conjecture  has  been  tested  experimentally on  a  simple square  wave  function \nof one  variable shown  in  Figure  1.  Five  identical  feed-forward  networks  with  one \nhidden layer of 20  units were  trained independently  by back-propagation using 200 \nrandom examples.  For each  network  a  cross-validation set of K  examples was  held \nout for testing as described above.  The \"true\" generalization and the ambiguity were \nestimated from  a  set of 1000 random inputs.  The weights  were  uniform,  w(X  = 1/5 \n(non-uniform weights are addressed  later). \n\nIn  Figure 2  average  results  over  12  independent  runs  are shown for  some values  of \n\n\fNeural  Network Ensembles,  Cross  Validation,  and Active Learning \n\n235 \n\nFigure 2:  The solid line shows  the gen(cid:173)\neralization error for  uniform weights as \na  function  of  K,  where  K  is  the  size \nof the cross-validation sets.  The dotted \nline  is  the  error  estimated  from  equa(cid:173)\ntion  (10) .  The  dashed  line  is  for  the \noptimal weights estimated by the use of \nthe generalization errors for the individ(cid:173)\nual  networks  estimated from  the  cross(cid:173)\nvalidation sets as  described  in  the text. \nThe bottom solid line is  the generaliza(cid:173)\ntion error  one  would  obtain if the indi(cid:173)\nvidual generalization errors were known \nexactly  (the best  possible  weights). \n\n0.08 ,-----r----,--~---r-----, \n\no t= w  0.06 \nc o \n~ \n.!::! co ... \n~  0.04 \nQ) \n(!) \n\n0.02 '---_---1 __  ---'-__  --'-__  -----' \n\no \n\n20 \n\n40 \n\nSize of CV set \n\n60 \n\n80 \n\nK  (top  solid line) .  First,  one should  note that the generalization error  is  the  same \nfor  a  cross-validation set  of size  40  as  for  size  0,  although  not lower,  so it supports \nthe  conjecture  in  a  weaker  form.  However,  we  have  done  many experiments,  and \ndepending on the experimental setup the curve can take on almost any form, some(cid:173)\ntimes the error is larger at zero than at 40 or vice versa.  In the experiments shown, \nonly ensembles  with  at  least  four  converging  networks  out of five  were  used .  If all \nthe  ensembles  were  kept,  the error  would  have  been  significantly  higher  at  ]{  =  a \nthan for  K  > a because  in  about  half of the runs  none of the  networks  in  the en(cid:173)\nsomething that seldom happened  when  a  cross-validation set \nsemble converged  -\nwas used.  Thus it is  still unclear  under  which  circumstances one can expect  a  drop \nin generalization error  when  using  cross-validation in  this fashion. \n\nThe  dotted  line  in  Figure  2  is  the  error  estimated  from  equation  (10)  using  the \ncross-validation  sets  for  each  of the  networks  to  estimate  Ea,  and  one  notices  a \ngood  agreement. \n\n4  OPTIMAL WEIGHTS \n\nThe  weights  Wa  can  be  estimated  as  described  in  e.g.  [3].  We  suggest  instead \nto  use  unlabeled  data  and  estimate  them  in  such  a  way  that  they  minimize  the \ngeneralization error given  in  (10) . \n\nThere  is  no  analytical solution  for  the  weights ,  but  something  can  be  said  about \nthe  minimum point of the generalization error.  Calculating the  derivative of E  as \ngiven  in  (10)  subject  to the  constraints on  the  weights  and setting  it equal to zero \nshows  that \n\nE a  - A a  = E  or  Wa  = O. \n\n(12) \n(The  calculation  is  not  shown  because  of space  limitations,  but  it  is  easy  to  do.) \nThat is,  Ea - Aa  has to be the same for  all the networks.  Notice that Aa  depends \non  the  weights  through  the  ensemble  average  of the  outputs.  It shows  that  the \noptimal weights have to be chosen such that each  network contributes exactly waE \n\n\f236 \n\nAnders Krogh,  Jesper  Vedelsby \n\nto the generalization error.  Note,  however,  that a  member of the ensemble can have \nsuch  a  poor generalization or be so  correlated  with  the rest  of the ensemble that it \nis  optimal to set its weight  to zero. \n\nThe  weights  can  be  \"learned\"  from  unlabeled  examples,  e.g.  by  gradient  descent \nminimization  of  the  estimate  of  the  generalization  error  (10).  A  more  efficient \napproach  to finding the optimal weights  is  to turn it into a  quadratic optimization \nproblem.  That problem is  non-trivial only because of the constraints on the weights \n(L:a Wa  =  1 and Wa  2::  0).  Define  the correlation  matrix, \nC af3  = f dxp(x)Va(x)V f3 (x) . \n\n(13) \n\nThen,  using  that the weights sum to one,  equation  (10)  can  be rewritten  as \n\nE  = L wa Ea + L w aC af3 w f3  - L waCaa . \n\n(14) \n\na \n\naf3 \n\na \n\nHaving  estimates  of E a  and  Caf3  the  optimal weights  can  be found  by  linear  pro(cid:173)\ngramming or other optimization techniques.  Just like the ambiguity, the correlation \nmatrix can be estimated from unlabeled data  to any accuracy  needed  (provided that \nthe input  distribution p  is  known). \n\nIn  Figure  2  the  results  from  an  experiment  with  weight  optimization  are  shown. \nThe dashed curve shows the generalization error when the weights are optimized as \ndescribed  above  using  the  estimates of Ea  from  the  cross-validation  (on  K  exam(cid:173)\npies).  The lowest  solid  curve  is  for  the idealized  case,  when  it  is  assumed  that the \nerrors Ea are known exactly, so it shows the lowest possible error.  The performance \nimprovement is  quite convincing when  the  cross-validation estimates are  used. \n\nIt is  important to  notice  that  any estimate of the  generalization error  of the  indi(cid:173)\nvidual networks  can  be  used  in  equation  (14).  If one  is  certain that the individual \nnetworks  do  not  overfit,  one  might  even  use  the  training  errors  as  estimates  for \nEa  (see  [3]).  It is  also  possible  to  use  some  kind  of regularization  in  (14),  if the \ncross-validation sets  are small. \n\n5  ACTIVE  LEARNING \n\nIn  some  neural  network  applications  it  is  very  time  consuming  and/or  expensive \nto acquire training data,  e.g.,  if a  complicated measurement is  required  to find  the \nvalue of the target function for  a certain input.  Therefore it is  desirable to only use \nexamples with maximal information about the function.  Methods where the learner \npoints out good examples are often called active  learning. \n\nWe propose  a  query-based  active learning scheme that  applies  to ensembles of net(cid:173)\nworks  with continuous-valued output.  It is  essentially  a  generalization of query  by \ncommittee [6,  7]  that was  developed for  classification problems.  Our basic assump(cid:173)\ntion  is  that  those  patterns  in  the  input  space  yielding  the  largest  error  are  those \npoints we  would benefit  the most from including in  the training set. \n\nSince  the  generalization  error  is  always  non-negative,  we  see  from  (6)  that  the \nweighted  average of the individual network errors  is  always larger than or equal  to \nthe ensemble ambiguity, \n\nf(X)  2::  a(x), \n\n(15) \n\n\fNeural  Network Ensembles.  Cross  Validation.  and Active Learning \n\n237 \n\n2.5  r\"':\":'T---r--\"T\"\"--.-----r---, \n\n.  .  .  : \n\n0.5 \n\no \n\n10 \n\n30 \n20 \nTraining set size \n\n40 \n\n50 \n\no \n\n10 \n\n20 \n30 \nTraining set size \n\n40 \n\n50 \n\nFigure  3:  In  both  plots  the  full  line  shows  the  average  generalization  for  active \nlearning,  and  the  dashed  line  for  passive  learning  as  a  function  of the  number  of \ntraining  examples.  The  dots  in  the  left  plot  show  the  results  of the  individual \nexperiments contributing to the mean for  the active learning.  The dots in right plot \nshow  the same for  passive learning. \n\nwhich  tells  us  that the  ambiguity is  a  lower  bound for  the weighted  average of the \nsquared  error.  An  input  pattern  that  yields  a  large  ambiguity will  always  have  a \nlarge average error.  On the other hand, a low  ambiguity does  not necessarily imply \na  low  error.  If the  individual  networks  are  trained  to  a  low  training  error  on  the \nsame set  of examples then both the error  and the ambiguity are  low on the training \npoints.  This ensures that a pattern yielding a large ambiguity cannot be in the close \nneighborhood of a training example.  The ambiguity will  to some extent follow  the \nfluctuations in the error.  Since the ambiguity is  calculated from unlabeled examples \nthe input-space  can  be  scanned  for  these  areas  to any  detail.  These  ideas  are  well \nillustrated in  Figure  1,  where  the correlation between  error  and  ambiguity is  quite \nstrong,  although not perfect. \n\nThe results  of an experiment with the active learning scheme is  shown  in  Figure 3. \nAn  ensemble  of 5  networks  was  trained  to  approximate the  square-wave  function \nshown in Figure 1, but in this experiments the function was restricted to the interval \nfrom  - 2  to 2.  The curves  show  the final  generalization  error  of the ensemble  in  a \npassive  (dashed  line)  and an  active learning test  (solid  line).  For  each  training set \nsize  2x40  independent  tests  were  made,  all  starting with  the same  initial  training \nset  of a  single example.  Examples were  generated and  added one at  a  time.  In the \npassive test examples were generated at random, and in the active one each example \nwas  selected  as  the  input that gave  the  largest  ambiguity out of 800  random ones. \nFigure  3  also  shows  the  distribution  of  the  individual  results  of  the  active  and \npassive  learning tests.  Not  only do  we  obtain significantly better  generalization by \nactive learning, there  is  also  less  scatter in the results.  It seems  to be easier for  the \nensemble to learn from the  actively generated set. \n\n\f238 \n\nAnders Krogh.  Jesper Vedelsby \n\n6  CONCLUSION \n\nThe  central  idea  in  this  paper  was  to  show  that  there  is  a  lot  to  be  gained  from \nusing  unlabeled  data when  training in  ensembles.  Although  we  dealt  with  neural \nnetworks,  all  the  theory  holds for  any  other  type of method used  as  the individual \nmembers of the ensemble. \nIt was  shown  that  apart  from  getting  the  individual  members of the  ensemble  to \ngeneralize  well,  it  is  important for  generalization  that  the  individuals disagrees  as \nmuch  as  possible,  and  we  discussed  one  method  to  make even  identical  networks \ndisagree.  This  was  done  by  training  the  individuals on  different  training sets  by \nholding out some examples for  each individual during training.  This had the added \nadvantage  that  these  examples  could  be  used  for  testing,  and  thereby  one  could \nobtain good estimates of the generalization error. \nIt was discussed  how to find  the optimal weights for  the individuals of the ensemble. \nFor  our  simple  test  problem  the  weights  found  improved  the  performance  of the \nensemble significantly. \n\nFinally a method for  active learning was described,  which was  based on the method \nof query  by committee developed for  classification problems.  The idea is  that if the \nensemble disagrees strongly on an input, it would be good to find  the label for  that \ninput and include  it  in  the  training set  for  the ensemble.  It was  shown  how  active \nlearning improves the learning curve  a  lot for  a simple test  problem. \n\nAcknowledgements \n\nWe  would like to thank  Peter  Salamon for  numerous  discussions  and for  his  imple(cid:173)\nmentation  of linear  programming for  optimization of the  weights.  We  also  thank \nLars  Kai  Hansen  for  many  discussions  and  great  insights,  and  David  Wolpert  for \nvaluable comments. \n\nReferences \n\n[1]  L.K.  Hansen  and  P  Salamon.  Neural  network  ensembles. \n\nIEEE  Transactions  on \n\nPattern Analysis and Machine Intelligence,  12(10):993- 1001,  Oct.  1990. \n\n[2]  D.H Wolpert.  Stacked generalization.  Neural Networks,  5(2):241-59,  1992. \n[3]  Michael  P.  Perrone  and  Leon  N  Cooper.  When  networks disagree:  Ensemble  method \nfor neural  networks.  In R.  J.  Mammone, editor,  Neural Networks for  Speech  and Image \nprocessing. Chapman-Hall,  1993. \n\n[4]  S.  Geman ,  E .  Bienenstock,  and  R  Doursat.  Neural  networks  and  the  bias/variance \n\ndilemma.  Neural  Computation, 4(1):1-58,  Jan.  1992. \n\n[5]  Ronny  Meir.  Bias,  variance  and  the combination of estimators;  the case of linear least \n\nsquares.  Preprint  (In  Neuroprose),  Technion,  Heifa,  Israel,  1994. \n\n[6]  H.S.  Seung,  M.  Opper,  and  H.  Sompolinsky.  Query  by  committee.  In  Proceedings  of \nthe  Fifth  Workshop  on  Computational  Learning  Theory,  pages  287-294,  San  Mateo, \nCA,  1992.  Morgan  Kaufmann. \n\n[7]  Y.  Freund,  H.S.  Seung,  E.  Shamir,  and  N. Tishby.  Information,  prediction,  and query \nby  committee.  In Advances in  Neural Information  Processing Systems, volume  5,  San \nMateo,  California,  1993.  Morgan  Kaufmann. \n\n\f", "award": [], "sourceid": 1001, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "Jesper", "family_name": "Vedelsby", "institution": null}]}