{"title": "Competition Among Networks Improves Committee Performance", "book": "Advances in Neural Information Processing Systems", "page_first": 592, "page_last": 598, "abstract": null, "full_text": "Competition Among Networks \nImproves Committee Performance \n\nPaul  W.  Munro \n\nBambang  Parman to \n\nDepartment of Infonnation Science \n\nDepartment of Health Infonnation \n\nand Telecommunications \nUniversity of Pittsburgh \nPittsburgh  PA  15260 \n\nmunro@sis.pitt.edu \n\nManagement \n\nUniversity of Pittsburgh \nPittsburgh  PA  15260 \n\nparmanto+@pitt.edu \n\nABSTRACT \n\nThe separation of generalization error into two types, bias and variance \n(Geman,  Bienenstock,  Doursat,  1992),  leads  to  the  notion  of error \nreduction  by  averaging  over  a  \"committee\"  of classifiers  (Perrone, \n1993).  Committee perfonnance decreases with both the average error of \nthe constituent classifiers and increases  with  the degree to  which the \nmisclassifications are correlated across the committee.  Here, a method \nfor  reducing  correlations  is  introduced,  that  uses  a  winner-take-all \nprocedure  similar  to  competitive  learning  to  drive  the  individual \nnetworks  to  different  minima  in  weight  space  with  respect  to  the \ntraining set, such that correlations in  generalization perfonnance will be \nreduced, thereby reducing committee error. \n\n1  INTRODUCTION \n\nThe problem  of constructing a  predictor can  generally  be viewed  as  finding  the  right \ncombination  of bias  and  variance  (Geman,  Bienenstock, Doursat,  1992)  to  reduce the \nexpected error.  Since a neural network predictor inherently has an excessive number of \nparameters, reducing the prediction error is  usually done by reducing variance.  Methods \nfor reducing  neural  network complexity can be viewed as  a regularization technique to \nreduce this variance.  Examples of such methods are Optimal Brain Damage (Le Cun et. \nal.,  1991), weight decay (Chauvin, 1989), and early stopping (Morgan &  Boulard,  1990). \n\nThe  idea of combining  several  predictors  to  fonn  a  single,  better predictor  (Bates  & \nGranger,  1969)  has been applied using  neural  networks in  recent years (Wolpert,  1992; \nPerrone, 1993; Hashem,  1994). \n\n\fCompetition Among Networks Improves Committee Performance \n\n593 \n\n2  REDUCING  MISCLASSIFICATION  CORRELATION \n\nSince committee errors occur when too many individual predictors are in error, committee \nperformance improves as the correlation of network misclassifications decreases.  Error \ncorrelations can be handled by using a weighted sum to generate a committee prediction; \nthe weights can be estimated by using ordinary least squares (OLS) estimators (Hashem, \n1994) or by using Lagrange multipliers (Perrone, 1993). \n\nAnother  approach  (Parmanto  et al.,  1994)  is  to  reduce  error  correlation  directly  by \nattempting  to  drive  the  networks  to  different  minima  in  weight  space,  that  will \npresumably have different generalization syndromes, or patterns of error with respect to a \ntest set (or better yet, the entire stimulus space). \n\n2.1  Data  Manipulations \n\nTraining  the  networks  using  nonidentical  data has  been  shown  to  improve committee \nperformance, both when the data sets are from mutually exclusive continuous regions (eg, \nJacobs et al.,1991), or when  the training subsets are arbitrarily chosen  (Breiman,  1992; \nParmanto,  Munro,  and  Doyle,  1995).  Networks  tend  to  converge  to  different  weight \nstates, because the error surface itself depends on the training set; hence changing the data \nchanges the error surface. \n\n2.2  Auxiliary  tasks \n\nAnother way to  influence the networks to disagree is  to  introduce a second output unit \nwith  a  different  task  to  each  network  in  the  committee.  Thus,  each  network  has  two \noutputs, a primary unit which is trained to predict the class of the input, and a secondary \nunit, with  some other task that is different than the tasks assigned to the secondary units \nof the other committee members.  The  success of this  approach rests on the  assumption \nthat the decorrelation of the network errors will more than compensate for any degradation \nof performance  induced on  the  primary task  by the  auxiliary task.  The presence of a \nhidden  layer in  each  network  guarantees  that  the  two output response functions  share \nsome  weight  parameters  (i.e.,  the  input-hidden  weights),  and  so  the  learning  of the \nsecondary task influences the function learned by the primary output unit. \n\nParmanto et al. (1994) acheived significant decorrelation and improved performance on a \nvaroety of tasks  using one of the input variables as  the training signal for the secondary \nunit.  Interestingly, the secondary task does  not necessarily degrade performance on  the \nprimary task.  Our studies, as well as  those of Caruana (1995), show that extra tasks can \nfacilitate learning time and generalization performance on an individual network.  On the \nother  hand,  certain  auxiliary  tasks  interfere  with  the  primary  task.  We  have  found \nhowever, that even when the individual performance is degraded, committee performance \nis nevertheless enahnced (relative to  a committee of single output networks) due to  the \nmagnitude of error decorrelation. \n\n3  THE  COMPETITIVE COMMITTEE \n\nAn  alternative to  using a stationary task per se, such as  replicating an input variable or \nprojecting onto principal components (as was done in  Parmanto et ai,  1994), is  to use a \nsignal that depends on the other networks, in  such a manner that the functions computed \nby the secondary units are negatively correlated after training.  This notion is reminiscent \nof competitive learning (Rumelhart and Zipser,  1986); that is, the functions computed by \nthe secondary units will partition the stimulus space. \n\nThus,  a  Competitive Committee Machine (CCM)  is defined  as  a  committee of neural \n\n\f594 \n\nP.  W  Munro and B.  Parmanto \n\nnetwork classifiers, each  with  two output units:  a primary unit trained according to the \nclassification  task,  and  a  secondary  unit  participating  in  a  competitive process  with \nsecondary  units of the other networks in  the committee; let the outputs of network i be \ndenoted  Pi  and Si,  respectively  (see  Figure  1).  The  network  weights  are  modified \naccording to the following  variant of the back propagation procedure. \n\nWhen  data item a  from  the training  set is presented to the  committee during  training, \nwith  input vector XCl  and known  output classification value  yCl  (binary),  the  networks \neach process  XCl  simultaneously, and the P and S output units of each network respond. \nEach P-unit receives the identical training signal, yCl,  that corresponds to the input item; \nthe  training  signal  to  the  S-units  is  zero  for  all  networks  except  the  network  with  the \ngreatest S-unit response among the committee;  the maximum  Si among the  networks in \nthe committee receives a training signal of 1, and the others receive a training signal of O. \n\nwhere or mof are the errors attributed to the primary and secondary units respectively \nto adjust network weights with back propagationl .  During the course of training, the S(cid:173)\nunit's response is  explicitly trained to become sensitive to a unique region (relative to the \nother networks'  S-units)  of the  stimulus  space.  This  training  signal  is  different  from \ntypical  \"tasks\"  that are used  to train  neural  networks in  that it is  not a static function  of \nthe  input;  instead,  since  it depends  on  the other  networks  in  the  committee,  it  has  a \ndynamic qUality. \n\n4  RESULTS \n\nSome experiments have been  run  using  the sine wave  classification task  (Figure  2)  of \nGeman and Bienenstock (1992). \n\nComparisons of CCM perfonnance versus the baseline perfonnance of a committee with \na simple average over a range of architectures (as indicated by the number of hidden units) \nare favorable  (Figure  3).  Also,  note that the  improvement is primarily attributable to \ndescreased  correlation,  since  the  average  individual  perfonnance is  not  significantly \naffected. \n\nVisualization  of the  response  of the  individual  networks  to  the entire  stimulus  space \ngives  a  complete picture  of how  the  networks  generalize and  shows  the effect of the \ncompetition (Figure 4).  For this particular data set, the classes are easily separated in the \ncentral region (note that all the networks do well here).  But at the edges, there is much \nmore variance in the networks trained with competitive secondary units (Figure 5). \n\n5  DISCUSSION \n\nCaruana (1995) has demontrated significant improvement on  \"target\" classification tasks \nin  individual  networks  by  adding  one or more  supplementary  output  units  trained  to \ncompute tasks related to the target task.  The additional output unit added to each network \n\nIFor notational convenience, the derivative factor sometimes included in the definition of \n8 is not included in this description of oP and  oS. \n\n\fCompetition Among Networks Improves Committee Performance \n\n595 \n\nCOMMITIEE OUTPUT \n(AVERAGE OR VOTE) \n\n(Training signal yfl \nc_ \n\n\u2022 \n\n. \n\ns \n--\n\n---. \n\n-\n\n\u2022 \u2022\u2022 \n\nInput Variables (simultaneously presented to  all networks) \n\nFigure  1:  A  Competitive  Committee  Machine.  Each  of the K  networks  receives  the \nsrune input and produces two outputs, P and S.  The P responses of all the networks are \ncompared to  a  common  training  signal  to  compute an  error value for  backpropagation \n(dark dashed arrows); the P responses are combined (by vote or by sum) to determine the \ncommittee  response.  The  S-unit  responses  are  compared  with  each  other,  with  the \n\"winner\"  (highest response) receiving  a training  signal  of 1,  and the others receiving  a \ntraining  signal of O.  Thus the training signal for network i is computed by comparing all \nS-unit responses, and then fed back to the S-units, hence the two-way arrows (gray). \n\nin  the CCM  merges  a  variant  of Rumelhart and  Zipser's  (1986)  competitive  learning \nprocedure with backpropagation, to form a novel hybrid of a supervised training technique \nwith an  unsupervised method.  The training signal  delivered  to  the secondary unit under \nCCM  is  more direct  than  an  arbitrary  task,  in  that it is defined  explicitly  in  terms  of \ndissociating response properties. \n\nNote that the  training  signals  for  the  S-units  differ  from  the P-unit training  signals  in \ntwo important respects: \n1.  Not static:  The signal depends  on  the  S-unit responses/rom the  other networks ,md \n\nhence chcmges during the course of training. \n\n2.  Not  uniform:  It  is  not constant across the committee (whereas  the P-unit training \n\nsignal is.) \n\n\f596 \n\nP  W  Munro and B.  Parmanto \n\nFi!.!ure  2.  A  (\"/assif/ca[ion  task .  Training data (bottom)  is  srunpled from  a classitication \nta;k defined hy a s'inusoid (top) conupted hy noise (middle). \n\nCommittee Performance \n\nIndiv.  Performance \n\n:e \n:! \n\n...... \n\n..... \n\n......... \n\n........................... ...... -..................\u2022 \n\n14 \n\n16 \n\n10 \n\n8 \n,  hidden unllS \n\n12 \n\nCorrelation \n\no~-----------\n\n.. \n\n. .... \n\n\u2022.................. \" ..... \n\n..... -......... \" ................. . \n\nc:  0 \no \n~  <.Q \na;  0 \n~  ~ \n0>0 \n~  N \no \no oL..--_________  --' \n\n. ...... _-. __ ._-.. \n\no  -\n\" \n\nba&.tlne \n\n......  cem \n\n10 \n\n8 \n,  hidden unns \n\n12 \n\n14 \n\n16 \n\n~r-----~~~~---; \n\nPercent Improvement \n..... \n.. ........... . \n\n..\u2022............ _.-............ .. \n\n..\u2022. \n\n/,-\n\no  -\n\u2022 \n\nbas.-w \n\n,._ ..  cem \n\n4 \n\n10 \n\n12 \n,  hidden urI/IS \n\n14 \n\n16 \n\n4 \n\n6 \n\n10 \n\n8 \n12 \n,  hidden unns \n\n14 \n\n16 \n\nFigure  3.  Peliormallce  of CCM.  Committees  of 5  networks  were  trained  with  comp(cid:173)\netitive  learning  (CCM)  and  without  (baseline).  Each  data  point is  an  average  over 5 \nsimulations with  different initial weights. \n\n\fCompetition Among Networks Improves Committee Performance \n\n597 \n\nNetwork.l (Error:  10.20%) \n\nr-==== \n\nNetwork '2 (Err\",: 15.59\"1.) \n\nNetwork .3 (Error: 15.250/.) \n\nNelwork ,4 (Error:  15.54\"/.) \n\nNetwork #5 (Error:  12.65%) \n\nCanmillee OUtput \u2022 Thtesholded (Error:  11.64) \n\nFigure 4.  Generalization plots for a committee.  The level of gray indicates the response \nfor  each  network of a committee trained  without competition.  The panel  on  the lower \nright shows the (thresholded) committee output.  The average pairwise correlation of the \ncommittee is  0.91. \n\nNetwork #1  (Error: 10.21%) \n\nNetwork,2 (Error:  9.83%) \n\nNetwork #3 (Error . 1683%) \n\nNelwork'4 (Error: 14.88%) \n\nFigure 5.  Gelleralization plots for a  CCM committee.  Comparison with Figure 4 shows \nmuch  more  variance  among  the  committee  at  the  edges.  Note  that  the  committee \nperforms  much  better  near the  right and  left  ends  of the stimulus  space than  does any \nindividual  network.  This  committee  had  an  error  rate  of 8.11 %  (cf  11.64%  in  the \nbaseline case). \n\n\f598 \n\nP.  W.  Munro and B.  Pannanto \n\nThe  weighting  of oS  relative to  oP  is  an  important  consideration;  in  the  simulations \nabove,  the signal  from  the secondary  unit  was  arbitrarily multiplied by  a factor of 0.1. \nWhile we have  not yet examined this  systematically,  it is  assumed that  this  factor  will \nmodulate  the  tradeoff between degradation  of the primary  task  and  reduction  of error \ncorrelation. \n\nReferences \nBates,  J.M.,  and  Granger,  C.W.  (1969)  \"The  combination  of forecasts,\"  Operation \nResearch Quarterly, 20(4),451-468. \n\nBreiman, L,  (1992)  \"Stacked Regressions\",  TR  367,  Dept.  of Statistics, Univ.  of Cal. \nBerkeley. \n\nCaruana, R (1995)  \"Learning many related tasks at the same time with backpropagation,\" \nIn: Advances in Neural Information Processing Systems 7.  D. S.  Touretsky, ed.  Morgan \nKaufmann. \n\nChauvin, Y.  (1989) \"A backpropagation  algorithm  with optimal use of hidden  units.\" In \nTouretzky  D.,  (ed.),  Advances  in  Neural  Information  Processing  1,  Denver,  1988, \nMorgan Kaufmann. \n\nGeman,  S.,  Bienenstock,  E.,  and  Doursat,  R.  (1992)  \"Neural  networks  and  the \nbias/variance dilemma,\"  Neural Computation 4,  1-58. \n\nHashem,  S.  (1994).  Optimal  Linear Combinations  of Neural  Networks.,  PhD Thesis, \nPurdue University. \n\nJacobs, R.A.,  Jordan,  M.I.,  Nowlan,  SJ., and  Hinton, G.E.  (1991)  \"Adaptive mixtures \nof local experts,\"  Neural Computation, 3, 79-87 \n\nLe Cun,  Y.,  Denker J.  and  Solla,  S.  (1990).  Optimal Brain  Damage.  In D.  Touretzky \n(Ed.)  Advances  in  Neural  Information  Processing  Systems  2,  San  Mateo:  Morgan \nKaufmann. 598-605. \n\nMorgan, N. &  Boulard, H. (1990) Generalization and parameter estimation in feedforward \nnets:  some  experiments.  In  D.  Touretzky  (Ed.)  Advances  in  Neural  Information \nProcessing Systems 2 San Mateo: Morgan Kaufmann. \n\nParmanto,  B.,  Munro,  P.W.,  Doyle,  H.R.,  Doria,  C.,  Aldrighetti,  L.,  Marino,  I.R., \nMitchel,  S.,  and  Fung,  JJ. (1994)  \"Neural  network classifier for  hepatoma detection,\" \nProceedings of the World Congress of Neural Networks \n\nParmanto,  B.,  Munro, P.W., Doyle,  H.R.  (1996)  \"Improving committee diagnosis  with \nresampling  techniques,\"  In:  D.  S.  Touretzky,  M.  C.  Mozer,  M.  E.  Hasselmo,  eds. \nAdvances in Neural Information Processing Systems  8.  MIT Press: Cambridge, MA. \n\nPerrone,  M.P.  (1993)  \"Improving  Regression  Estimation:  Averaging  Methods  for \nVariance  Reduction  with  Extension  to  General  Convex  Measure Optimization,\"  PhD \nThesis, Department of Physics, Brown University. \n\nRumelhart.  D.E and Zipser, D.  (1986)  \"Feature discovery by competitive learning,\"  In: \nRumelhart,  D.E.and  McClelland,  J.E.  (Eds.),  Parallel  Distributed  Processing: \nExplorations in  the  Microstructure of Cognition.  MIT Press, Cambridge, MA. \n\nWolpert, D.  (1992). Stacked generalization, Neural Networks, 5,241-259. \n\n\f", "award": [], "sourceid": 1196, "authors": [{"given_name": "Paul", "family_name": "Munro", "institution": null}, {"given_name": "Bambang", "family_name": "Parmanto", "institution": null}]}