{"title": "Putting It All Together: Methods for Combining Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1188, "page_last": 1189, "abstract": "", "full_text": "Pulling It  All  Together:  Methods  for \n\nCombining Neural Networks \n\nMichael P.  Perrone \n\nInstitute for  Brain and  Neural  Systems \n\nBrown  University \n\nProvidence,  RI \n\nmpp@cns. brown. edu \n\nThe  past  several  years  have  seen  a  tremendous  growth  in  the  complexity  of  the \nrecognition,  estimation  and  control  tasks  expected  of neural  networks.  In  solving \nthese  tasks,  one  is  faced  with  a  large  variety  of  learning  algorithms  and  a  vast \nselection of possible network architectures.  After all the training, how does one know \nwhich  is  the  best  network?  This  decision  is  further  complicated  by  the  fact  that \nstandard  techniques  can  be severely  limited by  problems such  as  over-fitting,  data \nsparsity and local optima.  The usual solution to these problems is a  winner-take-all \ncross-validatory model selection.  However, recent experimental and theoretical work \nindicates  that  we  can  improve performance by  considering  methods for  combining \nneural  networks. \n\nThis  workshop  examined  current  neural  network  optimization  methods  based  on \ncombining estimates  and  task  decomposition,  including  Boosting,  Competing  Ex(cid:173)\nperts,  Ensemble  Averaging,  Metropolis  algorithms,  Stacked  Generalization  and \nStacked  Regression.  The  issues  covered  included  Bayesian  considerations,  the \nrole  of  complexity,  the  role  of cross-validation,  incorporation  of a  priori  knowl(cid:173)\nedge,  error  orthogonality,  task  decomposition,  network  selection  techniques,  over(cid:173)\nfitting,  data  sparsity  and  local  optima.  Highlights  of each  talk  are  given  below. \nTo  obtain  the  workshop  proceedings,  please  contact  the  author  or  Norma Caccia \n(norma_caccia@brown.edu)  and ask  for  IBNS  ONR  technical  report  #69. \n\nM.  Perrone  (Brown  University,  \"Averaging  Methods:  Theoretical  Issues  and  Real \nWorld  Examples\")  presented  weighted  averaging schemes  [7],  discussed  their  theo(cid:173)\nretical  foundation  [6],  and  showed  that  averaging can improve performance  when(cid:173)\never  the  cost  function  is  (positive or  negative)  convex  which includes  Mean  Square \nError,  a  general  class of Lp-norm cost  functions,  Maximum Likelihood  Estimation, \nMaximum Entropy,  Maximum Mutual  Information,  the  Kullback-Leibler  Informa(cid:173)\ntion  (Cross  Entropy),  Penalized  Maximum Likelihood  Estimation  and  Smoothing \nSplines [6].  Averaging was shown  to improve performance on the  NIST  OCR data, \na  human face  recognition  task  and a  time series  prediction  task  [5]. \nJ.  Friedman  (Stanford,  \"A  New  Approach  to  Multiple  Outputs  Using  Stacking\") \npresented  a  detailed  analysis  of a  method for  averaging estimators and  noted  sim(cid:173)\nulations showed  that averaging with a  positivity constraint  was  better  than  cross-\n\n1188 \n\n\fPulling It All Together: Methods for Combining Neural Networks \n\n1189 \n\nvalidation estimator selection  [1]. \nS.  Nowlan  (Synaptics,  \"Competing Experts\")  emphasized the distinctions  between \nstatic  and  dynamic algorithms and  between  averaged  and stacked  algorithms; and \npresented  results of the mixture of experts algorithm [3]  on a  vowel recognition task \nand a  hand tracking  task. \nH.  Drucker  (AT&T,  \"Boosting Compared  to  Other  Ensemble  Methods\")  reviewed \nthe  boosting  algorithm  [2]  and  showed  how  it  can  improve  performance  for  OCR \ndata. \nJ. Moody (OGI,  \"Predicting the  U.S. Index ofIndustrial Production\") showed that \nneural networks make better predictions for  the  US  IP index than standard models \n[4]  and that  averaging these  estimates  improves prediction  performance further. \nW. Buntine (NASA Ames Research Cent.er,  \"Averaging and Probabilistic Networks: \nAutomating  the  Process\")  discussed  placing  combination  techniques  within  the \nBayesian framework. \nD.  Wolpert  (Santa  Fe  Institute,  \"Infen ing  a  Function  vs.  Inferring  an  Inference \nAlgorithm\")  argued  that  theory  can  not,  in  general,  identify  the  optimal network; \nso one  must  make assumptions in order  to  improve performance. \nH.  Thodberg  (Danish  Meat  Research  Institute,  \"Error  Bars  on  Predictions  from \nDeviations  among  Committee  Member~ (within  Bayesian  Backprop)\")  raised  the \nprovocative  (and  contentious)  point  that  Bayesian  arguments  support  averaging \nwhile  Occam's Razor  (seemingly?)  does  not. \nS.  Hashem  (Purdue  University,  \"Merits  of Combining Neural  Networks:  Potential \nBenefits  and  Risks\")  emphasized  the  importance of dealing with collinearity when \nusing averaging methods. \n\nReferences \n\n[1]  Leo  Breiman.  Stacked  regression.  Technical  Report  TR-367,  Department  of \n\nStatistics,  University of California, Berkeley,  August  1992. \n\n[2]  Harris  Drucker,  Robert  Schapire,  and  Patrice  Simard.  Boosting  performance \nin neural  networks.  International  Journal  of Pattern Recognition  and  Artificial \nIntelligence,  [To  appear]. \n\n[3]  R. A.  Jacobs,  M.  1.  Jordan, S.  J.  Nowlan, and  G.  E.  Hinton.  Adaptive mixtures \n\nof local experts.  Neural  Computation,  3(2),  1991. \n\n[4]  U.  Levin,  T.  Leen,  and J.  Moody.  Fa.st  pruning using principal components.  In \nSteven J. Hanson, Jack D. Cowan, and C.  Lee  Giles, editors,  Advances in Neural \nInformation  Processing  Systems  6.  Morgan  Kaufmann,  1994. \n\n[5]  M.  P.  Perrone.  Improving  Regression  Estimation:  A veraging  ~Methods for  Vari(cid:173)\n\nance  Reduction  with Eztensions  to  General  Convez  Measure  Optimization.  PhD \nthesis,  Brown  University,  Institute  for  Brain and  Neural  Systems;  Dr.  Leon  N \nCooper,  Thesis Supervisor,  May  1993. \n\n[6]  M.  P.  Perrone.  General averaging results for  convex optimization. In Proceedings \nof the  1993 Connectionist Models  Su,mmer School,  pages 364-371, Hillsdale,  N.T, \n1994.  Erlbaum Associates. \n\n[7]  M.  P.  Perrone and L.  N Cooper.  '!\\Then  networks disagree:  Ensemble method for \n\nneural networks.  In Artificial Neuml Networks for Speech  and  l!ision.  Chapman(cid:173)\nHall,  1993.  Chapter  10. \n\n\f\f", "award": [], "sourceid": 744, "authors": [{"given_name": "Michael", "family_name": "Perrone", "institution": null}]}