{"title": "Combining Neural Network Regression Estimates with Regularized Linear Weights", "book": "Advances in Neural Information Processing Systems", "page_first": 564, "page_last": 570, "abstract": "", "full_text": "Combining Neural Network Regression \n\nEstimate1s  with Regularized Linear \n\nWeights \n\nChristopher J.  Merz and Michael J.  Pazzani \n\nDept.  of Information and Computer Science \n\nUniversity of California, Irvine,  CA  92717-3425 U.S.A. \n\n{ cmerz,pazzani }@ics.uci.edu \n\nCategory:  Algorithms and Architectures. \n\nAbstract \n\nWhen  combining a  set  of learned  models  to form  an  improved es(cid:173)\ntimator, the  issue  of redundancy  or  multicollinearity in the set  of \nmodels  must  be  addressed.  A  progression  of existing  approaches \nand  their  limitations with  respect  to the  redundancy  is  discussed. \nA  new  approach,  PCR *,  based  on  principal  components  regres(cid:173)\nsion  is  proposed to address  these  limitations.  An evaluation of the \nnew  approach  on  a  collection  of domains  reveals  that:  1)  PCR* \nwas the most robust combination method as the redundancy of the \nlearned models increased,  2)  redundancy could be handled without \neliminating any of the learned models, and 3) the principal compo(cid:173)\nnents of the learned models provided a  continuum of \"regularized\" \nweights from  which  PCR * could choose. \n\n1 \n\nINTRODUCTION \n\nto  improve  classification  and  regres(cid:173)\n\nCombining  a  set  of  learned  modelsl \nlearning \nsion  estimates  has  been  an  area  of  much  research \n[Wolpert,  1992,  Merz,  1995,  Perrone  and Cooper,  1992, \nand  neural  networks \nLeblanc and Tibshirani, 1993, \nMeir,  1995, \nKrogh and Vedelsby,  1995,  Tresp,  1995,  Chan and Stolfo,  1995].  The  challenge  of \nthis  problem  is  to  decide  which  models  to  rely  on  for  prediction  and  how  much \nweight to give  each. \n\nBreiman,  1992, \n\nin  machine \n\n1 A learned model may be anything from a decision/regression  tree to a neural network. \n\n\fCombining Neural Network Regression Estimates \n\n565 \n\nThe goal of combining learned models is  to obtain a more accurate prediction than \ncan  be obtained from  any single source  alone.  One  major issue  in  combining a  set \nof learned  models is  redundancy.  Redundancy refers  to the amount of agreement or \nlinear dependence  between  models when making a set  of predictions.  The more the \nset  agrees,  the  more redundancy  is  present.  In  statistical terms,  this  is  referred  to \nas  the multicollinearity problem. \nThe focus  of this  paper is  to explore  and evaluate the  properties of existing  meth(cid:173)\nods  for  combining  regression  estimates  (Section  2),  and  to  motivate the  need  for \nmore advanced methods which deal with multicollinearity in the set of learned mod(cid:173)\nels  (Section  3).  In  particular,  a  method based  on  principal  components  regression \n(PCR, [Draper and Smith, 1981]) is described,  and is evaluated emperically demon(cid:173)\nstrating the it is a robust and efficient method for finding a set of combining weights \nwith low  prediction error  (Section 4).  Finally, Section 5 draws some conclusions. \n\n2  MOTIVATION \n\nThe problem of combining a set of learned models is  defined  using the terminology \nof [Perrone  and Cooper,  1992].  Suppose  two  sets  of data are  given:  a  training set \n'DTrain  = (xm, Ym)  and a  test  set 'DTelt  = (Xl, Yl).  Now  suppose  'DTrain  is  used  to \nbuild a  set of functions,  :F =  fi(X),  each element of which  approximates  f(x).  The \ngoal is to find  the best  approximation of f(x)  using :F. \nTo date,  most approaches to this problem limit the space of approximations of f( x) \nto linear  combinations of the elements of :F,  i.e., \nj(x) = L Cidi(X) \n\nN \n\ni=l \n\nwhere  Cij  is  the coefficient  or weight  of fj(x). \nThe  focus  of this  paper  is  to  evaluate  and  address  the  limitations  of  these  ap(cid:173)\nproaches.  To do so,  a  brief summary of these  approaches is now provided progress(cid:173)\ning from simpler to more complex methods pointing out their limitations along the \nway. \nThe simplest method for  combining the members of :F is  by taking the unweighted \naverage,  (i.e.,  Cij  =  1/ N).  Perrone  and Cooper  refer  to this as  the Basic  Ensemble \nMethod  (BEM),  written as \n\nN \n\nfBEM  = I/NLfi(x) \n\ni=l \n\nThis  equation  can  also  be  written  in  terms  of the  misfit  function  for  each  fi(X). \nThese functions describe  the deviations of the elements of :F from the true solution \nand are  written as \n\nmi(X) = f(x) -Ji(x). \n\nThus, \n\nfBEM  =  f(x) -1/NL mi(x). \n\nN \n\ni=l \n\nPerrone  and  Cooper  show  that  as  long  as  the  mi (x)  are  mutually  independent \nwith  zero  mean,  the  error  in  estimating  f(x)  can  be  made  arbitrarily  small  by \nincreasing the population size of :F.  Since these assumptions break down in practice, \n\n\f566 \n\nC.  J.  Merz and M. J.  Pazzani \n\nthey  developed  a  more general  approach which  finds  the  \"optimal,,2  weights  while \nallowing the  mi (x) 's  to be  correlated  and have  non-zero  means.  This Generalized \nEnsemble Method  (GEM) is  written  as \n\nN \n\nN \n\nIGEM  =  LQ:di(X) =  I(x) - LQ:imi(X) \n\ni=1 \n\ni=l \n\nwhere \n\nC is the symmetric sample covariance matrix for the misfit function and the goal is to \nminimize E7,; Q:iQ:jCii'  Note that the misfit functions are calculated on the training \ndata and  I(x)  is  not  required.  The main disadvantage  to this  approach  is  that it \ninvolves  taking the  inverse of C  which  can  be  \"unstable\".  That is,  redundancy  in \nthe members of :F  leads to linear dependence  in  the rows  and columns of C  which \nin turn leads to unreliable estimates of C- 1 \u2022 \n\nTo circumvent  this sensitivity  redundancy,  Perrone  and Cooper  propose  a  method \nfor  discarding  member(s)  of :F  when  the  strength  of its  agreement  with  another \nmember exceeds  a  certain  threshold.  Unfortunately,  this approach only  checks  for \nlinear dependence  (or  redundancy)  between  pairs of Ii (x)  and two  Ii (x)  for  i  =1=  j. \nIn fact, Ii (x)  could  be a  linear combination of several other members of :F  and the \ninstability problem would be manifest.  Also, depending on how high the threshold is \nset, a  member of :F  could be discarded  while still having some degree  of uniqueness \nand utility.  An ideal method for  weighting the members of :F would neither discard \nany models nor suffer  when  there  is  redundancy in the model set. \n\nThe next approach reviewed is linear regression (LR)3 which also finds the \"optimal\" \nweights for  the  Ii (x)  with  respect  to the  training data.  In  fact,  G EM  and  LR are \nboth considered  \"optimal\"  because  they are  closely  related  in that GEM  is  a  form \nof linear  regression  with  the  added  constraint  that  E~1 Q:i  = 1.  The  weights  for \nLR are found  as follows4 , \n\nN \n\nhR =  LQ:di(X) \n\ni=1 \n\nwhere \n\nLike GEM, LR and LRC are subject to the multicollinearity problem because finding \nthe Q:i's  involves taking the inverse of a matrix.  That is, if the I  matrix is composed \nof li(x) which strongly agree with other members of :F,  some linear dependence  will \nbe present. \n\n20ptimal here refers to weights which minimize mean square error for the training data. \n3 Actually, it is  a form of linear regression  without the intercept term.  The more general \nform,  denote  by  LRC,  would  be  formulated  the  same  way  but  with  member,  fo  which \nalways  predicts  1.  According  to  [Leblanc  and  Tibshirani,  1993]  having  the extra constant \nterm will  not be necessary  (i.e., it will  equal zero)  because in practice,  E[fi(x)] =  E[f(x)]. \n4Note  that  the  constraint,  E;:'l ai  =  1,  for  GEM  is  a  form  of  regularization \n[Leblanc  and  Tibshirani,  1993].  The purpose  of regularizing  the  weights  is  to  provide  an \nestimate  which  is  less  biased  by  the  training  sample.  Thus,  one  would  not  expect  GEM \nand  LR to  produce identical  weights. \n\n\fCombining Neural Network Regression Estimates \n\n567 \n\nGiven the limitations of these methods, the goal of this research was to find a method \nwhich finds weights for the learned models with low prediction error without discard(cid:173)\ning  any of the  original models,  and  without  being  subject  to the  multicollinearity \nproblem. \n\n3  METHODS FOR HANDLING MULTICOLLINEARITY \n\nIn  the  abovementioned  methods,  multicollinearity  leads  to  inflation  of the  vari(cid:173)\nance  of the  estimated  weights,  Ck.  Consequently,  the  weights  obtained  from  fit(cid:173)\nting  the  model  to  a  particular  sample  may  be  far  from  their  true  values.  To \ncircumvent  this  problem,  approaches  have  been  developed  which:  1)  constrain \nthe  estimated regression  coefficients  so  as  to improve prediction  performance  (Le., \nridge  regression,  RIDG E  [Montgomery and Friedman 1993],  and principal  compo(cid:173)\nnents regression),  2) search for the coefficients  via gradient descent  procedures (i.e., \nWidrow-Hofflearning, GD and EG+- [Kivinen and Warmuth, 1994]), or build mod(cid:173)\nels  which  make decorrelated  errors  by adjusting the bias of the  learning algorithm \n[Opitz  and Shavlik,  1995] or the data which it sees [Meir,  1995].  The third approach \nameliorates, but does not solve, the problem because redundancy is an inherent part \nof the task of combining estimators. \n\nThe  focus  of  this  paper  is  on  the  first  approach. \n[Leblanc and Tibshirani,  1993]  have proposed  several ways  of constraining or  regu(cid:173)\nlarizing the weights to help  produce estimators with lower  prediction error: \n\nLeblanc  and  Tibshirani \n\n1.  Shrink a towards (1/ K, 1/ K, ... ,1/ K)T where  K  is the number of learned \n\nmodels. \n\n2.  2:~1 Ckj  =  1 \n3.  Ckj  ~ O,i = 1,2 ... K \n\nBreiman [Breiman,  1992]  provides an intuitive justification for  these constraints by \npointing out  that the  more strongly  they  are  satisfied,  the  more interpolative the \nweighting scheme is.  In the extreme case, a uniformly weighted set of learned models \nis  likely  to  produce  a  prediction  between  the  maximum  and  minimum  predicted \nvalues of the learned models.  Without these constraints,  there is no guarantee that \nthe  resulting  predictor  will  stay  near  that  range  and  generalization  may be  poor. \nThe  next  subsection  describes  a  variant  of principal  components  regression  and \nexplains how it provides a  continuum of regularized  weights for the original learned \nmodels. \n\n3.1  PRINCIPAL  COMPONENTS  REGRESSION \n\nWhen  dealing with the above  mentioned  multicollinearity problem, principal com(cid:173)\nponents regression  [Draper and Smith, 1981]  may be used to summarize and extract \nthe  \"relevant\"  information from  the learned  models.  The main idea of PCR is  to \nmap the original learned models to a set of (independent)  principal components in \nwhich  each  component is  a  linear combination of the original  learned  models,  and \nthen to build a regression equation using the best subset of the principal components \nto predict lex). \nThe advantage of this representation is that the components are sorted according to \nhow much information (or variance) from the original learned models for  which they \naccount.  Given  this  representation,  the  goal  is  to choose  the  number of principal \ncomponents to include in the final  regression  by retaining the first  k  which  meet  a \npreselected  stopping criteria.  The basic approach is summarized as follows: \n\n\f568 \n\nC.  J.  Merz and M.  J.  Pazzani \n\n1.  Do  a principal components analysis (PCA) on the covariance matrix of the \nlearned  models'  predictions  on  the  training  data  (i.e.,  do  a  PCA  on  the \ncovariance  matrix of M,  where  Mi,j  is  the  j-th  model's  reponse  for  the \ni-th  training  example)  to  produce  a  set  of principal  components,  PC  = \n{PC1, ... ,PCN }. \n\n2.  Use  a stopping criteria to decide on k,  the number of principal components \n\nto use. \n\n3.  Do a least squares regression on the selected  components (i.e., include PCi \n\nfor  i:::;  k). \n\n4.  Derive  the  weights,  fri,  for  the original learned models by expanding \n\naccording to \n\n/peR*  = i31PC1 + ... + i3\"PC\" \n\nPCi  = ;i,O/O + ... + ;i,N /N, \n\nand  simplifying for  the  coefficients  of \".  Note  that ;i,j  is  the  j-th  coeffi(cid:173)\ncient of the i-th principal component. \n\nThe second  step  is very important because  choosing too few  or too many principal \ncomponents  may result  in underfitting  or overfitting,  respectively.  Ten-fold  cross(cid:173)\nvalidation is  used  to select  k  here. \n\nExamining the spectrum of (N)  weight sets derived  in step four  reveals that PCR* \nprovides a  continuum of weight sets spanning from highly constrained (i.e.,  weights \ngenerated from PCR1  satisfy all three regularization constraints) to completely un(cid:173)\nconstrained  (i.e.,  PCRN  is  equivalent  to unconstrained  linear  regression).  To see \nthat the  weights,  fr,  derived from  PCR1 are  (nearly)  uniform,  recall  that  the first \nprincipal  component  accounts  for  where  the  learned  models  agree.  Because  the \nlearned  models are all fairly  accurate  they agree  quite often so  their first  principal \ncomponent weights,  ;1,*  will  be similar.  The \"Y-weights  are in turn multiplied by a \nconstant  when  PCR1  is  regressed  upon.  Thus,  the resulting  fri'S  will  be fairly  uni(cid:173)\nform.  The later principal components serve as refinements to those already included \nproducing  less  constrained  weight  sets  until  finally  PCRN  is  included  resulting  in \nan unconstrained estimator much like  LR,  LRC and GEM. \n\n4  EXPERIMENTAL RESULTS \n\nlearned  models, \n\n:F,  were  generated  using  Backpropogation \nThe  set  of \n[Rumelhart, 1986].  For each dataset, a network topology was developed which gave \ngood  performance.  The  collection  of networks  built  differed  only  in  their  initial \nweights5 . \n\nThree  data  sets  were  chosen:  cpu  and  housing  (from  the  UCI  repository),  and \nbody/at  (from the Statistics  Library at  Carnegie  Mellon  University).  Due  to space \nlimitation, the data sets  reported  on were  chosen  because  they were  representative \nof the  basic  trends  found  in a  larger  collection  of datasets.  The combining meth(cid:173)\nods evaluated  consist  of all  the  methods  discussed  in  Sections  2 and  3,  as  well  as \nPCRI  and  PCRN  (to demonstrate PCR*'s most and least  regularized  weight sets, \n\nSThere  was  no  extreme  effort  to  produce  networks  with  more  decorrelated  errors. \nEven  with such  networks,  the issue  of extreme multicollinearity  would  still  exist  because \nE[f;(x)] =  E[fi(x)] for  a.ll  i  and  j. \n\n\fCombining Neural Network Regression Estimates \n\n569 \n\nTable 1\u00b7  Results \ncpu \n\nII \n\nData \nN \nBEM \nGEM \nLR \nRIDGE \nGD \nEGPM \nPCRl \nPCRN \nPCR* \n\nbodyfat \n50 \n10 \n1.03 \n1.04 \n1.02  0.86 \n3.09 \n1.02 \n1.02  0.826 \n1.03 \n1.03 \n1.04 \n1.02  0.848 \n0.786 \n0.99 \n\n1.04 \n1.07 \n1.05 \n\n10 \n38.57 \n46.59 \n44.9 \n44.8 \n38.9 \n38.4 \n39.0 \n44.8 \n40.3 \n\n50 \n38.62 \n227.54 \n238.0 \n191.0 \n38.8 \n38.0 \n39.0 \n249.9 \n40.8 \n\n11 \n\nhousing \n50 \n10 \n2.77 \n2.79 \n2.57 \n2.72 \n2.72 \n6.44 \n2.55 \n2.72 \n2.77 \n2.79 \n2.77 \n2.75 \n2.78 \n2.76 \n2.72  2.57 \n2.56 \n2.70 \n\nrespectively).  The more computationally intense procedures  based on stacking and \nbootstrapping proposed by [Leblanc and Tibshirani, 1993, Breiman,  1992]  were not \nevaluated here  because  they  required  many more models (i.e.,  neural  networks)  to \nbe generated for  each of the elements of F. \nThere  were  20  trials  run  for  each  of  the  datasets.  On  each  trial  the  data  was \nrandomly divided into 70% training data and 30% test data.  These trials were rerun \nfor  varying sizes  of F  (i.e., 10  and  50,  respectively).  As  more models  are  included \nthe linear dependence amongst them goes  up showing how well the multicollinearity \nproblem is  handled6 .  Table  1 shows  the average  residual errors  for  the each of the \nmethods on the three  data sets.  Each  row is  a  particular method and each  column \nis the size of F  for a given data set.  Bold-faced entries indicate methods which were \nnot  significantly  different  from  the  method  with  the  lowest  error  (via  two-tailed \npaired t-tests  with p  :::;  0.05) . \nPCR* is  the only approach which  is  among the leaders for  all  three  data sets.  For \nthe  body/at and  housing data sets  the weights  produced  by  BEM,  PCRI , GD,  and \nEG+- tended  to  be  too  constrained,  while  the  weights  for  LR  tended  to  be  too \nunconstrained  for  the  larger  collection  of models.  The  less  constrained  weights  of \nGEM,  LR,  RIDGE,  and  PCRN  severely  harmed  performance  in  the  cpu  domain \nwhere  uniform weighting performed  better. \n\nThe biggest  demonstration of PCR*'s robustness  is  its ability to gravitate towards \nthe  more  constrained  weights  produced  by  the earlier  principal  components  when \nappropriate (i.e., in the cpu dataset).  Similarly, it uses the less constrained principal \ncomponents  closer  to  PCRn  when  it  is  preferable  as  in  the  bodyfat  and  housing \ndomains. \n\n5  CONCLUSION \n\nThis investigation suggests  that the principal components of a set of learned  mod(cid:173)\nels  can  be  useful  when  combining the  models  to  form  an  improved  estimator.  It \nwas  demonstrated  that  the  principal  components  provide  a  continuum  of weight \nsets  from  highly  regularized  to  unconstrained.  An  algorithm,  PCR* ,  was  devel(cid:173)\noped  which attempts to automatically select  the subset  of these  components which \nprovides the lowest prediction error.  Experiments on a collection of domains demon(cid:173)\nstrated PCR*'s ability to robustly handle redundancy  in the set of learned models. \nFuture work will be to improve upon PCR* and expand it to the classification task. \n\n6This  is  verified  by  observing  the  eigenvalues  of the  principal  components  and  values \n\nin  the covariance  matrix of the models  in :F \n\n\f570 \n\nReferences \n\nC.  1. Merz and M.  J.  Pazzani \n\n[Breiman  et  ai,  1984]  Breiman,  L.,  Friedman,  J .H.,  Olshen,  R.A.  &  Stone,  C.J . \n\n(1984).  Classification  and  Regression  Trees.  Belmont, CA: Wadsworth. \n\n[Breiman, 1992]  Breiman, L.  (1992).  Stacked Regression. Dept of Statistics, Berke(cid:173)\n\nley, TR No. 367. \n\n[Chan and Stolfo,  1995]  Chan, P.K., Stolfo, S.J. (1995).  A Comparative Evaluation \nof Voting  and  Meta-Learning  on  Partitioned  Data  Proceedings  of the  Twelvth \nInternational  Machine  Learning  Conference  (90-98) .  San  Mateo,  CA:  Morgan \nKaufmann. \n\n[Draper and  Smith, 1981]  Draper,  N.R.,  Smith,  H.  (1981).  Applied  Regression \n\nAnalysis.  New  York,  NY: John Wiley and Sons. \n\n[Kivinen and Warmuth, 1994]  Kivinen,  J.,  and  Warmuth,  M.  (1994) .  Exponenti(cid:173)\n\nated  Gradient  Descent  Versus  Gradient  Descent  for  Linear  Predictors.  Dept. of \nComputer Science,  UC-Santa Cruz,  TR No.  ucsc-crl-94-16. \n\n[Krogh and  Vedelsby, 1995]  Krogh,  A. ,  and  Vedelsby,  J.  (1995).  Neural  Network \nEnsembles, Cross  Validation, and Active  Learning.  In  Advances  in  Neural Infor(cid:173)\nmation  Processing  Systems  7.  San Mateo, CA:  Morgan Kaufmann. \n\n[Hansen  and Salamon, 1990]  Hansen,  L.K.,  and  Salamon,  P.  (1990).  Neural  Net(cid:173)\nwork  Ensembles.  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelli(cid:173)\ngence,  12 (993-1001). \n\n[Leblanc and Tibshirani,  1993]  Leblanc,  M., Tibshirani, R.  (1993)  Combining esti(cid:173)\n\nmates in regression  and classification  Dept.  of Statistics,  University of Toronto, \nTR. \n\n[Meir,  1995]  Meir, R.  (1995) .  Bias, variance and the combination of estimators.  In \nAdvances in  Neural Information  Processing  Systems  7.  San  Mateo,  CA:  Morgan \nKaufmann. \n\n[Merz,  1995]  Merz,  C.J .  (1995)  Dynamical Selection  of Learning  Algorithms.  In \nFisher,  C.  and  Lenz,  H.  (Eds.)  Learning  from  Data:  Artificial  Intelligence  and \nStatistics, 5).  Springer Verlag \n\n[Montgomery and Friedman 1993]  Mongomery,  D.C.,  and  Friedman,  D.J.  (1993). \nPrediction Using Regression Models with  Multicollinear Predictor Variables.  lIE \nTransactions, vol.  25,  no.  3 73-85. \n\n[Opitz  and Shavlik,  1995]  Opitz, D.W ., Shavlik, J .W . (1996) . Generating Accurate \nand  Diverse  Members  of a  Neural-Network  Ensemble.  Advances  in  Neural  and \nInformation Processing Systems 8. Touretzky,  D.S., Mozer,  M.C.,  and  Hasselmo, \nM.E., eds.  Cambridge MA : MIT Press. \n\n[Perrone  and Cooper,  1992]  Perrone, M. P., Cooper, L.  N.,  (1993) . When Networks \nDisagree:  Ensemble  Methods  for  Hybrid  Neural  Networks.  Neural  Networks for \nSpeech  and  Image  Processing,  edited  by  Mammone, R.  J ..  New  York:  Chapman \nand  Hall. \n\n[Rumelhart, 1986]  Rumelhart,  D.  E.,  Hinton,  G.  E.,  &  Williams,  R.  J .  (1986). \nLearning Interior Representation by Error Propagation.  Parallel Distributed Pro(cid:173)\ncessing,  1 318-362.  Cambridge, MASS.:  MIT Press. \n\n[Tresp,  1995]  Tresp,  V., Taniguchi,  M.  (1995).  Combining Estimators  Using  Non(cid:173)\n\nConstant  Weighting  Functions.  In  Advances  in  Neural  Information  Processing \nSystems  7.  San Mateo,  CA:  Morgan  Kaufmann. \n\n[Wolpert, 1992]  Wolpert, D.  H. (1992).  Stacked  Generalization.  Neural  Networks, \n\n5,  241-259. \n\n\f", "award": [], "sourceid": 1201, "authors": [{"given_name": "Christopher", "family_name": "Merz", "institution": null}, {"given_name": "Michael", "family_name": "Pazzani", "institution": null}]}