{"title": "Optimal Brain Surgeon: Extensions and performance comparisons", "book": "Advances in Neural Information Processing Systems", "page_first": 263, "page_last": 270, "abstract": "", "full_text": "Optimal Brain Surgeon: \n\nExtensions and  performance  comparisons \n\nBabak Hassibi* \n\nDavid G.  Stork \n\nGregory Wolff \n\nTakahiro Watanabe \n\nRicoh California Research Center \n\n2882  Sand Hill  Road Suite 115 \nMenlo  Park, CA 94025-7022 \n\n105B  Durand Hall \nStanford University \n\nStanford,  CA 94305-4055 \n\n* Department of Electrical Engineering \n\nand \n\nAbstract \n\na  second-order \n\nWe  extend  Optimal  Brain  Surgeon  (OBS)  -\nto  allow  for  general  error  mea(cid:173)\nmethod  for  pruning  networks  -\nsures, and explore a reduced computational and storage implemen(cid:173)\ntation  via  a  dominant  eigenspace  decomposition.  Simulations  on \nnonlinear,  noisy  pattern  classification  problems  reveal  that  OBS \ndoes  lead  to  improved  generalization,  and  performs  favorably  in \ncomparison with Optimal Brain Damage (OBD).  We  find  that the \nrequired  retraining  steps  in  OBD  may  lead  to inferior  generaliza(cid:173)\ntion, a result that can be interpreted as due to injecting noise back \ninto the system.  A common technique is to stop training of a large \nnetwork at the minimum validation error.  We  found  that the test \nerror  could  be  reduced  even  further  by  means  of  OBS  (but  not \nOBD)  pruning.  Our  results justify the t  ~ 0  approximation  used \nin  OBS  and  indicate  why  retraining  in  a  highly  pruned  network \nmay lead to inferior  performance. \n\n263 \n\n\f264 \n\nHassibi, Stork, Wolff, and Watanabe \n\n1 \n\nINTRODUCTION \n\nThe  fundamental  theory  of  generalization  favors  simplicity.  For  a  given  level  of \nperformance on observed  data,  models  with  fewer  parameters  can  be expected  to \nperform  better on test data.  In practice,  we  find  that neural  networks  with  fewer \nweights typically generalize better than large networks with the same training error. \nTo  this  end,  LeCun,  Denker  and  Solla's  (1990)  Optimal  Brain  Damage  method \n(OED)  sought to delete weights  by keeping  the training error  as  small as  possible. \nHassibi  and  Stork  (1993)  extended  OED  to  include  the off-diagonal  terms  in  the \nnetwork's  Hessian,  which  were  shown  to  be significant  and  important for  pruning \nin classical  and benchmark problems. \nOED and Optimal Brain Surgeon  (OES)  share the same basic approach of training \na  network  to  (local)  minimum  in  error  at  weight  w*,  and  then  pruning  a  weight \nthat leads  to the smallest increase  in  the training error.  The  predicted  functional \nincrease in the error for  a  change in full  weight vector 8w is: \n\n8E= \n\n( BE  )T \n\n-\nBw \n\n'V '  \n~O \n\n1  T B2E \n\u00b78w+-8w\u00b7--\n2 \nBw2 \n~ \n=H \n\n\u00b78w +  O(1I8wI1 3 )  , \n\" -v - \"  \n\n~O \n\n(1 ) \n\nwhere H  is  the Hessian  matrix.  The first  term vanishes  because  we  are  at  a  local \nminimum in error; we  ignore third- and higher-order terms (Gorodkin et al.,  1993). \nHassibi  and Stork (1993)  first  showed that the general solution for  minimizing this \nfunction  given the constraint of deleting one weight was: \n\n(2) \n\nHere,  e q  is  the  unit  vector  along  the  qth  direction  in  weight  space  and  Lq  is  the \nsaliency of weight  q  -\nan estimate of the increase  in  training  error  if weight  q  is \npruned and the other weights updated by the left  equation in Eq.  2. \n\n2  GENERAL  ERROR MEASURES  AND  FISHER'S \n\nMETHOD  OF  SCORING \n\nIn  this  section  we  show  that  the  recursive  procedure  for  computing  the  inverse \nHessian  for  sum squared  errors  presented  in  Hassibi  and  Stork  (1993)  generalizes \nto any twice differentiable distance norm and that the key approximation based on \nFisher's method of scoring is  still valid. \nConsider  an  arbitrary twice differentiable  distance  norm d(t, 0)  where t  is  the de(cid:173)\nsired output  (teaching vector)  and 0  = F(w, in) the actual output.  Given a weight \nvector w, F  maps the input vector in to the output; the total error over P  patterns \nis  E  =  J;  2:f=l d(tlkJ, olkJ).  It is  straightforward to show that for  a  single output \nunit network the Hessian is: \n\n\fOptimal Brain Surgeon: Extensions and Performance Comparisons \n\n265 \n\n_ 2.  ~ 8P(w, in[kJ) \n\nH- P  L \n\nk=1 \n\n8 \nW \n\n. 82d(t[kJ,0[kJ) \n\n. 8pT(w, in[kJ) \n\n8  2 \n0 \n. 8 2 P(w, in[kJ) \n\n8w \n\n8w2\n\n\u00b7 \n\n+ \n\n(3) \n\n~ ~ 8d(t[kJ, O[kJ) \nL \nP  k=1 \n\n80 \n\nThe second term is  of order  O(lIt - 011);  using Fisher's method of scoring  (Sever & \nWild,  1989), we  set this term to zero.  Thus our Hessian reduces to: \n\nH  =  2.  ~ 8P(w, in[kJ) \n\nP  L \n\nk=1 \n\n8w \n\n. 8 2d(t[kJ, o[kJ) \n\n. 8pT(w, in[kJ) \n\n802 \n\n8w \u00b7  \n\n(4) \n\n. \n\n. \n\nk+l \n\nk \n\nand ak \n\n8 2 d(t[k1 o[kl) \n\n802\n\n8F(W in[k1) \n\n8W \n\n.. \nWe define Xk \n, and followmg the lOgIC of HasSlbl \nand  Stork  (1993)  we  can easily  show  that the recursion  for  computing the inverse \nHessian becomes: \nH- 1  = H- 1  _ \n\nk  H- 1  =  a-II  and  Hp-l  = H- 1  , \nk+l \n\nH- 1 .X \nk+l \nk \n.E.  + XT \nk+l \n\n.XT \n. H- 1  . X \n\nk+l \nk \n\n(5) \nwhere  a  is  a  small  parameter  -\neffectively  a  weight  decay  constant.  Note  how \ndifferent  error  measures  d(t,o)  scale the  gradient vectors  X k  forming  the Hessian \n(Eq.  4).  For  the squared error  d(t,o) = (t - 0)2,  we  have ak  = 1,  and  all  gradient \nvectors are weighted equally.  The cross  entropy or Kullback-Leibler distance, \n\n' 0  \n\n, \n\n.H- 1 \n\nak \n\nd(t, 0) = o log ~ + (1- 0) log i~ = ~? ' \n\n0::; 0, t::; 1 \n\n(6) \n\notkl(I~O[l'l).  Hence  if o[kJ  is  close  to  zero  or  one,  Xk  is  given  a  large \nyields  ak \nweight  in the Hessian;  conversely,  the smallest value of ak  occurs when  o[kJ  = 1/2. \nThis is  desirable  and  makes  great  intuitive sense,  since  in the cross  entropy  norm \nthe value of o[kJ  is  interpreted as the probability that the kth input pattern belongs \nto  a  particular class,  and therefore  we  give  large weight  to  Xk  whose  class we  are \nmost certain and small weight  to those which  we  are least certain. \n\n3  EIGENSPACE DECOMPOSITION \n\nAlthough  DES has  been  shown  to  be  a  powerful  method  for  small  and  interme(cid:173)\ndiate  sized  networks  - Hassibi,  Stork and Wolff  (1993)  applied  OES successfully \nto  NETtaik -\nits  use  in  larger  problems  is  difficult  because of large  storage  and \ncomputation requirements.  For  a  network of n  weights,  simply storing the Hessian \nrequires  0(n2 /2)  elements and 0(Pn2 )  computations  are needed  for  each  pruning \nstep.  Reducing  this  computational  burden  requires  some  type  of approximation. \nSince  OES uses  the inverse of the Hessian,  any approximation to  DES will at some \nlevel  reduce to an approximation of H.  For instance  OED uses  a  diagonal  approx(cid:173)\nimation;  magnitude-based  methods  use  an  isotropic  approximation;  and  dividing \nthe network  into subsets  (e.g.,  hidden-to-output  and  input-to-hidden)  corresponds \nto the less-restrictive  block diagonal approximation.  In what follows  we  explore the \ndominant eigenspace decomposition of the inverse Hessian as our approximation.  It \nshould be remembered that all these are subsets of the full  DBS approach. \n\n\f266 \n\nHassibi, Stork, Wolff, and Watanabe \n\n3.1  Theory \n\nThe dominant eigendecomposition is  the best low-rank  approximation of a  matrix \n(in an induced 2-norm sense).  Since the largest eigenvalues of H- 1  are the smallest \neigenvalues  of H,  this  method  will,  roughly  speaking,  be  pruning  weights  in  the \napproximate  nullspace of H.  Dealing  with  a  low  rank  approximation of H-l will \ndrastically reduce the storage and computational requirements. \n\nConsider the eigendecomposition of H: \n\n(7) \n\nwhere ~s contains the largest eigenvalues of H  and ~N the smallest ones.  (We use \nthe subscripts Sand N  to loosely connote signal  and noise.)  The dimension of the \nnoise subspace is typically m\u00ab  n.  Us and UN are n  x (n - m)  and n  x m  matrices \nthat span the dominant eigenspace of Hand H-l, and * denotes matrix transpose \nand complex conjugation.  If,  as suggested above, we  restrict the weight prunings to \nlie  in  UN, we  obtain the following  saliency  and full  weight  change when  removing \nthe qth weight: \n\n1 \n-\nLq  =  - -----...:.~---\n2  ef . UN . ~N 1  . UN . e q \n\nw~ \n\n-\n8w =  -\n\nWq \n\n1 \n\ne T  . UN . ~- . U N*  . e \nq \nq \n\nN \n\n1 \n\n~N Uiveq  , \n\n(8) \n\n(9) \n\nwhere we  have used 'bars' to indicate that these are approximations to Eq. 2.  Note \nnow  that  we  need  only  to  store  ~N and  UN,  which  have  roughly  nm  elements. \nLikewise the computation required to estimate ~N and UN is  O(Pnm). \nThe bound on Lq  is: \n\n-\n\nLq  < Lq  < Lq + 2  w~  . a(s) , \n\nLqLq \n\n1 \n\n(10) \n\nwhere  a(8)  is  the  smallest  eigenvalue  of  ~s.  Moreover  if Q:.(8)  is  large  enough  so \nthat Q:.( 8)  >  [H! l)qq  we  have the following  simpler form: \n\n(11) \n\nIn either case Eqs.  10  and 11  indicate that the larger a(8)  is,  the tighter the bounds \nare.  Thus if the subspace dimension m  is such that the eigenvalues in Us are large, \nthen we  will  have a  good  approximation. \n\nLeCun,  Simard and Pearlmutter (1993)  have suggested a  method that can be used \nto estimate the smallest eigenvectors of the Hessian.  However,  for  0 BS (as we shall \nsee)  it is  best to  use the Hessian with the t  ~ 0  approximation,  and their method \nis  not appropriate. \n\n\fOptimal Brain Surgeon: Extensions and Performance Comparisons \n\n267 \n\n3.2  Simulations \n\nWe  pruned networks trained on the three Monk's problems  (Thrun et al.,  1991)  us(cid:173)\ning the full  OBS and a 5-dimensional eigenspace version of OBS,  using the validation \nerror rate for  stopping criterion.  (We chose a  5-dimensional subspace,  because this \nreduced the computational complexity by an order of magnitude.)  The Table shows \nthe number of weights obtained.  It is  clear that this eigenspace decomposition was \nnot particularly successful.  It appears as though the the off-diagonal terms in H  be(cid:173)\nyond those in the eigenspace are important, and their omission leads to bad pruning. \nHowever,  this warrants further study. \n\nMonk1 \nMonk2 \nMonk3 \n\nunpruned  OBS \n14 \n16 \n4 \n\n58 \n39 \n39 \n\n5-d eigenspace \n\n28 \n27 \n11 \n\n4  OBS/OBD COMPARISON \n\nGeneral  criteria  for  comparing  pruning  methods  do  not  exist.  Since  such  meth(cid:173)\nods  amount  to  assuming  a  particular  prior  distribution  over  the  parameters,  the \nempirical  results  usually  tell  us  more  about  the  problem  space,  than  about  the \nmethods  themselves.  However,  for  two  methods,  such  as  OBS and  OBD,  which \nutilize  the  same  cost  function,  and  differ  only  in  their  approximations,  empirical \ncomparisons  can  be  informative.  Hence,  we  have  applied  both  OBS and  OBD to \nseveral  problems,  including  an  artificially  generated  statistical  classification  task, \nand a  real-world copier voltage control problem.  As  we  show below,  the  OBS algo(cid:173)\nrithm usually results  in better generalization performance. \n\n4.1  MULTIPLE GAUSSIAN  PRIORS \n\n(1,1,0,1, .5)  and  /-LA2  = \n\nWe  created  a  two-catagory  classification  problem  with  a  five-dimensional  in(cid:173)\nput  space.  Category  A  consisted  of  two  Gaussian  distributions  with  mean \nvectors  /-LA!  = \n(0,0,1,0, .5)  and  covariances  ~A!  = \nDiag[0.99, 1.0, 0.88, 0.70, 0.95]  and ~A2 =  Diag[1.28, 0.60, 0.52, 0.93, 0.93] while cat(cid:173)\negory  B  had  means  /-LB!  =  (0,1,0,0, .5)  and  /-LB2  =  (1,0,1,1, .5)  and  covariances \n~Bl =  Diag[0.84, 0.68, 1.28, 1.02,0.89]  and  ~B2 =  Diag[0.52, 1.25, 1.09,0.64,1.13]. \nThe  networks  were  feedforward  with  5  input  units,  9  hidden  units,  and  a  single \noutput  unit  (64  weights  total).  The training  and  the  test  sets  consisted  of  1000 \npatterns  each,  randomly  chosen  from  the equi-probable  categories.  The  problem \nwas  a  difficult  one:  even  with  the  somewhat  large  number  of weights  it  was  not \npossible to obtain less than 0.15 squared error per training pattern.  We trained the \nnetworks  to  a  local  error  minimum  and  then  applied  OBD  (with  retraining  after \neach pruning step using  backpropagation)  as  well  as  0 BS. \n\nFigure 1 (left) shows the training errors for  the network as a function of the number \nof remaining  weights  during  pruning  by  OBS and  by  OBD.  As  more  weights  are \npruned the training errors  for  both  OBS and  OBD typically  increase.  Comparing \nthe two  graphs  for  the first  pruned weights,  the training  error  for  OBD  and  OBS \nare roughly equal, after which the training error of OBS is less until the 24th weight \n\n\f268 \n\nHassibi, Stork, Wolff, and Watanabe \n\nTrain \n\nE \n\n. 17 \n\nr \n. 165 \n\n.16 \n\n.155 \n\n.15 \n\n30  35  40  45  50  55  60  65 \n\nnumber of weights \n\n. 22 \n\nE \nr \n.215 \n\nI \n\n. 2 1 \n\n.205 \n\n.2 \n\n.'~ \n\nTest \n,  \" . ~.,OBD \n..  ' \n.  . .... \n\ntI  _.  \u2022 \u2022 \u2022 \u2022 \u2022  \n\n',' \n\n\u2022  ._ .  \n\naBS \n\n.195 ' \n\n30  35 \n\n40 \n\n45 \n\n50  55 \nnumber of weights \n\n60 \n\n65 \n\nFigure 1:  DES and  OED training error on  a  sum of Gaussians  prior  pattern clas(cid:173)\nsification  task  as  a  function  of  the  number  of weights  in  the  network. \nproceeds  right  to  left.)  DES pruning  employed  0:  =  10-6  (cf.,  Eq.  5);  OED em(cid:173)\nployed 60  retraining epochs after each pruning. \n\n(Pruning \n\nis  removed.  The reason  OED training is initially slightly better is  that the network \nwas not at an exact local minimum; indeed in the first  few  stages the training error \nfor  OED  actually  becomes  less  than  its  original  value.  (Training  exhaustively  to \nthe  true  local  minimum  took  prohibitively  long.)  In  contrast,  due  to  the  t  ---+  0 \napproximation  DES tries to keep the network response  close  to where it was,  even \nif that isn't the minimum w*.  We think it plausible that if the network were  at an \nexact local  minimum  DES would  have had virtually identical performance. \n\nSince  OED is  using  retraining  the only reason why  OES can outperform  after  the \nfirst steps is that  OED has removed an incorrect weight,  due to its diagonal approx(cid:173)\nimation.  (The reason  DES behaves poorly after  removing 24  weights -\na  radically \npruned  net  - may  be  that  the  second-order  approximation  breaks  down  at  this \npoint.)  We can see that the minimum on test error occurs before this breakdown, \nmeaning  that  the  failed  approximation  (Fig.  2)  does  not  affect  our  choice  of the \noptimal network,  at least for  this problem. \n\nThe most important and interesting result is the test error for  these pruned networks \n(Figure 1,  right).  The test error for  OED does  not show any consistent behaviour, \nother than the fact that on the average it generally goes up.  This is contrary to what \none would expect of a  pruning algorithm.  It seems that the retraining phase works \nagainst the pruning process,  by tending to reinforce overfitting,  and to reinject the \ntraining  set  noise.  For  DES,  however,  the  test  error  consistently  decreases  until \nafter removing 22 weights a  minimum is reached,  because the t  ---+  0  approximation \navoids  reinjecting the training set noise. \n\n4.2  OBS/OBD PRUNING AND  \"STOPPED\"  NETWORKS \n\nA  popular  method of avoiding  overfitting  is  to stop  training  a  large  net when  the \nvalidation  error  reaches  a  minimum.  In order  to  explore  whether  pruning  could \nimprove the performance on such  a  \"stopped\"  network  (Le.,  not  at w*),  we  mon(cid:173)\nitored  the  test  error  for  the above  problem  and  recorded  the  weights  for  which  a \nminimum on the test set occured.  We then applied  OES and  OED to this network. \n\n\fOptimal Brain Surgeon: Extensions and Performance Comparisons \n\n269 \n\n. , , \n\n, , , , \n\n'- \"  aBO \n, \n\" \n\n, \n,'\"  ,  .. \n\n, \\   \"'--, \n\n-' ....  -, \n\nE \n\n\u00b7204 \n\n.202 \n\n.200 \n\n. 198 \n\n.196 \n\n.194 \n\n35 \n\n45 \n\n55 \n40 \nnumber of weights \n\n50 \n\n60 \n\nFigure  2:  A  64-weight  network  was  trained  to  minimum  validation  error  on  the \nand  then  pruned by  OBD  and  by  OBS.  The test \nGaussian  problem -\nerror on the resulting network is shown.  (Pruning proceeds from right to left.)  Note \nes,pecially that even though the network is  far  from  w*,  OBS leads lower  test error \nover  a  wide range of prunings, even through  OBD employs  retraining. \n\nnot w*  -\n\nThe results  shown  in  Figure 2 indicate that with  OBS we  were  able to reduce  the \ntest error,  and this  reached  a  minimum  after  removing  17 weights.  OBD  was  not \nable to consistently reduce the test error. \n\nThis  last  result  and those from  Fig.  2 have important consequences.  There are no \nuniversal stopping criteria based on theory  (for  the reasons  mentioned above),  but \nit  is  a  typical  practice to  use  validation error  as  such  a  criterion.  As  can  be seen \nin  Figure 2,  the test error  (which  we  here consider  a  validation error)  consistantly \ndecreases to a  unique miniumum for  pruning by  OBS.  For the network pruned (and \ncontinuously retrained) by OBD,  there is no such structure in the validation curves. \nThere seems to be no reliable clue that would permit the user to know when to stop \npruning. \n\n4.3  COPIER CONTROL  APPLICATION \n\nThe quality of an image produced by a  copier is  dependent upon  a wide variety of \nfactors:  time since last copy,  time since last toner cartridge installed,  temperature, \nhumidity, overall graylevel of the source document,  etc.  These factors  interact in a \nhighly non-linear fashion,  so that mathematical modelling of their interrelationships \nis difficult.  Morita et al.  (1992)  used backpropagation to train an 8-4-8 network (65 \nweights) on real-world data, and managed to achieve an RMS voltage error of 0.0124 \non a  critical control  plate.  We  pruned his  network with both  OBD with retraining \nas  well  as  with  OBS.  When the network  was  pruned  by  OBD with retraining,  the \ntest error continually increased  (erratically)  such that at 34  remaining weights,  the \nRMS  error was  0.023.  When also we  pruned the original net by  OBS,  and the test \nerror  gradually  decreased  such  that  at the same  number  of weights  the test  error \nwas  0.012 -\n\nsignificantly lower than that of the net pruned by  OBD. \n\n\f270 \n\nHassibi, Stork, Wolff, and Watanabe \n\n5  CONCLUSIONS \n\nWe compared pruning by  OES and by OED with retraining on a difficult non-linear \nstatistical pattern recognition problem and found that OES led to lower generaliza(cid:173)\ntion  error.  We  also  considered the widely  used  technique of training  large  nets  to \nminimum  validation  error.  To  our  surprise,  we  found  that subsequent  pruning  by \nOES lowered  generalization  error,  thereby demonstrating that such  networks  still \nhave over fitting  problems.  We  have found  that the dominant eigenspace  approach \nto  OES leads  to  poor  performance.  Our  simulations  support  the  claim  that  the \nt  ---+  0  approximation used in  OBS avoids reinjecting training set noise into the net(cid:173)\nwork.  In contrast,  including  such  t  - 0  terms  in  OES reinjects  training  set  noise \nand degrades generalization performance,  as  does  retraining in  OBD. \n\nAcknowledgements \n\nThanks  to  T.  Kailath  for  support  of  B.H.  through  grants  AFOSR  91-0060  and \nDAAL03-91-C-0010.  Address reprint requests to Dr.  Stork:  stork@crc.ricoh.com. \n\nReferences \n\nJ.  Gorodkin, L.  K.  Hansen,  A.  Krogh,  C.  Svarer and O.  Winther.  (1993)  A quanti(cid:173)\ntative study of pruning by Optimal Brain Damage.  International Journal of Neural \nSystems 4(2)  159-169. \nB.  Hassibi  &  D.  G.  Stork.  (1993)  Second  order  derivatives  for  network  pruning: \nOptimal  Brain  Surgeon.  In  S.  J.  Hanson,  J.  D.  Cowan  and  C.  L.  Giles  (eds.), \nAdvances  in  Neural  Information  Processing  Systems  5,  164-171.  San  Mateo,  CA: \nMorgan Kaufmann. \n\nB.  Hassibi,  D.  G.  Stork  &  G.  Wolff.  (1993)  Optimal  Brain  Surgeon  and  general \nnetwork pruning.  Proceedings  of ICNN 93,  San  Francisco  1  IEEE Press.  293-299. \nY.  LeCun,  J.  Denker  &  S.  Solla.  (1990)  Optimal Brain Damage.  In  D.  Touretzky \n(ed.),  Advances in Neural  Information  Processing  Systems  2,  598-605.  San Mateo, \nCA:  Morgan Kaufmann. \nY.  LeCun,  P.  Simard  &  B.  Pearlmutter. \nIn  S.  J.  Hanson, \nimization  by  on-line  estimation  of  the  Hessian's  eigenvectors. \nJ.  D.  Cowan & C.  L.  Giles  (eds.),  Advances in Neural Information  Processing  Sys(cid:173)\ntems  5,  156-163.  San Mateo,  CA:  Morgan Kaufmann. \nT.  Morita,  M.  Kanaya,  T.  Inagaki,  H.  Murayama & S.  Kato.  (1992)  Photo-copier \nimage density control  using  neural network and fuzzy  theory.  Second  International \nWorkshop  on Industrial Fuzzy  Control  \u20ac3  Intelligent Systems December 2-4,  College \nStation, TX,  10. \nS.  Thrun and 23  co-authors.  (1991)  The Monk's Problems - A  performance com(cid:173)\nparison of different learning algorithms.  CMU-CS-91-197 Carnegie-Mellon Univer(cid:173)\nsity Dept.  of Computer Science Technical Report. \n\n(1993)  Automatic  learning  rate  max(cid:173)\n\n\f", "award": [], "sourceid": 749, "authors": [{"given_name": "Babak", "family_name": "Hassibi", "institution": null}, {"given_name": "David", "family_name": "Stork", "institution": null}, {"given_name": "Gregory", "family_name": "Wolff", "institution": null}]}