{"title": "Fast Network Pruning and Feature Extraction by using the Unit-OBS Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 655, "page_last": 661, "abstract": null, "full_text": "Fast  Network Pruning and  Feature \n\nExtraction Using the Unit-OBS  Algorithm \n\nAchim  Stahlberger and Martin Riedmiller \n\nInstitut fur  Logik , Komplexitiit und  Deduktionssysteme \n\nUniversitiit  Karlsruhe, 76128  Karlsruhe,  Germany \n\nemail:  stahlb@ira.uka.de. riedml@ira.uka.de \n\nAbstract \n\nThe  algorithm described  in  this  article  is  based  on  the  OBS  algo(cid:173)\nrithm by  Hassibi,  Stork  and  Wolff  ([1]  and  [2]).  The  main  disad(cid:173)\nvantage of OBS  is  its high complexity.  OBS needs  to calculate the \ninverse  Hessian  to delete only one  weight  (thus  needing  much time \nto  prune  a  big  net) .  A  better  algorithm should  use  this matrix to \nremove more than only one  weight , because  calculating the inverse \nHessian  takes  the  most time in  the  OBS  algorithm. \nThe  algorithm,  called  Unit- OBS,  described  in  this  article  is  a \nmethod to overcome this disadvantage.  This algorithm only  needs \nto calculate the inverse Hessian once to remove one whole unit thus \ndrastically  reducing  the  time to prune big nets. \nA  further  advantage  of Unit- OBS  is  that  it  can  be  used  to  do  a \nfeature  extraction  on  the  input  data.  This  can  be  helpful  on  the \nunderstanding of unknown  problems. \n\n1 \n\nIntroduction \n\nThis  article  is  based  on  the  technical  report  [3]  about  speeding  up  the  OBS  algo(cid:173)\nrithm.  The  main  target  of this  work  was  to  reduce  the  high complexity O(n 2p)  of \nthe  OBS  algorithm in  order  to  use  it  for  big  nets  in  a  reasonable  time.  Two  \"ex(cid:173)\nact\"  algorithms were  developed  which lead  to exactly  the same results  as  OBS  but \nusing  less  time.  The  first  with  time  O( n1. 8p)  makes  use  of Strassens'  fast  matrix \nmultiplication algorithm.  The second  algorithm uses  algebraic  transformations  to \nspeed  up calculation and needs  time O(np2).  This algorithm is faster  than  OBS  in \nthe special  case  of p < n. \n\n\f656 \n\nA.  Stahlberger and M.  Riedmiller \n\nTo  get  a  much  higher  speedup  than  these  exact  algorithms can  do,  an  improved \nOBS  algorithm was  developed  which  reduces  the  runtime  needed  to  prune  a  big \nnetwork drastically.  The basic idea is  to use  the inverse  Hessian  to remove a  group \nof weights instead of only one,  because the calculation of this matrix takes the most \ntime in  the OBS  algorithm.  This idea leads  to an  algorithm called  Unit- OBS  that \nis  able to remove whole  units. \n\nUnit-OBS has  two  main advantages:  First it is  a fast  algorithm to  prune  big nets, \nbecause  whole  units  are  removed  in  every  step  instead  of slow  pruning  weight  by \nweight.  On  the  other  side  it can  be  used  to  do  a  feature  extraction  on  the  input \ndata by  removing  unimportant input  units.  This  is  helpful  for  the  understanding \nof unknown  problems. \n\n2  Optimal Brain Surgeon \n\nThis  section  gives  a  small  summary  of the  OBS  algorithm described  by  Hassibi, \nStork and Wolff in [1]  and [2] .  As  they showed  the increase in error  (when  changing \nweights  by  ~ w)  is \n\n(1) \n\nwhere H  is the Hessian matrix.  The goal is to eliminate weight Wq  and minimize the \nincrease in error given by  Eq.  1.  Eliminating Wq  can be expressed  by  Wq + ~Wq =  0 \nwhich is equivalent to (w + ~ W f  eq  =  0 where eq  is the unit vector corresponding to \nweight  Wq  (wT eq  = wq ) .  Solving this extremum problem with side  condition  using \nLagrange's method leads  to the solution \n\n2 \n~E=  Wq \n\n2H-1 qq \n\n~W =  - H-1  H \n\nWq \n\n-1 \n\neq \n\nqq \n\n(2) \n\n(3) \n\nH -1 qq  denotes  the element  (q, q)  of matrix H -1 .  For every  weight  Wq  the minimal \nincrease  in  error  ~E(wq) is  calculated  and  the  weight  which  leads  to overall  mini(cid:173)\nmum will be  removed and  all other weights be  adapted referring  to Eq.  3.  Hassibi, \nStork  and  Wolff also  showed  how  to  calculate  H- 1  using  time  O(n 2p)  where  n  is \nthe number of weights  and p the number of pattern. \n\nThe main disadvantage of the OBS algorithm is that it needs time O(n 2p)  to remove \nonly  one  weight  thus  needing  much  time  to  prune  big  nets.  The  basic  idea  to \nsoften  this disadvantage is  to  use  H- 1  to remove more than only one weight!  This \ngeneralized  OBS  algorithm is  described  in  the  next section. \n\n3  Generalized  OBS  (G-OBS) \n\nThis section  shows  a  generalized  OBS  algorithm  (G-OBS)  which  can  be  used  to \ndelete  m  weights  in  one  step  with  minimal  increase  in  error.  Like  in  the  OBS \nalgorithm the  increase  in  error  is  given  by  ~E =  ~~wT H ~w .  But  the  condition \nWq + ~Wq =  0 is  replaced  by  the generalized  condition \n\n(4) \n\n\fFast Network Pruning by using the Unit-OBS Algoritiun \n\n657 \n\nwhere  M \nis  the  selection  matrix  (selecting  the  weights  to  be  removed)  and \nql, q2,  . .. , qm  are  the  indices  of  the  weights  that  will  be  removed.  Solving  this \nextremum problem with side  condition  using  Lagrange's method leads  to  the  solu(cid:173)\ntion \n\njj.E =  !wT M(MT H- 1 M)-l MT w \n\n2 \n\njj.w =  _H- 1 M(MT H- 1 M)-l MT w \n\n(5) \n\n(6) \n\nChoosing  M  =  eq  Eq.  5  and  6  reduce  to  Eq.  2  and 3.  This shows  that  OBS  is  (as \nexpected)  a  special  case  of G-OBS.  The  problem of calculating H- 1  was  already \nsolved  by Hassibi, Stork and Wolff ([1]  and [2]) . \n\n4  Analysis of G-OBS \n\nHassibi,  Stork  and  Wolff ([1]  and  [2])  showed  that  the  time to calculate  H-l is  in \nO(n2p).  The  calculation of  jj.E  referring  to  Eq.  5  needs  time  O(m3)t  where  m  is \nthe  number of weights  to  be  removed.  The  calculation of  jj.w  (Eq.  6)  needs  time \nO(nm + m 3 ). \nThe  problem within  this solution consists  of not  knowing  which  weights  should  be \ndeleted  and  thus  jj.E  has  to  be  calculated  for  all  possible  combinations  to  find \nthe  global  minimum in  error  increase.  Choosing  m  weights  out  of n  can  be  done \nwith  (;:J  possible  combinations.  This  takes  time  (~)O(m3) to  find  the  minimum. \nTherefore the total runtime of the generalized  OBS  algorithm to remove  m  weights \n(with  minimal increase  in  error)  is \n\nThe problem is that for  m > 3 the term C:Jm3  dominates and TG-OBS  is  in  O(n4 ). \nIn other  words  G-OBS  can  be  used  only to remove a maximum of three  weights  in \none step.  But  this means little advantage over  OBS. \n\nTo  overcome  this  problem  the  set  of  possible  combinations  has  to  be  restricted \nto  a  small  subset  of  combinations  that  seem  to  be  \"good\"  combinations.  This \nreduces  the  term  (~)m3 to a reasonable  amount.  One way  to do this is  that a good \ncombination exists  of all outgoing connections  of a  unit.  This reduces  the  number \nof combinations to  the  number  of units!  The  basic  idea  for  that  subset  is:  If all \noutgoing connections of a  unit  can  be  removed  then  the  whole  unit can  be deleted \nbecause it can not influence the net output anymore.  Therefore choosing this subset \nleads to  an  algorithm called Unit- OBS  that is  able to remove whole  units  without \nthe  need  to recalculate  H- 1 . \n\n5  Special Case  of G-OBS:  Unit-OBS \n\nWith the results of the  last sections  we  can  now describe  an  algorithm called Unit(cid:173)\nOBS  to remove whole  units. \n\n1.  Train a  network  to  minimum error. \n\nt M  is  a  matrix  of  special  type  and  thus  the  calculation  of  (MT H- J M)  needs  only \n\nO(m2 )  operations! \n\n\f658 \n\nA.  Stahlbergerand M.  Riedmiller \n\n2.  Compute H- 1 . \n\n3.  For each  unit u \n\n(a)  Compute  the  indices  Ql, Q2 , .. . ,Qm(u)  of the  outgoing  connections  of \n\nunit u  where  m(u)  is  the number of outgoing connections of unit u. \n\n(b)  M  :=  (e q1  eq2  ... eqm(u\u00bb) \n(c)  D..E(u)  := ~wT M(MT H- 1 M)-l MT w \n\n4.  Find the Uo  that gives the smallest increase  in error  D..E(uo). \n\n5.  M  := M(uo)  (refer  to steps  3.(a)  and 3.(b)) \n\n6.  D..w  :=  _H- 1 M(MT H- 1 M)-l MT w \n\n7.  Remove  unit Uo  and  use  D..w  to  update  all weights. \n\n8.  Repeat steps  2 to  7 until  a  break criteria is reached . \n\nFollowing the  analysis of G-OBS  the time to remove one  unit is \n\nTUnit-OBS  =  O(n2p + um3 ) \n\n(7) \n\nwhere  u  is  the  number of units  in  the  network  and  m  is  the  maximum number of \noutgoing  connections.  If m  is  much smaller  than  n  we  can  neglect  the  term  um3 \nand the main problem is to calculate H- 1 .  Therefore,  if m is small, we can say that \nUnit-OBS  needs  the  same  time to  remove  a  whole  unit  as  OBS  needs  to  remove \na  single  weight .  The speedup  when  removing  units  with  an  average  of s  outgoing \nconnections should  then  be  s . \n\n6  Simulation results \n\n6.1  The Monk-1  benchmark \n\nUnit- OBS  was  applied  to  the  MONK's  problems  because  the  underlying  logical \nrules  are  well  known  and  it is  easy  to say  which  input  units are  important to the \nproblem  and  which  input  units  can  be  removed.  The  simulations showed  that  in \nno  case  Unit-OBS  removed  a  wrong  unit  and  that  it  has  the  ability to remove all \nunimportant input  units. \n\nFigure  1 shows  a  MONK-I- net  pruned  with  Unit-OBS.  This  net  is  the  minimal \nnetwork that can be found  by  Unit-OBS.  Table 1 shows  the speedup of Unit-OBS \ncompared to OBS  to find  an equal-size network for  the MONK-I  problem. \n\nThe network shown in  Fig.  1 is only minimal in the number of units but not minimal \nwith respect to the number of weights.  Hassibi, Stork and Wolff ([1]  and [2])  found a \nnetwork with only  14  weights by  applying OBS  (Fig. 3).  In the framework of Unit(cid:173)\nOBS,  OBS  can be used  to do further  pruning on  the network  after all possible units \nhave  been  pruned.  The  advantage  lies  in  the  fact  that  now  the  time consuming \nOBS- algorithm is  applied  to  a  much  smaller  network  (22  weights  instead  of 58). \nThe  result  of this  combination  of Unit-OBS  and  OBS  is  a  network  with  only  14 \nweights  (Fig. 2)  which has  also  100 % accuracy  like  the  minimal net found  by  OBS \n(see  Table 1). \n\n. \n\n\fFast Network Pruning by using the Unit-OBS Algorithm \n\n659 \n\nAtuibute 1 \n\nAttribute 2 \n\nAttribute 3 \n\nAttribute 4 \n\nAttribute 5 \n\nAtuibute 6 \n\nFigure 1:  MONK-I - net pruned with Unit-OBS, 22 weights.  All  unimportant units \nare removed and this net needs  less  units than the minimal network found  by  OBS! \n\nAtuibute 1 \n\nAttribute 2 \n\nAttribute 3 \n\nAttribute 4 \n\nAtuibute 5 \n\nAtuibute 6 \n\nFigure  2:  Minimal  network  (14  weights)  for  the  MONK-I  problem  found  by  the \ncombination of Unit-OBS  with OBS . The  logical rule  for  the  MONK- I  problem is \nmore evident  in  this  network  than  in  the  minimal network  found  by  OBS  (comp. \nFig. 3) . \n\nAtuibute 1 \n\nAttribute 2 \n\nAttribute  3 \n\nAttribute 4 \n\nAttribute 5 \n\nAttribute 6 \n\nFigure  3:  Minimal network  (14  weights)  for  the  MONK-I  problem found  by  OBS \n(see  [1]  and  [2]) . \n\n\f660 \n\nA.  Stahlberger and M.  Riedmiller \n\nalgorithm \n\n#  weights \n\ntopology \n\nspeedup+ \n\nno  prumng \n\nOBS \n\nUnit-OBS \n\nUnit-OBS + OBS \n\n58 \n14 \n22 \n14 \n\n17-3-1 \n6-3-1 \n5-3-1 \n5-3-1 \n\n-\n1.0 \n2.8 \n2.6 \n\nTable 1:  The  Monk- l  problem \n\nperf. \nperf. \ntest \ntrain \n100%  100% \n100%  100% \n100% \n100% \n100% \n100% \n\nFor  the  initial  Monk-l  network  the maximum number of outgoing connections  (m \nin  Eq.  7)  is  3  and  this  is  much  smaller  than  the  number of weights.  The  average \nnumber of outgoing connections of the  removed  units is  3 and  therefore  we  expect \na speedup  by factor  3 (compare Table 1). \n\nBy  comparing  the  two  minimal  nets  found  by  Unit-OBSjOBS  (Fig.  2)  and \nOBS  (Fig.  3)  it  can  be  seen  that  the  underlying  logical  rule  (out=1  \u00a2:}  At(cid:173)\ntribuLl=AttribuL2 or AttribuL5=1) is more evident in  the network found by Unit(cid:173)\nOBSjOBS. The other advantage of Unit-OBS is  that it needs only 38 % of the time \nOBS needs  to find  this minimal network.  This advantage makes it possible to apply \nUnit-OBS  to  big nets for  which  OBS  is  not  useful  because  of its long computation \ntime. \n\n6.2  The Thyroid Benchmark \n\nThe following describes  the application of pruning on  a medical classification  prob(cid:173)\nlem.  The  task is  to classify measured  data values of patients into three  categories. \nThe output of the three layered feed forward network therefore consists of three neu(cid:173)\nrons  indicating  the  corresponding  class.  The  input  consists  of 21  both  continuos \nand binary signals. \n\nThe task was first  described  in  [4].  The results obtained there are shown in  the first \nrow  of Table 2.  The initially  used  network  has  21  input neurons,  10  hidden  and  3 \noutput neurons,  which  are fully connected  using shortcut connections. \n\nWhen applying OBS  to prune the  network  weights,  more than  90 % of the weights \ncan  be  pruned.  However,  over 8 hours of cpu-time on  a sparc  workstation  are used \nto  do  so  (row  2  in  Table  2).  The  solution  finally  found  by  OBS  uses  only  8  of \nthe  originally  21  input  features.  The  pruned  network  shows  a  slightly  improved \nclassification  rate on  the  test set. \n\nUnit-OBS  finds  a  solution  with  41  weights  in  only  76  minutes  of  cpu-time.  In \ncomparison to the original OBS  algorithm, Unit-OBS is about 8 times as fast when \ndeleting the same number of weights.  Also another important fact can be seen  from \nthe result:  The Unit-OBS network considers only 7 of the originally 21  inputs, 1 less \nthan the weight-focused  OBS- algorithm.  The number of hidden units is reduced  to \n2 units,  5  units less  than  the  OBS  network  uses. \n\nWhen  further  looking for  an  absolute minimum in  the number of used  weights,  the \nUnit-OBS network  can  be  additionally pruned  using  OBS . This finally  leeds  to an \noptimized network with only 24 weights.  The classification performance of this very \n\ntCompared to OBS deleting  the same number  of weights. \n\n\fFast Network Pruning by using the Unit-DBS Algorithm \n\n661 \n\nsmall  network  is  98.5 %  which  is  even  slightly  better  than  obtained  by  the  much \nbigger initial net . \n\nalgorithm \nno  prunmg \n\nOBS \n\nUnit-OBS \n\nUnit-OBS + OBS \n\n#  weights \n\n316 \n28 \n41 \n24 \n\ntopology \n21-10-3 \n8-7-3 \n7-2-3 \n7-2-3 \n\nspeedup  I cpu-time \n\nperf.  test \n\n-\n1.0 \n7.8 \n-\n\n-\n\n511  min . \n76  min. \n137  min. \n\n98.4% \n98.5% \n98.4% \n98.5% \n\nTable  2:  The  thyroid  benchmark \n\n7  Conclusion \n\nThe article describes  an improvement of the OBS-algorithm introduced in  [1]  called \nGeneralized OBS  (G-OBS). The underlying idea is to exploit second order informa(cid:173)\ntion  to  delete  mutliple weights  at once.  The  aim to  reduce  the  number of different \nweight groups leads to the formulation of the Unit-OBS  algorithm, which considers \nthe  outgoing  weights  of one  unit  as  a  group  of candidate  weights:  When  all  the \nweights of a  unit can  be  deleted,  the  unit itself can  be  pruned .  The new  Unit-OBS \nalgorithm has  two  major advantages :  First,  it considerably  accelerates  pruning by \na speedup factor  which lies  in  the  range of the  average number of outgoing weights \nof each  unit .  Second,  deleting complete  units  is  especially  interesting  to determine \nthe  input  features  which  really contribute  to  the  computation of the output.  This \ninformation can  be  used  to get  more  insight  in  the  underlying  problem structure, \ne.g.  to facilitate  the  process of rule extraction. \n\nReferences \n\n[1]  B . Hassibi,  D. G . Storck:  Second  Order  Derivatives for  Network  Pruning:  Op(cid:173)\n\ntimal  Brain  Surgeon .  Advances  in  Neural  Information  Processing  Systems  5, \nMorgan Kaufmann,  1993,  pages  164- 171 . \n\n[2]  B.  Hassibi,  D.  G.  Stork,  G.  J.  Wolff:  Optimal Brain  Surgeon  and general  Net(cid:173)\n\nwork Pruning. IEEE International Conference on Neural Networks,  1993 Volume \n1,  pages  293-299. \n\n[3]  A.  Stahlberger:  OBS - Verbesserungen  und neue  Ansatze.  Diplomarbeit,  Uni(cid:173)\n\nversitat  Karlsruhe,  Institut  fur  Logik ,  Komplexitat  und  Deduktionssysteme, \n1996. \n\n[4]  W .  Schiffmann ,  M.  Joost,  R.  Werner:  Optimization  of the  Backpropagation \nAlgorithm for  Training Multilayer Perceptrons. Technical  Report,  University  of \nKoblenz,  Institute of Physics,  1993 . \n\n\f", "award": [], "sourceid": 1233, "authors": [{"given_name": "Achim", "family_name": "Stahlberger", "institution": null}, {"given_name": "Martin", "family_name": "Riedmiller", "institution": null}]}