{"title": "Bayesian Backprop in Action: Pruning, Committees, Error Bars and an Application to Spectroscopy", "book": "Advances in Neural Information Processing Systems", "page_first": 208, "page_last": 215, "abstract": null, "full_text": "Bayesian Backprop  in  Action: \n\nPruning,  Committees, Error  Bars \nand an Application to  Spectroscopy \n\nHans  Henrik Thodberg \n\nDanish  Meat  Research  Institute \n\nMaglegaardsvej  2,  DK-4000  Roskilde \n\nthodberg~nn.meatre.dk \n\nAbstract \n\nMacKay's Bayesian framework for  backpropagation is  conceptually \nappealing  as  well  as  practical.  It automatically adjusts the  weight \ndecay  parameters  during  training,  and  computes  the  evidence  for \neach  trained  network.  The  evidence  is  proportional  to  our  belief \nin  the  model.  The  networks  with  highest  evidence  turn  out  to \ngeneralise well.  In this paper, the framework is extended to pruned \nnets,  leading  to  an  Ockham  Factor  for  \"tuning  the  architecture \nto  the  data\".  A  committee  of networks,  selected  by  their  high \nevidence,  is  a  natural  Bayesian  construction.  The  evidence  of a \ncommittee is computed.  The framework is illustrated on real-world \ndata from  a  near  infrared  spectrometer  used  to determine  the fat \ncontent  in  minced  meat.  Error  bars  are  computed,  including  the \ncontribution from the  dissent  of the committee members. \n\n1  THE OCKHAM  FACTOR \n\nWilliam  of  Ockham's  (1285-1349)  principle  of economy  in  explanations,  can  be \nformulated as follows: \n\nIf several theories  account  for  a phenomenon  we  should  prefer the \nsimplest  which  describes  the  data  sufficiently  well. \n\n208 \n\n\fBayesian Backprop in Action \n\n209 \n\nThe  principle  states  that  a  model  has  two  virtues:  simplicity and goodness  of fit. \nBut what  is  the  meaning of \"sufficiently  well\"  - i.e.  what  is  the optimal trade-off \nbetween  the  two  virtues?  With  Bayesian  model  comparison  we  can  deduce  this \ntrade-off. \n\nWe  express  our  belief in  a  model as  its probability given  the  data, and  use  Bayes' \nformula: \n\nP(H I D)  =  P(D IH)P(H) \n\nP(D) \n\nWe assume that the prior belief P(H) is  the same for  all models, so we  can compare \nmodels  by  comparing  P(D IH)  which  is  called  the  evidence  for H, and  acts  as  a \nquality measure in model comparison. \n\nAssume that the model has a single tunable parameter w  with a prior range ~ Wprior \nso  that  pew IH) =  1/ ~Wprior.  The  most probable  (or  maximum posterior)  value \nWMP  of the  parameter w  is given by the maximum of \n\nP(  ID  H)=  P(Dlw,H)P(wl1i) \n\nw ,  \n\nP(DIH) \n\n(2) \n\nThe  width  of this  distribution  is  denoted  ~Wpo8terior.  The  evidence  P(D 11i)  is \nobtained  by  integrating  over  the  posterior  w  distribution  and  approximating the \nintegral: \n\nP(DIH) \n\nJ P(Dlw,H)P(wIH)dw \nP(D I WMP, H) ~wpo8terior \n~Wprior \n\n(1) \n\n(3) \n\n(4) \n\n(5) \n\nEvidence \n\nLikelihood  x  OckhamFactor \n\nThe evidence for  the model is  the product  of two factors: \n\n\u2022  The best fit  likelihood, i.e.  the probability of the  data given the model and \nthe tuned parameters.  It measures how  well  the tuned  model fits  the data . \n\n\u2022  The  integrated  probability of the  tuned  model  parameters  with  their  un(cid:173)\n\ncertainties, i.e.  the collapse of the available parameter space when  the data \nis  taken  into account.  This factor  is small  when  the  model  has  many pa(cid:173)\nrameters  or  when  some  parameters  must  be  tuned  very  accurately  to  fit \nthe data.  It is called  the  Ockham  Factor since it is large when the model is \nsimple. \n\nBy  optimizing  the  modelling  through  the  evidence  framework  we  can  avoid  the \noverfitting problem as  well  as  the  equally important  \"underfitting\"  problem. \n\n2  THE FOUR LEVELS  OF INFERENCE \n\nIn  1991-92 MacKay  presented  a  comprehensive  and  detailed framework for  combi(cid:173)\nning backpropagation neural networks with Bayesian statistics (MacKay, 1992).  He \noutlined four  levels  of inference  which  applies for  instance  to a  regression  problem \nwhere we  have a  training set  and want to make predictions for  new  data: \n\n\f210 \n\nThodberg \n\nLevel 1  Make  predictions including error  bars for  new  input data. \nLevel 2  Estimate the weight parameters and their uncertainties. \nLevel 3  Estimate the scale parameters (the weight decay parameters and the noise \n\nscale  parameter)  and their  uncertainties. \n\nLevel 4  Select the network  architecture and for  that architecture select one of the \nw-minima.  Optionally select  a committee to reflect  the uncertainty on this \nlevel. \n\nLevel  1 is  the  typical goal  in  an  application.  But  to make  predictions  we  have  to \ndo  some modelling, so  at level  2  we  pick  a  net  and some weight decay  parameters \nand train the net for  a  while.  But the weight  decay  parameters were  picked  rather \narbitrarily,  so  on  level  3  we  set  them  to  their  inferred  maximum posterior  (MP) \nvalue.  We alternate between level  2 and 3 until the network has converged.  This is \nstill not the end, because also the network architecture was picked rather arbitrarily. \nHence  level  2 and  3 are repeated for  other  architectures  and the evidences  of these \nare  computed on level  4.  (Pruning makes level  4 more complicated,  see  section 6). \n\nWhen  we  make inference  on  each  of these  levels,  there  are  uncertainties  which  are \ndescribed  by  the posterior  distributions of the parameters which  are  inferred.  The \nuncertainty on level 2 is described  by  the Hessian  (the second  derivative of the net \ncost  function  with respect  to the weights).  The uncertainty on level  3 is  negligible \nif the  number  of weight  decays  parameters  is  small  compared  to  the  number  of \nweights.  The uncertainty on level 4 is described  by the committee of networks with \nhighest evidence  within some margin (discussed  below). \n\nThe uncertainties are used  for  two  purposes.  Firstly they give  rise  to error  bars on \nthe  predictions  on  level  1.  And  secondly  the  posterior  uncertainty  divided  by  the \nprior uncertainty (the  Ockham Factor)  enters  the evidence. \n\nMacKay's approach differs in two respects from other Bayesian approaches to neural \nnets: \n\n\u2022  It assumes the Gaussian approximation to the posterior weight distribution. \nIn contrast, the  Monte Carlo approach of (Neal,  1992) does not suffer from \nthis limitation . \n\n\u2022  It  determines  maximum posterior  values  of the  weight  decay  parameters, \nrather  than integrating them out as done  in  (Buntine and Weigend,  1991). \n\nIt is  difficult  to justify  these  choices  in  general.  The  Gaussian  approximation  is \nbelieved to be good when there are at least 3 training examples per weight (MacKay, \n1992).  The use  of MP  weight  decay  parameters is  the superior method when  there \nare ill-defined parameters, as there usually is in neural networks, where some weights \nare  typically poorly defined  by the data (MacKay,  1993). \n\n3  BAYESIAN  NEURAL NETWORKS \n\nThe training set  D  consists  of N  cases of the form  (x, t).  We  model t  as a function \nof x,  t  = y(x) + II,  where  II  is  Gaussian  noise  and  y(x)  is  computed  by  a  neural \n\n\fBayesian Backprop in Action \n\n211 \n\nnetwork  11.  with  weights  w.  The  noise  scale  is  a  free  parameter  {3  =  1/(1';.  The \nprobability of the data (the likelihood) is \n\nP(Dlw,{3,11.) \nEn \n\nex  exp(-{3En) \n~ L:(y - t)2 \n\n(6) \n(7) \n\nwhere  the sum extends  over  the N  cases. \n\nIn  Bayesian  modelling we  must specify  the  prior distribution of the  model param(cid:173)\neters.  The  model  contains  k  adjustable  parameters  w,  called  weights,  which  are \nin general  split into several groups,  for  instance  one  per  layer  of the  net.  Here  we \nconsider  the  case  with  all  weights  in  one  group.  The general  case  is  described  in \n(MacKay,  1992)  and in more details in (Thodberg,  1993).  The prior of the weights \nw  1S \n\np(w\\{3,e,11.) \n\n(8) \n(9) \n{3  and e are called  the  scales  of the  model  and  are  free  parameters determined  by \nthe data. \n\nEw  _  ~L:w2 \n\nex  exp(-{3eEw) \n\nThe most probable values of the weights  given  the  data, some values of the scales \n(to be determined  later)  and  the model, is given  by  the maximum of \n\nP(w\\D,{3,e,11.) \n\nP(D/w,{3,e, 11.)p(w/{3,e, 11.) \n\np(D\\{3,e,11.) \n\n(  (3C) \n\nex  exp  -\n\n(10) \n\n(11) \n\nSo the maximum posterior  weights according  to the probabilistic interpretation are \nidentical  to  the weights  obtained  by minimising the familiar  cost  function  C  with \nweight  decay  parameter e.  This  is  the  well-known  Bayesian  account  for  weight \ndecay. \n\n4  MACKAY'S  FORMULAE \n\nThe single  most useful  result  of MacKay's analysis is  a  simple formula for  the  MP \nvalue of the weight decay  parameter \n\nEn \n\n, \n\neMP  =-E N w  -, \n\n(12) \n\nwhere  ,  is  the number of well-determined parameters which  can  be  approximated \nby  the  actual  number  of parameters  k,  or  computed  more  accurately  from  the \neigenvalues Ai  of the Hessian  \\T\\T En: \n\n,--\n-L: \n' \n;=}  Ai + eMP \nThe MP  value of the noise scale  is {3MP  = N /(2C). \n\nA' \n\nk \n\n(13) \n\n\f212 \n\nThodberg \n\nThe  evidence  for  a  neural  network  'Jf.  is,  as  in  section  1,  obtained  by  integration \nover  the  posterior  distribution  of the inferred  parameters,  which  gives  raise to the \nOckham Factors: \n\nEv('Jf.) = log P( D 1'Jf.) \n\n-\n\nN  - 'Y  _  N  log 411\"C \nN \n\n2 \n\n2 \n\n+ log Ock(w) + log Ock\u00ab(3) + log Ock(e) \nk \"\" \n~ L.J log \n\n- - + log h! + h log 2 \n\neMP \n\n'Y \n2 \n\n(14) \n\n(15) \n\n(16) \n\nlogOck(w) \n\ni=l \n\neMP  + Ai \n\nOck\u00ab(3) \n\n-\n\nJ411\"/(N - 'Y)  Ock(e) = .f4iFt \nlogO \n\nlogO \n\nThe  first  line  in  (14)  is  the  log  likelihood.  The  Ockham  Factor  for  the  weights \nOck(w)  is  small when  the eigenvalues Ai  of the Hessian are large,  corresponding  to \nwell-determined weights.  0  is  the prior range of the scales and is set (subjectively) \nto  103 . \nThe  expression  (15)  (valid  for  a  network  with  a  single  hidden  layer)  contains  a \nsymmetry factor  h!2h .  This  is  because  the  posterior  volume  must  include  all  w \nconfigurations which  are equivalent to the particular one.  The hidden units can be \npermuted, giving a factor  h!  more posterior volume.  And the sign of the weights to \nand from every hidden unit can be changed giving 2h  times more posterior volume. \n\n5  COMMITTEES \n\nFor  a  given  data set  we  usually  train  several  networks  with  different  numbers  of \nhidden  units  and  different  initial weights.  Several  of these  networks  have evidence \nnear  or  at  the  maximal  value,  but  the  networks  differ  in  their  predictions.  The \ndifferent  solutions  are  interpreted  as  components of the posterior  distribution  and \nthe correct  Bayesian answer  is obtained by averaging the predictions over  the solu(cid:173)\ntions,  weighted  by  their  posterior  probabilities,  i.e.  their  evidences.  However,  the \nevidence  is  not  accurately  determined,  primarily due  to  the  Gaussian  approxima(cid:173)\ntion.  This  means  that instead  of weighting with  Ev('Jf.)  we  should  use  the  weight \nexp{log Ev / ~(log Ev\u00bb,  where  ~(log Ev)  is  the  total  uncertainty  in  the  evaluation \nof log Ev.  As  an  approximation  to  this,  we  define  the  committee  as  the  models \nwith evidence larger than log Evrnax - ~ log Ev, where Evrnax  is  the largest evidence \nobtained, and all  members enter  with  the same weight. \n\nTo compute the evidence  Ev(C)  of the committee, we  assume for  simplicity that all \nnetworks  in  the  committee C share  the  same architecture.  Let  Nc  be  the  number \nof truly different  solutions in  the  committee.  Of course,  we  count symmetric reali(cid:173)\nsations only once.  The posterior volume i.e.  the  Ockham Factor for  the  weights  is \nnow  Nc  times larger.  This renders  the  committee more probable - it has a  larger \nevidence: \n\nlog Ev(C) = log Nc + log Ev('Jf.) \n\n(17) \nwhere  log Ev('Jf.)  denotes  the  average  log  evidence  of the  members.  Since  the evi(cid:173)\ndence  is  correlated  with  the generalisation error,  we  expect the committee to gene(cid:173)\nralise  better  than the  committee members. \n\n\fBayesian Backprop in Action \n\n213 \n\n6  PRUNING \n\nWe now extend  the Bayesian framework to networks which are pruned to adjust the \narchitecture  to the particular problem.  This extends the fourth  level  of inference. \nAt first  sight,  the factor  h!  in  the Ockham Factor for  the weights in a sparsely  con(cid:173)\nnected  network  appears to be  lost, since  the network  is  (in general)  not symmetric \nwith respect  to permutations of the hidden units.  However,  the symmetry reappears \nbecause for every sparsely connected network with tuned weights there are h!  other \nequivalent  network  architectures  obtained  by  permuting  the hidden  units.  So  the \nfactor  h!  remains.  If this argument is  not found  compelling, it can be viewed  as  an \nassumption. \n\nIf the data are used  to select the architecture,  which is  the case in pruning designed \nto  minimise  the  cost  function,  an  additional  Ockham  Factor  must  be  included. \nWith  one  output  unit,  only  the  input-to-hidden  layer  is  sparsely  connected,  so \nconsider  only  these  connections.  Attach  a  binary  pruning  parameter  to  each  of \nthe  m  potential  connections.  A  sparsely  connected  architecture  is  described  by \nthe  values of the  pruning parameters.  The  prior probability of a  connection  to be \npresent  is  described  by  a  hyperparameter  cP  which  is  determined from the data i.e. \nit is set  to the fraction  of connections  remaining after  pruning (notice  the analogy \nbetween  cP  and  a  weight  decay  parameter).  A  non-pruned  connection  gives  an \nOckham  Factor cP  and  a  pruned  1 -\ncP,  assuming the  data to be  certain  about  the \narchitecture.  The Ockham Factors for  the pruning parameters is  therefore \n\nlog Ock(pruning) = m(cPMP  log cPMP  + (1  - cPMP) 10g(1  - cPMP\u00bb \n\n(18) \n\nThe  tuning of the  meta-parameter to the  data gives  an  Ockham factor  Ock( cP)  :::::; \nJ2jm, which  is  rather  negligible. \nFrom a  minimum description length perspective  (18)  reflects  the extra information \nneeded  to describe the topology of a pruned net relative to a fully  connected net.  It \nacts like a barrier towards pruning.  Pruning is favoured only if the negative contri(cid:173)\nbution log Ock(pruning) is  compensated by an increase  in for  instance log Ock(w). \n\n7  APPLICATION  TO  SPECTROSCOPY \n\nBayesian  Backprop  is  used  in  a  real-life  application from  the  meat  industry.  The \ndata  were  recorded  by  a  Tecator  near-infrared  spectrometer  which  measures  the \nspectrum  of light  transmitted  through  samples  of minced  pork  meat.  The  ab(cid:173)\nsorbance spectrum has 100 channels in the region 850-1050 nm.  We want to calibrate \nthe  spectrometer  to  determine  the fat  content.  The first  10  principal  components \nof the spectra are  used  as  input to a  neural network. \n\nThree  weight  decay  parameters  are  used:  one  for  the  weights  and  biases  of the \nhidden  layer, one  for  the connections from  the  hidden to the output layer, and one \nfor  the direct  connections from the inputs to the output as well  as  the output  bias. \n\nThe relation between test error and log evidence is shown in figure  1.  The test error \nis given as  standard  error of prediction (SEP), i.e.  the root mean square error.  The \n12  networks  with  3 hidden  units and  evidence  larger  than  -270 are  selected  for  a \n\n\f214 \n\nThodberg \n\nC\\f ... \n\nC! \n.,... \n\nco \nci \n\n~ \n0 \n\nc: \n0 .\"\" \n.!Z! \n\"Q \nI!! \n\nc.. -0 \n~ e ~ \nLU \n\"E \n'\" j \n\nen \n\n\u00a2 \n\n2 hidden units \n\n\u2022  1 hidden unit \n\u2022  3 hidden units \n\u2022  6 hidden units \n\n4 hidden units \n\n8 hidden units \n\nX \n0 \n\n0 \n\n0 \n\n0 \n\nX \n\n\u2022 \n\n\u2022  0  \u2022 \n\n0 \n\nC \n\n0 \n\n0 \n\n0 \n\nC \n\n0 \n\nX \n\n.. \n\u2022 \n\u2022 \n\u2022 \n\u2022\u2022 \n\u2022  ~OoO \n\u2022 \n\u2022 \n\nC \n\nX \n\nX \n\nX \n\n\u2022\u2022 \nd(D  X  \u2022 \n\u2022 \n~mwlDl \nIJ  0 c \n\n-320 \n\n-300 \n\n-280 \n\nlog Evidence \n\nX \n\n\u2022 \n\nX \n\n. X_. \n\u2022 \n~ X  - .  \u00b00 \n\n\u2022  X \n\u2022 \u2022 \n~ \u2022\u2022  X \nIx \n\nX \n\n-260 \n\nFigure  1:  The  test  error  as  a  function  of the  log  evidence  for  networks  trained on \nthe spectroscopic  data.  High  evidence  implies low  test error. \n\ncommittee.  The  committee average  gives  6%  lower  SEP  than  the  members  do  on \naverage, and 21% lower SEP than a non-Bayesian analysis using early stopping (see \nThodberg,  1993). \n\nPruning  is  applied  to  the  networks  with  6  hidden  units.  The  evidence  decreases \nslightly,  i.e.  Ock(pruning)  dominates.  Also  the  SEP  is  slightly  worse.  So  the \nevidence  correctly suggests that pruning is  not useful  for  this problem.  1 \n\nThe  Bayesian  error  bars  are  illustrated for  the  spectroscopic  data in figure  2.  We \nstudy  the model predictions on  the line through input space  defined  by the second \nprincipal component axis,  i.e.  the second  input is  varied  while  all other inputs are \nzero.  The total prediction  variance for  a  new  datum x  is \n\nwhere  Uwu  comes  from  the  weight  uncertainties  (level  2)  and  Ucu  from  the  com(cid:173)\nmittee dissent  (level 4). \n\n1 For  artificial  data generated  by  a  sparsely  connected  network  the evidence  correctly \n\npoints  to pruned  nets as  better models  (see  Thodberg,  1993). \n\n(19) \n\n\fBayesian Backprop in Action \n\n215 \n\n...... :::::::::.... \n\nI \n\n.I \n.! \n\n!/ \n\".,\"'. \n/j \n.... \n. <'>\" \n,'.!; \n.... .... \n,'. \n,II \n,I \n;. \n...  .... \n,I \n....  -.-;.' \n........  if \n.:~\" .  \"./ \nI. \n,'. \n... \nij.  .... \n... \n... \n\nt'/\" \n\n.' \n\n'. \n\n\" . \u2022 . . \n\n.\u2022 \u2022\u2022 \u2022 , /  CommiUM Prediction \n\nTotal U,..rllinty \n...... \n\n..... : .. ...... ........... '\\ ......... ::::::::::::::: .......... . \n\\ \n\\ \n\nTotal Unc.rtlinly \n\n\\ \n\n\u2022 \u2022\u2022 \u2022..\u2022\u2022\u2022  \u2022 .\u2022\u2022 \u2022\u2022 \n......  \" . \n\n\\ \\\\  .... P* Unc.rllinly \n\n, \n\n\". , \n\n\". \n\n10\u00b7 Weight Unc\u00abtainty \n\n\". \n\n\\ \n\\~\"\"\"~\n\n'  \u2022 ConllTitlM Unc.rllinly \n'. \n\\.... \n\n\" \n\n'.  \\~ \n\"~'\" \n.... \n\\ \n\\ \n\". \n\".  \\ \n'.<, \n'. \n\"-\n\n\" \n\n10\u00b7 Randcm Noi.. \n\nl \nc: \n~ \n8 -as u.. \n\n~ \n\nyo) .-\n\n0 .-\n\n'. \n, \n\" .  ................................................\n./ \n./  / \n\",..... \n/ /   / \n/ \n'....:..-.:::----...::-_\"'::-_.-._.-.  ~/ \n\n.,... \n\n---..----~ \n\n-\n\no  ~------,_----------._----------r_---------.r---------~~--~~ \n\n-4 \n\n-2 \n\no \n\n2 \n\n4 \n\nSecond Principal Component \n\nFigure  2:  Prediction  of the fat  content  as  a  function  of the  second  principal com(cid:173)\nponent P2  of the NIR spectrum .  95% of the training data have  Ip21  < 2.  The total \nerror bars are indicated by a  \"1  sigma\"  band with the dotted lines.  The total stan(cid:173)\ndard  errors  O'total(X)  and  the standard errors  of its  contributions  (O'v,  O'wu(x)  and \nO'cu(x))  are shown separately,  multiplied by  a factor  of 10. \n\nReferences \n\nW.L.Buntine and A.S.Weigend,  \"Bayesian Back-Propagation\", Complex Systems 5, \n(1991)  603-643. \nR.M.Neal,  \"Bayesian Learning via Stochastic Dynamics\", Neural Information  Pro(cid:173)\ncessing  Systems,  Vol.5,  ed.  C.L.Giles,  S.J . Hanson  and  J .D.Cowan  (Morgan  Kauf(cid:173)\nmann, San  Mateo,  1993) \n\nD.J .C.MacKay,  \"A  Practical  Bayesian  Framework for  Backpropagation  Networks\" \nNeural Compo  4  (1992) 448-472. \nD.J .C.MacKay, paper on  Bayesian hyperparameters,  in preparation  1993. \n\nH.H.Thodberg,  \"A  Review  of  Bayesian  Backprop  with  an  Application  to \nNear  Infrared  Spectroscopy\"  and  \"A  Bayesian  Approach  to  Pruning  of  Neu(cid:173)\nral  Networks\",  submitted  to  IEEE  Transactions  of  Neural  Networks  1993  (in \n/pub/neuroprose/thodberg.ace-of-bayes*.ps.Z  on archive.cis.ohio-state.edu). \n\n\f", "award": [], "sourceid": 720, "authors": [{"given_name": "Hans", "family_name": "Thodberg", "institution": null}]}