{"title": "Bayesian Model Comparison by Monte Carlo Chaining", "book": "Advances in Neural Information Processing Systems", "page_first": 333, "page_last": 339, "abstract": null, "full_text": "Bayesian Model Comparison \n\nby Monte  Carlo  Chaining \n\nDavid Barber \n\nD.Barber~aston.ac.uk \n\nChristopher M.  Bishop \nC.M.Bishop~aston.ac.uk \n\nNeural  Computing Research  Group \n\nAston University,  Birmingham, B4  7ET,  U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\nAbstract \n\nThe techniques of Bayesian inference  have been  applied with great \nsuccess to many problems in neural computing including evaluation \nof regression  functions,  determination of error bars on predictions, \nand  the  treatment of hyper-parameters.  However,  the  problem of \nmodel comparison is a much more challenging one for which current \ntechniques  have significant limitations.  In this paper we  show  how \nan  extended  form  of Markov  chain  Monte  Carlo,  called  chaining, \nis  able  to provide effective  estimates of the relative probabilities of \ndifferent  models.  We  present  results  from  the  robot  arm problem \nand  compare  them  with  the  corresponding  results  obtained  using \nthe standard  Gaussian approximation framework. \n\n1  Bayesian Model  Comparison \n\nIn a Bayesian treatment of statistical inference, our state of knowledge of the values \nof the parameters w  in a model M  is described in terms of a probability distribution \nfunction.  Initially this  is  chosen  to be  some prior distribution p(wIM), which  can \nbe  combined with a  likelihood function  p( Dlw, M) using Bayes'  theorem  to give  a \nposterior  distribution p(wID, M) in  the form \n\n( \np w \n\n, \n\nID  M) =  p(Dlw,M)p(wIM) \n\np(DIM) \n\n(1) \n\nwhere  D  is  the  data  set.  Predictions  of  the  model  are  obtained  by  performing \nintegrations weighted  by  the  posterior  distribution. \n\n\f334 \n\nD.  Barber and C.  M.  Bishop \n\nThe comparison of different models Mi is based on their relative probabilities, which \ncan be expressed,  again using Bayes' theorem, in terms of prior probabilities P(Mi) \nto give \n\nP(MiI D ) \np(MjID) \n\np(DIMdp(Mi) \np(DIMj )p(Mj) \n\n(2) \n\n(3) \n\nand  so  requires  that  we  be  able  to  evaluate  the  model  evidence  p(DIMi),  which \ncorresponds to the denominator in (1).  The relative probabilities of different models \ncan  be  used  to select  the  single  most  probable  model,  or  to  form  a  committee of \nmodels,  weighed  by  their  probabilities. \n\nIt is  convenient  to write the numerator of (1) in the form exp{ -E(w)}, where  E(w) \nis  an error function.  Normalization of the  posterior distribution then requires  that \n\np(DIM) = J exp{ -E(w)} dw. \n\nGenerally,  it  is  straightforward  to evaluate  E(w)  for  a  given  value of w,  although \nit  is  extremely  difficult  to  evaluate  the  corresponding  model  evidence  using  (3) \nsince  the  posterior  distribution  is  typically  very  small  except  in  narrow  regions \nof the  high-dimensional  parameter  space,  which  are  unknown  a-priori.  Standard \nnumerical integration techniques  are  therefore  inapplicable. \n\nOne  approach  is  based  on  a  local  Gaussian  approximation  around  a  mode  of the \nposterior  (MacKay,  1992).  Unfortunately,  this  approximation  is  expected  to  be \naccurate only when  the  number of data points is  large in relation to the  number of \nparameters in the model.  In fact  it is for relatively complex models, or problems for \nwhich  data is  scarce,  that  Bayesian  methods have  the  most  to offer.  Indeed,  Neal \n(1996)  has  argued  that,  from  a  Bayesian  perspective,  there  is  no  reason  to  limit \nthe  number of parameters  in  a  model,  other  than  for  computational reasons.  We \ntherefore consider an approach to the evaluation of model evidence which overcomes \nthe limitations of the Gaussian framework.  For additional techniques and references \nto Bayesian model comparison, see  Gilks et  al.  (1995) and Kass and Raftery (1995). \n\n2  Chaining \n\nSuppose  we  have  a  simple  model  Mo  for  which  we  can  evaluate  the  evidence  an(cid:173)\nalytically,  and  for  which  we  can  easily  generate  a  sample wi  (where  I  =  1, ... , L) \nfrom the corresponding distribution p(wID, Mo).  Then the evidence for some other \nmodel M  can  be  expressed  in  the form \n\np(DIM) \np(DIMo) \n\nJ exp{-E(w) + Eo(w)}p(wID, Mo)dw \nL :E exp{ -E(w1) + Eo(w 1)}. \n\n1  L \n\n(4) \n\n1=1 \n\nUnfortunately,  the  Monte  Carlo approximation in  (4)  will  be  poor  if the  two error \nfunctions  are  significantly  different,  since  the  exponent  is  dominated  by  regions \nwhere E is relatively small, for which there will be few samples unless Eo  is also small \nin  those  regions.  A  simple  Monte  Carlo approach  will  therefore  yield  poor results. \nThis problem  is  equivalent  to the  evaluation of free  energies  in  statistical  physics, \n\n\fBayesian Model Comparison by Monte Carlo Chaining \n\n335 \n\nwhich  is  known  to  be  a  challenging  problem,  and  where  a  number  of approaches \nhave  been  developed  Neal  (1993). \n\nHere  we discuss one such approach to this problem based on a  chain of J{  successive \nmodels Mi  which  interpolate  between  Mo  and M, so  that  the  required  evidence \ncan be written  as \n\np(DIM) \np(DIM) = p(DIMo) p(DIMo) p(DIMt}  ... p(DIMK)\u00b7 \n\np(DIMl) p(DIM2) \n\n(5) \n\nEach  of the  ratios in  (5)  can  be evaluated  using  (4).  The goal is  to devise  a  chain \nof models  such  that  each  successive  pair  of  models  has  probability  distributions \nwhich  are reasonably close,  so that each  of the ratios in  (5)  can be evaluated accu(cid:173)\nrately,  while keeping the total number of links in the chain fairly small to limit the \ncomputational costs. \n\nWe  have  chosen  the  technique  of hybrid  Monte  Carlo  (Duane  et  ai.,  1987;  Neal, \n1993)  to  sample  from  the  various  distributions,  since  this  has  been  shown  to  be \neffective  for  sampling from  the  complex  distributions  arising  with  neural  network \nmodels (Neal,  1996).  This involves introducing Hamiltonian equations of motion in \nwhich the parameters ware augmented by a set of fictitious 'momentum' variables, \nwhich  are then integrated using the leapfrog method.  At the end of each trajectory \nthe new  parameter vector is  accepted with a probability governed by the Metropolis \ncriterion,  and  the momenta are  replaced  using Gibbs sampling.  As  a  check  on  our \nsoftware  implementation of chaining,  we  have evaluated the evidence for  a  mixture \nof two non-isotropic Gaussian distributions, and obtained a result which  was  within \n10% of the analytical solution. \n\n3  Application to  Neural  Networks \n\nWe  now  consider  the  application  of the  chaining  method  to  regression  problems \ninvolving neural  network  models.  The  network  corresponds  to a  function  y(x, w), \nand the data set  consists of N  pairs of input vectors  Xn  and  corresponding  targets \ntn  where  n  = 1, ... , N.  Assuming Gaussian noise on the target data, the likelihood \nfunction  takes the form \n\np(Dlw, M) =  211\" \n\n( f3)N/2 \n\n{f3  N \n\n} \nexp  -2\" ~ Ily(xn ; w) - t n l1 2 \n\n(6) \n\nwhere  f3  is  a  hyper-parameter  representing  the  inverse  of the  noise  variance.  We \nconsider  networks  with  a  single  hidden  layer  of  'tanh'  units,  and  linear  output \nunits.  Following Neal  (1996)  we  use  a diagonal Gaussian prior in  which the weights \nare  divided  into groups  Wk,  where  k  =  1, ... ,4 corresponding  to input-to-hidden \nweights,  hidden-unit  biases,  hidden-to-output  weights,  and  output  biases.  Each \ngroup  is  governed  by  a  separate  'precision'  hyper-parameter  ak,  so  that  the prior \ntakes the form \n\np(wl{a,}) = L exp { -~ ~ a,wfw , } \n\n(7) \n\nwhere  Zw  is  the  normalization coefficient.  The  hyper-parameters {ad and  f3  are \nthemselves each governed by hyper-priors given by Gamma distributions of the form \n\np( a) ex:  a$ exp( -as /2w) \n\n(8) \n\n\f336 \n\nD.  Barber and C. M.  Bishop \n\nin which  the mean wand variance 2w 2 /  s  are chosen  to give very broad hyper-priors \nin  reflection  of our limited prior  knowledge  of the  values  of the  hyper-parameters. \nWe  use  the  hybrid  Monte  Carlo algorithm to sample from  the joint distribution of \nparameters and  hyper-parameters.  For  the evaluation of evidence  ratios,  however, \nwe  consider  only  the  parameter  samples,  and  perform  the  integrals  over  hyper(cid:173)\nparameters analytically, using the fact  that the gamma distribution is  conjugate to \nthe Gaussian. \n\nIn  order to apply chaining to this problem, we  choose  the prior as our reference  dis(cid:173)\ntribution,  and  then define  a  set of intermediate distributions based on a  parameter \nA which  governs  the effective  contribution from  the  data term, so  that \n\nE(A, w) = A</>(W)  + Eo(w) \n\n(9) \nwhere  </>(w)  arises  from  the  likelihood  term  (6)  while  Eo(w)  corresponds  to  the \nprior (7).  We select  a  set  of 18  values of A which  interpolate between  the reference \ndistribution  (A  = 0)  and  the  desired  model  distribution  (A  = 1) .  The  evidence  for \nthe prior  alone is  easily evaluated analytically. \n\n4  Gaussian  Approximation \n\nAs  a  comparison  against  the  method  of chaining,  we  consider  the  framework  of \nMacKay  (1992)  based  on  a  local  Gaussian  approximation  to  the  posterior  distri(cid:173)\nbution.  This approach  makes use  of the  evidence  approximation in  which  the  inte(cid:173)\ngration  over  hyper-parameters  is  approximated  by  setting  them  to specific  values \nwhich  are  themselves  determined  by  maximizing their evidence  functions. \n\nThis leads to a  hierarchical treatment as follows.  At  the lowest level,  the  maximum \nw of the  posterior  distribution over  weights is  found  for  fixed  values  of the  hyper(cid:173)\nparameters by minimizing the error function .  Periodically the hyper-parameters are \nre-estimated by evidence  maximization, where  the evidence is obtained analytically \nusing the Gaussian approximation. This gives  the following re-estimation formulae \n\n1 \nf3 \n\n(10) \n\nwhere  'Yk  =  Wk  - Uk Trk(A -1),  Wk  is  the  total  number  of parameters  in  group \nk,  A  =  \\7\\7 E(w),  'Y  =  Lk 'Yk.  and  Trk(-)  denotes  the  trace  over  the  kth  group \nof parameters.  The  weights  are  updated  in  an  inner  loop  by  minimizing  the  er(cid:173)\nror  function  using  a  conjugate gradient  optimizer,  while  the  hyper-parameters  are \nperiodically re-estimated  using (10)1. \n\nOnce  training is  complete,  the  model evidence  is  evaluated  by  making a  Gaussian \napproximation around  the converged  values of the  hyper-parameters,  and integrat(cid:173)\ning over  this distribution analytically.  This gives the  model log evidence  as \n\nInp(DIM)  =  -E(w) - ~ In IAI  +  ~ L Wk lnuk + \n\nN \n2Inf3+lnh!+2Inh+ 2 ~ln(2hk) +  2 In (2/(N -'Y\u00bb. \n\n1 \n\nk \n\n1 \n\n(11) \n\n1 Note  that  we  are  assuming  that  the  hyper-priors  (8)  are  sufficiently  broad  that  they \n\nhave  no  effect  on  the location  of the evidence  maximum  and can  therefore  be  neglected. \n\n\fBayesian Model Comparison by Monte Carlo Chaining \n\n337 \n\nHere  h is  the  number of hidden  units,  and the  terms  In h! + 2ln h  take  account  of \nthe  many equivalent  modes of the  posterior  distribution  arising from  sign-flip  and \nhidden  unit  interchange  symmetries  in  the  network  model.  A  derivation  of these \nresults  can be found  in  Bishop  (1995;  pages 434-436). \n\nThe  result  (11)  corresponds  to  a  single  mode  of the  distribution.  If we  initialize \nthe weight optimization algorithm with different  random values we  can find  distinct \nsolutions.  In order to compute an overall evidence for  the particular network model \nwith a given number of hidden units, we  make the assumption that we have found all \nof the distinct  modes of the posterior distribution precisely once each, and then sum \nthe evidences to arrive at the total model evidence.  This neglects the possibility that \nsome of the solutions found are related by symmetry transformations (and therefore \nalready taken into account)  or  that  we  have  missed important modes.  While some \nattempt could be made to detect degenerate solutions, it will be difficult to do much \nbetter  than the  above  within  the framework of the Gaussian  approximation. \n\n5  Results:  Robot  Arm Problem \n\nAs  an  illustration  of the  evaluation of model  evidence  for  a  larger-scale  problem \nwe  consider  the  modelling of the  forward  kinematics for  a  two-link robot  arm in  a \ntwo-dimensional space,  as introduced by  MacKay  (1992).  This problem was chosen \nas MacKay reports good results in using the Gaussian approximation framework to \nevaluate  the  evidences,  and  provides  a  good opportunity  for  comparison  with  the \nchaining approach.  The  task  is  to learn  the  mapping (Xl, X2)  -+ (Yl, Y2)  given  by \n\nwhere  the  data  set  consists  of 200  input-output  pairs  with  outputs  corrupted  by \nzero  mean  Gaussian  noise  with  standard  deviation  u  =  0.05.  We  have  used  the \noriginal  training  data of MacKay,  but  generated  our  own  test  set  of  1000  points \nusing the same prescription.  The evidence  is  evaluated using both chaining and the \nGaussian approximation, for  networks  with various numbers of hidden  units. \n\nIn  the  chaining  method, the  particular form  of the  gamma priors for  the  precision \nvariables  are  as  follows:  for  the  input-to-hidden  weights  and  hidden-unit  biases, \nw  =  1,  s  =  0.2;  for  the  hidden-to-output  weights,  w  =  h,  s  =  0.2;  for  the  output \nbiases,  w  =  0.2,  s  =  1.  The  noise  level  hyper-parameters  were  w  =  400,  s  =  0.2. \nThese settings follow  closely  those used  by  Neal  (1996)  for  the same problem.  The \nhidden-to-output  precision  scaling  was  chosen  by  Neal  such  that  the  limit  of  an \ninfinite number of hidden units is well defined and corresponds to a Gaussian process \nprior.  For  each  evidence  ratio  in  the  chain,  the  first  100  samples from  the  hybrid \nMonte  Carlo  run,  obtained  with  a  trajectory  length  of 50  leapfrog  iterations,  are \nomitted to give  the  algorithm a  chance  to reach  the  equilibrium distribution.  The \nnext  600  samples  are  obtained  using  a  trajectory  length  of 300  and  are  used  to \nevaluate the evidence  ratio. \n\nIn  Figure  1 (a)  we  show  the error values of the sampling stage for  24  hidden units, \nwhere we see  that the errors are largely uncorrelated,  as required for effective  Monte \nCarlo  sampling.  In  Figure  1  (b),  we  plot  the  values  of In{p(DIMi)/p(DIMi_l)} \nagainst  .Ai  i = 1..18.  Note  that  there  is  a large  change  in the evidence  ratios at the \nbeginning of the chain, where we sample close  to the reference  distribution.  For this \n\n\f338 \n\nD.  Barber and C.  M.  Bishop \n\n(a) \n\n6 \n\n(b) \n\no \n\n100 \n\n200 \n\n300 \n\n400 \n\n500 \n\n600 \n\n-4~--~----~--~----~--~ \n\n0.6 \n\n0.8  A 1 \n\no \n\n0.2 \n\n0.4 \n\nFigure  1:  (a)  error  E(>.  = 0.6,w)  for  h  = 24,  plotted  for  600  successive  Monte  Carlo \nsamples.  (b)  Values  of the ratio  In{p(DIM.)jp(DIM.-d} for  i = 1, ... ,18 for  h = 24. \n\nreason,  we  choose  the  Ai  to be  dense  close  to  A =  O.  We  are  currently  researching \nmore  principled  approaches  to  the  partitioning selection.  Figure  2  (a)  shows  the \nlog  model  evidence  against  the  number  of hidden  units.  Note  that  the  chaining \napproach is  computationally expensive:  for  h=24,  a  complete chain  takes  48  hours \nin  a  Matlab implementation running on  a  Silicon Graphics  Challenge L. \n\nWe  see  that  there  is  no  decline  in  the  evidence  as  the  number  of  hidden  units \ngrows.  Correspondingly,  in  Figure  2  (b),  we  see  that  the  test  error  performance \ndoes not degrade as  the number of hidden units increases.  This indicates that there \nis  no  over-fitting  with  increasing  model  complexity,  in  accordance  with  Bayesian \nexpectations. \n\nThe corresponding results from the Gaussian approximation approach are shown in \nFigure  3.  We  see  that  there  is  a  characteristic  'Occam hill'  whereby  the  evidence \nshows  a  peak  at  around  h  =  12,  with  a  strong  decrease  for  smaller  values  of  h \nand  a  slower  decrease  for  larger  values.  The  corresponding  test set  errors similarly \nshow  a  minimum at around  h  =  12,  indicating that the  Gaussian approximation is \nbecoming increasingly inaccurate for  more  complex models. \n\n6  Discussion \n\nWe  have  seen  that  the  use  of  chaining  allows  the  effective  evaluation  of  model \nevidences for  neural  networks using Monte Carlo techniques.  In  particular,  we  find \nthat  there  is  no  peak  in  the  model  evidence,  or  the  corresponding  test  set  error, \nas  the  number  of hidden  units  is  increased,  and so  there  is  no  indication of over(cid:173)\nfitting.  This is  in accord  with the expectation that model complexity should not be \nlimited by  the  size  of the  data set,  and  is  in  marked  contrast  to  the  conventional \n\n70.-----~----~------~----~ \n\n1.4r-----~-----,-------~----~ \n\n(b) \n\n60 \n\n50 \n\n40 \n\n(a) \n\n1.3~ \n1.2'  ~ \n\n1.1 \n\n1!r------&----~ \n\n30L-----1~0------1~5------~20~--~ h \n\n10 \n\n15 \n\n20 \n\nh \n\nFigure  2:  (a)  Plot  of  Inp(DIM)  for  different  numbers  of  hidden  units.  (b)  Test  error \nagainst  the  number  of  hidden  units.  Here  the  theoretical  minimum  value  is  1.0.  For \nh  =  64  the test  error is  1.11 \n\n\fBayesian Model Comparison by Monte  Carlo Chaining \n\n85or-----------~----~----~ \n\n(a) \n\n800 \n\no \n\no \n\no \n\no  00 \n\n0 \n\n00  0 \n\n7505L-----l~0----~15------2O~----2......J5 h \n\n2.5  0 \no \n\no \n2  0  @ \n\no \n\n1.5 \n\n339 \n\n(b) \n\no \n\no \n\no \n\n1L-----~~--~----~----~ \n5 \n\n10 \n\n15 \n\n20 \n\n25  h \n\nFigure  3: \n(a)  Plot  of the  model  evidence  for  the  robot  arm  problem  versus  the  number \nof  hidden  units,  using  the  Gaussian  approximation  framework.  This  clearly  shows  the \ncharacteristic  'Occam  hill'  shape.  Note  that  the evidence  is  computed  up  to  an  additive \nconstant,  and so  the origin  of the vertical axis  has  no significance.  (b)  Corresponding  plot \nof the  test  set  error  versus  the  number  of hidden  units.  Individual  points  correspond  to \nparticular  modes  of the  posterior  weight  distribution,  while  the  line  shows  the  mean  test \nset  error for  each  value  of h. \n\nmaximum likelihood  viewpoint.  It is  also  consistent  with  the  result  that,  in  the \nlimit of an infinite  number of hidden  units,  the prior over  network  weights leads to \na  well-defined  Gaussian prior over functions  (Williams, 1997). \n\nAn  important  advantage of being  able  to  make  accurate  evaluations of the  model \nevidence  is the  ability to compare quite distinct  kinds of model, for  example radial \nbasis  function  networks  and  multi-layer  perceptIOns.  This  can  be  done  either  by \nchaining both models back  to a  common reference  model, or by evaluating normal(cid:173)\nized  model evidences  explicitly. \n\nAcknowledgements \n\nWe  would  like  to  thank  Chris  Williams and  Alastair  Bruce  for  a  number of useful \ndiscussions.  This work  was  supported  by  EPSRC  grant  GR/J75425:  Novel  Devel(cid:173)\nopments  in  Learning  Theory  for Neural  Networks. \n\nReferences \n\nBishop, C. M.  (1995).  Neural Networks for Pattern Recognition. Oxford University Press. \nDuane,  S., A.  D.  Kennedy,  B.  J.  Pendleton,  and D.  Roweth  (1987).  Hybrid Monte Carlo. \n\nPhysics  Letters  B  195 (2),  216-222. \n\nGilks,  W.  R.,  S.  Richardson,  and  D.  J. Spiegelhalter  (1995).  Markov  Chain  Monte  Carlo \n\nin  Practice.  Chapman  and  Hall. \n\nKass,  R.  E.  and  A.  E.  Raftery  (1995).  Bayes  factors.  J.  Am.  Statist.  Ass. 90, 773-795. \nMacKay,  D.  J.  C.  (1992).  A  practical  Bayesian  framework  for  back-propagation  net(cid:173)\n\nworks.  Neural  Computation  4  (3),  448- 472. \n\nNeal,  R.  M.  (1993).  Probabilistic  inference  using  Markov  chain  Monte  Carlo  methods. \nTechnical  Report  CRG-TR-93-1,  Department  of Computer  Science,  University  of \nToronto,  Cananda. \n\nNeal,  R.  M.  (1996).  Bayesian Learning for  Neural Networks.  Springer.  Lecture  Notes in \n\nStatistics  118. \n\nWilliams,  C.  K.  I. (1997).  Computing  with infinite  networks.  This  volume. \n\n\f", "award": [], "sourceid": 1272, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Christopher", "family_name": "Bishop", "institution": null}]}