{"title": "Ensemble Learning for Multi-Layer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 401, "abstract": "", "full_text": "Ensemble Learning \n\nfor  Multi-Layer  Networks \n\nDavid Barber\u00b7 \n\nChristopher M.  Bishopt \n\nNeural  Computing Research Group \n\nDepartment of Applied Mathematics and  Computer Science \n\nAston  University,  Birmingham B4  7ET, U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\nAbstract \n\nBayesian  treatments  of learning  in  neural  networks  are  typically \nbased  either  on  local  Gaussian  approximations  to  a  mode  of  the \nposterior  weight  distribution,  or  on  Markov  chain  Monte  Carlo \nsimulations.  A  third  approach,  called  ensemble  learning,  was  in(cid:173)\ntroduced by Hinton and van  Camp  (1993).  It aims to approximate \nthe  posterior  distribution  by  minimizing  the  Kullback-Leibler  di(cid:173)\nvergence  between the true posterior and a parametric approximat(cid:173)\ning distribution.  However,  the  derivation  of a  deterministic  algo(cid:173)\nrithm  relied  on  the  use  of a  Gaussian  approximating distribution \nwith  a  diagonal  covariance  matrix  and  so  was  unable  to capture \nthe  posterior  correlations  between  parameters.  In this  paper,  we \nshow  how the ensemble learning approach can  be extended to full(cid:173)\ncovariance Gaussian distributions while remaining computationally \ntractable.  We also extend the framework to deal with hyperparam(cid:173)\neters,  leading  to  a  simple  re-estimation  procedure.  Initial  results \nfrom  a standard benchmark problem are encouraging. \n\n1 \n\nIntroduction \n\nBayesian techniques  have  been  successfully  applied  to neural  networks  in the con(cid:173)\ntext of both  regression  and  classification  problems  (MacKay  1992;  Neal  1996).  In \ncontrast  to  the  maximum  likelihood  approach  which  finds  only  a  single  estimate \nfor the regression parameters, the Bayesian approach yields a distribution of weight \nparameters, p(wID),  conditional  on  the  training data D,  and  predictions  are  ex-\n\n\u00b7Present address:  SNN, University of Nijmegen,  Geert  Grooteplein  21,  Nijmegen,  The \n\nNetherlands.  http://wvw.mbfys.kun . n1/snn/ email:  davidbbbfys.kun.n1 \n\ntpresent address:  Microsoft Research Limited, St George House,  Cambridge CB2 3NH, \n\nUK.  http://vvv.research.microsoft . com  email:  cmbishopbicrosoft.com \n\n\f396 \n\nD.  Barber and C.  M.  Bishop \n\npressed in terms of expectations with  respect to the posterior distribution  (Bishop \n1995).  However,  the corresponding integrals over  weight  space are analytically  in(cid:173)\ntractable.  One well-established procedure for  approximating these integrals, known \nas  Laplace's  method,  is  to  approximate  the  posterior  distribution  by  a  Gaussian, \ncentred  at  a  mode  of p(wID),  in  which  the  covariance  of the  Gaussian  is  deter(cid:173)\nmined  by  the  local  curvature  of  the  posterior  distribution  (MacKay  1995).  The \nrequired  integrations  can  then  be performed analytically.  More  recent  approaches \ninvolve  Markov chain Monte  Carlo simulations to generate samples from  the poste(cid:173)\nrior  (Neal  1996}.  However,  such techniques can be computationally expensive,  and \nthey also suffer from  the lack of a suitable convergence criterion. \n\nA  third  approach,  called  ensemble  learning,  was  introduced  by  Hinton  and  van \nCamp  (1993)  and  again  involves  finding  a  simple,  analytically  tractable,  approxi(cid:173)\nmation  to  the  true  posterior  distribution.  Unlike  Laplace's  method,  however,  the \napproximating  distribution  is  fitted  globally,  rather than  locally,  by  minimizing  a \nKullback-Leibler divergence.  Hinton and van Camp  (1993)  showed that, in the case \nof a  Gaussian  approximating distribution with  a  diagonal  covariance,  a  determin(cid:173)\nistic learning algorithm could be derived.  Although the approximating distribution \nis  no longer  constrained to coincide  with  a  mode  of the  posterior,  the assumption \nof a  diagonal covariance prevents the model from  capturing the (often very strong) \nposterior correlations between the parameters.  MacKay  (1995)  suggested a  modifi(cid:173)\ncation to the algorithm by including linear preprocessing of the inputs to achieve a \nsomewhat richer class of approximating distributions, although this was not imple(cid:173)\nmented.  In this paper we show that the ensemble learning approach can be extended \nto allow  a  Gaussian approximating distribution with an general covariance  matrix, \nwhile  still leading to a  tractable algorithm. \n\n1.1  The Network Model \n\nWe  consider  a  two-layer feed-forward  network  having a  single  output  whose  value \nis given  by \n\nH \n\n/(x,w) =  LViU(Ui'X) \n\ni=1 \n\n(1) \n\nwhere w  is a k-dimensional vector representing all of the adaptive parameters in the \nmodel,  x  is  the  input  vector,  {ud, i  =  1, ... , H  are  the  input-to-hidden  weights, \nand  {Vi}, i  =  1, ... ,H are  the hidden-to-output  weights.  The extension  to  multi(cid:173)\nple  outputs  is  straightforward.  For  reasons  of analytic  tractability,  we  choose  the \nsigmoidal hidden-unit activation function  u(a)  to be given by  the error function \n\nu(a}  = f! loa exp (-82/2) d8 \n\n(2) \n\nwhich  (when  appropriately  scaled)  is  quantitatively  very  similar  to  the  standard \nlogistic  sigmoid.  Hidden  unit  biases  are  accounted  for  by  appending  the  input \nvector  with  a  node  that  is  always  unity.  In  the  current  implementation there are \nno  output biases  (and  the output data is  shifted  to give  zero  mean),  although  the \nformalism is  easily extended to include adaptive output biases  (Barber and Bishop \n1997) . . The data set consists  of N  pairs of input  vectors  and  corresponding target \noutput  values  D  = {x~, t~} ,It = 1, ... , N.  We  make  the  standard  assumption \nof Gaussian  noise  on  the target  values,  with  variance  (3-1.  The  likelihood  of the \ntraining data is  then proportional to exp(-(3ED ), where the training error ED  is \n\nED{w) =  2 ~ (J{x~, w) - t~)  . \n\n1\", \n\n2 \n\n~ \n\n(3) \n\n\fEnsemble Leamingfor Multi-Layer Networks \n\n397 \n\nThe prior distribution over  weights  is  chosen  to be a  Gaussian of the form \n\n(4) \nwhere Ew(w) =  !wT Aw, and A  is  a matrix of hyper parameters.  The treatment of \n(3  and A is dealt with in Section 2.1.  From Bayes' theorem, the posterior distribution \nover weights  can then be written \n\np(w)  (X  exp (-Ew(w)) \n\np(wID) = z exp (-(3ED(w) - Ew(w)) \n\n1 \n\n(5) \n\nwhere  Z  is  a  normalizing  constant.  Network  predictions  on  a  novel  example  are \ngiven  by  the posterior average of the network output \n\n(f(x)) = J f(x, w)p(wID) dw. \n\n(6) \n\n(7) \n\n(8) \n\n(9) \n\nThis  represents  an  integration  over  a  high-dimensional  space,  weighted  by  a  pos(cid:173)\nterior  distribution  p(wID)  which  is  exponentially  small  except  in  narrow  regions \nwhose  locations are  unknown a-priori.  The accurate evaluation of such  integrals is \nthus very difficult. \n\n2 \n\n. Ensemble Learning \n\nIntegrals  of the form  (6)  may  be  tackled  by  approximating  p(wID)  by  a  simpler \ndistribution Q(w).  In this  paper  we  choose  this  approximating  distribution to be \na  Gaussian with  mean  wand covariance  C.  We  determine the values  of w  and  C \nby  minimizing the Kullback-Leibler  divergence  between the network  posterior  and \napproximating Gaussian, given by \n\nF [Q]  = \n\nQ(w) In  p(wID)  dw \n\n{ Q(w)  } \n\nJ \nJ Q(w) In Q(w)dw - J Q(w) Inp(wID) dw. \n\nThe first term in (8)  is the negative entropy of a Gaussian distribution, and is  easily \nevaluated to give ! In det (C) + const. \nFrom  (5)  we  see  that the posterior dependent  term  in  (8)  contains  two  parts that \ndepend on the prior and likelihood \n\nJ Q(w)Ew(w)dw + J Q(w)ED(w)dw. \n\nNote that the normalization coefficient  Z-l in  (5)  gives  rise to a  constant additive \nterm  in  the  KL  divergence  and  so  can  be  neglected.  The  prior  term  Ew (w)  is \nquadratic  in  w,  and  integrates  to  give  Tr(CA) + ~wT Aw.  This  leaves  the  data \ndependent term in  (9)  which  we  write as \n\nJ \nQ(W)ED(W)dw =  \"2 I: l(xl!, tl!) \n\n(3N \n\nL  = \n\nI!=l \n\nwhere \n\nl(x, t)  = J Q(w) (J(x, W))2  dw - 2t J Q(w)f(x, w) dw + t 2 \u2022 \n\n(10) \n\n(11) \n\n\f398 \n\nD.  Barber and C.  M.  BisJwp \n\nFor  clarity,  we  concentrate only on the first  term in  (11),  as  the calculation of the \nterm  linear  in  I(x, w)  is  similar,  though  simpler.  Writing  the  Gaussian  integral \nover Q as  an average,  ( ),  the first  term of (11)  becomes \n\n((I(x, w\u00bb2)  =  L  (vivju(uTx)u(uJx\u00bb). \n\nH \n\ni,j=I \n\n(12) \n\nTo simplify the notation, we denote the set of input-to-hidden weights  (Ul' ... , UH) \nby  u  and  the  set  of hidden-to-output  weights,  (VI' ... ' V H)  by  v.  Similarly,  we \npartition the covariance matrix C  into blocks,  C uu ,  C vu , C vv , and C vu  = C~v. As \nthe components of v  do  not enter the non-linear sigmoid functions,  we  can directly \nintegrate over v, so  that each term in the summation  (12)  gives \n\n((Oij + (u - IT)T \\Ilij (u - IT)  + n~ (u - IT\u00bb)  u  (uTxi) u (uTxj)) \n\n(13) \n\nwhere \n\nOij \n\n(Cvv  - CvuCuu -lCuV)ij + \"hvj \nC uu -ICu,v=:i C lI=:j,uC uu -1, \n\n(14) \n\n\\Ilij \n\n(15) \n(16) \nAlthough the remaining integration in  (13)  over  u  is  not analytically tractable, we \ncan  make  use of the following  result to reduce it to a one-dimensional integration \n\n2Cuu -ICu,lI=:jVi. \n\nnij \n\n-\n\n(aT b)2 \n\nza Tb + bolal \n\nVlal2 (1 + Ib 12 )  -\n\n(u (z\u00b7a + ao) u (z \u00b7b + bo\u00bb)z  =  (u (zlal + 0.0) u  ( \n\n)) \nz \n(17) \nwhere  a  and b  are  vectors  and 0.0, bo are scalar offsets.  The  avera~e on the left  of \n(17) is over an isotropic multi-dimensional Gaussian, p(z) ex:  exp( -z  z/2), while the \naverage on the right is over the one-dimensional Gaussian p(z) ex:  exp( -z2 /2).  This \nresult follows from the fact that the vector z only occurs through the scalar product \nwith  a  and  b,  and  so  we  can  choose  a  coordinate  system  in  which  the  first  two \ncomponents of z  lie  in the  plane spanned by  a  and b.  All  orthogonal  components \ndo not  appear elsewhere in  the integrand, and therefore integrate to unity. \nThe integral we  desire,  (13)  is  only a  little more complicated than  (17)  and can be \nevaluated  by  first  transforming  the  coordinate  system  to  an  isotopic  basis  z,  and \nthen differentiating with respect to elements of the covariance matrix to 'pull down' \nthe  required  linear  and  quadratic  terms  in  the  u-independent  pre-factor  of  (13). \nThese derivatives can then be reduced to a form  which  requires only the numerical \nevaluation of (17).  We  have  therefore  succeeded in reducing the calculation of the \nKL  divergence  to analytic terms together with  a single one-dimensional  numerical \nintegration of the form  (17),  which  we compute using Gaussian quadrature1 . \nSimilar techniques can be used to evaluate the derivatives of the KL divergence with \nrespect to the mean and covariance matrix (Barber and Bishop 1997).  Together with \nthe  KL  divergence,  these  derivatives  are  then  used  in  a  scaled  conjugate  gradient \noptimizer to find  the parameters w  and C  that represent the best Gaussian fit. \nThe number of parameters in  the covariance  matrix scales  quadratically  with  the \nnumber of weight  parameters.  We  therefore  have also  implemented a  version  with \n\n1 Although  (17)  appears to depend on  4 parameters,  it can be expressed  in terms of 3 \nindependent parameters.  An alternative to performing quadrature during training would \ntherefore be  to compute a  3-dimensionallook-up table in advance. \n\n\fEnsemble Learning for Multi-Layer Networks \n\n399 \n\nPosterior \n\nlaplace fit \n\nMinimum KLD fit \n\nMinimum KL fit \n\nFigure  1:  Laplace  and  minimum  Kullback-Leibler  Gaussian  fits  to  the  posterior. \nThe Laplace method underestimates the local  posterior mass by basing the covari(cid:173)\nance  matrix  on  the  mode  alone,  and  has  KL  value  41.  The  minimum  Kullback(cid:173)\nLeibler  Gaussian  fit  with  a  diagonal  covariance  matrix  (KLD)  gives  a  KL  value \nof 4.6,  while  the  minimum  Kullback-Leibler  Gaussian  with  full  covariance  matrix \nachieves  a  value of 3.9. \n\na constrained covariance matrix \n\nC = diag(di,\u00b7 \u00b7 \u00b7, d~) + L sisT \n\ns \n\ni=l \n\n(18) \n\nwhich is  the form  of covariance used in  factor analysis  (Bishop  1997).  This reduces \nthe number offree parameters in  the covariance matrix from  k(k + 1)/2 to k(s + 1) \n(representing  k(s + 1)  - s(s - 1)/2 independent  degrees  of freedom)  which  is  now \nlinear in  k.  Thus,  the number of parameters can  be controlled  by changing sand, \nunlike a diagonal covariance matrix, this model can still capture the strongest of the \nposterior correlations.  The value  of s  should  be  as  large  as  possible,  subject  only \nto computational cost  limitations.  There is  no  'over-fitting'  as  s  is  increased since \nmore flexible  distributions  Q(w)  simply better approximate the true posterior. \nWe  illustrate the optimization of the KL  divergence  using  a  toy problem  involving \nthe posterior distribution  for  a  two-parameter regression  problem.  Figure  1 shows \nthe true posterior  together  with  approximations obtained from  Laplace's  method, \nensemble learning with a diagonal covariance Gaussian, and ensemble learning using \nan unconstrained Gaussian. \n\n2.1  Hyperparameter Adaptation \n\nSo far,  we have treated the hyperparameters as fixed.  We now extend the ensemble \nlearning formalism to include hyperparameters within the Bayesian framework.  For \nsimplicity,  we  consider  a  standard  isotropic  prior  covariance  matrix  of  the  form \nA  = aI, and introduce hyperpriors given  by  Gamma distributions \n\nlnp (a) \n\nlnp (f3)  = \n\nIn {aa-l exp ( -~) } + const \nIn {f3 C- 1 exp  ( -~) }  + const \n\n(19) \n\n(20) \n\n\f400 \n\nD.  Barber and C.  M. BisJwp \n\nwhere  a, b, c, d  are  constants.  The joint  posterior  distribution  of the  weights  and \nhyperparameters is  given  by \n\nin  which \n\np (w, a, ,BID)  <X  P (Dlw, j3) p (wla) p (a) p (,B) \n\nlnp (Dlw,,B) \n\nlnp (wla) \n\n- ,BED + \"2 In,B + const \n\nN \n\n-alwl 2 + '2 In a  + const \n\nk \n\n(21) \n\n(22) \n\n(23) \n\nWe  follow  MacKay  (1995)  by  modelling  the joint posterior p (w, a, ,BID)  by  a  fac(cid:173)\ntorized approximating distribution of the form \n\nQ(w)R(a)S(,B) \n\n(24) \n\nwhere Q(w) is a Gaussian distribution as before, and the functional forms of Rand \nS  are left unspecified.  We  then minimize the KL  divergence \n\nF[Q,R,S] =  Q(w)R(a)S(,B) In \n\nJ \n\n{Q(W)R(a)S(,B) } \n\np(w,a,,BID) \n\ndwdad,B. \n\n(25) \n\nConsider first  the dependence of (25)  on Q(w) \n\nF  [QJ \n\n- J Q(w)R(a)S(j3) { -,BED(W) - ~lwl2 -In Q(w) } + const  (26) \n\n- -J Q(w) { -73ED(W)  - ~lwl2 -In Q(W)} + const \n\n(27) \nwhere a = J R(a)ada and 73  =  J S(,B)j3d,B.  We  see  that  (27)  has  the form  of (8), \nexcept that the fixed  hyperparameters are now  replaced with their average values. \nTo  calculate these averages, consider the dependence of the functional  F  on R(a) \n- J Q(W)R(a)S(j3){-~lwI2+~lna+(a-1)lna-i} dwdad,B \n-J R(a) {; + (r - 1) Ina -In R(a)}  da + const \n\nF[R] \n\n(28) \n\nwhere r  =  ~ +a and lis =  ~lwl2 +  ~TrC+ lib.  We recognise  (28)  as the Kullback(cid:173)\nLeibler  divergence  between  R(a)  and  a  Gamma distribution.  Thus  the  optimum \nR(a)  is  also  Gamma distributed \n\nR(a)  <X  a r - 1 exp (-;) . \n\n(29) \n\nWe  therefore obtain a = rs. \nA similar procedure for  S(,B)  gives 73  =  uv, where u =  ~ + c and 11v = (ED) + lid, \nin which  (ED)  has already been calculated during the optimization of Q(w) . \nThis defines an iterative procedure in which we start by initializing the hyperparam(cid:173)\neters (using the mean of the hyperprior distributions) and then alternately optimize \nthe KL divergence over  Q(w)  and re-estimate a and 73. \n\n3  Results and  Discussion \n\nAs  a  preliminary  test  of our  method  on  a  standard  benchmark  problem,  we  ap(cid:173)\nplied  the  minimum  KL  procedure  to  the  Boston  Housing  dataset.  This  is  a  one \n\n\fEnsemble Learning for Multi-Layer Networks \n\n401 \n\nI Method \nEnsemble  (s  ==  1) \nEnsemble  (diagonal) \nLaplace \n\nTest  Error \n\n0.22 \n0.28 \n0.33 \n\nTable  1:  Comparison of ensemble  learning with  Laplace's  method.  The test  error \nis  defined to be the mean squared error over the test set of 378 examples. \n\ndimensional regression problem, with 13  inputs, in which  the data for  128 training \nexamples  was  obtained from  the  DELVE  archive2 \u2022  We  trained  a  network  of four \nhidden  units, with covariance matrix given  by  (18)  with  s =  1,  and specified broad \nhyperpriors on a  and (3  (a  =  0.25,  b =  400, c =  0.05, and d =  2000).  Predictions are \nmade  by  evaluating  the integral  in  (6).  This  integration  can  be  done  analytically \nas  a consequence of the form  of the sigmoid  function  given  in  (2). \n\nWe  compared the performance of the KL  method against the Laplace framework  of \nMacKay  (1995)  which  also  treats  hyperparameters through  a  re-estimation  proce(cid:173)\ndure.  In addition we  also evaluated the performance of the ensemble method using \na  diagonal covariance matrix.  Our results  are summarized in  Table  1. \n\nAcknowledgements \n\nWe would like to thank Chris Williams for helpful discussions.  Supported by EPSRC \ngrant GR/J75425:  Novel Developments  in  Learning  Theory for  Neural  Networks. \n\nReferences \n\nBarber,  D.  and  C.  M.  Bishop  (1997).  On  computing  the  KL  divergence  for \nBayesian  neural  networks.  Technical  report,  Neural  Computing  Research \nGroup,  Aston University, Birmingham,  {;.K. \n\nBishop,  C.  M.  (1995).  Neural  Networks  for  Pattern  Recognition.  Oxford Univer(cid:173)\n\nsity Press. \n\nBishop,  C.  M.  (1997).  Latent  variables,  mixture  distributions  and  topographic \nmappings.  Technical  report,  Aston  University.  To  appear  in  Proceedings  of \nthe  NATO  Advanced  Study  Institute  on  Learning in  Graphical  Models,  Erice. \nHinton,  G.  E.  and  D.  van  Camp  (1993).  Keeping  neural  networks  simple  by \nminimizing the description length of the weights.  In  Proceedings  of the  Sixth \nAnnual  Conference  on  Computational Learning  Theory,  pp.  5-13. \n\nMacKay,  D.  J.  C.  (1992).  A practical  Bayesian framework  for  back-propagation \n\nnetworks.  Neural  Computation  4 (3) , 448-472. \n\nMacKay,  D.  J.  C.  (1995).  Developments  in  probabilistic  modelling  with  neural \nnetworks--ensemble learning.  In  Neural  Networks:  Artificial Intelligence  and \nIndustrial  Applications.  Proceedings  of the  3rd  Annual Symposium  on  Neural \nNetworks,  Nijmegen,  Netherlands,  14-15 September 1995, Berlin, pp. 191-198. \nSpringer. \n\nMacKay,  D.  J.  C.  (1995).  Probable  networks  and  plausible  predictions  - a  re(cid:173)\n\nview of practical Bayesian  methods for  supervised neural networks.  Network: \nComputation  in  Neural  Systems  6(3), 469-505. \n\nNeal,  R.  M.  (1996).  Bayesian  Learning  for  Neural  Networks.  Springer.  Lecture \n\nNotes  in  Statistics  118. \n\n2See http://wvv . cs. utoronto. cal \"-'delve I \n\n\f", "award": [], "sourceid": 1480, "authors": [{"given_name": "David", "family_name": "Barber", "institution": null}, {"given_name": "Christopher", "family_name": "Bishop", "institution": null}]}