{"title": "Regression with Input-Dependent Noise: A Bayesian Treatment", "book": "Advances in Neural Information Processing Systems", "page_first": 347, "page_last": 353, "abstract": null, "full_text": "Regression with Input-Dependent Noise: \n\nA  Bayesian Treatment \n\nChristopher M. Bishop \nC.M.BishopGaston.ac.uk \n\nCazhaow S.  Qazaz \nqazazcsGaston.ac.uk \n\nNeural Computing Research Group \n\nAston University, Birmingham, B4  7ET, U.K. \n\nhttp://www.ncrg.aston.ac.uk/ \n\nAbstract \n\nIn  most  treatments  of the  regression  problem  it  is  assumed  that \nthe distribution of target data can be described by a  deterministic \nfunction  of the inputs, together with additive Gaussian noise hav(cid:173)\ning constant variance.  The use of maximum likelihood to train such \nmodels then  corresponds to the minimization of a  sum-of-squares \nerror function.  In  many applications a more realistic model would \nallow  the  noise  variance  itself to  depend  on  the  input  variables. \nHowever, the use of maximum likelihood to train such models would \ngive highly biased  results.  In  this paper we  show  how  a  Bayesian \ntreatment  can  allow  for  an  input-dependent  variance  while  over(cid:173)\ncoming the bias of maximum likelihood. \n\n1 \n\nIntroduction \n\nIn  regression problems it is important not only to predict the output variables but \nalso to have some estimate of the error bars associated with those predictions.  An \nimportant contribution to the error bars arises from the intrinsic noise on the data. \nIn most conventional treatments of regression,  it is  assumed that the noise can  be \nmodelled by a  Gaussian distribution with a  constant  variance.  However,  in  many \napplications it will be more realistic to allow the noise variance itself to depend  on \nthe input variables.  A general framework for modelling the conditional probability \ndensity function of the target data, given the input vector,  has been  introduced in \nthe  form  of  mixture  density  networks  by  Bishop  (1994,  1995).  This  uses  a  feed(cid:173)\nforward  network to  set  the  parameters of a  mixture kernel  distribution,  following \nJacobs et al.  (1991).  The special case of a single isotropic Gaussian kernel function \n\n\f348 \n\nC.  M. Bishop and C.  S.  Qazaz \n\nwas  discussed  by  Nix  and  Weigend  (1995),  and  its generalization to  allow  for  an \narbitrary covariance matrix was  given by  Williams (1996). \n\nThese approaches, however, are all based on the use of maximum likelihood, which \ncan lead to the noise variance being systematically under-estimated.  Here we  adopt \nan  approximate hierarchical Bayesian  treatment  (MacKay,  1991)  to find  the most \nprobable interpolant and most probable input-dependent  noise variance.  We  com(cid:173)\npare our  results  with  maximum  likelihood and  show  how  this  Bayesian  approach \nleads to a  significantly reduced bias. \n\nIn  order  to  gain  some  insight  into  the  limitations of the maximum  likelihood  ap(cid:173)\nproach, and to see how these limitations can be overcome in a Bayesian treatment, it \nis useful to consider first a much simpler problem involving a single random variable \n(Bishop, 1995).  Suppose that a variable Z  is known to have a Gaussian distribution, \nbut  with  unknown  mean  fJ.  and  unknown  variance  (J2.  Given  a  sample D  ==  {zn} \ndrawn from that distribution, where n  =  1, ... , N, our goal is to infer values for  the \nmean and variance.  The likelihood function  is given by \n\np(DIfJ., (J  )  = (27r(J2)N/2  exp \n\n{ \n\n2 \n\n1 \n\n1 \n\n- 2(J2 ?; (Zn  - fJ.) \n\nN \n\n2 \n\n} \n\n. \n\n(1) \n\nA  non-Bayesian  approach  to  finding  the  mean  and  variance  is  to  maximize  the \nlikelihood jointly over  fJ.  and  (J2,  corresponding to the intuitive idea of finding  the \nparameter values which  are most likely to have given  rise to the observed data set. \nThis yields the standard result \n\nN \n\n(12  =  ~ 2)Zn - Ji)2. \n\nn=l \n\n(2) \n\nIt  is  well  known  that the  estimate (12  for  the  variance given  in  (2)  is  biased  since \nthe expectation of this estimate is  not equal to the true value \n\nC[~2] _  N  -1  2 \n- - - ( JO  \n(,  (J \n\nN \n\n(3) \n\nwhere  (J5  is  the  true  variance  of the  distribution  which  generated  the  data,  and \n\u00a3[.]  denotes  an  average over  data sets of size  N.  For  large  N  this  effect  is  small. \nHowever, in the case of regression problems there are generally much larger number \nof degrees  of freedom  in  relation  to the number of available data points,  in  which \ncase  the effect  of this bias can  be very substantial. \n\nThe  problem  of  bias  can  be  regarded  as  a  symptom  of the  maximum  likelihood \napproach.  Because the mean Ji has been estimated from the data, it has fitted some \nof the noise on the data and this leads to an under-estimate of the variance.  If the \ntrue mean is used in the expression for  (12  in (2)  instead of the maximum likelihood \nexpression, then the estimate is unbiased. \n\nBy adopting a Bayesian viewpoint this bias can be removed.  The marginal likelihood \nof (J2  should be computed by  integrating  over the mean fJ..  Assuming a  'flat'  prior \np(fJ.)  we  obtain \n\n(4) \n\n\fRegression with Input-Dependent Noise:  A Bayesian Treatment \n\nMaximizing (5)  with respect to ~2 then gives \n\n-2 \n~  = N  _ 1 ~ Z n  -\n\n1  ~(  ~)2 \n\nJ.L \n\nN \n\nn=l \n\n349 \n\n(5) \n\n(6) \n\nwhich is  unbiased. \n\nThis  result  is  illustrated in  Figure  1  which  shows  contours of p(DIJ.L, ~2) together \nwith the marginal likelihood p(DI~2) and the conditional likelihood p(DI;t, ~2) eval(cid:173)\nuated at J.L  =  ;t. \n\n2.5 \n\n2 \n\n0.5 \n\no~~----~----~--~ \n\n-2 \n\no \nmean \n\n2 \n\n2.5 \n\n2 \n\nQ) u \nc: \n.~ 1.5 \n113 > \n\n1 \n\n/ \n\n4 \n2 \nlikelihood \n\n6 \n\nFigure 1:  The left  hand plot  shows contours of the likelihood function  p(DIJ..L, 0-2) given \nby  (1)  for  4 data points  drawn from  a  Gaussian  distribution  having  zero  mean  and  unit \nvariance.  The  right  hand plot  shows  the  marginal  likelihood  function  p(DI0-2)  (dashed \ncurve) and the conditional likelihood function p(DI{i,0-2)  (solid curve).  It can be seen that \nthe skewed contours result in  a value of 0:2, which  maximizes p(DI{i, 0-2),  which is smaller \nthan 0:2 which maximizes p(DI0-2). \n\n2  Bayesian Regression \n\nConsider  a  regression  problem involving the  prediction of a  noisy variable t  given \nthe value of a  vector x  of input variablesl .  Our goal is  to predict both a  regression \nfunction  and  an  input-dependent  noise  variance.  We  shall therefore  consider  two \nnetworks.  The  first  network  takes  the  input  vector  x  and  generates  an  output \n\nIFor  simplicity  we  consider  a  single  output  variable.  The  extension  of this  work  to \n\nmultiple outputs is  straightforward. \n\n\f350 \n\nC.  M.  Bishop and C.  S.  Qazaz \n\ny(x; w)  which  represents  the  regression  function,  and  is  governed  by  a  vector  of \nweight  parameters  w.  The  second  network  also  takes  the  input  vector  x,  and \ngenerates an output function  j3(x; u)  representing the inverse  variance of the noise \ndistribution,  and is  governed by  a  vector of weight  parameters u.  The conditional \ndistribution  of target  data,  given  the  input  vector,  is  then  modelled  by  a  normal \ndistribution p(tlx, w, u)  =  N(tly, 13- 1 ).  From this we  obtain the likelihood function \n\nwhere j3n  =  j3(xn; u), \n\nN  (271') 1/2 \n\nZD = II  j3n \n\nn=l \n\n(7) \n\n(8) \n\n' \n\nand D  ==  {xn' tn}  is  the data set. \nSome simplification of the subsequent analysis is  obtained by taking the regression \nfunction,  and In 13,  to be given by linear combinations of fixed  basis functions,  as in \nMacKay  (1995),  so that \n\ny(x; w) =  w T <j)(x) , \n\nj3(x; u)  =  exp (u T ,p(x)) \n\n(9) \n\nwhere choose one  basis function  in  each  network to be a  constant  \u00a2o  =  'l/Jo  =  1 so \nthat the corresponding weights Wo  and Uo  represent bias parameters. \nThe maximum likelihood procedure chooses values wand u by finding a joint max(cid:173)\nimum over  wand u.  As  we  have  already indicated,  this  will  give  a  biased  result \nsince  the  regression function  inevitably fits  part  of the  noise  on the data,  leading \nto  an  over-estimate of j3(x).  In  extreme  cases,  where  the  regression  curve  passes \nexactly through  a  data point,  the  corresponding estimate of 13  can  go  to  infinity, \ncorresponding to an estimated noise variance of zero. \n\nThe solution to this problem has  already been  indicated in  Section  1 and was first \nsuggested  in  this  context  by  MacKay  (1991,  Chapter  6).  In  order  to  obtain  an \nunbiased estimate of j3(x)  we  must find  the marginal distribution of 13,  or equiva(cid:173)\nlently of u,  in  which we  have integrated out the dependence on w.  This leads to a \nhierarchical Bayesian analysis. \n\nWe begin by defining priors over the parameters wand u.  Here we consider isotropic \nGaussian priors of the form \n\np(ulau ) \n\n(10) \n\n(11) \n\nwhere  a w  and  au  are  hyper-parameters.  At  the  first  stage  of  the  hierarchy,  we \nassume that  u  is  fixed  to its most  probable value  UMP,  which  will  be determined \nshortly.  The  most probable value  of w,  denoted  by  WMP,  is  then found  by  maxi-\n\n\fRegression with Input-Dependent Noise:  A Bayesian Treatment \n\nmizing the posterior distribution2 \n\n( \n\np  w \n\nID \n\n, UMP, Ow \n\n) - p(Dlw, uMP)p(wlow) \n-\n\n) \nUMP, Ow \n\n(D I \np \n\nwhere the denominator in  (12)  is given  by \n\np(DIUMP, ow)  = I p(Dlw,uMP)p(wlow)dw. \n\n351 \n\n(12) \n\n(13) \n\nTaking the negative log of (12),  and dropping constant terms, we  see  that WMP  is \nobtained by minimizing \n\nN \n\nS(w) =  L 13nEn + \u00b02w  IIwll2 \n\nn=l \n\n(14) \n\nwhere we  have used  (7)  and  (10).  For the particular choice of model  (9)  this min(cid:173)\nimization  represents  a  linear  problem  which  is  easily  solved  (for  a  given  u)  by \nstandard matrix techniques. \n\nAt the next level of the hierarchy, we find UMP  by maximizing the marginal posterior \ndistribution \n\n(15) \n\nThe term p(Dlu, ow)  is just the denominator from  (12)  and is found  by integrating \nover  w  as in  (13).  For the model  (9)  and  prior  (10)  this  integral is  Gaussian  and \ncan be performed analytically without approximation.  Again taking logarithms and \ndiscarding constants, we have to minimize \n\nN \n\nM(u) =  L 13nEn + ~u lIuII 2 - 2 L In13n + 2ln IAI \nwhere  IAI  denotes the determinant of the Hessian matrix A  given  by \n\nn=l \n\nn=l \n\n1  N \n\n1 \n\nN \n\nA  =  L 13nl/J(Xn)l/J(xn? + Owl \n\nn=l \n\n(16) \n\n(17) \n\nand  I  is  the  unit  matrix.  The  function  M(u)  in  (16)  can  be  minimized  using \nstandard non-linear optimization algorithms.  We use scaled conjugate gradients, in \nwhich the necessary derivatives of In IAI are easily found in terms of the eigenvalues \nof A. \n\nIn summary, the algorithm requires an outer loop in which the most probable value \nUMP  is  found  by  non-linear minimization of (16),  using the scaled  conjugate  gra(cid:173)\ndient  algorithm.  Each  time  the  optimization  code  requires  a  value  for  M(u)  or \nits gradient,  for  a  new  value  of u,  the optimum value for  WMP  must  be found  by \nminimizing (14).  In effect, w  is evolving on a fast time-scale, and U  on a slow time(cid:173)\nscale.  The  corresponding maximum  (penalized)  likelihood  approach  consists  of a \njoint non-linear optimization over U and  w  of the posterior distribution p(w, uID) \nobtained from  (7),  (10)  and  (11).  Finally, the hyperparameters are given fixed  val(cid:173)\nues Ow  = Ou = 0.1  as this allows the maximum likelihood and Bayesian approaches \nto be treated on an equal footing. \n\n2Note  that  the  result  will  be  dependent  on  the  choice  of  parametrization  since  the \n\nmaximum of a  distribution is not invariant under a  change of variable. \n\n\f352 \n\nC.  M.  Bishop and C.  S.  Qazaz \n\n3  Results  and  Discussion \n\nAs  an illustration of this algorithm, we  consider a  toy problem involving one input \nand  one  output,  with  a  noise  variance  which  has  an  x 2  dependence  on  the  input \nvariable.  Since  the  estimated  quantities  are  noisy,  due  to  the  finite  data  set,  we \nconsider an averaging procedure as follows.  We  generate 100 independent data sets \neach consisting of 10 data points.  The model is  trained on each of the data sets in \nturn and then tested on the remaining 99 data sets.  Both the Y(Xj w)  and (3(Xj u) \nnetworks have 4 Gaussian basis functions (plus a bias) with width parameters chosen \nto equal the spacing of the centres. \n\nResults  are shown  in  Figure 2.  It is  clear that the maximum likelihood results are \nbiased and that the noise  variance is  systematically underestimated.  By contrast, \n\nMaximum likelihood \n\" \n\n\\ \n\n\\ \n\n0.5 \n\n:5 \nc.. \n:5 \n0 \n\na \n\n\\ \n\n-0.5 \n\n-1 \n-4 \n\n-2 \n\n2 \n\n4 \n\na \nx \n\nBayesian \n\n0.5 \n\na \n\n:5 \nc.. \n:5 \n0 \n\n-0.5 \n\n0.8 \n\n<1l  0.6 \nC,) \nt: \nCI:l \n\u00b7~0.4 \n<1l en \n\u00b7gO.2 \n\na \n-4 \n\n0.8 \n\n<1l  0.6 \nC,) \nt: \nCI:l \n.~ 0.4 \n<1l en \n\u00b7gO.2 \n\nMaximum likelihood \n\n\\ \n\n\\ \n\n, \n\" \n\n'-\n\n~ \n\n/ \n\n/ \n\n/ \n\n/ \n\n-2 \n\na \nx \n\nBayesian \n\n2 \n\n4 \n\n\\ \n\n\\ \n\n-1 \n-4 \n\n-2 \n\na \nx \n\n2 \n\n4 \n\n0 \n-4 \n\n-2 \n\na \nx \n\n2 \n\n4 \n\nFigure 2:  The left  hand  plots  show  the sinusoidal  function  (dashed curve)  from  which \nthe data were  generated, together with the regression  function  averaged over  100 training \nsets.  The right  hand plots show  the true noise  variance  (dashed curve)  together with the \nestimated. noise  variance,  again  averaged over  100  data sets. \n\nthe Bayesian results show an improved estimate of the noise variance.  This is borne \nout by evaluating the log likelihood for  the test  data under the corresponding pre(cid:173)\ndictive distributions.  The Bayesian approach gives a  log likelihood per data point, \naveraged  over the 100 runs,  of -1.38.  Due to the over-fitting problem,  maximum \nlikelihood occasionally gives  extremely large negative  values for  the  log  likelihood \n(when  (3  has  been estimated to be very  large,  corresponding to  a  regression  curve \nwhich  passes close to  an individual data point).  Even omitting these extreme val(cid:173)\nues,  the maximum likelihood still gives  an average log likelihood per data point of \n\n\fRegression with Input-Dependent Noise:  A Bayesian Treatment \n\n353 \n\n-17.1 which is  substantially smaller than the Bayesian result. \n\nWe  are currently exploring the use  of Markov  chain  Monte  Carlo methods  (Neal, \n1993)  to  perform the  integrations required  by  the  Bayesian  analysis  numerically, \nwithout the need  to introduce the Gaussian  approximation or the evidence frame(cid:173)\nwork.  Recently,  MacKay  (1995)  has  proposed  an  alternative technique  based  on \nGibbs sampling.  It will  be interesting to compare these various approaches. \n\nAcknowledgements:  This  work  was  supported  by  EPSRC  grant  GR/K51792, \nValidation  and  Verification  of Neural  Network  Systems. \n\nReferences \n\nBishop,  C.  M. \n\n(1994).  Mixture  density  networks.  Technical  Report \n\nNCRG/94/001, Neural Computing Research Group,  Aston University, Birm(cid:173)\ningham, UK. \n\nBishop,  C.  M.  (1995).  Neural  Networks  for  Pattern  Recognition.  Oxford Univer(cid:173)\n\nsity Press. \n\nJacobs,  R.  A.,  M.  I.  Jordan,  S.  J.  Nowlan,  and  G.  E.  Hinton  (1991).  Adaptive \n\nmixtures of local experts.  Neural  Computation  3  (1),  79-87. \n\nMacKay,  D.  J.  C.  (1991).  Bayesian  Methods  for  Adaptive  Models.  Ph.JJ.thesis, \n\nCalifornia Institute of Technology. \n\nMacKay, D.  J.  C.  (1995).  Probabilistic networks:  new models and new methods. \nIn  F.  Fogelman-Soulie and P.  Gallinari (Eds.),  Proceedings  ICANN'95 Inter(cid:173)\nnational  Conference  on  Artificial  Neural  Networks,  pp.  331-337.  Paris:  EC2 \n&  Cie. \n\nNeal, R.  M.  (1993). Probabilistic inference using Markov chain Monte Carlo meth(cid:173)\nods.  Technical Report CRG-TR-93-1, Department of Computer Science, Uni(cid:173)\nversity of Toronto, Cananda. \n\nNix,  A.  D.  and  A.  S.  Weigend  (1995).  Learning  local  error  bars  for  nonlinear \nregression. In G. Tesauro, D.  S. Touretzky, and T. K. Leen (Eds.),  Advances in \nNeural  Information  Processing  Systems,  Volume  7,  pp.  489-496.  Cambridge, \nMA:  MIT Press. \n\nWilliams, P.  M.  (1996). Using neural networks to model conditional multivariate \n\ndensities.  Neural  Computation  8  (4),  843-854. \n\n\f", "award": [], "sourceid": 1191, "authors": [{"given_name": "Christopher", "family_name": "Bishop", "institution": null}, {"given_name": "Cazhaow", "family_name": "Quazaz", "institution": null}]}