{"title": "Computing with Infinite Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 295, "page_last": 301, "abstract": null, "full_text": "Computing with infinite networks \n\nChristopher K.  I.  Williams \n\nNeural  Computing Research  Group \n\nDepartment of Computer Science  and Applied  Mathematics \n\nAston  University,  Birmingham B4  7ET,  UK \n\nc.k.i.williamsGaston.ac.nk \n\nAbstract \n\nFor  neural  networks  with  a  wide  class  of weight-priors,  it  can  be \nshown  that  in  the  limit  of an  infinite  number of hidden  units  the \nprior over functions  tends to a  Gaussian process.  In  this paper an(cid:173)\nalytic forms are derived for the covariance function of the Gaussian \nprocesses  corresponding  to networks with sigmoidal and Gaussian \nhidden  units.  This  allows predictions  to  be  made efficiently  using \nnetworks  with an infinite number of hidden units,  and shows  that, \nsomewhat paradoxically, it may be  easier  to compute with infinite \nnetworks  than finite  ones. \n\n1 \n\nIntroduction \n\nTo  someone  training  a  neural  network  by  maximizing  the  likelihood  of a  finite \namount of data it makes no sense  to use  a network with an infinite number of hidden \nunits;  the  network  will  \"overfit\"  the  data  and  so  will  be  expected  to  generalize \npoorly.  However,  the  idea of selecting  the  network  size  depending  on  the  amount \nof training  data makes  little  sense  to  a  Bayesian;  a  model  should  be  chosen  that \nreflects  the  understanding of the  problem,  and then  application of Bayes' theorem \nallows inference  to  be  carried  out (at least  in  theory)  after  the data is  observed. \n\nIn  the  Bayesian  treatment of neural  networks,  a  question  immediately arises  as  to \nhow  many hidden  units are  believed  to be  appropriate for  a  task.  Neal  (1996)  has \nargued compellingly that for  real-world problems,  there is no reason  to believe  that \nneural network  models should be  limited to nets containing only a  \"small\"  number \nof hidden  units.  He  has  shown  that  it  is  sensible  to  consider  a  limit  where  the \nnumber of hidden  units in  a  net  tends to infinity,  and that good predictions can be \nobtained from such  models using  the  Bayesian machinery.  He  has  also shown  that \nfor  fixed  hyperparameters,  a  large  class  of neural  network  models will  converge  to \na  Gaussian process  prior over functions in the limit of an infinite number of hidden \nunits. \n\n\f296 \n\nC.  K.  I.  Williams \n\nNeal's  argument  is  an  existence  proof-it  states  that  an  infinite  neural  net  will \nconverge  to  a  Gaussian  process,  but  does  not  give  the  covariance  function  needed \nto  actually  specify  the  particUlar  Gaussian  process. \nIn  this  paper  I  show  that \nfor  certain  weight  priors  and  transfer  functions  in  the  neural  network  model,  the \ncovariance  function  which  describes  the  behaviour  of the  corresponding  Gaussian \nprocess  can  be  calculated  analytically.  This  allows  predictions  to  be  made  using \nneural  networks  with  an  infinite  number  of hidden  units  in  time  O( n 3 ),  where  n \nis  the number of training examples l .  The only  alternative currently  available is  to \nuse  Markov  Chain  Monte  Carlo (MCMC)  methods  (e.g.  Neal,  1996)  for  networks \nwith  a  large  (but  finite)  number  of  hidden  units.  However,  this  is  likely  to  be \ncomputationally expensive,  and we  note possible concerns over  the  time needed for \nthe  Markov  chain  to  reach  equilibrium.  The  availability  of an  analytic  form  for \nthe  covariance  function  also facilitates  the  comparison  of the  properties  of neural \nnetworks  with  an  infinite  number of hidden  units  as  compared  to  other  Gaussian \nprocess  priors  that may be considered. \nThe  Gaussian process  analysis  applies  for  fixed  hyperparameters  B.  If it  were  de(cid:173)\nsired  to  make predictions  based  on  a  hyperprior  P( B)  then  the  necessary  B-space \nintegration could be achieved by  MCMC methods.  The great advantage of integrat(cid:173)\ning out  the  weights  analytically is  that  it  dramatically reduces  the  dimensionality \nof the  MCMC  integrals,  and thus improves their speed  of convergence. \n\n1.1  From priors on weights to priors on functions \n\nBayesian neural networks are usually specified in a hierarchical manner, so that the \nweights  ware  regarded  as  being  drawn  from  a  distribution  P(wIB).  For  example, \nthe weights might be drawn from a zero-mean Gaussian distribution, where  B spec(cid:173)\nifies  the  variance  of groups  of weights.  A  full  description  of the  prior  is  given  by \nspecifying  P( B)  as  well  as  P( wIB).  The  hyperprior  can  be  integrated  out  to  give \nP(w) = J P(wIB)P(B) dB,  but in our case it will be advantageous not to do this as \nit introduces  weight correlations which prevent  convergence  to a  Gaussian  process. \n\nIn  the  Bayesian  view  of neural  networks,  predictions  for  the  output  value  y ..  cor(cid:173)\nresponding  to a  new  input  value  x ..  are  made by  integrating over  the  posterior  in \n\nweight  space.  Let  D  = ((XI,t1),(xz,tz), ... ,(xn,tn\u00bb denote  the  n  training  data \npairs,  t  =  (tl'\"  .,tnl and ! .. (w)  denote  the mapping carried  out  by the  network \non  input  x ..  given  weights  w.  P(wlt, B)  is  the  weight  posterior  given  the  training \ndataz.  Then  the  predictive  distribution  for  y ..  given  the  training  data  and  hyper(cid:173)\nparameters B is \n\n(1) \n\nWe will now show how this can also be viewed as making the prediction using priors \nover  functions  rather  than  weights.  Let  f(w)  denote  the  vector  of outputs  corre(cid:173)\nsponding  to  inputs (Xl, ... , xn) given  weights  w.  Then,  using  Bayes'  theorem  we \nhave  P(wlt,8) = P(tlw)P(wI8)/ P(tI8),  and  P(tlw) = J P(tly) o(y -\nf(w\u00bb  dy. \nHence  equation  1 can  be rewritten  as \nP(y .. It, 8) =  P(~18) J J P(tly) o(Y ..  -\n(2) \nHowever,  the prior over (y .. , YI, ... , Yn)  is given by P(y .. , y18) = P(y .. Iy, 8)P(yI8) = \nJ o(Y ..  - ! .. (w) o(y- f(w\u00bbP(wI8) dw and thus the predictive distribution  can  be \n1 For large n,  various  ap'proximations  to the exact solution  which  avoid  the inversion  of \n\nf(w\u00bb  P(wI8) dw dy \n\n! .. (w\u00bbo(y -\n\nan  n  x  n  matrix are available. \n\n2For notational  convenience  we suppress  the  x-dependence  of the posterior. \n\n\fComputing with Infinite Networks \n\n297 \n\nwritten  as \n\nP(y .. lt,8) =  P(~18) J P(tly)P(y .. ly, 8)P(yI8) dy = J P(y .. ly, 8)P(ylt, 8) dy \n\n(3) \nHence  in  a  Bayesian  view  it  is  the  prior  over  function  values  P(y .. , Y18)  which  is \nimportant;  specifying  this  prior  by  using  weight  distributions  is  one  valid  way  to \nachieve  this  goal.  In  general  we  can  use  the  weight  space  or function  space  view, \nwhich  ever  is  more convenient,  and for  infinite  neural  networks  the function  space \nview  is more useful. \n\n2  Gaussian  processes \n\nA stochastic  process  is  a  collection  of random variables {Y(z)lz E  X}  indexed  by \na set  X .  In our case  X  will be n d ,  where  d is the number of inputs.  The stochastic \nprocess  is  specified  by  giving  the  probability  distribution  for  every  finite  subset \nof variables  Y(zt), ... , Y(Zk)  in  a  consistent  manner.  A  Gaussian  process  (GP) \nis  a  stochastic  process  which  can  be  fully  specified  by  its  mean  function  jJ( z)  = \nE[Y(z)]  and  its  covariance  function  C(z, z') =  E[(Y(z) -\njJ(z\u00bb(Y(z') - JJ(z'\u00bb]; \nany finite set ofY-variables will have ajoint multivariate Gaussian distribution.  For \na  multidimensional input space  a  Gaussian process  may also  be  called  a  Gaussian \nrandom field. \n\nBelow  we  consider  Gaussian processes  which  have  jJ(z) = 0,  as  is  the  case  for  the \n\nneural  network  priors discussed  in section  3.  A  non-zero JJ(z)  can  be  incorporated \ninto the framework at the expense  of a little extra complexity. \n\nA  widely  used  class  of covariance functions  is  the  stationary  covariance  functions, \nwhereby  C(z, z') = C(z - z') .  These  are  related  to the spectral  density  (or power \nspectrum)  of the  process  by  the  Wiener-Khinchine  theorem,  and  are  particularly \namenable to Fourier analysis as  the eigenfunctions of a stationary covariance kernel \nare exp ik.z .  Many commonly used  covariance functions  are also  isotropic,  so  that \nC(h) = C(h)  where  h  = z  - z'  and  h = Ihl.  For  example C(h) = exp(-(h/oy) \nis  a  valid  covariance  function  for  all  d  and  for  0  <  v  ~ 2.  Note  that  in  this  case \nu  sets  the  correlation  length-scale  of the  random field,  although  other  covariance \nfunctions  (e.g.  those  corresponding  to power-law  spectral  densities)  may have  no \npreferred  length scale. \n\n2.1  Prediction with Gaussian processes \n\nThe model for  the observed  data is that it was  generated from the  prior stochastic \nprocess,  and  that  independent  Gaussian  noise  (of  variance  u~) was  then  added. \nGiven  a  prior  covariance  function  CP(Zi,Zj),  a  noise  process  CN(Zj,Zj)  =  U~6ij \n(i.e.  independent  noise  of variance  u~ at each  data  point)  and  the  training  data, \nthe prediction for the distribution of y ..  corresponding to a test point z ..  is obtained \nsimply by  applying equation 3.  As  the prior and noise model are both Gaussian the \nintegral can be done analytically and P(y .. lt, 8) is Gaussian with mean and variance \n\ny(z .. )  =  k~(z .. )(Kp + KN)-lt \nu2(z .. )  =  Cp(z .. , z .. ) - k~(z .. )(J{p + KN )-lkp(z .. ) \n\n(4) \n(5) \n\nwhere  [Ko]ij  =  Co(Zi, Zj)  for  a  =  P, Nand  kp(z .. )  = \nCp(z .. , zn\u00bbT.  u~(z .. )  gives  the  \"error bars\"  of the prediction. \n\n(Cp(z .. , zt), ... , \n\nEquations  4  and  5  are  the  analogue  for  spatial  processes  of  Wiener-Kolmogorov \nprediction  theory.  They  have  appeared  in  a  wide  variety  of contexts  including \n\n\f298 \n\nC.  K.  I.  Williams \n\ngeostatistics where the method is known as \"kriging\"  (Journel and Huijbregts, 1978; \nCressie  1993), multidimensional spline smoothing (Wahba,  1990), in the derivation \nof radial basis function  neural networks  (Poggio and Girosi,  1990) and in  the work \nof Whittle  (1963). \n\n3  Covariance functions for  Neural  Networks \n\nConsider a  network which takes an input z, has one  hidden  layer with  H  units and \nthen  linearly combines the  outputs of the hidden  units with  a  bias to obtain fez). \nThe mapping can be written \n\nH \n\nfez) = b+ L.:vjh(z;uj) \n\nj=l \n\n(6) \n\nwhere  h(z; u)  is  the  hidden  unit  transfer  function  (which  we  shall  assume  is \nbounded)  which  depends  on  the  input-to-hidden  weights  u.  This  architecture  is \nimportant  because  it  has  been  shown  by  Hornik  (1993)  that  networks  with  one \nhidden  layer  are  universal  approximators as  the  number  of hidden  units  tends  to \ninfinity,  for  a  wide  class  of transfer  functions  (but  excluding  polynomials).  Let  b \nand the v's have independent zero-mean distributions of variance O'~ and 0'1)  respec(cid:173)\ntively, and let the weights  Uj  for each  hidden unit be independently and identically \ndistributed.  Denoting all  weights by  w, we  obtain (following Neal,  1996) \n\n(7) \n(8) \n\nEw[!(z)] \nEw[/(z )/(z')] \n\n-\n\n0 \n\nO'~ + L.: O';Eu[hj(z; u)hj(z'; u)] \nO'l + HO';Eu[h(z; u)h(z'; u)] \n\nj \n\n(9) \nwhere equation 9 follows  because  all of the hidden  units are identically distributed. \nThe final  term in equation 9  becomes w 2 Eu[h(z; u)h(z'; u)]  by  letting 0';  scale  as \nw 2 /H. \nAs the transfer function is bounded, all moments of the distribution will be bounded \nand hence  the Central  Limit Theorem  can  be  applied,  showing  that the stochastic \nprocess  will  become  a  Gaussian process  in  the limit as  H  -+ 00. \nBy evaluating Eu[h(z)h(z')] for all z  and z' in the training and testing sets we  can \nobtain the covariance function needed  to describe  the neural network as a Gaussian \nprocess.  These  expectations  are,  of course,  integrals over  the  relevant  probability \ndistributions of the  biases  and input weights.  In  the following sections  two specific \nchoices for  the transfer functions are  considered,  (1)  a sigmoidal function and (2)  a \nGaussian.  Gaussian  weight  priors are  used  in  both cases. \n\nIt is interesting to note why this analysis cannot be taken a stage further to integrate \nout  any  hyperparameters  as  well.  For  example,  the  variance  0';  of the  v  weights \nmight be  drawn from an inverse  Gamma distribution.  In this case  the distribution \nP(v) = J P(vIO';)P(O';)dO';  is  no  longer  the  product  of the  marginal distributions \nfor each  v  weight (in fact it will be a  multivariate t-distribution).  A similar analysis \ncan be  applied to the u  weights with a  hyperprior.  The effect  is to make the hidden \nunits non-independent, so that the Central Limit Theorem can no longer be applied. \n\n3.1  Sigmoidal transfer function \n\nA sigmoidal transfer function is  a  very common choice  in neural  networks  research; \nnets with this architecture  are usually called  multi-layer perceptrons. \n\n\fComputing with Infinite Networks \n\n299 \n\nBelow we  consider the transfer function h(z; u) =  ~(uo+ 'L1=1  UjXi),  where ~(z) = \n2/ Vii J; e- t2 dt  is  the error  function,  closely  related  to the  cumulative distribution \nfunction  for  the  Gaussian  distribution.  Appropriately  scaled,  the  graph  of  this \nfunction  is  very  similar to the  tanh  function  which  is  more commonly used  in  the \nneural  networks literature. \n\nIn  calculating V(z, Z/)d;J Eu[h(z; U)h(Z/; u)]  we  make the  usual assumptions (e.g. \nMacKay,  1992)  that  u  is  drawn  from  a  zero-mean  Gaussian  distribution  with  co(cid:173)\nvariance matrix E,  i.e.  u  \"\" N(O, E).  Let i  =  (1, Xl,  ... ,  Xd)  be an augmented input \nvector whose first  entry corresponds  to the bias.  Then Verf(z, Z/)  can be written as \n\nVerf(z,z/) =  ~  J~(uTi)~(uTi/)exp(-!uTE-lu) du \n\n(211\") \n\n2 \n\nIE1 1/ 2 \n\n2 \n\nThis integral can  be evaluated analytically3 to give \n\nVerf  z, z  ) =  - sm \n\n( \n\n1 \n\n\u2022  -1 \n\n2 \n\n11\" \n\n2 -T .... -1 \nZ \n.wZ \n\n---;=========== \n)(1 + 2iTEi)(1 + 2i/TEi/) \n\n(10) \n\n(11) \n\nWe  observe  that  this  covariance  function  is  not  stationary,  which  makes  sense  as \nthe  distributions for  the  weights  are  centered  about  zero,  and  hence  translational \nsymmetry is not  present. \nConsider a diagonal weight prior so that E = diag(0\"5, 0\"7,  ... ,0\"1), so that the inputs \ni = 1, ... , d have a  different  weight  variance  to the  bias 0\"6.  Then for  Iz12,  Iz/12\u00bb \n(1+20\"6)/20\"1,  we find that Verf(z, Z/)  ~ 1-20/11\",  where 0 is the angle between z  and \nZ/.  Again this makes sense intuitively; if the model is made up of a large number of \nsigmoidal functions in random directions  (in z  space),  then we would expect points \nthat  lie  diametrically opposite  (i.e.  at  z  and  -z) to  be  anti-correlated,  because \nthey  will lie in  the + 1 and -1 regions of the sigmoid function  for  most directions. \n\n3.2  Gaussian transfer function \n\nOne  other  very  common transfer  function  used  in  neural  networks  research  is  the \nGaussian,  so  that  h(z; u)  = exp[-(z - u)T(z - u)/20\"~],  where  0\";  is  the  width \nparameter of the Gaussian.  Gaussian basis functions  are often used  in Radial Basis \nFunction  (RBF)  networks (e.g.  Poggio and  Girosi,  1990). \nFor a  Gaussian prior over  the distribution of u  so  that u  \"\" N(O, O\"~I), \n\n1 \n\nVG(z,z)=( \n\n1 \n2)d/2 \n211\"0\" u \n\nJ \n\nexp-\n\n(z-u)T(z-u) \n\n2 \n20\" 9 \n\nexp-\n\n(Z/-u)T(Z/_U) \n\n2 \n20\" 9 \n\nBy completing the square and integrating out u  we  obtain \n\nuTu \nexp---2 G \n20\" u \n(12) \n\nVG(Z,Z/) =  _e \n\n( 0\"  )d \n\nO\"U \n\neXP{--2 2  }  exp{-\n\n(13) \nwhere  1/0\"2  =  2/0\"2 + 1/0\"2  0\"2  = 20\"2  + 0\"4/0\"2  and  0\"2  = 20\"2  + 0\"2  This formula \ncan  be  generalized  by  allowing covariance  matrices Eb  and  Eu  in  place of O\";!  and \nO\"~!; rescaling each  input variable  Xi  independently is  a  simple example. \n\n}exp{--2 2  } \n\n9  gum  \n\n2  2 \n0\"$ \n\nu' \n\n$ \n\ng. \n\nu \n\n9 \n\ne \n\n(z - z')T(z - z') \n\nzlT z ' \nO\"m \n\nzT z \nO\"m \n\n3Introduce  a  dummy  parameter  A to  make  the first  term  in  the integrand  ~(AUTX). \nDifferentiate  the  integral  with  respect  to  A and  then  use  integration  by  parts.  Finally \nrecognize  that  dVerfjdA is  of the form  (1-fP)-1/2d9jdA and hence obtain the sin- 1  form \nof the result,  and evaluate it  at  A =  1. \n\n\f300 \n\nC.  K.  I.  Williams \n\nAgain  this  is  a  non-stationary  covariance  function,  although  it  is  interest(cid:173)\ning  to  note  that  if  O\"~  -\n00  (while  scaling  w 2  appropriately)  we  find  that \nVG(Z,Z/)  ex:  exp{-(z - z/)T(z - z/)/40\"2}  4.  For  a  finite  value  of O\"~,  VG(Z,Z/) \nis  a  stationary  covariance  function  \"modulated\"  by  the  Gaussian  decay  function \nexp( _zT z/20\"?n) exp( _zIT Zl /20\"?n).  Clearly  if O\"?n  is  much  larger  than  the  largest \ndistance in z-space then the predictions made with VG  and a Gaussian process  with \nonly the stationary part of VG  will  be very  similar. \n\nIt is  also possible  to view  the  infinite  network  with Gaussian  transfer functions  as \nan  example  of a  shot-noise  process  based  on  an  inhomogeneous  Poisson  process \n(see  Parzen  (1962)  \u00a74.5  for  details).  Points  are  generated from  an  inhomogeneous \nPoisson process  with  the  rate function  ex:  exp( _zT z/20\"~), and Gaussian  kernels  of \nheight v  are centered on each of the points, where  v is chosen  iid from a distribution \nwith mean zero  and variance 0\"; . \n\n3.3  Comparing covariance functions \n\nThe priors over functions specified by sigmoidal and Gaussian neural networks differ \nfrom  covariance  functions  that  are  usually  employed  in  the  literature,  e.g.  splines \n(Wahba,  1990).  How  might we  characterize  the  different  covariance functions  and \ncompare the  kinds of priors  that they  imply? \nThe complex exponential exp ik.z is  an eigenfunction of a stationary and isotropic \ncovariance  function,  and  hence  the  spectral  density  (or  power  spectrum)  S(k) \n(k = Ikl) nicely characterizes  the corresponding stochastic process.  Roughly speak(cid:173)\ning  the  spectral  density  describes  the  \"power\"  at  a  given  spatial frequency  k;  for \nexample,  splines  have  S(k)  ex:  k- f3 .  The  decay  of S(k)  as  k  increases  is  essential, \nas it provides  a smoothing or damping out of high frequencies.  Unfortunately non(cid:173)\nstationary processes  cannot be analyzed in exactly this fashion because the complex \nexponentials are not (in general) eigenfunctions of a  non-stationary kernel.  Instead, \nwe  must consider  the eigenfunctions  defined  by J C(z, Z/)\u00a2(Z/)dz l  = )..\u00a2(z).  How(cid:173)\never, it may be possible to get some feel  for  the effect  of a non-stationary covariance \nfunction  by  looking  at  the  diagonal  elements  in  its  2d-dimensional  Fourier  trans(cid:173)\nform,  which  correspond  to the entries in  power  spectrum for  stationary covariance \nfunctions. \n\n3.4  Convergence of finite network priors to GPs \n\nFrom general Central Limit Theorem results one would expect a rate of convergence \nof  H-l/2  towards  a  Gaussian  process  prior.  How  many  units  will  be  required \nin  practice  would  seem  to  depend  on  the  particular  values  of the  weight-variance \nparameters.  For  example,  for  Gaussian  transfer  functions,  O\"rn  defines  the  radius \nover  which  we  expect  the  process  to  be  significantly  different  from  zero.  If this \nradius is  increased  (while keeping the variance of the basis functions  O\"~ fixed)  then \nnaturally one would expect  to need  more hidden  units in order to achieve  the same \nlevel of approximation as before.  Similar comments can  be  made for  the sigmoidal \ncase,  depending on  (1 + 20\"6)/20\"1-\nI  have  conducted  some experiments for  the sigmoidal transfer umction,  comparing \nthe  predictive  performance  of a  finite  neural  network  with  one  Input  unit  to the \nequivalent  Gaussian  process  on  data  generated  from  the  GP.  The  finite  network \nsimulations  were  carried  out  using  a  slightly  modified  version  of  Neal's  MCMC \nBayesian  neural  networks  code  (Neal,  1996)  and  the  inputs  were  drawn  from  a \n\n4Note that  this would  require w 2  -\n\n00  and hence the Central Limit Theorem would no \n\nlonger  hold,  i.e.  the process would  be non-Gaussian. \n\n\fComputing with Infinite Networks \n\n301 \n\nN(O,l)  distribution.  The hyperparameter settings were  UI  = 10.0,  0\"0  = 2.0,  O\"v  = \n1.189  and  Ub  =  1.0.  Roughly  speaking  the  results  are  that  100's of hidden  units \nare  required  before  similar performance is  achieved  by  the  two  methods,  although \nthere is  considerable variability depending on the particular sample drawn from the \nprior; sometimes 10  hidden  units appears sufficient for  good  agreement. \n\n4  Discussion \n\nThe work  described  above shows  how  to  calculate  the  covariance  function  for  sig(cid:173)\nmoidal and Gaussian basis functions networks.  It is probable similar techniques will \nallow  covariance functions  to be  derived  analytically for  networks with other kinds \nof basis  functions  as  well;  these  may turn  out  to  be  similar in  form  to  covariance \nfunctions  already  used  in  the Gaussian process  literature. \n\nIn  the derivations above the hyperparameters 9  were fixed.  However,  in a  real  data \nanalysis  problem it  would  be  unlikely  that  appropriate values  of these  parameters \nwould  be  known.  Given  a  prior  distribution  P(9)  predictions  should  be  made  by \nintegrating  over  the  posterior  distribution  P(9It)  ()(  P(9)P(tI9),  where  P(tI9)  is \nthe likelihood of the training data t  under  the model; P(tI9) is  easily computed for \na  Gaussian process.  The prediction y( z) for  test  input  z  is  then given  by \n\ny(z) = J Y9(z)P(9ID)d9 \n\n(14) \n\nwhere  Y9(z)  is  the  predicted  mean (as  given  by  equation 4)  for  a  particular value \nof 9.  This integration is not  tractable  analytically but Markov  Chain  Monte Carlo \nmethods such as Hybrid Monte Carlo can be used to approximate it.  This strategy \nwas used in Williams and Rasmussen (1996), but for stationary covariance functions, \nnot ones derived from Gaussian processes;  it would be interesting to compare results. \n\nAcknowledgements \n\nI  thank  David Saad and  David  Barber for  help in obtaining  the result in  equation  11,  and \nChris  Bishop,  Peter  Dayan,  Ian  Nabney,  Radford  Neal,  David  Saad  and  Huaiyu  Zhu  for \ncomments on an earlier draft of the paper.  This  work  was partially supported  by  EPSRC \ngrant  GR/J75425,  \"Novel  Developments  in  Learning  Theory for  Neural  Networks\". \n\nReferences \n\nCressie,  N.  A.  C.  (1993).  Statistics for  Spatial Data.  Wiley. \nHornik,  K.  (1993).  Some  new  results  on  neural  network  approximation.  Neural  Net(cid:173)\n\nworks  6  (8),  1069-1072. \n\nJournel,  A.  G.  and  C.  J.  Huijbregts  (1978).  Mining  Geostatistics. Academic  Press. \nMacKay,  D.  J.  C.  (1992).  A  Practical  Bayesian  Framework  for  Backpropagation  Net(cid:173)\n\nworks.  Neural  Computation 4(3), 448-472. \n\nNeal,  R.  M.  (1996).  Bayesian Learning for  Neural Networks. Springer.  Lecture Notes in \n\nStatistics  118. \n\nParzen,  E.  (1962).  Stochastic Processes.  Holden-Day. \nPoggio,  T.  and  F.  Girosi  (1990).  Networks for  approximation  and learning.  Proceedings \n\nof IEEE  78,  1481-1497. \n\nWahba,  G.  (1990).  Spline Models for  Observational Data. Society for Industrial and  Ap(cid:173)\n\nplied  Mathematics.  CBMS-NSF  Regional  Conference series in  applied  mathematics. \nWhittle,  P.  (1963).  Prediction  and  regulation  by  linear  least-square  methods.  English \n\nUniversities  Press. \n\nWilliams,  C.  K.  I.  and  C.  E.  Rasmussen  (1996).  Gaussian  processes  for  regression.  In \nD.  S.  Touretzky,  M.  C.  Mozer,  and  M.  E.  Hasselmo  (Eds.),  Advances  in  Neural \nInformation  Processing Systems 8,  pp.  514-520.  MIT Press. \n\n\f", "award": [], "sourceid": 1197, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}]}