{"title": "Robust Full Bayesian Methods for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 379, "page_last": 385, "abstract": null, "full_text": "Robust Full Bayesian Methods for Neural \n\nNetworks \n\nChristophe Andrieu* \nCambridge University \n\nEngineering Department \n\nCambridge CB2 1PZ \n\nEngland \n\nca226@eng.cam.ac.uk \n\nJ oao FG de Freitas \n\nUC Berkeley \n\nComputer Science \n\n387 Soda Hall, Berkeley \n\nCA 94720-1776 USA \njfgf@cs.berkeley.edu \n\nArnaud Doucet \n\nCambridge University \n\nEngineering Department \n\nCambridge CB2 1PZ \n\nEngland \n\nad2@eng.cam.ac.uk \n\nAbstract \n\nIn this paper, we propose a full Bayesian model for neural networks. \nThis model treats the model dimension (number of neurons), model \nparameters, regularisation parameters and noise parameters as ran(cid:173)\ndom variables that need to be estimated. We then propose a re(cid:173)\nversible jump Markov chain Monte Carlo (MCMC) method to per(cid:173)\nform the necessary computations. We find that the results are not \nonly better than the previously reported ones, but also appear to \nbe robust with respect to the prior specification. Moreover, we \npresent a geometric convergence theorem for the algorithm. \n\n1 \n\nIntroduction \n\nIn the early nineties, Buntine and Weigend (1991) and Mackay (1992) showed that a \nprincipled Bayesian learning approach to neural networks can lead to many improve(cid:173)\nments [1 ,2] . In particular, Mackay showed that by approximating the distributions \nof the weights with Gaussians and adopting smoothing priors, it is possible to obtain \nestimates of the weights and output variances and to automatically set the regular(cid:173)\nisation coefficients. Neal (1996) cast the net much further by introducing advanced \nBayesian simulation methods, specifically the hybrid Monte Carlo method, into the \nanalysis of neural networks [3]. Bayesian sequential Monte Carlo methods have also \nbeen shown to provide good training results, especially in time-varying scenarios \n[4]. More recently, Rios Insua and Muller (1998) and Holmes and Mallick (1998) \nhave addressed the issue of selecting the number of hidden neurons with growing \nand pruning algorithms from a Bayesian perspective [5,6]. In particular, they apply \nthe reversible jump Markov Chain Monte Carlo (MCMC) algorithm of Green [7] \nto feed-forward sigmoidal networks and radial basis function (RBF) networks to \nobtain joint estimates of the number of neurons and weights. \nWe also apply the reversible jump MCMC simulation algorithm to RBF networks so \nas to compute the joint posterior distribution of the radial basis parameters and the \nnumber of basis functions. However, we advance this area of research in two impor(cid:173)\ntant directions. Firstly, we propose a full hierarchical prior for RBF networks. That \n\n* Authorship based on alphabetical order. \n\n\f380 \n\nC. Andrieu, J. F G. d. Freitas and A. Doucet \n\nis, we adopt a full Bayesian model, which accounts for model order uncertainty and \nregularisation, and show that the results appear to be robust with respect to the \nprior specification. Secondly, we present a geometric convergence theorem for the \nalgorithm. The complexity of the problem does not allow for a comprehensive dis(cid:173)\ncussion in this short paper. We have, therefore, focused on describing our objectives, \nthe Bayesian model, convergence theorem and results. Readers are encouraged to \nconsult our technical report for further results and implementation details [Sp. \n\n2 Problem statement \n\nMany physical processes may be described by the following nonlinear, multivariate \ninput-output mapping: \n\n(1) \n\nwhere Xt E ~d corresponds to a group of input variables, Yt E ~c to the target vari(cid:173)\nables, TIt E ~c to an unknown noise process and t = {I, 2,' .. } is an index variable \nover the data. In this context, the learning problem involves computing an approx(cid:173)\nimation to the function f and estimating the characteristics of the noise process \ngiven a set of N input-output observations: 0 = {Xl, X2, ... ,XN, YI, Y2, ... ,Y N } \nTypical examples include regression, where YI :N,1 :C2 is continuous; classification, \nwhere Y corresponds to a group of classes and nonlinear dynamical system identi(cid:173)\nfication, where the inputs and targets correspond to several delayed versions of the \nsignals under consideration. \n\nYt = b + f3'Xt + TIt \n\nWe adopt the approximation scheme of Holmes and Mallick (199S), consisting of \na mixture of k RBFs and a linear regression term. Yet, the work can be easily \nextended to other regression models. More precisely, our model Mis: \nk = 0 \nMo : \nMk: Yt = ~~=l aj\u00a2(IIx t - 11)1) + b + f3'Xt + TIt k ~ 1 \n\n(2) \nwhere\" ./1 denotes a distance metric (usually Euclidean or Mahalanobis), I-Lj E ~d \ndenotes the j-th RBF centre for a model with k RBFs, aj E ~c the j-th RBF \namplitude and b E ~c and f3 E ~d X ~c the linear regression parameters. The noise \nsequence TIt E ~c is assumed to be zero-mean white Gaussian. It is important to \nmention that although we have not explicitly indicated the dependency of b, f3 and \nTIt on k, these parameters are indeed affected by the value of k. For convenience, \nwe express our approximation model in vector-matrix form: \n\nYl,l . . . Y1,e \nY2,1 . .. Y2,e \n\n1 X1,1 .. . X1 ,d \n1 X2,1 . .. X2 ,d \n\n O. The variance of this hyper-prior with a 02 = 2 is infinite. \nWe apply the same method to A by setting an uninformative conjugate prior [9]: \nA\"\" Qa(1/2 +Cl,c2) (ci\u00ab 1 i = 1,2). \n\n3.1 Estimation and inference aims \n\nThe Bayesian inference of k, 0 and 1/J is based on the joint posterior distribution \np(k, 0, 1/Jlx, y) obtained from Bayes' theorem. Our aim is to estimate this joint dis(cid:173)\ntribution from which, by standard probability marginalisation and transformation \ntechniques, one can \"theoretically\" obtain all posterior features of interest. We \npropose here to use the reversible jump MCMC method to perform the necessary \ncomputations, see [8] for details. MCMC techniques were introduced in the mid \n1950's in statistical physics and started appearing in the fields of applied statis(cid:173)\ntics, signal processing and neural networks in the 1980's and 1990's [3,5,6,10,11]. \nThe key idea is to build an ergodic Markov chain (k(i) , O(i), 1/J(i\u00bb)iEN whose equi(cid:173)\nlibrium distribution is the desired posterior distribution. Under weak additional \nassumptions, the P \u00bb 1 samples generated by the Markov chain are asymptotically \ndistributed according to the posterior distribution and thus allow easy evaluation \nof all posterior features of interest. For example: \n\n.-.... \np(k = Jlx, y) = p I){j}(k(t\u00bb) and IE(Olk = J, x, y) = \n\n1 P \n\n.-... \n\n. \n\n. \n\ni=1 \n\n~~ 0(i)1I . (k(i\u00bb) \n\nt-~ \n~i=1 lI{j}(k(t\u00bb) \n\n{J}. \n\nIn addition, we can obtain predictions, such as: \n\n.-... \nIE(YN+llxl :N+l,Yl :N) = p L...,.D(ILl:k,XN+I)OI:m \n\n1 ~ ( i) \n\n(i) \n\nP \n\ni=1 \n\n(4) \n\n(5) \n\n3.2 \n\nIntegration of the nuisance parameters \n\nAccording to Bayes theorem, we can obtain the posterior distribution as follows: \n\np(k, 0, 1/Jlx, y) ex p(Ylk, 0, 1/J, x)p(k, 0, 'ljJ) \n\nIn our case, we can integrate with respect to 01:m (Gaussian distribution) and with \nrespect to u; (inverse Gamma distribution) to obtain the following expression for \nthe posterior: \n\n(k \nP \n\n,IL1:k\" \n\nA c521 \n\n[ lIn(k,lLk)][ \n\n<;Sk \n\nt \n\n) \n\nx,yex \n\n[rrC (c52)-m/2IM' 11/2 (TO + Y~:N,iP i,kYl:N,i ) (_ N~VQ)] \ni=1 \n][rrC (c5~)-(062 +l)exp(- (362 )][(A)(Cl-l/2)exP (-C2 A )] \n\"kmax AJ / ., \ni=l \nJ. \nLJj=O \n\nAk/k!. \n\nc52 \nt \n\nt,k \n\n2 \n\nX \n\nt \n\n(6) \n\n\fRobust Full Bayesian Methods for Neural Networks \n\n383 \n\nIt is worth noticing that the posterior distribution is highly non-linear in the RBF \ncentres I-'k and that an expression of p(klx,y) cannot be obtained in closed-form. \n\n4 Geometric convergence theorem \n\nIt is easy to prove that the reversible jump MCMC algorithm applied to our model \nconverges, that is, that the Markov chain (k( i) , I-'l( ~~, A (i) , 82( i\u00bb) \nis ergodic. We \niEN \npresent here a stronger result, namely that (k(i)'I-'(I~LA(i),82(i\u00bb) \n. \nthe required posterior distribution at a geometric rate: \n\nconverges to \n\niEN \n\n. \n\n. \n\nTheorem 1 Let (k(i), I-'i~)k' A(i), 82(i\u00bb) \nbe the Markov chain whose transition \nkernel has been described in Section 3. This Markov chain converges to the proba-\nbility distribution p (k, I-'l :k, A, 82 1 x, y). Furthermore this convergence occurs at a \ngeometric rate, that is, for almost every initial point (k(O), I-'i~k, A (0),8 2(0\u00bb) E 11 x 'II \nthere exists a function of the initial states Co > 0 and a constant and p E [0,1) such \nthat \n\niEN \n\nIIp(i) (k,I-'l:k,A,8 2) -p(k,I-'l:k,A, 82 I x ,y)IITV ~ CopLi/kmaxJ \n\n(7) \n\nwhere p(i) (k,I-'l:k,A,82) is the distribution of (k(i)'l-'i~~,A(i),82(i\u00bb) and II\u00b7IITV is \nthe total variation norm [11]. Proof. See [8] \u2022 \n\nCorollary 1 If for each iteration i one samples the nuisance parameters (Ol :m, u%) \nthen the distribution of the series (k(i), oi~~, I-'i~~, u~(i), A(i), 8Z(i\u00bb)iEN converges ge(cid:173)\nometrically towards p(k,Ol:m,I-'l:k,ULA,82Ix,y) at the same rate p. \n\n5 Demonstration: robot arm data \n\nThis data is often used as a benchmark to compare learning algorithms3 . It involves \nimplementing a model to map the joint angle of a robot arm (Xl, X2) to the position \nof the end of the arm (Yl, yz). The data were generated from the following model: \n\nYl = 2.0cos(xt} + 1.3COS(Xl +X2) + El \nY2 = 2.0sin(xt} + 1.3sin(xl +X2) +E2 \n\nwhere Ei '\" N(O, 0\"2); 0\" = 0.05. We use the first 200 observations of the data set \nto train our models and the last 200 observations to test them. In the simulations, \nwe chose to use cubic basis functions. Figure 1 shows the 3D plots of the training \ndata and the contours of the training and test data. The contour plots also in(cid:173)\nclude the typical approximations that were obtained using the algorithm. We chose \nuninformative priors for all the parameters and hyper-parameters (Table 1). To \ndemonstrate the robustness of our algorithm, we chose different values for (382 (the \nonly critical hyper-parameter as it quantifies the mean of the spread 8 of Ok)' The \nobtained mean square errors and probabilities for 81, 82, u~ k' u~ k and k, shown in \nFigure 2, clearly indicate that our algorithm is robust with 'respe~t to prior specifi(cid:173)\ncation. Our mean square errors are of the same magnitude as the ones reported by \nother researchers [2,3,5,6]; slightly better (Not by more than 10%). Moreover, our \nalgorithm leads to more parsimonious models than the ones previously reported. \n\n3The robot arm data set can be \n\nfound \n\nin David Mackay's home page: \n\nhttp://vol.ra.phy.cam.ac.uk/mackay/ \n\n\f384 \n\nC. Andrieu. J. F G. d. Freitas and A. Doucet \n\n5 .' . \n\n>.0 .. .. .. .. .. . ... ........ ..... \n\n-~ . ' .\n\n' \n\n.' \n\n2 \nx2 \n\no -2 \n\no \n\nx1 \n\n2 \n\n5 \n\n~ 0 \u00b7\u00b7 \u00b7 \u00b7\u00b7\u00b7\n\n\u00b7 \n\n-5 ... ... \n4 \n\no -2 \n\nx2 \n\n2 \n\n4r---~--~--~----, \n\n4r---~--~--~----, \n\noL---~------~--~ \n-2 \n\n-1 \n\n2 \n\no \n\n4r---~--~--~----, \n\no~--~------~--~ \n-2 \n\n-1 \n\n2 \n\no \n\n0 \n-2 \n\n4 \n\n-1 \n\n0 \n\n2 \n\n:~ \n\n0 \n-2 \n\n-1 \n\n0 \n\n2 \n\nFigure 1: The top plots show the training data surfaces corresponding to each \ncoordinate of the robot arm's position. The middle and bottom plots show the \ntraining and validation data [- -] and the respective REF network mappings [-]. \n\nTable 1: Simulation parameters and mean square test errors. \n\na 02 \n2 \n2 \n2 \n\n/30 2 \n0.1 \n10 \n100 \n\nVo \n0 \n0 \n0 \n\n')'0 \n0 \n0 \n0 \n\nCI \n\nC2 \n\n0.0001 0.0001 \n0.0001 0.0001 \n0.0001 0.0001 \n\nMS ERROR \n\n0.00505 \n0.00503 \n0.00502 \n\n6 Conclusions \n\nIn adopting a Bayesian \n\nWe presented a general methodology for estimating, jointly, the noise variance, pa(cid:173)\nrameters and number of parameters of an RBF model. \nmodel and the reversible jump MCMC algorithm to perform the necessary integra(cid:173)\ntions, we demonstrated that the method is very accurate. Contrary to previous \nreported results, our experiments indicate that our model is robust with respect \nto the specification of the prior. In addition, we obtained more parsimonious RBF \nnetworks and better approximation errors than the ones previously reported in the \nliterature. There are many avenues for further research. These include estimating \nthe type of basis functions, performing input variable selection, considering other \nnoise models and extending the framework to sequential scenarios. A possible so(cid:173)\nlution to the first problem can be formulated using the reversible jump MCMC \nframework. Variable selection schemes can also be implemented via the reversible \njump MCMC algorithm. We are presently working on a sequential version of the \nalgorithm that allows us to perform model selection in non-stationary environments. \n\nReferences \n\n[1] Buntine, W.L. & Weigend, A.S. (1991) Bayesian back-propagation. Complex Systems \n5:603-643. \n\n\fRobust Full Bayesian Methods for Neural Networks \n\n385 \n\n'fl and 'fl \n2 \n1 \n\n~and~ \n\n100 \n\n200 \n\n100 \n\n200 \n\n0.06 \n\n0.04 \n\n0.02 \n\n0 \n0 \n\n0.06 \n\n0.04 \n\n0.02 \n\n0 \n0 \n\n0.06 \n\n0.02 \n\n0 0.2 \nII \n~0. 1 \n:!:l. \n\n'\" no 0.1 \n\n8 0.2 \n,... \n\n\"'00 0.1 \n:!:l. \n\n100 \n\n200 \n\n0 \n0 \n\n2 \n\n4 \n\n6 \nx 10-\" \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n12 \n\nO.B \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n12 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0 \n12 \n\nk \n\nI \n\n14 \n\nI \n\n16 \n\n1 \n\n14 \n\nI \n\n14 \n\n16 \n\n16 \n\nFigure 2: Histograms of smoothness constraints (~1 and 82), noise variances (O'i k \nand O'~ k) and model order (k) for the robot arm data using 3 different values f~r \n{382. The plots confirm that the algorithm is robust to the setting of {382. \n\n[2] Mackay, D.J.C. (1992) A practical Bayesian framework for backpropagation networks. \nNeural Computation 4:448-472. \n\n[3] Neal, R.M. (1996) Bayesian Learning for Neural Networks. New York: Lecture Notes \nin Statistics No. 118, Springer-Verlag. \n\n[4] de Freitas, J .F.G., Niranjan, M., Gee, A.H. & Doucet, A. (1999) Sequential Monte \nCarlo methods to train neural network models. To appear in Neural Computation. \n\n[5] Rios Insua, D. & Miiller, P. (1998) Feedforward neural networks for nonparametric \nregression. Technical report 98-02. \nInstitute of Statistics and Decision Sciences, Duke \nUniversity, http://vtw.1 . stat. duke. edu. \n\n[6] Holmes, C.C. & Mallick, B.K. (1998) Bayesian radial basis functions of variable dimen(cid:173)\nsion. Neural Computation 10:1217-1233. \n\n[7] Green, P.J. (1995) Reversible jump Markov chain Monte Carlo computation and \nBayesian model determination. Biometrika 82:711-732. \n[8] Andrieu, C., de Freitas, J.F.G. & Doucet, A. (1999) Robust full Bayesian learning \nfor neural networks. Technical report CUED/F-INFENG/TR 343. Cambridge University, \nhttp://svr-www.eng.cam.ac.uk/. \n\n[9] Bernardo, J .M. & Smith, A.F.M. (1994) Bayesian Theory. Chichester: Wiley Series in \nApplied Probability and Statistics. \n\n[10] Besag, J ., Green, P.J., Hidgon, D. & Mengersen, K. (1995) Bayesian computation and \nstochastic systems. Statistical Science 10:3-66. \n\n[11] Tierney, L. (1994) Markov chains for exploring posterior distributions. The Annals of \nStatistics. 22(4):1701-1762. \n\n\f", "award": [], "sourceid": 1741, "authors": [{"given_name": "Christophe", "family_name": "Andrieu", "institution": null}, {"given_name": "Jo\u00e3o", "family_name": "de Freitas", "institution": null}, {"given_name": "Arnaud", "family_name": "Doucet", "institution": null}]}