The attempt to find a single "optimal" weight vector in conven(cid:173) tional network training can lead to overfitting and poor generaliza(cid:173) tion. Bayesian methods avoid this, without the need for a valida(cid:173) tion set, by averaging the outputs of many networks with weights sampled from the posterior distribution given the training data. This sample can be obtained by simulating a stochastic dynamical system that has the posterior as its stationary distribution.
1 CONVENTIONAL AND BAYESIAN LEARNING
I view neural networks as probabilistic models, and learning as statistical inference. Conventional network learning finds a single "optimal" set of network parameter values, corresponding to maximum likelihood or maximum penalized likelihood in(cid:173) ference. Bayesian inference instead integrates the predictions of the network over all possible values of the network parameters, weighting each parameter set by its posterior probability in light of the training data.
1.1 NEURAL NETWORKS AS PROBABILISTIC MODELS
Consider a network taking a vector of real-valued inputs, x, and producing a vector of real-valued outputs, y, perhaps computed using hidden units. Such a network architecture corresponds to a function, I, with y = I(x, w), where w is a vector of connection weights. If we assume the observed outputs, y, are equal to y plus Gaus(cid:173) sian noise of standard deviation (j, the network defines the conditional probability