{"title": "Practical Confidence and Prediction Intervals", "book": "Advances in Neural Information Processing Systems", "page_first": 176, "page_last": 182, "abstract": null, "full_text": "Practical confidence and prediction \n\nintervals \n\nTom Heskes \n\nRWCP Novel Functions SNN Laboratory; University of Nijmegen \n\nGeert Grooteplein 21, 6525 EZ Nijmegen, The Netherlands \n\ntom@mbfys.kun.nl \n\nAbstract \n\nWe propose a new method to compute prediction intervals. Espe(cid:173)\ncially for small data sets the width of a prediction interval does not \nonly depend on the variance of the target distribution, but also on \nthe accuracy of our estimator of the mean of the target, i.e., on the \nwidth of the confidence interval. The confidence interval follows \nfrom the variation in an ensemble of neural networks, each of them \ntrained and stopped on bootstrap replicates of the original data set. \nA second improvement is the use of the residuals on validation pat(cid:173)\nterns instead of on training patterns for estimation of the variance \nof the target distribution. As illustrated on a synthetic example, \nour method is better than existing methods with regard to extrap(cid:173)\nolation and interpolation in data regimes with a limited amount of \ndata, and yields prediction intervals which actual confidence levels \nare closer to the desired confidence levels. \n\n1 STATISTICAL INTERVALS \n\nIn this paper we will consider feedforward neural networks for regression tasks: \nestimating an underlying mathematical function between input and output variables \nbased on a finite number of data points possibly corrupted by noise. We are given \na set of Pdata pairs {ifJ, tfJ } which are assumed to be generated according to \n\n(1) \nwhere e(i) denotes noise with zero mean. Straightforwardly trained on such a \nregression task, the output of a network o(i) given a new input vector i can be \n\nt(i) = f(i) + e(i) , \n\nRWCP: Real World Computing Partnership; SNN: Foundation for Neural Networks. \n\n\fPractical Confidence and Prediction Intervals \n\n177 \n\ninterpreted as an estimate of the regression f(i) , i.e ., of the mean of the target \ndistribution given input i. Sometimes this is all we are interested in: a reliable \nestimate of the regression f(i). In many applications, however, it is important to \nquantify the accuracy of our statements. For regression problems we can distinguish \ntwo different aspects: the accuracy of our estimate of the true regression and the \naccuracy of our estimate with respect to the observed output. Confidence intervals \ndeal with the first aspect, i.e. , consider the distribution of the quantity f(i) - o(i), \nprediction intervals with the latter, i.e., treat the quantity t(i) - o(i). We see from \n\nt(i) - o(i) = [f(i) - o(i)] + ~(i) , \n\n(2) \n\nthat a prediction interval necessarily encloses the corresponding confidence interval. \nIn [7] a method somewhat similar to ours is introduced to estimate both the mean \nand the variance of the target probability distribution. It is based on the assumption \nthat there is a sufficiently large data set, i.e., that their is no risk of overfitting and \nthat the neural network finds the correct regression. In practical applications with \nlimited data sets such assumptions are too strict. In this paper we will propose \na new method which estimates the inaccuracy of the estimator through bootstrap \nresampling and corrects for the tendency to overfit by considering the residuals on \nvalidation patterns rather than those on training patterns. \n\n2 BOOTSTRAPPING AND EARLY STOPPING \n\nBootstrapping [3] is based on the idea that the available data set is nothing but \na particular realization of some unknown probability distribution. Instead of sam(cid:173)\npling over the \"true\" probability distribution, which is obviously impossible, one \ndefines an empirical distribution. With so-called naive bootstrapping the empirical \ndistribution is a sum of delta peaks on the available data points, each with probabil(cid:173)\nity content l/Pdata. A bootstrap sample is a collection of Pdata patterns drawn with \nreplacement from this empirical probability distribution. This bootstrap sample is \nnothing but our training set and all patterns that do not occur in the training set \nare by definition part of the validation set . For large Pdata, the probability that a \npattern becomes part of the validation set is (1 -\n\nl/Pdata)Pdata ~ lie ~ 0.37. \n\nWhen training a neural network on a particular bootstrap sample, the weights are \nadjusted in order to minimize the error on the training data. Training is stopped \nwhen the error on the validation data starts to increase. This so-called early stop(cid:173)\nping procedure is a popular strategy to prevent overfitting in neural networks and \ncan be viewed as an alternative to regularization techniques such as weight decay. \nIn this context bootstrapping is just a procedure to generate subdivisions in training \nand validation set similar to k-fold cross-validation or subsampling. \n\nOn each of the nrun bootstrap replicates we train and stop a single neural network. \n\nThe output of network i on input vector i IJ is written oi(ilJ ) == or. As \"the\" \nestimate of our ensemble of networks for the regression f(i) we take the average \noutput l \n\n1 nrun \n\nL.: oi(i). \n\nm(i) == -\n\nn run i=l \n\nlThis is a so-called \"bagged\" estimator [2]. In [5] it is shown that a proper balancing \n\nof the network outputs can yield even better results. \n\n\f178 \n\nT. Heskes \n\n3 CONFIDENCE INTERVALS \n\nConfidence intervals provide a way to quantify our confidence in the estimate \nm(i) of the regression f(i), i.e., we have to consider the probability distribution \nP(f(x)lm(i)) that the true regression is f(x) given our estimate m(i). Our line of \nreasoning goes as follows (see also [8]). \n\nWe assume that our ensemble of neural networks yields a more or less unbiased \nestimate for f(x), i.e., the distribution P(f(i)lm(i)) is centered around m(i). The \ntruth is that neural networks are biased estimators. For example, neural networks \ntrained on a finite number of examples will always have a tendency (as almost \nany other model) to oversmooth a sharp peak in the data. This introduces a bias, \nwhich, to arrive at asymptotically correct confidence intervals, should be taken \ninto account. However, if it would be possible to compute such a bias correction, \none should do it in the first place to arrive at a better estimator. Our working \nhypothesis here is that the bias component of the confidence intervals is negligible \nin comparison with the variance component. \n\nThere do exist methods that claim to give confidence intervals that are \"second(cid:173)\norder correct\" , i.e., up to and including terms of order l/P!~~a (see e.g. the discussion \nafter [3]). Since we do not know how to handle the bias component anyways, \nsuch precise confidence intervals, which require a tremendous amount of bootstrap \nsamples, are too ambitious for our purposes. First-order correct intervals up to \nand including terms of order l/Pdata are always symmetric and can be derived by \nassuming a Gaussian distribution P(f(i)lm(i)). \n\nThe variance of this distribution can be estimated from the variance in the outputs \nof the n run networks: \n\n1 \n\nnrun \n\n----1 2: [Oi(X) - m(iW\u00b7 \n\nn run \n\ni=l \n\n(3) \n\nThis is the crux of the bootstrap method (see e.g. [3]). Since the distribution of \nP(f(i)lm(i)) is a Gaussian, so is the \"inverse\" distribution P(m(i)lf(i)) to find \nthe regression m( i) by randomly drawing data sets consisting of Pdata data points \naccording to the prescription (1) . Not knowing the true distribution of inputs and \ncorresponding targets 2 , the best we can do is to define the empirical distribution as \nexplained before and estimate P(m(i)lf(i)) from the distribution P(o(i)lm(i)). \nThis then yields the estimate (3). \n\nSo, following this bootstrap procedure we arrive at the confidence intervals \n\nm(i) - Cconfidenceu(i) ~ f(i) ~ m(i) + Cconfidenceu(i) , \n\nwhere Cconfidence depends on the desired confidence level 1-o'. The factors Cconfidence \ncan be taken from a table with the percentage points of the Student's t-distribution \nwith number of degrees of freedom equal to the number of bootstrap runs nrun . A \nmore direct alternative is to choose Cconfidence such that for no more than 1000'% of \nall nrun x Pdata network predictions lor - m'\" I 2: Cconfidence u\"'. \n\n2In this paper we assume that both the inputs and the outputs are stochastic. For the \ncase of deterministic input variables other bootstrapping techniques (see e.g. [4]) are more \nappropriate, since the statistical intervals resulting from naive bootstrapping may be too \nconservative. \n\n\fPractical Confidence and Prediction Intervals \n\n179 \n\n4 PREDICTION INTERVALS \n\nConfidence intervals deal with the accuracy of our prediction of the regression, Le., \nof the mean of the target probability distribution. Prediction intervals consider the \naccuracy with which we can predict the targets themselves, i.e., they are based on \nestimates of the distribution P(t( x) Im(i)). We propose the following method. \nThe two noise components f(i) - m(x) and ~(x) in (2) are independent. The \nvariance of the first component has been estimated in our bootstrap procedure to \narrive at confidence intervals. The remaining task is to estimate the noise inherent \nto the regression problem. We assume that this noise is more or less Gaussian such \nthat it again suffices to compute its variance which may however depend on the \ninput x. In mathematical symbols, \n\nOf course, we are interested in prediction intervals for new points x for which we \ndo not know the targets t. Suppose that we had left aside a set of test patterns \n{Xli, til} that we had never used for training nor for validating our neural networks. \nThen we could try and estimate a model X2 (x) to fit the remaining residuals \n\nusing minus the loglikelihood as the error measure: \n\n(4) \n\n(5) \n\nOf course, leaving out these test patterns is a waste of data and luckily our bootstrap \nprocedure offers an alternative. Each pattern is in about 37% of all bootstrap runs \nnot part of the training set. Let us write qf = 1 if pattern j.t is in the validation set \nof run i and qf = 0 otherwise. If we, for each pattern j.t, use the average \n\nmvalidation (xJJ) = z= qf of \n\nnrun \n\ni=l \n\n/ \n\nnrun \n\nz= qf , \n\nJJ=l \n\ninstead of the average m(xJJ ) we get as close as possible to an unbiased estimate for \nthe residual on independent test patterns as we can, without wasting any training \ndata. So, summarizing, we suggest to find a function X(x) that minimizes the \nerror (5), yet not by leaving out test patterns, which would be a waste of data, nor \nby straightforwardly using the training data, which would underestimate the error, \nbut by exploiting the information about the residuals on the validation patterns. \nOnce we have found the function X(i), we can compute for any x both the mean \nm(x) and the deviation s(x) which are combined in the prediction interval \n\nm(x) - CpredictionS(X) ~ t(x) ~ m(x) + CpredictionS(X) . \n\nAgain, the factor Cprediction can be found in a Student's t-table or chosen such that \nfor no more than 100a% of all Pdata patterns ItJJ - mvalidation (xJJ) I ~ Cprediction s( xJJ) . \nThe function X2(x) may be modelled by a separate neural network, similar to the \nmethod proposed in [7] with an exponential instead of a linear transfer function for \nthe output unit to ensure that the variance is always positive. \n\n\f180 \n\n1.5 .-....----r----.-----,~-....___, \n\n(a) \n\n1 \n\n\"5 \n.e-\n:::J o \n\n0.5 \na \n-0.5 \n\n-1 \n\n-1.5 L--\"\"'--_--'-__ _'_____--'-__ '---' \n\n-1 \n\n-0.5 \n\n0.5 \n\na \ninput \n\n0.2 ..--....---.-----.--~---....--n \n\n0.15 \n\n\\ \n\n.s= \n~ 0.1 \n\n(c) I \nI \nI \nI \n\n0.05 _ _ . __ . _ . _ . _ . _ . _ . I \n\n\\ \n\\ - - ' \" \n\n\\ \n\n-\n\n-1 \n\n-0.5 \n\na \ninput \n\n0.5 \n\nT. Heskes \n\n(b) \n\no \n\n-1 \n\n-0.5 \n\na \ninput \n\n0.5 \n\n1 \n\n(d) \n\n0.06 \n\n0.05 \n\nQ) 0.04 \n(.) \nc \n.~0.03 \n..... \nto \n> 0.02 \n\n0.01 \na \n\n1.5 \n\n1 \n\n-1 \n\n_1.5L-----~-~--~-~.....J \n\n-1 \n\n-0.5 \n\n0.5 \n\n1 \n\na \ninput \n\nFigure 1: Prediction intervals for a synthetic problem. (a) Training set (crosses), \ntrue regression (solid line), and network prediction (dashed line). (b) Validation \nresiduals (crosses), training residuals (circles), true variance (solid line), estimated \nvariance based on validation residuals (dashed line) and based on training residuals \n(dash-dotted line). (c) Width of standard error bars for the more advanced method \n(dashed line), the simpler procedure (dash-dotted line) and what it should be (solid \nline). (d) Prediction intervals (solid line) , network prediction (dashed line), and \n1000 test points (dots) . \n\n5 \n\nILLUSTRATION \n\nWe consider a synthetic problem similar to the one used in [7]. With this example \nwe will demonstrate the desirability to incorporate the inaccuracy of the regression \nestimator in the prediction intervals. Inputs x are drawn from the interval [-1,1] \nwith probability density p(x) = lxi, i.e., more examples are drawn at the boundary \nthan in the middle. Targets t are generated according to \nt = sin(lI'x) cos(511'x/4) + ~(x) with (e(x)) = 0.005 + 0.005 [1 + sin(lI'x)]2 . \n\nThe regression is the solid line in Figure 1 (a), the variance of the target distribution \nthe solid line in Figure 1 (b). Following this prescription we obtain a training set of \nPdata = 50 data points [the crosses in Figure l(a)] on which we train an ensemble \nof nrun = 25 networks, each having 8 hidden units with tanh-transfer function \nand one linear output unit . The average network output m(x) is the dashed line \n\n\fPractical Confidence and Prediction Intervals \n\nIBI \n\nin Figure l(a) and (d). In the following we compare two methods to arrive at \nprediction intervals: the more advanced method described in Section 4, Le., taking \ninto account the uncertainty of the estimator and correcting for the tendency to \noverfit on the training data, and a simpler procedure similar to [7] which disregards \nboth effects. \n\nWe compute the (squared) \"validation residuals\" (mealidation - t~)2 [crosses in Fig(cid:173)\nure l(b)], based on runs in which pattern J.t was part of the validation set, and the \n\"training residuals\" (mrrain - t~)2 (circles) , based on runs in which pattern J.t was \npart of the training set . The validation residuals are most of the time somewhat \nlarger than the training residuals. \n\nFor our more advanced method we substract the uncertainty of our model from \nthe validation residuals as in (4) . The other procedure simply keeps the training \nresiduals to estimate the variance of the target distribution. It is obvious that the \ndistribution of residuals in Figure l(b) does not allow for a complex model. Here \nwe take a feedforward network with one hidden unit: \n\nThe parameters {vo, VI, V2, V3} are found through minimization of the error (5). \nBoth for the advanced method (dashed line) and for the simpler procedure (dash(cid:173)\ndotted line) the variance of the target distribution is estimated to be a step function. \nThe former, being based on the validation residuals minus the uncertainty of the \nestimator, is slightly more conservative than the latter, being based on the training \nresiduals. Both estimates are pretty far from the truth (solid line), especially for \no < x < 0.5, yet considering such a limited amount of noisy residuals we can hardly \nexpect anything better. \n\nFigure l(c) considers the width of standard error bars, i.e., of prediction intervals \nfor error level a ~ 0.32. For the simpler procedure the width of the prediction \ninterval [dash-dotted line in Figure l(c)] follows directly from the estimate of the \nvariance of the target distribution. Our more advanced method adds the uncertainty \nof the estimator to arrive at the dashed line. The correct width of the prediction \ninterval, i.e., the width that would include 68% of all targets for a particular input, \nis given by the solid line. The prediction intervals obtained through the more \nadvanced procedure are displayed in Figure l(d) together with a set of 1000 test \npoints visualizing the probability distribution of inputs and corresponding targets. \n\nThe method proposed in Section 4 has several advantages. The prediction intervals \nof the advanced method include 65% of the test points in Figure l(d), pretty close \nto the desired confidence level of 68%. The simpler procedure is too liberal with \nan actual confidence level of only 58%. This difference is mainly due to the use of \nvalidation residuals instead of training residuals. Incorporation of the uncertainty \nof the estimator is important in regions of input space with just a few training data. \nIn this example the density of training data affects both extrapolation and inter(cid:173)\npolation. For Ixl > 1 the prediction intervals obtained with the advanced method \nbecome wider and wider whereas those obtained through the simpler procedure re(cid:173)\nmain more or less constant. The bump in the prediction interval (dashed line) near \nthe origin is a result of the relatively large variance in the network predictions in \nthis region. It shows that our method also incorporates the effect that the density \nof training data has on the accuracy of interpolation. \n\n\f182 \n\nT. Heskes \n\n6 CONCLUSION AND DISCUSSION \n\nWe have presented a novel method to compute prediction intervals for applications \nwith a limited amount of data. The uncertainty of the estimator itself has been \ntaken into account by the computation of the confidence intervals. This explains \nthe qualitative improvement over existing methods in regimes with a low density of \ntraining data. Usage of the residuals on validation instead of on training patterns \nyields prediction intervals with a better coverage. The price we have to pay is in \nthe computation time: we have to train an ensemble of networks on about 20 to 50 \ndifferent bootstrap replicates [3, 8]. There are other good reasons for resampling: \naveraging over networks improves the generalization performance and early stopping \nis a natural strategy to prevent overfitting. It would be interesting to see how our \n''frequentist'' method compares with Bayesian alternatives (see e.g. [1, 6]). \n\nPrediction intervals can also be used for the detection of outliers. With regard to \nthe training set it is straightforward to point out the targets that are not enclosed \nby a prediction interval of error level say a = 0.05. A wide prediction interval for a \nnew test pattern indicates that this test pattern lies in a region of input space with \na low density of training data making any prediction completely unreliable. \n\nA weak point in our method is the assumption of unbiased ness in the computation of \nthe confidence intervals. This assumption makes the confidence intervals in general \ntoo liberal. However, as discussed in [8], such bootstrap methods tend to perform \nbetter than other alternatives based on the computation of the Hessian matrix, \npartly because they incorporate the variability due to the random initialization. \nFurthermore, when we model the prediction interval as a function of the input x we \nwill, to some extent, repair this deficiency. But still, incorporating even a somewhat \ninaccurate confidence interval ensures that we can never severely overestimate our \naccuracy in regions of input space where we have never been before. \n\nReferences \n\n[1] C. Bishop and C. Qazaz. Regression with input-dependent noise: a Bayesian \n\ntreatment. These proceedings, 1997. \n\n[2] L. Breiman. Bagging predictors. Machine Learning, 24: 123-140, 1996. \n\n[3] B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, \n\nLondon, 1993. \n\n[4] W. HardIe. Applied Nonparametric Regression. Cambridge University Press, \n\n1991. \n\n[5] T. Heskes. Balancing between bagging and bumping. These proceedings, 1997. \n\n[6] D. MacKay. A practical Bayesian framework for backpropagation. Neural Com(cid:173)\n\nputation, 4:448-472, 1992. \n\n[7] D. Nix and A. Weigend. Estimating the mean and variance of the target prob(cid:173)\nability distribution. In Proceedings of the [JCNN '94, pages 55-60. IEEE, 1994. \n\n[8] R. Tibshirani. A comparison of some error estimates for neural network models. \n\nNeural Computation, 8:152-163,1996. \n\n\f", "award": [], "sourceid": 1306, "authors": [{"given_name": "Tom", "family_name": "Heskes", "institution": null}]}