{"title": "Learning Local Error Bars for Nonlinear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 489, "page_last": 496, "abstract": null, "full_text": "Learning Local Error Bars \nfor Nonlinear Regression \n\nDavid A.Nix \n\nDepartment of Computer Science \nand Institute of Cognitive Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\ndnix@cs.colorado.edu \n\nAndreas S. Weigend \n\nDepartment of Computer Science \nand Institute of Cognitive Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nandreas@cs.colorado.edu\u00b7 \n\nAbstract \n\nWe present a new method for obtaining local error bars for nonlinear \nregression, i.e., estimates of the confidence in predicted values that de(cid:173)\npend on the input. We approach this problem by applying a maximum(cid:173)\nlikelihood framework to an assumed distribution of errors. We demon(cid:173)\nstrate our method first on computer-generated data with locally varying, \nnormally distributed target noise. We then apply it to laser data from the \nSanta Fe Time Series Competition where the underlying system noise is \nknown quantization error and the error bars give local estimates of model \nmisspecification. In both cases, the method also provides a weighted(cid:173)\nregression effect that improves generalization performance. \n\n1 Learning Local Error Bars Using a Maximum Likelihood \n\nFramework: Motivation, Concept, and Mechanics \n\nFeed-forward artificial neural networks used for nonlinear regression can be interpreted as \npredicting the mean of the target distribution as a function of (conditioned on) the input \npattern (e.g., Buntine & Weigend, 1991; Bishop, 1994), typically using one linear output unit \nper output variable. If parameterized, this conditional target distribution (CID) may also be \n\n\u00b7http://www.cs.colorado.edu/~andreas/Home.html. \n\nThis paper is available with figures in colors as ftp://ftp.cs.colorado.edu/pub/ \nTime-Series/MyPapers/nix.weigenCLnips7.ps.Z . \n\n\f490 \n\nDavid A. Nix, Andreas S. Weigend \n\nviewed as an error model (Rumelhart et al., 1995). Here, we present a simple method that \nprovides higher-order information about the cm than simply the mean. Such additional \ninformation could come from attempting to estimate the entire cm with connectionist \nmethods (e.g., \"Mixture Density Networks,\" Bishop, 1994; \"fractional binning, \"Srivastava \n& Weigend, 1994) or with non-connectionist methods such as a Monte Carlo on a hidden \nMarkov model (Fraser & Dimitriadis, 1994). While non-parametric estimates of the shape \nof a C1D require large quantities of data, our less data-hungry method (Weigend & Nix, \n1994) assumes a specific parameterized form of the C1D (e.g., Gaussian) and gives us the \nvalue of the error bar (e.g., the width of the Gaussian) by finding those parameters which \nmaximize the likelihood that the target data was generated by a particular network model. \nIn this paper we derive the specific update rules for the Gaussian case. We would like to \nemphasize, however, that any parameterized unimodal distribution can be used for the em \nin the method presented here. \n\nj------------, \nI \nI------T-------, \n, \nI A \nI A2,. ) \no y(x) 0 cr IX \n/\\ i \n'. \n\nI \nI \n\n\\ \n\nO'OOh k : \nI \nI \nI \nI \nI \nI \nI \n\n: \nl \n,-----------_. \n\nFigure 1: Architecture of the network for estimating error bars using an auxiliary output unit. All \nweight layers have full connectivity. This architecture allows the conditional variance ~2 -unit access \nto both information in the input pattern itself and in the hidden unit representation formed while \nlearning the conditional mean, y(x). \n\nWe model the desired observed target value d as d(x) = y(x) + n(x), where y(x) is the \nunderlying function we wish to approximate and n(x) is noise drawn from the assumed \ncm. Just as the conditional mean of this cm, y(x), is a function of the input, the \nvariance (j2 of the em, the noise level, may also vary as a function of the input x \n(noise heterogeneity). Therefore, not only do we want the network to learn a function \ny(x) that estimates the conditional mean y(x) of the cm, but we also want it to learn a \nfunction a-2(x) that estimates the conditional variance (j2(x). We simply add an auxiliary \noutput unit, the a-2-unit, to compute our estimate of (j2(x). Since (j2(x) must be positive, \nwe choose an exponential activation function to naturally impose this bound: a-2 (x) = \nexp [Lk Wq2khk (x) + ,8], where,8 is the offset (or \"bias\"), and Wq2k is the weight between \nhidden unit k and the a-2-unit. The particular connectivity of our architecture (Figure 1), \nin which the a-2-unit has a hidden layer of its own that receives connections from both the \ny-unit's hidden layer and the input pattern itself, allows great flexibility in learning a-2 (x). \nIn contrast, if the a-2-unit has no hidden layer of its own, the a-2-unit is constrained to \napproximate (j2 (x) using only the exponential of a linear combination of basis functions \n(hidden units) already tailored to represent y(x) (since learning the conditional variance \na-2(x) before learning the conditional mean y(x) is troublesome at best). Such limited \nconnectivity can be too constraining on the functional forms for a-2( x) and, in our experience, \n\nI The case of a single Gaussian to represent a unimodal distribution can also been generalized to a \n\nmixture of several Gaussians that allows the modeling of multimodal distributions (Bishop, 1994). \n\n\fLearning Local Error Bars for Nonlinear Regression \n\n491 \n\nproduce inferior results. This is a significant difference compared to Bishop's (1994) \nGaussian mixture approach in which all output units are directly connected to one set of \nhidden units. The other extreme would be not to share any hidden units at all, i.e., to \nemploy two completely separate sets of hidden units, one to the y(x)-unit, the other one to \nthe a-2(x)-unit. This is the right thing to do if there is indeed no overlap in the mapping \nfrom the inputs to y and from the inputs to cr2 \u2022 The two examples discussed in this paper are \nbetween these two extremes; this justifies the mixed architecture we use. Further discussion \non shared vs. separate hidden units for the second example of the laser data is given by \nKazlas & Weigend (1995, this volume). \n\nFor one of our network outputs, the y-unit, the target is easily available-it is simply given \nby d. But what is the target for the a-2-unit? By maximizing the likelihood of our network ' \nmodel N given the data, P(Nlx, d), a target is \"invented\" as follows. Applying Bayes' rule \nand assuming statistical independence of the errors, we equivalently do gradient descent in \nthe negative log likelihood of the targets d given the inputs and the network model, summed \nover all patterns i (see Rumelhart et at., 1995): C = - Li In P(dilxi, N). Traditionally, \nthe resulting form of this cost function involves only the estimate Y(Xi) of the conditional \nmean; the variance of the CID is assumed to be constant for all Xi, and the constant terms \ndrop out after differentiation. In contrast, we allow the conditional variance to depend on \nx and explicitly keep these terms in C, approximating the conditional variance for Xi by \na-2(Xi). Given any network architecture and any parametric form for the ern (Le., any \nerror model), the appropriate weight-update equations for gradient decent learning can be \nstraightforwardly derived. \nAssuming normally distributed errors around y(x) corresponds to a em density function \nof P(dilxj) = [27rcr2(Xi)t 1/ 2 exp {- d~:Y~.) 2}. Using the network output Y(Xi) ~ \ny(Xi) to estimate the conditional mean and using the auxiliary output a-2(Xi) ~ cr2(xd \nto estimate the conditional variance, we obtain the monotonically related negative log \nlik lib d \npatterns \ngives the total cost: \n\nII 2 A2() \n2\" n 7rcr Xi + 2\".2(X.) \n\nI P(d I Af\\ -\ni Xi, n J -\n\n. ummatlon over \n\ne 00, - n \n\n[di-y(Xi)]2 S \n\n. \n\nall \n\nC = ! ,,{ [di =- y(xd] \n\n2 ~ cr 2(Xi) \n\n, \n\n2 + Ina-2(Xi) + In27r} \n\n(1) \n\nTo write explicit weight-update equations, we must specify the network unit transfer func(cid:173)\ntions. Here we choose a linear activation function for the y-unit, tanh functions for the \nhidden units, and an exponential function for the a-2 -unit. We can then take derivatives of \nthe cost C with respect to the network weights. To update weights connected to the Y and \na-2 -units we have: \n\n11 a-2~i) [di - Y(Xi)] hj(Xi) \n11 2a-2~Xi) {[di - y(Xi)f - a-2 (Xi) } hk (Xi) \n\n(2) \n\n(3) \n\nwhere 11 is the learning rate. For weights not connected to the output, the weight-update \nequations are derived using the chain rule in the same way as in standard backpropagation. \nNote that Eq. (3) is equivalent to training a separate function-approximation network for \na-2(x) where the targets are the squared errors [di - y(Xi)]2]. Note also that if a-2(Xj) is \n\n\f492 \n\nDavid A. Nix, Andreas S. Weigend \n\nconstant, Eqs. (1)-(2) reduce to their familiar forms for standard backpropagation with a \nsum-squared error cost function. \n\nThe 1/&2(X) term in Eqs. (2)-(3) can be interpreted as a form of \"weighted regression,\" \nincreasing the effective learning rate in low-noise regions and reducing it in high-noise \nregions. As a result, the network emphasizes obtaining small errors on those patterns where \nit can (low &2); it discounts learning patterns for which the expected error is going to be large \nanyway (large &2). This weighted-regression term can itself be highly beneficial where \noutliers (i.e., samples from high-noise regions) would ordinarily pull network resources \naway from fitting low-noise regions which would otherwise be well approximated. \n\nFor simplicity, we use simple gradient descent learning for training. Other nonlinear mini(cid:173)\nmization techniques could be applied, however, but only if the following problem is avoided. \nIf the weighted-regression term described above is allowed a significant influence early in \nlearning, local minima frequently result. This is because input patterns for which low errors \nare initially obtained are interpreted as \"low noise\" in Eqs. (2)-(3) and overemphasized \nin learning. Conversely, patterns for which large errors are initially obtained (because \nsignificant learning of y has not yet taken place) are erroneously discounted as being in \n\"high-noise\" regions and little subsequent learning takes place for these patterns, leading \nto highly-suboptimal solutions. This problem can be avoided if we separate training into \nthe following three phases: \n\nPhase I (Initial estimate of the conditional mean): Randomly split the available data \ninto equal halves, sets A and 8. Assuming u2(x) is constant, learn the estimate of the \nconditional mean y(x) using set A as the training set. This corresponds to \"traditional\" \ntraining using gradient descent on a simple squared-error cost function, i.e., Eqs. (1)-(2) \nwithout the 1/&2(X) terms. To reduce overfitting, training is considered complete at the \nminimum of the squared error on the cross-validation set 8, monitored at the end of each \ncomplete pass through the training data. \n\nPhase II (Initial estimate of the conditional variance): Attach a layer of hidden units \nconnected to both the inputs and the hidden units of the network from Phase I (see Figure 1). \nFreeze the weights trained in Phase I, and train the &2-unit to predict the squared errors \n(see Eq. (3\u00bb, again using simple gradient descent as in Phase I. The training set for this \nphase is set 8, with set A used for cross-validation. If set A were used as the training set \nin this phase as well, any overfitting in Phase I could result in seriously underestimating \nu 2(x). To avoid this risk, we interchange the data sets. The initial value for the offset (3 \nof the &2-unit is the natural logarithm of the mean squared error (from Phase I) of set 8. \nPhase II stops when the squared error on set A levels off or starts to increase. \nPhase ill (Weighted regression): Re-split the available data into two new halves, A' and \n8'. Unfreeze all weights and train all network parameters to minimize the full cost function \nC on set A'. Training is considered complete when C has reached its minimum on set 8'. \n\n2 Examples \n\nExample #1: To demonstrate this method, we construct a one-dimensional example prob(cid:173)\nlem where y(x) and u2(x) are known. We take the equation y(x) = sin(wax) sin(w,Bx) \nwithwa = 3 andw,B = 5. We then generate (x, d) pairs by picking x uniformly from the in(cid:173)\nterval [0, 7r /2] and obtaining the corresponding target d by adding normally distributed noise \nn(x) = N[0,u2(x)] totheunderlyingy(x), whereu2(x) = 0.02+0.25 x [1-sin(w,Bx)j2. \n\n\fLearnillg Local Error Bars for Nonlinear Regression \n\n493 \n\nTable 1: Results for Example #1. ENMS denotes the mean squared error divided by the overall \nvariance of the target; \"Mean cost\" represents the cost function (Eq. (1)) averaged over all patterns . \nRow 4 lists these values for the ideal model (true y(x) and a2 (x)) given the data generated. Row 5 \ngives the correlation coefficient between the network's predictions for the standard error (i.e., the \nsquare root of the &2 -unit's activation) and the actually occurring L1 residual errors, Id(Xi) - y(x;) I. \nRow 6 gives the correlation between the true a(x) and these residual errors . Rows 7-9 give the \npercentage of residuals smaller than one and two standard deviations for the obtained and ideal \nmodels as well as for an exact Gaussian. \n\n1 Training \n\n-\n(N -\n\n103) 1 Evaluation \nEr.. IU .,,< \n\nMean cost \n\nI \n\n1 \n2 \n3 \n4 \n\n5 \n6 \n\n7 \n8 \n9 \n\nPhase I \nPhase \" \nPhase III \nn( x ) \n\n(exact additive noise) \n\n~ ~~ ~ \\: residual errors) \na( x ,reridual errors) \n\n% of errors < .,.~ xl; 2\"'~ xl \n% of errors < a-(x); 2a-(x) \n(aact Gaussian) \n\nENM .\"< \n\n0.576 \n0.576 \n0.552 \n0.545 \n\nP \n\n0.564 \n0.602 \n\nI sId \n64.8 \n66.6 \n68.3 \n\n0.853 \n0.542 \n0.440 \n0.430 \n\n2 sId \n95.4 \n96.0 \n95.4 \n\nI \n\n0.593 \n0.593 \n0.570 \n0.563 \n\np \n\n0.548 \n0.584 \n\nI sId \n67.0 \n68.4 \n68.3 \n\n-\n(N -\n\n105 ) 1 \n\nMean cost \n\nJ \n\n0.882 \n0.566 \n0.462 \n0.441 \n\n2 sId \n94.6 \n95.4 \n95.4 \n\nWe generate 1000 patterns for training and an additional 105 patterns for post-training \nevaluation. \n\nTraining follows exactly the three phases described above with the following details: 2 Phase \nI uses a network with one hidden layer of 10 tanh units and TJ = 10- 2 \u2022 For Phase II we add \nan auxiliary layer of 10 tanh hidden units connected to the a.2-unit (see Figure 1) and use \nthe same TJ. Finally, in Phase III the composite network is trained with TJ = 10- 4. \nAt the end of Phase I (Figure 2a), the only available estimate of (1'2 (x ) is the global \nroot-mean-squared error on the available data, and the model misspecification is roughly \nuniform over x-a typical solution were we training with only the traditional squared-error \ncost function. The corresponding error measures are listed in Table 1. At the end of Phase \nII, however, we have obtained an initial estimate of (1'2 (x ) (since the weights to the i)-unit \nare frozen during this phase, no modification of i) is made). Finally, at the end of Phase \nIII, we have better estimates of both y{ x) and (1'2 (x). First we note that the correlations \nbetween the predicted errors and actual errors listed in Table 1 underscore the near-optimal \nprediction of local errors. We also see that these errors correspond, as expected, to the \nassumed Gaussian error model. Second, we note that not only has the value of the cost \nfunction dropped from Phase II to Phase III, but the generalization error has also dropped, \nindicating an improved estimate of y( x ). By comparing Phases I and III we see that the \nquality of i)(x) has improved significantly in the low-noise regions (roughly x < 0.6) at a \nminor sacrifice of accuracy in the high-noise region. \nExample #2: We now apply our method to a set of observed data, the 1000-point laser \n\n2Purther details: all inputs are scaled to zero mean and unit variance. All initial weights feeding \ninto hidden units are drawn from a uniform distribution between -1 j i and 1 j i where i is the number of \nincoming connections. All initial weights feeding into y or &2 are drawn from a uniform distribution \nbetween -sji and sji where s is the standard deviation of the (overall) target distribution. No \nmomentum is used, and all weight updates are averaged over the forward passes of 20 patterns. \n\n\f494 \n\nDavid A. Nix. Andreas S. Weigend \n\nPhaao 1 \n\nPhaloll \n\nPhaoolll \n\n(bl \n\n1~ \n\n. \u2022. \\ \n\n. \n\n0.5 \n\n: \n\n~ 0 \n\n-0.5 \n\n.\u2022 \n\n-1 o \n\n1 \n\nx \n\nPhalol \n\n_II \n\nx \n\nx \n\n_\"I \n\nx \n\nx \n\nx \n\n11(1-11 \n\nFigure 2: (a) Example #1: Results after each phase of training. The top row gives the true y(x) \n(solid line) and network estimate y(x) (dotted line); the bottom row gives the true oo2(x) (solid line) \nand network estimate o-2(x) (dotted line). (b) Example #2: state-space embedding of laser data \n(evaluation set) using linear grey-scaling of 0.50 (lightest) < o-(Xt) < 6.92 (darkest). See text for \ndetails. \n\nintensity series from the Santa Fe competition.3 Since our method is based on the network's \nobserved errors, the predicted error a-2(x) actually represents the sum of the underlying \nsystem noise, characterized by 002 (x), and the model misspecification. Here, since we know \nthe system noise is roughly uniform 8-bit sampling resolution quantization error, we can \napply our method to evaluate the local quality of the manifold approximation.4 \n\nThe prediction task is easier if we have more points that lie on the manifold, thus better \nconstraining its shape. In the competition, Sauer (1994) upsampled the 1000 available data \npoints with an FFf method by a factor of 32. This does not change the effective sampling \nrate, but it \"fills in\" more points, more precisely defining the manifold. We use the same \nupsampling trick (without filtered embedding), and obtain 31200 full (x, d) patterns for \nlearning. We apply the three-phase approach described above for the simple network of \nFigure 1 with 25 inputs (corresponding to 25 past values), 12 hidden units feeding the y-unit, \nand a libera130 hidden units feeding the a-2 -unit (since we are uncertain as to the complexity \nof (j2(x) for this dataset). We use 11 = 10-7 for Phase I and 11 = 10- 10 for Phases II and III. \nSince we know the quantization error is \u00b1O. 5, error estimates less than this are meaningless. \nTherefore, we enforce a minimum value of (j2(x) = 0.25 (the quantization error squared) \non the squared errors in Phases II and III. \n\n3The data set and several predictions and characterizations are described in the vol(cid:173)\nume edited by Weigend & Gershenfeld (1994). The data is available by anonymous ftp \nat ftp.cs.colorado.edu in /pub/Time-Series/SantaFe as A.dat. \nSee also \nhttp://www . cs. colorado. edu/Time-Series/TSWelcome. html for further analyses \nof this and other time series data sets. \n\n4When we make a single-step prediction where the manifold approximation is poor, we have little \nconfidence making iterated predictions based on that predicted value. However, if we know we are in \na low-error region, we can have increased confidence in iterated predictions that involve our current \nprediction. \n\n\fLearning Local Error Bars for Nonlinear Regression \n\n495 \n\nTable 2: Results for Example #2 (See Table 1 caption for definitions). \nrow I \n\n(N -\n\n23 950) \n\n-\n\n(N - 975) I Evaluation \nMean .).\" In \nAdvances in Neural Infonnation Processing Systems 7 (NIPS*94, this volume). San Francisco, CA: \nMorgan Kaufmann. \n\nD.E. Rumelhart, R. Durbin. R. Golden. and Y. Chauvin. (1995) \"Backpropagation: The Basic Theory.\" \nIn Backpropagation: Theory, Architectures and Applications, Y. Chauvin and D.E. Rumelhart, eds., \nLawrence Erlbaum, pp. 1- 34. \n\nT. Sauer. (1994) 'T1l11.e Series Prediction by Using Delay Coordinate Embedding.\" In Time Series \nPrediction: Forecasting the Future and Understanding the Past, A.S. Weigend and N.A. Gershenfeld. \neds., Addison-Wesley, pp. 175-193. \n\nA.N. Srivastava and A.S. Weigend. (1994) \"Computing the Probability Density in Connectionist \nRegression.\" In Proceedings of the IEEE International Conference on Neural Networks (IEEE(cid:173)\nICNN'94), Orlando, FL, p. 3786--3789. IEEE-Press. \n\nA.S. Weigend and N.A. Gershenfeld, eds. (1994) Time Series Prediction: Forecasting the Future and \nUnderstanding the Past. Addison-Wesley. \n\nA.S. Weigend and B. LeBaron. (1994) \"Evaluating Neural Network Predictors by Bootstrapping.\" In \nProceedings of the International Conference on Neural Infonnation Processing (ICONIP'94), Seoul, \n}(orea,pp.1207-1212. \n\nA.S. Weigend and D.A. Nix. (1994) \"Predictions with Confidence Intervals (Local Error Bars).\" In \nProceedings of the International Conference on Neural Information Processing (ICONIP'94), Seoul, \n}(orea, p. 847- 852. \n\n\f", "award": [], "sourceid": 896, "authors": [{"given_name": "David", "family_name": "Nix", "institution": null}, {"given_name": "Andreas", "family_name": "Weigend", "institution": null}]}