{"title": "Robust Parameter Estimation and Model Selection for Neural Network Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 192, "page_last": 199, "abstract": null, "full_text": "Robust Parameter Estimation And \nModel Selection For Neural Network \n\nRegression \n\nYong Liu \n\nDepartment of Physics \n\nInstitute for Brain and Neural Systems \n\nBox 1843, Brown University \n\nProvidence, RI 02912 \nyong~cns.brown.edu \n\nAbstract \n\nIn this paper, it is shown that the conventional back-propagation \n(BPP) algorithm for neural network regression is robust to lever(cid:173)\nages (data with :n corrupted), but not to outliers (data with y \ncorrupted). A robust model is to model the error as a mixture of \nnormal distribution. The influence function for this mixture model \nis calculated and the condition for the model to be robust to outliers \nis given. EM algorithm [5] is used to estimate the parameter. The \nusefulness of model selection criteria is also discussed. Illustrative \nsimulations are performed. \n\n1 \n\nIntroduction \n\nIn neural network research, the back-propagation (BPP) algorithm is the most \npopular algorithm. In the regression problem y = 7](:n, w) + \u00a3, in which 7](:n, 8) \ndenote a neural network with weight 8, the algorithm is equivalent to modeling \nthe error as identically independently normally distributed (i.i.d.), and using the \nmaximum likelihood method to estimate the parameter [13]. Howerer, the training \ndata set may contain surprising data points either due to errors in y space (outliers) \nwhen the response vectors ys of these data points are far away from the underlying \nfunction surface, or due to errors in :n space (leverages), when the the feature vectors \n\n192 \n\n\fRobust Parameter Estimation and Model Selection for Neural Network Regression \n\n193 \n\nxs of these data points are far away from the mass of the feature vectors of the rest \nof the data points. These abnormal data points may be able to cause the parameter \nestimation biased towards them. A robust algorithm or robust model is the one \nthat overcome the influence of the abnormal data points. \n\nA lot of work has been done in linear robust regression [8, 6, 3]. In neural network. \nit is generally believed that the role of sigmoidal function of the basic computing \nunit in the neural net has some significance in the robustness of the neural net \nto outliers and leverages. In this article, we investigate this more thoroughly. It \nturns out the conventional normal model (BPP algorithm) is robust to leverages \ndue to sigmoidal property of the neurons, but not to outliers (section 2). From the \nBayesian point of view [2], modeling the error as a mixture of normal distributions \nwith different variances, with flat prior distribution on the variances, is more robust. \nThe influence function for this mixture model is calculated and condition for the \nmodel to be robust to outliers is given (section 3.1). An efficient algorithm for \nparameter estimation in this situation is the EM algorithm [5] (section 3.2). In \nsection 3.3, we discuss a choice of prior and its properties. \nIn order to choose \namong different probability models or different forms of priors, and neural nets \nwith different architecture, we discuss the model selection criteria in section 4. \nIllustrative simulations on the choice of prior, or the t distribution model, and the \nnormal distribution model are given. Model selection statistics, is used to choose \nthe degree of freedom oft distribution, different neural network, and choose between \na t model and a normal model (section 4 and 5). \n\n2 \n\nIssue Of Robustness In Normal Model For Neural Net \nRegression \n\nOne way to think of the outliers and leverages is to regard them as a data per(cid:173)\nturbation on the data distribution of the good data. Remember that a estimated \nparameter T = T(F) is an implicit function of the underlying data distribution F. \nTo evaluate the influence of T by this distribution perturbation, we use the influence \nfunction [6] of estimator T at point z = (x, y) with data distribution F, which is \ndefined as \n\nIF(T \n\n, z, F \n\n) -1\u00b7 \n-\n\nT((1 - t)F + t~z) - T(F) \nImt -+ 0+ ----.:'-'------'------'-----'----'-\n\nt \n\n(1 ) \n\nin which ~:r: has mass 1 at x. 1 This definition is equivalent to a definition of deriva(cid:173)\ntive with respect to F except what we are dealing now is the derivative of functional. \nThis definition gives the amount of change in the estimator T with respect to a dis(cid:173)\ntribution perturbation t~z at point z = (x, y). For a robust estimation of the \nparameter, we expect the estimated parameter does not change significantly with \nrespect to a data perturbation. In another word, the influence function is bounded \nfor a robust estimation. \n\nDenote the conditional probability model of y given x as i.i.d. f(ylx,8) with param(cid:173)\neter 8. If the error function is the negative log-likelihood plus or not plus a penalty \nterm, then a general property of the influence function of the estimated parameter \nB, is IF(B, (Xi, Yi), F) ex \\71l1ogf(ydxi, B) (for proof, see [11]). Denote the neural \n\nlThe probability density of the distribution D.\", is 6(y - 2:). \n\n\f194 \n\nLiu \n\nnet, with h hidden units and the dimension of the output being one (dy = 1), as \n\nh \n\n17(:z:,8) = L akO\"( Wk:Z: + tk) \n\n(2) \n\nin which O\"(:z:) is the sigmoidal function or 1/(1 + exp(:z:)) and 8 = {ak, Wk, td. \nFor a normal model, f(yl:z:, 8, 0\") = JV(Yj 17(:Z:, 8), 0\") in which .N'(y; c, 0\") denotes dy -\nvariate normal distribution with mean c and covariance matrix 0\"2 I. Straightforward \ncalculation yield (dy = 1) \n\nk=l \n\nIF(8, (:Z:i' Yi), F) ',8), \n\n2 \nTm(w) ~ -Ltwi (Yi-1J(Xi,8)) \n\nA \n\n1 ~ * \n\nneff i=1 \n\n+ ~ t wirigJ [2: wi (gigJ - ri(i) + 'VI) 'VI)O:(>', 8)]-1 rigi (10) \n\neff i=1 \n\ni \n\n(9) \n\nin which gi = 'V1)1J(xi,8), (i = 'V1)'V~1J(xi,8) and ri = Yi -1J(xi,8). Thus if the \nmodels in comparison contains a improper prior, the above model selection statistics \ncan be used. \nIf the models in comparison have close forms of f(ylx, 8, u), the average negative \nlog-likelihood can be used as the model selection criteria. In Liu's work [10], an \napproximation form for the unbiased estimation of expected negative log-likelihood \nwas provided. If we use the negative log-likelihood plus a penalty term 0:(>.,8) as \nthe parameter optimization criteria, the model selection statistics is \n\nTm(w) = -- Ltlogf(Yilxi,8_i) ~ -- Ltlogf(Yilxi,8) + -Tr(C D) \n\n-1 \n\nA \n\nA \n\n(11) \n\n1 ~ \nn . \n\n~=1 \n\n1 ~ \nn \n\ni=1 \n\n1 \nn \n\n\fRobust Parameter Estimation and Model Selection for Neural Network Regression \n\n197 \n\n10 \n\nC \n\nwhich \n\nE~=l V' (1 log f(Yi lXi, B)V'~ log f(Yi lXi, B) \nand D = - E~=l V'(1V'pogf(Yilxi,B) + V'(1\\7~a().,8). The optimal model is the \none that minimizes this statistics. If the true underlying distribution is the normal \ndistribution model and there is no penalization terms, it is easy to prove C -+ D as \nn goes to infinite. Then the statistics becomes AIC [1]. \n\no S \no \n\no \n\n00 \n\no \no \no \n\n-1.5 ~--__ ~ ______ ~~ __ ~ ______ ~ ____ ~ ______ ~ ____ ~ \n\no \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\nFigure 1: BPP fit to data set with leverages, and comparison with BPP fit to the \ndata set without the leverages. An one hidden layer neural net with 4 hidden units, \nis fitted to a data set with 10 leverage, which are on the right side of X = 3.5, by \nusing the conventional BPP method. The main body of the data (90 data points) \n'\"V .V(\u20acj 0, a = 0.2). It can be noticed \nwas generated from Y = sin(x) + \u20ac with \u20ac \nthat the fit on the part of good data points was not dramatically influenced by the \nleverages. This verified our theoretical result about the robustness of a neural net \nwith respect to leverages \n\n5 \n\nIllustrative Simulations \n\nFor the results shown in figure 2 and 3, the training data set contains 93 data point \nfrom Y == sin( x) + \u20ac and seven Y values (outliers) randomly generated from region [1, \n'\" .:\\:'( \u20acj 0, a = 0.2). The neural net we use is of the form in equation \n2), in which \u20ac \n2. Denote h as the number of hidden units in the neural net. The caption of each \nfigure (1, 2, 3) explains the usefulness of the parameter estimation algorithm and \nthe model selection. \n\nAcknowledgements \n\nThe author thanks Leon N Cooper, M. P. Perrone. The author also thanks his wife \nCongo This research was supported by grants from NSF, ONR and ARO. \n\nReferences \n\n[1] H. Akaike. Information theory and an extension of the maximum likelihood \n\n\f198 \n\nLiu \n\n1.1 .--.--~--~~--~--~--~~--~--~--~~--~~ \n\n1 \n0.9 \nTm statistics \n0.8 MSE on the test set (x 10- 1 ) \n0.7 \n0.6 \n0.5 \nOA'=-----'-__ -'--__ ..L----l __ ---'-__ ---L--_L-----..l __ ---L-__ ..l.--_L-..---'-__ -'---..d \n\n(3,3)(2,3)( 4,3)(2,4)(3,4)(5,3)(3,5)( 1,3)(3,7)( n,3X n,4X n,5X n, 7) \n\nmodels (/.I, h), n stands for normal distribution model (BPP fit) \n\nFigure 2: Model selection statistics Tm for fits to data set with outliers, tests on \na independent data set with 1000 data points from y = sin(:z:) + \u20ac, where \u20ac \n'\" \nJV(f.; 0, U = 0.2). it can be seen that Tm statistics is in consistent with the error on \nthe test data set. The Tm statistics favors t model with small /.I than for the normal \ndistribution models. \n\n0 \n\n0 \n\n0 \n\n0 \n\nY \n\n2 \n1.5 \n1 \n0.5 \n\n0 \n-0.5 \n-1 \n-1.5 \n\n0 \n\n00 \n\n0 \n\nt3 model with outliers \nBPP fit with outliers \nPP fit without outliers \n\n0 \n\n1 \n\n2 \n\n3 \n\n:z: \n\n4 \n\n5 \n\n6 \n\n7 \n\nFigure 3: Fits to data set with outliers, and comparison with BPP fit to the data \nset without the outliers. The best fit in the four BPP fits (h = 3), according to Tm \nstatistics, was influenced by the outliers, tending to shift upwards. Although the \ndistribution is not a t distribution at all, the best fit by the EM algorithm under \nthe t model (/.I = 3, h = 3), also according to Tm statistics, gives better result than \nthe BPP fit, actually is almost the same as the BPP fit (h = 3) to the training data \nset without the outliers. This is due to the fact that a t distribution has a heavy \ntail to accommodate the outliers \n\n\fRobust Parameter Estimation and Model Selection for Neural Network Regression \n\n199 \n\nprinciple. In Petrov and Czaki, editors, Proceedings of the 2nd International \nSymposium on Information Theory, pages 267-281, 1973. \n\n[2] J. O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer(cid:173)\n\nVerlag, 1985. \n\n[3] R. D. Cook and S. Weisberg. Characterization of an empirical influence func(cid:173)\n\ntion for detecting influential cases in regression. Technometrics, 22:495-508, \n1980. \n\n[4] P. Craven and G. Wahba. \n\nSmoothing noisy data with spline func-\ntions:estimating the correct degree of smoothing by the method of generalized \ncross-validation. Numer. Math., 31:377-403, 1979. \n\n[5] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from \nincomplete data via the EM algorithm. (with discussion). J. Roy. Stat. Soc. \nSer. B, 39:1-38, 1977. \n\n[6] F.R. Hampel, E.M. Rouchetti, P.l. Rousseeuw, and W.A. Stahel. Robust Statis(cid:173)\n\ntics: The approach based on influence functions. Wiley, 1986. \n\n[7] P.J. Holland and R.E. Welsch. Robust regression using iteratively reweighted \n\nleast-squares. Commun. Stat. A, 6:813-88, 1977. \n\n[8] P.l. Huber. Robust Statistics. New York: Wiley, 1981. \n[9] S. Kullback and R.A. Leibler. On information and sufficiency. Ann. Stat., \n\n22:79-86, 1951. \n\n[10] Y. Liu. Neural Network Model Selection Using Asymptotic Jackknife Estima(cid:173)\n\ntor and Cross-Validation Method. In C.L. Giles, S.l.and Hanson, and J.D. \nCowan, editors, Advances in neural information processing system 5. Morgan \nKaufmann Publication, 1993. \n\n[11] Y. Liu. Robust neural network parameter estimation and model selection for \n\nregression. Submitted., 1993. \n\n[12] Y. Liu. Unbiased estimate of generalization error and model selection criterion \n\nin neural network. Submitted to Neural Network, 1993. \n\n[13] D. MacKay. Bayesian methods for adaptive models. PhD thesis, California \n\nInstitute of Technology, 1991. \n\n[14] J. E. Moody. The effective number of parameters, an analysis of generalization \nand regularization in nonlinear learning system. In l. E. Moody, S. l. Hanson, \nand R. P. Lippmann, editors, Advances in neural information processing system \n4, pages 847-854. Morgan Kaufmann Publication, 1992. \n\n[15] G. Schwartz. Estimating the dimension of a model. Ann. Stat, 6:461-464, 1978. \n[16] M. Stone. Cross-validatory choice and assessment of statistical predictions \n\n(with discussion). J. Roy. Stat. Soc. Ser. B, 36:111-147, 1974. \n\n[17] M. Stone. An asymptotic equivalence of choice of model by cross-validation \n\nand Akaike's criterion. J. Roy. Stat. Soc., Ser. B, 39(1):44-47, 1977. \n\n\f", "award": [], "sourceid": 796, "authors": [{"given_name": "Yong", "family_name": "Liu", "institution": null}]}