{"title": "Early Brain Damage", "book": "Advances in Neural Information Processing Systems", "page_first": 669, "page_last": 675, "abstract": null, "full_text": "Early Brain Damage \n\nVolker Tresp, Ralph Neuneier and Hans Georg Zimmermann* \n\nSiemens AG, Corporate Technologies \n\nOtto-Hahn-Ring 6 \n\n81730 Miinchen, Germany \n\nAbstract \n\nOptimal Brain Damage (OBD) is a method for reducing the num(cid:173)\nber of weights in a neural network. OBD estimates the increase in \ncost function if weights are pruned and is a valid approximation \nif the learning algorithm has converged into a local minimum. On \nthe other hand it is often desirable to terminate the learning pro(cid:173)\ncess before a local minimum is reached (early stopping). In this \npaper we show that OBD estimates the increase in cost function \nincorrectly if the network is not in a local minimum. We also show \nhow OBD can be extended such that it can be used in connec(cid:173)\ntion with early stopping. We call this new approach Early Brain \nDamage, EBD. EBD also allows to revive already pruned weights. \nWe demonstrate the improvements achieved by EBD using three \npublicly available data sets. \n\n1 \n\nIntroduction \n\nOptimal Brain Damage (OBD) was introduced by Le Cun et al. (1990) as a method \nto significantly reduce the number of weights in a neural network. By reducing the \nnumber of free parameters, the variance in the prediction of the network is often \nreduced considerably which -in some cases- leads to an improvement in general(cid:173)\nization performance of the neural network. OBD might be considered a realization \nof the principle of Occam's razor which states that the simplest explanation (of the \ntraining data) should be preferred to more complex explanations (requiring more \nweights). \n\nIf E is the cost function which is minimized during training, OBD calculates the \n\n{Volker. Tresp, Ralph. Neuneier, Georg.Zimmermann}@mchp.siemens.de. \n\n\f670 \n\nV. Tresp, R. Neuneierand H. G. Zimmermann \n\nsaliency of each parameter Wi defined as \n\nOBD(wd = A(Wi) = 2 I)w~ wl\u00b7 \n\n11)2E \n\ns \n\nWeights with a small OBD(wd are candidates for removal. OBD(wd has the \nintuitive meaning of being the increase in cost function if weight Wi is set to zero \nunder the assumptions \n\n\u2022 that the cost function is quadratic, \n\u2022 that the cost function is \"diagonal\" which means it can be written as E = \nBias + 1/2l:i hi(Wi - wi)2 where where {wn~l are the weights in a \n(local) optimum of the cost function (Figure 1) and the hi and BIAS are \nparameters which are dependent on the training data set. \n\n\u2022 and that Wi ~ wi. \n\nIn practice, all three assumptions are often violated but experiments have demon(cid:173)\nstrated that OBD is a useful method for weight removal. \n\nIn this paper we want to take a closer look at the third assumption, i. e. the assump(cid:173)\ntion that weights are close to optimum. The motivation is that theory and practice \nhave shown that it is often advantageous to perform early stopping which means \nthat training is terminated before convergence. Early stopping can be thought of \nas a form of regularization: since training typically starts with small weights, with \nearly stopping weights are biased towards small weights analogously to other reg(cid:173)\nularization methods such as ridge regression and weight decay. According to the \nassumptions in OBD we might be able to apply OBD only in heavily overtrained \nnetworks where we loose the benefits of early stopping. In this paper we show that \nOBD can be extended such that it can work together with early stopping. We call \nthe new criterion Early Brain Damage (EBD). As in OBD, EBD contains a num(cid:173)\nber of simplifying assumptions which are typically invalid in practice. Therefore, \nexperimental results have to demonstrate that EBD has benefits. We validate EBD \nusing three publicly available data sets. \n\n2 Theory \n\nAs in OBD we approximate the cost function locally by a quadratic function and \nassume a \"diagonal\" form. Figure 1 illustrates that OBD(Wi) for Wi = wi calculates \nthe increase in cost function if Wi is set to zero. In early stopping where Wi =f. wi, \no B D( Wi) calculates the quantity denoted as Ai in Figure 1. Consider \n\nBi = ---Wi' \n\nI)E \nI)wi \n\nThe saliency of weights Wi in Early Stopping Pruning \n\nESP(wd = Ai + Bi \n\nis an estimate of how much the cost function increases if the current Wi (i. e. Wi m \nearly stopping) is set to zero. Finally, consider \n\n\fEarly Brain Damage \n\n671 \n\nE \n\n\\ \n\\ \n\\ \n\\ \n\\ \n\\ \n\n\\ \n\n_.l __ _ \n\n\\ \n\n\\ \n\n\\ \\ B. \n, 1 \n\n_____ 1 \n\nc\u00b7 \n\nI \nI \nI \nI \n\nI \nI \n\nI \n\nI \n\nI \n\nI \n\n/ \n\n/ \n\n/ \n\nw* 1 \n\nFigure 1: The figure shows the cost function E as a function of one weight Wi \nin the network. wi is the optimal weight. Wi is the weight at an early stopping \npoint. If OBD is applied at Wi, it estimates the quantity Ai- ESP(Wi) = Ai + \nBi = E(Wi) - E(Wi = 0) estimate the increase in cost function if Wi is pruned. \nEBD{Wi) = Ai + Bi + Ci = E{wi) - E(Wi = 0) is the difference in cost function \nif we would train to convergence and if we would set Wi = O. \nIn other words \nEBD{wd = OBD(wi). \n\n\f672 \n\nV. Tresp, R. Neuneierand H. G. Zimmermann \n\nThe saliency of weight Wi in EBD is \n\nEBD(wd = OBD(w;) = Ai + B j + Cj \n\nwhich estimates the increase in cost function if Wj is pruned after convergence (i.e. \nEBD(wd = OBD(wt)) but based on local information around the current value \nof Wi. In this sense EBD evaluates the \"potential\" of Wi. Weights with a small \nEBD(wd are candidates for pruning. \n\nNote, that all terms required for EBD are easily calculated. With a quadratic cost \nfunction E = l:f=I(J!' - N N(xk))2 OBD approximates (OBD-approximation) \n\n82E ~ 2 K (8NN(Xk))2 \n8w~ 2: \nk=1 \n\n8w' \n\nI \n\nI \n\n(1) \n\nwhere (xk, J!'){f=1 are the training data and N N(xk) is the network response. \n\n3 Extensions \n\n3.1 Revival of Weights \n\nIn some cases, it is beneficial to revive weights which are already pruned. Note, that \nCi exactly estimate the decrease in cost function if weight Wi is \"revived\". Weights \nwith a large Ci(Wi = 0) are candidates for revival. \n\n3.2 Early Brain Surgeon (EBS) \n\nAfter OBD or EBD is performed, the network needs to be retrained since the \n\"diagonal\" approximation is typically violated and there are dependencies between \nweights. Optimal Brain Surgeon (OBS, Hassibi and Storck, 1993) does not use \nthe \"diagonal\" approximation and recalculates the new weights without explicit \nretraining. OBS still assumes a quadratic approximation of the cost function. The \nsaliency in OBS is \n\nW~ \n\nLi = 2[H~I]ii \n\nwhere [H- 1]ii is i-th diagonal element of the inverse of the Hessian. Li estimates the \nincrease in cost if the i-th weight is set to zero and all other weights are retrained. \nTo recalculate all weights after weight Wi is removed apply \n\nW new = Wold -\n\nWi \n\n-1 \n\n[H-l]ii H ei \n\nwhere ei is the unit vector in the i-th direction. \n\nAnalogously to OBS, Early Brain Surgeon EBS would first calculate the optimal \nweight vector using a second order approximation of the cost function \n\nw*=w_H- 18E \n8w \n\nand then apply OBS using w*. We did not pursue this idea any further since our \ninitial experiments indicated that W* was not estimated very accurately in praxis. \nHassibi et al. (1994) achieved good performance with OBS even when weights were \nfar from optimal. \n\n\fEarly Brain Damage \n\n673 \n\n3.3 Approximations to the Hessian and the Gradient \n\nFinnoff et al. (1993) have introduced the interesting idea that the relevant quantities \nfor OBD can be estimated from the statistics of the weight changes. \n\nConsider the update in pattern by pattern gradient descent learning and a quadratic \ncost function \n\naWi = -'1 OEk = 2'1(yk _ N N(xk)) oN N(xk) \n\now \n\nOWi \n\nwith Ek = (Yk - N N(Xk))2 where '1 is the learning rate. \nWe assume that xk and yk are drawn online from a fixed distribution (which is \nstrictly not true since in pattern by pattern learning we draw samples from a fixed \ntraining data set). Then, using the quadratic and \"diagonal\" approximation of the \ncost function and assuming that the noise f in the model \n\nis additive uncorrelated with variance (J'2 1 \n\n(2) \n\nand \n\nv AR(aWi) = V AR (2'1(yk - N N(xk)) oN :w:xk)) \n\n= 4~'V AR ((v\" - N N'(zk)) aN:w:xk)) +4~'V AR ((W; _ Wi) (aN :w:zk)) ') \n\n= 4,Nc (aN:w:zk))' +4~'(w; - wo)'VAR ( eN :w:zk)) ') \n\nwhere N N* (xk) is the network output with optimal weights {wi} ~1. Note, that \nin the OBD approximation (Equation 1) \n\nand \n\nIf we make the further assumption that oN N(xk)j OWi is Gaussian distributed with \nzero mean2 \n\n1 e stands for the expected value. With Wi kept at a fixed value. \n2The zero mean assumption \n\nis \n\ntypically violated but might be enforced by \n\nrenormalization. \n\n\f674 \n\nwe obtain \n\nV. Tresp, R. Neuneier and H. G. Zimmermann \n\n(3) \n\nThe first term in Equation 3 is a result of the residual error which is translated \ninto weight fluctuations. But note, that weights with a small variance with a large \n/)2 E / /)wl fluctuate the most. The first term is only active when there is a residual \nerror, i.e. (72 > O. The second term is non-zero independent of (72 and is due to the \nfact that in sample-by-sample learning, weight updates have a random component. \nFrom Equation 2 and Equation 3 all terms needed in EBD (i. e. \n/)E//)w, and \n/)2 E / /)wl) are easily estimated. \n\n4 Experimental Results \n\nIn our experiments we studied the performance of OBD, ESP and EBD in connec(cid:173)\ntion with early stopping. Although theory tells us that EBD should provide the best \nestimate of the the increase in cost function by the removal of weight Wi, it is not \nobvious how reliable that estimate is when the assumptions (\"diagonal\" quadratic \ncost function) are violated. Also we are not really interested in the correct estimate \nof the increase in cost function but in a ranking of the weights. Since the assump(cid:173)\ntions which go into OBD, EBD, ESP (and also OBS and EBS) are questionable, the \nusefulness of the new methods have to be demonstrated using practical experiments. \n\nWe used three different data sets: Breast Cancer Data, Diabetes Data, and Boston \nHousing Data. All three data sets can be obtained from the DCI repository \n(ftp:ics.uci.edu/pub/machine-Iearning-databases). The Breast Cancer Data con(cid:173)\ntains 699 samples with 9 input variables consisting of cellular characteristics and \none binary output with 458 benign and 241 malignant cases. The Diabetes Data \ncontains 768 samples with 8 input variables and one binary output. The Boston \nHousing Data consist of 506 samples with 13 input variables which potentially in(cid:173)\nfluence the housing price (output variable) in a Boston neighborhood (Harrison & \nRubinfeld, 1978). \n\nOur procedure is as follows. The data set is divided into training data, validation \ndata and test data. A neural network (MLP) is trained until the error on the vali(cid:173)\ndation data set starts to increase. At this point OBD, ESP and EBD are employed \nand 50% of all weights are removed. After pruning the networks are retrained until \nagain the error on the validation set starts to increase. At this point the results \nare compared. Each experiment was repeated 5-times with different divisions of the \ndata into training data, validation data and test data and we report averages over \nthose 5 experiments. \n\nTable 1 sums up the results. The first row shows the number of data in training \nset, validation set and test set. The second row displays the test set error at the \n(first) early stopping point. Rows 3 to 5 show test set performance of OBD, ESP \nand EBD at the stopping point after pruning and retraining (absolute / relative \nto early stopping). In all three experiments, EBD performed best and OBD was \nsecond best in two experiments (Breast Cancer Data and Diabetes Data). In two \nexperiments (Breast Cancer Data and Boston Housing Data) the performance after \npruning improved. \n\n\fEarly Brain Damage \n\n675 \n\nTable 1: Comparing OBD, ESP, and EBD. \n\nBreast Cancer \n233/233/233 \n\nDiabetes \n\n256/256/256 \n\nBoston Housing Data \n\n168/169/169 \n\n10 \n\n0.0340 \n\n5 \n\n0.1625 \n\n0.0328/ 0.965 0.1652/1.017 \n0.0331 / 0.973 \n0.1657 /1.020 \n0.1647 /1.014 \n0.0326/ 0.959 \n\n3 \n\n0.2283 \n\n0.2275 /0.997 \n0.2178 / 0.954 \n0.2160 / 0.946 \n\nTrain/V /Test \nHidden units \nMSE (Stopp) \nOBD \nESP \nEBD \n\n5 Conclusions \n\nIn our experiments, EBD showed better performance than OBD if used in conjunc(cid:173)\ntion with early stopping. The improvement in performance is not dramatic which \nindicates that the rankings of the weights in OBD are reasonable as well. \n\nReferences \n\nFinnoff, W., Hergert, F., and Zimmermann, H. (1993). Improving model selection \nby nonconvergent methods, Neural Networks, Vol. 6, No.6. \n\nHassibi, B. and Storck, D. G. (1993). Second order derivatives for network pruning: \nOptimal Brain Surgeon. In: Hanson, S. J., Cowan, J. D., and Giles, C. L. (Eds.). \nAdvances in Neural Information Processing Systems 5, San Mateo, CA, Morgan \nKaufman. \n\nHassibi, B., Storck, D. G., and Wolff, G. (1994). Optimal Brain Surgeon: Extensions \nand performance comparisons. In: Cowan, J. D., Tesauro, G., and Alspector, J. \n(Eds.). Advances in Neural Information Processing Systems 6, San Mateo, CA, \nMorgan Kaufman. \n\nLe Cun, Y., Denker, J. S., and SolI a, S. A. (1990). Optimal brain damage. In: D. \nS. Touretzky (Ed.). Advances in Neural Information Processing Systems 2, San \nMateo, CA, Morgan Kaufman. \n\n\f", "award": [], "sourceid": 1181, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "Ralph", "family_name": "Neuneier", "institution": null}, {"given_name": "Hans-Georg", "family_name": "Zimmermann", "institution": null}]}