{"title": "Optimal Brain Damage", "book": "Advances in Neural Information Processing Systems", "page_first": 598, "page_last": 605, "abstract": null, "full_text": "598 \n\nLe Cun, Denker and Solla \n\nOptimal Brain Damage \n\nYann Le Cun, John S. Denker and Sara A. Sol1a \nAT&T Bell Laboratories, Holmdel, N. J. 07733 \n\nABSTRACT \n\nWe have used information-theoretic ideas to derive a class of prac(cid:173)\ntical and nearly optimal schemes for adapting the size of a neural \nnetwork. By removing unimportant weights from a network, sev(cid:173)\neral improvements can be expected: better generalization, fewer \ntraining examples required, and improved speed of learning and/or \nclassification. The basic idea is to use second-derivative informa(cid:173)\ntion to make a tradeoff between network complexity and training \nset error. Experiments confirm the usefulness of the methods on a \nreal-world application. \n\nINTRODUCTION \n\n1 \nMost successful applications of neural network learning to real-world problems have \nbeen achieved using highly structured networks of rather large size [for example \n(Waibel, 1989; Le Cun et al., 1990a)]. As applications become more complex, the \nnetworks will presumably become even larger and more structured. Design tools \nand techniques for comparing different architectures and minimizing the network \nsize will be needed. More importantly, as the number of parameters in the systems \nincreases, overfitting problems may arise, with devastating effects on the general(cid:173)\nization performance. We introduce a new technique called Optimal Brain Damage \n(OBD) for reducing the size of a learning network by selectively deleting weights. \nWe show that OBD can be used both as an automatic network minimization pro(cid:173)\ncedure and as an interactive tool to suggest better architectures. \nThe basic idea of OBD is that it is possible to take a perfectly reasonable network, \ndelete half (or more) of the weights and wind up with a network that works just as \nwell, or better. It can be applied in situations where a complicated problem must \n\n\fOptimal Brain Damage \n\n599 \n\nbe solved, and the system must make optimal use of a limited amount of training \ndata. It is known from theory (Denker et al., 1987; Baum and Haussler, 1989; Solla \net al., 1990) and experience (Le Cun, 1989) that, for a fixed amount of training \ndata, networks with too many weights do not generalize well. On the other hand. \nnetworks with too few weights will not have enough power to represent the data \naccurately. The best generalization is obtained by trading off the training error and \nthe network complexity. \nOne technique to reach this tradeoff is to minimize a cost function composed of two \nterms: the ordinary training error, plus some measure of the network complexity. \nSeveral such schemes have been proposed in the statistical inference literature [see \n(Akaike, 1986; Rissanen, 1989; Vapnik, 1989) and references therein] as well as in \nthe NN literature (Rumelhart, 1988; Chauvin, 1989; Hanson and Pratt, 1989; Mozer \nand Smolensky, 1989). \n\nVarious complexity measures have been proposed, including Vapnik-Chervonenkis \ndimensionality (Vapnik and Chervonenkis, 1971) and description length (Rissanen, \n1989) . A time-honored (albeit inexact) measure of complexity is simply the number \nof non-zero free parameters, which is the measure we choose to use in this paper \n[but see (Denker, Le Cun and Solla, 1990)]. Free parameters are used rather than \nconnections, since in constrained networks, several connections can be controlled by \na single parameter. \n\nIn most cases in the statistical inference literature, there is some a priori or heuristic \ninformation that dictates the order in which parameters should be deleted; for \nexample, in a family of polynomials, a smoothness heuristic may require high-order \nterms to be deleted first. In a neural network, however, it is not at all obvious in \nwhich order the parameters should be deleted. \nA simple strategy consists in deleting parameters with small \"saliency\", i.e. those \nwhose deletion will have the least effect on the training error. Other things be(cid:173)\ning equal, small-magnitude parameters will have the least saliency, so a reasonable \ninitial strategy is to train the network and delete small-magnitude parameters in \norder. After deletion, the network should be retrained. Of course this procedure \ncan be iterated; in the limit it reduces to continuous weight-decay during training \n(using disproportionately rapid decay of small-magnitude parameters). In fact, sev(cid:173)\neral network minimization schemes have been implemented using non-proportional \nweight decay (Rumelhart, 1988; Chauvin, 1989; Hanson and Pratt, 1989), or \"gat(cid:173)\ning coefficients\" (Mozer and Smolensky, 1989). Generalization performance has \nbeen reported to increase significantly on the somewhat small problems examined. \nTwo drawbacks of these techniques are that they require fine-tuning of the \"prun(cid:173)\ning\" coefficients to avoid catastrophic effects, and also that the learning process \nis significantly slowed down. Such methods include the implicit hypothesis that \nthe appropriate measure of network complexity is the number of parameters (or \nsometimes the number of units) in the network. \nOne of the main points of this paper is to move beyond the approximation that \n\"magnitude equals saliency\" , and propose a theoretically justified saliency measure. \n\n\f600 \n\nLe Cun, Denker and Solla \n\nOur technique uses the second derivative of the objective function with respect to \nthe parameters to compute the saliencies. The method was ,,-alidated using our \nhandwritten digit recognition network trained with backpropagation (Le Cun et aI., \n1990b). \n\n2 OPTIMAL BRAIN DAMAGE \nObjective functions playa central role in this field; therefore it is more than rea(cid:173)\nsonable to define the saliency of a parameter to be the change in the objective \nfunction caused by deleting that parameter. It would be prohibiti,-ely laborious to \nevaluate the saliency directly from this definition, i.e. by temporarily deleting each \nparameter and reevaluating the objective function. \nFortunately, it is possible to construct a local model of the error function and \nanalytically predict the effect of perturbing the parameter vector. \"'e approximate \nthe objective function E by a Taylor series. A perturbation lL~ of the parameter \nvector will change the objective function by \n\n(1) \n\nHere, the 6ui'S are the components of flJ, the gi's are the components of the gradient \nG of E with respect to U, and the hi;'S are the elements of the Hessian matrix H \nof E with respect to U: \n\n8E \ngi= -8 \nUi \n\nand \n\n(2) \n\nThe goal is to find a set of parameters whose deletion will cause the least increase \nof E . This problem is practically insoluble in the general case. One reason is \nthat the matrix H is enormous (6.5 x 106 terms for our 2600 parameter network), \nand is very difficult to compute. Therefore we must introduce some simplifying \napproximations. The \"diagonal\" approximation assumes that the 6E caused by \ndeleting several parameters is the sum of the 6E's caused by delet~ each parameter \nindividually; cross terms are neglected, so third term of the npt hand side of \nequation 1 is discarded. The \"extremal\" approximation assumes that parameter \ndeletion will be performed after training has converged. The parameter vector is \nthen at a (local) minimum of E and the first term of the right hand side of equation 1 \ncan be neglected. Furthermore, at a local minimum, all the hii's are non-negative, \nso any perturbation of the parameters will cause E to increase or stay the same. \nThirdly, the \"quadratic\" approximation assumes that the cost fundion is nearly \nquadratic 80 that the last term in the equation can be neglected. Equation 1 then \nreduces to \n\n6E=~~h\"6u~ \n\u2022 \n\n2L.i \" \n\ni \n\n(3) \n\n\fOptimal Brain Damage \n\n601 \n\n2.1 COMPUTING THE SECOND DERIVATIVES \n\nNow we need an efficient way of computing the diagonal second derivatives hii . \nSuch a procedure was derived in (Le Cun, 1987), and was the basis of a fast back(cid:173)\npropagation method used extensively in \\1lrious applications (Becker and Le Cun, \n1989; Le Cun, 1989; Le Cun et al., 1990a). The procedure is very similar to the \nback-propagation algorithm used for computing the first derivatives. We will only \noutline the proced ure; details can be found in the references. \nWe assume the objective function is the usual mean-squared error (MSE); general(cid:173)\nization to other additive error measures is straightforward. The following expres(cid:173)\nsions apply to a single input pattern; afterward E and H must be averaged over \nthe training set. The network state is computed using the standard formulae \n\nand \n\nai = L WijZj \n\nj \n\n( 4) \n\nwhere Zi is the state of unit i, ai its total input (weighted sum), ! the squashing \nfunction and Wij is the connection going from unit j to unit i. In a shared-weight \nnetwork like ours, a single parameter Uk can control one or more connections: Wij = \nUk for all (i, j) E Vk, where Vk is a set of index pairs. By the chain rule, the diagonal \nterms of H are given by \n\nhu = L {)w~, \n\n{)2E \n\n(i,j)EV. \n\n., \n\nThe summand can be expanded (using the basic network equations 4) as: \n\nlP E \n{J2E \n2 \n- -= -z \u00b7 \n{Jw~. \n{Ja~' \n\n., \n\n. \n\nThe second derivatives are back-propagated from layer to layer: \n\n(5) \n\n(6) \n\n(7) \n\nWe also need the boundary condition at the output layer, specifying the second \nderivative of E with respect to the last-layer weighted BUms: \n{J{J2 ~ = 2!'(ai)2 - 2(di - Zi)!\"(ai) \nai \n\n(8) \n\nfor all units i in the output layer. \nAs can be seen, computing the diagonal Hessian is of the same order of complexity \nas computing the gradient. In some cases, the second term of the right hand side of \nthe last two equations (involving the second derivative of I) can be neglected. This \ncorresponds to the well-known Levenberg-Marquardt approximation, and has the \ninteresting property of giving guaranteed positive estimates of the second derivative. \n\n\f602 \n\nLe Cun, Denker and Solla \n\n2.2 THE RECIPE \n\nThe OBO procedure can be carried out as follows: \n\n1. Choose a reasonable network architecture \n2. Train the network until a reasonable solution is obtained \n3. Compute the second derivatives hu for each parameter \n4. Compute the saliencies for each parameter: Sk = huu~/2 \n5. Sort the parameters by saliency and delete some low-saliency parameters \n6. Iterate to step 2 \n\nDeleting a parameter is defined as setting it to 0 and freezing it there. Several \nvariants of the procedure can be devised, such as decreasing the ... 41ues of the low(cid:173)\nsaliency parameters instead of simply setting them to 0, or allowing the deleted \nparameters to adapt again after they have been set to o. \n2.3 EXPERIMENTS \n\nThe simulation results given in this section were obtained using back-propagation \napplied to handwritten digit recognition. The initial network was highly constrained \nand sparsely connected, having 105 connections controlled by 2578 free parameters. \nIt was trained on a database of segmented handwritten zip code digits and printed \ndigits containing approximately 9300 training examples and 3350 t.est examples. \nMore details can be obtained from the companion paper (Le Cun et al., 1990b). \n\n16 \n14 \n1 \n10 \npJ 8 \n~6 \nb04 \n\no -\n\n \n\nMagnitude \n\n16 \n14 \n1 \n10 \npJ 8 \n~6 \nb04 \n.9 \n\nOBD \n\n(b) \n\no \n~~--~--~---+--~----~ \n500 1000 1500 2000 2SOO \n\no \n\nParameters \n\no \n-2~ __ ~ __ ~ __ -+ ________ ~ \nlaX) 2SOO \n\nSOO 1000 1500 \n\no \n\nParameters \n\nFigure 1: (a) Objective function (in dB) versus number of paramet.ers for OBn \n(lower curve) and magnitude-based parameter deletion (upper curve). (b) Predicted \nand actual objective function versus number of parameters. The predicted value \n(lower curve) is the sum of the saliencies of the deleted parameters. \n\nFigure la shows how the objective function increases (from right to left) as the \nnumber of remaining parameters decreases. It is clear that deletin~ parameters by \n\n\fOptimal Brain Damage \n\n603 \n\norder of saliency causes a significantly smaller increase of the objective function than \ndeleting them according to their magnitude. Random deletions were also tested for \nthe sake of comparison, but the performance was so bad that the curves cannot be \nshown on the same scale. \n\nFigure 1b shows how the objective function increases (from right to left) as the num(cid:173)\nber of remaining parameters decreases, compared to the increase predicted by the \nQuadratic-Extremum-Diagonal approximation. Good agrement is obtained for up \nto approximately 800 deleted parameters (approximately 30% of the parameters). \nBeyond that point, the curves begin to split, for several reasons: the off-diagonal \nterms in equation 1 become disproportionately more important as the number of \ndeleted parameters increases, and higher-than-quadratic terms become more impor-\ntant when larger-valued parameters are deleted. \n\n' \n\n16 \n14 \n1 \n10 \nUJ 8 \n~ 6 \n\n~4 o -\n\no \n-2~--4----+----~--~--~ \n1000 1500 2000 2500 \n\nSOO \n\no \n\nParameters \n\n