{"title": "Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment", "book": "Advances in Neural Information Processing Systems", "page_first": 107, "page_last": 115, "abstract": null, "full_text": "107 \n\nSKELETONIZATION: \n\nA TECHNIQUE FOR TRIMMING THE FAT \n\nFROM A NETWORK VIA RELEVANCE ASSESSMENT \n\nMichael C. Mozer \nPaul Smolensky \n\nDepartment of Computer Science & \n\nInstitute of Cognitive Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nABSTRACT \n\nThis paper proposes a means of using the knowledge in a network to \ndetermine the functionality or relevance of individual units, both for \nthe purpose of understanding the network's behavior and improving its \nperformance. The basic idea is to iteratively train the network to a cer(cid:173)\ntain performance criterion, compute a measure of relevance that identi(cid:173)\nfies which input or hidden units are most critical to performance, and \nautomatically trim the least relevant units. This skeletonization tech(cid:173)\nnique can be used to simplify networks by eliminating units that con(cid:173)\nvey redundant information; to improve learning performance by first \nlearning with spare hidden units and then trimming the unnecessary \nones away, thereby constraining generalization; and to understand the \nbehavior of networks in terms of minimal \"rules.\" \n\nINTRODUCTION \n\nOne thing that connectionist networks have in common with brains is that if you open \nthem up and peer inside, all you can see is a big pile of goo. Internal organization is \nobscured by the sheer number of units and connections. Although techniques such as \nhierarchical cluster analysis (Sejnowski & Rosenberg, 1987) have been suggested as a \nstep in understanding network behavior, one would like a better handle on the role that \nindividual units play. This paper proposes one means of using the knowledge in a net(cid:173)\nwork to determine the functionality or relevance of individual units. Given a measure of \nrelevance for each unit, the least relevant units can be automatically trimmed from the \nnetwork to construct a skeleton version of the network. \n\nSkeleton networks have several potential applications: \n\u2022 Constraining generalization. By eliminating input and hidden units that serve no pur(cid:173)\n\npose, the number of parameters in the network is reduced and generalization will be \nconstrained (and hopefully improved) . \n\n\u2022 Speeding up learning. Learning is fast with many hidden units, but a large number of \nhidden units allows for many possible generalizations. Learning is slower with few \n\n\f108 \n\nMozer and Smolensky \n\nhidden units. but generalization tends to be better. One idea for speeding up learning is \nto train a network with many hidden units and then eliminate the irrelevant ones. This \nmay lead to a rapid learning of the training set. and then gradually, an improvement in \ngeneralization performance . \n\n\u2022 Understanding the behavior of a network in terms of \"rules\". One often wishes to get a \nhandle on the behavior of a network by analyzing the network in terms of a small \nnumber of rules instead of an enormous number of parameters. In such situations, one \nmay prefer a simple network that performed correctly on 95% of the cases over a com(cid:173)\nplex network that performed correctly on 100%. The skeletonization process can dis(cid:173)\ncover such a simplified network. \n\nSeveral researchers (Chauvin, 1989; Hanson & Pratt, 1989; David Rumelhart, personal \ncommunication, 1988) have studied techniques for the closely related problem of reduc(cid:173)\ning the number of free parameters in back propagation networks. Their approach \ninvolves adding extra cost terms to the usual error function that cause nonessential \nweights and units to decay away. We have opted for a different approach -\nthe all-or(cid:173)\nnone removal of units - which is not a gradient descent procedure. The motivation for \nour approach was twofold. First, our initial interest was in designing a procedure that \ncould serve to focus \"attention\" on the most important units, hence an explicit relevance \nmetric was needed. Second, our impression is that it is a tricky matter to balance a pri(cid:173)\nmary and secondary error term against one another. One must determine the relative \nweighting of these terms, weightings that may have to be adjusted over the course of \nlearning. In our experience, it is often impossible to avoid local minima -\ncompromise \nsolutions that partially satisfy each of the error terms. This conclusion is supported by \nthe experiments of Hanson and Pratt (1989). \n\nDETERMINING THE RELEVANCE OF A UNIT \n\nConsider a multi-layer feedforward network. How might we determine whether a given \nunit serves an important function in the network? One obvious source of information is \nits outgoing connections. If a unit in layer I has many large-weighted connections, then \none might expect its activity to have a big impact on higher layers. However, this need \nnot be. The effects of these connections may cancel each other out; even a large input to \nunits in layer 1+1 will have little influence if these units are near saturation; outgoing \nconnections from the innervated units in 1+1 may be small; and the unit in I may have a \nmore-or-Iess constant activity, in which case it could be replaced by a bias on units in \n1+1. Thus, a more accurate measure of the relevance of a unit is needed. \n\nWhat one really wants to know is, what will happen to the performance of the network \nwhen a unit is removed? That is, how well does the network do with the unit versus \nwithout it? For unit i , then, a straightforward measure of the relevance, Pi' is \n\nPi = E without .wt i - E with .wt i , \n\nwhere E is the error of the network on the training set. The problem with this measure is \nthat to compute the error with a given unit removed, a complete pass must be made \nthrough the training sel Thus, the cost of computing P is O(np) stimulus presentations, \nwhere n is the number of units in the network and p is the number of patterns in the \n\n\fSkeletonization \n\n109 \n\ntrammg set. Further, if the training set is not fixed or is not known to the experimenter, \nadditional difficulties arise in computing p. \n\nWe therefore set out to find a good approximation to p. Before presenting this approxi(cid:173)\nmation, it is fust necessary to introduce an additional bit of notation. Suppose that asso(cid:173)\nciated with each unit i is a coefficient ai which represents the attentional strength of the \nunit (see Figure 1). This coefficient can be thought of as gating the flow of activity from \nthe unit: \n\nOJ = f (~Wji aioi) , \n\ni \n\nwhere OJ is the activity of unit j, Wji the connection strength to j from i, and f the sig(cid:173)\nmoid squashing function. If ai = 0, unit i has no influence on the rest of the network; if \nai = I, unit i is a conventional unit. In terms of a. the relevance of unit i can then be \nrewritten as \n\nPi = Ea.:::(J - Ea.=1 \u2022 \n\nWe can approximate Pi using the derivative of the error with respect to a: \n\nlim E a.=y - E a.=1 \n1-+1 \n\ny- 1 \n\niJE \n= - -\niJa\u00b7 I \n\na.=1 \n\nAssuming that this equality holds approximately for y = 0: \n\ndE \nEa.:::(J - Ea.=1 \n- - - - - - - -\ndaj a.=1 \n\n-1 \n\nor \n\nO \n\nur approxunatton or pj IS en Pi = --;-- . \n\nA \n\n. . fi \n\n. th \n\niJE \nuaj \n\nThis derivative can be computed using an error propagation procedure very similar to \nthat used in adjusting the weights with back propagation. Additionally, note that because \nthe approximation assumes that o,j is I, the aj never need be changed. Thus. the ai are \nnot actual parameters of the system, just a bit of notational convenience used in \n\nFigure 1. A 4-2-3 network with attentionaI coefficients on the input and hidden units. \n\n\f110 \n\nMozer and Smolensky \n\nestimating relevance. \n\nIn practice, we have found that dE Ida. fluctuates strongly in time and a more stable esti(cid:173)\nmate that yields better results is an exponentially-decaying time average of the derivative. \nIn the simulations reported below, we use the following measure: \n\n~i(t+I)=.8~i(t)+ .2 d \n\n. \n\ndE(t) \n\nClj \n\nOne fmal detail of relevance assessment we need to mention is that relevance is com(cid:173)\nputed based on a linear error function, E' = 1: I tpj - opJ I (where p is an index over pat(cid:173)\nterns, j over output units; tpj is the target output, Opj the actual output). The usual qua(cid:173)\ndratic error function, Ef = !Jtpj - Opj )2, provides a poor estima~ of relevance if the out(cid:173)\nput pattern is close to the target. This difficulty with Ef is further elaborated in Mozer \nand Smolensky (1989). In the results reported below, while Ef is used as the error \nmetric in training the weights via conventional back propagation, ~ is measured using E'. \nThis involves separate back propagation phases for computing the weight updates and the \nrelevance measures. \n\nA SIMPLE EXAMPLE: THE CUE SALIENCE PROBLEM \n\nConsider a network with four inputs labeled A-D, one hidden unit, and one output. We \ngenerated ten training patterns such that the correlations between each input unit and the \noutput are as shown in the fIrst row of Table 1. (In this particular task. a hidden layer is \nnot necessary. The inclusion of the hidden unit simply allowed us to use a standard \nthree-layer architecture for all tasks.) \n\nIn this and subsequent simulations, unit activities range from -I to I, input and target \noutput patterns are binary (-lor 1) vectors. Training continues until all output activities \nare within some acceptable margin of the target value. Additional details of the training \nprocedure and network parameters are described in Mozer and Smolensky (1989). \n\nTo perform perfectly, the network need only attend to input A. This is not what the \ninput-hidden connections do, however; their weights have the same qualitative proflle as \nthe correlations (second row of Table 1).1 In contrast, the relevance values for the input \n\nTable 1 \n\nA \n\n1.0 \n3.15 \n5.36 \n0.46 \n\nCorrelation with Output Unit \n\nInput-Hidden Connection Strmgths \n\nPi \n~i \n\nC \n\nInput Unit \nB \n0.6 \n1.23 \n0.(11 \n-0.03 \n\n0.2 \n.83 \n0.06 \n0.01 \n\nD \n0.0 \n- .01 \n0.00 \n-0.02 \n\n1 The values reported in Table 1 are an average over 100 replications of the simulation with different initial ran(cid:173)\ndom weights. Bs:fore averaging. however, the signs of the weights were flipped if the hidden-output connection \nwas negative. \n\n\fSkeletonization \n\n111 \n\nunits show A to be highly relevant while B-D have negligible relevance. Further, the \nqualitative picture presented by the profile of (Ji s is identical to that of the Pi s. Thus, \n\nwhile the weights merely reflect the statistics of the training set, \"i indicates the func(cid:173)\n\ntionality of the units. \n\nTHE RULE-PLUS-EXCEPfION PROBLEM \n\nConsider a network with four binary inputs labeled A-D and one binary output. The task \nis to learn the function AB+ABci5; the output unit should be on whenever both A and B are \non, or in the special case that all inputs are off. With two hidden units, back propagation \nand the other to \narrives at a solution in which one unit responds to AB -\nthe exception. Clearly, the AB unit is more relevant to the solution; it accounts \nABeD -\nfor fifteen cases whereas the ABeD unit accounts for only one. This fact is reflected in \nthe (Ji: in 100 replications of the simulation, the mean value of (JAB was 1.49 whereas ~ABCD \nwas only .17. These values are extremely reliable; the standard errors are .003 and .005, \nrespectively. \n\nthe rule -\n\nRelevance was also measured using the quadratic error function. With this metric, the AB \nunit is incorrectly judged as being less relevant than the ABcD unit (JiB is .029 and (JIBCD is \n.033. As mentioned above, the basis of the failme of the quadratic error function is that \n\n\"f grossly underestimates the true relevance as the output error gbes to zero. Because \n\nthe one exception pattern is invariably the last to be learned, the pUtput error for the fIf(cid:173)\nteen non-exception patterns is significantly lower, and consequently, the relevance values \ncomputed on the basis of the non-exception patterns are much smaller than those com(cid:173)\nputed on the basis of the one exception pattern. This results in the relevance assessment \nderived from the exception pattern dominating the overall relevance measure, and in the \nincorrect relevance assignments described above. However, this problem can be avoided \nby assessing relevance using the linear error function. \n\nIf we attempted to \"trim\" the rule-plus-exception network by eliminating hidden units, \nthe logical first candidate would be the less relevant ABeD unit. This trimming process \na skeleton network - whose behavior is easily \nwould leave us with a simpler network -\ncharacterized in tenns of a simple rule, but which could only account for 15 of the 16 \ninput cases. \n\nCONSTRUCTING SKELETON NETWORKS \n\nIn the remaining examples we construct skeleton networks using the relevance metric. \nThe procedure is as follows: (1) train the network until all output unit activities are \nwithin some specified margin around the target value (for details, see Mozer & Smolen(cid:173)\nsky, 1989); (2) compute\" for each unit; (3) remove the unit with the smallest ,,; and (4) \nrepeat steps 1-3 a specified number of times. In the examples below, we have chosen to \ntrim either the input units or the hidden units, not both simultaneously. but there is no \nreason why this could not be done. \n\nWe have not yet addressed the crucial question of how much to trim away from the net(cid:173)\nwork. At present. we specify in advance when to stop trimming. However. the pro(cid:173)\ncedure described above makes use only of the ordinal values of the (J. One untapped \n\n\f112 \n\nMozer and Smolensky \n\nsource of information that may be quite informative is the magnitudes of the (J. A large \nincrease in the minimum (J value as trimming progresses may indicate that further trim(cid:173)\nming will seriously disrupt performance in the network. \n\nTHE TRAIN PROBLEM \n\nConsider the task of determining a rule that discriminates the \"east\" trains from the \n\"west\" trains in Figure 2. There are two simple rules -\nsimple in the sense that the rules \nrequire a minimal number of input features: East trains have a long car and triangle load \nin car or an open car or white wheels on car. Thus, of the seven features that describe \neach train, only two are essential for making the east/west discrimination. \n\nA 7-1-1 network trained on this task using back propagation learns quickly, but the final \nsolution takes into consideration nearly all the inputs because 6 of the 7 features are par(cid:173)\ntially correlated with the east/west discrimination. When the skeletonization procedure is \napplied to trim the number of inputs from 7 to 2, however, the network is successfully \ntrimmed to the minimal set of input features -\neither long car and triangle load, or open \ncar and white wheels on car -\n\non each of 100 replications of the simulation we ran. \n\nThe trimming task is far from trivial. The expected success rate with random removal of \nthe inputs is only 9.5%. Other skeletonization procedures we experimented with resulted \nin success rates of 50%-90%. \n\nTHE FOUR-BIT MULTIPLEXOR PROBLEM \n\nConsider a network that learns to behave as a four-bit multiplexor. The task is, given 6 \nbinary inputs labeled A-D, Mit and M2, and one binary output, to map one of the inputs A-D \nto the output contingent on the values of MI and M2. The logical function being computed \nis MI~A + MIM2B + M1M1C + M\\M1.D. \n\nEAST \n\nWEST \n\nFigure 2. The train problem. Adapted from Medin, Wattenmaker, & Michalski, 1987. \n\nI.-Jib \n\n\fSkeletonization \n\n113 \n\nTable 2 \n\nmedian epochs median epochs \n\narchitecture \n\nfailure rate \n\nto criterion \n\nstandard 4-hidden net \n\n8-+4 skeleton net \n\n17% \n0% \n\n(with 8 hidden) \n\n--\n25 \n\nto criterion \n\n(with 4 hidden) \n\n52 \n45 \n\nA standard 4-hidden unit back propgation network was tested against a skeletonized net(cid:173)\nwork that began with 8 hidden units initially and was trimmed to 4 (an 8-+4 skeleton net(cid:173)\nwork). If the network did not reach the performance criterion within 1000 training \nepochs, we assumed that the network was stuck in a local minimum and counted the run \nas a failure. \n\nPerformance statistics for the two networks are shown in Table 2, averaged over 100 \nreplications. The standard network fails to reach criterion on 17% of the runs. whereas \nthe skeleton network always obtains a solution with 8 hidden units and the solution is not \nlost as the hidden layer is trimmed to 4 units. 2 The skeleton network with 8 hidden units \nreaches criterion in about half the number of training epochs required by the standard \nnetwork. From this point, hidden units are trimmed one at a time from the skeleton net(cid:173)\nwork, and after each cut the network is retrained to criterion. Nonetheless, the total \nnumber of epochs required to train the initial 8 hidden unit network and then trim it down \nto 4 is still less than that required for the standard network with 4 units. Furthermore, as \nhidden units are trimmed, the performance of the skeleton network remains close to cri(cid:173)\nterion, so the improvement in learning is substantial. \n\nTHE RANDOM MAPPING PROBLEM \n\nThe problem here is to map a set of random 20-element input vectors to random 2-\nelement output vectors. Twenty random input-output pairs were used as the training set \nTen such training sets were generated and tested. A standard 2-hidden unit network was \ntested against a 6~2 skeleton network. For each training set and architecture, 100 repli(cid:173)\ncations of the simulation were run. If criterion was not reached within 1000 training \nepochs, we assumed that the network was stuck in a local minimum and counted the run \nas a failure. \n\nAs Table 3 shows, the standard network failed to reach criterion with two hidden units on \n17% of all runs. whereas the skeleton network failed with the hidden layer trimmed to \ntwo units on only 8.3% of runs. In 9 of the 10 training sets, the failure rate of the skele(cid:173)\nton network was lower than that of the standard network. Both networks required com(cid:173)\nparable amounts of training to reach criterion with two hidden units, but the skeleton net(cid:173)\nwork reaches criterion much sooner with six hidden units, and its performance does not \nsignificantly decline as the network is trimmed. These results parallel those of the four(cid:173)\nbit multiplexor. \n\n1 Here and below we report median epochs to criterion rather than mean epochs to avoid aberrations caused by \nthe large number of epochs consumed in failure runs. \n\n\f114 \n\nMozer and Smolensky \n\nTable 3 \n\nstandard network \n\n642 skeleton network \n\ntraining set % failures \n\nmedian epochs \n\nto criterion \n(2 hidden) \n\n% failures \n\nmedian epochs median epochs \n\nto criterion \n(6 hidden) \n\nto criterion \n(2 hidden) \n\n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n\n14 \n16 \n25 \n33 \n38 \n9 \n9 \n6 \n8 \n12 \n\n20 \n69 \n34 \n38 \n96 \n17 \n28 \n13 \n12 \n12 \n\n7 \n13 \n0 \n0 \n55 \n1 \n5 \n0 \n0 \n2 \n\n11 \n12 \n7 \n10 \n35 \n9 \n14 \n8 \n8 \n8 \n\n22 \n47 \n14 \n21 \n\n \n\n17 \n43 \n16 \n17 \n17 \n\nSUMMARY AND CONCLUSIONS \n\nWe proposed a method of using the know ledge in a network to determine the relevance \nof individual units. The relevance metric can identify which input or hidden units are \nmost critical to the performance of the network. The least relevant units can then be \ntrimmed to construct a skeleton version of the network. \n\nSkeleton networks have application in two different scenarios. as our simulations demon(cid:173)\nstrated: \n\n\u2022 Understanding the behavior of a network in terms of \"rules\" \n\n-\n\n-\n\n-\n\nThe cue salience problem. The relevance metric singled out the one input that was \nsufficient to solve the problem. The other inputs conveyed redundant information. \nThe rule-plus-exception problem. The relevance metric was able to distinguish the \nhidden unit that was responsible for correctly handling most cases (the general rule) \nfrom the hidden unit that dealt with an exceptional case. \nThe train problem. The relevance metric correctly discovered the minimal set of \ninput features required to describe a category . \n\n\u2022 Improving learning performance \n\n-\n\n-\n\nThe four-bit multiplexor. Whereas a standard network was often unable to discover \na solution. the skeleton network never failed. Further. the skeleton network learned \nthe training set more quickly. \nThe random mapping problem. As in the multiplexor problem. the skeleton net(cid:173)\nwork succeeded considerably more often with comparable overall learning speed. \nand less training was required to reach criterion initially. \n\nBasically. the skeletonization technique allows a network to use spare input and hidden \nunits to learn a set of training examples rapidly. and gradually. as units are trimmed. to \ndiscover a more concise characterization of the underlying regularities of the task. In the \nprocess, local minima seem to be avoided without increasing the overall learning time. \n\n\fSkeletonization \n\n115 \n\nOne somewhat surprising result is the ease with which a network is able to recover when \na unit is removed. Conventional wisdom has it that if, say, a network is given excess hid(cid:173)\nden units, it will memorize the training set, thereby making use of all the hidden units \navailable to it. However, in our simulations, the network does not seem to be distributing \nthe solution across all hidden units because even with no further training, removal of a \nhidden unit often does not drop performance below the criterion. In any case, there gen(cid:173)\nerally appears to be an easy path from the solution with many units to the solution with \nfewer. \n\nAlthough we have presented skeletonization as a technique for trimming units from a net(cid:173)\nwork, there is no reason why a similar procedure could not operate on individual connec(cid:173)\ntions instead. Basically, an 0. coefficient would be required for each connection, allow(cid:173)\ning for the computation of aE lao.. Yann Ie Cun (personal communication, 1989) has \nindependently developed 'a procedure quite similar to our skeletonization technique \nwhich operates on individual connections. \n\nAcknowledgements \n\nOur thanks to Colleen Seifert for conversations that led to this work; to Dave Goldberg, \nGeoff Hinton, and Yann Ie Cun for their feedback; and to Eric Jorgensen for saving us \nfrom computer hell. This work was supported by grant 87-2-36 from the Sloan Founda(cid:173)\ntion to Geoffrey Hinton, a grant from the James S. McDonnell Foundation to Michael \nMozer, and a Sloan Foundation grant and NSF grants IRI-8609599, ECE-8617947, and \nCDR-8622236 to Paul Smolensky. \n\nRererences \n\nChauvin, Y. (1989). A back-propagation algorithm with optimal use of hidden units. In \nAdvances in Neural Network Information Processing Systems. San Mateo, CA: \nMorgan Kaufmann. \n\nHanson, S. J., & Pratt, L. Y. (1989). Some comparisons of constraints for minimal \nIn Advances in Neural Network \n\nnetwork construction with back propagation. \nInformation Processing Systems. San Mateo, CA: Morgan Kaufmann. \n\nMedin, D. L., Wattenmaker, W. D., & Michalski, R. S. (1987). Constraints and \npreferences in inductive learning: An experimental study of human and machine \nperformance. Cognitive Science, 11, 299-339. \n\nMozer, M. C., & Smolensky, P. (1989). Skeletonization: A technique for trimming the \nfat from a network via relevance assessment (Technical Report CU-CS-421-89). \nBoulder: University of Colorado, Department of Computer Science. \n\nSejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronounce \n\nEnglish text. Complex Systems, 1, 145-168. \n\n\f", "award": [], "sourceid": 119, "authors": [{"given_name": "Michael", "family_name": "Mozer", "institution": null}, {"given_name": "Paul", "family_name": "Smolensky", "institution": null}]}