{"title": "Comparing Biases for Minimal Network Construction with Back-Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 177, "page_last": 185, "abstract": null, "full_text": "177 \n\nCOMPARING BIASES FOR MINIMAL NETWORK \nCONSTRUCTION WITH BACK-PROPAGATION \n\nStephen Jo~ Hansont \n\nLorien Y. Pratt \n\nBell Communications Research \nMorristown. New Jersey 07960 New Brunswick. New Jersey 08903 \n\nRutgers University \n\nABSTRACT \n\nlearning \n\nrepresentations during \n\nRumelhart (1987). has proposed a method for choosing minimal or \n\"simple\" \nin Back-propagation \nnetworks. This approach can be used to (a) dynamically select the \nnumber of hidden units. (b) construct a representation that is \nappropriate for the problem and (c) thus improve the generalization \nability of Back-propagation networks. The method Rumelhart suggests \ninvolves adding penalty terms to the usual error function. In this paper \nwe introduce Rumelhart\u00b7s minimal networks idea and compare two \npossible biases on the weight search space. These biases are compared \nin both simple counting problems and a speech recognition problem. \nIn general. the constrained search does seem to minimize the number of \nhidden units required with an expected increase in local minima. \n\nINTRODUCTION \n\nMany supervised connectionist models use gradient descent in error to solve various \nkinds of tasks (Rumelhart. Hinton & Williams. 1986). However. such gradient descent \nmethods tend to be \".opportunistic\" and can solve problems in an arbitrary way dependent \non starting point in weight space and peculiarities of the training set. For example. in \nFigure 1 we show a \"mesh\" problem which consists of a random distribution of \nexemplars from two categories. The spatial geometry of the categories impose a meshed \nor overlapping subset of the exemplars in the two dimensional feature space. As the \nmeshed part of the categories increase the problem becomes more complex and must \ninvolve the combination of more linear cuts in feature space and consequently more \nnonlinear cuts for category separation. In the top left corner of Figure l(a). we show a \nmesh geometry requiring only three cuts for category separation. In the bottom center \n\nt Also member of Cognitive Science Laboratory, 221 Nassau Street, Princeton University, Princeton, New \n\nlersey,08S42 \n\n\f178 \n\nHanson and Pratt \n\nI (b) is the projection of the three cut solution of the mesh in output space. In the top right \nof this Figure I(c) is a typical solution provided by back-propagation starting with 16 \nhidden units. This Figure shows the two dimensional featme space in which 9 of the \nlines cuts are projected (the other 7 are outside the [0.1] unit plane). \n\n~ r-------------------------~ \n\n0.0 \n\n0.2 \n\nU \n\n0.' \n\n0.8 \n\n'.0 \n\nFigure 1: Mesh problem (a). output space (b) and typical back-propagation solution (c) \n\nExamining the weights in the next layer of the network indicates that in fact. 7 of these 9 \nline segments are used in order to construct the output surface shown in Figure l(b). \nConsequently. the underlying feature relations determining the output surface and \ncategory separation are arbitrary. more complex then necessary and may result in \nanomalous generalizations. \n\nRumelhart (1987). has proposed a way to increase the generalization capabilities of \nlearning networks which use gradient descent methods and to automatically control the \nresources learning networks use-for example. in tenns of \"hidden\" units. His hypothesis \nconcerns the nature of the 'representation in the network: \" ... the simplest most robust \nnetwork which accounts/or a data set will, on awrage,lead to the best generalization to \nthe population from which the training set has been drawn\". \nThe basic approach involves adding penalty terms to the usual error function in order to \nconstrain the search and cause weights to differentially decay. This is similar to many \nproposals in statistical regression where a \"simplicity\" measure is minimized along with \nthe error term and is sometimes referred to as \"biased\" regression (Rawlings. 1988). \nBasically. the statistical concept of biased regression derives from parameter estimation \napproaches that attempt to achieve a best linear unbiased estimator (\"BLUE\"). By \ndefinition an unbiased estimator is one with the lowest possible variance and \ntheoretically. unless there is significant collinearityl or nonlinearity amongst the \n\n1. For example, Ridge regreuiolt is a special case of biased regression which attempts to make a singular \ncorrelation matrix non-lingular by adding a small arbitrary coostant to the diagonal of the matrix. This \nincrease in the diagonal may lower the impact of the off-diagonal elements and thus reduce the effects of \ncollinearity \u2022 \n\n\fComparing Biases for Minimal Network Construction \n\n179 \n\nvariables. a least squares estimator (LSE) can be also shown to be a BLUE. If on the \nother hand. input variables are correlated or nonlinear with the output variables (as is the \ncase in back-propagation) then there is no guarantee that the LSE will also be unbiased. \nConsequently. introducing a bias may actually reduce the variance of the estimator of \nthat below the theoretically unbiased estimator. \n\nSince back-propagation is a special case of multivariate nonlinear regression methods we \nmust immediately give up on achieving a BLUE. Worse ye4 the input variables are also \nvery likely to be collinear in that input data are typically expected to be used for feature \nextraction. Consequently. the neural network framework \nleads naturally to the \nexploration of biased regression techniques. unfortunately. it is not obvious what sorts of \nbiases ought to be introduced and whether they may be problem dependent \nFurthennore. the choice of particular biases probably determines \nthe particular \nrepresentation that is chosen and its nature in tenns of size. structure and \"simplicity\". \nThis representation bias may in turn induce generalization behavior which is greater in \naccuracy with larger coverage over the domain. Nonetheless. since there is no particular \nmotivation for minimizing a least squares estimator it is important to begin exploring \npossible biases that would lead to lower variance and more robust estimators. \n\nIn this paper we explore two general type of bias which introduce explicit constraints on \nthe hidden units. First we discuss the standard back-propagation method. various past \nmethods of biasing which have been called \"weight decay\". the properties of our biases. \nand finally some simple benchmark tests using parity and a speech recognition task. \n\nBACK\u00b7PROPAGATION \n\nThe Back-propagation method [2] is a supervised learning technique using a gradient \ndescent in an error variable. The error is established by comparing an output value to a \ndesired or expected value. These errors can be accumulated over the sample: \n\nE = LL (yu - ;ir)2 \n\n\u2022 \n\ni \n\n(1) \n\nAssuming the output function is differentiable then a gradient of the error can be found, \nand we require that this derivative be decreasing. \n\ndE \ndWij \n\n- -=0 \n\u2022 \n\n(2) \n\nOver multiple layers we pass back a weighted sum of each derivative from units above. \n\nWEIGHT DECAY \n\nPast wo~ using biases have generally been based on ad hoc arguments that weights \nshould differentially decay allowing large weights to persist and small weights to tend \n\n\f180 \n\nHanson and Pratt \n\ntowards zero sooner. Apparently. this would tend to concentrate more of the input into a \nsmaller number of weights. Generally. the intuitive notion is to somehow reduce the \ncomplexity of the network as defined by the nmnber of connections and number of \nhidden units. A simple but inefficient way of doing this is to include a weight decay tenn \nin the usual delta updating rule causing all weights to decay on each learning step (where \nW = Wjj throughout): \n\n(3) \n\nSolving this difference equation shows that for P < 1.0 weights are decaying \nexponentially over steps towards zero. \n. \n\nw\" = a 1: P\"'\" (--)j + P\" Wo \n\n(4) \n\n\" \n;=1 \n\naE \nOw \n\nThis approach introduces the decay tenn in the derivative itself causing error tenns to \nalso decrease over learning steps which may not be desirable. \n\nBIAS \n\nThe sort of weight decay just discussed can also be derived from genezal consideration of \n\"costs\" on weights. For example it is possible to consider E with a bias tenn which in the \nsimple decay case is quadratic with weight value (Le. w2). \n\nWe now combine this bias with E producing an objective function that includes both the \nerror term and this bias function: \n\nwhere. we now want to minimize \n\nO=E+B \n\nao = aE + aB \now\u00b7\u00b7 \no~\u00b7\u00b7 \n'I \n'I \n\now\u00b7\u00b7 \n\n'I \n\nIn the quadratic case the updating rule becomes. \n\naE \nW,,+1 = a (- :\\.... \nClW;J \n\n- 2w,,) + w\" \n\nSolving this difference equation derives the updating rule from equation 4. \n\nw\" = a I:(1-2a)\"\"'(- Ow ); + (l-2a)\"wo \n\n\" \n\nlal \n\n. oE \n\nij \n\n(5) \n\n(6) \n\n(7) \n\n(8) \n\nIn this case. however without introduction of other parameters. a is both the learning rate \n\n2. MOlt tX the wort discussed here has not been previously publilhed but nonetheless has entered into general \nuse in many cormectionisl models and wu recently summarized on the COlIMctionist Bw/~tin Board by \nJohn Kruschke. \n\n\fComparing Biases for Minimal Network Construction \n\n181 \n\nand related to the decay tenn and must be strictly < ~ for weight decay. \n\nUnifonn weight decay has a disadvantage in that large weights are decaying at the same \nrate as small weights. It is possible to design biases that influence weights only when \nthey are relatively small or even in a particular range of values. For example. Rumelhart \nhas entertained a number of biases. one fonn in particular that we will also explore is \nbased on a rectangular hyperbolic function. \nw1 \n\nB:: (1+w2) \n\n(9) \n\nIt is infonnative to examine the derivative associated with this function in order to \nunderstand its effect on the weight updates. \n\ndB \n\n2w \n\n- dwij ::- (1+w2)1 \n\n(10) \n\nThis derivative is plotted in Figure 2 (indicated as Rumelhart) and is non-monotonic \nshowing a strong differential effect on small weights (+ or -) pushing them towards zero. \nwhile near zero and large weight values are not significantly affected. \n\nBIAS PER UNIT \n\nIt is possible to consider bias on each hidden unit weight group. This has the potentially \ndesirable effect of isolating weight changes to hidden unit weight groups and could \neffectively eliminate hidden units. Consequently. the hidden units are directly \ndetermining the bias. In order to do this. first define \n\nw\u00b7::~lw\u00b7\u00b71 \n'I' \n\nI \n\nIf..I \nj \n\n(11) \n\nwhere i is the ith hidden unit. \n\nHyperbolic Bias \n\nNow consider a function similar to Rumelhart's but this time with Wi, the ith hidden \ngroup as the variable. \n\nB-\n\n- 1 +AWi\u00b7 \n\nW\u00b7 \n' \n\nThe new gradient includes the term from the bias which is. \n\naB \n\n- dWij = \n\nAsgn(wij) \n(1 +Wi)2 \n\nExponential Bias \n\nA similar kind of bias would be to consider the negative exponential: \n\n(12) \n\n(13) \n\n\f182 \n\nHanson and Pratt \n\nThis bias is similar to the hyperbolic bias tenn as above but involves the exponential \nwhich potentially produce more unifonn and gradual rate changes towards zero, \n\n- -= \n\ndB \ndWij \n\nsgn(wij) \n( e ~Wi) . \n\n(15) \n\n(14) \n\nThe behavior of these two biases (hyperbolic, exponential) are shown as function of \nweight magnitudes in Figure 2. Notice that the exponential bias term is more similar in \nslope change to RumelharCs (even though his is non-monotonic) than the hyperbolic as \nweight magnitude to a hidden unit increases. \n\nII) \n\nq -\ni d \n-0 \n~ 0 \n'iI d \n'i 'tJ an \n9 \n\nq -. \n\n-3 \n\n-2 \n\n-1 \n\no \n\n1 \n\n2 \n\n3 \n\nweightvalu8 \n\nFigure 2: Bias function behavior of Rumelharfs, Hyperbolic and Exponential \n\nObviously there are many more kinds of bias that one can consider. These two were \nchosen in order to provide a systematic test of varying biases and exploring their \ndifferential effectiveness in minimizing network complexity. \n\nSOME COMPARISONS \n\nParity \n\nThese biased Back-propagation methods were applied to several counting problems and \nto a speech (digit) recognition problem. In the following graphs for example, we show \nthe results of 100 runs of XOR and 4-bit parity at 11 =.1 (learning rate) and ex=.8 \n(moving average) starting with 10 hidden units. The parameter A. was optimized for the \nbias runs. \n\n\fII \n\nI , \nI: \n\n......... ., .... \n\n---...... \n_._-\n\n\u2022 \n\n\u2022 \n\noJ~~~~~~~=-~. \n12 \n\n\\I \n\na \n\n4 \n\n12 \n\n...-------\n-.----\n_.--\n....... -----\n_._-\n\n~------\n\nr - - r-\n\nI \u2022 \u2022 \n\nt - -\n\n, \n\nr-\n\n, \n\n, \n\nr-\n\n-\n\n~ \n. \n, \n.'1 \n\nJ: \n\nJ: \n\n, \n\u2022 \n\nComparing Biases for Minimal Network Construction \n\n183 \n\nFigure 3: Exclusive OR runs for standard, hyperbolic and exponential biasing \n\nShown are runs for the standard case without biases, the hyperbolic bias and the \nexponential bias. Once a solution was reached all hidden Wlits were tested individually \nby removing each of them one at a time from the network and then testing on the training \nset Any hidden unit which was unnecessary was removed for data analysis. Only the \nnumber of these \"functional units\" are reported in the histograms. Notice the number of \nhidden units decrease with bias runs. An analysis of variance (statistical test) verified \nthis improvement for both the hyperbolic and exponential over the standard. Also note \nthat the exponential is significandy better than the hyperbolic. This is also confinned for \nthe 4-bit parity case as shown in Figure 4. \n\n-_. __ .... _-\n\n- -_ .... _-\n\n... --..... --\n\nI' \n\nI-\n\n\u2022 \n\nI' \n\n........ ------\n\n. . \n---\n\n..........-..-.. _-\n\n. \n-_. \n\n, \n\nFigure 4: four-bit parity runs for standard. hyperbolic and exponential biasing \n\n\f184 \n\nHanson and Pratt \n\nSpeech Recognition \n\nSamples of 10 spoken digits (0-9) each were collected (same speaker throughout--DJ. \nBUlT kindly supplied data). Samples were then preprocessed using FFTs retaining the \nfirst 12 Cepstral coefficients. To avoid ceiling effects only two tokens each of the 10 \ndigits were used for training (\"0\", \"0\",\"1\",\"1\", .... \"9\",.,9., .. ) each network. Eight such 2 \ntoken samples were used for replications. Another set of 50 spoken digits (5 samples of \neach of the 10 digits) were collected for transfer. All runs were matched across methods \nfor number ofleaming sweeps \u00ab300),11=.05, a=.2, and A=.01 which were optimized for \nthe exponential bias. Shown in the following table is the results of the 8 replications for \nthe standard and the exponential bias. \n\nIIIIIDIe \n\nrl \nr2 \nr3 \n1'4 \nd \n16 \nt7 \nIS \n\n1'rInIrer \n5K \n6K \n62~ \n6A \n62~ \nc569& \n58\" \nsa\" \n\nCOIII&I'IiIlecl{up 1 \n\ndoll \n, HIdden Units TrlDafer \n64~ \n76~ \n64\" \n74\" \n56 .. \n68 .. \n54\" \n64 .. \n\nII \n11 \n18 \n14 \n16 \n19 \n18 \n18 \n\n'HWcnUnill \n\n10 \n13 \n14 \n14 \n11 \n14 \n11 \n9 \n\n17%.56 \n\n65~ 12.%.71 \n\nTable 1: Eight replications with transfer for standard and exponential bias. \n\nIn this case there appears to both an improvement in the average number of hidden units \n(functional ones) and transfer. A typical correlation of the improved transfer and reduced \nhidden unit usage for a single replication is plotted in the next graph. \n\nJ! \n\ny- -1.21+ 71.7. ,- -.trT \n\n~ \nl ~ \nI 2 \n\nI :I \n\ni \n\nI/) or \n\n~ \n\n10 \n\n12 \n\n14 \n\n18 \n\n11 \n\nI1IMnber of hidden unIIs \n\nFigure 5: Transfer as a function of hidden unit usage for a single replication \n\nWe note that introduction of biases decrease the probability of convergence relative to \nthe standard case (as many as 75% of the parity runs did not converge within criteria \n\n\fComparing Biases for Minimal Network Construction \n\n185 \n\nnumber of sweeps.) Since the search problem is made more difficult by introducing \nbiases it now becomes even more important to explore methods for improving \nconvergence similar for example. to simulated annealing (Kirkpatrick. Gelatt & Vecchi. \n1983) \n\nCONCLUSIONS \n\nMinimal networks were defined and two types of bias were compared in a simple \ncounting problem and a speech recognition problem. In the counting problems under \nbiasing conditions the number hidden units tended to decrease towards the minimum \nrequired for the problem although with a concomitant decrease in convergence rate. In \nthe speech problem also under biasing conditions the number of hidden units tended to \ndecrease as the transfer rate tended to improve. \n\nAcknowledgements \n\nWe thank Dave Rumelhart for discussions concerning the minimal network concept. the \nBellcore connectionist group and members of the Princeton Cognitive Science Lab for a \nlively environment for the development of these ideas. \n\nReferences \n\nKirkpalrick. S .\u2022 Gelatt. C. D .\u2022 & Vecchi. M. P .\u2022 Optimization by simulated annealing. \n\nScience. 220. 671-680. (1983). \n\nRawlings. I. 0 \u2022\u2022 Applied Regression Analysis. Wadsworth & Brooks/Cole, (1988). \n\nRumelhart D. E .\u2022 Personal Communication, Princeton. (1987). \n\nRumelhart D. E., Hinton G. E .\u2022 & Williams R .\u2022 Learning Internal Representations by \n\nerror propagation. Nature. (1986). \n\n\f", "award": [], "sourceid": 156, "authors": [{"given_name": "Stephen", "family_name": "Hanson", "institution": null}, {"given_name": "Lorien", "family_name": "Pratt", "institution": null}]}