{"title": "A Back-Propagation Algorithm with Optimal Use of Hidden Units", "book": "Advances in Neural Information Processing Systems", "page_first": 519, "page_last": 526, "abstract": null, "full_text": "A BACK-PROPAGATION ALGORITHM \nWITH OPTIMAL USE OF HIDDEN UNITS \n\n519 \n\nYves Chauvin \n\nThomson-CSF, Inc \n\n(and Psychology Department, Stanford University) \n\n630, Hansen Way (Suite 250) \n\nPalo Alto, CA 94306 \n\nABSTRACT \n\nThis paper presents a variation of the back-propagation algo(cid:173)\nrithm that makes optimal use of a network hidden units by de(cid:173)\ncr~asing an \"energy\" term written as a function of the squared \nactivations of these hidden units. The algorithm can automati(cid:173)\ncally find optimal or nearly optimal architectures necessary to \nsolve known Boolean functions, facilitate the interpretation of \nthe activation of the remaining hidden units and automatically \nestimate the complexity of architectures appropriate for phonetic \nlabeling problems. The general principle of the algorithm can \nalso be adapted to different tasks: for example, it can be used to \neliminate the [0, 0] local minimum of the [-1. +1] logistic acti(cid:173)\nvation function while preserving a much faster convergence and \nforcing binary activations over the set of hidden units. \n\nPRINCIPLE \n\nThis paper describes an algorithm which makes optimal use of the hidden units in \na network using the standard back-propagation algorithm (Rumelhart. Hinton & \nWilliams, 1986). Optimality is defined as the minimization of a function of the \n\"energy\" spent by the hidden units throughtout the network, independently of \nthe chosen architecture, and where the energy is written as a function of the \nsquared activations of the hidden units. \n\nThe standard back-propagation algorithm is a gradient descent algorithm on the \nfollowing cost function: \n\nP 0 \n\nC = I I (dij- Oij)2 \n\nj \n\n[1] \n\nwhere d is the desired output of an output unit, 0 the actual output, and where \nthe sum is taken over the set of output units 0 for the set of training patterns P. \n\n\f520 \n\nChauvin \n\nThe following algorithm implements a gradient descent on the following cost func(cid:173)\ntion: \n\nPOP H \n\nC = IJer I I (dij - Oij)'l + IJen I I e(ot) \n\nj \n\nj \n\n[2] \n\nwhere e is a positive monotonic function and where the sum of the second term is \nnow taken over a set or subset of the hidden units H. The first term of this cost \nfunction will be called the error term, the second, the energy term. \n\nIn principle, the theoretical minimum of this function is found when the desired \nactivations are equal to the actual activations for all output units and all presented \npatterns and when the hidden units do not \"spend any energy\". \nIn practical \ncases, such a minimum cannot be reached and the hidden units have to \"spend \nsome energy\" to solve a given problem. The quantity of energy will be in pan \ndetermined by the relative importance given to the error and energy terms during \ngradient descent. In principle, if a hidden unit has a constant activation whatever \nthe pattern presented to the network, it contributes to the energy term only and \nwill be \"suppressed\" by the algorithm. The precise energy distribution among the \nhidden units will depend on the actual energy function e. \n\nANALYSIS \n\nALGORITHM IMPLEMENTATION \n\nWe can write the total cost function that the algorithm tries to minimize as a \nweighted sum of an error and energy term: \n\n[3] \n\nThe first term is the error term used with the standard back-propagation algo(cid:173)\nrithm in Rumelhan et al. \nIf we have h hidden layers, we can write the total \nenergy term as a sum of all the energy terms corresponding to each hidden layer: \n\nh Hi \n\nEen = I I e(o}) \n\ni \n\nj \n\n[4] \n\nTo decrease the energy of the uppermost hidden layer Hh, we can compute the \nderivative of the energy function with respect to the weights. This derivative will \nbe null for any weight \"above\" the considered hidden layer. For any weight just \nbelow the considered hidden layer, we have (using Rumelhan et al. notation): \n\n\fA Back-Propagation Algorithm \n\n521 \n\n[5] \n\n[6] \n\n~en ae(ot) \nU\u00b7 = \nanet; \nI \n\n= \n\nae(01) a01 aOi 2'.\" ( \n\n) \n- - -= e OiJ ; net; \n\na01 ao; anet; \n\nwhere the derivative of e is taken with respect to the .. energy\" of the unit i and \nwhere f corresponds to the logistic function. For any hidden layer below the \nconsidered layer h. the chain rule yields: \n\nd1n = f /c(net/c) I dj\"Wj/c \n\nJ \n\n[7] \n\nThis is just. standard back-propagation with a different back-propagated term. If \nwe minimize both the error at the output layer and the energy of the hidden layer \nh, we can compute the complete weight change for any connection below layer h: \n\nA \n_ \nuW/C1 -\n\n~en _ \n- a!J.eru/c 01 - a!J.enu/c 01 -\n\n,ur \n\nt.. A.er \n\n- aOI\\P'eru/c + !J.enu/c = - aOlu/c \n\n~en) \n\n~ac \n\n[8] \n\nwhere d~c is now the delta accumulated for error and energy that we can write as \na function of the deltas of the upper layer: \n\n[9] \n\nThis means that instead of propagating the delta for both energy and error. we \ncan compute an accumulated delta for hidden layer h and propagate it back \nthroughout the network. If we minimize the energy of the layers hand h-J, the \nnew accumulated delta will equal the previously accumulated delta added to a \nnew delta energy on layer h-J. The procedure can be repeated throughout the \ncomplete network. In shon. the back-propagated error signal used to change the \nweights of each layer is simply equal to the back-propagated signal used in the \nprevious layer augmented with the delta energy of the current hidden layer. (The \nalgorithm is local and easy to implement). \n\nENERGY FUNCTION \n\nThe algorithm is sensitive to the energy function e being minimized. The func(cid:173)\ntions used in the simulations described below have the following derivative with \n\n\f522 \n\nChauvin \n\nrespect to the squared activations/energy (only this derivative is necessary to im(cid:173)\nplement the algorithm, see Equation [6]): \n\n[10] \n\nwhere n is an integer that determines the precise shape of the energy function \n(see Table 1) and modulates the behavior of the algorithm in the following way. \nFor n = 0, e is a linear function of the energy: \"high and low energy\" units are \nequally penalized. For n = I, e is a logarithmic function and \"low energy\" units \nbecome more penalized than uhigh energy\" units, in proportion to the linear \ncase. For n = 2, the energy penalty may reach an asymptote as the energy in(cid:173)\ncreases: \"high energy\" units are not penalized more than umiddle energy\" units. \nIn the simulations, as expected, it appears that higher values of n tend to suppress \n(For n > 2, the behavior of the algorithm was not signifi(cid:173)\n\"low e-nergy\" units. \ncantly different from n = 2. for the tests described below). \n\nTABLE 1: Energy Functions. \n\nn \n\ne \n\n0 \n\n0 2 \n\nI \nI \nI \nI \nI \n\n1 \n\nI \nI \n! \nLog(l +02) I \nI \nI \n\n2 \n0 2 \n\n1 +02 \n\nI \nI \nI \nI \nI \n\nn>2 \n\n? \n\nBOOLEAN EXAMPLES \n\nThe algorithm was tested with a set of Boolean problems. In typical tasks, the \nenergy of the network significantly decreases during early learning. Later on, the \nnetwork finds a better minimum of the total cost function by decreasing the error \nand by \"spending\" energy to solve the problem. Figure 1 shows energy and error \nin function of the number of learning cycles during a typical task (XOR) for 4 \ndifferent runs. For a broad range of the energy learning rate, the algorithm is \nquite stable and finds the solution to the given problem. This nice behavior is \nalso quite independent of the variations of the onset of energy minimization. \n\nEXCLUSIVE OR AND PARITY \n\nThe algorithm was tested with EXOR for various network architectures. Figure 2 \nshows an example of the activation of the hidden units after learning. The algo(cid:173)\nrithm finds a minimal solution (2 hidden units, \"minimum logic\") to solve the \nXOR problem when the energy is being minimized. This minimal solution is \nactually found whatever the starting number of hidden units. If several layers are \nused, the algorithm finds an optimal or nearly-optimal size for each layer. \n\n\f~r-------'--------r------~--------__ ------~ \n\nA Back-Propagation Algorithm \n\n523 \n\n0.16 \n\nFigure 1. Energy and error curves as a function of the number of pattern \npresentations for different values of the \"energy\" rate (0, .1, .2, .4). Each \n\nenergy curve (It e\" label) is associated with an error curve (\" +\" label). \nDuring learning, the units \"spend\" some energy to solve the given task. \n\nWith parity 3, for a [-1, +1] activation range of the sigmoid function, the algo(cid:173)\nrithm does not find the 2 hidden units optimal solution but has no problem find(cid:173)\ning a 3 hidden units solution, independently of the staning architecture. \n\nSYMMETRY \n\nThe algorithm was tested with the symmetry problem, described in Rumelhan et \nal. The minimal solution for this task uses 2 hidden units. The simplest form of \nthe algorithm does not actually find this minimal solution because some weights \nfrom the hidden units to the output unit can actually grow enough to compensate \nthe low activations of the hidden units. However, a simple weight decay can \nprevent these weights from growing too much and allows the network to find the \nIn this case, the total cost function being minimized simply \nminimal solution. \nbecomes: \n\n\f524 \n\nChauvin \n\n1 __ 1 .1_1 \n\n---- _1_-\n\npattern 2 \n\npattem 3 \n\npattern 2 \n\npattem 3 \n\nI \n\nI_II 1 __ 10 \n\n__ 1- ----\n\nFigure 2. Hidden unit activations of a 4 hidden unit network over the 4 \nEXOR patterns when (left) standard back-propagation and (right) energy \nminimization are being used during learning. The network is reduced to \n\nminimal size (2 hidden units) when the. energy is being minimized. \n\nPOP H \n\nC = Per I I (di) - Oi)2 + Pen I Ie (ot) + Pw I wf) \n\nW \n\nij \n\nj \n\nj \n\n[11] \n\nPHONETIC LABELING \n\nThe algorithm was tested with a phonetic labeling task. The input patterns con(cid:173)\nsisted of spectrograms (single speaker, 10x3.2ms spaced time frames, centered, \n16 frequencies) corresponding to 9 syllables [ba] , [da], [ga], [bi] , [di], [gi] , and \n[bu] , [du] , [gu]. The task of the network was to classify these spectrograms (7 \ntokens per syllable) into three categories corresponding to the three consonants \n[b], [g], and [g]. Starting with 12 hidden units, the algorithm reduced the net(cid:173)\nwork to 3 hidden units. (A hidden unit is considered unused when its activation \nover the entire range of patterns contributes very little to the activations of the \noutput units). With standard back-propagation, all of the 12 hidden units are \nusually being used. The resulting network is consistent with the sizes of the hid(cid:173)\nden layers used by Elman and Zipser (1986) for similar tasks. \n\n\fA Back-Propagation Algorithm \n\n525 \n\nEXTENSION OF THE ALGORITHM \n\nEquation [2] represents a constraint over the set of possible LMS solutions found \nby the back-propagation algorithm. With such a constraint. the \"zero-energy\" \nlevel of the hidden units can be (informally) considered as an attractor in the \nsolution space. However. by changing the sign of the energy gradient. such a \npoint now constitutes a repellor in this space. Having such repellors might be \nuseful when a set of activation values are to be avoided during learning. For \nexample. if the activation range of the sigmoid transfer function is [-1. + 1]. the \nlearning speed of the back-propagation algorithm can be greatly improved but the \n[0. 0] unit activation point (zero-input, zero-output) often behaves as a local \nminimum. By inversing the sign of the energy gradient during early learning, it is \npossible to have the point [0, 0] act as a repellor. forcing the network to make \n\"maximal use\" of its resources (hidden units). This principle was tested on the \nparity-3 problem with a network of 7 hidden units. For a given set of coeffi(cid:173)\ncients. standard back-propagation can solve parity-3 in about 15 cycles but yields \nabout 65%. of local minima in [0. 0]. By using the \"repulsion\" constraint, par(cid:173)\nity-3 can be solved in about 20 cycles with 0% of local minima. \n\nInterestingly, it is also possible to design a I'trajectory\" of such constraints during \nlearning. For example, the [0, 0] activation point can be built as a repellor \nduring early learning in order to avoid the corresponding local minimum, then as \nan attractor during middle learning to reduce the size of the hidden layer. and as \na repulsor during late learning, to force the hidden units to have binary activa(cid:173)\ntions. This type of trajectory was tested on the parity-3 problem with 7 hidden \nunits. In this case, the algorithm always avoids the [0, 0] local minimum. More(cid:173)\nover, the network can be reduced to 3 or 4 hidden units taking binary values over \nthe set of input patterns. In contrast, standard back-propagation often gets stuck \nin local minima and uses the initial 7 hidden units with analog activation values. \n\nCONCLUSION \n\nThe present algorithm simply imposes a constraint over the LMS solution space. \nIt can be argued that limiting such a solution space can in some cases increase the \ngeneralizing propenies of the network (curve-fitting analogy). Although a com(cid:173)\nplete theory of generalization has yet to be formalized, the present algorithm \npresents a step toward the automatic design of \"minimal\" networks by imposing \nconstraints on the activations of the hidden units. (Similar constraints on weights \ncan be imposed and have been tested with success by D. E. Rumelhan, Personal \nCommunication. Combinations of constraints on weights and activations are be(cid:173)\ning tested). What is simply shown here is that this energy minimization principle \nis easy to implement, is robust to a brQad range of parameter values, can find \nminimal or nearly optimal network sizes when tested with a variety of tasks and \ncan be used to \"bend\" trajectories of activations during learning. \n\nAckowledgments \n\n\f526 \n\nChauvin \n\nThis research was conducted at Thomson-CSF, Inc. in Palo AIto. I would like to \nthank the Thomson neural net team for useful discussions. Dave Rumelhan and \nthe PDP team at Stanford University were also very helpful. \nI am especially \ngreateful to Yoshiro Miyata, from Bellcore, for having letting me use his neural \nnet simulator (SunNet) and to Jeff Elman, from UCSD, for having letting me use \nthe speech data that he collected. \n\nReferences. \n\nJ. L. Elman & D. Zipser. Learning the hidden structure of speech. ICS Techni(cid:173)\ncal Repon 8701. Institute for Cognitive Science. University of Califor(cid:173)\nnia, San Diego (1987). \n\nD. E. Rumelhan, O. E. Hinton & R. J. Williams. Learning internal represen(cid:173)\n\ntaions by error propagation. \n(Eds.), Parallel Distributed Processing: Exploration in the Microstruc(cid:173)\nture 0/ Cognition. Vol. 1. Cambridge, MA: MIT Press/ Bradford Books \n(1986) . \n\nIn D. E. Rumelhan & J. L. McClelland \n\n\f", "award": [], "sourceid": 133, "authors": [{"given_name": "Yves", "family_name": "Chauvin", "institution": null}]}