{"title": "Does the Neuron \"Learn\" like the Synapse?", "book": "Advances in Neural Information Processing Systems", "page_first": 169, "page_last": 176, "abstract": "", "full_text": "169 \n\nDOES THE NEURON \"LEARN\" LIKE THE SYNAPSE? \n\nRAOUL TAWEL \n\nJet Propulsion Laboratory \n\nCalifornia Institute of Technology \n\nPasadena, CA 91109 \n\nAbstract. An improved learning paradigm that offers a significant reduction in com(cid:173)\nputation time during the supervised learning phase is described. \nIt is based on \nextending the role that the neuron plays in artificial neural systems. Prior work \nhas regarded the neuron as a strictly passive, non-linear processing element, and \nthe synapse on the other hand as the primary source of information processing and \nknowledge retention. In this work, the role of the neuron is extended insofar as allow(cid:173)\ning its parameters to adaptively participate in the learning phase. The temperature \nof the sigmoid function is an example of such a parameter. During learning, both the \nsynaptic interconnection weights w[j and the neuronal temperatures Tr are opti(cid:173)\nmized so as to capture the knowledge contained within the training set. The method \nallows each neuron to possess and update its own characteristic local temperature. \nThis algorithm has been applied to logic type of problems such as the XOR or parity \nproblem, resulting in a significant decrease in the required number of training cycles. \n\nINTRODUCTION \n\nOne of the current issues in the theory of supervised learning concerns the scal(cid:173)\n\ning properties of neural networks. While low-order neural computations are easily \nhandled on sequential or parallel processors, high-order problems prove to be in(cid:173)\ntractable. The computational burden involved in implementing supervised learning \nalgorithms, such as back-propagation, on networks with large connectivity and/or \nlarge training sets is immense and impractical at present. Therefore the treatment \nof 'real' applications in such areas as image recognition or pattern classification \nrequire the development of computationally efficient learning rules. This paper \nreports such an algorithm. \n\nCurrent neuromorphic models regard the neuron as a strictly passive non-linear \nelement, and the synapse on the other hand as the primary source of knowledge \nretention. In these models, information processing is performed by propagating the \nsynaptically weighed neuronal contributions in either a feed forward, feed backward, \nor fully recurrent fashion [1]-[3). Artificial neural networks commonly take the point \nof view that the neuron can be modeled by a simple non-linear 'wire' type of device. \nHowever, evidence exists that information processing in biological neural net(cid:173)\n\nworks does occur at the neuronal level [4]. Although neuromorphic nets based on \nsimple neurons are useful as a first approximation, a considerable richness is to \nbe gained by extending 'learning' to the neuron. In this work, such an extension \nis made. The neuron is then seen to provide an additional or secondary source \nof information processing and knowledge retention. This is achieved by treating \nboth the neuronal and synaptic variables as optimization parameters. The temper(cid:173)\nature of the sigmoid function is an example of such a neuronal parameter. In much \n\n\f170 \n\nTawel \n\nthe same way that the synaptic interconnection weights require optimization to \nreflect the knowledge contained within the training set, so should the temperature \nterms be optimized. It should be emphasized that the method does not optimize a \nglobal neuronal temperature for the whole network, but rather allows each neuron \nto posses and update its own characteristic local value. \n\nADAPTIVE NEURON MODEL \n\nAlthough the principle of neuronal optimization is an entirely general concept, \nand therefore applicable to any learning scheme, the popular feed forward back \npropagation (BP) learning rule has been selected for its implementation and per(cid:173)\nformance evaluation. In this section we develop the mathematical formalism nec(cid:173)\ncessary to implement the adaptive neuron model (ANM). \n\nBack propagation is an example of supervised learning where, for each presenta(cid:173)\ntion consisting of an input vector iJip and its associated target vector tp, the algo(cid:173)\nrithm attemps to adjust the synaptic weights so as to minimize the sum-squared \nerror E over all patterns p. In its simplest form, back propagation treats the inter(cid:173)\nconnection weights as the only variable and consequently executes gradient descent \nin weight space. The error term is given by \n\nE = L: Ep = ~ L: L: [tf - o?]2 \n\nP \n\nP \n\ni \n\nThe quantity tf is the ith component of the pth desired output vector pattern \nand o? is the activation of the corresponding neuron in the final layer n . For \nnotational ease the summation over p is dropped and a single pattern is considered. \nOn completion of learning, the synaptic weights capture the transformation linking \nthe input to output variables. In applications other than toy problems, a major \ndrawback of this algorithm is the excessive convergence time. \n\nIn this paper it is shown that a significant decrease in convergence time can be \nrealized by allowing the neurons to adaptively participate in the learning process. \nThis means that each neuron is to be characterized by a set of parameters, such as \ntemperature, whose values are optimized according to a rule, and not in a heuris(cid:173)\ntic fashion as in simulated annealing. Upon training completion, learning is thus \ncaptured in both the synaptic and neuronal parameters. \n\nThe activation of a unit - say the ith neuron on the mth layer - is given by or. \n\nThis response is computed by a non-linear operation on the weighed responses of \nneurons from the previous layer, as seen in Figure 1. A common function to use is \nthe logistic funtion, \n\nand T = 1/ f3 is the temperature of the network. The net weighed input to the \nneuron is found by summing products of the synaptic weights and corresponding \nneuronal ouputs from units on the previous layer, \nsf!' = \"\"' w~-lof!'-l \n\n1 \n\nI] \n\n] \n\n1 \n\no~ = ----..\".-=-\n1 + e-{36'i \n\n1 \n\nL..J \nj \n\n\fDoes the Neuron \"Learn\" Like the Synapse? \n\n171 \n\nf= 1 +e k \n\n~m Sm \nk \n\n~ \n\n)-' \n\nI-----O~ \n\nom-t \n\n3 \n\n' - - -v - r - - - J1 ' ' - - - -\u2022 .-------1 '----v---------\n\nNEURON \n\nINPUT FROM \nPREVIOUS \n\nLAYER \n\nOUTPUT \nFROM \n\nNEURON \n\nFigure 1. Each neuron in a network is chara(cid:173)\nterized by a local, temperature dependent, sig(cid:173)\nmoidal activation function. \n\nwhere oj-1 represents fan-in units and the wij-l represent the pairwise connection \nstrength between neuron i in layer m and neuron j in layer m -\n\nl. \n\nWe have investigated several mathematical methods for the determination of the \noptimal neuronal temperatures. In this paper, the rule that was selected to optimize \nthese parameters is based on executing gradient descent in the sum squared error \nE in temperature space. The method requires that the incremental change in the \ntemperature term be proportional to the negative of the derivative of the error term \nwith respect to the temperature. Focussing on the 11h neuron on the ouput layer \nn, we have \n\n~T.', = -71 aE \n\" aT,\" \n\nIn this expression, ij is the temperature learning rate. This equation can be ex(cid:173)\npressed as the product of two terms by the chain rule \n\naE _ aE ao; \nar.n aon ar.n \n, \n\n\" \n\nSubstituting expressions and leaving the explicit functional form of the activation \nfunction unspecified, i.e. 0; = f(r,n, ... ) we obtain \nn af \naT,\" \n\naE \n- = - [t, - 0,] -\nar,n \n\nIn a similar fashion, the temperature update equation for the previous layer is given \nby, \n\n~r::-1 __ - aE \nTJ ar::- 1 \n\nk \n\n-\n\nk \n\n\f172 \n\nTawel \n\nU sing the chain rule, this can be expressed as \n\naE \naE ao; as; ao~-l \nar::- 1 = L aon as n aon-1 ar;:-l \n\nI \" \n\nk \n\n1: \n\nSubstituting expressions and simplifying reduces the above to \n\naE \nar::-1 = L...J -\nk , \n\n[~[ n] af n-l] af \nar,n-l \n\nt, - 0, asn WI1: \n\n, \n\n1: \n\nBy repeating the above derivation for the previous layer, i.e. determining the partial \nderivative of E with respect to Tj-2 etc., a simple recursive relationship emerges \nfor the temperature terms. Specifically, the updating scheme for the kth neuronal \ntemperature on the mth layer is given by \n\n- aE \nU.l k = -1] ar.m \n\nA rpm \n\nk \n\nwhere \n\naE \nm af \nar,m = -6k ar,m \n\n1: \n\nIn the above expression, the error signal 6r takes on the value, \n\nk \n\nif neuron m lies on an output layer, or \nem ~ em+l af \nVk = L...J v, \nas, \n, \n\nm \nm+l W'k \n\nif the neuron lies on a hidden layer. \n\nSIMULATION RESULTS OF TEMPERATURE OPTIMIZATION \nThe new algorithm was applied to logic problems. The network was trained on \na standard benchmark - the exclusive-or logic problem. This is a classic problem \nrequiring hidden units and since many problems involve an XOR as a subproblem. \nAs in plain BP, the application of the proposed learning rule involves two passes. \nIn the first, an input pattern is presented and propagated forward through the \nnetwork to compute the output values oj. This output is compared to its target \nvalue, resulting in an error signal for each output unit. The second pass involves a \nbackward pass through the network during which the error signal is passed along \nthe network and the appropriate weight and temperature changes made. Note that \nsince the synapses and neurons have their own characteristic learning rate, i.e 1] \nand fj respectively, an additional degree of freedom is introduced in the simulation. \nThis is equivalent to allowing for relative updating time scales for the weights and \n\n\fDoes the Neuron \"Learn\" Like the Synapse? \n\n173 \n\ntemperatures, i.e. Tw and TT respectively. We have now generated a gradient \ndescent method for finding weights and temperatures in a feed forward network. \n\nIn deriving the learning rule for temperature optimization in the above section, \nthe derivative of the activation function of a neuron played a key role. We have \nused a sigmoidal type of function in our simulations whose explicit form is given \nby, \n\nf (81:\\ If) = 1 \n\n~f3\"'s\", \n+e \"\" \n\nand in Figure 2 it is shown to be extremely sensitive to small changes in tempera(cid:173)\nture. \n\n;:-\ni \n\n10. \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0. \n\n-5 \n\n- 3 \n\n-1 \n\nFigure 2. Activation function shown plotted \nfor several different temperatures. \n\nThe sigmoid is shown plotted against the net input to a neuron for temperatures \nranging from 0.2 to 2.0, in increments of 0.2. However, the steepest curve was for \na temperature of 0.01. The derivative of the activation function taken with respect \nto the temperature is given by \nof \noTr: \n\nAs shown in Figure 3, the XOR architecture selected has two input units, two \n\nhidden units, and a single output unit. Each neuron is characterized by a temper(cid:173)\nature, and neurons are connected by weights. Prior to training the network, both \nthe weights and temperatures were randomized. The initial and final optimization \nparameters for a sample training exercise are shown in Figure 3(a) & (b). Specif(cid:173)\nically, Figure 3(a) shows the values of the randomized weights and temperatures \nprior to training, and Figure 3(b) shows their values after training the network for \n1000 iterations. This is a case where the network has reached a global minimum. In \nboth figures, the numbers associated with the dashed arrows represent the thresh(cid:173)\nolds of the neurons, and the numbers written next to the solid arrows represent the \n\n\f174 \n\nTawel \n\nOUT \n\nOUT \n\nT=O.951 \n\n, / \n\n-0.979 \n\n'\" 0.450 \n\n,. \n,\" \n-1.183 \n\n...... , \n\n0.476 \n\n(a) \n\n(b) \n\nFigure 3. Architecture of NN for XOR prob(cid:173)\nlem showing neuronal temperatures and synap(cid:173)\ntic weights before ( a) and after training (b). \n\nexcitatory/inhibitory strengths of the pairwise connections. To fully evaluate the \nconvergence speed of the proposed algorithm, a benchmark comparison between \nit and plain BP was made. In both cases the training was started with identical \ninitial random synaptic weights lying within the range [-2.0, +2.0] and the same \nsynaptic weight learning rate TJ = 0.1. The temperatures of the neurons in the AN M \nmodel were randomly selected to lie wjthin the narrow range of [0.9,1.1] and the \ntemperature learning rate ij set at 0.1. Figures 4(a) & (b) summarize the training \nstatistics of this comparison. \n\n100 \n\n10\"1 \n\n111\"2 \n\ni 10\"3 \n... \n\n10-4 \n\n10\"5 \n\n10-6 \n\n._- ....... \n\n, \n\n, \n'. \n' .......... \n\n'. \n\n'. \", \n\n\" \n\n, \n, \n, \n\" \n\", \n, \n\"./ \n\n/ \n\n10\"1 \n\n10-2 \n\n1'ct3 \n\n10-4 \n\n110\"5 \n\n/\\ \n\\ \n\\ \n\\ \ni \n\\ \n\\ \n\\ \n, \n'. \n\" , \n\n' .... \n\n'\\. \n\n-\n\n104 \n\n105 \n\n108 \n\n10-6 \n\n10\"7 \n\n100 \n\n101 \n\n102 \n\n103 \n\nITERATION \n\n100 \n\n101 \n\n102 \n\n103 \n\nITERATION \n\n104 \n\n105 \n\nloe \n\nFigure 4. Comparison of training statistics \nbetween the adaptive neuron model and plain \nback propagation. \n\n\fDoes the Neuron \"Learn\" Like the Synapse? \n\n175 \n\nIn both figures, the solid lines represent the ANM and the dashed lines represent \nthe plain BP model. In Figure 4( a), the error is plotted against the training iteration \nnumber. In Figure 4(b), the standard deviation of the error over the training set \nis shown plotted against the training iteration. In the first few hundred training \niterations in Figure 4( a), the performance ofBP and the ANM is similar and appears \nas a broad shoulder in the curve. Recall that both the weights and temperatures \nare randomized prior to training, and are therefore far from their final values. As \na consequence of the low values of the learning rates used, the error is large, and \nwill only begin to get smaller when the weights and temperatures begin to fall in \nthe right domain of values. In the ANM, the shoulder terminus is marked by a \nphase-transition like discontinuity in both error and standard deviation. For the \nparticular example shown, this occured at the 637 th iteration. A several order of \nmagnitude drop in the error and standard deviation is observed within the next 10 \niterations. This sharp drop off is followed by a much more gradual decrease in both \nthe error and standard deviation. A more detailed analysis of these results will be \npublished in a longer paper. \n\nIn learning the XOR problem using standard BP, it has been observed that the \nnetwork frequently gets trapped in local minima. In Figure 5(a) & (b) we observe \nsuch a case as shown by the dotted line. In numerous simulations on this problem, \nwe have determined that the ANM is much less likely to become trapped in local \nmInIma. \n\n\"..-----------\n/\" \ni \ni \n\n100 \n\n~----\"\"\".---........ \"_._---------\n\n10-1 \n\n10-2 \n\ni 10-3 \n\nw \n\n11)'4 \n\n1C)'6 \n\n~ \n100 \n\n101 \n\n102 \n\n103 \n\nITERATION \n\nFigure 5. Training case where the adaptive \nneuron model escapes a local minima and plain \nback propagation does not. \n\nCONCLUSIONS \n\nIn this paper we have attempted to upgrade and enrich the model of the neuron \nfrom a simple static non-linear wire-type construct, to a dynamically reconfigurable \none. From a purely computational point of view, there are definite advantages in \nsuch an extension. Recall that if N is the number of neurons in a network then the \nnumber of synaptic connections typically increases as O(N2). Since the activation \n\n\f176 \n\nTawel \n\nfunction is extremely sensitive to small changes in temperature and that there are \nfar fewer neuronal parameters to update than synaptic weights, suggests that the \nadaptive neuron model should offer a significant reduction in convergence time. \n\nIn this paper we have also shown that the active participation of the neurons \nduring the supervised learning phase led to a significant reduction in the number \nof training cycles required to learn logic type of problems. In the adaptive neuron \nmodel both the synaptic weight interconnection strengths and the neuronal tem(cid:173)\nperature terms are treated as optimization parameters and have their own updating \nscheme and time scales. This learning rule is based on implementing gradient de(cid:173)\nscent in the sum squared error E with respect to both the weights wr] and temper(cid:173)\natures Tim. Preliminary results indicate that the new algorithm can significantly \noutperform back propagation by reducing the learning time by several orders of \nmagnitude. Specifically, the XOR problem was learnt to a very high precision by \nthe network in :::::: 103 training iterations with a mean square error of:::::: 10- 6 versus \nover 106 iterations with a corresponding mean square error of:::::: 10- 3 . \n\nAcknowledgements. \nThe work described in this paper was performed by the Jet Propulsion Labora(cid:173)\ntory, California Institute of Technology, and was supported in parts by the Na(cid:173)\ntional Aeronautics and Space Administration and the Defense Advanced Research \nProjects Agency through an agreement with the National Aeronautics and Space \nAdministration. \n\nREFERENCES \n\n1. D. Rummelhart, J. McClelland, \"Parallel Distributed Processing,\" M.I.T. Press, Cambridge, \n\nMA,1986. \n\n2. J. J. Hopfield, Neural Networks as Physical Systems with Emergent Collective Computational \n\nAbilities, Proceedings of the National Academy of Science USA 79 (1982), 2554-2558. \n\n3. F. J. Pineda, Generalization Of Backpropagation To Recurrent and Higher Order Neural \nNetworks, in \"Neural Information Processing Systems Proceedings,\" AlP, New York, 1981. \n4. L. R. Carley, Presynaptic Neural Information Processing, in \"Neural Information Processing \n\nSystems Proceedings,\" AlP, New York, 1981. \n\n\f", "award": [], "sourceid": 184, "authors": [{"given_name": "Raoul", "family_name": "Tawel", "institution": null}]}