{"title": "A Cost Function for Internal Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 733, "page_last": 740, "abstract": null, "full_text": "A Cost Function for Internal Representations \n\n733 \n\nA Cost Function for Internal Representations \n\nAnders Krogh \nThe Niels Bohr Institute \nBlegdamsvej 17 \n2100 Copenhagen \nDenmark \n\nG. I. Thorbergsson \n\nNordita \n\nBlegdamsvej 17 \n2100 Copenhagen \n\nDenmark \n\nJohn A. Hertz \nNordita \nBlegdamsvej 17 \n2100 Copenhagen \nDenmark \n\nABSTRACT \n\nWe introduce a cost function for learning in feed-forward neural \nnetworks which is an explicit function of the internal representa(cid:173)\ntion in addition to the weights. The learning problem can then \nbe formulated as two simple perceptrons and a search for internal \nrepresentations. Back-propagation is recovered as a limit. The \nfrequency of successful solutions is better for this algorithm than \nfor back-propagation when weights and hidden units are updated \non the same timescale i.e. once every learning step. \n\n1 \n\nINTRODUCTION \n\nIn their review of back-propagation in layered networks, Rumelhart et al. (1986) \ndescribe the learning process in terms of finding good \"internal representations\" of \nthe input patterns on the hidden units. However, the search for these representa(cid:173)\ntions is an indirect one, since the variables which are adjusted in its course are the \nconnection weights, not the activations of the hidden units themselves when specific \ninput patterns are fed into the input layer. Rather, the internal representations are \nrepresented implicitly in the connection weight values. \n\nMore recently, Grossman et al. (1988 and 1989)1 suggested a way in which the \nsearch for internal representations could be made much more explicit. They pro(cid:173)\nposed to make the activations of the hidden units for each of the input patterns \n\n1 See also the paper by Grossman in this volume. \n\n\f734 \n\nKrogh, Thorbergsson and Hertz \n\nexplicit variables to be adjusted iteratively (together with the weights) in the learn(cid:173)\ning process. However, although they found that the algorithm they gave for making \nthese adjustments could be effective in some test problems, it is rather ad hoc and \nit is difficult to see whether the algorithm will converge to a good solution. \nIf an optimization task is posed in terms of a cost function which is systematically \nreduced as the algorithm runs, one is in a much better position to answer questions \nlike these. This is the motivation for this work, where we construct a cost function \nwhich is an explicit function of the internal representations as well as the connection \nweights. Learning is then a descent on the cost function surface, and variations in \nthe algorithm, corresponding to variations in the parameters of the cost function, \ncan be studied systematically. Both the conventional back-propagation algorithm \nand that of Grossman et al. can be recovered in special limits of ours. It is easy to \nchange the algorithm to include constraints on the learning. \nA method somewhat similar to ours has been proposed by Rohwer (1989)2. He con(cid:173)\nsiders networks with feedback but in this paper we study feed-forward networks. Le \nCun has also been working along the same lines, but in a quite different formulation \n(Le Cun, 1987). \nThe learning problem for a two-layer perceptron is reduced to learning in two simple \nperceptrons and the search for internal representations. This search can be carried \nout by gradient descent of the cost function or by an iterative method. \n\n2 THE COST FUNCTION \nWe work within the standard architecture, with three layers of units and two of \nconnections. Input pattern number J1. is denoted e~, the corresponding target pat(cid:173)\ntern (f, and its internal representation u1. We use a convention in which i always \nlabels output units, j labels hidden units, and k labels input units. Thus Wij is \nalways a hidden-to-output weight and Wjle an input-to-hidden connection weight. \nThen the actual activations of the hidden units when pattern J1. is the input are \n\nS1 = g(hf) = g(2.: Wjke~) \n\nk \n\n(1) \n\nand those of the output units, when given the internal representations u1 as inputs, \nare \n\nSf = g(hf) = g(2.: Wij ( 1) \n\n(2) \n\nj \n\nwhere g(h) is the activation function, which we take to be tanh h. \nThe cost function has two terms, one of which describes simple delta-rule learning \n(Rumelhart et al., 1986) of the internal representations from the inputs by the first \nlayer of connections, and the other of which describes the same kind of learning of the \n\n2See also the paper by Rohwer in this volume. \n\n\fA Cost Function for Internal Representations \n\n735 \n\ntarget patterns from the internal representations in the second layer of connections. \nWe use the \"entropic\" form for these terms: \n\n1-') \nE - L....J '2 1 \u00b1 (i In 1 \u00b1 S~ + T L....J '2 1 \u00b1 O'j \n\n1-') \n\n_ \" \n\n1 ( \n\nilJ\u00b1 \n\n( 1 \u00b1 (f) \n\n1 \n\n\" 1 ( \n\nj IJ\u00b1 \n\n(1 \u00b1 O'f) \n\nIn 1 \u00b1 S~ \n) \n\n(3) \n\nThis form of the cost function has been shown to reduce the learning time (Solla \net al., 1988). We allow different relative weights for the two terms through the \nparameter T. This cost function should now be minimized with respect to the two \nsets of connection weights Wij and Wjk and the internal representations O'f. \nThe resulting gradient descent learning equations for the connection weights are \nsimply those of simple one-layer perceptrons: \n\n8t \n\n8Wij ex: _ 8E = \"(I'~ _ Sf:A)O'~ = \"6f:A0'~ \n8Wjk ex: _ 8E = TL(O'~ - Sf:A)e~ = TL 6~e~ \n\nI ) L....J 1 \n\nL....J ~, \nIJ \n\n8w' . \nIJ \n\nIJ \n\n} \n\nIJ } } \n\nIJ \n\n} \n\n8t \n\n8Wjk \n\n(4) \n\n(5) \n\nThe new element is the corresponding equation for the adjustment of the internal \nrepresentations: \n\n8E L IJ \n\n80'f \n- - ex: - - - = \n8t \n\n80'''! \n} \n\ni \n\n6\u00b7 Wi}' + T \n' \n\nhlJ \n. - Ttan \n} \n\nh- 1 \n\nI-' \n0'. \n} \n\nThe stationary values of the internal representations thus solve \n\n(6) \n\n(7) \n\nwhich has a simple interpretation: The internal representation variables O'f are like \nconventional units except that in addition to the field fed forward into them from \nthe input layer they also feel the back-propagated error field bf = Li 6f Wi;. The \nparameter T regulates the relative weights of these terms. \n\nInstead of doing gradient descent we have iterated equation (7) to find the internal \nrepresentations. \n\nOne of the advantages offormulating the learning problem in terms of a cost function \nis that it is easy to implement constraints on the learning. Suppose we want to \nprevent the network from forming the same internal representations for different \noutput patterns. We can then add the term \n\nE = 1:: \" \n\n1'1:' I''! O'I!' O'~ \n2 L....J ~,~, } } \n\nij IJ/I \n\n(8) \n\n\f736 \n\nKrogh, Thorbergsson and Hertz \n\nto the energy. We may also want to suppress internal representations where the \nunits have identical values. This may be seen as an attempt to produce efficient \nrepresentations. The term \n\nis then added to the energy. The parameters \"( and \"(' can be tuned to get the best \nperformance. With these new terms equation (7) for the internal representations \nbecomes \n\n(9) \n\nThe only change in the algorithm is that this equation is iterated rather than (7). \nThese terms lead to better performance in some problems. The benefit of including \nsuch terms is very problem-dependent. We include in our results an example where \nthese terms are useful. \n\n3 SIMPLE LIMITS \nIt is simple to recover ordinary back-propagation in this model. It is the limit where \nT ~ 1: Expanding (7) we obtain \n\n(jj = Sf + T- 1 L 6f Wij(1 - tanh2 hj) \n\ni \n\n(11) \n\nKeeping only the lowest-order surviving terms, the learning equations for the con(cid:173)\nnection weights then reduce to \n\nand \n\n(12) \n\n(13) \n\nwhich are just the standard back-propagation equations (with an entropic cost \nfunction). \nNow consider the opposite limit, T <:: 1. Then the second term dominates in (7): \n\n(14) \n\nA similar algorithm to the one of Grossman et al. is then to train the input-to(cid:173)\nhidden connection weights with these (jf as targets while training the hidden-to(cid:173)\noutput weights with the (jf obtained in the other limit (7) as inputs. That is, \none alternates between high and low T according to which layer of weights one is \nadjusting. \n\n\fA Cost Function for Internal Representations \n\n737 \n\n4 RESULTS \nThere are many ways to do the optimization in practice. To be able to make a \ncomparison with back-propagation, we have made simulations that, at high T, are \nessentially the same as back-propagation (in terms of weight adjustment). \n\nIn one set of simulations we have kept the internal representations, uf, optimal with \nthe given set of connections. This means that after one step of weight changes we \nhave relaxed the u's. One can think of the u's as fast-varying and the weights as \nslowly-varying. In the T ~ 1 limit we can use these simulations to get a comparison \nwith back-propagation as described in the previous section. \n\nIn our second set of simulations we iterate the equation for the u's only once after \none step of weight updating. All variables are then updated on the same timescale. \nThis turns out to increase the success rate for learning considerably compared to \nthe back-propagation limit. The u's are updated in random order such that each \none is updated once on the average. \n\nThe learning rate, momentum, etc. have been chosen optimally for the back-propa(cid:173)\ngation limit (large T) and kept fixed at these values for other values of T (though \nno systematic optimization of parameters has been done). \nWe have tested the algorithm on the parity and encoding problems for T = 1 and \nT = 10 (the back-propagation limit). Each problem was run 100 times and the \naverage error and success rate were measured and plotted as functions of learning \nsteps (time). One learning step corresponds to one updating of the weights. \n\nFor the parity problem (and other similar tasks) the learning did not converge for \nT lower than about 3. When the weights are small we can expand the tanh on the \noutput in equation (7), \n\nuf ~ tanh(hf + T- 1 L: Wij[(f - L: Wijluj,]), \n\n(15) \n\nj' \n\nso the uf sits in a spin-glass-like \"local field\" except for the connection to it(cid:173)\nself. When the algorithm is started with small random weights this self-coupling \n(Ei(Wjj )2) is dominant. Forcing the self-coupling to be small at low w's and gradu(cid:173)\nally increasing it to full strength when the units saturate improves the performance \na lot. \n\nFor larger networks the self-coupling does not seem to be a pr.oblem. \n\nThe specific test problems were: \n\nParity with 4 input units and 4 hidden units and all the 16 patterns in the training \nset. We stop the runs after 300 sweeps of the training set. For T = 1 the self \ncoupling is suppressed. \n\nEncoding with 8 input, 3 hidden and 8 output units and 8 patterns to learn (same \ninput as output). The 8 patterns have -1 at all units but one. We stop the \nruns after 500 sweeps of the training set. \n\n\f738 \n\nKrogh, Thorbergsson and Hertz \n\nBoth problems were run with fast-varying O\"s and with all variables updated on \nthe same timescale. We determined the average learning time of the successful runs \nand the percentage of the 100 trials that were successful. The success criterion was \nthat the sign of the output was correct. The learning times and success rates are \nshown in table 1. \n\nTable 1: Learning Times and Succes Rates \n\nLearning times \nT=l \n\nFast-vary-\ning O\"S \nSlow-vary-\ning O\"S \n\nParity \nEncoding \nParity \nEncoding \n\n130\u00b11O \n167\u00b11O \n146\u00b11O \n145\u00b18 \n\nSuccess rate \nT=10 T=l T=10 \n48% \n97\u00b16 \n98% \n88\u00b14 \n121\u00b16 \n57% \n64\u00b12 \n100% \n\n30% \n95% \n36% \n99% \n\nIn figure 1 we plot the average error as a function of learning steps and the success \nrate for each set of runs. \nIt can seem a disadvantage of this method that it is necessary to store the values \nof the O\"s between learning sweeps. We have therefore tried to start the iteration \nof equation (7) with the value 0'1 = tanh(Ek Wi ken on the right hand side. This \ndoes not affect the performance much. \n\nWe have investigated the effect of including the terms (8) and (9) in the energy. \nFor the same parity problem as above we get an improved success rate in the high \nT limit. \n\n5 CONCLUSION \nThe most striking result is the improvement in the success rate when all variables, \nweights and hidden units, are updated once every learning step. This is in contrast \nto back-propagation, where the values of the hidden units are completely deter(cid:173)\nmined by the weights and inputs. In our formulation this corresponds to relaxing \nthe hidden units fully in every learning cycle and having the parameter T \u00bb 1. \nThere is then an advantage in considering the hidden units as additional variables \nduring the learning phase whose values are not completely determined by the field \nfed forward to them from the inputs. \nThe results indicate that the performance of the algorithm is best in the high T \nlimit. \n\nFor the parity problem the performance of the algorithm presented here is similar \nto that of the back-propagation algorithm measured in learning time. The real \nadvantage is the higher frequency of successful solutions. For the encoding problem \nthe algorithm is faster than back-propagation but the success rate is similar (~ \n100%). The algorithm should also be comparable to back-propagation in cpu time \n\n\fA Cost Function for Internal Representations \n\n739 \n\n100 \n\n80 \n\nVI \n\n80 \n\n~ \nf \nVI 8 40 \n:s rn \n\n300 \n\n(A) \n\n20 \n\n0 \n\n0 \n\n100 \n\n80 \n\noS \nf! 80 \n\nVI \n\nVI j 40 \n\n20 \n\n0 \n\n0 \n\n300 \n\n(B) \n\n100 \n\n200 \n\nLearning cycles \n\n100 \n\n200 \n\nLearning cycles \n\n---, .. -\n~ __ :::: ::..~ ... 0\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 \n\n.. . \" \n, . \n<' ..... \n~- .. \" \n, .. , \n\n100 \n\n200 \n\nLearning cycles \n\n300 \n\n.,,:\" \n, - : \n' \n\n.... \n\" ~ ; .. \n, ...... \n.':\"' \n,'.:' \n',0\u00b7 \n'l \nI~; \nrO \nI.: \n\nr\u00b7 .. -\u00b7 ... -\u00b7,~\u00b7:;;-~\u00b7:.::~~:~\u00b7 .. :\u00b7:\u00b7 \n\" \nI' \n! \n! \n; \n; \n{ \nI \n; \n; \n! \nf : . , \ni ~ \n! '.: \nI I\u00b7 \n\u2022 r \n~ .. \n\nJ \n\n. ' \n\n100 \n\n200 \n\n300 \n\n400 \n\nGOO \n\nLearning cycles \n\n1.4 \n\n1.2 \n\n1.0 \n\n0.8 \n\n... \ne ... t) \n~ e 0.8 \n~ 0.4 \n\nt) \n\nt) \n\n0.2 \n\n0.0 \n\n0 \n\n0.& \n\n0.4 \n\n... \nt 0.3 \nI\u00bb e t) \n\n0.2 \n\nt) \n\n~ \n\n0.1 \n\n0.0 \n\n0 \n\nFigure 1: (A) The left plot shows the error as a function of learning time for \nthe 4-parity problem for those runs that converged within 300 learning steps. The \ncurves are: T = 10 and slow sigmas ( \n), T = 10 and fast sigmas (-.-.-.-. ), \nT = 1 and slow sigmas (------), and T = 1 and fast sigmas ( ......... ). The right plot \nis the percentage of converged runs as a function of learning time. \n(B) The same as above but for the encoding problem. \n\n\f740 \n\nKrogh, Thorbergsson and Hertz \n\nin the limit where all variables are updated on the same timescale (once every \nlearning sweep). \n\nBecause the computational complexity is shifted from the calculation of new weights \nto the determination of internal representations, it might be easier to implement \nthis method in hardware than back-propagation is. It is possible to use the method \nwithout saving the array of internal representations by using the field fed forward \nfrom the inputs to generate an internal representation that then becomes a starting \npoint for iterating the equation for (1. \n\nThe method can easily be generalized to networks with feedback (as in [Rohwer, \n1989]) and it would be interesting to see how it compares to other algorithms for \nrecurrent networks. There are many other directions in which one can continue \nthis work. One is to try another cost function. Another is to use binary units and \nperceptron learning. \n\nReferences \n\nLe Cun, Y (1987). Modeles Connexionistes de l'Apprentissage. Thesis, Paris. \n\nGrossman, T, R Meir and E Domany (1988). Learning by Choice of Internal Rep(cid:173)\nresentations. Complex Systems 2, 555. \n\nGrossman, T (1989). The CHIR Algorithm: A Generalization for Multiple Output \nand Multilayered Networks. Preprint, submitted to Complex Systems. \n\nRohwer, R (1989). The \"Moving Targets\" Training Method. Preprint, Edinburgh. \nRumelhart, D E, G E Hinton and R J Williams (1986). Chapter 8 in Parallel \nDistributed Processing, vol 1 (D E Rumelhart and J L McClelland, eds), MIT Press. \n\nSoHa, S A, E Levin, M Fleisher (1988). Accelerated Learning in Layered Neural \nNetworks. Complex Systems 2, 625. \n\n\fPART IX: \n\nHARDWARE IMPLEMENTATION \n\n\f", "award": [], "sourceid": 229, "authors": [{"given_name": "Anders", "family_name": "Krogh", "institution": null}, {"given_name": "C.", "family_name": "Thorbergsson", "institution": null}, {"given_name": "John", "family_name": "Hertz", "institution": null}]}