{"title": "Asymptotic Convergence of Backpropagation: Numerical Experiments", "book": "Advances in Neural Information Processing Systems", "page_first": 606, "page_last": 613, "abstract": null, "full_text": "606 \n\nAhmad, Thsauro and He \n\nAsymptotic Convergence of Backpropagation: \n\nNumerical Experiments \n\nSubutai Ahmad \nICSI \n1947 Center St. \nBerkeley, CA 94704 \n\nGerald Tesauro \nmM Watson Labs. \n\nP. O. Box 704 \n\nYorktown Heights, NY \n\n10598 \n\nABSTRACT \n\nYu He \nDept. of Physics \nOhio State Univ. \nColumbus, OH 43212 \n\nWe have calculated, both analytically and in simulations, the rate \nof convergence at long times in the backpropagation learning al(cid:173)\ngorithm for networks with and without hidden units. Our basic \nfinding for units using the standard sigmoid transfer function is lit \nconvergence of the error for large t, with at most logarithmic cor(cid:173)\nrections for networks with hidden units. Other transfer functions \nmay lead to a 8lower polynomial rate of convergence. Our analytic \ncalculations were presented in (Tesauro, He & Ahamd, 1989). Here \nwe focus in more detail on our empirical measurements of the con(cid:173)\nvergence rate in numerical simulations, which confirm our analytic \nresults. \n\nINTRODUCTION \n\n1 \nBackpropagation is a popular learning algorithm for multilayer neural networks \nwhich minimizes a global error function by gradient descent (Werbos, 1974: Parker, \n1985; LeCun, 1985; Rumelhart, Hinton & Williams, 1986). In this paper, we ex(cid:173)\namine the rate of convergence of backpropagation late in learning when all of the \nerrors are small. In this limit, the learning equations become more amenable to an(cid:173)\nalytic study. By expanding in the small differences between the desired and actual \noutput states, and retaining only the dominant terms, one can explicitly solve for \nthe leading-order behavior of the weights as a function of time. This is true both for \n\n\fAsymptotic Convergence of Backpropagation: Numerical Experiments \n\n607 \n\nsingle-layer networks, and for multilayer networks containing hidden units. We con(cid:173)\nfirm our analysis by empirical measurements of the convergence rate in numerical \nsimula tions. \nIn gradient-descent learning, one minimizes an error function E according to: \n\n(1) \n\nwhere .:1tii is the change in the weight vector at each time step, and the learning \nrate E is a small numerical constant. The convergence of equation 1 for single-layer \nnetworks with general error functions and transfer functions is studied in section 2. \nIn section 3, we examine two standard modifications of gradient-descent: the use \nof a \"margin\" variable for turning oft'the error backpropagation, and the inclusion \nof a \"momentum\" term in the learning equation. In section 4 we consider networks \nwith hidden units, and in the final section we summarize our results and discuss \npossible extensions in future work. \n\n2 CONVERGENCE IN SINGLE-LAYER NETWORKS \nThe input-output relationship for single-Ia.yer networks takes the form: \n\nYp = g(tii\u00b7 zp) \n\n(2) \n\nwhere zp represents the state of the input units for pattern p, 10 is the real-valued \nweight vector of the network, 9 is the input-output transfer function (for the moment \nunspecified), and Yp is the output state for pattern p. We assume that the transfer \nfunction approaches 0 for large negative inputs and 1 for large positive inputs. \n\nFor convenience of analysis, we rewrite equation 1 for continuous time as: \n\n~ __ ~ BEp __ ~ BEp ~ __ ~ BEp '(h)'\" \np:Cp \nW -\n\nE L.J B10 -\n\nE L.J B 9 \n\nE L.J B \n\nB10 -\n\n(3) \n\nP \n\nP \n\nYp \n\np \n\nYp \n\nwhere Ep is the individual error for pattern p, hp = Uj,zp is the total input activation \nof the output unit for pattern p, and the summation over p is for an arbitrary subset \nof the possible training patterns. Ep is a function of the difference between the \nactual output Yp and the desired output dp for pattern p. Examples of common \nerror functions are the quadratic error Ep = (yP - dp)2 and the \"cross-entropy\" \nerror (Hinton, 1987) Ep = dp logyp + (1 - dp) log(l - Up). \nInstead of solving equation 3 for the weights directly, it is more convenient to work \nwith the outputs Yp' The outputs evolve according to: \n\n. \nYp = -Eg \n\n'(h ) ~ BEq '(h)\" \n\nP L.J B 9 \n\n.. \nq:Cq ' :Cp \n\nq \n\nYq \n\n(4) \n\nLet us now consider the situation late in learning when the output states are ap(cid:173)\nproaching the desired values. We define new variables rJp = Yp - dp , and assume \n\n\f608 \n\nAhmad, Tesauro and He \n\n2.' \n\n'.8 \n\n-'.3 \n\n-1.5 \n\n-2.&7 \n\n-3.8 \n\n-5 .\u2022 0+----+----+-----+---1----;----01 \n\n\u2022\u2022\u2022\u2022 \n\n1.&7 \n\n3.33 \n\n5 \u2022\u2022\u2022 \n\n&.&7 \n\n8.33 \n\n10.0. \n\nFigure 1: Plots of In(error) vs. In(epochs) for single-layer networks learning the \nmajority function using standard backpropagation without momentum. Four differ(cid:173)\nent learning runs starting from different random initial weights are shown. In each \ncase, the asymptotic behavior is approximately E ,..\" l/t, as seen by comparison \nwith a reference line of slope -1. \n\nthat 'lp is small for all p. For reasonable error functions, the individual errors Ep \nwill go to zero as some power of '1p, i.e., Ep ,..\" '1;. (For the quadratic error, .., = 2, \nand for the cross-entropy error, .., = 1.) Similarly, the slope of the transfer function \nshould approach zero as the output state approaches 1 or 0, and for reasonable \ntransfer functions, this will again follow a power law, i.e., g'(hp) ,..\" 'lpll. Using the \ndefinitions of '1, .., and {1, equation 4 becomes: \n\nrl\" ,..\" l'1p III L '1q 'Y- 1 1'1q I\" II:~ \u2022 11:-; + higher order \n\nq \n\n(5) \n\nThe absolute value appears because g is a non-decreasing function. Let f'Ip be the \nslowest to approach zero among all the 'lp's. We then have for '1r: \n\nUpon integrating we obtain \n\nf'Ip _ t- 1/(211+'Y- 2) i E ,..\" f'Ip 'Y ,..\" ,-'Y/(211+'Y- 2 ) \n\n(6) \n\n(7) \n\nWhen {1 = 1, i.e., g' ,..\" '1, the error function approaches zero like l/t, independent \nof..,. Since {1 = 1 for the standard sigmoid function g( 11:) = (1 + e - III) -I, one expects \nto see l/t behavior in the error function in this case. This behavior was in fact first \n\n\fAsymptotic Convergence of Backpropagation: Numerical Experiments \n\n609 \n\nseen in the numerical experiments of (Ahmad, 1988; Ahmad & Tesauro, 1988). The \nbehavior was obtained at relatively small t, about 20 cycles through the training \nset. Figure 1 illustrates this behavior for single-layer networks learning a data set \ncontaining 200 randomly chosen instances of the majority function. In each case, \nthe behavior at long times in this plot is approximately a straight line, indicating \npower-law decrease of the error. The slopes are in each case within a few percent \nof the theoretically predicted value of -1. \nIt turns out that {3 = 1 gives the fastest possible convergence of the error function. \nThis is because {3 < 1 yields transfer functions which do not saturate at finite values, \nand thus are not allowed, while (3 > 1 yields slower convergence. For example, if \nwe take the transfer function to be g(.x) = 0.5[1 + (2/,rr) tan- 1 .x], then (3 = 2. In \nthis case, the error function will go to zero as E \"'\" t-'Y/('Y+2 ). In particular, when \n; = 2, E \"'\" l/Vi. \n3 MODIFICATIONS OF GRADIENT DESCENT \nOne common modification to strict gradient-descent is the use of a \"margin\" variable \nIJ such that, if the difference between network output and teacher signal is smaller \nthan IJ, no error is backpropagated. This is meant to prevent the network from \ndevoting resources to making its output arbitrarily close to the teacher signal, which \nis usually unnecessary. It is clear from the structure of equations 5, 6 that the margin \nwill not affect the basic l/t error convergence, except in a rather trivial way. When \na margin is employed, certain driving terms on the right-hand side of equation 5 \nwill be set to zero as soon as they become small enough. However, as long as !ome \nnon-zero driving terms are present, the basic polynomial solution of equation 7 will \nbe unaltered. Of course, when all the driving terms disappear because they are all \nsmaller than the margin, the network will stop learning, and the error will remain \nconstant at some positive value. Thus the prediced behavior is l/t decrease in the \nerror followed eventually by a rapid transition to constant non-zero error. This \nagrees with what is seen numerically in Figure 2. \n\nAnother popular generalization of equation 1 includes a \"momentum\" term: \n\n~w(t) = -E ~~(t) + Ct~tii(t - 1) \n\nIn continuous time, this takes the form: \n\n-\n. \nCtW + (1 - Ct)tii \n\nBE \n-E Bw \n\nTurning this into an equation for the evolution of outputs gives: \n\n-\n\nCtYp - Ctg \n\np \n\n\"(h)[ YP]2 \n\n'(h) + 1 - Ct Yp = -eg \n9 \n\n) . \n\n( \n\nP \n\n'(h) '\" BEq '(h)'\" \n\np L...J a- g \n\n... \nq.xq \u2022 .xp \n\nq \n\nYq \n\n(8) \n\n(9) \n\n(10) \n\nOnce again, exapanding Yp, Ep and g' in small TIp yields a second-order differential \nequation for TIp in terms of a sum over other Tlq. As in equation 6, the sum will be \n\n\f610 \n\nAhmad, Thsauro and He \n\n0,0.025 \n\n-s.OO+----+----+----+---f----+----ot \n\n3.33 \n\nS.IO \n\n6.67 \n\nB.33 \n\n10.00 \n\n'.00 \n\n1.67 \n\nFigure 2: Plot of In(error) vs. In(epochs) for various values of margin variable /J \nas indicated. In each case there is a 1ft decrease in the error followed by a sudden \ntransition to constant error. This transition occurs earlier for larger values of /J. \n\ncontrolled by some dominant term r, and the equation for this term is: \n\n(11) \n\nwhere C I , C2 and C3 are numerical constants. For polynomial solutions,.\". -\nt Z , \nthe first two terms are of order t z - 2 , and can be neglected relative to the third term \nwhich is of order t z - l \u2022 The resulting equation thus has exactly the same form as \nin the zero momentum case of section 2, and therefore the rate of convergence is \nthe same as in equation 7. This is demonstrated numerically in Figure 3. We can \nsee that the error behaves as 1 ft for large t regardless of the value of momentum \nconstant cr. Furthermore, although it is not required by the analytic theory, the \nnumerical prefactor appears to be the same in each case. \n\nFinally, we have also considered the effect on convergence of schemes for adaptively \naltering the lea.rning rate constant E. It was shown analytically in (Tesauro, He & \nAhmad, 1989) that for the scheme proposed by Jacobs (1988), in which the learning \nrate could in principle increase linearly with time, the error would decrease as Ift 2 \nfor sigmoid units, instead of the 1ft result for fixed E. \n\n4 CONVERGENCE IN NETWORKS WITH HIDDEN UNITS \nWe now consider networks with a single hidden layer. In (Tesauro, He & Ahmad, \n1989), it was shown that if the hidden units saturate late in Ie a.rning , then the \nconvergence rate is no different from the single-layer rate. This should be typical \n\n\fAsymptotic Convergence of Backpropagation: Numerical Experiments \n\n611 \n\n-0.3 \n\n-1.5 \n\n-2.6 \n\n-3.8 \n\n-5 \u2022\u2022 O+----+---~---+---_+_--_+_--__oe \n\n1.67 \n\n3.33 \n\n5.00 \n\n6 .67 \n\n8.33 \n\n10 . 00 \n\n\u2022\u2022\u2022\u2022 \n\nFigure 3: Plot of In( error) vs. In( epochs) for single-layer networks learning the \nmajority function, with momentum const8.I1.t (t = 0,0.25,0.5, 0.75,0.99. Each run \nstarts from the same r8.I1.dom initial weights. Asymptotic l/t behavior is obtained \nin each case, with the same numerical prefactor. \n\nof what usually happens. However, assuming for purposes of argument that the \nhidden units do not saturate, when one goes through a small 11 exp8.I1.sion of the \nlearning equation, one obtains a coupled system of equations of the following form: \n\n11 -\n\n11211+ y - 1 [1 + n2] \nn _ 11\"1+11- 1 \n\n(13) \nwhere n represents the magnitude of the second layer weights, 8.I1.d for convenience \nall indices have been suppressed 8.I1.d all terms of order 1 have been written simply \nas 1. \nt.&, n - t~, with \nFor f3 > 1, this system has polynomial solutions of the form 11 -\nz = - 3/ (37 + 413 - 4) 8.I1.d ..\\ = z h + f3 - 1) - 1. It is interesting to note that these \nsolutions converge slightly faster th8.I1. in the single-layer case. For example, with \n7 = 2 8.I1.d f3 = 2, 11 -\nt- 3/ 10 in the multilayer case, but as shown previously, 11 goes \nto zero only as t- 1/ 4 in the single-layer case. We emphasize that this slight speed-up \nwill only be obtained when the hidden unit states do not saturate. To the extent \nthat the hidden units saturate 8.I1.d their slopes become small, the convergence rate \nwill return to the single-layer rate. \nWhen f3 = 1 the above polynomial solution is not possible. Instead, one C8.I1. verify \nthat the following is a self-consistent leading order solution to equations 12, 13: \n\n(12) \n\n(14) \n\n\f612 \n\nAhmad, Thsauro and He \n\n5.\" \n\n2.5' \n\n.... \n\n-2 . 51 \n\n-5.\" \n\n-7.51 \n\no Hidden Units \n\n3 Hidden Units \n10 Hidden Units \n50 Hidden Units \n\n-n ... \n\n2 \n\n6 \n\n7 \n\nFigure 4: Plot of In(error) vs. In(epochs) for networks with varying numbers of \nhidden units (as indicated) learning majority function data set. Approximate l/t \nbehavior is obtained in each case. \n\n(15) \n\nRecall that in the single-layer case, '1 \"'\" t-1/'y. Therefore, the effect of multiple layers \ncould provide at most only a logarithmic speed-up of convergence when the hidden \nunits do not saturate. For practical purposes, then, we expect the convergence of \nnetworks with hidden units to be no different empiric8Jly from networks without \nhidden units. This is in fact what our simulations find, as illustrated in Figure 4. \n\n5 DISCUSSION \nWe have obtained results for the asymptotic convergence of gradient-descent learn(cid:173)\ning which are valid for a wide variety of error functions a:nd transfer functions. We \ntypically expect the same rate of convergence to be obtained regardless of whether \nor not the network has hidden units. However, it may be possible to obtain a slight \npolynomial speed-up when {3 > 1 or a logarithmic speed-up when {3 = 1. We point \nout that in all cases, the sigmoid provides the maximum possible convergence rate, \nand is therefore a \"good\" transfer function to use in that sense. \n\nWe have not attempted analysis of networks with multiple layers of hidden units; \nhowever, the analysis of (Tesauro, He & Ahmad, 1989) suggests that, to the extent \nthat the hidden unit states saturate and the g' factors vanish, the rate of convergence \nwould be no different even in networks with arbitrary numbers of hidden layers. \n\nAnother important finding is that the expected rate of convergence does not depend \non the use of all 2ft. input patterns in the training set. The same behavior should \n\n\fAsymptotic Convergence of Backpropagation: Numerical Experiments \n\n613 \n\nbe seen for general subsets of training data. This is also in agreement with our \nnumerical results, and with the results of (Ahamd, 1988; Ahmand & Tesauro, 1988). \n\nIn conclusion, a combination of analysis and numerical simulations has led to insight \ninto the late stages of gradient-descent learning. It might also be possible to extend \nour approach to times earlier in the learning process, when not all of the errors \nare small. One might also be able to analyze the numbers, sizes and shapes of the \nbasins of attraction for gradient-descent learning in feed-forward networks. Another \nimportant issue is the behavior of the generalization performance, i.e., the error on \na set of test patterns not used in training, which was not addressed in this paper. \nFinally, our analysis might provide insight into the development of new algorithms \nwhich might scale more favorably than backpropagation. \n\nReferences \n\nS. Ahmad. (1988) A study of scaling and generalization in neural networks. Master's \nThesis, Univ. of Illinois at Urbana-Champaign, Dept. of Computer Science. \n\nS. Ahmad & G. Tesauro. (1988) Scaling and generalization in neural networks: a \ncase study. In D. S. Touretzky et al. (eds.), Proceedings of the 1988 Connectionist \nModels Summer School, 3-10. San Mateo, CA: Morgan Kaufmann. \n\nG. E. Hinton. \nCMU-CS-87-115, Dept. of Computer Science, Carnegie-Mellon University. \n\n(1987) Connectionist learning procedures. Technical Report No. \n\nR. A. Jacobs. (1988) Increased rates of convergence through learning rate adapta(cid:173)\ntion. Neural Networks 1:295-307. \n\nY. Le Cun. (1985) A learning procedure for asymmetric network. Proceedings of \nCognitiva (Paris) 85:599-604. \n\nD. B. Parker. (1985) Learning-logic. Technical Report No. TR-47, MIT Center for \nComputational Research in Economics and Management Science. \n\nD. E. Rumelhart, G. E. Hinton, & R. J. Williams. (1986) Learning representations \nby back-propagating errors. Nature 323:533-536. \n\nG. Tesauro, Y. He & S. Ahmad. (1989) Asymptotic convergence of back propagation. \nNeural Computation 1:382-391. \n\nP. Werbos. (1974) Ph. D. Thesis, Harvard University. \n\n\f", "award": [], "sourceid": 238, "authors": [{"given_name": "Subutai", "family_name": "Ahmad", "institution": null}, {"given_name": "Gerald", "family_name": "Tesauro", "institution": null}, {"given_name": "Yu", "family_name": "He", "institution": null}]}