{"title": "Backpropagation Convergence Via Deterministic Nonmonotone Perturbed Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 383, "page_last": 390, "abstract": null, "full_text": "Backpropagation Convergence Via \n\nDeterministic Nonmonotone Perturbed \n\nMinimization \n\no. L. Mangasarian & M. v. Solodov \n\nComputer Sciences Department \n\nUniversity of Wisconsin \n\nMadison, WI 53706 \n\nEmail: olvi@cs.wisc.edu, solodov@cs.wisc.edu \n\nAbstract \n\nThe fundamental backpropagation (BP) algorithm for training ar(cid:173)\ntificial neural networks is cast as a deterministic nonmonotone per(cid:173)\nturbed gradient method. Under certain natural assumptions, such \nas the series of learning rates diverging while the series of their \nsquares converging, it is established that every accumulation point \nof the online BP iterates is a stationary point of the BP error func(cid:173)\ntion. The results presented cover serial and parallel online BP, \nmodified BP with a momentum term, and BP with weight decay. \n\n1 \n\nINTRODUCTION \n\nWe regard training artificial neural networks as an unconstrained minimization \nproblem \n\nN \n\nmin f(x) := ~ h(x) \nxERn \n\n~ \nj=l \n\nwhere h : ~n --+ ~, j = 1, ... , N are continuously differentiable functions from the \nn-dimensional real space ~n to the real numbers~. Each function Ii represents the \nerror associated with the j-th training example, and N is the number of examples \nin the training set. The n-dimensional variable space here is that of the weights \nassociated with the arcs of the neural network and the thresholds of the hidden and \n\n(1) \n\n383 \n\n\f384 \n\nMangasarian and Solodov \n\noutput units. For an explicit description of f(x) see (Mangasarian, 1993). We note \nthat our convergence results are equally applicable to any other form of the error \nfunction, provided that it is smooth. \nBP (Rumelhart,Hinton & Williams, 1986; Khanna, 1989) has long been successfully \nused by the artificial intelligence community for training artificial neural networks. \nCuriously, there seems to be no published deterministic convergence results for this \nmethod. The primary reason for this is the nonmonotonic nature of the process. \nEvery iteration of online BP is a step in the direction of negative gradient of a partial \nerror function associated with a single training example, e.g. Ii (x) in (1). It is clear \nthat there is no guarantee that such a step will decrease the full objective function \nf( x), which is the sum of the errors for all the training examples. Therefore a single \niteration of BP may, in fact, increase rather than decrease the objective function \nf( x) we are trying to minimize. This difficulty makes convergence analysis of BP \na challenging problem that has currently attracted interest of many researchers \n(Mangasarian & Solodov, 1994; Gaivoronski, 1994; Grippo, 1994; Luo & Tseng, \n1994; White, 1989). \nBy using stochastic approximation ideas (Kashyap,Blaydon & Fu, 1970; Ermoliev & \nWets, 1988), White (White, 1989) has shown that, under certain stochastic assump(cid:173)\ntions, the sequence of weights generated by BP either diverges or converges almost \nsurely to a point that is a stationary point of the error function. More recently, \nGaivoronski obtained stronger stochastic results (Gaivoronski, 1994). It is worth \nnoting that even if the data is assumed to be deterministic, the best that stochastic \nanalysis can do is to establish convergence of certain sequences with probability \none. This means that convergence is not guaranteed. Indeed, there may exist some \nnoise patterns for which the algorithm diverges, even though this event is claimed \nto be unlikely. \n\nBy contrast, our approach is purely deterministic. In particular, we show that \nonline BP can be viewed as an ordinary perturbed nonmonotone gradient-type \nalgorithm for unconstrained optimization (Section 3) . We note in the passing, that \nthe term gradient descent which is widely used in the backpropagation and neural \nnetworks literature is incorrect. From an optimization point of view, online BP \nis not a descent method, because there is no guaranteed decrease in the objective \nfunction at each step. We thus prefer to refer to it as a nonmonotone perturbed \ngradient algorithm. \n\nWe give a convergence result for a serial (Algorithm 2.1), a parallel (Algorithm 2.2) \nBP, a modified BP with a momentum term, and BP with weight decay. To the best \nof our knowledge, there is no published convergence analysis, either stochastic or \ndeterministic, for the latter three versions of BP. The proposed parallel algorithm is \nan attempt to accelerate convergence of BP which is generally known to be relatively \nslow. \n\n2 CONVERGENCE OF THE BACKPROPAGATION \n\nALGORITHM AND ITS MODIFICATIONS \n\nWe now turn our attention to the classical BP algorithm for training feedforward \nartificial neural networks with one layer of hidden units (Rumelhart,Hinton & \n\n\fBackpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization \n\n385 \n\nWilliams, 1986; Khanna, 1989). Throughout our analysis the number of hidden \nunits is assumed to be fixed. The choice of network topology is a separate issue \nthat is not addressed in this work. For some methods for choosing the number of \nhidden units see (Courrien, 1993; Arai, 1993). \n\nWe now summarize our notation. \n\nN : Nonnegative integer denoting number of examples in the training set. \ni = 1,2, ... : Index number of major iterations (epochs) of BP. Each major itera(cid:173)\ntion consists of going through the entire set of error functions !1(x), ... , fN(X). \nj = 1, ... ,N : Index of minor iterations. Each minor iteration j consists of a step \nin the direction of the negative gradient - \\7 fmU)(zi,j) and a momentum step. Here \nm(j) is an element of the permuted set {I, ... , N}, and zi,j is defined immediately \nbelow. Note that if the training set is randomly permuted after every epoch, the \nmap m(\u00b7) depends on the index i. For simplicity, we skip this dependence in our \nnotation. \nxi : Iterate in ~n of major iteration (epoch) i = 1,2, .... \nzi,; : Iterate in ~n of minor iteration j = 1, ... , N, within major iteration i \n1,2, .... Iterates zi,j can be thought of as elements of a matrix with N columns and \ninfinite number of rows, with row i corresponding to the i-th epoch of BP. \n\nBy 1}i we shall denote the learning rate (the coefficient multiplying the gradient), \nand by (ki the momentum rate (the coefficient multiplying the momentum term). \nFor simplicity we shall assume that the learning and momentum rates remain fixed \nwithin each major iteration. In a manner similar to that of conjugate gradients \n(Polyak, 1987) we reset the momentum term to zero periodically. \n\nAlgorithm 2.1. Serial Online BP with a Momentum Term. \nStart with any xO E ~n. Having xi, stop if \\7 f(x i ) = 0, else compute xi+l as \nfollows: \n\nzi,j+l = zi,j -\n\nwhere \n\nTJi \\7 fmu)(i,j) + aif1zi,j, j = 1, ... , N \n\nxi+l = zi,N+l \n\nif j = 1 \notherwise \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nRemark 2.1. Note that the stopping criterion of this algorithm is typically that \nused in first order optimization methods, and is not explicitly related to the abil(cid:173)\nity of the neural network to generalize. However, since we are concerned with \nconvergence properties of BP as a numerical algorithm, this stopping criterion is \n\n\f386 \n\nMangasarian and Solodov \n\njustified. Another point related to the issue of generalization versus convergence is \nthe following. Our analysis allows the use of a weight decay term in the objective \nfunction (Hinton, 1986; Weigend,Huberman & Rumelhart, 1990) which often yields \na network with better generalization properties. In the latter case the minimization \nproblem becomes \n\nmin I(x) := ~ hex) + All xl1 2 \nxElRn \n\nN \n\nL-\ni=l \n\n(6) \n\nwhere A is a small positive scaling factor. \nRemark 2.2. The choice of C\u00a5i = 0 reduces Algorithm 2.1 to the original BP \nwithout a momentum term. \n\nRemark 2.3. We can easily handle the \"mini-batch\" methods (M!2l11er, 1992) by \nmerely redefining the meaning of the partial error function Ii to represent the error \nassociated with a subset of training examples. Thus \"mini-batch\" methods also fall \nwithin our framework. \n\nWe next present a parallel modification of BP. Suppose we have k parallel pro(cid:173)\ncessors, k 2: 1. We consider a partition of the set {l, ... , N} into the subsets \n1 = 1, ... ,k, so that each example is assigned to at least one processor. Let \nJ\" \nN, be the cardinality of the corresponding set J,. In the parallel BP each processor \nperforms one (or more) cycles of serial BP on its set of training examples. Then a \nsynchronization step is performed that consists of averaging the iterates computed \nby all the k processors. From the mathematical point of view this is equivalent to \neach processor I E {I, ... , k} handling the partial error function I' (x) associated \nwith the corresponding set of training examples J,. In this setting we have \n\nJ'(x)=~Ii(x), f(x)=~f'(x) \n\niEJI \n\n1=1 \n\nk \n\nWe note that in training a neural network it might be advantageous to assign \nsome training examples to more than one parallel processor. We thus allow for the \npossibility of overlapping sets J,. \nThe notation for Algorithm 2.2 is similar to that for Algorithm 2.1, except for the \nindex 1 that is used to label the partial error function and minor iterates associated \nwith the l-th parallel processor. We now state the parallel BP with a momentum \nterm. \n\nAlgorithm 2.2. Parallel Online BP with a Momentum Term. \nStart with any xO E ~n. Having xi, stop if xi+l = xi, else compute x i+l as follows \n\n(i) Parallelization. For each parallel processor I E {I, ... , k} do \n\ni \n\nz, = x \ni,l \n(iIi) + . A \n\nc\u00a5,uz\" \n\ni,i \n\nJ = , ... , I \nN \n. \n\n1 \n\ni,i+l _ \nz, \n\ni,i \n\n- z, \nwhere \n\n'~f' \n\n- 7], v m(j) z, \n\n~zlili = { 0 \n\nz;,i - z;,i- l \n\nif j = 1 \notherwise \n\n(7) \n\n(8) \n\n(9) \n\n\fBackpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization \n\n387 \n\n(ii) Synchronization \n\no < TJi < 1, O:s a i < 1 \n\nXi+l = ~ L z;,Nr+l \n\nk \n\n1=1 \n\nWe give below in Table 1 a flowchart of this algorithm. \n\n/ \n\nMajor iteration i : xi \n\n..... ~\n\n.~ \n\n(10) \n\ni 1 \n. \nZ ' . - x' \n1 \n\n. -\n\ni 1 \n. \nz' '- x' \nI \n\n.-\n\n~ \n\ni 1 \n. \nz' '- x' \nk .-\n\n~ \n\nSerial BP on \nexamples in Jl \n\nSerial BP on \nexamples in J, \n\nSerial BP on \nexamples in Jk \n\n~ \n\ni,Nr+l \nz, \n\nJ \n\n~ \n\ni,N,,+I \nzk \n\n/ \n\nM \u00b7 \u00b7 . \n\naJor IteratIOn z + : x = k L.....I=1 z,' \nTable 1. Flowchart of the Parallel BP \n\n. 1 \n\n,'+1 \n\n1 \"k \n\ni Nr+ 1 \n\nRemark 2.4. It is well known that ordinary backpropagation is a relatively slow \nalgorithm. One appealing remedy is parallelization (Zhang,Mckenna,Mesirov & \nWaltz, 1990). The proposed Algorithm 2.2 is a possible step in that direction. \nNote that in Algorithm 2.2 all processors typically use the same program for their \ncomputations. Thus load balancing is easily achieved. \n\nRemark 2.5. We wish to point out that synchronization strategies other than \n(10) are possible. For example, one may choose among the k sets of weights and \nthresholds the one that best classifies the training data. \n\nTo the best of our knowledge there are no published deterministic convergence \n\n\f388 \n\nMangasarian and Solodov \n\nproofs for either of Algorithms 2.1,2.2. Using new convergence analysis for a class of \nnonmonotone optimization methods with perturbations (Mangasarian & Solodov, \n1994), we are able to derive deterministic convergence properties for online BP \nand its modifications. Once again we emphasize the equivalence of either of those \nmethods to a deterministic nonmonotone perturbed gradient-type algorithm. \n\nWe now state our main convergence theorem. An important result used in the proof \nis given in the Mathematical Appendix. We refer interested readers to (Mangasarian \n& Solodov, 1994) for more details. \n\nTheorem 2.1. If the learning and momentum rates are chosen such that \n\n00 \n\nL l7i = 00, L 171 < 00, L O:'il7i < 00, \n\n00 \n\n00 \n\ni=O \n\ni=O \n\ni=O \n\n(11) \n\nthen for any sequence {xi} generated by any of the Algorithms 2.1 or 2.2, it follows \n0, and for each accumulation point x of the \nthat {/(xiH converges, {\\7 !(xi)} -\nsequence {x'}, \\7 I( x) = O. \n\nRemark 2.6. We note that conditions (11) imply that both the learning and \nmomentum rates asymptotically tend to zero. These conditions are similar to those \nused in (White, 1989; Luo & Tseng, 1994) and seem to be the inevitable price paid \nfor rigorous convergence. For practical purposes the learning rate can be fixed or \nadjusted to some small but finite number to obtain an approximate solution to the \nminimization problem. For state-of-the-art techniques of computing the learning \nrate see (Ie Cun, Simard & Pearlmutter, 1993). \n\nRemark 2.7. We wish to point out that Theorem 2.1 covers BP with momentum \nand/or decay terms for which there is no published convergence analysis of any \nkind. \n\nRemark 2.8. We note that the approach of perturbed minimization provides \ntheoretical justification to the well known properties of robustness and recovery \nfrom damage for neural networks (Sejnowski & Rosenberg, 1987). In particular, this \napproach shows that the net should recover from any reasonably small perturbation. \n\nRemark 2.9. Establishing convergence to a stationary point seems to be the \nbest one can do for a first-order minimization method without any additional re(cid:173)\nstrictive assumptions on the objective function. There have been some attempts \nto achieve global descent in training, see for example, (Cetin,Burdick & Barhen, \n1993). However, convergence to global minima was not proven rigorously in the \nmultidimensional case. \n\n3 MATHEMATICAL APPENDIX: CONVERGENCE OF \n\nALGORITHMS WITH PERTURBATIONS \n\nIn this section we state a new result that enables us to establish convergence prop(cid:173)\nerties of BP. The full proof is nontrivial and is given in (Mangasarian & Solodov, \n1994). \n\n\fBackpropagation Convergence via Deterministic Nonmonotone Perturbed Minimization \n\n389 \n\nTheorem 3.1. General Nonmonotonic Perturbed Gradient Convergence \n(subsumes BP convergence). \nSuppose that f(x) is bou?,!-ded below and that \\1 f(x) is bounded and Lipschitz contin(cid:173)\nuous on the sequence {x'} defined below. Consider the following perturbed gradient \nalgorithm. Start with any x O E ~n. Having xi, stop if \\1 f(x i ) = 0, else compute \n(12) \n\nwhere \n\ndi = -\\1f(xi ) + ei \n\nfor some ei E ~n, TJi E~, TJi > 0 and such that for some I > 0 \n\n00 \n\nL TJi = 00, L TJl < 00, L TJdleili < 00, Ileill ~ I Vi \n\n00 \n\n00 \n\n;=0 \n\ni=O \n\ni=O \n\n(13) \n\n(14) \n\nIt follows that {f(xi)} converges, {\\1 f(x i )} -+ 0, and for each accumulation point \nx of the sequence {x'}, V' f(x) = O. If, in addition, the number of stationary points \nof f(x) is finite, then the sequence {xi} converges to a stationary point of f(x). \n\nRemark 3.1. The error function of BP is nonnegative, and thus the boundedness \ncondition on f(x) is satisfied automatically. There are a number of ways to ensure \nthat f(x) has Lipschitz continuous and bounded gradient on {xi} . In (Luo & Tseng, \n1994) a simple projection onto a box is introduced which ensures that the iterates \nremain in the box. In (Grippo, 1994) a regularization term as in (6) is added to the \nerror function so that the modified objective function has bounded level sets. We \nnote that the latter provides a mathematical justification for weight decay (Hinton, \n1986; Weigend,Huberman & Rumelhart, 1990). In either case the iterates remain \nin some compact set, and since f( x) is an infinitely smooth function, its gradient is \nbounded and Lipschitz continuous on this set as desired. \n\nAcknowledgements \n\nThis material is based on research supported by Air Force Office of Scientific \nResearch Grant F49620-94-1-0036 and National Science Foundation Grant CCR-\n9101801. \n\nReferences \n\nM. Arai. (1993) Bounds on the number of hidden units in binary-valued three-layer \nneural networks. Neural Networks, 6:855-860. \n\nB. C. Cetin, J. W. Burdick, and J. Barhen. (1993) Global descent replaces gradient \ndescent to avoid local minima problem in learning with artificial neural networks. \nIn IEEE International Conference on Neural Networks, (San Francisco), volume 2, \n836-842. \n\nP. Courrien.(1993) Convergent generator of neural networks. Neural Networks, \n6:835-844. \n\nYu. Ermoliev and R.J.-B. Wets (editors). (1988) Numerical Techniques for Stochas(cid:173)\ntic Optimization Problems. Springer-Verlag, Berlin. \n\n\f390 \n\nMangasarian and Solodov \n\nA.A. Gaivoronski. (1994) Convergence properties of backpropagation for neural \nnetworks via theory of stochastic gradient methods. Part 1. Optimization Methods \nand Software, 1994, to appear. \n\n1. Grippo. (1994) A class of unconstrained minimization methods for neural net(cid:173)\nwork training. Optimization Methods and Software, 1994, to appear. \n\nG. E. Hinton. (1986) Learning distributed representations of concepts. In Pro(cid:173)\nceedings of the Eighth Annual Conference of the Cognitive Science Society, 1-12, \nHillsdale. Erlbaum. \n\nR. 1. Kashyap, C. C. Blaydon and K. S. Fu. (1970) Applications of stochastic \napproximation methods. In J .M.Mendel and K.S. Fu, editors, Adaptive, Learning, \nand Pattern Recognition Systems. Academic Press. \n\nT. Khanna. (1989) Foundations of neural networks. Addison-Wesley, New Jersey. \n\nY. Ie Cun, P.Y. Simard, and B. Pearlmutter. \nmaximization by on-line estimation of the Hessian's eigenvectors. \nS.J .Hanson, J .D.Cowan, editor, Advances in Neural Information Processing Sys(cid:173)\ntems 5, 156-163, San Mateo, California, Morgan Kaufmann. \n\n(1993) Automatic learning rate \nIn C.1.Giles \n\nZ.-Q. Luo and P. Tseng. (1994) Analysis of an approximate gradient projection \nmethod with applications to the backpropagation algorithm. Optimization Methods \nand Software, 1994, to appear. \n\n0.1. Mangasarian. (1993) Mathematical programming in neural networks. ORSA \nJournal on Computing, 5(4), 349-360. \n\n0.1. Mangasarian and M.V. Solodov. (1994) Serial and parallel backpropagation \nconvergence via nonmonotone perturbed minimization. Optimization Methods and \nSoftware, 1994, to appear. Proceedings of Symposium on Parallel Optimization 3, \nMadison July 7-9, 1993. \n\nM.F. M!2Sller. (1992) Supervised learning on large redundant training sets. In Neural \nNetworks for Signal Processing 2. IEEE Press. \n\nB.T. Polyak. (1987) Introduction to Optimization. Optimization Software, Inc., \nPublications Division, New York. \n\nD.E. Rumelhart, G.E. Hinton, and R.J. Williams. (1986) Learning internal repre(cid:173)\nsentations by error propagation. In D.E. Rumelhart and J.1. McClelland, editors, \nParallel Distributed Processing, 318-362, Cambridge, Massachusetts. MIT Press. \n\nT.J. Sejnowski and C.R. Rosenberg. (1987) Parallel networks that learn to pro(cid:173)\nnounce english text. Complex Systems, 1:145-168. \n\nA.S. Weigend, B.A. Huberman, and D.E. Rumelhart. (1990) Predicting the future:a \nconnectionist approach. International Journal of Neural Systems, 1 :193-209. \n\nH. White. \nfeedforward network models. \n84( 408): 1003-1013. \n\n(1989) Some asymptotic results for learning in single hidden-layer \nJournal of the American Statistical Association, \n\nX. Zhang, M. Mckenna, J. P. Mesirov, and D. 1. Waltz. (1990) The backpropagation \nalgorithm on grid and hypercube architectures. Parallel Computing, 14:317-327. \n\n\f", "award": [], "sourceid": 821, "authors": [{"given_name": "O. L.", "family_name": "Mangasarian", "institution": null}, {"given_name": "M. V.", "family_name": "Solodov", "institution": null}]}