{"title": "A Study of Parallel Perturbative Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 803, "page_last": 810, "abstract": null, "full_text": "A Study of Parallel Perturbative \n\nGradient Descent \n\nD. Lippe\u00b7 J. Alspector \n\nBellcore \n\nMorristown, NJ 07960 \n\nAbstract \n\nWe have continued our study of a parallel perturbative learning \nmethod [Alspector et al., 1993] and implications for its implemen(cid:173)\ntation in analog VLSI. Our new results indicate that, in most cases, \na single parallel perturbation (per pattern presentation) of the func(cid:173)\ntion parameters (weights in a neural network) is theoretically the \nbest course. This is not true, however, for certain problems and \nmay not generally be true when faced with issues of implemen(cid:173)\ntation such as limited precision. In these cases, multiple parallel \nperturbations may be best as indicated in our previous results. \n\n1 \n\nINTRODUCTION \n\nMotivated by difficulties in analog VLSI implementation of back-propagation \n[Rumelhart et al., 1986] and related algorithms that calculate gradients based on \ndetailed knowledge of the neural network model, there were several similar re(cid:173)\ncent papers proposing to use a parallel [Alspector et al., 1993, Cauwenberghs, 1993, \nKirk et al., 1993] or a semi-parallel [Flower and Jabri, 1993] perturbative technique \nwhich has the property that it measures (with the physical neural network) rather \nthan calculates the gradient. This technique is closely related to methods of stochas(cid:173)\ntic approximation [Kushner and Clark, 1978] which have been investigated recently \nby workers in fields other than neural networks. [Spall, 1992] showed that averaging \nmultiple parallel perturbations for each pattern presentation may be asymptotically \npreferable in the presence of noise. Our own results [Alspector et al., 1993] indicated \n\n\u00b7Present address: Dept. of EECSj MITj Cambridge, MA 02139; dalippe@mit.edu \n\n\f804 \n\nD. Lippe, 1. Alspector \n\nthat multiple parallel perturbations are also preferable when only limited precision \nis available in the learning rate which is realistic for a physical implementation. In \nthis work we have investigated whether multiple parallel perturbations for each pat(cid:173)\ntern are non-asymptotically preferable theoretically (without noise). We have also \nstudied this empirically, to the limited degree that simulations allow, by removing \nthe precision constraints of our previous work. \n\n2 GRADIENT ESTIMATION BY PARALLEL WEIGHT \n\nPERTURBATION \n\nFollowing our previous work, one can estimate the gradient of the error, E( w), with \nrespect to any weight, Wi, by perturbing Wi by 6w1 and measuring the change in \nthe output error, 6E, as the entire weight vector, W, except for component Wi is \nheld constant. \n\n6E \n6w1 \n\nE(w + 6;1) - E(w) \n\n6Wi \n\nWe now consider perturbing all weights simultaneously. However, we wish to have \nthe perturbation vector, 6w, chosen uniformly on a hypercube. Note that this \nrequires only a random sign multiplying a fixed perturbation and is natural for \nVLSI using a parallel noise generator [Alspector et al., 1991J. \n\nThis leads to the approximation (ignoring higher order terms) \n\nw \n- -+ \n\n' i\u00a21 \n\n-\n6E \n6w\u00b7 -\n, \n\n8E 2:(8E) (6Wi) \n8w\u00b7 \n\n-\n6w\u00b7\u00b7 \n\n-\n8w\u00b7 \n] \n\n, \n\n(1 ) \n\nThe last term has expectation value zero for random and independently distributed \n6w1\u2022 The weight change rule \n\nwhere 1] is a learning rate, will follow the gradient on the average but with consid(cid:173)\nerable noise. \n\nFor each pattern, one can reduce the variance of the noise term in (1) by repeating \nthe random parallel perturbation many times to improve the statistical estimate. If \nwe average over P perturbations, we have \n\nwhere p indexes the perturbation number. \n\n\fA Study of Parallel Perturbative Gradient Descellf \n\n805 \n\n3 THEORETICAL RELATIVE EFFICIENCY \n\n3.1 BACKGROUND \n\nSpall [Spall, 1992] shows in an asymptotic sense that multiple perturbations may be \nfaster if only a noisy measurement of E( tV) is available, and that one perturbation \nis superior otherwise. His results are asymptotic in that they compare the rate of \nconvergence to the local minimum if the algorithms run for infinite time. Thus, his \nresults may only indicate that 1 perturbation is superior close to a local minimum. \nFurthermore, his result implicitly assumes that P perturbations per weight update \ntakes P times as long as 1 perturbation per weight update. Experience shows \nthat the time required to present patterns to the hardware is often the bottleneck \nin VLSI implementations of neural networks [Brown et al., 1992]. In a hardware \nimplementation of a perturbative learning algorithm, a few perturbations might be \nperformed with no time penalty while waiting for the next pattern presentation. \n\nThe remainder of this section sketches an argument that multiple perturbations \nmay be desirable for some problems in a non-asymptotic sense, even in a noise \nfree environment and under the assumption of a multiplicative time penalty for \nperforming multiple perturbations. On the other hand, the argument also shows \nthat there is little reason to believe in practice that any given problem will be \nlearned more quickly by multiple perturbations. Space limitations prevent us from \nreproducing the full argument and discussion of its relevance which can be found \nin [Lippe, 1994]. \n\nThe argument fixes a point in weight space and considers the expectation value of \nthe change in the error induced by one weight update under both the 1 pertur(cid:173)\nbation case and the multiple perturbation case. [Cauwenberghs, 1994] contains a \nsomewhat related analysis of the relative speed of one parallel perturbation and \nweight perturbation as described in [Jabri and Flower, 1991]. The analysis is only \ntruly relevant far from a local minimum because close to a local minimum the vari(cid:173)\nance of the change of the error is as important as the mean of the change of the \nerror. \n\n3.2 Calculations \n\nIf P is the number of perturbations, then our learning rule is \n\n-'TJ ~ 6E(p) \n\n~Wi = P L.J~. \n\nP=1 6wi \n\nIf W is the number of weights, then ~E, calculated to second order in 'TJ, is \n\n~E = '\" -8 ~Wi + - '\" '\" 8 8 ~Wi~Wj. \n\nw 8E \nL.J W\u00b7 \ni= l ' \n\n82E \n\n1 W W \n2 L.J L.J W\u00b7 W \u00b7 \nJ \n\ni=l j=l \n\n\u2022 \n\nExpanding 6E(p) to second order in (j (where 6Wi = \u00b1(j), we obtain \n\n8 2 \n6E(p) = '\" -6w~P) + ! \"'\" E \n\nW W \n\nW 8E \nL.J 8w' \nj=l \nJ \n\nJ \n\n2 L.J L.J 8w' 8wk \n\nj=l k=l \n\nJ \n\n6w~P)6w(P). \n\nJ \n\nk \n\n(2) \n\n(3) \n\n(4) \n\n\f806 \n\nD. Lippe, 1. Alspector \n\n[Lippe, 1994] shows that combining (2)-(4), retaining only first and second order \nterms, and taking expectation values gives \n\n< l:1E >= -TJX + ~ (Y + PZ) \n\n2 \n\n(5) \n\nw (8E)2 \nL 8w' \n\ni=l \n\n' \n\n' \n\nwhere \n\nx \n\nZ \n\nY \n\nNote that first term in (5) is strictly less than or equal to 0 since X is a sum of \nsquares l . The second term, on the other hand, can be either positive or negative. \nClearly then a sufficient condition for learning is that the first term dominates the \nsecond term. By making TJ small enough, we can guarantee that learning occurs. \nStrictly speaking, this is not a necessary condition for learning. However, it is \nimportant to keep in mind that we are only focusing on one point in weight space. If, \nat this point in weight space, < l:1E > is negative but the second term's magnitude \nis close to the first term's magnitude, it is not unlikely that at some other point \nin weight space < l:1E > will be positive. Thus, we will assume that for efficient \nlearning to occur, it is necessary that TJ be small enough to make the first term \ndominate the second term. \nAssume that some problem can be successfully learned with one perturbation, at \nlearning rate TJ(I). Then the first order term in (5) dominates the second order \nterms. Specifically, at any point in weight space we have, for some large constant \nJ1., \n\nTJ(I)X ~ J1.TJ(I)2IY + ZI \n\nIn order to learn with P perturbations, we apparently need \n\nTJ( P)X ~ J1. TJ(~)2 IY + P ZI \n\n(6) \n\nThe assumption that the first order term of (5) dominates the second order terms \nimplies that convergence time is proportional to ,.lp), Thus, learning is more effi(cid:173)\ncient in the multiple perturbation case if \n\nJ1.TJ(P) > J1.TJ(I) \n\nP \n\n(7) \n\nIt turns out, as shown in [Lippe, 1994] that the conditions (6) and (7) can be met \nsimultaneously with multiple perturbations if =f ~ 2. \n\nlIf we are at a stationary point then the first term in (5) is O. \n\n\fA Study of Parallel Perturbative Gradient Descent \n\n807 \n\nIt is shown in [Lippe, 1994], by using the fact that the Hessian of a quadratic \nfunction with a minimum is positive semi-definite, that if E is guadratic and has \na minimum, then Y and Z have the same sign (and hence =f < 2). Any well \nbehaved function acts quadratically sufficiently close to a stationary point. Thus, \nwe can not get < flE > more than a factor of P larger by using P perturbations \nnear local minima of well behaved functions. Although, as mentioned earlier, we \nare entirely ignoring the issue of the variance of flE, this may be some indication \nof the asymptotic superiority of 1 perturbation. \n\n3.3 Discussion of Results \nThe result that multiple perturbations are superior when -i ~ 2 may seem some(cid:173)\nwhat mysterious. It sheds some light on our answer to rewrite (5) as \n\n< flE >= -\"IX + \"I2(p + Z). \n\nY \n\nFor strict gradient descent, the corresponding equation is \n< flE >= flE = -\"IX + \"I2Z. \n\nThe difference between strict gradient descent and perturbative gradient descent, \non average, is the second order term \"I2~. This is the term which results from not \nfollowing the gradient exactly, and it Obviously goes down as P goes up and the \ngradient measurement becomes more accurate. Thus, if Z and Y have different \nsigns, P can be used to make the second order term disappear. There is no way \nto know whether this situation will occur frequently. Furthermore, it is important \nto keep in mind that if Y is negative and Z is positive, then raising P may make \nthe magnitude of the second order term smaller, but it makes the term itself larger. \nThus, in general, there is little reason to believe that multiple perturbations will \nhelp with a randomly chosen problem. \n\nAn example where multiple perturbations help is when we are at a point where \nthe error surface is convex along the gradient direction, and concave in most other \ndirections. Curvature due to second derivative terms in Y and Z help when the \ngradient direction is followed, but can hurt when we stray from the gradient. In \nthis case, Z < 0 and possibly Y > 0, so multiple perturbations might be preferable \nin order to follow the gradient direction very closely. \n\n4 SIMULATIONS OF SINGLE AND MULTIPLE \n\nPARALLEL PERTURBATION \n\n4.1 CONSTANT LEARNING RATES \n\nThe second order terms in (5) can be reduced either by using a small learning rate, \nor by using more perturbations, as discussed briefly in [Cauwenberghs, 1993]. Thus, \nif \"I is kept constant, we expect a minimum necessary number of perturbations in \norder to learn. This in itself might be of importance in a limited precision imple(cid:173)\nmentation. If there is a non-trivial lower bound on \"I, then it might be necessary \nto use multiple perturbations in order to learn. This is the effect that was noticed \nin [Alspector et al., 1993]. At that time we thought that we had found empirically \n\n\f808 \n\nD. Lippe, J. Alspector \n\nTable 1: Running times for the first initial weight vector \nTime for < .1 \n1,121,459 \n831 , 684 \n784, 768 \n4 94,029 \n1,695,974 \n707,840 \n583,654 \n922,880 \n1,010,355 \nNot tested \n\nTime for < .5 \n32,179 \n18,534 \n11,008 \n9,933 \n9,728 \n23,834 \n16,845 \n13,261 \n12,006 \n17,024 \n\nTJ \n.0005 \n.001 \n.002 \n.003 \n.004 \n.00625 \n.008 \n.0125 \n.025 \n.035 \n\nP \n1 \n1 \n1 \n1 \n1 \n7 \n7 \n7 \n7 \n7 \n\nthat multiple perturbations were necessary for learning. The problem was that we \nfailed to decrease the learning rate with the number of perturbations. \n\n4.2 EMPIRICAL RELATIVE EFFICIENCY OF SINGLE AND \n\nMULTIPLE PERTURBATION ALGORITHMS \n\nSection 3 showed that, in theory, multiple perturbations might be faster than 1 per(cid:173)\nturbation. We investigated whether or not this is the case for the 7 input hamming \nerror correction problem as described in [Biggs, 1989]. This is basically a nearest \nneighbor problem. There exist 16 distinct 7 bit binary code words. When presented \nwith an arbitrary 7 bit binary word, the network is to output the code word with \nthe least hamming distance from the input. \n\nAfter preliminary tests with 50, 25, 7, and 1 perturbation, it seemed that 7 per(cid:173)\nturbations provided the fastest learning, so we concentrated on running simulations \nfor both the 1 perturbation and the 7 perturbation case. Specifically, we chose two \ndifferent (randomly generated) initial weight vectors, and five different seeds for the \npseudo-random function used to generate the bWi. For each of these ten cases, we \ntested both 1 perturbation and 7 perturbations with various learning rates in order \nto obtain the fastest possible learning. \n\nThe 128 possible input patterns were repeatedly presented in order. We investigated \nhow many pattern presentations were necessary to drive the MSE below .1 and \nhow many presentations were necessary to drive it below .5. Recalling the theory \ndeveloped in section 3, we know that multiple perturbations can be helpful only far \naway from a stationary point. Thus, we expected that 7 perturbations might be \nquicker reaching .5 but would be slower reaching .1. \n\nThe results are summarized in tables 1 and 2. Each table summarizes information \nfor a different initial weight vector. All of the data presented are averaged over 5 \nruns, one with each of the different random seeds. The two columns labeled \"Time \nfor < .5\" and \"Time for < .1\" are adjusted according to the assumption that one \nweight update at 7 perturbations takes 7 times as long as one weight update at \n1 perturbation. In each table, the following four numbers appear in italics: the \nshortest time to reach .1 with 1 perturbation, the shortest time to reach .1 with 7 \nperturbations, the shortest time to reach .5 with 1 perturbation, and the shortest \ntime to reach .5 with 7 perturbations. \n\n7 perturbations were a loss in three out of four of the experiments. Surprisingly, \n\n\fA Study of Parallel Perturbative Gradient Descent \n\n809 \n\nl' \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n1 \n\n'T/ \n.001 \n.002 \n.003 \n.004 \n.00625 \n.008 \n.0125 \n.025 \n.035 \n\nTable 2: Running times for the second initial weight vector \n\nTIme for < .1 \n928,236 \n719 , 078 \n154,139 \n1,603,354 \n629 , 530 \n611,610 \n912,333 \n1,580,442 \nNot tested \n\nTIme for < .5 \n22,133 \n12,817 \n10,675 \n11,150 \n21,059 \n19,112 \n15,949 \n14,515 \n11,141 \n\nthe one time that multiple perturbations helped was in reaching .1 from the second \ninitial weight vector. There are several possible explanations for this. To begin \nwith, these learning times are averages over only five simulations each, which makes \ntheir statistical significance somewhat dubious. Unfortunately, it was impractical \nto perform too many experiments as the data obtained required 180 computer sim(cid:173)\nulations, each of which sometimes took more than a day to complete. \n\nAnother possible explanation is that .1 may not be \"asymptotic enough.\" The \nnumbers .5 and .1 were chosen somewhat arbitrarily to represent non-asymptotic \nand asymptotic results. However, there is no way of predicting from the theory how \nclose the error must be to its minimum before asymptotic results become relevant. \n\nThe fact that 1 perturbation outperformed 7 perturbations in three out of four cases \nis not surprising. As explained in section 3, there is in general no reason to believe \nthat multiple perturbations will help on a randomly chosen problem. \n\n5 CONCLUSION \n\nOur results show that, under ideal computational conditions, where the learning \nrate can be adjusted to proper size, that a single parallel perturbation is, except \nfor unusual problems, superior to multiple parallel perturbations. However, under \nthe precision constraints imposed by analog VLSI implementation, where learning \nrates may not be adjustable and presenting a pattern takes longer than performing \na perturbation, multiple parallel perturbations are likely to be the best choice. \n\nAcknowledgment \n\nWe thank Gert Cauwenberghs and James Spall for valuable and insightful discus(cid:173)\nsIons. \n\nReferences \n\n[Alspector et al., 1991] Alspector, J., Gannett, J. W., Haber, S., Parker, M. B., \n\nand Chu, R. (1991). A VLSI-efficient technique for generating multiple uncor(cid:173)\nrelated noise sources and its application to stochastic neural networks. IEEE \nTransactions on Circuits and Systems, 38:109-123. \n\n[Alspector et al., 1993] Alspector, J., Meir, R., Yuhas, B., Jayakumar, A., and \nLippe, D. (1993). A parallel gradient descent method for learning in analog \n\n\f810 \n\nD. Lippe, J. A/spector \n\nVLSI neural networks. In Hanson, S. J., Cowan, J. D., and Giles, C. L., edi(cid:173)\ntors, Advances in Neural Information Processing Systems 5, pages 836-844, San \nMateo, California. Morgan Kaufmann Publishers. \n\n[Biggs, 1989] Biggs, N. L. (1989). Discrete Math. Oxford University Press. \n[Brown et al., 1992] Brown, T. X., Tran, M. D., Duong, T., and Thakoor, A. P. \n(1992). Cascaded VLSI neural network chips: Hardware learning for pattern \nrecognition and classification. Simulation, 58(5):340-347. \n\n[Cauwenberghs, 1993] Cauwenberghs, G. (1993). A fast stochastic error-descent al(cid:173)\n\ngorithm for supervised learning and optimization. In Hanson, S. J., Cowan, J. D., \nand Giles, C. L., editors, Advances in Neural Information Processing Systems 5, \npages 244-251, San Mateo, California. Morgan Kaufmann Publishers. \n\n[Cauwenberghs, 1994] Cauwenberghs, G. (1994). Analog VLSI Autonomous Sys(cid:173)\ntems for Learning and Optimization. PhD thesis, California Institute of Technol(cid:173)\nogy. \n\n[Flower and Jabri, 1993J Flower, B. and Jabri, M. (1993). Summed weight neu(cid:173)\nron perturbation: An o(n) improvement over weight perturbation. In Hanson, \nS. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information \nProcessing Systems 5, pages 212-219, San Mateo, California. Morgan Kaufmann \nPublishers. \n\n[Jabri and Flower, 1991] Jabri, M. and Flower, B. (1991). Weight perturbation: \nAn optimal architecture and learning technique for analog VLSI feedforward and \nrecurrent multilayer networks. In Neural Computation 3, pages 546-565. \n\n[Kirk et al., 1993] Kirk, D., Kerns, D., Fleischer, K., and Barr, A. (1993). Analog \nVLSI implementation of gradient descent. In Hanson, S. J., Cowan, J. D., and \nGiles, C. L., editors, Advances in Neural Information Processing Systems 5, pages \n789-796, San Mateo, California. Morgan Kaufmann Publishers. \n\n[Kushner and Clark, 1978] Kushner, H. and Clark, D. (1978). Stochastic Approz(cid:173)\nimation Methods for Constrained and Unconstrained Systems. Springer-Verlag, \nNew York. \n\n[Lippe, 1994] Lippe, D. A. (1994). Parallel, perturbative gradient descent methods \nfor learning in analog VLSI neural networks. Master's thesis, Massachusetts \nInstitute of Technology. \n\n[Rumelhart et al., 1986] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. \n(1986). Learning internal representations by error propogation. In Rumelhart, \nD. E. and McClelland, J. L., editors, Parallel Distributed Processing: Ezplorations \nin the Microstructure of Cognition, page 318. MIT Press, Cambridge, MA. \n\n[Spall, 1992] Spall, J. C. (1992). Multivariate stochastic approximation using a \nsimultaneous perturbation gradient approximation. IEEE Transactions on A u(cid:173)\ntomatic Control, 37(3):332-341. \n\n\f", "award": [], "sourceid": 911, "authors": [{"given_name": "D.", "family_name": "Lippe", "institution": null}, {"given_name": "Joshua", "family_name": "Alspector", "institution": null}]}