{"title": "Backpropagation without Multiplication", "book": "Advances in Neural Information Processing Systems", "page_first": 232, "page_last": 239, "abstract": null, "full_text": "Backpropagation without Multiplication \n\nPatrice Y. Simard \n\nAT &T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nHans Peter Graf \n\nAT&T Bell Laboratories \n\nHolmdel, NJ 07733 \n\nAbstract \n\nThe back propagation algorithm has been modified to work with(cid:173)\nout any multiplications and to tolerate comput.ations with a low \nresolution, which makes it. more attractive for a hardware imple(cid:173)\nmentatioll. Numbers are represented in float.ing point format with \n1 bit mantissa and 3 bits in the exponent for the states, and 1 bit \nmantissa and 5 bit exponent. for the gradients, while the weights are \n16 bit fixed-point numbers. In this way, all the computations can \nbe executed with shift and add operations. Large nehvorks with \nover 100,000 weights were t.rained and demonstrat.ed the same per(cid:173)\nformance as networks comput.ed with full precision. An estimate of \na circuit implementatioll shows that a large network can be placed \non a single chip , reaching more t.han 1 billion weight updat.es pel' \nsecond. A speedup is also obtained on any machine where a mul(cid:173)\ntiplication is slower than a shift operat.ioJl. \n\n1 \n\nINTRODUCTION \n\nOne of the main problems for implement.ing the backpropagation algorithm in hard(cid:173)\nware is the large number of multiplications t.hat. have to be executed. Fast multipli(cid:173)\ners for operands wit.h a high resolution l'eqllire a large area. Hence the multipliers \nare the elements dominating t.he area of a circuit. i\\'Iany researchers have tried to \nreduce the size of a circuit by limit.ing the resolution of the computation. Typically, \nthis is done by simply reducing the number of bits utilized for the computation. For \na forward pass a reduction tOjllst a few , 4 to 6. bits, often degl'ades the performance \nvery little, but. learning requires considerably more resolution. Requirements rang(cid:173)\ning anywhere from 8 bits to more than 16 bits were report.ed to be necessary to make \nlearning converge relia.bly (Sakaue et al., 1993; Asanovic, I\\'Iorgan and \\Va.wrzYllek, \n1993; Reyneri and Filippi, 1991). But t.here is no general theory, how much resolu(cid:173)\ntion is enough, and it depends on several factors, such as the size and architecture \nof the network as \\-vell as on the t.ype of problem to be solved . \n\n232 \n\n\fBackpropagation without Multiplication \n\n233 \n\nSeveral researchers have tried to tl'ain networks where the weights are limited to \npowers of two (Kwan and Tang, 1993; White and Elmasry, 1992; l'vlarchesi et. al., \n1993). In this way all the multiplications can be reduced to shift operations, an \noperation that can be implemented with much less hardware than a multiplication. \nBut restricting the weight values severely impacts the performance of a network, and \nit is tricky to make t.he learning procedure converge. III fact , some researchers keep \nweights with a full resolution off-line and update t.hese weights in the backward pass, \nwhile the weights with reduced resolution are used in the forward pass (Marchesi \net al., 1993) . Similar tricks are usually used when networks implemented in analog \nhardware are trained. Weight.s with a high resolution are stored in an external, \ndigital memory while the analog net.work with its limited resolution is used in the \nforward pass. If a high resolution copy is not stored, the weight update process \nneeds to be modified. This is typically done by using a stochastic update technique, \nsuch as weight dithering (Vincent and l\\lyers, 1 9~)2), or weight perturbation (.J abri \nand Flower, 1992). \nWe present here an algorithm that instead of reducing the resolut.ion of the weights, \nreduces the resolution of all t.he other values, namely those of the states, gradients \nand learning rates, to powers of two. This eliminates multiplications without af(cid:173)\nfecting the learning capabilities of t.he network. Therefore we ohtain the benefit of \na much compacter circuit without any compromises on the learning performance. \nSimulations of large net.works with over 100,000 weights show that this algorithm \nperf?r.ms as well as standal'd backpwpagation computed with 32 bit floating point \npreCIsIon. \n\n2 THE ALGORITHM \n\nThe forward propagat.ion for each unit i. is given by th e pquation: \n\nXj = j~(L wji.t' il \n\n( 1 ) \n\nwhere f is the unit functjoll, Wji is the weight from unit i to unit j, and Xi is the \nactivation of unit i. The backpwpagation algorithm is wbust with regard to the \nunit function as long as the function is nonlinear, monotonically increasing, and a \nderivative exists (the most commonly used function is depicted in Figure 1, left. \nA saturated ramp function (see Figure 1, middle), for instance, performs as well \nas the differentiable sigmoid. The binary threshold function, however, is too much \nof a simplification and results in poor performance . The choice of OUl' function is \ndictated by the fact that we would like t.o have only powers of two for the unit values. \nThis function is depicted ill Figure 1, right. It gives performances comparable to \nthe sigmoid or the saturated ramp . Its values can be represented by a 1 bit mantissa \n(the sign) with a. 2 or 3 bit exponent. (negative powers of t.wo). \n\nThe derivative of this funct.ion is a. SlIm of Dirac delta functions, but we take instead \nthe derivative of a piecewise linear ramp funct.ion (see Figure 1) . 0\\1(\" could actually \nconsider this a low pass filtered version of the real derivat.ive . After the gradients \nof all the units have been computed using the equation. \n\n[h = If ( s 11 In i ) L U'j i [lj \n\nj \n\n(2) \n\nwe will discretize the values to be a power of two (wit h sign) . This introduces noise \ninto the gradient and its effect, on th e learning has to be considered carefully. This \n\n\f234 \n\nSimard and Graf \n\nSigmoid \n\nPiecewise linear \n\nPower of two \n\nF\\Jflctl.On \n\n1.. \n\n... \n\n-, \n\nFunctlon \n\n1., \n\n-, \n\n-l.$_I:-.--:_,:--: .\u2022 ~_\":\", ~_.~ .\u2022 ~. -=.--=-.\u2022 ~, \"\"7,--=-.,~, \n\n-l.~_':-, -:_,:'\": .\u2022 ~_\":\", -:_,~. ,~, -:,--=-. ,~, \"\"7,\"\": .\u2022 -:, \n\nF'Unction \n\n.. , \n\n.., \n\n- o. S \n\n-'1----1 \n- ) . s _':-, -:_,-=. ,~_.,.., -:_.-=. ,--::-. -,.'\"'\"., -.-, -:\",'\"'\".,-!, \n\nFunctlon derl.Vatlve \n\n1.. \n\nFunctIon derlvatl.Ve \n1.. \n\nFunctIon derl.vatl.ve (apprOXlmationJ \n\n.., \n\n-o.S \n-, \n\n-o.S \n-, \n\n-O.! \n\n-, \n\n-1 \u2022 '_':-, ~_,~ \u2022\u2022 :-':\"_,-_-:-\u2022\u2022 :-. \"'=.~ \u2022. :-, \"':,--;\" .. :-, ~, \n\n-1. S_I:-, --:_,:--:. ,-_\":\", ~_.~. ,~. -=,'\"'\". ,~, \"\"7,'\"'\". ,--:, \n\n-1. '-'=\"2 --:\"\"1. ':-':\"-1--\":\"\" \u2022. :-, \"'=o~ \u2022. :-, -:\"'--:\"'1.:-, -:, \n\nFigure 1: Left: sigmoid function with its derivative. \n]\\>'1iddle: piecewise linear \nfunction with its derivative. Right.: Sat.urated power of two function with a power \nof two approximation of its derivative (ident.ical to t.he piecewise linear derivative). \n\nproblem will be discussed in section 4. The backpropagat.ion algorithm can now be \nimplemented with addit.ions (or subtract.iolls) and shifting only. The weight update \nis given by the equa.tion: \n\n(3) \nSince both 9j and Xi are powers of two, the weight update also reduces to additions \nand shifts. \n\nD.1.Vji = -119jXi \n\n3 RESULTS \n\nA large structured network wit.h five layers and overall 11l00'e t.han 100,000 weights \nwas used to test this algorithm. The applicat.ion analyzed is recognizing handwrit.ten \ncharacter images. A database of 800 digits was used for training and 2000 hand(cid:173)\nwritten digits were used for test.ing. A description of this network can be found in \n(Le Cun et aI., 1990). Figure 2 shows the learning curves on t.he test set for various \nunit functions and discretization processes. \nFirst, it should be noted that t.he results given by the sigmoid function and the \nsaturated ramp with full precision on unit values, gradients, and weights are very \nsimilar. This is actually a well known behavior. The surprising result comes from \nthe fact that reducing the precision of the unit values and the gradients to a 1 bit \nmantissa does not reduce the classification accuracy and does not even slow down the \nlearning process. During these tests the learning process was interrupted at various \nstages to check that both the unit values (including the input layer, but excluding \nthe output layer) and t.he gradient.s (all gradients) were restricted to powers of two. \nIt was further confirmed that. ollly 2 bits wet'e sufficient. for the exponent of the unit \n\n\fBackpropagation without Multiplication \n\n235 \n\nTraining \n\nerror \n\n100 \n\nTesting \nerror \n\n100 \n\ngo \n\neo \n\n70 \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\n0 \n\n\u2022 sigmoid \no piecewise lin \no power of 2 \n\n90 \n\n80 \n\n70 \n\n60 \n\n50 \n\n40 \n\n30 \n\n20 \n\n10 \n\n0 2 \n\n4 \n\n6 \n\n8 10 12 14 18 18 20 22 24 \n\n~ \n\n0 \n\n\u2022 sigmoid \nD piecewise lin \no power of 2 \n\nIlohU:e::a.n:u \n\n0 2 \n\n4 \n\n6 \n\n8 10 12 14 16 16 20 22 24 \n\nage (in 1000) \n\nage (in 1000) \n\nFigure 2: Training and testing error during leaming. The filled squares (resp. \nempty squared) represent the points obtained with the vanilla backpropagation and \na sigmoid function (resp. piecewise linear function) used a<; an activation function. \nThe circles represent the same experiment done wit.h a power of t.wo function used \nas the activation function, and wit.h all lInit. gradients discretized to the nearest \npower of two. \n\nvalues (from 2\u00b0 to 2- 3 ) and 4 bit.s were sufficient for the exponent. of the gradients \n(from 2\u00b0 to 2- 15 ). \nTo test whether there was any asymptot.ic limit on performance, we ran a long \nterm experiment (several days) with our largest network (17,000 free parameters) \nfor handwritten character recognition. The training set (60,000 patterns) was made \nout 30,000 patterns of the original NIST t.raining set (easy) and 30,000 patterns \nout of the original NIST testing set (hard). Using the most basic backpropagation \nalgorithm (with a guessed constant learning rate) we got the training raw error rate \ndown to under 1 % in 20 epochs which is comparable to our standard learning time. \nPerformance on the test set was not as good with the discrete network (it took twice \nas long to reach equal performance with the discrete network). This was attributed \nto the unnecessary discretization of the output units 1. \nThese results show that gradients and unit activations can be discretized to powers \nof two with negligible loss in pel\"formance and convergence speed! The next section \nwill present theoretical explanations for why this is at. all possible and why it is \ngenerally the case. \n\nlSince the output units are not multiplied by anything, t.here is no need to use a \ndiscrete activation funct.ion. As a matter of fact the continuous sigmoid function can be \nimplemented by just changing the target. values (using the inverse sigmoid function) and \nby using no activation function for the output units. This modificat.ion was not introduced \nbut we believe it would improves the performance on t.he t.est. set. especially when fancy \ndecision rules (with confidencE' evaluatioll) are used, since t.hey require high precision on \nthe output units. \n\n\f236 \n\nSimard and Graf \n\n~:s:.ogram \n\n2000 \n\nlROO \n\n1600 \n\n1\"00 \n\n1200 \n\n1000 \n\naDD \n\n600 \n\n'00 \n\n200 \n\nhistogram \n\nBest case: Noise \nis uncorrelated and \na\" weights are equal \n\nWorse case: NOise \nis correlated or the \nweights are unequal \n\nFigure 3: Top left: histogram of t.he gradients of one output unit after more than \n20 epochs of learning over a training set of GO,OOO pallel'lIs. Bottom left: same \nhistogram assuming that the distt'ibutioll is constant between powers of two. Right: \nsimplified network architectlll'es fOl' noise effect. analysis . \n\n4 DISCUSSION \n\nDiscretizing the gradient is potentially very dangerous. Convergence may no longer \nbe guaranteed, learning may hecome prohibitively slow, and final performance after \nlearning may be be too poor to be interesting, \"Ve will now explain why these \nproblems do not arise for our choice of discret.ization. \nLet gi(p) be the error gradient at weight i and pattern p. Let 1'.,: and Ui be the mean \nand standard deviation of gi(p) over the set of patterns. The mean Pi is what is \ndriving the weights to their final values, the standard deviation Ui represents the \namplitudes of the variations of the gradients from pattern to pattern. In batch \nlearning, only Pi is used for the weight upda.te, while in stochastic gradient descent, \neach gi(p) is used for the weight update. If the learning rate is small enough the \neffects of the noise (measured by u;) of the stochastic variable Ui (p) are negligible, \nbut the frequent weight updates in stochastic gradient descent result. in important \nspeedups, \nTo explain why the discretization of the gradient to a power of two has negligible \neffect on the pel'formance, consider that in stochastic gradient descent, the noise on \nthe gradient is already so large that it is minimally affected by a rounding (of the \ngradient) to the nearest power of two. Indeed asymptotically, t.he gradient a.verage \n(Pi) tends to be negligible compared to it.s standard deviation (ui), meaning that \nfrom pattern to pattern the gradient can undergo sign reversals, Rounding to the \nnearest power of two in comparison is a change of at. most 33%, but never a change \nin sign. This additional noise can therefore be compensated by a slight. decrease in \nthe learning rate which will hardly affect the leal'l1ing process . \n\n\fBackpropagation without Multiplication \n\n237 \n\nThe histogram of gi(p) after learning in the last experiment described in the result \nsection, is shown in Figure 3 (over the training set of 60,000 patterns). It is easy to \nsee in the figure that J.li is small wit.h respect to (7i (in this experiment J.li was one to \ntwo orders of magnitude smaller than (7i depending on the layer). vVe can also see \nthat rounding each gradient to the nearest power of two will not affect significantly \nthe variance (7i and therefore the learning rate will not need to be decreased to \nachieve the same performance. \nWe will now try to evaluate the rounding to the nearest power of two effect more \nquantitatively. The standard deviation of the gradient for any weight can be written \nas \n\n') l~ 2 \n'I 1~ \n(7; = N ~(gi(P) - pi)- = N ~ gi(p)- - J.l- ~ N ~ gi(p) \n\n') l~ ') \n\n(4) \n\np \n\np \n\np \n\nThis approximation is very good asymptotically (after a few epochs of learning). \nFor instance if lJ.li I < (7;/ 10, the above formula gives the standard deviation to 1 %. \nRounding the gradient gi to the nearest power of two (while keeping the sign) can \nbe viewed as the effect of a multiplicative noise 11i in the equation \n\ng/ = 2k = ad 1 + nd \n\n(5) \nwhere g/ is the nearest power of two from gj. It can be easily verified that this \nimplies that 11.i ranges from -1/3 and 1/3 . From now on , we will view Hi as a \nrandom variable which models as noise the effect of discretization . To simplifY the \ncomputation we will assume that 11j has uniform distribution. The effect of this \nassumption is depicted in figure :3, where the bottom histogram has been assumed \nconstant between any two powers of t.wo. \n\nfor some k \n\nTo evaluate the effect of the noise ni in a multilayer network , let 7Ili be the multi(cid:173)\nplicative noise introduced at layer I (l = 1 for the output, and I = L for the first \nlayer above the input) for weight i. Let's further assume that there is only one unit \nper layer (a simplified diagra.m of the network architecture is shown on figure 3. \nThis is the worst case analysis. If there are several units per layer, the gradients \nwill be summed to units in a lower layer. The gradients within a layer are corre(cid:173)\nlated from unit to unit (they all originate from the same desired values), but the \nnoise introduced by the discretization can only be less correlated, not more . The \nsummation of the gradient in a lower layer can therefore only decrease the effect of \nthe discretization . The worst case analysis is t.herefore when there is only one unit. \nper layer as depicted in figure :3, extreme right. \\Ve will further assume that the \nnoise introduced by the discretizat.ion ill one layer is independent from the Iloise \nintroduced in the next layer . This is not ~'eally true but it greatly simplifies the \nderivation. \nLet J.l~ and (7i be the mean and standard deviation of Oi (p)'. Since nli has a zero \nmean, J.l~ = J.li and J1~ is negligible with respect to gd}J)\u00b7 In the worst case, when the \n~radient has to be backpropagated all the way to t.he input , the standard deviation \nIS: \n\nL \n\n1 \nN L gi(p)2 II -2 \n\n(3 j1/3 \n\np \n\n1 \n\n-1/3 \n\n) \n(1 + 11 Ii )2d7l/i \n\n(1 ) L \n\n- /1 2 ~ (7; 1 + -. \n27 \n\n(6) \n\n\f238 \n\nSimard and Graf \n\nAs learning progresses, the minimum average distance of each weight to the weight \ncorresponding to a local minimum becomes proportional to the variance of the noise \non that weight, divided by the learning rate. Therefore, asymptotically (which is \nwhere most of the time is spent), for a given convergence speed, the learning rate \nshould be inversely proportional to the variance of the noise in the gradient. This \nmeans that to compensClte the effect of the discretization. the learning rate should \nbe divided by \n\n(1\" \nI \n\nL \n\n(11+ ;7) '\" 1.02\u00a3 \n\n(7) \n\nEven for a 10 layer network this value is only 1.2, (u~ is 20 % larger than ud. \nThe assumption that the noise is independent from layer to layer tends to slightly \nunderestimate this number while the assumption that the noise from unit to unit \nin the same layer is completely correlated tends to overestimate it. \nAll things considered, we do not expect that the learning rate should be decrea'Sed \nby more than 10 to 20% for any practical application. In all our simulations it was \nactually left unchanged! \n\n5 HARDWARE \n\nThis algorithm is well suited for integrating a large network on a single chip . The \nweights are implemented with a resolution of 16 bits, while the states need only 1 \nbit in the mantissa and 3 bits in the exponent, the gradient 1 bit in the mantissa \nand 5 bits in the exponent, and for the learning rate 1 bits mantissa and 4 bits \nexponent suffice. In this way, all the multiplications of weights with states, and of \ngradients with learning rates and st.at.t's. reduce to add operations of the exponents. \nFor the forward pass the weights are multiplied with the states and then summed. \nThe mUltiplication is executed as a shift operation of the weight values. For sum(cid:173)\nming two products their mantissae have to be aligned, again a shift operation, and \nthen they can be added. The partial sums are kept at full resolution until the end of \nthe summing process. This is necessary to avoid losing the influence of many small \nproducts. Once the sum is computed, it is then quantized simply by checking the \nmost significant bit in the mantissa. For the backward propagation the computation \nruns in the same way, except t.hat now the gradient is propagated through the net, \nand the learning rate has to be taken into account.. \nThe only operations required for this algorithm are 'shift' and 'add'. An ALU \nimplementing these operations with the resolution ment.ioned can be built with \nless than 1,000 transistors. In order to execut.e a network fast, its weights have to \nbe stored on-chip. Ot.herwise, t.he time required to t.ransfer weight!; from external \nmemory onto the chip boundary makes the high compute power all but useless. If \nstorage is provided for 10,000 weights plus 2,000 states, this requires less than 256 \nkbit of memory. Together with 256 ALUs and circuitry for routing the data, this \nleads to a circuit with about 1.7 million transistors, where over 80% of them are \ncontained in the memory. This assumes that the memory is implemented with static \ncells, if dynamic memory is used instead the transistor count drops considerably .. \nAn operating speed of 40 MHz resnlts in a compute rate of 10 billion operat.ions \nper second. \\-\\lith such a chip a network may be trained at a speed of more than 1 \nbillion weight updates per second. \n\n\fBackpropagation without Multiplication \n\n239 \n\nThis algorithm has been optimized for an implementation on a chip, but it can \nalso provide a considerable speed up when executed on a standard computer. Due \nto the small resolution of the numbers, several states can be packed into a 32 bit \nnumber and hence many more fit int.o a chache. Moreover on a machine without \na hardware multiplier, where the multiplication is executed with microcode, shift \noperations may be much faster than multiplications. Hence a suhstancial speedup \nmay be observed. \n\nReferences \n\nAsanovic, K., Morgan, N., and \\Vawrzynek, J. (1993). Using Simulations of Re(cid:173)\n\nduced Precision Arithmetic t.o Design a Neura- Microprocessor. 1. FLSI Signal \nProcessing, 6(1):33-44. \n\nJabri, M. and Flower, B. (1992). 'Veight Perturbation: An optimal architecture \nand learning technique for analog VLSI feedforward and l'ecmrent multilayer \nnetworks. IEEE Trans. Neural Networks, 3(3):154-157. \n\nKwan, H. and Tang, C. (1993). Multipyerless Multilayer Feedforward Neural Net(cid:173)\nwork Desi~n Suitable for Continuous Input-Output Mapping. Elecironic Lei(cid:173)\nters,29(14):1259-1260. \n\nLe Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, \n\nW., and Jackel, L. D. (1990). Handwritten Digit Recognition with a Back(cid:173)\nPropagation Network . In Touretzky, D., editor, Neural Injo1'lnaiio71 Processing \nSystems, volume 2, (Denver, 1989). l'vIorgan Kaufman. \n\nMarchesi, M., Orlando, G., Piazza, F., and Uncini, A. (1993). Fast Neural Networks \n\nwithout Multipliers. IEEE Trall.5. Ne11ral Networks, 4(1):53-62. \n\nReyneri, L. and Filippi, E. (1991). An analysis on the Performance of Silicon Im(cid:173)\n\nplementations of Backpropagation Algorithms for Artificial Nemal Networks. \nIEEE Trans. Computer's, 40( 12): 1380-1389. \n\nSakaue, S., Kohda, T., Yamamoto, II., l\\'laruno, S., and Shimeki, Y. (1993). Re(cid:173)\n\nduction of Required Precision Bits for Back-Propagation Applied to Pattern \nRecognition. IEEE Tralls. Neural Neiworks, 4(2):270-275. \n\nVincent, J. and Myers, D. (1992). Weight dithering and \\Vordlength Selection for \n\nDigital Backpropagation Networks. BT Tech7lology J., 10(3):124-133. \n\nWhite, B. and Elmasry, M. (1992). The Digi-Neocognitron: A Digit.al Neocognit.ron \n\nNeural Network Model for VLSI. IEEE Trans. Nel/r'al Networks, 3( 1 ):73-85. \n\n\f", "award": [], "sourceid": 833, "authors": [{"given_name": "Patrice", "family_name": "Simard", "institution": null}, {"given_name": "Hans", "family_name": "Graf", "institution": null}]}