{"title": "Analog VLSI Implementation of Multi-dimensional Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 789, "page_last": 796, "abstract": null, "full_text": "Analog VLSI Implementation of \n\nMulti-dimensional Gradient Descent \n\nDavid B. Kirk, Douglas Kerns, Kurt Fleischer, Alan H. Barr \n\nCalifornia Institute of Technology \n\nBeckman Institute 350-74 \n\nPasadena, CA 91125 \n\nE-mail: dkIDegg.gg . cal tech. edu \n\nAbstract \n\nWe describe an analog VLSI implementation of a multi-dimensional \ngradient estimation and descent technique for minimizing an on(cid:173)\nchip scalar function fO. The implementation uses noise injec(cid:173)\ntion and multiplicative correlation to estimate derivatives, as in \n[Anderson, Kerns 92]. One intended application of this technique \nis setting circuit parameters on-chip automatically, rather than \nmanually [Kirk 91]. Gradient descent optimization may be used \nto adjust synapse weights for a backpropagation or other on-chip \nlearning implementation. The approach combines the features of \ncontinuous multi-dimensional gradient descent and the potential \nfor an annealing style of optimization. We present data measured \nfrom our analog VLSI implementation. \n\n1 \n\nIntroduction \n\nThis work is similar to [Anderson, Kerns 92], but represents two advances. First, we \ndescribe the extension of the technique to multiple dimensions. Second, we demon(cid:173)\nstrate an implementation of the multi-dimensional technique in analog VLSI, and \nprovide results measured from the chip. Unlike previous work using noise sources \nin adaptive systems, we use the noise as a means of estimating the gradient of a \nfunction f(y), rather than performing an annealing process [Alspector 88]. We also \nestimate gr-;:dients continuously in position and time, in contrast to [Umminger 89] \nand [J abri 91], which utilize discrete position gradient estimates. \n\n789 \n\n\f790 \n\nKirk, Kerns, Fleischer, and Barr \n\nIt is interesting to note the existence of related algorithms, also presented in this \nvolume [Cauwenberghs 93] [Alspector 93] [Flower 93]. The main difference is that \nour implementation operates in continuous time, with continuous differentiation \nand integration operators. The other approaches realize the integration and differ(cid:173)\nentiation processes as discrete addition and subtraction operations, and use unit \nperturbations. [Cauwenberghs 93] provides a detailed derivation of the convergence \nand scaling properties of the discrete approach, and a simulation. [Alspector 93] \nprovides a description of the use of the technique as part of a neural network hard(cid:173)\nware architecture, and provides a simulation. [Flower 93] derived a similar discrete \nalgorithm from a node perturbation perspective in the context of multi-layered feed(cid:173)\nforward networks. Our work is similar in spirit to [Dembo 90] in that we don't make \nany explicit assumptions about the \"model\" that is embodied in the function fO. \nThe function may be implemented as a neural network. In that case, the gradient \ndescent is on-chip learning of the parameters of the network. \n\nWe have fabricated a working chip containing the continuous-time multi(cid:173)\ndimensional gradient descent circuits. This paper includes chip data for individ(cid:173)\nual circuit components, as well as the entire circuit performing multi-dimensional \ngradient descent and annealing. \n\n2 The Gradient Estimation Technique \n\nd/dt \n\nd/dt \n\nFigure 1: Gradient estimation technique from [Anderson, Kerns 92] \n\nAnderson and Kerns [Anderson, Kerns 92] describe techniques for one-dimensional \ngradient estimation in analog hardware. The gradient is estimated by correlating \n(using a multiplier) the output of a scalar function f( v(t)) with a noise source \nn(t), as shown in Fig. 1. The function input y(t) is additively \"contaminated\" by \nthe noise n(t) to produce v(t) = y(t) + n(t). A scale factor B is used to set the \nscale of the noise to match the function output, which improves the signal-to-noise \nratio. The signals are \"high-pass\" filtered to approximate differentiation (shown \nas d/ dt operators in Fig. 1) directly before the multiplication. The results of the \nmultiplication are \"low-pass\" filtered to approximate integration. \n\nThe gradient estimate is integrated over time, to smooth out some of the noise and \nto damp the response. This smoothed estimate is compared with a \"zero\" reference, \nusing an amplifier A, and the result is fed back to the input, as shown in Fig. 2. \nTh~ contents of Fig. 1 are represented by the \"Gradient Estimation\" box in Fig. 2. \n\nWe have chosen to implement the multi-dimensional technique in analog VLSI. We \n\n\fAnalog VLSI Implementation of Multi-dimensional Gradient Descent \n\n791 \n\nGradient \nEstimation \n\n[ \nJ dt \n\n\"zero\" \n\nFigure 2: Closing the loop: performing gradient descent using the gradient estimate. \n\nwill not reproduce here the one-dimensional analysis from [Anderson, Kerns 92]' \nbut summarize some ofthe more important results, and provide a multi-dimensional \nderivation. [Anderson 92] provides a more detailed theoretical discussion. \n\n3 Multi-dimensional Derivation \n\nThe multi-dimensional gradient descent operation that we are approximating can \nbe written as follows: \n\n(1) \nwhere y and y' are vectors, and the solution is obtained continuously in time t, \nrather than at discrete ti. The circuit described in the block diagram in Fig. 1 \ncomputes an approximation to the gradient: \n\n(2) \n\nWe approximate the operations of differentiation and integration in time by realiz(cid:173)\nable high-pass and low-pass filters, respectively. To see that Eq. 2 is valid, and that \nthis result is useful for approximating Eq. 1, we sketch an N-dimensional extension \nof [Anderson 92]. Using the chain rule, \n\nd \ndtf 0l.(t) + !let)) = L.-J (yj(t) + nj(t)) ~ \n\n~ \n\n. \n3 \n\nof \nY3 \n\nAssuming nj(t) ~ yj (t), the rhs is approximated to produce \n\ndd f0l.(t)+n(t)) = Lnj(t)oo~ \nY3 \nt \n\n. \n3 \n\n(3) \n\n(4) \n\nMultiplying both sides by ni(t), and taking the expectation integral operator E[ ] \nof each side, \n\nE [nitt) ~ f 0t(t) + !!(t\u00bb) 1 = E [n;(t) ~>;(t) :~ ] \n\n(5) \n\nIf the noise sources ni(t) and nj (t) are unc.orrelated, nat) is independent of nj (t) \nwhen i =P j, and the sum on the right has a contribution only when i = j, \n\nE [ni(t) :t f 0l.(t) + net)) 1 = E [n~(t)n~(t) :~ 1 \n\n(6) \n\n\f792 \n\nKirk, Kerns, Fleischer, and Barr \n\nE ni(t)- f ~(t) + n(t)) ~ an- = a\\l f \n\n[ \n\nd \ndt \n\nThe expectation operator E[] can be used to smooth random variations of the noise \nnj(t). So, we have \n\n1 of \n\nUYi \n\n(7) \n\n(8) \n\nSince the descent rate k is arbitrary, we can absorb a into k. \ncan approximate the gradient descent technique as follows: \n\nU sing equation 8, we \n\ny~(t) ~ -k E [ni(t) ~ f (1t(t) + n(t)) 1 \n\n(9) \n\n4 Elements of the Multi-dimensional Implementation \n\nWe have designed, fabricated, and tested a chip which allows us to test these ideas. \nThe chip implementation can be decomposed into six distinct parts: \n\nnoise source(s): an analog VLSI circuit which produces a noise function. An in(cid:173)\n\ndependent, correlation-free noise source is needed for each input dimension, \ndesignated ni(t). The noise circuit is described in [Alspector 91]. \n\ntarget function: a scalar function f(Yl , Y2, ... , YN) of N input variables, bounded \nbelow, which is to be minimized [Kirk 91]. The circuit in this case is a 4-\ndimensional variant of the bump circuit described in [Delbriick 91]. In the \ngeneral case, this fO can be any scalar function or error metric, computed \nby some circuit. Specifically, the function may be a neural network. \n\ninput signal(s): the inputs Yi(t) to the function fO. These will typically be on(cid:173)\n\nchip values, or real-world inputs. \n\nmultiplier circuit(s): the multiplier computes the correlation between the noise \nvalues and the function output. Offsets in the multiplication appear as \nsystematic errors in the gradient estimate, so it is important to compensate \nfor the offsets. Linearity is not especially important, although monotonicity \nis critical. Ideally, the multiplication will also have a \"tanh-like\" character, \nlimiting the output range for extreme inputs. \n\nintegrator: an integration over time is approximated by a low-pass filter \ndifferentiator: the time derivatives of the noise signals and the function are ap-\n\nproximated by a high-pass filter. \n\nThe N inputs, Yi(t), are additively \"contaminated\" with the noise signals, ni(t), by \ncapacitive coupling, producing Vi(t) = Yi(t) + ni(t), the inputs to the function fO. \nThe function output is differentiated, as are the noise functions. Each differentiated \nnoise signal is correlated with the differentiated function output, using the multi(cid:173)\npliers. The results are low-pass filtered, providing N partial derivative estimates, \nfor the N input dimensions, shown for 4 dimensions in Fig. 3. \nThe function fO is implemented as an 4-dimensional extension of Delbriick's \n[D~lbriick 91] bump circuit. Details ofthe N-dimensional bump circuit can be found \nin [Kirk 93]. For learning and other applications, the function fO can implement \nsome other error metric to be minimized. \n\n\fAnalog VLSI Implementation of Multi-dimensional Gradient Descent \n\n793 \n\n, ............................................................................................ : \n: \n: \n: ~' ........................................................................................... : \n...... ------------n: : , ......................................................................................... ;. .. \n\u2022 ! ! r .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7 .................................. .L.! \n: \n! \n\n: \n: \n\nd/dt \n\nn1(t) -\n\nn2(t) \n\n~ro \nn4(t) \n\nv1(t) \n\nd/dt \n\nJ dt \n\n. \n. \n. \n\nv2(t) \n----4 I ~-+---=-'::....j \n\nf(.) \n\n. \n\n\\0 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 , .......... ... \n\ny1(t) \n\ny2(t) \n\ny3(t) \n\ny4(t) \n\nFigure 3: Block diagram for a 4-dimensional gradient estimation circuit. \n\n5 Chi p Results \n\nWe have tested chips implementing the gradient estimation and gradient descent \ntechniques described in this paper. Figure 4 shows the gradient estimate, without \nthe closed loop descent process. Figure 5 shows the trajectories of two state variables \nduring the 2D gradient descent process. Figure 6 shows the gradient descent process \nin operation on a 2D bump surface, and Fig. 7 shows how, using appropriate choice \nof noise scale, we can perform annealing using the gradient estimation hardware. \n\ni : \n.. \n\n1 \n0 \n~ \ng \n\n, .. \n\n06 \n\nDO \n\n\" \n\n\u00b711 \n\n\" \n\n\u00b706 \n\n\u00b718 \n\n11 \n\ni : \n.. \n\n1 \n0 \n~ \n! \n\n\" \n\" \n\nDO \n\n'1 \n\nL-__ ~-----___ ~ \n\n..... ~~ \n\n11 \n\nDO \n\n.\" \n\n08 \n\n'I \n\n'\" \n\n011 \n\n007 \n\nnm.(\",oodI} \n\nI\" \n\n'\" \n\n'\" \n\n018 \n\nTUD! (RCIKlda) \n\n'\" \n\n'\" \n\nlOS \n\nFigure 4: Measured Chip Data: 1D Gradient Estimate. Upper curves are 1D bump \noutput as the input yet) is a slow triangle wave. Lower curves are gradient estimates. \n(left) raw data, and (right) average of 1024 runs. \n\n\f794 \n\nKirk, Kerns, Fleischer, and Barr \n\ni \n\n;> \n:> \n\n11 \n\n\u00b7lJ \n\n1. \n\n\u00b711 \n\n., \n\n\u00b7'1 \n\n.,. \n\n\u00b716 \n\nlit \n\nIII \n\n'\" \n\n1ID 1\"'-} \n\nI.~ \n\ni \n\n~ \n:> \n\n\u00b71' \n\n\u00b71' \n\n\u00b717 \n\n\u00b71' \n\n., \n\n\u00b731 \n\n\u00b7' 1 \n\n\" \n\n.. ~ \n\\V.~ \n\n~~i \n\nIII \n\n011 \n\n\"' \n\n1ID 1\"'_} \n\nI~ \n\n1.01 \n\nFigure 5: Measured Chip Data: 2D Gradient Descent. The curves above show the \nfunction optimization by gradient descent for 2 variables. Each curve represents \nthe path of one of the state variables \u00a5..(t) from some initial values to the values for \nwhich the function 10 is minimized. (left) raw data, and (right) average of S runs. \n\n6 Conclusions \n\nWe have implemented an analog VLSI structure for performing continuous multi(cid:173)\ndimensional gradient descent, and the gradient estimation uses only local informa(cid:173)\ntion. The circuitry is compact and easily extensible to higher dimensions. This \nimplementation leads to on-chip multi-dimensional optimization, such as is needed \nto perform on-chip learning for a hardware neural network . We can also perform a \nkind of annealing by adding a schedule to the scale of the noise input. Our approach \nalso has some drawbacks, however. The gradient estimation is sensitive to the in(cid:173)\nput offsets in the multipliers and integrators, since those offsets result in systematic \nerrors. Also, the gradient estimation technique adds noise to the input signals. \n\nWe hope that with only small additional circuit complexity, the performance of \nanalog VLSI circuits can be greatly increased by permitting them to be intrinsically \nadaptive . On-chip implementation of an approximate gradient descent technique is \nan important step in this direction. \n\nAcknowledgements \n\nThis work was supported in part oy an AT&T Bell Laboratories Ph.D. Fellowship, \nand by grants from Apple, DEC, Hewlett Packard, and IBM . Additional support \nwas provided by NSF (ASC-S9-20219), as part of the NSF/DARPA STC for Com(cid:173)\nputer Graphics and Scientific Visualization. All opinions, findings, conclusions, or \nrecommendations expressed in this document are those of the author and do not \nnecessarily reflect the views of the sponsoring agencies. \n\nReferences \n\n[Alspector 93] Alspector, J., R. Meir, B. Yuhas, and A. Jayakumar, \"A Parallel Gradient \n\n\fAnalog VLSI Implementation of Multi-dimensional Gradient Descent \n\n795 \n\nInltlol Stote \n\nFigure 6: Measured Chip Data: 2D Gradient Descent. Here we see the results for \n2D gradient descent on a 2D bump surface . Both the bump surface and t.he descent \npath are actual data measured from our chips. \n\nFigure 7: Measured Chip Data: 2D Gradient Descent and Annealing. Here we see \nthe effects of varying the amplitude of the noise. The dots represent points along \nthe optimization pat.h . At left, with small magnitude noise, the process descends \nto a local minimum. At right, with larger magnitude, the descent process escapes \nto the global minimum. A schedule of gradually decreasing noise amplitude could \nreduce the probability of getting caught in undesirable local minima, and increase \nthe probability of converging to a small region near a more desirable minimum, or \neven the global minimum. \n\n\f796 \n\nKirk, Kerns, Fleischer, and Barr \n\nDescent Method for Learning in Analog VLSI Neural Networks,\" in Advances in \nNeural Information Processing Systems, Vol. 5, Morgan Kaufman, San Mateo, CA, \n1993. \n\n[Alspector 91] Alspector, J., J. W. Gannett, S. Haber, M. B. Parker, and R. Chu, \"A \nVLSI-Efficient Technique for Generating Multiple Uncorrelated Noise Sources and \nIts Application to Stochastic Neural Networks,\" IEEE Transactions on Circuits and \nSystems, Vol.38, no.l, pp.l09-123, January, 1991. \n\n[Alspector 88] Alspector, J., B. Gupta, and R. B. Allen, \"Performance of a stochastic \nlearning microchip,\" in Advances in Neural Information Processing Systems, vol. I, \nDenver Colorado, Nov. 1988. D. S. Touretzky, ed., Morgan Kauffman Publishers, \n1989, pp. 748-760. \n\n[Anderson, Kerns 92] Anderson, Brooke P., and Douglas Kerns, \"Using Noise Injection \nand Correlation in Analog Hardware to Estimate Gradients,\" submitted to IEEE \nTransactions on Circuits and Systems I: Fundamental Theory and Applications. \n[Anderson 92] Anderson, Brooke P., \"Low-pass Filters as Expectation Operators for Mul(cid:173)\ntiplicative Noise,\" submitted to IEEE Transactions on Circuits and Systems I: \nFundamental Theory and Applications. \n\n[Cauwenberghs 93] Cauwenberghs, Gert, \"A Fast Stochastic Error-Descent Algorithm for \n\nSupervised Learning and Optimization,\" in Advances in Neural Information Pro(cid:173)\ncessing Systems, Vol. 5, Morgan Kaufman, San Mateo, CA, 1993. \n\n[Delbriick 91] Delbriick, Tobias, \"'Bump' Circuits for Computing Similarity and Dissimi(cid:173)\n\nlarity of Analog Voltages,\" Proceedings of International Joint Conference on Neural \nNetworks, July 8-12, 1991, Seattle Washington, pp 1-475-479. (Extended version \nas Caltech Computation and Neural Systems Memo Number 10.) \n\n[Dembo 90] Dembo, A., and T. Kailath, \"Model-Free Distributed Learning,\" IEEE Trans(cid:173)\n\nactions on Neural Networks, Vol. 1, No.1, pp. 58-70, 1990. \n\n[Flower 93] Flower, B., and M. Jabri, \"Summed Weight Neuron Perturbation: An O(n) \nImprovement over Weight Perturbation,\" in Advances in Neural Information Pro(cid:173)\ncessing Systems, Vol. 5, Morgan Kaufman, San Mateo, CA, 1993. \n\n[Jabri 91] Jabri, M., S. Pickard, P. Leong, Z. Chi, and B. Flower, \"Architectures and Im(cid:173)\n\nplementations of Right Ventricular Apex Signal Classifiers for Pacemakers,\" IEEE \nNeural Information Processing Systems 1991 (NIPS 91), Morgan Kaufman, San \nDiego, 1991. \n\n[Kerns 92] Kerns, Douglas, \"A Compact Noise Source for VLSI Applications,\" submit(cid:173)\n\nted to IEEE Transactions on Circuits and Systems I: Fundamental Theory and \nApplications. \n\n[Kirk 91] Kirk, David, Kurt Fleischer, and Alan Barr, \"Constrained Optimization Applied \nto the Parameter Setting Problem for Analog Circuits,\" IEEE Neural Information \nProcessing Systems 1991 (NIPS 91), Morgan Kaufman, San Diego, 1991. \n\n[Kirk 93] Kirk, David, \"Accurate and Precise Computation using Analog VLSI, with Ap(cid:173)\n\nplications to Computer Graphics and Neural Networks,\" Ph.D. Thesis, California \nInstitute of Technology, Caltech-CS-TR-93-??, June, 1993. \n\n[Mead 89] Mead, Carver, \"Analog VLSI and Neural Systems,\" Addison-Wesley, 1989. \n[Platt 89] Platt, John, \"Constrained Optimization for Neural Networks and Computer \nGraphics,\" Ph.D. Thesis, California Institute of Technology, Caltech-CS-TR-89-07, \nJune, 1989. \n\n[Umminger 89] Umminger, Christopher B., and Steven P. DeWeerth, \"Implementing Gra(cid:173)\n\ndient Following in Analog VLSI,\" Advanced Research in VLSI, MIT Press, Boston, \n1989, pp. 195-208. \n\n\f", "award": [], "sourceid": 632, "authors": [{"given_name": "David", "family_name": "Kirk", "institution": null}, {"given_name": "Douglas", "family_name": "Kerns", "institution": null}, {"given_name": "Kurt", "family_name": "Fleischer", "institution": null}, {"given_name": "Alan", "family_name": "Barr", "institution": null}]}