{"title": "Adaptive Neural Networks Using MOS Charge Storage", "book": "Advances in Neural Information Processing Systems", "page_first": 761, "page_last": 768, "abstract": null, "full_text": "761 \n\nAdaptive Neural Networks Using MOS Charge Storage \n\nD. B. Schwartz 1, R. E. Howard and W. E. Hubbard \n\nAT&T Bell Laboratories \n\nCrawfords Corner Rd. \nHolmdel, N.J. 07733 \n\nAbstract \n\nMOS charge storage has been demonstrated as an effective method to store \nthe weights in VLSI implementations of neural network models by several \nworkers 2 . However, to achieve the full power of a VLSI implementation of \nan adaptive algorithm, the learning operation must built into the circuit. We \nhave fabricated and tested a circuit ideal for this purpose by connecting a \npair of capacitors with a CCD like structure, allowing for variable size weight \nchanges as well as a weight decay operation. A 2.51-' CMOS version achieves \nbetter than 10 bits of dynamic range in a 140/' X 3501-' area. A 1.25/' chip \nbased upon the same cell has 1104 weights on a 3.5mm x 6.0mm die and is \ncapable of peak learning rates of at least 2 x 109 weight changes per second. \n\n1 Adaptive Networks \n\nMuch of the recent excitement about neural network models of computation has \nbeen driven by the prospect of new architectures for fine grained parallel compu(cid:173)\ntation using analog VLSI. Adaptive systems are espescially good targets for analog \nVLSI because the ada.ptive process can compensate for the inaccuracy of individual \ndevices as easily as for the variability of the signal. However, silicon VLSI does not \nprovide us with an ideal solution for weight storage. Among the properties of an \nideal storage technology for analog VLSI adaptive systems are: \n\n\u2022 The minimum available weight change ~w must be small. The simplest adap(cid:173)\n\ntive algorithms optimize the weights by minimizing the output error with a \nsteepest descent search in weight space [1]. Iterative improvement algorithms \nsuch as steepest descent are based on the heuristic assumption of 'better' \nweights being found in the neighborhood of 'good' ones; a heuristic that fails \nwhen the granularity of the weights is not fine enough. In the worst case, the \nresolution required just to represent a function can grow exponentially in the \ndimension of the input space . \n\n\u2022 The weights must be able to represent both positive and negative values and \nthe changes must be easily reversible. Frequently, the weights may cycle up \nand down while the adaptive process is converging and millions of incremental \nchanges during a single training session is not unreasonable. If the weights \ncannot easily follow all of these changes, then the learning must be done off \nchip. \n\n1 Now at GTE Laboratories, 40 Sylvan Rd., Waltham, Mass 02254 dbs@gte.com%relay.cs.net \n2For example, see the papers by Mann and Gilbert, Walker and Akers, and Murray et. al. in \n\nthis proceedings \n\n\f762 \n\nSchwartz, Howard and Hubbard \n\n\u2022 The parallelism of the network can be exploited to the fullest only if the \nmechanism controlling weight changes is simple enough to be reproduced at \neach weight. Ideally, the change is determined by some easily computed com(cid:173)\nbination of information local to each weight and signals global to the entire \nsystem. This type of locality, which is as much a property of the algorithm as \nof the hardware, is necessary to keep the wiring cost associated with learning \nsmall. \n\n\u2022 Weight decay, Wi = aw with a < 1 is useful although not essential. Global \ndecay of all the weights can be used to extend their dynamic range by rescaling \nwhen the average magnitude becomes too large. Decay of randomly chosen \nweights can be used both to control their magnitude [2] and to help gradient \nsearches escape from local minima. \n\nTo implement an analog storage cell with MOS VLSI the most obvious choices \nare non-volatile devices like floating gate and MNOS transistors, multiplying DAC's \nwith conventional digital storage, and dynamic analog storage on MOS capacitors. \nMost non-volatile devices rely upon electron tunneling to change the amount of \nstored charge, typically requiring a large amount of circuitry to control weight \nchanges. DAC's have already proven themselves in situations where 5 bits or less \nof resolution [3] [4] are sufficient, but higher resolution is prohibitively expensive in \nterms of area. We will show the disadvantage of MOS charge storage, its volatility, \nis more than outweighed by the resolution available and ease of making weight \nchanges. \n\nRepresentation of both positive and negative weights can be obtained by storing \n\nthe weights Wi differentially on a pair of capacitors in which case \n\nDifferential storage can be used to obtain some degree of rejection of leakage and \ncan guarantee that leakage will reduce the magnitude of the weights as compared \nwith a scheme where the weights are defined with respect to a fixed level, in which \ncase as a weight decays it can change signs. A constant common mode voltage also \neases the design constraints on the differential input multiplier used to read out the \nweights. An elegant way to manipulate the weights is to transfer charge from one \ncapacitor to the other, keeping constant the total charge on the system and thus \nmaximizing the dynamic range available from the readout circuit. \n\n2 Weight Changes \n\nSmall packets of charge can easily be transferred from one capacitor to the other by \nexploiting charge injection, a phenomena carefully avoided by designers of switched \ncapacitor circuits as a source of sampling error [5] [6] [7] [8] [9]. An example of a \nstorage cell with the simplest configuration for a charge transfer system is shown \nin figure 1. A pair of MOS capacitors are connected by a string of narrow MOS \ntransistors, a long one to transfer charge and two minimum length ones to isolate \n\n\fAdaptive Neural Networks Using MOS Charge Storage \n\n763 \n\nTC \n\nTP \n\nru \n\nTA \n\nI \n\nTP \n\nTI \n\nTCP \n\nru UlJ \n\nTA \n\nI \n\nTM \n\nTA \n\nUl~v-IL \n\nI \n\nTCM \n\nTM \n\nTA \n\nI \n\nFigure 1: (a) The simplest storage cell, with provisions for only a single size incre(cid:173)\nment/ decrement operations and no weight decay. (b) A more sophisticated cell with \nfacilities for weight decay. By suitable manipulation of the clock signals, the two \ncharge transfer transistors can be used to obtain different sizes of weight changes. \nBoth circuits are initialized by turning on the access transistors TA and charging \nthe capacitors up to a convenient voltage, typically Vnn /2. \n\nthe charge transfer transistor from the storage nodes. For the sake of discussion, we \ncan treat the isolation transistors as ideal switches and concentrate on the charge \ntransfer transistor that we here assume to be an n-channel device. To increase the \nweight ( See figure 1 ), the charge transfer transistor (TC) and isolation transistor \nattached to the positive storage node (TP) are turned on. When the system has \nreached electrostatic equilibrium the charge transfer transistor (TC) is disconnected \nfrom the plus storage node by turning off TP and connected to the minus storage \nnode by turning on TM. If the charge transfer transistor TC is slowly turned off, the \nmobile charge in its channel will diffuse into the minus node, lowering its voltage. \n\nA detailed analysis of the charge transfer mechanism has been given elsewhere [10], \nbut for the purpose of qualitative understanding of the circuit the inversion charge \nin the charge transfer transistor's channel can be approximated by \n\nqNinv = Cox(VG - VTE). \n\nwhere VT E is the effective threshold voltage and Cox the gate to channel capacitance \nof the charge transfer transistor. The effective threshold voltage is then given by \n\nwhere VTO is the threshold voltage in the absence of body effect, 1; J the fermi level, \nVs the source to substrate voltage, and f the usual body effect coefficient. An even \n\n\f764 \n\nSchwartz, Howard and Hubbard \n\nrougher model can be obtained by linearizing the body effect term [6] \n\nwhere Cell co.ntains both the gate oxide capacitance and the effects of parasitic \ncapacitance and T/ = 'Y /2.j12\u00a2 I I . Within the linearized approximation, the change \nin voltage on a storage node with capacitance Cstore after n transfers is \n\nVn = Va + -(VG - VT - T/Va)(1- exp(-an)) \n\n1 \nT/ \n\n(1) \n\nwith a = Cell /Cstore and where Va is the initial voltage on the storage node. Due \nto the dependence of the size of the transfer on the stored voltage, when the transfer \ndirection is reversed the increment size changes unless the stored voltages on the \ncapacitors are equal. This can be partially compensated for by using complementary \npairs of p-channel and n-channel charge transfer transistors, in effect using a string \nof transmission gates to perform charge transfers. A weight decay operation can be \nintroduced by using the more complex string of charge transfer transistors shown \nin figure lb. A weight decay is initiated by turning off the transistor in the middle \nof the string (TI) and turning on all the other transistors. When the two sides of \nthe charge transfer string have equilibrated with their respective storage nodes, the \nconnections to the storage nodes ( TM and TP ) are turned off and the two cha.rge \ntransfer transistors ( TCP and TCM ) are allowed to exchange charge by turning \non the transistor, TI, which separates them. When two equal charge packets have \nbeen obtained TI is turned off again and the charge packets held by TCP and TCM \nare injected back into the storage capacitors. The resulting change in the stored \nweight is \n\ntl. vdecay = - CCeff (V+ - v_). \n\nox \n\nwhich corresponds to multiplying the weight by a constant a < 1 as desired. Besides \nallowing for weight decay, the more complex charge string shown in figure Ib ca.n also \nbe used to obtain different size weight changes by using different clock sequences. \n\n3 Experimental Evaluation \n\nTest chips have been fabricated in both 1.25J.l and 2.5J.l CMOS, using the AT&T \nTwin Tub technology[ll]. To evaluate the properties of an individual cell, especially \nthe charge transfer mechanism, an isolated test structure consisting of five storage \ncells was built on one section of the 2.5J.l chip. The storage cells were differen(cid:173)\ntially read out by two quadrant transconductance amplifiers whose input-output \ncharacteristics are shown in figure 2. By using the bias current of the amplifiers as \nan input, the amplifiers were used as two quadrant multipliers. Since many neural \nnetwork models call for a sigmoidal nonlinearity, no attempt was made to linearize \nthe operation of the multiplier. The output currents of the five multipliers were \nsummed by a single output wire and the voltages on each of the ten capacitors were \n\n\fAdaptive Neural Networks Using MOB Charge Storage \n\n765 \n\n10~--------------------------------------~ \n\no \n\n\u00b710+-------,---~--_r--~--~--~----._--~~ \n5 \n\n4 \n\no \n\n2 \nInput Voltage \n\n3 \n\nFigure 2: A family of transfer characteristics from one of the transconductance \nmultipliers for several different values of stored weight. The different branches of \nthe curves are each separated by ten large charge transfers. No attempt was made \nto linearize the input/output characteristic since many neural network models call \nfor non-linearities. \n\nbuffered by voltage followers to allow for detailed examination of the inner workings \nof the cell. \n\nAfter trading off between hold time, resolution and area we decided upon 20Jl \nlong charge transfer transistors and 2000Jl2 storage capacitors with 2.5Jl technology \nbased upon the minimum channel width of 2.5Jl. For a 20Jl long channel and a \n2.5V gate to source voltage the channel transit time To is approximately 5 ns and \ncharge transfer clock frequencies exceeding 1 o MHz are possible without measurable \npumping of charge into the substrate. The 2.5p wide access transistors were 12J-l \nlong, leading to leakage rates from the individual capacitors of about 1% of the \nstored value in 100s, limited by surface leakage in our unpassivated test structures. \nEven with uncapped wafers, the leakage was small enough to allow all the tests \ndescribed here to be made without special provisions for environmental control of \neither temperature or humidity. As mentioned earlier, the more complex set of \ncharge transfer transistors needed to introduce weight decay can also be used to \nobtain several different size of charge transfers, a small weight change by using \nthe two long transistors in sequence and a coarse one by treating the two long \ntransistors and the isolation transistor separating them as a single device. Using \nthe small weight changes, the worst case resolution was 10 bits ( near ~ V = 0 ) \nand the results where in excellent agreement with the predictions of equation 1 \n\n\f766 \n\nSchwartz, Howard and Hubbard \n\n~ \n\nQ) \nC) \n\nas ... -0 \n> \n... o -\n0 \nf.) as \nQ. as \n0 \n\n3.2 \n\n3.0 \n\n2.8 \n\n2.6 \n\n2.4 \n\n2.2 \n\n2.0 \n\n1.8 \n\n0 \n\n100 \n\n200 \n\n300 \n\n400 \n\n500 \n\nCharge transfers \n\nFigure 3: The voltage on the two storage capacitors when the weight is initially \nset to saturation using large increments and then reduced back towards zero using \nweight decay. The granularity of the curves is an experimental artifact of the digital \nvoltmeter's resolution. \n\nusing the effective capacitance as a fitting parameter. In the figure 3 we use large \ncharge transfers to quickly increment the weight up to its maximum value and then \nreduce it back to zero with weight decays, demonstrating the expected exponential \ndependence of the stored voltage on the number of weight decays. Even under \nrepeated cycling up and down through the entire differential voltage range of the \ncell, the total amount of charge on the cell remained constant for frequencies under \n10M H z with the exception of the expected losses due to leakage. \n\nThe long term goal of this work is to develop analog VLSI chips that are complete \n'learning machines', capable of modified their own weights when provided with input \ndata and some feedback based on the output of the network. However, the study \nof learning algorithms is in a state of flux and few, if any, algorithms have been \noptimized for VLSI implementation. Rather than cast an inappropriate algorithm \nin silicon, we have designed our first chips to be used as adaptive systems with \nan external controller, allowing us to develop algorithms that are appropriate for \nthe medium once we understand its properties. The networks are organized as \nrectangular matrix multipliers with voltage inputs and current outputs with 46 \ninputs and 24 outputs in a 96 pin package for the 1.251-' chip. Since none of the \nanalog input/output lines of the chip are multiplexed, larger and more complicated \nnetworks can be built by cascading several chips. \n\nTo the digital controller, the chip looks like a 1104 x 2 static RAM with some \nextra clock inputs to drive the charge transfers. The charge transfer clock signals are \n\n\fAdaptive Neural Networks Using MOS Charge Storage \n\n767 \n\ndistributed globally and are connected to the individual strings of charge transfer \ntransistors through a pair of 2 x 2 cross bar switches controlled by two bits of static \nRAM local to each cell. The use of a pair of cross bar switches is necessitated \nby the faciltities for weight decay; if the simpler charge transfer string shown in \nfigure la were used then only a single switch would be needed. When both a \ncell's RAMs are zeroed, the global charge transfer lines are not connected to the \ncharge transfer transistors. The global lines are connected to the individual strings \nof charge transfer transistors either normally or in reverse depending upon which \nRAM cell contains a one. By reversing the order of the signals on the charge \ntransfer lines, a weight change can also be reversed. Neglecting the dependence of \nthe size of the charge transfer upon stored weight, the RAM's represent a weight \nchange vector f).Cij with components f).wij E [-1,0,1]. Once a weight change vector \nhas been written serially to the RAM's, the weight changes along that vector are \nmade in parallel by manipulating the charge transfer lines. This architecture is \nalso a powerful way to implement programable networks of fixed weights since an \narbitrary matrix of 10 bit weights can be written to the chip in a few milliseconds \nor less if an efficient decomposition of the desired weight vector into global charge \ntransfers is made. In view of the speed with which the chip can evaluate the output \nof a network, an overhead of less than a percent for a refresh operation is acceptable \nin many applications. \n\n4 Conclusions \n\nWe have implemented a generic chip to facilitate studying adaptive networks by \nbuilding them in analog VLSI. By exploiting the well known properties of charge \nstorage and charge injection in a novel way, we have achieved a high enough level of \ncomplexity ( > 103 weights and 10 bits of analog depth) to be interesting, in spite \nof the limitation to a modest 6.00mm x 3.5mm die size required by a multi-project \nfabrication run. If the cell were optimized to represent fixed weight networks by \neliminating weight decay and bi-directional weight changes, the density could easily \nbe increased by a factor of two with no loss in resolution. Once a weight change \nvector has been written to the RAM cells, charge transfers can be clocked at a \nrate of 2M H z chip corresponds to a peak learning rate of 2 x 109 updates/second, \nexceeding the speeds of 'digital neurocomputers' based upon DSP chips by two \norders of magnitude. \nAcknowledgements \nA large group of people assisted the authors in taking this work from concept to \nsilicon, a few of whom we single out for mention here. The IDA design tools used \nfor the layouts were provided and supported by D. D. Hill and D. D. Shugard at \nMurray Hill and the 1.25J.l process was supported by D. Wroge and R. Ashton. The \nfirst author wishes to acknowledge helpful discussions with H. P. Graf, S. Mackie \nand G. Taylor, with special thanks to R. G. Swartz. \n\n\f768 \n\nSchwartz, Howard and Hubbard \n\nReferences \n\n[1] Bernard Widrow and Samuel D. Stearns. Adaptive Signal Processing. \n\nPrentice-Hall, Inc., Englewood Cliffs, N. J., 1985. \n\n[2] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A lea.rning algorithm for \n\nBoltzman machines. Cognitive Science, 9:147, 1985. \n\n[3] Jack Raffel, James Mann, Robert Berger, Antonio Soares, and Sheldon \n\nGilbert. A generic architecture for wafer-scale neuromorphic systems. In \nIEEE First International Conference on Neural Networks. Volume III, \npage 501, 1987. \n\n[4] Joshua Alspector, Bhusan Gupta, and Robert B. Allen. Performance of a \n\nstochastic learning microchip. In Advances in Neural Network Information \nProcessing Systems, 1988. \n\n[5] William B. Wilson, Hisham Z. Massoud, Eric J. Swanson, Rhett T. George, \n\nand Richard B. Fair. Measurement and modeling of charge feed through in \nn-channel MOS analog switches. IEEE Journal of Solid-State Circuits, \nSC-20(6):1206-1213, 1985. \n\n[6] George Wegmann, Eric A. Vittoz, and Fouad Ra.ha.li. Charge injection in \n\nanalog MaS switches. IEEE Journal of Solid-State Circuits, \nSC-20(6):1091-1097, 1987. \n\n[7] James A. Kuo, Robert W. Dutton, and Bruce A. Wooley. MOS pass \n\ntransistor turn-off transient analysis. IEEE Transactions on Electron \nDevices, ED-33(10):1545-1555, 1986. \n\n[8] James R. Kuo, Robert W. Dutton, and Bruce A. Wooley. Turn-off tra.nsients \n\nin circular geometry MOS pass transistors. IEEE Journal Solid-State \nCircuits, SC-21(5):837-844, 1986. \n\n[9] Je-Hurn Shieh, Mahesh Patil, and Bing J. Sheu. Measurement and analysis of \n\ncharge injection in MOS analog switches. IEEE Journal of Solid State \nCircuits, SC-22(2):277-281, 1987. \n\n[10J R. E. Howard, D. B. Schwartz, and W. E. Hubbard. A programmable analog \n\nneural network chip. IEEE Journal of Solid-State Circuits, 24, 1989. \n\n[11J J. Argraz-Guerena, R. A. Ashton, W. J. Bertram, R. C. Melin, R. C. Sun, and \n\nJ. T. Clemens. Twin Tub III - A third generation CMOS. In Proceedings \nof the International Electron Device Meeting, 1984. Citation P63-6. \n\n\f", "award": [], "sourceid": 110, "authors": [{"given_name": "Daniel", "family_name": "Schwartz", "institution": null}, {"given_name": "R.", "family_name": "Howard", "institution": null}, {"given_name": "Wayne", "family_name": "Hubbard", "institution": null}]}