{"title": "Digital Boltzmann VLSI for constraint satisfaction and learning", "book": "Advances in Neural Information Processing Systems", "page_first": 896, "page_last": 903, "abstract": null, "full_text": "Digital Boltzmann VLSI for \n\nconstraint satisfaction and learning \n\nMichael Murray t \n\nMing-Tak Leung t \n\nKan Boonyanit t \n\nKong Kritayakirana t \n\nJames B. Burrt* \n\nGregory J. Wolff+ \n\nTakahiro Watanabe+ \n\nEdward Schwartz+ \n\nDavid G. Storktt \n\nAllen M. Petersont \n\nt Department of Electrical Engineering \n\nStanford University \n\nStanford, CA 94305-4055 \n\n+Ricoh California Research Center \n\n2882 Sand Hill Road Suite 115 \nMenlo Park, CA 94025-7022 \n\n* \nSun Mlcrosystems \n\nand \n. \n\n2550 Garcia Ave., MTV-29, room 203 \n\nMountain View, CA 94043 \n\nAbstract \n\nWe built a high-speed, digital mean-field Boltzmann chip and SBus \nboard for general problems in constraint satjsfaction and learning. \nEach chip has 32 neural processors and 4 weight update processors, \nsupporting an arbitrary topology of up to 160 functional neurons. \nOn-chip learning is at a theoretical maximum rate of 3.5 x 108 con(cid:173)\nnection updates/sec; recall is 12000 patterns/sec for typical condi(cid:173)\ntions. The chip's high speed is due to parallel computation of inner \nproducts, limited (but adequate) precision for weights and activa(cid:173)\ntions (5 bits), fast clock (125 MHz), and several design insights. \n\n896 \n\n\fDigital Boltzmann VLSI for Constraint Satisfaction and Learning \n\n897 \n\n1 \n\nINTRODUCTION \n\nA vast number of important problems can be cast into a form of constraint satisfac(cid:173)\ntion. A crucial difficulty when solving such problems is the fact that there are local \nminima in the solution space, and hence simple gradient descent methods rarely suf(cid:173)\nfice. Simulated annealing via the Boltzmann algorithm (BA) is attractive because it \ncan avoid local minima better than many other methods (Aarts and Korst, 1989). \nIt is well known that the problem of learning also generally has local minima in \nweight (parameter) space; a Boltzmann algorithm has been developed for learning \nwhich is effective at avoiding local minima (Ackley and Hinton, 1985). The BA \nhas not received extensive attention, however, in part because of its slow operation \nwhich is due to the annealing stages in which the network is allowed to slowly relax \ninto a state of low error. Consequently there is a great need for fast and efficient \nspecial purpose VLSI hardware for implementing the algorithm. Analog Boltzmann \nchips have been described by Alspector, Jayakumar and Luna (1992) and by Arima \net al. (1990); both implement stochastic BA. Our digital chip is the first to imple(cid:173)\nment the deterministic mean field BA algorithm (Hinton, 1989), and although its \nraw throughput is somewhat lower than the analog chips just mentioned, ours has \nunique benefits in capacity, ease of interfacing and scalability (Burr, 1991, 1992). \n\n2 BOLTZMANN THEORY \n\nThe problems of constraint satisfaction and of learning are unified through the \nBoltzmann learning algorithm. Given a partial pattern and a set of constraints, \nthe BA completes the pattern by means of annealing (gradually lowering a com(cid:173)\nputational \"temperature\" until the lowest energy state is found) -\nan example \nof constraint satisfaction. Over a set of training patterns, the learning algorithm \nmodifies the constraints to model the relationships in the data. \n\n2.1 CONSTRAINT SATISFACTION \n\nA general constraint satisfaction problem over variables Xi (e.g., neural activations) \nis to find the set Xi that minimize a global energy function E = -~ Lij WijXiXj, \nwhere Wij are the (symmetric) connection weights between neurons i and j and \nrepresent the problem constraints. \nThere are two versions of the BA approach to minimizing E. In one version -\nthe \neach binary neuron Xi E {-I, I} is polled randomly, independently \nstochastic BA -\nand repeatedly, and its state is given a candidate perturbation. The probability of \nacceptance of this perturbation depends upon the amount of the energy change \nand the temperature. Early in the annealing schedule (Le., at high temperature) \nthe probability of acceptance is nearly independent of the change in energy; late in \nannealing (Le., at low temperature), candidate changes that lead to lower energy \nare accepted with higher probability. \nIn the deterministic mean field BA, each continuous valued neuron (-1 < Xi \n::; \n1) is updated simultaneously and in parallel, its new activation is set to Xi = \nI(Lj WijXj), where 10 is a monotonic non-linearity, typically a sigmoid which \ncorresponds to a stochastic unit at a given temperature (assuming independent \n\n\f898 \n\nMurray, Leung, Boonyanit, Kritayakirana, Burr, Wolff, Watanabe, Schwartz, Stork, and Peterson \n\ninputs). The inverse slope of the non-linearity is proportional to the temperature; at \nthe end of the anneal the slope is very high and f ( .) is effectively a step function. It \nhas been shown that if certain non-restrictive assump'tions hold, and if the annealing \nschedule is sufficiently slow, then the final binary states (at 0 temperature) will be \nthose of minimum E (Hinton, 1989, Peterson and Hartman, 1989). \n\n2.2 LEARNING \n\nThe problem of Boltzmann learning is the following: given a network topology \nof input and output neurons, interconnected by hidden neurons, and given a set of \ntraining patterns (input and desired output), find a set of weights that leads to high \nprobability of a desired output activations for the corresponding input activations. \nIn the Boltzmann algorithm such learning is achieved using two main phases -\nthe Teacher phase and the Student phase -\nfollowed by the actual Weight update. \nDuring the Teacher phase the network is annealed with the inputs and outputs \nclamped (held at the values provided by the omniscient teacher). During the anneal \nof the Student phase, only the inputs are clamped -\nthe outputs are allowed to \nvary. The weights are updated according to: \n\nD..Wij = \u20ac( (x!x;) - (x:xj)) \n\n(1) \n\nwhere \u20ac is a learning rate and (x~x;) the coactivations of neurons i and j at the end \nof the Teacher phase and (x:xj) in at the end of the Student phase (Ackley and \nHinton, 1985). Hinton (1989) has shown that Eq. 1 effectively performs gradient \ndescent on the cross-entropy distance between the probability of a state in the \nTeacher (clamped) and the Student (free-running) phases. \n\nRecent simulations by Galland (1993) have shown limitations of the deterministic \nBA for learning in networks having hidden units directly connected to other hidden \nunits. While his results do not cast doubt on the deterministic BA for constraint \nsatisfaction, they do imply that the deterministic BA for learning is most successful \nin networks of a single hidden layer. Fortunately, with enough hidden units this \ntopology has the expressive power to represent all but the most pathological input(cid:173)\noutput mappings. \n\n3 FUNCTIONAL DESIGN AND CHIP OPERATION \n\nFigure 1 shows the functional block diagram of our chip. The most important units \nare the Weight memory, Neural processors, Weight update processors, Sigmoid and \nRotating Activation Storage (RAS), and their operation are best explained in terms \nof constraint satisfaction and learning. \n\n3.1 CONSTRAINT SATISFACTION \n\nFor constraint satisfaction, the weights (constraints) are loaded into the Weight \nmemory, the form of the transfer function is loaded into the Sigmoid Unit, and \nthe values and duration of the annealing temperatures (the annealing schedule) are \nloaded into the Temperature Unit. Then an input pattern is loaded into a bank \nof the RAS to be annealed. Such an anneal occurs as follows: At an initial high \n\n\fDigital Boltzmann VLSI for Constraint Satisfaction and Learning \n\n899 \n\ntemperature, the 32 Neural processors compute Xi = Lj WijXj in parallel for the \nhidden units. A 4 x multiplexing here permits networks of up to 128 neurons to \nbe annealed, with the remaining 32 neurons used as (non-annealed) inputs. Thus \nour chip supports networks of up to 160 neurons total. These activations are then \nstored in the Neural Processor Latch and then passed sequentially to the Sigmoid \nunit, where they are multiplied by the reciprocal of the instantaneous temperature. \nThis Sigmoid unit employs a lookup table to convert the inputs to neural outputs \nby means of non-linearity f(\u00b7). These outputs are sequentially loaded back into the \nactivation store. The temperature is lowered (according to the annealing sched(cid:173)\nule), and the new activations are calculated as before, and so on. The final set of \nactivations Xi (i.e., at the lowest temperature) represent the solution. \n\nRotating \nActivation \n\nr-----t .... 4 weight update processors \n\nweight update cache \n\nWeight \nmemory \n\n32 Neural Processors (NP) \n\n1 \n\nSigmoid \n\nFigure 1: Boltzmann VLSI block diagram. The rotating activation storage (black) \nconsists of three banks, which for learning problems contain the last pattern (al(cid:173)\nready annealed), the current pattern (being annealed) and the next pattern (to be \nannealed) read onto the chip through the external interface. \n\n3.2 LEARNING \n\nWhen the chip is used for learning, the weight memory is initialized with random \nweights and the first, second and third training patterns are loaded into the RAS. \nThe three-bank RAS is crucial for our chip's speed because it allows a three-fold \n\n\f900 \n\nMurray, Leung, Boonyanit, Kritayaldrana, Burr, Wolff, Watanabe, Schwartz, Stork, and Peterson \n\nconcurrency: 1) a current pattern of activations is annealed, while 2) the annealed \nlast pattern is used to update the weights, while 3) the next pattern is being loaded \nfrom off-chip. The three banks form a circular buffer, each with a Student and a \nTeacher activation store. \n\nDuring the Teacher anneal phase (for the current pattern), activations of the input \nand output neurons are held at the values given by the teacher, and the values of \nthe hidden units found by annealing (as described in the previous subsection). After \nthe last such annealling step (Le., at the lowest temperature), the final activations \nare left in the Teacher activation store -\nthe Teacher phase is then complete. The \nannealing schedule is then reset to its initial temperature, and the above process is \nthen repeated for the Student phase; here only the input activations are clamped \nto their values and the outputs are free to vary. At the end of this Student anneal, \nthe final activations are left in the Student activation storage. \n\nIn steady state, the MUX then rotates the storage banks of the RAS such that the \nnext, current, and last banks are now called the current, last, and next, respectively. \nTo update the weights, the activations in the Student and Teacher storage bank \nfor the pattern just annealed (now called the \"last\" pattern) are sent to the four \nWeight update processors, along with the weights themselves. The Weight update \nprocessors compute the updated weights according to Eq. 1, and write them back \nto the Weight memory. While such weight update is occuring for the last pattern, \nthe current pattern is annealing and the next pattern is being loaded from off chip. \n\nAfter the chip has been trained with all of the patterns, it is ready for use in \nrecall. During recall, a test pattern is loaded to the input units of an activation \nbank (Student side), the machine performs a Student anneal and the final output \nactivations are placed in the Student activation store, then read off the chip to \nthe host computer as the result. In a constraint satisfaction problem, we merely \ndownload the weights (constraints) and perform a Student anneal. \n\n4 HARDWARE IMPLEMENTATION \n\nFigure 2 shows the chip die. The four main blocks of the Weight memory are at \nthe top, surrounded by 32 Neural processors (above and below this memory), and \nfour Weight update processors (between the memory banks). The three banks of \nthe Rotating Activation Store are at the bottom of the chip. The Sigmoid processor \nis at the lower left, and instruction cache and external interface at the lower right. \nMost of the rest of the chip consists of clocking and control circuitry. \n\n4.1 VLSI \n\nThe chip mixes dynamic and static memory on the same die. The Activation and \nTemperature memories are static RAM (which needs no refresh circuitry) while the \nWeight memory is dynamic (for area efficiency) . The system clock is distributed to \nvarious local clock drivers in order to reduce the global clock capacitance and to se(cid:173)\nlectively disable the clocks in inactive subsystems for reducing power consumption. \nEach functional block has its own finite state machine control which communicates \n\n\fDigital Boltzmann VLSI for Constraint Satisfaction and Learning \n\n901 \n\n. . \" \n\n._ \u2022 ...- . \n\n.. .... \n\n\u2022 \n\n.\n\n- , \n\n\"o.t ' . \n\n. IM \n\n.... . . \n\n'7 \",\"\", \n\nFigure 2: Boltzmann VLSI chip die. \n\nasynchronously. For diagnostic purposes, the State Machines and counters are ob(cid:173)\nservable through the External Interface. There is a Single Step mode which has \nbeen very useful in verifying sub-system performance. Figure 3 shows the power \ndissipation throughout a range of frequencies. Note that the power is less than \n2 Watts throughout. \n\nExtensive testing of the first silicon revealed two main classes of chip error: electrical \nand circuit. Most of the electrical problems can be traced to fast edge rates on \nthe DRAM sense-amp equalization control signals, which cause inductive voltage \ntransients on the power supply rails of roughly 1 Volt. This appears to be at least \npartly responsible for the occasional loss of data in dynamic storage nodes. There \nalso seems to be insufficient latchup protection in the pads, which is aggravated by \nthe on-chip voltage surges. The circuit problems can be traced to having to modify \nthe circuits used in the layout for full chip simulation. \n\nIn light of these problems, we have simulated the circuit in great detail in order to \nexplore possible corrective steps. We have modified the design to provide improved \nelectrical isolation, resized drivers and reduced the logic depth in several compo(cid:173)\nnents. These corrections solve the problems in simulation, and give us confidence \nthat the next fab run will yield a fully working chip. \n\n4.2 BOARD AND SBus INTERFACE \n\nAn SBus interface board was developed to allow the Boltzmann chip to be used \nwith a SparcStation host. The registers and memory in the chip can be memory \nmapped so that they are directly accessible to user software. The board can support \n\n\f902 \n\nMurray, Leung, Boonyanit, Kritayakirana, Burr, Wolff, Watanabe, Schwartz, Stork, and Peterson \n\nTable 1: Boltzmann VLSI chip specifications \n\nArchitecture \nSize \nNeurons \nWeight memory \nActivation store \nTechnology \nTransistors \nPins \nClock \nI/O rate \nLearning rate \nRecall rate \nPower dissipation \n\nn-Iayer, arbitrary intercoItnnections \n9.5 mm x 9.8 mm \n32 processors --+ 160 virtual \n20,480 5-bit weights (on chip) \n3 banks, 160 teacher & 160 student values in each \n1. 2 11m CMOS \n400,000 \n84 \n125 MHz (on chip) \n3 x 107 activations/sec (sustained) \n3.5 x 108 connection updates/sec (on chip) \n12000 patterns/sec \n:::;2 Watts (see Figure 3) \n\n20-bit transfers to the chip at a sustained rate in excess of 8 Mbytes/second. The \nboard uses reconfigurable Xilinx FPGAs (field-programmable gate arrays) to allow \nflexibility for testing with and without the chip installed. \n\n4.3 SOFTWARE \n\nThe chip control program is written in C (roughly 1,500 lines of code) and commu(cid:173)\nnicates to the Boltzmann interface card through the virtual memory. The user can \nread/write to all activation and weight memory locations and all functions of the \nchip (learning, recall, annealing, etc.) can thus be specified in software. \n\n5 CONCLUSIONS AND FUTURE WORK \n\nThe chip was designed so that interchip communications could be easily incorpo(cid:173)\nrated by means of high-speed parallel busses. The SBus board, interface and soft(cid:173)\nware described above will require only minor changes to incorporate a multi-chip \nmodule (MCM) containing several such chips (for instance 16). There is minimal \n\n2 \n1. 75 \ntil 1.5 \n.w \n.w 1. 25 \n111 \n-~ 0.75 \n2: \n1 \n~ 0.5 \n0 \n0. \n0.25 \n0 \n\nQ) \n\n50 \n\ni \n\n,--f--T \n\nI ----\n\ni \n, \ni \n\n; \n, \n! \n\ni \n\n, \n\ni \ni , \nI \ni \ni \nI \ni \nI \nI \n\n60 70 80 \n\n90 100 110 \n\nfrequency, MHz \n\nFigure 3: Power dissipation of the chip during full operation at 5 Volts. \n\n\fDigital Boltzmann VLSI for Constraint Satisfaction and Learning \n\n903 \n\ninter chip communication delay \u00ab 3% overhead), and thus MCM versions of our \nsystem promise to be extremely powerful learning systems for large neural network \nproblems (Murrayet al., 1992). \n\nAcknowledgements \n\nThanks to Martin Boliek and Donald Wynn for assistance in design and construc(cid:173)\ntion of the SBus board. Research support by NASA through grant NAGW419 is \ngratefully acknowledged; VLSI fabrication by MOSIS. Send reprint requests to Dr. \nStor k: stor k@crc.ricoh.com. \n\nReferences \n\nE. Aarts & J. Korst. \n(1989) Simulated Annealing and Boltzmann Machines: A \nstochastic approach to combinatorial optimization and neural computing. New York: \nWiley. \nD. H. Ackley & G. E. Hinton. (1985) A learning algorithm for Boltzmann machines. \nCognitive Science 9, 147-169. \nJ. Alspector, A. Jayakumar & S. Luna. (1992) ExpeJimental evaluation of learning \nin a neural microsystem. Advances in Neural Information Processing Systems-4, \nJ. E. Moody, S. J. Hanson & R. P. Lippmann (eds.), San Mateo, CA: Morgan \nKaufmann, 871-878. \n\nY. Arima, K. Mashiko, K. Okada, T. Yamada, A. Maeda, H. Kondoh & S. Kayano. \n(1990) A self-learning neural network chip with 125 neurons and 10K self(cid:173)\norganization synapses. In Symposium on VLSI Circuits, Solid State Circuits Council \nStaff, Los Alamitos, CA: IEEE Press, 63-64. \n\n(1991) Digital Neural Network Implementations. Neural Networks: \n\nJ. B. Burr. \nConcepts, Applications, and Implementations, Volume 2, P. Antognetti & V. Mi(cid:173)\nlutinovic (eds.) 237-285, Englewood Cliffs, NJ: Prentice Hall. \n\nJ. B. Burr. (1992) Digital Neurochip Design. Digital Parallel Implementations of \nNeural Networks. K. Wojtek Przytula & Viktor K. Prasanna (eds.), Englewood \nCliffs, N J: Prentice Hall. \n\nC. C. Galland. (1993) The limitations of deterministic Boltzmann machine learning. \nNetwork 4, 355-379. \nG. E. Hinton. (1989) Deterministic Boltzmann learning performs steepest descent \nin weight-space. Neural Computation 1, 143-150. \nC. Peterson & E. Hartman. (1989) Explorations of the mean field theory learning \nalgorithm. Neural Networks 2, 475-494. \n\nM. Murray, J. B. Burr, D. G. Stork, M.-T. Leung, K. Boonyanit, G. J. Wolff \n& A. M. Peterson. (1992) Deterministic Boltzmann machine VLSI can be scaled \nusing multi-chip modules. Proc. of the International Conference on Application \nSpecific Array Processors. Berkeley, CA (August 4-7) Los Alamitos, CA: IEEE \nPress, 206-217. \n\n\f", "award": [], "sourceid": 864, "authors": [{"given_name": "Michael", "family_name": "Murray", "institution": null}, {"given_name": "Ming-Tak", "family_name": "Leung", "institution": null}, {"given_name": "Kan", "family_name": "Boonyanit", "institution": null}, {"given_name": "Kong", "family_name": "Kritayakirana", "institution": null}, {"given_name": "James", "family_name": "Burg", "institution": null}, {"given_name": "Gregory", "family_name": "Wolff", "institution": null}, {"given_name": "Tokahiro", "family_name": "Watanabe", "institution": null}, {"given_name": "Edward", "family_name": "Schwartz", "institution": null}, {"given_name": "David", "family_name": "Stork", "institution": null}, {"given_name": "Allen", "family_name": "Peterson", "institution": null}]}