{"title": "Relaxation Networks for Large Supervised Learning Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1015, "page_last": 1021, "abstract": null, "full_text": "Relaxation Networks for Large Supervised Learning Problems \n\nJoshua Alspector Robert B. Allen Anthony Jayakumar \n\nTorsten Zeppenfeld and Ronny Meir \n\nBellcore \n\nMorristown, NJ 07962-1910 \n\nAbstract \n\nFeedback connections are required so that the teacher signal on the output \nneurons can modify weights during supervised learning. Relaxation methods \nare needed for learning static patterns with full-time feedback connections. \nFeedback network learning techniques have not achieved wide popularity \nbecause of the still greater computational efficiency of back-propagation. We \nshow by simulation that relaxation networks of the kind we are implementing in \nVLSI are capable of learning large problems just like back-propagation \nnetworks. A microchip incorporates deterministic mean-field theory learning as \nwell as stochastic Boltzmann learning. A multiple-chip electronic system \nimplementing these networks will make high-speed parallel learning in them \nfeasible in the future. \n\n1. INTRODUCTION \n\nFor supervised learning in neural networks, feedback connections are required so that the \nteacher signal on the output neurons can affect the learning in the network interior. Even \nthough back-propagation[l] networks are feedforward in processing, they have implicit \nfeedback paths during learning for error propagation. Networks with explicit, full-time \nfeedback paths can perform pattern completion[2] and can have interesting temporal and \ndynamical properties in contrast to the single forward pass processing of multilayer \nperceptrons trained with back-propagation or other means. Because of the potential for \ncomplex dynamics, feedback networks require a reliable method of relaxation for \nlearning and retrieval of static patterns. The Boltzmann machine[3] uses stochastic \nsettling while the mean-field theory (MFT) version[4] [5] uses a more computationally \nefficient deterministic technique. \nNeither of these feedback network learning techniques has achieved wide popularity \nbecause of the greater computational efficiency of back-propagation. However, this is \nlikely to change in the near future because the feedback networks will be implemented in \nVLSI[6] making them available for learning experiments on high-speed parallel hardware. \n\nIn this paper, we therefore raise the following questions: whether these types of learning \nnetworks have the same representational and learning power as the more thoroughly \nstudied back-propagation methods, how learning in such networks scales with problem \nsize, and whether they can solve usefully large problems. Such questions are difficult to \n\n1015 \n\n\f1016 \n\nanswer with computer simulations because of the large amount of computer time \nrequired compared to back-propagation, but, as we show, the indications are promising. \n\n2. SIMULATIONS \n\n2.1 Procedure \n\nIn this section, we compare back-propagation, Boltzmann machine, and MFf networks \non a variety of test problems. The back-propagation technique performs gradient descent \nin weight space by differentiation of an objective function, usually the error, \n\nE = L (st - Sk-)2 \n\nwhere st is the target output and Sk- is the actual output. We choose to use the function \n\noutputs k \n\nG = L [stlog(stlsk) + (1-st)IOg[(1-Sk+)/(1-sk-)]] \n\noutputs k \n\nfor a more direct comparison to the Boltzmann machine[7] which has \n\nG = \n\nL \n\npt1og(Pg+lpg) \n\nglobal states g \n\n(1) \n\n(2) \n\nwhere P g is the probability of a global state. \n\nIndividual neurons in the Boltzmann machine have a probabilistic decision' rule such that \nneuron k is in state Sk = I with probability \n\nI \n\nPi = 1 +e -net.tT \n\n(3) \n\nwhere neti = LWijSj is the net input to each neuron and T is a parameter that acts like \n\nj \n\ntemperature in a physical system and is represented by the noise term in Eq. (4), which \nfollows. In the relaxation models, each neuron performs the activation computation \n\nSi = f (gain* (neti +noisei \u00bb \n\n(4) \n\nis a monotonic non-linear function such as tanh. \n\nwhere f \nIn simulations of the \nBoltzmann machine, this is a step function corresponding to a high value of gain. The \nnoise is chosen from a zero mean gaussian distribution whose width is proportional to the \ntemperature. This closely approximates the distribution in Eq. (3) and matches our \nhardware implementation, which supplies uncorrelated noise to each neuron. The noise \nis slowly reduced as annealing proceeds. For MFf learning, the noise is zero but the \ngain term has a finite value proportional to liT taken from the annealing schedule. Thus \nthe non-linearity sharpens as 'annealing' proceeds. \nThe network is annealed in two phases, + and -, corresponding to clamping the outputs \nin the desired state and allowing them to run free at each pattern presentation. The \nlearning rule which adjusts the weights Wij from neuron j to neuron i is \n\nL\\wij=sgn[ (SjSjt-(SjSj)-]' \n\n(5) \n\nNote that this measures the instantaneous correlations after annealing. For both phases \neach synapse memorizes the correlations measured at the end of the annealing cycle and \nweight adjustment is then made, (i.e., online). The sgn matches our hardware \n\n\f1017 \n\nimplementation which changes weights by one each time. \n\n2.2 Scaling \n\nTo study learning time as a function of problem size, we chose as benchmarks the parity \nand replication (identity) problems. The parity problem is the generalization of \nexclusive-OR for arbitrary input size, n. It is difficult because the classification regions \nare disjoint with every change of input bit, but it has only one output. The goal of the \nreplication problem is for the output to duplicate the bit pattern found on the input after \nbeing transformed by the hidden layer. There are as many output neurons as input. For \nthe replication problem, we chose the hidden layer to have the same number of neurons \nas the input layer, while for parity we chose the hidden layer to have twice the number as \nthe input layer. \n\nFor back-propagation simulations, we used a learning rate of 0.3 and zero momentum. \nFor MFT simulations, we started at a high temperature of T hi = K (1.4)10 ((ranin) where \nK = 1-10. We annealed in 20 steps dividing the temperature by 1.4 each time. The \njanin parameter is the number of inputs from other neurons to a neuron in the hidden \nlayer. We did 3 neuron update cycles at each temperature. For Boltzmann, we increased \nthis to 11 updates because of the longer equilibration time. We used high gain rather \nthan strictly binary units because of the possibility that the binary Boltzmann units would \nhave exactly zero net input making annealing fruitless. \n\nParity Comparison \n\nReplication Comparison \n\n105 \n\n104 \n\n_ cycles \n\n103 \n\n102 \n\n- __ MFT \n\n~ \n\n-o--1lZ \n\n! \n\n104 \n\n10) \n\n\u2022 cycles \n\n102 \n\n__ MFT \n\n~ \n\n-o--1lZ \n\n, \nI \n1 \n\n--\n\n_>_n. \n\no \n\n10 \n\no \n\nInput bits \n\n10 \n\nInput 8it~ \n\nFigure 1. Scaling of Parity (1 a) and Replication (1 b) Problem with Input Size \n\nFig. la plots the results of an average of 10 runs and shows that the number of patterns \nrequired to learn to 90% correct for parity scales as an exponential in n for all three \nnetworks. This is not surprising since the training set size is exponential and no \nconstraints were imposed to help the network generalize from a small amount of data. \nAn activation range of -1 to 1 was used on both this problem and the replication problem. \nThere is no appreciable difference in learning as a function of patterns presented. Actual \n\n\f1018 \n\nAlspector, Allen, Jayakumar, Zeppenfeld, and Meir \n\ncomputer time is larger by an additional factor of n 2 to account for the increase in the \nnumber of connections. Direct parallel implementation will reduce this additional factor \nto less than n . Computer time for MFr learning was an additional factor of 10 slower \nthan back-propagation and stochastic Boltzmann learning was yet another factor of 10 \nslower. The hardware implementation will make these techniques roughly equal in speed \nand far faster than any simulation of back-propagation. Fig. Ib shows analogous results \nfor the replication problem. \n\n2.3 NETtalk \n\nAs an example of a large problem, we chose the NETtalk[8] corpus with 20,000 words. \nFig. 2 shows the learning curves for back-propagation, Boltzmann, and :MFT learning. \nAn activation range of 0 to 1 gave the best results on this problem, possibly due to the \nsparse coding of text and phonemes. We can see that back-propagation does better on \nthis problem which we believe may be due to the ambiguity in mapping letters to \nmultiple phonemic outputs. \n\n----r --{)-BP \n\n. \u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b7\u00b71 \u00b7\u00b7\u00b7 \n\n~MFT(inc) \n\n.. ......... _ ... . \n\nI ~BZ \n\n.. _ .. __ ........ --.. + ...... . \n\ni \ni \n.... r \nj \n! ; \nI , \nI \n..... .... .... .. .. + .. .. . .. ... ...... . ' Hi ..... . ... . \u2022 ..... .... . j .. ... . .. ....... ..... t\u00b7 \nI \n\n! \ni \n\n! \n\n! \n\ni \n\n0.8 \n\n0.6 \n\nfraction \noorrect \n\n0.4 \n\n0.2 \n\no \n\no \n\n4.000 104 \n\ncycles \n\n8.000 104 \n\n1.000105 \n\nFigure 2. Learning Curves for NETtalk \n\n2.4 Dynamic Range Manipulation \n\nFor all problems, we checked to see if reducing the dynamic range of the weights to 5 \nbits, equivalent to our VLSI implementation, would hinder learning. In most cases, there \nwas no effect. Dynamic range was a limitation for the two largest replication problems \nwith MFr. By adding an occasional global decay which decremented the absolute value \nof the weights, we were able to achieve good learning. Our implementation is capable of \ndoing this. There was also a degradation of performance on the back-propagation \nversion of the parity problem which took about a factor of three longer to learn with a 5 \nbit weight range. \n\n\f3. VLSI IMPLEMENTATION \n\nThe previous section shows that relaxation networks are as capable as back-propagation \nnetworks of learning large problems even though they are slower in computer \nsimulations. We are, however, implementing these feedback networks in VLSI which \nwill speed up learning by many orders of magnitude. Our choice of learning technique \nfor implementation is due mainly to the local learning rule which makes it much easier to \ncast these networks into electronics than back-propagation. \n\n1019 \n\nFigure 3. Photo of 32-Neuron Bellcore Learning Chip \n\nFig. 3 shows a microchip which has been fabricated. It contains 32 neurons and 992 \nconnections (496 bidirectional synapses). On the extreme right is a noise generator \nwhich supplies 32 uncorrelated pseudo-random noise sources [91 to the neurons to their \nleft. These noise sources are summed along with the weighted post-synaptic signals from \nother neurons at the input to each neuron in order to implement the simulated annealing \nprocess of the stochastic Boltzmann machine. The neuron amplifiers implement a non(cid:173)\nlinear activation function which has variable gain to provide for the gain sharpening \nfur.ction of the MFT technique. The range of neuron gain can also be adjusted to allow \nfor scaling in summing currents due to adjustable network size. \n\nMost of the area is occupied by the synapse array. Each synapse digitally stores a weight \nranging from -15 to +15 as 4 bits plus a sign. It multiples the voltage input from the \npresynaptic neuron by this weight to output a current. One conductance direction can be \ndisconnected so that we can experiment with asymmetric networks in accordance with \nour recent findings[lOl. Although the synapses can have their weights set externally, they \nare designed to be adaptive. They store correlations using the local learning rule of Eq. \n\n\f1020 \n\nAlspector, Allen, Jayakumar, Zeppenfeld, and Meir \n\n(5) and adjust their weights accordingly. \n\nAlthough the chip is still being tested, some measurements can be reported. Fig. 4a \nshows a family of transfer functions of a neuron, showing how the gain is continually \nadjustable by varying a control voltage. Fig. 4b shows the transfer function of a synapse \nas different weights are loaded. The input linear range is about 2 volts. \n\nMeasured Neuron Transfer Function \n\nMeasured synapse transfer function \n\n~--garn \n\nV \n\n_ - \u00b711 \n\n_ - - \u00b77 \n\n~-- '! \n\n~- 11 \n\n15 \n\n~ ~~~~~~~~~~~~~~~ \n\n-200 \n\n\u00b7100 \n\n0 \n\nIr'4XJI curent (pAl \n\n100 \n\n200 \n\n300 \n\nD.S \n\nu \n\n2S \n\n2 \nInput voltage (VI \n\n3 \n\n3.S \n\nFigure 4. Transfer Functions of Electronic Neuron and Synapse \n\nFig. 5 shows two different neuron outputs with a decreasing noise signal added in. The \nupper trace shows a neuron driven by a function generator while the center trace shows \nan undriven neuron. The lower trace is the noise control voltage common to all neurons. \n\nThe chip is designed to be cascaded with other similar chips in a board-level system \nwhich can be accessed by a computer. The nodes which sum current from synapses for \nnet input into a neuron are available externally for connection to other chips and for \nexternal clamping of neurons or other external input. We expect to be able to present \nroughly 100,000 patterns per second to the chip for learning as was determined from a \nprevious prototype system[6] that was not cascadable. This speed will not be strongly \naffected by the increased network size of a multiple-chip system because of the inherent \nparallelism whereby each neuron and synapse updates its own state. \n\n4. CONCLUSION \n\nWe have shown by simulation that relaxation networks of the kind we are implementing \nare as capable of learning large problems as back-propagation networks. A multiple-Chip \nelectronic system implementing these networks will make high-speed parallel learning in \nthem feasible in the future. \n\n\fRelaxation Networks for Large Supervised Learning Problems \n\n1021 \n\nFigure 5. Neuron Signals in the Presence of Noise Generator Input \n\nREFERENCES \n\n1. D.E. Rumelhart, G.E. Hinton, & R.1. Williams, \"Learning Internal Representations by Error \nPropagation\", in Parallel Distributed Processing: Explorations in the Microstructure of \nCognition. Vol. 1: Foundations, D.E. Rumelhart & 1.L McClelland (eds.), MIT Press, \nCambridge, MA (1986), p. 318. \n\n2.1.1. Hopfield, \"Neural Networks and Physical Systems with Emergent Collective \n\nComputational Abilities\", Proc. Natl. Acad. Sci. USA, 79,2554-2558 (1982). \n\n3. D.H. Ackley, G.E. Hinton, & T.l. Sejnowski, \"A Learning Algorithm for Boltzmann \n\nMachines\", Cognitive Science 9 (985) pp. 147-169. \n\n4. C. Peterson & J.R. Anderson, \"A Mean Field Learning Algorithm for Neural Networks\", \n\nComplex Systems, 1:5, 995-1019, (1987). \n\n5. G. Hinton, \"Detenninistic Boltzmann Learning Perfonns Steepest Descent in Weight-Space\", \n\nNeural Computation, 1, 143-150 (1989). \n\n6.1. Alspector, B. Gupta, & R.B. Allen, \"Perfonnance of a Stochastic Learning Microchip\" in \n\nAdvances in Neural Information Processing Systems edited by D. Tourctzky (Morgan(cid:173)\nKaufmann, Palo Alto), pp. 748-760. (1989). \n\n7.1.1. Hopfield, \"Learning Algorithms and Probability Distributions in Feed-Forward and Feed(cid:173)\n\nBack networks\", Proc. Natl. Acad. Sci. USA, 84, 8429-8433 (1987). \n\n8. T.l. Sejnowski & C.R. Rosenberg, \"Parallel Networks that Learn to Pronounce English Text\", \n\nComplex Systems, 1, 145-168 (1987). \n\n9.1. Alspector, 1.W. Gannett, S. Haber, M.B. Parker, & R. Chu, \"A VLSI-Efficient Technique for \nGenerating Multiple Uncorrclated Noise Sources and Its Application to Stochastic Neural \nNetworks\", IEEE Trans. Circuits & Systems, 38, 109, (Jan., 1991). \n\n10. R.B. Allen & 1. Alspector, \"Learning of Stable States in Stochastic Asymmetric Networks\", \n\nIEEE Trans. Neural Networks. 1,233-238, (1990). \n\n\f", "award": [], "sourceid": 367, "authors": [{"given_name": "Joshua", "family_name": "Alspector", "institution": null}, {"given_name": "Robert", "family_name": "Allen", "institution": null}, {"given_name": "Anthony", "family_name": "Jayakumar", "institution": null}, {"given_name": "Torsten", "family_name": "Zeppenfeld", "institution": null}, {"given_name": "Ronny", "family_name": "Meir", "institution": null}]}