{"title": "Convergence and Pattern-Stabilization in the Boltzmann Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 511, "page_last": 518, "abstract": null, "full_text": "511 \n\nCONVERGENCE AND PATTERN STABILIZATION \n\nIN THE BOLTZMANN MACHINE \n\nMosheKam \n\nRoger Cheng \n\nDept. of Electrical and Computer Eng. \nDrexel University, Philadelphia PA 19104 \n\nDept. of Electrical Eng. \n\nPrinceton University, NJ 08544 \n\nABSTRACT \n\nThe Boltzmann Machine has been introduced as a means to perform \nglobal optimization for multimodal objective functions using the \nprinciples of simulated annealing. In this paper we consider its utility \nas a spurious-free content-addressable memory, and provide bounds on \nits performance in this context. We show how to exploit the machine's \nability to escape local minima, in order to use it, at a constant \ntemperature, for unambiguous associative pattern-retrieval in noisy \nenvironments. An association rule, which creates a sphere of influence \naround each stored pattern, is used along with the Machine's dynamics \nto match the machine's noisy input with one of the pre-stored patterns. \nSpurious fIxed points, whose regions of attraction are not recognized by \nthe rule, are skipped, due to the Machine's fInite probability to escape \nfrom any state. The results apply to the Boltzmann machine and to the \nasynchronous net of binary threshold elements (Hopfield model'). They \nprovide the network designer with worst-case and best-case bounds for \nthe network's performance, and allow polynomial-time tradeoff studies \nof design parameters. \n\nI. INTRODUCTION \n\nThe suggestion that artificial neural networks can be utilized for classification, pattern \nrecognition and associative recall has been the main theme of numerous studies which \nappeared in the last decade (e.g. Rumelhart and McClelland (1986) and Grossberg (1988) -\nand their references.) Among the most popular families of neural networks are fully \nconnected networks of binary threshold elements (e.g. Amari (1972), HopfIeld (1982).) \nThese structures, and the related family of fully connected networks of sigmOidal threshold \nelements have been used as error-correcting decoders in many applications, among which \nwere interesting applications in optimization (Hopfield and Tank, 1985; Tank and \nHopfield, 1986; Kennedy and Chua, 1987.) A common drawback of the many studied \nschemes is the abundance of 'spurious' local minima, which 'trap' the decoder in \nundesirable, and often non-interpretable, states during the process of input I stored-pattern \nassociation. It is generally accepted now that while the number of arbitrary binary patterns \nthat can be stored in a fully-connected network is of the order of magnitude of N (N = \nnumber of the neurons in the network,) the number of created local attractors in the \n\n\f512 \n\nKam and Cheng \n\nnetwork's state space is exponential in N. \nIt was proposed (Acldey et al., 1985; Hinton and Sejnowski, 1986) that fully-connected \nbinary neural networks, which update their states on the basis of stochastic \nstate-reassessment rules, could be used for global optimization when the objective \nfunction is multi-modal. The suggested architecture, the Boltzmann machine, is based on \nthe principles of simulated annealing ( Kirkpatrick et al., 1983; Geman and Geman, 1984; \nAarts et al., 1985; Szu, 1986,) and has been shown to perform interesting tasks of \ndecision making and optimization. However, the learning algorithm that was proposed for \nthe Machine, along with its \"cooling\" procedures, do not lend themselves to real-time \noperation. Most studies so far have concentrated on the properties of the Machine in \nglobal optimization and only few studies have mentioned possible utilization of the \nMachine (at constant 'temperature') as a content-addressable memory (e.g. for local \noptimization. ) \nIn the present paper we propose to use the Boltzmann machine for associative retrieval, \nand develop bounds on its performance as a content-addressable memory. We introduce a \nlearning algorithm for the Machine, which locally maximizes the stabilization probability \nof learned patterns. We then proceed to calculate (in polynomial time) upper and lower \nbounds on the probability that a tuple at a given initial Hamming distance from a stored \npattern will get attracted to that pattern. A proposed association rule creates a sphere of \ninfluence around each stored pattern, and is indifferent to 'spurious' attractors. Due to the \nfact that the Machine has a nonzero probability of escape from any state, the 'spurious' \nattractors are ignored. The obtained bounds allow the assessment of retrieval probabilities, \ndifferent learning algorithms and necessary learning periods, network 'temperatures' and \ncoding schemes for items to be stored. \n\nII. THE MACHINE AS A CONTENT ADDRESSABLE \n\nMEMORY \n\nThe Boltzmann Machine is a multi-connected network of N simple processors called \nprobabilistic binary neurons. The ith neuron is characterized by N-1 real numbers \nrepresenting the synaptic weights (Wij, j=1,2, ... ,i-1,i+1, ... ,N; Wii is assumed to be zero \nfor all i), a real threshold ('tj) and a binary activity level (Ui E B ={ -1,1},) which we \nshall also refer to as the neuron's state. The binary N-tuple U = [Ul,U2, ..\u2022 ,UN] is called \nthe network's state. We distinguish between two phases of the network's operation: \na) The leamjn& phase - when the network parameters Wij and 'ti are determined. This \ndetermination could be achieved through autonomous learning of the binary pattern \nenvironment by the network (unsupervised learning); through learning of the environment \nwith the help of a 'teacher' which supplies evaluative reinforcement signals (supervised \nlearning); or by an external flxed assignment of parameter values. \nb) The production phase - when the network's state U is determined. This determination \ncould be performed synchronously by all neurons at the same predetermined time instants, \nor asynchronously - each neuron reassesses its state independently of the other neurons at \nrandom times. The decisions of the neurons regarding their states during reassessment can \nbe arrived at deterministically (the set of neuron inputs determines the neuron's state) or \n\n\fConvergence and Pattem-Stabilization \n\n513 \n\nstochastically (the set of neuron inputs shapes a probability distribution for the \nstate-selection law.) \nWe shall describe fast the (asynchronous and stochastic) production rule which our \nnetwork employs. At random times during the production phase, asynchronously and \nindependently of all other neurons, the ith neuron decides upon its next state, using the \nprobabilistic decision rule \n\n1 with probabilty --~~ \nl+e-T \n\nII \n\n1 \n\nu\u00b7= \nJ \n\n-1 \n\ner II \n\nwith probabilty --~~ where \n\nl+e-T \n\nII \n\nN \n\nLllii = L WijUj-ti \n\nj=l~ \n\n(1) \n\nis called the ith energy gap, and Te is a predetermined real number called temperature. \nThe state-updating rule (1) is related to the network's energy level which is described by \n\nI \n\n\u2022 \n\ni=l \n\nlJ J \n\n2 \u00a3..J I \u00a3..J \nj=l~ \n\nE=-.![~ u.( ~ w .. u.-t.)] \nui=sgn[ \u00b1 Wii Ui - 1i ]' \n\nj=l~ \n\n(2) \n\n(3) \n\nIf the network is to fmd a local minimum of E in equation (2), then the ith neuron, when \nchosen (at random) for state updating, should choose deterministically \n\nWe note that rule in equation (3) is obtained from rule in equation (1) as Te --+ O. This \ndeterministic choice of Ui guarantees, under symmetry conditions on the weights \n(Wij=Wji), that the network's state will stabilize at afzxed point in the 2N -tuple state \nspace of the network (Hoptield, 1982), where \nDefinition I: A state UfE BN is called afued point in the state space of the N-neuron \nnetwork if \np .ro<~+1)= Uf I U<'Y = Uf] = 1. \nA fixed point found through iterations of equation (3) (with i chosen at random at each \niteration) may not be the global minimum of the energy in equation (2). A mechanism \nwhich seeks the global minimum should avoid local-minimum \"traps\" by allowing \n'uphill' climbing with respect to the value of E. The decision scheme of equation (1) is \ndevised for that purpose, allowing an increase in E with nonzero probability. This \nprovision for progress in the locally 'wrong' direction in order to reap a 'global' advantage \nlater, is in accordance with the principles of simulated annealing techniques, which are \nused in multimodal optimization. In our case, the probabilities of choosing the locally \n'right' decision (equation (3\u00bb and the locally 'wrong' decision are determined by the ratio \n\n(4) \n\n\f514 \n\nKam and Cheng \n\nof the energy gap ~i to the 'temperature' shaping constant T e . \nThe Boltzmann Machine has been proposed for global minimization and a considerable \namount of effort has been invested in devising a good cooling scheme, namely a means to \ncontrol T e in order to guarantee the finding of a global minimum in a short time (Geman \nand Geman, 1984, Szu, 1987.) However, the network may also be used as a selective \ncontent addressable memory which does not suffer from inadvertently-installed spurious \nlocal minima. \nWe consider the following application of the Boltzmann Machine as a scheme for pattern \nclassification under noisy conditions: let an encoder emit a sequence of NXI binary code \nvectors from a set of Q codewords (or 'patterns',) each having a probability of occurrence \nof TIm (m = 1,2, ... ,Q). The emitted pattern encounters noise and distortion before it \narrives at the decoder, resulting in some of its bits being reversed. The Boltzmann \nMachine, which is the decoder at the receiving end, accepts this distorted pattern as its \ninitial state (U(O\u00bb, and observes the consequent time evolution of the network's state U. \nAt a certain time instant nO, the Machine will declare that the input pattern U(O) is to be \nassociated with pattern Bm if U at that instant (u(no\u00bb is 'close enough' to Bm. For this \npurpose we defme \nDefinition 2: The dmax-sphere of influence of pattern B m, a( Bm, dmax) is \no(Bm,dmax)={UeBN : HD(U, Bm)~~}. \n(5) \ndmax is prespecified. \nLetl:(~)=uo(Bm' ~)andletno be the smallest integer such that dnJel:(~~ \n\nm \n\nTherule of atsociation is : associate dO) with Bm at time no, if dllo> which has evolved \n\nfrom u<0) satisfies: U<llo>e o(Bm' ~~ \nDue to the finite probability of escape from any minimum, the energy minima which \ncorrespond to spurious fIXed points are skipped by the network on its way to the energy \nvalleys induced by the designed flXed points (Le. Bl, ... ,BQ.) \n\nIll. ONE-STEP CONTRACTION PROBABILITIES \n\nUsing the association rule, the utility of the Boltzmann machine for error correction \ninvolves the probabilities \nP r {HD [lfn) ,BnJ ~ d.nax I HD [If\u00b0),BnJ = d} m= 1 ,2, ... , Q \nfor predetermined n and dmax . In order to calculate (6) we shall frrst calculate the \nfollowing one-step attraction probabilities: \nP(Bm,d,B)=Pr{HDrd~ + 1), BnJ=d+B I HDrd~ ), BnJ=d}whereB= -1,0, 1 (7) \nFor B = -1 we obtain the probability of convergence ; For B = + 1 we obtain the \nprobability of divergence; For B = 0 we obtain the probability of stalemate. \nAn exact calculation of the attraction probabilities in equation (7) is time-exponential and \nwe shall therefore seek lower and upper bounds which can be calculated in polynomial \ntime. We shall reorder the weights of each neuron according to their contribution to the \n\n(6) \n\n\fConvergence and Pattern-Stabilization \n\n515 \n\n~i for each pattern, using the notation \nw;n= {wit bm1, wi2bm2\" \u2022\u2022 , wiNbmN} \n~=max{wF-{~,~, ... ,'fs-l}} \n\n~=max wr \n\ni = 1,2, ... ,N, s =2,3, ... ,N, m= 1,2, ... ,Q \nLetL1E~i(d)=~ -2L!f and~~(d)=~ -2L'fN+l-f\" \n\nd \n\nd \n\n(8) \n\n(9) \n\n1'=1 \n\n1'=1 \n\nThese values represent the maximum and minimum values that the ith energy gap could \nassume when the network is at HD of d from Bm. Using these extrema, we can fmd the \nworst case attraction probabilities : \n\nN \n\npwc(B d -1) =.! ~ . [ U-1 (brni) \nAEJ.d) \n1+ e---:r-e \n\nN ti' PI \n\nm\" \n\nAEJ.d) \nU_l(bmi)e-~ \n\n+ \n\n!IF:J.d) \nl+e----y-\n\ne \nand the best case attraction probabilities : \n\npbc(B d -1)=-\n~, N \n\nN \n\nd L [ U 1 (b .) \n\n. \n~ \n\nmJ \n+ \n\n-\n\ni=1 \n\nAEJ.d) \nl+e--r \n\ne \n\n-\n\n+ \n\nAE.Jd)] \n1 - U (b .) \n-1 mJ - -\nAEuJd) \nl+e--r \n\nT \ne \n\ne \n\ne \n\n1-U-1(bmi) \nAEJ.d) \nl+e--r \n\ne \n\n(lOa) \n\n(lOb) \n\n(lla) \n\n(lIb) \n\n(Bm\" O)-l-P \n\nwhere for both cases \n~) d -~) \nP \nFor the worst- (respectively, best-) case probabilities, we have used the extreme values of \n.1Emi(d) to underestimate (overestimate) convergence and overestimate (underestimate) \ndivergence, given that there is a disagreement in d of the N positions between the \nnetwork's state and Bm ; we assume that errors are equally likely at each one of the bits. \n\n(Bm' d, -l)-P \n\n(Bm' d, 1). \n\n~) \n\n(12) \n\nIV. ESTIMATION OF RETRIEVAL PROBABILITIES \n\nTo estimate the retrieval probabilities, we shall study the Hamming distance of the \n\n\f516 \n\nKam and Cheng \n\nnetwork's state from a stored pattern. The evolution of the network from state to state, as \naffecting the distance from a stored pattern, can be interpreted in terms of a birth-and-death \nMarkov process (e.g. Howard, 1971) with the probability transition matrix \n\n'I' ,lPbbPdi)= \n\nI-Pt,o \n\nPbo \n\nI-Pbl-PII \n\nPd2 \n0 \n\nPII \n0 \n0 \n\n0 \n0 \n0 \n\n0 \nPbl \n\n0 0 \n0 0 \nI-Pb2-P& Pb2 0 \n\n0 \n\n0 \n0 \n0 \n0 \n\n0 Pdt I-Pbk-Pdt Pbk 0 \n\n0 \n0 PdN-l I-PbN-CPdN-I PbN-I \n0 0 \nI-PdN \n\nPdN \n\n(13) \n\nwhere the birth probability Pbi is the divergence probability of increasing the lID from i \nto i+ 1, and the death probability Pdi is the contraction probability of decreasing the HD \nfrom i to i-I. \nGiven that an input pattern was at HD of do from Bm, the probability that after n steps \nthe network will associate it with Bm is \nC\\-\n\nP r {lin) --+BrrJ I lID [dO) ,BrrJ = do} = Il r [HD(U(n), Bm) = r I lID(U(O), Bm) = <\\,] (14) \n\nr=O \n\nwhere we can use the one-step bounds found in section III in order to calculate the \nworst-case and best-case probabilities of association. Using equations (10) and (11) we \ndefine two matrices for e~h pattern Bm; a worst case matrix, V:' and a best case matrix,~: \n\nWorst case matrix \n\nPbi=Pwc(Bm,i ,+1) \n\nBest case matrix \n\nPbi=pbc(Bm,i ,+1) \n\npdi=pwc(Bm, i ,-1) \nU sing these matrices, it is now possible to calculate lower and upper bounds for the \nassociation probabilities needed in equation (14): \n[1tdo ('I'~1n]r~ Pr[HD(U(n),Bm)=r I lID(U(O), Bm)= <\\,] ~[1tdo ('I'~)n]r (ISa) \n\nPdi=Pbc(Bm, i ,-1). \n\nwhere [x]i indicatestheithelementofthevectorx, and1tdois the unit 1 xn+l vector \n\n\fConvergence and Pattem-Stabilization \n\n51 7 \n\n(ISb) \n\n1 \n\ni=do \n\no \n\nThe bounds of equation 14(a) can be used to bound the association probability in equation \n(13). The upper bound of the association probability is obtained by replacing the summed \nterms in (13) by their upper-bound values: \n\n'\\-\n\nP r {U(n) --+Bm11 lID [(f0) ,BnJ= do} S L[1tdo ('P~)nlr \n\n(16a) \n\nr=O \n\nThe lower bound cannot be treated similarly, since it is possible that at some instant of \ntime prior to the present time-step (n), the network has already associated its state U with \none of the other patterns. We shall therefore use as the lower bound on the convergence \nprobability in equation (14): \nC\\z. \nL[1Ii1o(~nlrSPr{lfn)--iJnJ IHD[lf\u00b0),Bm} \nr=O \nwhere the underlined matrix is the birth-and-death matrix (13) with \n\n(16b) \n\n1 \n\no \n\nfor i = Ili' Jli+ I, ... , N \n\n(16c) \n\nj = 1, ... ,Q,j~i \n\nam \nJli = min HD(Bi, Bj )- dmax \nEquation (16c) and (16d) assume that the network wanders into the dmax- sphere of \ninfluence of a pattern other than Bi, whenever its distance from Bi is Ili or more. This \nassumption is very conservative, since Ili represents the shortest distance to a competing \ndmar sphere of influence, and the network could actually wander to distances larger than \nIli and still converge eventually into the dmax -sphere of influence of Bi. \n\n(16d) \n\nCONCLUSION \n\nWe have presented how the Boltzmann Machine can be used as a content-addressable \nmemory, exploiting the stochastic nature of its state-selection procedure in order to escape \nundesirable minima. An association rule in terms of patterns' spheres of influence is used, \nalong with the Machine's dynamics, in order to match an input tuple with one of the \npredetermined stored patterns. The system is therefore indifferent to 'spurious' states, \nwhose spheres of influence are not recognized in the retrieval process. We have detailed a \ntechnique to calculate the upper and lower bounds on retrieval probabilities of each stored \n\n\f518 \n\nKam and Cheng \n\npattern. These bounds are functions of the network's parameters (i.e. assignment or \nlearning rules, and the pattern sets); the initial Hamming distance from the stored pattern; \nthe association rule; and the number of production steps. They allow a polynomial-time \nassessment of the network's capabilities as an associative memory for a given set of \npatterns; a comparison of different coding schemes for patterns to be stored and retrieved; \nan assessment of the length of the learning period necessary in order to guarantee a \npre specified probability of retrieval; and a comparison of different learning/assignment \nrules for the network parameters. Examples and additional results are provided in a \ncompanion paper (Kam and Cheng, 1989). \n\nAcknowledgements \n\nThis work was supported by NSF grant IRI 8810186. \n\nReferences \n\n[1] Aarts,E.H.L., Van Laarhoven,P.J.M. : \"Statistical Cooling: A General \nApproach to Combinatorial Optimization Problems,\" Phillips 1. Res., Vol. 40, 1985. \n[2] Ackley,D.H., Hinton,J.E., Sejnowski,T J. : \" A Learning Algorithm for \nBoltzmann Machines,\" Cognitive Science, Vol. 19, pp. 147-169, 1985. \n[3] Amari,S-I: \"Learning Patterns and Pattern Sequences by Self-Organizing Nets of \nThreshold Elements,\" IEEE Trans. Computers, Vol. C-21, No. 11, pp. 1197-1206, 1972. \n[4] Geman,S., Geman,D. : \"Stochastic Relaxation, Gibbs Distributions, and the \nBayesian Restoration of Images\" IEEE Trans. Patte Anal. Mach. Int., pp. 721-741,1984. \n[5] Grossberg,S.: \"Nonlinear Neural Networks: Principles, Mechanisms, and \nArchitectures,\" Neural Networks, Vol. 1, 1988. \n[6] Hebb,D.O.: The Organization of Behavior, New York:Wiley, 1949. \n[7] Hinton,J.E., Sejnowski,T J. \"Learning and Relearning in the Boltzmann \nMachine,\" in [14] \n[8] Hopfield,J.J.: \"Neural Networks and Physical Systems with Emergent \nCollective Computational Abilities,\" Proc. Nat. Acad. Sci. USA, pp. 2554-2558, 1982. \n[9] Hopfield,J.J., Tank,D. :\" 'Neural' Computation of Decision in Optimization \nProblems,\" Biological Cybernetics, Vol. 52, pp. 1-12, 1985. \n[10] Howard,R.A.: Dynamic Probabilistic Systems, New York:Wiley, 1971. \n[11] Kam,M., Cheng,R.: \" Decision Making with the Boltzmann Machine,\" \nProceedings of the 1989 American Control Conference, Vol. 1, Pittsburgh, PA, 1989. \n[12] Kennedy,M.P., Chua, L.O. :\"Circuit Theoretic Solutions for Neural \nNetworks,\" Proceedings of the 1st Int. Con!. on Neural Networks, San Diego, CA, 1987. \n[13] Kirkpatrick,S., Gellat,C.D.,Jr., Vecchi,M.P. \n: \"Optimization by \nSimulated Annealing,\" Science, 220, pp. 671-680, 1983. \n[14] Rumelhart,D.E., McClelland,J.L. \nProcessing, Volume 1: Foundations, Cambridge:MIT press, 1986. \n[15] Szu,H.: \"Fast Simulated Annealing,\" in Denker,J.S.(editor) : Neural \nNetworks for Computing, New York:American Inst. Phys., Vol. 51.,pp. 420-425, 1986. \n[16] Tank,D.W., Hopfield, J.J. : \"Simple 'Neural' Optimization Networks,\" IEEE \nTransactions on Circuits and Systems, Vol. CAS-33, No.5, pp. 533-541, 1986. \n\n(editors): Parallel Distributed \n\n\f", "award": [], "sourceid": 116, "authors": [{"given_name": "Moshe", "family_name": "Kam", "institution": null}, {"given_name": "Roger", "family_name": "Cheng", "institution": null}]}