{"title": "A Method for the Efficient Design of Boltzmann Machines for Classiffication Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 825, "page_last": 831, "abstract": null, "full_text": "A Method for the Efficient Design \n\nof Boltzmann Machines for Classification \n\nProblems \n\nAjay Gupta and Wolfgang Maass\u00b7 \n\nDepartment of Mathematics, Statistics, and Computer Science \n\nUniversity of Illinois at Chicago \n\nChicago IL, 60680 \n\nAbstract \n\nWe introduce a method for the efficient design of a Boltzmann machine (or \na Hopfield net) that computes an arbitrary given Boolean function f . This \nmethod is based on an efficient simulation of acyclic circuits with threshold \ngates by Boltzmann machines. As a consequence we can show that various \nconcrete Boolean functions f that are relevant for classification problems \ncan be computed by scalable Boltzmann machines that are guaranteed \nto converge to their global maximum configuration with high probability \nafter constantly many steps. \n\n1 \n\nINTRODUCTION \n\nA Boltzmann machine ([AHS], [HS], [AK]) is a neural network model in which the \nunits update their states according to a stochastic decision rule. It consists of a \nset U of units, a set C of unordered pairs of elements of U, and an assignment \nof connection strengths S : C -- R. A configuration of a Boltzmann machine \nis a map k : U -- {O, I}. The consensus C(k) of a configuration k is given by \nC(k) = L:{u,v}ECS({u,v}) .k(u) .k(v). If the Boltzmann machine is currently in \nconfiguration k and unit u is considered for a state change, then the acceptance \n\n\u00b7This paper was written during a visit of the second author at the Department of \n\nComputer Science of the University of Chicago. \n\n825 \n\n\f826 \n\nGupta and Maass \n\nprobability for this state change is given by l+e':AC/ C ' Here ~c is the change in \nthe value of the consensus function C that would result from this state change of \nu, and c> \u00b0 is a fixed parameter (the \"temperature\"). \nAssume that n units of a Boltzmann machine B have been declared as input units \nand m other units as output units . One says that B computes a function f : \n{O,l}n -+ {a, I}m if for any clamping of the input units of B according to some Q E \n{O,l}n the only global maxima of the consensus function of the clamped Boltzmann \nmachine are those configurations where the output units are in the states given by \nf(Q)\u00b7 \nNote that even if one leaves the determination of the connection strengths for a \nBoltzmann machine up to a learning procedure ([AHS), [HS], [AK)) , one has to \nknow in advance the required number of hidden units, and how they should be \nconnected (see section 10.4.3 of [AK] for a discussion of this open problem). \n\nAd hoc constructions of efficient Boltzmann machines tend to be rather difficult \n(and hard to verify) because of the cyclic nature of their \"computations\". \n\nWe introduce in this paper a new method for the construction of efficient Boltzmann \nmachines for the computation of a given Boolean function f (the same method can \nalso be used for the construction of Hopfield nets). We propose to construct first an \nacyclic Boolean circuit T with threshold gates that computes f (this turns out to \nbe substantially easier). We show in section 2 that any Boolean threshold circuit T \ncan be simulated by a Boltzmann machine B(T) of the same size as T. Furthermore \nwe show in section 3 that a minor variation of B(T) is likely to converge very fast. \nIn Section 4 we discuss applications of our method for various concrete Boolean \nfunctions . \n\n2 SIMULATION OF THRESHOLD CIRCUITS BY \n\nBOLTZMANN MACHINES \n\nA threshold circuit T (see [M), [PS], [R], [HMPST]) is a labeled acyclic directed \ngraph. We refer to the number of edges that are directed into (out of) a node of T \n\nas the in degree (outdegree) of that node. Its nodes of indegree \u00b0 are labeled by inpu t \nvariables Xi (i E {I, . . . , n} ). Each node 9 of indegree I > \u00b0 in T is labeled by some \n\narbitrary Boolean threshold function Fg : {a, I}' -+ {a, I}, where Fg(Y1, ... , y,)::: 1 \nifand only ifL:!=t O:iYi ~ t (for some arbitrary parameters 0:1, ... ,0:\" t E R; w.l.o.g. \n0:1, \u2022. . , 0:\" t E Z M]). One views such node 9 as a threshold gate that computes \nFg \u2022 If m nodes of a threshold circuit T are in addition labeled as output nodes, \none defines in the usual manner the Boolean function f : {O, l}n --+ {a, l}m that is \ncomputed by T . \n\\Ve simulate T by the following Boltzmann machine B(T) = < U, C, S > (note that \nT has directed edges, while B(T) has undirected edges) . We reserve for each node 9 \nof T a separate unit beg) of B(T). We set \n\nU:::: \nc:::: \n\n{b(g)lg is a node of T} and \n{{beg'), b(g)}lg', 9 are nodes of T so that either g' = 9 or \ng' ,g are connected by an edge in T} . \n\n\fEfficient Design of Boltzmann Machines \n\n827 \n\nConsider an arbitrary unit beg) of B(T). We define the connection strengths \nS( {b(g)}) and S( {b(g'), b(g)}) (for edges < g', g > of T) by induction on the length \nof the longest path in T from g to a node of T with outdegree O. \nIf g is a gate of T with outdegree 0 then we define S( {b(g)}) := -2t + 1, where t is \nthe threshold of g, and we set S({b(g'),b(g)}):= 2a\u00ab g',g \u00bb (where a\u00ab g',g \u00bb \nis the weight of the directed edge < g', g > in T). \nAssume that g is a threshold gate of T with outdegree > O. Let gl, ... ,gk be the \nimmediate successors of gin T. Set w := 2:::1IS({b(g),b(gi)})1 (we assume that \nthe connection strengths S( {beg), b(gi)}) have already been defined). We define \nS( {b(g)}) := -(2w + 2) . t + w + 1, where t is the threshold of gate g. Furthermore \nfor every edge < g', g > in T we set S( {b(g'), b(g)}) := (2w + 2) . a \u00ab g', g \u00bb. \n\nRemark: It is obvious that for problems in TGo (see section 4) the size of connec(cid:173)\ntion strengths in B(T) can be bounded by a polynomial in n. \n\nTheorem 2.1 For any threshold circuit T the Boltzmann machine B(T) computes \nthe same Boolean function as T. \n\nProof of Theorem 2.1: \nLet Q E {O, l}n be an arbitrary input for circuit T. We write g(gJ E {O, I} for the \noutput of gate g of T for circuit input Q. \nConsider the Boltzmann machine B(T)a with the n units b(g) for input nodes g of \nT clamped according to Q. We show that the configuration K a of B(T)a where b(g) \nis on if and only if g(Q) = 1 is the only global maximum (in fact: the only local \nmaximum) of the consensus function G for B(T)!!.. \nAssume for a contradiction that configuration K of B(T)a is a global maximum of \nthe consensus function G and K 1= K a. Fix a node g of T of minimal depth in T \nso that K(b(g)) 1= Ka(b(g\u00bb = g(Q). By definition of B(T)a this node g is not an \ninput node of T. Let I{' result form K by changing the stat~ of beg). We will show \nthat G(K') > G(K), which is a contradiction to the choice of K. \nWe have (by the definition of G) \n\nG(K') - G(K) = (1- 2K(b(g\u00bb) . (SI + S2 + S( {b(g)}\u00bb , where \nSI:= 2:{K(b(g'\u00bb. S({b(g'),b(g)})1 < g',g > is an edge in T} \nS2:= E{K(b(g'\u00bb' S({b(g),b(g')})1 < g,g' > is an edge in T}. \n\nLet w be the parameter that occurs in the definition of S( {b(g)}) (set w := 0 if g \nhas outdegree 0). Then IS21 < w. Let PI, ... , Pm be the immediate predecessors \nof g in T, and let t be the threshold of gate g. Assume first that g(Q) = 1. Then \nSI = (2w+2). E~1 a\u00ab Pi,g \u00bb 'Pi(Q) ~ (2w+2) \u00b7t. This implies that SI +S2 > \n(2w + 2).t - w-l, and therefore SI +S2 +S( {beg)}) > 0, hence G(I(') - G(K) > O. \nIf g(Q) = 0 then we have E~1 a( < Pi, g \u00bb . Pi(Q) < t - 1, thus SI = (2w + 2) . \n2:~1 a( < Pi, g \u00bb . Pi(Q) < (2w + 2) . t - 2w - 2. This implies that SI + S2 < \n(2w + 2) . t - w - 1, and therefore 51 + S2 + 5( {beg)}) < O. We have in this case \nK(b(g\u00bb = 1, hence G(K') - G(K) = (-1)\u00b7 (SI + 52 + S({b(g)}\u00bb > o. 0 \n\n\f828 \n\nGupta and Maass \n\n3 THE CONVERGENCE SPEED OF THE \n\nCONSTRUCTED BOLTZMANN MACHINES \n\nWe show that the constructed Boltzmann machines will converge relatively fast to \na global maximum configuration. This positive result holds both if we view B(T) as \na sequential Boltzmann machine (in which units are considered for a state change \none at a time), and if we view B(T) as a parallel Boltzmann machine (where several \nunits are simultaneously considered for a state change). In fact, it even holds for \nunlimited parallelism, where every unit is considered for a state change at every \nstep. Although unlimited parallelism appears to be of particular interest in the \ncontext of brain models and for the design of massively parallel machines, there are \nhardly any positive results known for this case (see section 8.3 in [AK]). \nIf 9 is a gate in T with outdegree > 1 then the current state of unit b(g) of B(T) \nbecomes relevant at several different time points (whenever one of the immediate \nsuccessors of 9 is considered for a state change). This effect increases the probability \nthat unit b(g) may cause an \"error.\" Therefore the error probability of an output \nunit of B(T) does not just depend on the number of nodes in T, but on the number \nN (T) of nodes in a tree T' that results if we replace in the usual fashion the directed \ngraph of T by a tree T' of the same depth (one calls a directed graph a tree if aU of \nits nodes have outdegree ~ 1). \nTo be precise, we define by induction on the depth of 9 for each gate 9 of T a \ntree Tree(g) that replaces the sub circuit of T below g. If g1, ... ,gk are the im(cid:173)\nmediate predecessors of 9 in T then Tree(g) is the tree which has 9 as root and \nTree(gl), ... ,Tree(gk) as immediate subtrees (it is understood that if some gi has \nanother immediate successor g' \"# 9 then different copies of Tree(gd are employed \nin the definition of Tree(g) and Tree(g'\u00bb. \n\nin Tree(g) \n\nthe number of nodes \n\n, and N(T) for \nWe write ITree(g)I for \nL { ITree(g) 1 Ig is an output node of T}. It is easy to see that if T is synchronous \n(Le. depth (gff):::: depth(g')+ 1 for all edges < g',g\" > in T) then ITree(g)1 < sd-1 \nfor any node 9 in T of depth d which has s nodes in the subcircuit of T below g. \nTherefore N(T) is polynomial in n if T is of constant depth and polynomial size \n(this can be achieved for all problems in Teo, see Section 4). \nWe write B 6(T) for the variation of the Boltzmann machine B(T) of section 2 where \neach connection strength in B(T) is multiplied by 6 (6 > 0). Equivalently one could \nview B6 (T) as a machine with the same connection strengths as B(T) but a lower \n\"temperature\" (replace c by c/6). \n\nTheorem 3.1 Assume that T is a threshold circuit of depth d that computes a \nBoolean function f : {O, l}n -+ {O, l}m. Let B6(T)a be the Boltzmann machine that \nresults from clamping the input units of B 6(T) ac~rding to Q (g E {O, l}n). \n\nAssume that \u00b0 :::: qo < ql < ... < qd are arbitrary numbers such that for every \n\ni E {I, ... , d} and every gate 9 of depth i in T the corresponding unit b(g) is \nconsidered for a state change at some step during interval (qi-1, qi]. There is no \nrestriction on how many other units are considered for a state change at any step. \nLet t be an arbitrary time step with t > qd. Then the output units of B(T) are at \n\n\fEfficient Design of Boltzmann Machines \n\n829 \n\nthe end of step t with probability > 1 - N(T) . 1+!67c in the states given by f(g.). \n\nRemarks: \n\n1. For 8 := n this probability converges to 1 for n --+ 00 if T is of constant depth \n\nand polynomial size. \n\n2. The condition on the timing of state changes in Theorem 3.1 has been for(cid:173)\n\nmulated in a very general fashion in order to make it applicable to all of the \ncommon types of Boltzmann machines.For a sequential Boltzmann machine \n(see [AK], section 8.2) one can choose qi - qi-1 sufficiently large (for exam(cid:173)\nple polynomially in the size of T) so that with high probability every unit of \nB(T) is considered for a state change during the interval (qi-1, qd. On the \nother hand, for a synchronous Boltzmann machine with limited parallelism \n([AK], section 8.3) one may apply the result to the case where every unit beg) \nwith 9 of depth i in T is considered for a state change at step i (set qi := i). \nTheorem 3.1 also remains valid for unlimited parallelism ([AK], section 8.3), \nwhere every unit is considered for a state change at every step (set qi := i). In \nfact, not even synchronicity is required for Theorem 3.1, and it also applies to \nasynchronous parallel Boltzmann machines ([AK], section 8.3.2) . \n\n3. For sequential Boltzmann machines in general the available upper bounds for \ntheir convergence speed are very unsatisfactory. In particular no upper bounds \nare known which are polynomial in the number of units (see section 3.5 of \n[AK]). For Boltzmann machines with unlimited parallelism one can in general \nnot even prove that they converge to a global maximum of their consensus \nfunction (section 8.3 of [AK]). \n\nProof of Theorem 3.1: We prove by induction on i that for every gate 9 of depth \ni in T and every step t 2: qi the unit b(g) is at the end of step t with probability \n~ 1 -\n\nITree(g)1 . l+!A/c in state g(g.). \n\nAssume that g1, .. . , gk are the immediate predecessors of gate 9 in T. By definition \nwe have ITree(g)1 = 1 + 2:7=1 1 Tree(gj )1. Let t' ~ t be the last step before t at which \nbeg) has been considered for a state change. Since T ~ qi we have t' > qi-1. Thus \nfor each j = 1, ... ,k we can apply the induction hypothesis to unit b(gj) and step \nt' - 1 ~ qdepth(9J)' Hence with probability > 1- (ITree(g)l- 1) . 1+~6/C the state of \nthe units b(g1), ... , b(gk) at the end of step t' - 1 are g1 (.q), ... ,gk (gJ. Assume now \nthat the unit b(gj) is at the end of step t' - 1 in state gj (.q), for j = 1, ... , k. If 9 is \nat the beginning of step t' not in state g(.!!), then a state change of unit b(g) would \nincrease the consensus function by 6C ~ 8 (independently of the current status \nof units beg) for immediate successors g of 9 in T). Thus b(g) accepts in this case \nthe change to state g(S!) with probability 1+e_l~c/c > 1+e: 6 / C = 1 - 1+!6/C' On the \nother hand, if beg) is already at the beginning of step t' in state g(!!), then a change \nof its state would decrease the consensus by at least 8. Thus beg) remains with \nprobability > 1 -\n1+!6/C in stat.e g(.g.). The preceding considerations imply that \nunit b(g) is at the end of step t' (and hence at the end of step t) with probability \n> 1 -\n\nITree(g)1 . 1+!6/C in state g(g.). D \n\n\f830 \n\nGupta and Maass \n\n4 APPLICATIONS \n\nThe complexity class Teo is defined as the class of all Boolean functions f : \n{O,l}* ---+ {0,1}* for which there exists a family (Tn)nEN of threshold circuits \nof some constant depth so that for each n the circuit Tn computes f for inputs of \nlength n, and so that the number of gates in Tn and the absolute value of he weights \nof threshold gates in Tn (all weights are assumed to be integers) are bounded by a \npolynomial in n ([HMPST], [PS]). \n\nCorollary 4.1 (to Theorems 2.1, 3.1): Every Boolean function f that belongs \nto the complexity class Teo can be computed by scalable (i.e. polynomial size) \nBoltzmann machines whose connection strengths are integers of polynomial size and \nwhich converge for state changes with unlimited parallelism with high probability in \nconstantly many steps to a global maximum of their consensus function. \n\nThe following Boolean functions are known to belong to the complexity class TeO: \nAND, OR, PARITY; SORTING, ADDITION, SUBTRACTION, MULTIPLICA(cid:173)\nTION and DIVISION of binary numbers; DISCRETE FOURIER TRANSFORM, \nand approximations to arbitrary analytic functions with a convergent rational power \nseries ([CVS], [R], [HMPST]). \n\nRemarks: \n\n1. One can also use the method from this paper for the efficient construction \nof a Boltzmann machine BP1 \"\"'Pk that can decide very fast to which of k \nstored \"patterns\" PI,\"\" Pk E {O, l}n the current input x E {O,l}n to the \nBoltzmann machine has the closest \"similarity.\" \nFor arbitrary fixed \"patterns\" PI,\"', Pk E {O, l}n let fpl,\"\"p\" : {O, l}n --+ \n{O, l}k be the pattern classification function whose ith output bit is 1 if and \nonly if the Hamming distance between the input \u00a3. E {O, l}n and Pi is less or \nequal to the Hamming distance between\u00a3. and Pj, for all j\"# i. \nWe write H D(~, y) for the Hamming distance L~I IXi - y. I of strings \n\u00a3.,l!., E {O, l}n. O;e has H D(z.,l!.} = Lyi:o Xi + Ly,:1 (1 - xd, and there(cid:173)\nfore H D(~, pj) - H D(\u00a3, p,) = L~:l fiiX. + c for suit.able coefficients fii E \n{-2, -1, 0,1, 2} and c E Z (that depend on the fixed patterns Pj, PI E {O, l}n). \nThus there is a threshold circuit that consists of a single threshold gate which \noutputs 1 if HD(x,pj) < HD(!.,PI}, and \u00b0 otherwise. \n\nThe function fpl, \"\" P\" can be computed by a threshold circuit T of depth 2 \nwhose jth output gate is the AND of k - 1 gates as above which check for \nI E {I, ... , k} - {j} whether H D(\u00a3, Pi) < H D(\u00a3, PI) (note that the under(cid:173)\nlying graph of T is the same for any choice of the patterns PI, ... ,Pk)' The \ndesired Boltzmann machine Bp1, .. . ,p\" is the Boltzmann machine B(T) for this \nthreshold circuit T. \n\n2. Our results are also of interest in the context of learning algorithms for Boltz(cid:173)\nmann machines. For example, the previous remark provides a single graph \n< u, C > of a Boltzmann machine with n input units, k output units, and \nk2 - k hidden units, that is able to compute with a suitable assignment of \n\n\fEfficient Design of Boltzmann Machines \n\n831 \n\nconnection strengths (that may arise from a learning algorithm for Boltzmann \nmachines) any function Ipl, ... ,PIc (for any choice of Pl,\"\" Pk E {O, l}n). \nSimilarly we get from Theorem 2.1 together with a result from [M] the graph \n< u, C > of a Boltzmann machine with n input units, n hidden units, and \none output unit, that can compute with a suitable assignment of connection \nstrengths any symmetric function 1 : {O,l}n ---+ {O, I} (I is called symmetric \nif I(Zi,\"\" zn) depends only on E~=l Xi; examples of symmetric functions are \nAND, OR, PARITY). \n\nAcknowledgment: We would like to thank Georg Schnitger for his suggestion to \ninvestigate the convergence speed of the constructed Boltzmann machines. \n\nReferences \n\n[AK] \n\n[AHS] \n\n[HS] \n\n[CVS] \n\nE. Aarts, J. Korst, Simulated Annealing and Boltzmann Machines, John \nWiley & Sons (New York, 1989). \nD.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for \nBoltzmann machines, Cognitive Science, 9, 1985, pp. 147-169. \nG.E. Hinton, T.J. Sejinowski, Learning and relearning in Boltzmann ma(cid:173)\nchines, in: D.E. Rumelhart, J.L McCelland, & the PDP Research Group \n(Eds.), Parallel Distributed Processing: Explorations in the Microstruc(cid:173)\nture of Cognition, MIT Press (Cambridge, 1986), pp. 282-317. \nA.K. Chandra, L.J. Stockmeyer, U. Vishkin, Constant. depth reducibilit.y, \nSIAM, J. Comp., 13, 1984, pp. 423-439. \n\n[HMPST] A. Hajnal, W. Maass, P. Pudlak, M. Szegedy, G. Turan, Threshold cir(cid:173)\n\n[M] \n\n[PS] \n\n[R] \n\ncuits of bounded depth, to appear in J. of Compo and Syst. Sci. (for an \nextended abstract see Proc. of the 28th IEEE Conf. on Foundations of \nComputer Science, 1987, pp.99-110). \nS. Muroga, Threshold Logic and its Applications, John \\Viley & Sons \n(New York, 1971). \nI. Parberry, G. Schnitger, Relating Boltzmann machines to conventional \nmodels of computation, Neural Networks, 2, 1989, pp. 59-67. \nJ. Reif, On threshold circuits and polynomial computation, Proc. of \nthe 2nd Annual Conference on Structure in Complexity Theory, IEEE \nComputer Society Press, Washington, 1987, pp. 118-123. \n\n\f", "award": [], "sourceid": 418, "authors": [{"given_name": "Ajay", "family_name": "Gupta", "institution": null}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": null}]}