{"title": "Analyzing the Energy Landscapes of Distributed Winner-Take-All Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 626, "page_last": 633, "abstract": null, "full_text": "626 \n\nANALYZING THE ENERGY LANDSCAPES \n\nOF DISTRIBUTED \n\nWINNER-TAKE-ALL NETWORKS \n\nDavid S. Touretzky \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, P A 15213 \n\nABSTRACT \n\nDCPS (the Distributed Connectionist Production System) is a neural \nnetwork with complex dynamical properties. Visualizing the energy \nlandscapes of some of its component modules leads to a better intuitive \nunderstanding of the model, and suggests ways in which its dynamics \ncan be controlled in order to improve performance on difficult cases. \n\nINTRODUCTION \n\nCompetition through mutual inhibition appears in a wide variety of network designs. \nThis paper discusses a system with unusually complex competitive dynamics. The \nsystem is DCPS, the Distributed Connectionist Production System of Touretzky \nand Hinton (1988). DCPS is a Boltzmann machine composed of five modules, \ntwo of which, labeled \"Rule Space\" and \"Bind Space,\" are winner-take-all (WTA) \nnetworks. These modules interact via their effects on two attentional mod ules called \nclause spaces. Clause spaces are another type of competitive architecture based on \nmutual inhibition, but they do not produce WTA behavior. Both clause spaces \nprovide evidential input to both WTA nets, but since connections are symmetric \nthey also receive top-down \"guidance\" from the WTA nets. Thus, unlike most \nother competitive architectures, in DCPS the external input to a WTA net does \nnot remain constant as its state evolves. Rather, the present output of the WTA \nnet helps to determine which evidence will become visible in the clause spaces in the \nfuture. This dynamic attentional mechanism allows rule and bind spaces to work \ntogether even though they are not directly connected. \n\nDCPS actually uses a distributed version of winner-take-all networks whose oper(cid:173)\nating characteristics differ slightly from the non-distributed version. Analyzing the \nenergy landscapes of DWTA networks has led to a better intuitive understanding \nof their dynamics. For a complete discussion of the role of DWTA nets in DCPS, \nand the ways in which insights gained from visualization led to improvements in \nthe system's stochastic search behavior, see [Touretzky, 1989]. \n\n\fEnergy Landscapes of Distributed Winner-Take-All Networks \n\n627 \n\nDISTRIBUTED WINNER-TAKE-ALL NETWORKS \n\nIn classical WTA nets [Feldman & Ballard, 1982], a unit's output value is a continu(cid:173)\nous quantity that reflects its activation level. In this paper we analyze a stochastic, \ndistributed version of winner-take-all dynamics using Boltzmann machines, whose \nunits have only binary outputs [Hinton & Sejnowski, 1986]. The amount of eviden(cid:173)\ntial input to these units determines its energy gap [Hopfield, 1982], which in turn \ndetermines its probability of being active. The network's degree of confidence in \na hypothesis is thus reflected in the amount of time the unit spends in the active \nstate. A good instantaneous approximation to strength of support can be obtained \nby representing each hypothesis with a clique of k independent units looking at a \ncommon evidence pool. The number of active units in a clique reflects the strength \nof that hypothesis. DCPS uses cliques of size 40. Units in rival cliques compete via \ninhibitory connections \n\nIf all units in a clique have identical receptive fields, the result is an \"ensemble\" \nBoltzmann machine [Derthick & Tebelskis, 1988]. In DCPS the units have only \nmoderately sized, but highly overlapped, receptive fields, so the amount of evidence \nindividual units perceive is distributed binomially. Small excitatory weights between \nsibling units help make up for variations in external evidence. They also make states \nwhere all the units in a single clique are active be powerful attractors. \n\nEnergy tours in a DWTA take one of four basic shapes. Examples may be seen in \nFigure 1a. Let e be the amount of external evidence available to each unit, 0 the \nunit's threshold, k the clique size, and W, the excitatory weight between siblings. \nThe four shapes are: \n\nEager vee: the evidence is above threshold (e > 0). The system is eager to \nturn units on; energy decreases as the number of active units goes up. We \nhave a broad, deep energy well, which the system will naturally fall into given \nthe chance. \n\nReluctant vee: the evidence is below threshold, but a little bit of sibling \ninfluence (fewer than k/2 siblings) is enough to make up the difference and \nput the system over the energy barrier. We have e < 0 < e +w,(k-1)/2. The \nsystem is initially reluctant to turn units on because that causes the energy to \ngo up, but once over the hump it willingly turns on more units. With all units \nin the clique active, the system is in an energy well whose energy is below \nzero. \n\nDimpled peak: with higher thresholds the total energy of the network may \nremain above zero even when all units are on. This happens when more than \nhalf of the siblings must be active to boost each unit above threshold, i.e., \ne + w,(k - 1) > 0 > e + w,(k - 1)/2. The system can still be trapped in \nthe small energy well that remains, but only at low temperatures. The well \nis hard to reach since the system must first cross a large energy barrier by \ntraveling far uphill in energy space. Even if it does visit the well, the system \nmay easily bounce out of it again if the well is shallow. \n\n\f628 \n\nTouretzky \n\nSmooth peak: when () > e + w.(k - 1), units will be below threshold even \nwith full sibling support. In this case there is no energy well, only a peak. \nThe system wants to turn all units off. \n\nVISUALIZING ENERGY LANDSCAPES \n\nLet's examine the energy landscape of one WTA space when there is ample evidence \nin the clause spaces for the winning hypothesis. We select three hypotheses, A, B, \nand C, with disjoint evidence populations. Let hypothesis B be the best supported \none with evidence 100, and let A have evidence 40 and C have evidence 5. We will \nsimplify the situation slightly by assuming that all units in a clique perceive exactly \nthe same evidence. In the left half of Figure 1 b we show the energy curves for A, \nB, and C, using a value of 69 for the unit thresholds.1 Each curve is generated by \nstarting with all units turned off; units for a particular hypothesis are turned on one \nat a time until all 40 are on; then they are turned off again one at a time, making \nthe curve symmetric. Since the evidence for hypothesis A is a bit below threshold, \nits curve is of the \"reluctant vee\" type. The evidence for hypothesis B is well above \nthreshold, so its curve is an \"eager vee.\" Hypothesis C has almost no evidence; its \n\"dimpled peak\" shape is due almost entirely to sibling support. (Sibling weights \nhave a value of +2; rival weights a value of -2.) \n\nNote that the energy well for B is considerably deeper than for A. This means at \nmoderate temperature the model can pop out of A's energy well, but it is more \nlikely to remain in B's well. The well for B is also somewhat broader than the well \nfor A, making it easier for the B attractor to capture the model; its attract or region \nspans a larger portion of state space. \n\nThe energy tours for hypotheses A, B, and C correspond to traversing three or(cid:173)\nthogonal edges extending from a corner of a 40 x 40 x 40 cube. A point at location \n(x, y, z) in this cube corresponds to x A units, y B units, and z C units being \nactive. During the stochastic search, A and B units will be flickering on and off \nsimultaneously, so the model will also visit internal points of the cube not covered \nin the energy tour diagram. To see these points we will use two additional graphic \nrepresentations of energy landscapes. First, note that hypothesis C gets so little \nsupport that we safely can ignore it and concentrate on A and B. This allows us \nto focus on just the front face of the state space cube. In Figure 2a, the number \nof active A units runs from zero to forty along the vertical axis, and the number of \nactive B units runs from zero to forty along the horizontal axis. The arrows at each \npoint on the graph show legal state transitions at zero temperature. For example, \nat the point where there are are 38 active B units and 3 active A units there are \ntwo arrows, pointing down and to the right. This means there are two states the \nmodel could enter next: it could either turn off one of the active A units, or turn \non one more B unit, respectively. At nonzero temperatures other state transitions \n\n1 All the weights and thresholds used in this paper are actual DCPS values taken from [Touretzky \n\n& Hinton, 1988]. \n\n\fEnergy Landscapes of Distributed Winner-Take-All Networks \n\n629 \n\nare possible, corresponding to uphill moves in energy space, but these two remain \nthe most probable. \n\nThe points in the upper left and lower right corners of Figure 2a are marked by \n\"Y\" shapes. These represent point attractors at the bottoms of energy wells; the \nmodel will not move out of these states unless the temperature is greater than zero. \nOther points in state space are said to be within the region of a particular attractor \nif all legal transition sequences (at T = 0) from those points lead eventually to the \nattractor. The attractor regions of A and B are outlined in the figure. Note that \nthe B attractor covers more area than A, as predicted by its greater breadth in \nthe energy tour diagram. Note also that there is a small ridge between the two \nattractor regions. From starting points on the ridge the model can end up in either \nfinal state. \n\nFigure 2b shows the depths of the two attractors. The energy well for B is substan(cid:173)\ntially deeper than the well for A. Starting at the point in the lower left corner where \nthere are zero A units and zero B units active, the energy falls off immediately when \nmoving in the B direction (right), but rises initially in the A direction (left) before \ndropping into a modest energy well when most of the A units are on. Points in \nthe interior of the diagram, representing a combination of A and B units active, \nhave higher energies than points along the edges due to the inhibitory connections \nbetween units in rival cliques. \n\nWe can see from Figures lb and 2 that the attractor for A, although narrower and \nshallower than the one for B, is still sizable. This is likely to mislead the model, so \nthat some of the time it will get trapped in the wrong energy well. The fact that \nthere is an attractor for A at all is due largely to sibling support, since the raw \nevidence for A is less than the rule unit threshold. \n\nWe can eliminate the unwanted energy well for A by choosing thresholds that exceed \nthe maximum sibling support of 2 x 39 = 78. DCPS uses a value of 119. However, \nearly in the stochastic search the evidence visible in the clause spaces will be lower \nthan at the conclusion of the search; high thresholds combined with low evidence \nwould make the B attractor small and very hard to find. \n(See the right half of \nFigure Ie, and Figure 3.) Under these conditions the largest attractor is the one \nwith all units turned off: the null hypothesis. ' \n\nDISCUSSION \n\nOur analysis of energy landscapes pulls us in two directions: we need low thresholds \nso the correct attractor is broad and easy to find, but we need high thresholds to \neliminate unwanted at tractors associated with local energy minima. Two solutions \nhave been investigated. The first is to start out with low thresholds and raise them \ngradually during the stochastic search. This \"pulls the rug out from under\" poorly(cid:173)\nsupported hypotheses while giving the model time to find the desired winner. The \nsecond solution involves clipping a corner from the state space hypercube so that \nthe model may never have fewer than 40 units active at a time. This prevents the \n\n\f630 \n\nTouretzky \n\nmodel from falling into the null attractor. When it attempts to drop the number of \nactive units below 40 it is kicked away from the clipped edge by forcing it to turn \non a few inactive units at random. \n\nAlthough DCPS is a Boltzmann machine it does not search the state space by \nsimulated annealing in the usual sense. True annealing implies a slow reduction \nin temperature over many update cycles. Stochastic search in DCPS takes place \nat a single temperature that has been empirically determined to be the model's \napproximate \"melting point.\" The search is only allowed to take a few cycles; \ntypically it takes less than 10. Therefore the shapes of energy wells and the dynamics \nof the search are particularly important, as they determine how likely the model is \nto wander into particular attractor regions. \n\nThe work reported here suggests that stochastic search dynamics may be improved \nby manipulating parameters other than just absolute temperature and cooling rate. \nThreshold growing and corner clipping appear useful in the case of DWTA nets. \nAdditional details are available in [Touretzky, 1989]. \n\nAcknowledgments \n\nThis research was supported by the Office of Naval Research under contract N00014-\n86-K-0678, and by National Science Foundation grant EET-8716324. I thank Dean \nPomerleau, Roni Rosenfeld, Paul Gleichauf, and Lokendra Shastri for helpful com(cid:173)\nments, and Geoff Hinton for his collaboration in the development of DCPS. \n\nReferences \n\n[1] Derthick, M. A., & Tebelskis, J. M. (1988) \"Ensemble\" Boltzmann machines \n\nhave collective computational properties like those of Hopfield and Tank neu(cid:173)\nrons. In D. Z. Anderson (ed.), Neural Information Processing Systems. New \nYork: American Institute of Physics. \n\n[2] Feldman, J. A., & Ballard, D. H. (1982) Connectionist models and their prop(cid:173)\n\nerties. Cognitive Science 6:205-254. \n\n[3] Hinton, G. E., & Sejnowski, T. J. (1986) Learning and relearning in Boltzmann \nmachines. In D. E. Rumelhart and J. L. McClelland (eds.), Parallel Distributed \nProcessing: Explorations in the Microstructure of Cognition, volume 1. Cam(cid:173)\nbridge, MA: Bradford Books/The MIT Press. \n\n[4] Hopfield, J. J. (1982) Neural networks and physical systems with emergent col(cid:173)\n\nlective computational abilities. Proceedings of the National Academy of Sciences \nUSA, 79:2554-2558. \n\n[5] Touretzky, D. S., & Hinton, G. E. (1988) A distributed connectionist product.ion \n\nsystem. Cognitive Science 12(3):423-466. \n\n[6] Touretzky, D. S. (1989) Controlling search dynamics by manipulating energy \nlandscapes. Technical report CMU-CS-89-113, School of Computer Science, \nCarnegie Mellon University, Pittsburgh, PA. \n\n\fEnergy Landscapes of Distributed Winner-Take-All Networks \n\n631 \n\n. \n\nAJ'\\ \n, \n! \n\\ \n! \n. \n\u00b7 \n\u00b7 \n. \n. \n\u00b7 \n. \n\u00b7 \n. \n. \n\u00b7 \n\u00b7 \n\n\\ ( \n~ ! \n\n~ \n\n! \n\nEvldlncl: A&4O. \"100. C:5. \n\nEvldlncl: A&4O. \"100. C:5. \n\n/ \\ \n\nI \\ \n\n\\ , \n\\ \\ \n\nj \n\n: \n: \n\n! : \n! \n\n~ \n\n/\\ \n! \n\\ \n! \nl \n\\ \n\u00b7 \n. \n. \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n\u00b7 \n. - . \n\n\\ ! \n\n\":. \n\n0. \n\n: \n\\f \n\nllnIhold \u2022 69 \n\nllnIhold = 119 \n\nEvldlncl: A&4O. 1060, C:5. \n\nEvldlncl: Aa4O. 1060. C:5. \n\n\\ ! \n\n, . '= \n\nnr.hold = 69 \n\n\\( \n\nr\\ \n\\ \n! \n\\ \n! \n\\ \n! \n. \n\u00b7 \n. \n\u00b7 \n\u00b7 \n. \n\u00b7 \n. \n\nf\\\\ \n. \nf \n\n\\ \n. \n\nnr.hold = 119 \n\nI \n; \n\n, \n; \n\n1\\ \n.f \n\\ : \n\n1\\ \n1\\ \n! \\ \n\u00b7 \n\u00b7 \n. \n. \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n. \n. \n\u00b7 \n\u00b7 \n. \n\u00b7 \n. \n. \n\u00b7 \n\u00b7 \n. \n\u00b7 \n. \n. \n\u00b7 \n\u00b7 \n. \n. \n\u00b7 \n\n!A'. \n: \n! \n! \n\n\\ ! \\ \n\u00b7 \n. \n. \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n. \n\u00b7 \n. \n. \n\u00b7 \n\nFigure 1: (a) four basic shapes for DWTA energy tours; (b) comparison of low \nvs. high thresholds in energy tours where there is a high degree of evidence for \nhypothesis B; (c) corresponding tours with low evidence for B. \n\n\f'~~'eJms A~J~U~ ~u!puods~JJo~ ~q'l (q) !~m'l'eJ~dw~'l OJ~Z 'l'e SUO!'l!su'eJ'l ~'l'e'ls I'e~~1 \n('e) \n'q 1 ~m~!d JO Jl'eq U~l ~q'l U! S'e '~~U~P!A~ q~!q pU'e sPloqs~Jq'l MOl :~ ~JI1~!d \n\n\"QIlI1 r \n\n\u20221Ik\u00bb. \n\n~':J. \n\nfb e\" \n\n.... \\~t'> \n\n\u2022\u2022\u2022 \n\n'69 = PT o4sa.J41 \n\n'001=8 'O~=~ :aouapT A3 \n\nA~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \n1444444444444444444444444444444444~44444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n144444444444444444444444444~444444444444 \n1444444~4444444444444444444~444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n'4444444444444444~4444444444444444444 \n44444444444444444444 \n14444444444444444 \n14444444444444444 4 44444444444444444444 \n14444444444444444 \n4444444444444444444 \n14444444444444444 \n14444444444444444 \n144444444444444444 \n14444444444444444444444444444444444 \n1444444444444444444444444444444444 \n144444444444444444444444444444444 \n14444444444444444444444444444444 \n1444444444444444444444444444444 \n144444444444444444444444444444 \n14444444444444444444444444444 \n1444444444444444444444444444 \n144444444444444444444444444 \n14444444444444444444444444 \n1444444444444444444444444 \n144444444444444444444444 \n14444444444444444444444 \n1444444444444444444444 \n144444444444444444444 \n14444444444444444444 \n1444444444444444444 \n144444444444444444 \n14444444444444444 \n1444444444444444 \n144444444444444 \n144 \n\n~ \n~~ \n~~~ \n~~~~ \n~~~~~ \n~~~~~~ \n~~~~~~~ \n~~~~~~~~ \n~~~~~~~~~ \n~~~~~~~~~~ \n~~~~~~~~~~~ \n~~~~~~~~~~~~ \n~~~~~~~~~~~~~ \n~~~~~~~~~~~~~~ \n~~~~~~~~~~~~~~~ \n~~~~~~~~~~~~~~~~ \n~~~~~~~~~~ ~~~~ \n~~~~~~~~~ ~~ ~~~~~ \n~~~~~ \n~~~~~ \n~~~~~~~~~~~~~~ ~~~~~~ \n~~~~~~~~~~~~~~~~~~~~~~ \n~~~~~~~~~~~~~~~~~~~~~~~ \n\n44444444444444444 \n4444444444444444 \n444444444444444 \n\n~~~~~~~~~~~~~~~~~~~~~~~ \n\n~~~~~~~~~~ \n\n~~~~~~~~~~~~ \n\nAlIZla,m0J, \n\n~f!9 \n\n\f-('I) \n~ \n~~ \n\n~ \n\nt-t:;.! \n~~ ~ ... \n~ .. \n('I) == ~ ..... \n... oq \n~ ::r \n; ~ \n..... ::r \nC-. ... \no ('I) \n~ -~o.. \n~ ~ \n~ I:T' o \nN \n... ~ \n('I) ~ \no 0.. \n~-('I) 0 \nS == \n('I) \n~ < \n~ ..... \n~o.. \n~ ('I) \n... ~ \n('I) \n_. \n..--\nO\"'~ \n~ ..... \n\"-\"~ \n\n(\") \n('I) \n\n\"'C:j \n\nfI) \n\n::r~ \n('I) ~ \nI:T' \n(\") \n... ... \no \n('I) \n... \n('I) \n..... \n~oq \n\"t:I::r \no ~ 5..;-\n..... -~ ...., \n\noq 0 \n('I) \n...., \n~ ~ \n('I) \n..... \n\"'oq \noq ~ \n'< \n... \n~ .... \n~ ('I) \n\n... \n(\") \n~. \n(\")..-\n~~ \n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \n\n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n14444444444444444444~44444444444444 \n14444444444444444444444 444444444444444 \nt44444444444444444444 \n4 44444444444444 \nt44444444444444444444 \n44444444444444 \nt444444444444444444444444444444444444444 \nt44444444444~4444444~4444444444444444 \n4444444444444444 \nt4444444444444444444 \nt44444444444444444444 44444444444444444 \nt44444444444444444441 \n44444444444444 \nt44444444444444444444. 1144444444444444 \n1444444444444444444444444444444444444444 \nt444444444444444444444444444444444444444 \nt444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \nt444444444444444444444444444444444444444 \nt444444444444444444444444444444444444444 \nt444444444444444444444444444444444444444 \nt444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \n1444444444444444444444444444444444444444 \nt444444444444444444444444444444444444444 \n14 4444444 444 \n4 4 \n, 444444444444444444444444444444444444444 \n,~ 44444444444444444444444444444444444444 \n,~ 4444444444444444444444444444444444444 \n,~~ 444444444444444444444444444444444444 \n,~~~ 44444444444444444444444444444444444 \n4444444444444444444444444444444444 \n, \n, ~ \n444444444444444444444444444444444 \n~~ 44444444444444444444444444444444 \nl \nl~~~~~~~ 4444444444444444444444444444444 \n444444444444444444444444444444 \n\n,.., \n< .... \na. \nIII \n:J \n0 \nIII \n\nl> \nII \nA \n\n0 \u2022 \ntIl \nII m \n0 \u00b7 \n\n-i \n\n~ .., \n\nIII \nUl \n~ \n\n0 .... \n\n0.. \n\nII \n..... \n..... \n\n\\0 \u00b7 \n\ntTl. \n\n4 I \n\n'\\ \u2022 ., \n\n~ \u2022 \n\ntE.1 \n\n~ \nr \n\nfI) \n\n~ (D \n~ \n~ \nfI) &: \n0'\" \n~ \n~ \nj;l.. \n~ \n~ \n~ (D \n\n~ -z (D i \n\n~ \n\na-\n\n0) \n\nto c.o \n\n\f", "award": [], "sourceid": 160, "authors": [{"given_name": "David", "family_name": "Touretzky", "institution": null}]}