{"title": "Learning in Networks of Nondeterministic Adaptive Logic Elements", "book": "Neural Information Processing Systems", "page_first": 840, "page_last": 849, "abstract": null, "full_text": "840 \n\nLEARNING IN NETWORKS OF \n\nNONDETERMINISTIC ADAPTIVE LOGIC ELEMENTS \n\nRichard C. Windecker* \n\nAT&T Bell Laboratories, Middletown, NJ 07748 \n\nABSTRACT \n\nfrom \n\nThis \n\nholds \n\nsimpler nondeterministic adaptive \n\nThis paper presents a model of nondeterministic adaptive automata that are \nconstructed \ninformation processing \nelements. The first half of the paper describes the model. The second half discusses \nsome of its significant adaptive properties using computer simulation examples. \nChief among these properties is that network aggregates of the model elements can \nadapt appropriately when a single reinforcement channel provides the same positive \nor negative reinforcement signal to all adaptive elements of the network at the same \nfor multiple-input, multiple-output, multiple-layered, \ntime. \ncombinational and sequential networks. It also holds when some network elements \nare \"hidden\" \nthe external \nenvironment. \n\ntheir outputs are not directly seen by \n\nin \n\nthat \n\nINTRODUCTION \n\nThere are two primary motivations for studying models of adaptive automata \nconstructed from simple parts. First, they let us learn things about real biological \nsystems whose properties are difficult to study directly: We form a hypothesis \nabout such systems, embody it in a model, and then see if the model has reasonable \nlearning and behavioral properties. In the present work, the hypothesis being tested \nis: \nthat much of an animal's behavior as determined by its nervous system is \nintrinsically nondeterministic; that learning consists of incremental changes in the \nprobabilities governing the animal's behavior; and that this is a consequence of the \nanimal's nervous system consisting of an aggregate of information processing \nelements some of which are individually nondeterministic and adaptive. The second \nmotivation for studying models of this type is to find ways of building machines \nthat can learn to do (artificially) intelligent and practical things. This approach has \nthe potential of complementing \nthe currently more developed approach of \nprogramming intelligence into machines. \n\nWe do not assert that there is necessarily a one-to-one correspondence \nbetween real physiological neurons and the postulated model information processing \nelements. Thus, the model may be loosely termed a \"neural network model,\" but is \nmore accurately described as a model of adaptive automata constructed from simple \nadaptive parts. \n\n* The main ideas in this paper were conceived and initially developed while the \nauthor was at the University of Chiang Mai, Thailand (1972-73). The ideas were \ndeveloped further and put in a form consistent with existing switching and \nautomata theory during the next four years. For two of those years, the author \nwas at the University of Guelph, Ontario, supported of National Research \nCouncil of Canada Grant #A6983. \n\n\u00a9 American Institute of Physics 1988 \n\n\f841 \n\nIt almost certainly has to be a property of any acceptable model of animal \nlearning that a single reinforcement channel providing reinforcement to all the \nadaptive elements in a network (or subnetwork) can effectively cause that network \nto adapt appropriately. Otherwise, methods of providing separate, specific \nreinforcement to all adaptive elements in the network must be postulated. Clearly, \nthe environment reinforces an animal as a whole and the same reinforcement \nmechanism can cause the animal to adapt to many types of situation. Thus, the \nreinforcement system is non-specific to particular adaptive elements and particular \nbehaviors. The model presented here has this property. \n\nThe model described here is a close cousin to the family of models recently \ndescribed by Barto and coworkers 1-4. The most significant difference are: 1) In \nthe present model, we define the timing discipline for networks of elements more \nexplicitly and completely. This particular timing discipline makes the present \nmodel consistent with a nondeterministic extension of switching and automata \ntheory previously described 0. 2) In the present model, the reinforcement algorithm \nthat adjusts the weights is kept very simple. With this algorithm, positive and \nnegative reinforcement have symmetric and opposite effects on the weights. This \nensures that the logical signals are symmetric opposites of each other. (Even small \ndifferences in the reinforcement algorithm can make both subtle as well as profound \ndifferences \nin \nthe model.) We also allow, null, or zero, \nreinforcemen t. \n\nthe behavior of \n\nAs in the family of models described by Barto, networks constructed within \nthe present model can get \"stuck\" at a sUboptimal behavior during learning and \ntherefore not arrive at the optimal adapted state. The complexity of the Barto \nreinforcement algorithm is designed partly to overcome this tendency. \nIn the \npresent work, we emphasize the use of training strategies when we wish to ensure \nthat the network arrives at an optimal state. (In nature, it seems likely that getting \n\"stuck\" at suboptimal behavior is common.) In all networks studied so far, it has \nbeen easy to find strategies that prevent the network from getting stuck. \n\nthese \n\nThe chief contributions of the present work are: 1) The establishment of a \nclose connection between \ntypes of models and ordinary, nonadaptive, \nswitching and automata theory 0. This makes the wealth of knowledge in this area, \nespecially network synthesis and analysis methods, readily applicable to the study \nof adaptive networks. \nthat sequential \n(\"recurrent\") nondeterministic adaptive networks can adapt appropriately. Such \nnetworks can learn to produce outputs that depend on the recent sequence of past \ninputs, not just the current inputs. 3) The demonstration that the use of training \nstrategies can not only prevent a network from getting stuck, but may also result in \nmore rapid learning. Thus, such strategies may be able to compensate, or even \nmore than compensate, for reduced complexity in the model itself. \n\n2) The experimental demonstration \n\nReferences 2-4 and 6 provide a comprehensive background and guide to the \nliterature on both deterministic and nondeterministic adaptive automata including \nthose constructed from simple parts and those not. \n\nTHE MODEL ADAPTIVE ELEMENT \n\nThe model adaptive element postulated in this work is a nondeterministic, \nadaptive generalization of \nthese elements \nNondeterministic Adaptive Threshold-logic gates (NATs). The output chosen by a \nNAT at any given time is not a function of its inputs. Rather, it is chosen by a \nstochastic process according to certain probabilities. It is these probabilities that \nare a function of the inputs. \n\nlogic 7. Thus, we call \n\nthreshold \n\nA NAT is like an ordinary logic gate in that it accepts logical inputs that are \ntwo-valued and produces a logical output that is two-valued. We let these values be \n\n\f842 \n\n+ 1 and -1. A NAT also has a timing input channel and a reinforcement input \nchannel. The NAT operates on a three-part cycle: 1) Logical input signals are \nchanged and remain constant. 2) A timing signal is received and the NAT selects a \nnew output based on the inputs at that moment. The new output remains \nconstant. 3) A reinforcement signal is received and the weights are incremented \naccording to certain rules. \n\nLet N be the number of logical input channels, let Xi represent the ith input \nthe output. The NAT has within it N+ 1 \"weights,\" \nsignal, and \nwo, WI! ... , WN. The weights are confined to integer values. For a given set of \ninputs, the gate calculates the quantity W: \n\nlet z be \n\nThen the probability that output z = + 1 is chosen is: \n\nP(z = +1) -\n\nw _--=-=-\nJ e 2q2 dx = _1_ J e-(l d~ \n\nW/v2q \n\n. . ; ; - 00 \n\n1 \n\n.j2; u - 00 \n\n(2) \n\nthe gate selects \n\nthe output z = + 1. \n\nwhere ~ = xjV2u. \n(An equivalent formulation is to let the NAT generate a \nrandom number, Wq, according to the normal distribution with mean zero and \nif W > - Wq, \nvariance u2 . Then \nIf \nW < - Wq, the gate selects output z = -1. If W = - Wq, the gate selects output \n-1 or + 1 with equal probability.) \nReinforcement signals, R, may have one of three values: + 1, -1, and 0 \nIf + 1 \nrepresenting positive, negative, and no \nreinforcement is received, each weight is incremented by one in the direction that \nmakes the current output, z, more likely to occur in the future when the same \ninputs are applied; if -1 reinforcement is received, each weight is incremented in \nthe direction that makes the current output less likely; if 0 reinforcement is \nreceived, the weights are not changed. These rules may be summarized: ~wo = zR \nand ~Wj = xjzR for i > o. \nNATs operate in discrete time because if the NAT can choose output + 1 or \n-1, depending on a stochastic process, it has to be told when to select a new \noutput. It cannot \"run freely,\" or it could be constantly changing output. Nor can \nit change output only when its inputs change because it may need to select a new \noutput even when they do not change. \n\nreinforcement, \n\nrespectively. \n\nThe normal distribution is used for heuristic reasons. If a real neuron (or an \naggregate of neurons) uses a stochastic process to produce nondeterministic \nbehavior, it is likely that process can be described by the normal distribution. In \nany case, the exact relationship between P{z = + 1) and W is not critical. What is \nimportant is that P(z = + 1) be monotonically increasing in W, go to 0 and 1 \nasymptotically as W goes to - 00 and + 00, respectively, and equal 0.5 at W = O. \n\nThe parameter u \n\nis adjustable. We use 10 in \n\nthe computer simulation \nexperiments described below. Experimentally, values near 10 work reasonably well \nfor networks of NATs having few inputs. Note that as u goes to zero, the behavior \nof a NAT approximates that of an ordinary deterministic ada pt,ive threshold logic \ngate with the difference that the output for the case W = 0 is not arbitrary: The \nNAT will select output +1 or -1 with equal probability. \nNote that for all values of W, the probabilit,ies are greater than zero that \neither + 1 or -1 will be chosen, although for large values of W (relative to u) for all \n\n\f843 \n\npractical purposes, the behavior is deterministic. There are many values of the \nweights that cause \nthe NAT to approximate the behavior of a deterministic \nthreshold logic gate. ~or the same reasons that deterministic threshold logic gates \nfunctions of N variables 7, so a NAT cannot learn to \ncannot realize all 22 \napproximate any deterministic function; only the threshold logic functions. \n\nNote also that when the weights are near zero, a NAT adapts most rapidly \nwhen both positive and negative reinforcement are used in approximately equal \namounts. As the NAT becomes more likely to produce the appropriate behavior, \nthe opportunity to use negative reinforcement decreases while the opportunity to \nuse positive reinforcement increases. This means that a NAT cannot learn to \n(nearly) always select a certain output if negative reinforcement alone is used. \nThus, positive reinforcement has an important role \nin \n(In most \ndeterministic models, positive reinforcement is not useful.) \nNote further that there is no hysteresis in NAT learning. For a given \nconfiguration of inputs, a + 1 output followed by a + 1 reinforcement has exactly the \nsame effect on all the weights as a -1 output followed by a -1 reinforcement. So \nthe order of such events has no effect on the final values of the weights. \n\nthis model. \n\nFinally, if only negative reinforcement is applied to a NAT, independent of \noutput, for a particular combination of inputs, the weights will change in the \ndirection that makes W tend toward zero and once there, follow a random walk \ncentered on zero. (The further W is from zero, the more likely its next step will be \ntoward zero.) If all possible input combinations are applied with more or less equal \nprobability, all the weights will tend toward zero and then follow random walks \ncentered on zero. In this case, the NAT will select + 1 or -1 with more or less \nequal probability without regard to its inputs. \n\nNETWORKS \n\nNATs may be connected together in networks (NAT-nets). The inputs to a \nNAT in such a network can be selected from among: 1) the set of inputs to the \nentire network, 2) the set of outputs from other NATs in the network, and 3) its \nown output. The outputs of the network may be chosen from among: 1) the inputs \nto the network as a whole, and 2) the outputs of the various NATs in the network . \n\nFollowing Ref. 5, we impose a timing discipline on a NAT-net. The network is \norganized into layers such that each NAT belongs to one layer. Letting L be the \nnumber of layers, the network operates as follows: 1) All NATs in a given layer \nreceive timing signals at the same time and select a new output at the same time. \n2) Timing signals are received by the different layers, in sequence, from 1 to L. 3) \nInputs to the network as a whole are levels that may change only before Layer 1 \nreceives its timing signal. Similarly, outputs from the network as a whole are \navailable to the environment only after Layer L has received its timing signal. \nReinforcement to the network as a whole is accepted only after outputs are made \navailable to the environment. The same reinforcement signal is distributed to all \nNATs in the network at the same time. \n\nWith these rules, NAT-nets operate through a sequence of timing cycles. In \neach cycle: 1) Network inputs are changed. 2) Layers 1 through L select new \noutputs, in sequence. 3) Network outputs are made available to the environment. \n4) Reinforcement is received from the environment. We call each such cycle a \n\"trial\" and a sequence of such trials is a \"session.\" \n\nThis model is very general. If, for each gate, inputs are selected only from \namong the inputs to the network as a whole and from the outputs of gates in layers \npreceding it in the timing cycle, then the network is combinational. In this case, the \nprobability of the network producing a given output configuration is a function of \nthe inputs at the start of the timing cycle. If at least one NAT has one input from a \n\n\f844 \n\nit to remember \n\nNAT in the same layer or from a subsequent layer in the timing cycle, then the \nnetwork is sequential. In this case, the network may have \"internal states\" that \nallow \nto the next. Thus, the \nprobabilities governing its choice of outputs may depend on inputs in previous \ncycles. So sequential NAT-nets may have short-term memory embodied in internal \nstates and long-term memory embodied in the weights. In Ref. 5, we showed that \nsequential networks can be constructed by adding feedback paths to combinational \nnetworks and any sequential network can be put in this standard form. \n\ninformation from one cycle \n\nIn information-theoretic terms: 1) A NAT-net with no inputs and some \noutputs is an \"information source.\" 2) A NAT-net with both inputs and outputs is \nan information \"channel.\" 3) A combinational NAT-net is \"memory-less\" while a \nsequential NAT-net has memory. In this context, note that a NAT-net may operate \nin an environment that is either deterministic or nondeterministic. Both the logical \nand the reinforcement inputs can be selected by stochastic processes. Note also \nthat nondeterministic and deterministic elements as well as adaptive and \nnonadaptive elements can be combined in one network. \n(It may be that the \ndecision-making parts of an animal's nervous system are nondeterministic and \nadaptive while the information transmitting parts (sensory data-gathering and the \nmotor output parts) are deterministic and nonadaptive.) \n\nOne capability that combinational NAT-nets possess is that of \"pattern \nrecognizers.\" A network having many inputs and one or a few outputs can \n\"recognize\" a small subset of the potential input patterns by producing a particular \noutput pattern with high probability when a member of the recognized subset \nappears and a different output pattern otherwise. \nIn practice, the number of \npossible input patterns may be so large that we cannot present them all for training \npurposes and must be content to train the network to recognize one subset by \ndistinguishing it (with different output pattern) from another subset. In this case, \nif a pattern is subsequently presented to the network that has not been in one of \nthe training sets, the probabilities governing its output may approach one or zero, \nbut may well be closer to 0.5. The exact values will depend on the details of the \ntraining period. If the new pattern is similar to those in one of the training sets, the \nNAT-net will often have a high probability of producing the same output as for that \nset. This associative property is the analog of the well known associative property \nin deterministic models. \nthe \nseparation we wish to make, then it cannot be trained. For example, a single N(cid:173)\ninput NAT cannot be trained to recognize any arbitrary set of input patterns by \nselecting the + 1 output when one of them is presented and -1 otherwise. It can \nonly be trained to make separations that correspond to threshold functions. \n\nIf the network lacks sufficient complexity for \n\nA combinational NAT-net can also produce patterns. By analogy with a \npattern recognizer, a NAT-net with none or a few inputs and a larger number of \noutputs can learn for each input pattern to produce a particular subset of the \npossible output patterns. Since the mapping may be few-to-many, instead of \nmany-to-few, the goal of training in this case mayor may not be to have the \nnetwork approximate deterministic behavior. Clearly, \nthe distinction between \npattern recognizers and pattern prod ucers is somewhat arbitrary: in general, NAT(cid:173)\nnets are pattern transducers that map subsets of input patterns into subsets of \noutput patterns. A sequential network can \"recognize\" patterns in the time(cid:173)\nsequence of network inputs and produce patterns in the time-sequence of outputs. \n\nSIMULATION EXPERIMENTS \n\nIn this Section, we discuss computer simulation results for three types of \nmultiple-element networks. For two of these types, certain strategies are used to \ntrain the networks. In general, these strategies have two parts that alternate, as \n\n\f845 \n\nneeded. The first part is a general scheme for providing network inputs and \nreinforcement that tends to train all elements in the network in the desired \ndirection. The second part is substituted temporarily when it becomes apparent \nthat the network is getting stuck in some suboptimal behavior. It is focussed on \ngetting the network unstuck. The strategies used here are intuitive. In general, \nthere appear to be many strategies that will lead the network to the desired \nbehavior. While we have made some attempt to find strategies that are reasonably \nefficient, it is very unlikely that the ones used are optimal. Finally, these strategies \nhave been tested in hundreds of training sessions. Although they worked in all such \nsessions, there may be some (depending on the sequence of random numbers \ngenerated) in which they would not work . \n\nIn describing the networks simulated, Figs. 1-3, we use the diagramatic \nconventions defined in Ref. 5: We put all NATs in the same layer in a vertical line, \nwith the various layers arranged from left to right in their order in the timing cycle. \nInputs to the entire network corne in from the left; outputs go out to the right. \nBecause the timing cycle is fixed, we omit the timing inputs in these figures. For \nsimilar reasons, we also omit the reinforcement inputs. \n\nIn the simulations described here, the weights in the NATs start at zero \nmaking the network outputs completely random in the sense that on any given \ntrial, all outputs are equally likely to occur, independent of past or present inputs. \nAs learning proceeds, some or all the weights become large, so that the NAT-net's \nselection of outputs is strongly influenced by some or all of its inputs and internal \nconnections. (Note that if the weights do not start at zero, they can be driven close \nto zero by using negative reinforcement.) In general, the optimum behavior toward \nwhich the network adapts is deterministic. However, because the probabilities are \nnever identically equal to zero or one, we apply an arbitrary criterion and say that a \nNAT-net has learned the appropriate behavior when that criterion is satisfied. In \nreal biological systems, we cannot know the weights or the exact probabilities \ngoverning the behavior of the individual adaptive elements. Therefore, it is \nappropriate to use a criterion based on observable behavior. For example, the \ncriterion might be that the network selects the correct response (and continues to \nreceive appropriate reinforcement) 25 times in a row. \n\nNote that NAT-nets can adapt appropriately when the environment is not \ndeliberately trying to make the them behave in a particular way. For example, the \nenvironment may provide inputs according to some (not necessarily deterministic) \npattern and there may be some independent mechanism that determines whether \nthe NAT-net is responding appropriately or not and provides the reinforcement \naccordingly. One paradigm for this situation is a game in which the NAT-net and \nthe environment are players. The reinforcement scheme is simple: if, according to \nthe rules of the game, the NAT-net wins a play (= trial) of the game, reinforcement \nis + 1 , if it loses, -1. \n\nFor a NAT-net to adapt appropriately in this situation, the game must consist \nof a series of similar plays. If the game is competitive, the best strategy a given \nplayer has depends on how much information he has about the opponent and vice \nversa. If a player assumes that his opponent is all-knowing, then his best strategy is \nto minimize his maximum loss and this often means playing at random, or a least \naccording to certain probabilities. If a player knows a lot about how his opponent \nplays, his best strategy may be to maximize gain. This often means playing \naccording to some deterministic strategy. \n\nThe example networks described here are special cases of three types: pattern \nrecogmzmg \nproducing \n(combinational multiple-input, multiple-layered, few-output) networks, and game \nplaying (sequential) networks. The associative properties of NATs and NAT-nets \n\n(combinational multiple-output) \n\nnetworks, \n\npattern \n\n\f846 \n\nare not emphasized here because they are analogous to the well known associative \nproperties of other related models. \n\nA Class of Simple Pattern Producing Networks \n\nthe single-layer \n\nA simple class of pattern producing \nnetworks consists of \ntype \nshown in Fig. 1. Each of M NATs in such a \nnetwork has no inputs, only an output. As a \nconsequence, each has only one weight, Woo \nThe network is a simple, adaptive, information \nsource. \n\n+ 1 \n\nand \n\nnegative \n\no~-..... z, \n0--.' Z2 \nO~-..... ~3 \n\u00b7 \u2022 \u2022 \u2022 \u2022 \u2022 \u00b7 \n\u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \u00b7 \n0--.. Z18 \n\nthe case \n\nin which \n\nConsider first \n\nthe \nnetwork contains only one NAT and we wish to \ntrain it to always produce a simple \"pattern,\" \n+ 1. We give positive reinforcement when it \nselects \nreinforcement \nIf Wo starts at 0, it will quickly \notherwise. \ngr.ow large making the probability of selecting \n+ 1 approach unity. The criterion we use for \ndeciding that the network is trained is that it \nproduce a string of 25 correct outputs. Table I \nshows that in 100 sessions, this one-NAT network selected + 1 output for the next \n25 trials starting, on average, at trial 13. \n\nFig. 1. A Simple Pattern \n\nProducing Network \n\nNext consider a network with two NATs. They can produce four different \noutput patterns. If both weights are 0, they will produce each of the patterns with \nequal probability. But they can be trained to produce one pattern (nearly) all the \nIf we wish to train this subnetwork to produce the pattern (in vector \ntime. \nnotation) [+1 +1], one strategy is to give no \nreinforcement if it produces patterns [-1 +1] or \nif it \n[+1 -1), give \nproduces [+1 +1] and negative reinforcement if \nit produces [-1 -1]. Table I shows that in 100 \nsessions, this network learned to produce the \ndesired pattern (by producing a string of 25 \ncorrect outputs) in about 25 trials. Because we \ninitially gave reinforcement only about 50% of \nthe time, it took longer to train two NATS Table I. Training Times For \nthan one. \nNetworks Per Fig. 1. \n\nMax \n26 \n43 \n60 \n109 \n215 \n\nit positive reinforcement \n\nMin \n1 \n8 \n18 \n44 \n49 \n\nAve \n13 \n25 \n35 \n70 \n115 \n\nM \n1 \n2 \n4 \n8 \n16 \n\nNext, consider the 16-NAT network in \n\nFig. 1. Now there are 216 possible patterns the network can produce. When all the \nweights are zero, each has probability 2- 16 of being produced. An \nineffective \nstrategy for training this network is to provide positive reinforcement when the \ndesired pattern is produced, negative reinforcement when its opposite is produced, \nand zero reinforcement otherwise. A better strategy is to focus on one output of the \nnetwork at a time, training each NAT separately (as above) to have a high \nprobability of producing the desired output. Once all are trained to a relatively \nhigh level, the network as a whole has a reasonable chance of producing exactly the \ncorrect output. Now we can provide positive reinforcement when it does and no \nreinforcement otherwise. With this two-stage hybrid strategy, the network will \nsoon meet the training criterion. The time it takes to train a network of M \nelements with a strategy of this type is roughly proportional to M, not 2(M - 1), as \nfor the first strategy. \n\n\f847 \n\nA still more efficient strategy is to alternate between a general substrategy \nand a substrategy focussed on keeping the network from getting \"stuck .\" One \neffective general substrategy is to give positive reinforcement when more than half \nof the NATs select the desired output, negative reinforcement when less than half \nselect the desired output, and no reinforcement when exactly half select the desired \noutput. This substrategy starts out with approximately equal amounts of positive \nand negative reinforcement being applied. Soon, the network selects more than half \nof the outputs correctly more and more of the time. Unfortunately, there will tend \nto be a minority subset with low probability of selecting the correct output. At this \nstage, we must recognize this subset and switch to a substrategy that focuses on the \nelements of this subset following the strategy for one or two elements, above. When \nall NATs have a sufficiently high probability of selecting the desired output, \ntraining can conclude with the first substrategy. \n\nThe strategies used to obtain the results for M = 4,8, and 16 in Table I were \nslightly more complicated variants of this two-part strategy. In all of them, a \nrunning average was kept of the number of right responses given by each NAT. \nLetting OJ be the \"correct\" output for Zj, the running average after the tt\" trial, \nAj( t), is: \n\nAj(t) = BAj(t - 1) + CjZj(t) \n\n(3) \n\nwhere B is a fraction generally in the range 0.75 to 0.9. If Aj(t) for a particular i \ngets too far below the combined average for all i, then training focuses on the it\" \nelement until its average improves. The significance of the results given in Table I \nis not the details of the strategies used, nor how close the training times may be to \nthe optimum. Rather, it is the demonstration that training strategies exist such \nthat the training time grows significantly more slowly than in proportion to M. \n\nA Simple Pattern Recognizing Network \n\nX2 _-0lil_1:;,.-__ __ \n\nThat Learns XOR \n\nFig. 2. A Two-Element Network \n\nthe \"exclusive or\" \n\n(XOR) and \n\nx, -~))~---......p)oo--\u2022. Z \n\nAs mentioned above, there are fewer \nthreshold logic functions of N variables (for \nN > 1) than the total possible functions. \nFor N = 2, there are 14. The remining two \nare \nits \ncomplement. Multi-layered networks are \nneeded to realize these functions, and an \nimportant test of any adaptive network \nmodel is its ability to learn XOR. The \nnetwork in Fig. 2 is one of the simplest networks capable of learning this function. \nTable II gives the results of 100 training sessions with this network. The strategy \nthese \nused \nresults again had \ntwo \nparts. The general part \nconsisted of \nsupplying \neach of the four possible \nthe \ninput patterns \nnetwork \nrotation, \nglvmg \nappropriate \nreinforcement each trial. \nThe second part involved \nkeeping a running average (similar to Eq. (3)) of the responses of the network by \ninput combination. When the average for one combination fell significantly behind \n\nFunction Min \n18 \n218 \n-700 \n2232 \n\nTable II. Training Times For The \n\nMax \n\n106 \n1992 \n-14,300 \n\n-\n\nNetwork \n\nFig. 2 \nFig. 2 \nRef. 2 \nRef. 8 \n\nAve \n57 \n681 \n-3500 \n\n-\n\nto obtain \n\nto \n\nin \n\nOR \nXOR \nXOR \nXOR \n\nNetwork In Fig. 2. \n\n\f848 \n\nthe average for all, training was focused on just that combination until performance \nimproved. The criterion used for deciding when training was complete was a \nsequence of 50 correct responses (for all input patterns together). \n\nin \n\nthe equivalent network \n\nFor comparison, Table II shows results for the same network trained to realize \nthe normal OR function. Also shown for comparison are numbers taken from Refs. \nthose different models. These are \n2 and 8 for \nnondeterministic and deterministic models, respectively. The numbers from Ref. 2 \nare not exactly comparable with the present results for several reasons. These \ninclude: 1) The criterion for judging when the task was learned was not the same; \n2) In Ref. 2, the \"wrong\" reinforcement was deliberately applied 10% of the time to \ntest learning in this situation; 3) Neither model was optimized for the particular \ntask at hand. Nonetheless, if these (and other) differences were taken into account, \nit is likely that the NAT-net would have learned the XOR function significantly \nfaster. \n\nThe significance of the present results is that they suggest that the use of a \ntraining strategy can not only prevent a network from getting stuck, but may also \nfacilitate more rapid learning. Thus, such strategies can compensate, or more than \ncompensate, for reduced complexity in the reinforcement algorithm. \n\nA Simple Game-Playing Network \n\nHere, we consider NAT-nets in \n\nthe context of the game of \"matching \npennies.\" In this game, each player has a stack of pennies. At each play of the \ngame, each player places one of his pennies, heads up or heads down, but covered, in \nfront of him. Each player uncovers his penny at the same time. If they match, \nplayer A adds both to his stack, otherwise, player B takes both. \n\nGame theory says that the strategy of each player that minimizes his \nmaximum loss is to play heads and tails at random. Then A cannot predict B's \nbehavior and at best can win 50% of the time and likewise for B with respect to A. \nThis is a conservative strategy on the part of each player because each assumes that \nthe other has (or can derive through a sequence of plays), and can use, information \nabout the other player's strategy. Here, we make the different assumption that: 1) \nPlayer B does not play at random, 2) Player B has no information about A's \nstrategy, and 3) Player B is incapable of inferring any information about A through \na sequence of plays and in any event is incapable of changing its strategy. Then, if \nA has no information about B's pattern of playing at the start of the game, A's best \ncourse of action is to try to infer a non-random pattern in B's playing through a \nsequence of plays and subsequently take advantage of that knowledge to win more \noften than 50% of the time. An adaptive NAT-net, as A, can adapt appropriately \nin situations of this type. For example, suppose a single NAT of the type in Fig. 1 \nplays A, where + 1 output means heads, -1 output means tails. A third agent \nsupplies reinforcement + 1 if the NAT wins a play, -1 otherwise. Suppose B plays \nheads with 0.55 probability and tails with 0.45 probability. Then A will learn over \ntime to play heads 100% of the time and thereby maximize its total winnings by \nwinning 55% of the time. \n\nA more complicated situation is the following. Suppose B repeats its own \nmove two plays ago 80% of the time, and plays the opposite 20% of the time. A \nNAT-net with the potential to adapt to this strategy and win 80% of the time is \nshown in Fig. 3. This is a sequential network shown in the standard form of a \ncombinational network (in the dotted rectangle) plus a feedback path. The input to \nthe network at time tis B's play at t -\n1. The output is A's move. The top NAT \nselects its output at time t based partly on the bottom NAT's output at time \n1 based on its input at that time \nt -\nwhich is B's output at t -\n2. Thus, the network as a whole can learn to select its \n\n1. The bottom NAT selects its output at t -\n\n\f849 \n\noutput based on B's play two time increments past. Simulation of 100 sessions \nresulted in the network learning to do this \n98 times. On average, it took 468 plays \n(Min 20, max 4137) to reach the point at \nwhich the network repeated B's move two \ntimes past on the next 50 plays. For two \nsessions the network got stuck (for an \nunknown number of plays greater than \n25,000) playing the opposite of B's last \nmove or always playing tails. {The first \ntwo-part strategy found \nthat trains the \nnetwork to repeat B's output two time \nincrements past without getting stuck (not \nin \ntook an \naverage of 260 trials (Min 25, Max 1943) to \nmeet the training criterion.) \n\nx----~ Hi----.... Z \n\nthe game-playing context) \n\nFig. 3. A Sequential Game(cid:173)\n\nPlaying Network \n\nThe significance of these results is that a sequential NAT-net can learn to \nproduce appropriate behavior. Note that hidden NATs contributed to appropriate \nbehavior for both this network and the one that learned XOR, above. \n\nCONCLUDING REMARKS \n\nThe examples above have been kept simple in order to make them readily \nunderstandable. They are not exhaustive in the sense of covering all possible types \nof situations in which NAT-nets can adapt appropriately. Nor are they definitive in \nthe sense of proving generally and \nin what situations NAT-nets can adapt \nappropriately. Rather, they are illustrative in the sense of demonstrating a variety \nof significant adaptive abilities. They provide an existence proof that NAT-nets can \nadapt appropriately and relatively easily in a wide variety of situations. \n\nthe hypothesis \n\nThe fact that nondeterministic models can learn when the same reinforcement \nis applied to all adaptive elements, while deterministic models generally cannot, \nsupports \n(partly) \nnondeterministic. Experimental characterization of how animal learning does, or \ndoes not get \"stuck,\" as a function of learning environment or training strategy, \nwould be a useful test of the ideas presented here. \nREFERENCES \n\nthat animal nervous \n\nsystems may be \n\n1. Barto, A. G., \"Game-Theoretic Cooperativity in Networks of Self-Interested \nUnits,\" pp. 41-46 in Neural Networks for Computing, J. S. Denker, Ed., AlP \nConference Proceedings 151, American Institute of Physics, New York, 1986. \n\n2. Barto, A. G., Human Neurobiology, 4, 229-256, 1985. \n3. Barto, A. G., R. S. Sutton, and C. W. Anderson, IEEE Transactions on \n\nSystems, Man, and Cybernetics, SMC-13, No.5, 834-846, 1983. \n\n4. Barto, A. G., and P. Anandan, IEEE Transactions on Systems, Man, and \n\nCybernetics, SMC-15, No.3, 360-375, 1985. \n\n5. Windecker, R. C., Information Sciences, 16, 185-234 (1978). \n6. Rumelhart, D. E., and J. L. McClelland, Parallel Distributed Processing, MIT \n\nPress, Cambridge, 1986. \n\n7. Muroga, S., Threshold Logic And Its Applications, Wiley-Interscience, New \n\nYork, 1971. \n\n8. Rumelhart, D. E., G. E. Hinton, and R. J. Williams, Chapter 8 in Ref. 6. \n\n\f", "award": [], "sourceid": 79, "authors": [{"given_name": "Richard", "family_name": "Windecker", "institution": null}]}