{"title": "Discovering the Structure of a Reactive Environment by Exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 439, "page_last": 446, "abstract": null, "full_text": "Discovering the Structure of a Reactive Environment by Exploration \n\n439 \n\nDiscovering the Structure of a Reactive Environment \n\nby Exploration \n\nMichael C. Mozer \n\nDepartment of Computer Science \nand Institute of Cognitive Science \n\nUniversity of Colorado \nBoulder, CO 80309-0430 \n\nJonatban Bachrach \n\nDepartmentofCompu~ \nand Infonnation Science \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\nABSTRACT \n\nConsider a robot wandering around an unfamiliar environment. performing ac(cid:173)\ntions and sensing the resulting environmental states. The robot's task is to con(cid:173)\nstruct an internal model of its environment. a model that will allow it to predict \nthe consequences of its actions and to determine what sequences of actions to \ntake to reach particular goal states. Rivest and Schapire (1987&, 1987b; \nSchapire. 1988) have studied this problem and have designed a symbolic algo(cid:173)\nrithm to strategically explore and infer the structure of \"finite state\" environ(cid:173)\nments. The heart of this algorithm is a clever representation of the environment \ncalled an update graph. We have developed a connectionist implementation of \nthe update graph using a highly-specialized network architecture. With back \npropagation learning and a trivial exploration strategy -\ntions -\ngorithm on simple problems. The network has the additional strength that it \ncan accommodate stochastic environments. Perhaps the greatest virtue of the \nconnectionist approach is that it suggests generalizations of the update graph \nrepresentation that do not arise from a traditional, symbolic perspective. \n\nchoosing random ac(cid:173)\nthe connectionist network can outperfonn the Rivest and Schapire al(cid:173)\n\n1 INTRODUCTION \n\nConsider a robot placed in an unfamiliar environment The robot is allowed to wander \naround the environment, performing actions and sensing the resulting environmental \nstates. With sufficient exploration, the robot should be able to construct an internal \nmodel of the environment, a model that will allow it to predict the consequences of its ac(cid:173)\ntions and to determine what sequence of actions must be taken to reach a particular goal \nstate. In this paper, we describe a connectionist network that accomplishes this task, \nbased on a representation of finite-state automata developed by Rivest and Scbapire \n\n\f440 Mozer and Bachrach \n\n(1987a, 1987b; Schapire. 1988). \nThe environments we wish to consider can be modeled by a finite-state automaton (FSA). \nIn each environment. the robot has a set of discrete actions it can execute to move from \none environmental state to another. At each environmental state. a set of binary-valued \nsensations can be detected by the robot To illustrate the concepts and methods in our \nwork, we use as an extended example a simple environment, the n -room world (from \nRivest and Schapire). The n -room world consists of n rooms arranged in a circular \nchain. Each room is connected to the two adjacent rooms. In each room is a light bulb \nand light switch. The robot can sense whether the light in the room where it currently \nstands is on or off. The robot has three possible actions: move to the next room down \nthe chain (0). move to the next room up the chain (U). and toggle the light switch in the \ncurrent room (T). \n\n2 MODELING THE ENVIRONMENT \n\nIf the FSA corresponding to the n -room world is known, the sensory consequences of \nany sequence of actions can be predicted. Further. the FSA can be used to determine a \nsequence of actions to take to obtain a certain goal state. Although one might try \ndeveloping an algorithm to learn the FSA directly, there are several arguments against \ndoing so (Schapire, 1988). Most important is that the FSA often does not capture struc(cid:173)\nture inherent in the environment. Rather than trying to learn the FSA, Rivest and \nScbapire suggest learning another representation of the environment called an update \ngraph. The advantage of the update graph is that in environments with regularities, the \nnumber of nodes in the update graph can be much smaller than in the FSA (e.g., 2n \nversus 2\" for the n -room world). Rivest and Schapire's formal definition of the update \ngraph is based on the notion of tests that can be performed on the environment. and the \nequivalence of different tests. In this section, we present an alternative, more intuitive \nview of the update graph that facilitates a connectionist interpretation. \nConsider a three-room world. To model this environment, the essential knowledge re(cid:173)\nquired is the status of the lights in the current room (CUR), the next room up from the \nClUTent room (UP), and the next room down from the current room (DOWN). Assume the \nupdate graph has a node for each of these environmental variables. Further assume that \neach node has an associated value indicating whether the light in the particular room is \non or off. \nIf we know the values of the variables in the current environmental state, what will their \nnew values be after taking some action, say u1 When the robot moves to the next room \nup, the new value of CUR becomes the previous value of UP; the new value of DOWN be(cid:173)\ncomes the previous value of CUR; and in the three-room world, the new value of UP be(cid:173)\ncomes the previous value of DOWN. As depicted in Figure la, this action thus results in \nshifting values around in the three nodes. This makes sense because moving up does not \naffect the status of any light, but it does alter the robot's position with respect to the three \nrooms. Figure 1b shows the analogous flow of information for the action o. Finally, the \naction T should cause the status of the current room's light to be complemented while the \nother two rooms remain unaffected (Figure 1c). In Figure 1d, the three sets of links from \nFigures la-c have been superimposed and have been labeled with the appropriate action. \nOne final detail: The Rivest and Schapire update graph formalism does not make use of \nthe \"complementation\" link. To avoid it, each node may be split into two values. one \n\n\fDiscovering the Structure of a Reactive Environment by Exploration \n\n441 \n\nrepresenting the status of a room and the other its complement (Figure Ie). Toggling \nthus involves exchanging the values of CUR and CUR. Just as the values of CUR, UP, and \nDOWN must be shifted for the actions u and D, so must their complements. \nGiven the update graph in Figure Ie and the value of each node for the current environ(cid:173)\nmental state, the resuk of any sequence of actions can be predicted simply by shifting \nvalues around in the graph. Thus, as far as predicting the input/output behavior of the en(cid:173)\nvironment is concerned, the update graph serves the same purpose as the FSA. \nA defining and nonobvious (from the current description) property of an update graph is \nthat each node has exactly one incoming link for each action. We call this the one(cid:173)\ninput-per-action constraint. For example, CUR gets input from CUR for the action T, \nfrom UP for u. and from DOWN for D. \n\n(a) \n\n(c) \n\nN \n\n~ @ \n\n-~ \n\n(6) \n\n(d) \n\nT \n\n(e) \n\nFlgure 1: (a) Links between nodes indicating the desired infonnation flow on pedonning the action u. CUR \nrepresenu that status of the Jighu in the ament room, UP the status of the Jighu in the next room up, and DOWN \nthe status of the lights in the next room down. (b) Links between nodes indicating the desired infonnation flow \non perfonning the action D. (c) Links between nodes indicating the desired infonnation flow on perfonning the \naction T. The \"_\" on the link from CUR to iuelf indicates that the value must be complemented. (d) Links \nfrom the three separate actions superimposed and labeled by the action. (e) The complementation link can be \navoided by adding a set of nodes that represent the complemenu of the original seL Thil is the update grapb for \na three-room world. \n\n\f442 \n\nMozer and Bachrach \n\n3 THE RIVEST AND SCHAPIRE ALGORITHM \n\nRivest and Schapire have developed a symbolic algorithm (hereafter, the RS algorithm) to \nstrategically explore an environment and learn its update graph representation. The RS \nalgorithm fonnulates explicit hypotheses about regularities in the environment and tests \nthese hypotheses one or a relatively small number at a time. As a result, the algorithm \nmay not make full use of the environmental feedback obtained. It thus seems worthwhile \nto consider alternative approaches that could allow more efficient use of the environmen(cid:173)\ntal feedback, and hence, more efficient learning of the update graph. We have taken con(cid:173)\nnectionist approach, which has shown quite promising results in preliminary experiments \nand suggests other significant benefits. We detail these benefits below, but must first \ndescribe the basic approach. \n\n4 THE UPDATE GRAPH AS A CONNECTIONIST NETWORK \n\nHow might we tum the update graph into a connectionist network? Start by asswning \none unit in a network for each node in the update graph. The activity level of the unit \nrepresents the truth value associated with the update graph node. Some of these units \nserve as \"outputs\" of the network. For example, in the three-room world, the output of \nthe network is the unit that represents the status of the current room. In other enviroD(cid:173)\nments, there may several sensations in which case there will be several output units. \nWhat is the analog of the labeled links in the update graph? The labels indicate that \nvalues are to be sent down a link when a particular action occurs. In connectionist tenns, \nthe links should be gated by the action. To elaborate, we might include a set of units that \nrepresent the possible actions; these units act to multiplicatively gate the flow of activity \nbetween units in the update graph. Thus, when a particular action is to be perfonned, the \ncorresponding action unit is activated, and the connections that are gated by this action \nbecome enabled. If the action units fonn a local representation, i.e., only one is active at \na time, exactly one set of connections is enabled at a time. Consequently, the gated con(cid:173)\nnections can be replaced by a set of weight matrices, one per action. To predict the \nconsequences of a particular action, the weight matrix for that action is plugged into the \nnetwork and activity is allowed to propagate through the connections. Thus, the network \nis dynamically rewired contingent on the current action. \nThe effect of activity propagation should be that the new activity of a unit is the previous \nactivity of some other unit A linear activation function is sufficient to achieve this: \n\nX(t) = Wa(t)X(t-l), \n\n(1) \nwhere a (t) is the action selected at time t, Wa (t) is the weight matrix associated with this \naction, and X(t) is the activity vector that results from taking action a (t). Assuming \nweight matrices which have zeroes in each row except for one connection of strength 1 \n(the one-input-per-action constraint), the activation rule will cause activity values to be \ncopied around the network. \n\n5 TRAINING THE NETWORK TO BE AN UPDATE GRAPH \n\nWe have described a connectionist network that can behave as an update graph, and now \ntum to the procedure used to learn the connection strengths in this network. For exposi(cid:173)\ntory purposes, assume that the number of units in the update graph is known in advance. \n\n\fDiscovering the Structure of a Reactive Environment by Exploration \n\n443 \n\n(This is not necessary, as we show in Mozer & Bachrach, 1989.) A weight matrix is re(cid:173)\nquired for each action, with a potential non-zero connection between every pair of units. \nAs in most connectionist learning procedures, the weight matrices are initialized to ran(cid:173)\ndom values; the outcome of learning will be a set of matrices that represent the update \ngraph connectivity. \nIf the network is to behave as an update graph, the one-input-per-action constraint must \nbe satisfied. In terms of the connectivity matrices, this means that each row of each \nweight matrix should have connection strengths of zero except for one value which is 1. \nTo achieve this property, additional constraints are placed on the weights. We have ex(cid:173)\nplored a combination of three constraints: \n\n(1) l:w~j = 1, \n\nj \n\n(2) l:Waij = 1, \n\nj \n\nand (3) Waij ~ 0, \n\nwhere waij is the connection strength to i from j for action a. Constraint 1 is satisfied by \nintroducing an additional cost term to the error function. Constraints 2 and 3 are rigidly \nenforced by renormalizing the Wai following each weight update. The normalization \nprocedure finds the shortest distance projection from the updated weight vector to the hy(cid:173)\nperplane specified by constraint 2 that also satisfies constraint 3. \nAt each time step t, the training procedure consists the following sequence of events: \n1. An action. a (t), is selected at random. \n2. The weight matrix for that action, Wa(t). is used to compute the activities at t, X(t), \n\nfrom the previous activities X(t-l). \n\n3. The selected action is performed on the environment and the resulting sensations are \n\nobserved. \n\n4. The observed sensations are compared with the sensations predicted by the network \n(Le., the activities of units chosen to represent the sensations) to compute a measure of \nerror. To this error is added the contribution of constraint 1. \n\n5. The back propagation \"unfolding-in-time\" procedure (Rumelhart, Hinton. & Williams, \n1986) is used to compute the derivative of the error with respect to weights at the \ncurrent and earlier time steps, W a(t-;)' for i =0 ... 't-l. \n\n6. The weight matrices for each action are updated using the overall error gradient and \n\nthen are renormalized to enforce constraints 2 and 3. \n\n7. The temporal record of unit activities, X(t-i) for i=O\u00b7 .. 't, which is maintained to \npermit back propagation in time, is updated to reflect the new weights. (See further \nexplanation below.) \n\n8. The activities of the output units at time t, which represent the predicted sensations, \n\nare replaced by the actual observed sensations. \n\nSteps 5-7 require further elaboration. The error measured at time t may be due to in(cid:173)\ncorrect propagation of activities from time t-l, which would call for modification of the \nweight matrix Wa(t). But the error may also be attributed to incorrect propagation of ac(cid:173)\ntivities at earlier times. Thus. back propagation is usui to assign blame to the weights at \nearlier times. One critical parameter of training is the amount of temporal history, 't, to \nconsider. We have found that. for a particular problem, error propagation beyond a cer-\n\n\f444 Mozer and Bachrach \n\nlain critical number of steps does not improve learning performance, although any fewer \ndoes indeed harm performance. In the results described below, we set 't for a particular \nproblem to what appeared to be a safe limit: one less than the number of nodes in the up(cid:173)\ndate graph solution of the problem. \nTo back propagate error in time, we maintain a temporal record of unit activities. How(cid:173)\never, a problem arises with these activities following a weight update: the activities are \ni.e., Equation I is violated. Because the error \nno longer consistent with the weights -\nderivatives computed by back propagation are exact only when Equation I is satisfied, \nfuture weight updates based on the inconsistent activities are not assured of being correct. \nEmpirically, we have found the algorithm extremely unstable if we do not address this \nproblem. \nIn most situations where back propagation is applied to temporally-extended sequences. \nthe sequences are of finite length. Consequently. it is possible to wait until the end of the \nsequence to update the weights, at which point consistency between activities and \nweights no longer matters because the system starts fresh at the beginning of the next se(cid:173)\nquence. In the present situation. however, the sequence of actions does not tenninate. \nWe thus were forced to consider alternative means of ensuring consistency. The most \nsuccessful approach involved updating the activities after each weight change to force \nconsistency (step 7 of the list above). To do this, we propagated the earliest activities in \nthe temporal record. X(t--'t). forward again to time t, using the updated weight matrices. \n\n6 RESULTS \n\nFigure 2 shows the weights in the update graph network for the three-room world after \nthe robot has taken 6,000 steps. The Figure depicts a connectivity pattern identical to \nthat of the update graph of Figure Ie. To explain the correspondence, think of the di(cid:173)\nagram as being in the shape of a person who has a head, left and right arms, left and right \nlegs, and a heart. For the action U, the head -\nreceives input from \nthe left leg, the left leg from the heart, and the heart from the head, thereby fonning a \nthree-unit loop. The other three units -\nfonn a \n\nthe left arm, right arm, and right leg -\n\nthe output unit -\n\nFlgure 2: Weights learned after 6,000 exploratory steps in the three-room world. Each large diagram \nrepresents the weights corresponding to one of the three actic.lI. Each small diagram contained within a large \ndiagram represents the connection strengths feeding into a particular Wlit for a particular action. There are six \nWlits, hence six small diagrams. The output Wlit, which indicates the state of the light in the wrrent room, is the \nprotruding \"head\" of the large diagram. A white square in a particular position of a small diagram represents the \nstrength of connection from the unit in the homologous position in the large diagram to the unit represented by \nthe small diagram. The area of the square is proportional to the cormection strength. \n\n\fDiscovering the Structure of a Reactive Environment by Exploration \n\n445 \n\nsimilar loop. For the action D, the same two loops are present but in the reverse direc(cid:173)\ntion. These two loops also appear in Figure Ie. For the action T, the left and right anns, \nheart, and left leg each keep their current value, while the head and the right leg ex(cid:173)\nchange values. This corresponds to the exchange of values between the CUR and CUR \nnodes of the Figure Ie. \nIn addition to learning the update graph connectivity, the network has simultaneously \nlearned the correct activity values associated with each node for the current state of the \nenvironment. Armed with this infonnation, the network can predict the outcome of any \nsequence of actions. Indeed, the prediction error drops to zero, causing learning to cease \nand the network to become completely stable. \nNow for the bad news: The network does not converge for every set of random initial \nweights, and when it does, it requires on the order of 6,000 steps. However, when the \nweight constraints are removed, that the network converges without fail and in about 300 \nsteps. In Mozer and Bachrach (1989), we consider why the weight constraints are hann(cid:173)\nful and suggest several remedies. Without weight constraints, the resulting weight ma(cid:173)\ntrix, which contains a collection of positive and negative weights of varying magnitudes, \nis not readily interpreted. In the case of the n -room world,. one reason why the final \nweights are difficult to interpret is because the net has discovered a solution that does not \nsatisfy the RS update graph fonnalism; it has discovered the notion of complementation \nlinks of the sort shown in Figure ld. With the use of complementation links, only three \nunits are required, not six. Consequently, the three unnecessary units are either cut out of \nthe solution or encode infonnation redundantly. \nTable 1 compares the perfonnance of the RS algorithm against that of the connectionist \nnetwork without weight constraints for several environments. Perfonnance is measured \nin tenns of the median number of actions the robot must take before it is able to predict \nthe outcome of subsequent actions. (Further details of the experiments can be found in \nMozer and Bachrach, 1989.) In simple environments, the connectionist update graph can \noutperfonn the RS algorithm. This result is quite surprising when considering that the ac(cid:173)\ntion sequence used to train the network is generated at random, in contrast to the RS algo(cid:173)\nrithm, which involves a strategy for exploring the environment. We conjecture that the \nnetwork does as well as it does because it considers and updates many hypotheses in \nparallel at each time step. In complex environments, however, the network does poorly. \nBy \"complex\", we mean that the number of nodes in the update graph is quite large and \nthe number of distinguishing environmental sensations is relatively small. For example, \nthe network failed to learn a 32-room world, whereas the RS algorithm succeeded. An \nintelligent exploration strategy seems necessary in this case: random actions will take \ntoo long to search the state space. This is one direction our future work will take. \nBeyond the potential speedups offered by connectionist learning algorithms, the connec(cid:173)\ntionist approach has other benefits. \n\nTable 1: Nwnber of Steps Required to Learn Update Graph \n\nEnvironment \n\nLittle Prince Wodd \nCar Radio World \nFour-Room World \n32-Room World \n\nRS \n\nConnectionist \nAlgorithm Update Graph \n\n200 \n27,695 \n1,388 \n52,436 \n\n91 \n8,167 \n1,308 \nfails \n\n\f446 \n\nMozer and Bachrach \n\n\u2022 Perfonnance of the network appears insensitive to prior knowledge of the number of \nnodes in the update graph being learned. In contrast, the RS algorithm requires an \nupper bound on the update graph complexity, and performance degrades significantly if \nthe upper bound isn't tight. \n\n\u2022 The network is able to accommodate \"noisy\" environments, also in contrast to the RS \n\nalgorithm. \n\n\u2022 Owing learning, the network continually makes predictions about what sensations will \nresult from a particular action, and these predictions improve with experience. The RS \nalgorithm cannot make predictions until learning is complete; it could perhaps be \nmodified to do so, but there would be an associated cost. \n\n\u2022 Treating the update graph as matrices of connection strengths has suggested generali(cid:173)\nzations of the update graph formalism that don't arise from a more traditional analysis. \nFirst, there is the fairly direct extension of allowing complementation links. Second, \nbecause the connectionist network is a linear system. any rank-preserving linear \ntransform of the weight matrices will produce an equivalent system, but one that does \nnot have the local connectivity of the update graph (see Mozer & Bachrach, 1989). \nThe linearity of the network also allows us to use tools of linear algebra to analyze the \nresulting connectivity matrices. \n\nThese benefits indicate that the connectionist approach to the environment-modeling \nproblem is worthy of further study. We do not wish to claim that the connectionist ap(cid:173)\nproach supercedes the impressive work of Rivest and Schapire. However, it offers com(cid:173)\nplementary strengths and alternative conceptualizations of the learning problem. \n\nAcknowledgements \n\nOur thanks to Rob Schapire, Paul Smolensky, and Rich Sutton for helpful discussions. This work \nwas supported by a grant from the James S. McDonnell Foundation to Michael Mozer. grant 87-2-\n36 from the Sloan Foundation to Geoffrey Hinton. and grant AFOSR-87\"()()30 from the Air Force \nOffice of Scientific Research. Bolling AFB, to Andrew Barto. \n\nReferences \n\nMozer, M. C., & Bachrach, J. (1989). Discovering the structure of a reactive environment by \nexploration (Teclmical Report CU-CS-451-89). Boulder, CO: University of Colorado, \nDepartment of Computer Science. \n\nRivest, R. L., & Schapire, R. E. (1987). Diversity-based inference of finite automata. In \nProceedings of the Twenty-Eighth Annual Symposium on Foundations of Computer \nScience (pp. 78-87). \n\nRivest, R. L., & Schapire, R. E. (1987). A new approach to unsupervised learning in detenninistic \nenvironments. In P. Langley (Ed.), Proceedings of the Fourth Inlernational Workslwp on \nMachine Learning (pp. 364-375). \n\nRumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by \nerror propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed \nprocessing: Explorations in the microstructure of cognition. Volume I: Foundations (pp. \n318-362). Cambridge, MA: MIT Press/Bradford Books. \n\nSchapire, R. E. (1988). Diversity-based inference ofjiniJe automara. Unpublished master's thesis, \n\nMassachusetts Instiblte of Technology, Cambridge, MA. \n\n\f", "award": [], "sourceid": 292, "authors": [{"given_name": "Michael", "family_name": "Mozer", "institution": null}, {"given_name": "Jonathan", "family_name": "Bachrach", "institution": null}]}