{"title": "Viewing Classifier Systems as Model Free Learning in POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 989, "page_last": 995, "abstract": null, "full_text": "Viewing Classifier Systems \n\nas Model Free Learning in POMDPs \n\nAkira Hayashi and Nobuo Suematsu \n\nFaculty of Information Sciences \n\nHiroshima City University \n\n3-4-1 Ozuka-higashi, Asaminami-ku, Hiroshima, 731-3194 Japan \n\n{ akira,suematsu }@im.hiroshima-cu.ac.jp \n\nAbstract \n\nClassifier systems are now viewed disappointing because of their prob(cid:173)\nlems such as the rule strength vs rule set performance problem and the \ncredit assignment problem. In order to solve the problems, we have de(cid:173)\nveloped a hybrid classifier system: GLS (Generalization Learning Sys(cid:173)\ntem). In designing GLS, we view CSs as model free learning in POMDPs \nand take a hybrid approach to finding the best generalization, given the \ntotal number of rules. GLS uses the policy improvement procedure by \nJaakkola et al. for an locally optimal stochastic policy when a set of \nrule conditions is given. GLS uses GA to search for the best set of rule \nconditions. \n\n1 \n\nINTRODUCTION \n\nClassifier systems (CSs) (Holland 1986) have been among the most used in reinforcement \nlearning. Some of the advantages of CSs are (1) they have a built-in feature (the use of \ndon't care symbols \"#\") for input generalization, and (2) the complexity of pOlicies can \nbe controlled by restricting the number of rules. In spite of these attractive features, CSs \nare now viewed somewhat disappointing because of their problems (Wilson and Goldberg \n1989; Westerdale 1997). Among them are the rule strength vs rule set performance prob(cid:173)\nlem, the definition of the rule strength parameter, and the credit assignment (BBA vs PSP) \nproblem. \n\nIn order to solve the problems, we have developed a hybrid classifier system: GLS (Gener(cid:173)\nalization Learning System). GLS is based on the recent progress ofRL research in partially \nobservable Markov decision processes (POMDPs). In POMDPs, the environments are re(cid:173)\nally Markovian, but the agent cannot identify the state from the current observation. It may \nbe due to noisy sensing or perceptual aliasing. Perceptual aliasing occurs when the sensor \nreturns the same observation in multiple states. Note that even for a completely observable \n\n\f990 \n\nA. Hayashi and N. Suematsu \n\nMDP, the use of don't care symbols for input generalization will make the process as if it \nwere partially observable. \n\nIn designing GLS, we view CSs as RL in POMDPs and take a hybrid approach to finding \nthe best generalization, given the total number of rules. GLS uses the policy improvement \nprocedure in Jaakkola et a!. (1994) for an locally optimal stochastic policy when a set of \nrule conditions is given. GLS uses GA to search for the best set of rule conditions. \n\nThe paper is organized as follows. Since CS problems are easier to understand from GLS \nperspective, we introduce Jaakkola et a!. (1994), propose GLS, and then discuss CS prob(cid:173)\nlems. \n\n2 LEARNING IN POMDPS \n\nJaakkola et a1. (1994) consider POMDPs with perceptual aliasing and memoryless stochas(cid:173)\ntic policies. Following the authors, let us call the observations messages. Therefore, a \npolicy is a mapping from messages to probability distributions (PDs) over the actions. \nGiven a policy 7r, the value of a state s, V7!' (s), is defined for POMDPs just as for MDPs. \nThen, the value of a message m under policy 7r, V7!' (m ), can be defined as follows: \n\nV7!'(m) = LP7!'(slm)V7!'(s) \n\nsES \n\n(1) \n\nwhere P7!' (slm) is the probability that the state is s when the message is m under the policy \n7r. \n\nThen, the following holds. \n\nN \n\nlim'\"' E{R(st, at) -R lSI = s} \nN-+(X)~ \nt=l \n\n(2) \n\nE{V(s) Is --t m} \n\n(3) \nwhere St and at refer to the state and the action taken at the tth step respectively, R( St, at) \nis the immediate reward at the tth step, R is the (unknown) gain (Le. the average reward \nper step). s --t m refers to all the instances where m is observed in sand E{\u00b7 I s --t m} is \na Monte-Carlo expectation. \nIn order to compute E{V(s) I s --t m}, Jaakkola et a1. showed a Monte-Carlo procedure: \n\n1 \n\nvt(m) = 'k{ Rtl +rl,IRtl+l + rl,2Rtl+2 + ... + rl ,t-tIRt \n+ Rt2 +r2,IRt2 +l + r2,2Rt2+2 + ... + r2,t-t2Rt \n\n+ Rtk +rk,IRtdl + ... + rk ,t-tkRtl \n\n(4) \n\nwhere tk denotes. the time step corresponding to the kth occurrence of the message m, \nR t = R(st, at) - R for every t, rk,T indicates the discounting at the Tth step in the kth \nsequence. By estimating R and by suitably setting rk ,T, Vt(m) converges to V7!'(m). \nQ7!' (m, a), Q-value of the message m for the action a under the policy 7r, is also defined \nand computed in the same way. \n\nJaakkola et a1. have developed a policy improvement method: \n\nStep 1 Evaluate the current policy 7r by computing V7!' (m) and Q7!' (m, a) for each m and \n\na. \n\n\fViewing Classifier Systems as Model Free Learning in POMDPs \n\n991 \n\nStep 2 Test for any m whether maxa Q1r (m, a) > V 1r (m) holds. If 110t, then return 7r. \nStep 3 For each m and a, define 7r 1(alm) as follows: \n\n7r 1 (aim) = 1.0 when a = argmaxaQ1r(m, a), \nThen, define 7r f as 7r f (aim) = (1 - \u20ac )7r( aim) + \u20ac7r 1 (aim) \n\n7r 1 (aim) = 0.0 otherwise. \n\nStep 4 Set the new policy as 7r = 7r f\n\n, and goto Stepl. \n\n3 GLS \n\nEach rule in GLS consists of a condition part, an action part, and an evaluation \npart: Rule = \n(Condit'ion, Action, Evaluation). The condition part is a string c \nover the alphabet {O, 1, #}, and is compared with a binary sensor message. # is a \ndon't care symbol, and matches 0 and 1. When the condition c matches the mes(cid:173)\nsage, the action is randomly selected using the PD in the action part: Action = \n(p(allc),p(a21c), ... ,p(aIA!lc)), I:j'!\\ p(ajlc) = 1.0 where IAI is the total number ofac(cid:173)\ntions. The evaluation part records the value of the condition V ( c) and the Q-values of the \ncondition action pairs Q(c, a): Evaluation = (V(c), Q(c, ad , Q(c, a2), ... ,Q(c, a lAI))' \nEach rule set consists of N rules, {Rulel, Rule2,\"\" RuleN}. N, the total number of \nrules in a rule set, is a design parameter to control the complexity of policies. All the rules \nexcept the last one are called standard rules. The last rule Rule N is a special rule which is \ncalled the default rule. The condition part of the default rule is a string of # 's and matches \nany message. \n\nLearning in GLS proceeds as follows: (1 )Initialization: randomly generate an initial pop(cid:173)\nulation of M rule sets, (2)Policy Evaluation and Improvement: for each rule set, repeat a \npolicy evaluation and improvement cycle for a suboptimal policy, then, record the gain of \nthe policy for each rule set, (3)Genetic Algorithm: use the gain of each rule set as its fit(cid:173)\nness measure and produce a new generation of rule sets, (4) Repeat: repeat from the policy \nevaluation and improvement step with the new generation of rule sets. \n\nIn (2)Policy Evaluation and Improvement, GLS repeats the following cycle for each rule \nset. \n\nStep 1 Set \u20ac sufficiently small. Set t max sufficiently large. \nStep 2 Repeat for 1 :::; t :::; t max \u2022 \n\n1. Make an observation of the environment and receive a message mt from the \n\nsensor. \n\n2. From all the rules whose condition matches the message mt, find the rule \n\nwhose condition is the most specific l . Let us call the rule the active rule. \n\n3. Select the next action at randomly according to the PD in the action part of \nthe active rule, execute the action, and receive the reward R( St, at) from the \nenvironment. (The state St is not observable.) \n\n4. Update the current estimate of the gain R from its previous estimate and \nR( St, ad . Let R t = R( St , ad - R. For each rule, consider its condition Ci \nas (a generalization of) a message, and update its evaluation part V ( Ci ) and \nQ(c;, aHa E A) using Eq.(4). \n\nStep 3 Check whether the following holds. If not, exit. \n\n3i (1 :::; i :::; N), maxa Q ( Ci , a) > V ( cd \n\nStep 4 Improve the current policy according to the method in the previous section, and \n\nupdate the action part of the corresponding rules and goto Step 2. \n\nIThe most specific rule has the least number of #'s. This is intended only for saving the number \n\nof rules. \n\n\f992 \n\nA. Hayashi and N. Suematsu \n\nGLS extracts the condition parts of all the rules in a rule set and concatenates them to \nform a string. The string will be an individual to be manipulated by the genetic algorithm \n(GA). The genetic algorithm used in GLS is a fairly standard one. GLS combines the SGA \n(the simple genetic algorithm) (Goldberg 1989) with the elitist keeping strategy. The SGA \nis composed of three genetic operators: selection, crossover, and mutation. The fitness \nproportional selection and the single-point crossover are used. The three operators are \napplied to an entire population at each generation. Since the original SGA does not consider \n#'s in the rule conditions, we modified SGA as follows. When GLS randomly generates \nan initial population of rule sets, it generates # at each allele position in rule conditions \naccording to the probability P#. \n\n4 CS PROBLEMS AND GLS \n\nIn the history of classifier systems, there were two quite different approaches: the Michigan \napproach (Holland and Reitman 1978), and the Pittsburgh (Pitt) approach (Dejong 1988). \nIn the Michigan approach, each rule is considered as an individual and the rule set as the \npopulation in GA. Each rule has its strength parameter, which is based on its future payoff \nand is used as the fitness measure in GA. These aspects of the approach cause many prob(cid:173)\nlems. One is the rule strength vs rule set performance problem. Can we collect only strong \nrules and get the best rule set performance? Not necessarily. A strong rule may cooperate \nwith weak rules to increase its payoff. Then, how can we define and compute the strength \nparameter for the best rule set performance? In spite of its problems, this approach is now \nso much more popular than the other, that when people simply say classifier systems, they \nrefer to Michigan type classifier systems. In the Pitt approach, the problems of the Michi(cid:173)\ngan approach are avoided by requiring GA to evaluate a whole rule set. In the approach, a \nrule set is considered as an individual and multiple rule sets are kept as the population. The \nproblem of the Pitt approach is its computational difficulties. \n\nGLS can be considered as a combination of the Michigan and Pitt approaches. GA in GLS \nworks as that in the Pitt approach. It evaluates a total rule set, and completely avoids the \nrule strength vs rule set performance problem in the Michigan approach. As the Michigan \ntype CSs, GLS evaluates each rule to improve the policy. This alleviates the computational \nburden in the Pitt approach. Moreover, GLS evaluates each rule in a more formal and sound \nway than the Michigan approach. The values, V(c), and Q(c, a), are defined on the basis \nof POMDPs, and the policy improvement procedure using the values is guaranteed to find \na local maximum. \n\nWesterdale (1997) has recently made an excellent analysis of problematic behaviors of \nMichigan type CSs. Two popular methods for credit assignment in CSs are the bucket \nbrigade algorithm (BBA) (Holland 1986) and the profit sharing plan (PSP) (Grefenstette \n1988). Westerdale shows that BBA does not work in POMDPs. He insists that PSP with \ninfinite time span is necessary for the right credit assignment, although he does not show \nhow to carry out the computation. GLS does not use BBA or PSP. GLS uses the Monte \nCarlo procedure, Eq.(4), to compute the value of each condition action pair. The series \nin Eq.(4) is slow to converge. But, this is the cost we have to pay for the right credit \nassignment in POMDPs. Westerdale points out another CS problem. He claims that a \ndistinction must be made between the availability and the payoff of rules. We agree with \nhim. As he says, if the expected payoff of Rule 1 is twice as much as Rule 2, then we \nwant to a/ways choose Rule 1. GLS makes the distinction. The probability of a stochastic \npolicy 71'(alc) in GLS corresponds to the availability, and the value of a condition action \npair Q ( c, a) corresponds to the payoff. \n\nSamuel System (Grefenstette et a1. 1990) can also be considered as a combination of the \nMichigan and Pitt approaches. Samuel is a highly sophisticated system which has lots of \nfeatures. We conjecture, however, that Samuel is not free from the CS problems which \n\n\fViewing Classifier Systems as Model Free Learning in POMDPs \n\n993 \n\nWesterdale has analyzed. This is because Samuel uses PSP for credit assignment, and \nSamuel uses the payoff of each rule for action selection, and does not make a distinction \nbetween the availability and the payoff of rules. \n\nxes (Wilson 1995) seems to be an exceptionally reliable Michigan\u00b7type es. In xes, each \nrule's fitness is based not on its future payoff but on the prediction accuracy of its future \npayoff (XeS uses BBA for credit assignment). Wilson reports that xes's population tends \nto form a complete and accurate mapping from sensor messages and actions to payoff \npredictions. We conjecture that xes tries to build the most general Markovian model of \nthe environment. Therefore, it will be difficult to apply xes when the environment is not \nMarkovian, or when we cannot afford the number of rules enough to build a Markovian \nmodel of the environment, even if the environment itself is Markovian. As we will see in \nthe next section, GLS is intended exactly for these situations. \n\nKaelbling et a1. (19%) surveys methods for input generalization when reward is delayed. \nThe methods use a function approximator to represent the value function by mapping a state \ndescription to a value. Since they use value iteration or Q\u00b7leaming anyway, it is difficult to \napply the methods when the generalization violates the Markov assumption and induces a \nPOMDP. \n\n5 EXPERIMENTS \n\nWe have tested GLS with some of the representative problems in es literature. Fig. 1 shows \nGrefl world (Grefenstette 1987). In Grefl world, we used GLS to find the smallest rule set \nwhich is necessary for the optimal performance. Since this is not a POMDP but an MDP, the \noptimal policy can easily be learned when we have a corresponding rule for each of the 16 \nstates. However, when the total number of rules is less than that of states, the environment \nlooks like a POMDP to the learning agent, even if the environment itself is an MDP. The \ngraph shows how the gain of the best rule set in the population changes with the generation. \nWe can see from the figure that four rules are enough for the optimal performance. Also \nnote that the saving of the rules is achieved by selecting the most specific matching rule \nas an active rule. The rule set with this rule selection is called the defallit hierarchy in es \nliterature. \n\npayoff \n\n'i \n\n150 \n\n200 \n\nISO \n\n100 \n\n50 \n\n~~~--------~ \n\n....................................... .. .......... ~ ................. . \n\nN: . l (cid:173)\nN=3 -\nN=2 M \u2022 \u2022\u2022 \u2022 \u2022 _ _ \n\nO L-~~ __ ~~~~~L~~\u00b7I ~ .. -.~.~.-~-\n\no \n\n10 \n\n15 \n\n10 \n\n15 \n\n30 \n\n35 \n\n40 \n\ng.!ner.a tioruJ \n\nFigure 1: LEFT: GREF1 World. States {O, 1,2, 3} are the start states and states {12.13, 14, 15} \nare the end states. In each state, the agent gets the state number (4 bits) as a message, and chooses \nan action a,b,c, or d. When the agent reaches the end states, he receives reward 1000 in state 13, but \nreward 0 in other states. Then the agent is put in one of the start states with equal probability. We \nadded 10% action errors to make the process ergodic. When an action error occurs, the agent moves \nto one of the 16 states with equal probability. \n=: 0.10. M =: 10. N =: \nRIGHT: Gain of the best rule set. Parameters: \n2,3 , 4, P# =: 0.33. For N =: 4, the best rule set at the 40 th generation was { if 0101 (State 5) \nthen a 1.0, if 1010 (State 10) then c 1.0, if ##11 (States 3,7,11,15) then d 1.0, if #### (Default \nRule) then b 1.0}. \n\ntma ;r =: 10000. \u20ac \n\n\f994 \n\nA. Hayashi and N. Suematsu \n\noo~~~~~~~~~--~ \n\n80 \n\n70 \n\n60 \n\n30 \n\n20 \n10 L -...... ~~~oo4---'---'!'Pz.:tim:=o \u2022 .:...1 -_\"\" .... -.-J. \n\nN06-\nNoS-\n\n00 ~ \n\nBlillaD \na II a \nII II a \n\no W \n\n20 \n\n30 ~ ~ 60 m 80 \n\naenerations \n\nFigure 2: LEFf: McCallum's Maze. We show the state numbers in the left, and the messages in the \nright. States 8 and 9 are the start states, and state G is the goal state. In each state, the agent receives \na sensor message which is 4 bit long, Each bit in the message tells whether a wall exists in each of \nthe four directions. From each state, the agent moves to one of the adjacent states. When the agent \nreaches the goal state, he receives reward 1000. The agent is then put in one of the start states with \nequal probability. \nRIGHT: Gain of the best rule set. Parameters: t mBX = 50000, ~ = 0.10, M = 10, N = 5,6, P# = \n0.33. \n\nFig. 2 is a POMDP known as as McCallum's Maze (McCallum 1993). Thanks to the use \nof stochastic policies, GLS achieves near optimal gain for memoryless poliCies. Note that \nno memoryless deterministic policy can take the agent to the goal for this problem. \n\nWe have seen GLS's generalization capability for an MDP in Grefl World, the advantage \nof stochastic policies for a POMDP in McCallum's maze. In Woods7 (Wilson 1994), we \nattempt to test GLS's generalization capability for a POMDP. See Fig. 3. Since each sensor \nmessage is 16 bit long, and the conditions of GLS rules can have either O,l,or # for each of \nthe 16 bits, there are 316 possible conditions in total. When we notice that there are only \n92 different actual sensor messages in the environment, it seems quite difficult to discover \nthem only by using GA. In fact, when we ran GLS for the first time, the standard rules \nvery rarely matched the messages and the default rule took over most of the time. In order \nto avoid the no matching rule problem, we made the number of rules in a rule set large \n(N = 100), increased P# from 0.33 in the previous problems to 0.70. \n\nThe problem was independently attacked by other methods. Wilson applied his ZCS, zeroth \nlevel classifier system, to Woods7 (Wilson 1994). The gain was 0.20. ZCS has a special \ncovering procedure to tum around the no matching rule problem. The covering procedure \ngenerates a rule which matches a message when none of the current rules matches the \nmessage. We expect further improvement on the gain, if we equip GLS with some covering \nprocedure. \n\n6 SUMMARY \n\nIn order to solve the CS problems such as the rule strength vs rule set performance problem \nand the credit assignment problem, we have developed a hybrid classifier system: GLS. \nWe notice that generalization often leads to state aliasing. Therefore, in designing GLS, \nwe view CSs as model free learning in POMDPs and take a hybrid approach to finding \nthe best generalization, given the total number of rules. GLS uses the policy improvement \nprocedure by Jaakkola et a1. for an locally optimal stochastic policy when a set of rule \nconditions is given. GLS uses GA to search for the best set of rule conditions. \n\n\fViewing Classifier Systems as Model Free Learning in POMDPs \n\n995 \n\n\u2022\u2022\u2022\u2022\u2022 , \u2022\u2022\u2022\u2022\u2022 0 . \u2022 \u2022\u2022\u2022\u2022\u2022\u2022...\u2022 .. \u2022\u2022\u2022\u2022\u2022\u2022\u2022.\u2022\u2022\u2022 00 \u2022 \u2022 , \u2022\u2022\u2022\u2022\u2022 ,0 \u2022\u2022 . , . \" \u2022\u2022 \n. or o . .... F . \u2022\u2022\u2022\u2022 .. . F .\u2022 \u2022 .\u2022\u2022\u2022. ,D . \u2022\u2022\u2022\u2022\u2022 . F \u2022\u2022 , \u2022 . \u2022 . \u2022 . r o., ..... . \n\n.. .\n\n. \u2022\u2022 0 . \u2022 . \u2022\u2022 .\n\n.\n\n. 00 ... \u2022\u2022\u2022. F . \u2022 \u2022.\u2022.\u2022 \u2022. \u2022\u2022\u2022\u2022\u2022 .. \n\n\u2022 . . . .\u2022 . \u2022 \u2022\u2022 . \u2022\u2022 \u2022\u2022 \u2022 . .\u2022\u2022\u2022\u2022 .\u2022 \u2022\u2022\u2022 0 . . . . . . 0 . . . . . . . .. F \u2022\u2022\u2022\u2022\u2022\u2022 0 \u2022\u2022 \u2022\u2022\u2022 \n.. oro ..... .. . oro .. _ .. . .... F . \u2022.\u2022\u2022\u2022.\u2022\u2022 00 \u2022\u2022\u2022\u2022 . \u2022 r ... . \n. . . r ' \" \n. . \u2022..\u2022\u2022 . \u2022\u2022\u2022\u2022 \u2022 \u2022\u2022. . 0 \u2022\u2022 . \n. \u2022\u2022 . .\u2022\u2022\u2022. .. \u2022\u2022\u2022..\u2022\u2022\u2022\u2022. .. .\u2022\u2022. 0 .\n\u2022 . \u2022 00 .. .\n\n:~r~:.:: : :~~::::: : :~r: ::: : ::~::: :::::: :~t~: :: :::: :~:: ::::: 'i \n\n. . . . . . . . . . _ \u2022 . 0 .\u2022\u2022 .\n\n. ..\u2022 00 \u2022 . .. \u2022 \u2022. 0 . \" \n\n. . ~ . . . ~ .. .. .\n\n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \u2022 \u2022\u2022\u2022.\u2022\u2022.\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022.\u2022 \u2022 \u2022.\u2022 . \u2022\u2022 \u2022 \u2022 . .\u2022\u2022.\u2022\u2022\u2022\u2022. 0 . \u2022.\u2022\u2022 . . \n.\n. \u2022 00 . . .. \u2022. , 0 .\n. \u2022 \u2022 \u2022 . \n. . . F . . ..... r ...... . o .. . . .. . . Fo . .. . . ... F ... ... .. or . . . . oro. \n. .... . .\n.. . . . . \n\u2022 . 0,,, \u2022\u2022 . \u2022 .. ...\u2022\u2022 . .\u2022\u2022.\u2022\u2022 . 0 \u2022 . ..\u2022 . \u2022\u2022.\u2022\u2022\u2022 0 \u2022\u2022\u2022\u2022 . \u2022 0 \u2022\u2022\u2022\u2022\u2022\u2022 . 0 \u2022\u2022\u2022\u2022 \n. . r . .. . ... F . \u2022\u2022 . \u2022 .. \u2022\u2022 . . \u2022.. \u2022 F . . \u2022 \u2022\u2022\u2022\u2022. . Fo . \u2022 . ... . r .\u2022 \u2022. \u2022. oF . . . \n. . 0 , .. _ . . \u2022 00 \u2022. . \u2022 .. .\u2022 . . . . \u2022 \u2022\u2022 0 .\u2022\u2022 \u2022\u2022.\u2022 , .\n.\u2022\u2022\u2022 \u2022 \u2022 \u2022 \u2022 0 \u2022 \u2022.\u2022 .\u2022\u2022\u2022 .. \n... ... .. . ..... .... .. . 0 \u2022 .\u2022 \u2022\u2022 \u2022 \u2022 . \u2022.\u2022 \" . . . . . . . 0 . . ... . ....... . \n.. . F .. .. . \u2022 \u2022 \u2022 oro ... .. . r . . . .. .. \" F . . \u2022\n.\u2022.\u2022\u2022 r . ... . 0 F .. \u2022\u2022\u2022\u2022.\u2022 \n... 00 . \u2022 .\u2022.. \u2022\u2022.\u2022\u2022 \u2022. \u2022\u2022 . 0 ... ...\n\n. . ... . . .. ... ... .. . ..\n\n... 00 . . . . \u2022\u2022\u2022 0 \u2022 .\u2022 \u2022\u2022 . 0 ..\u2022 \u2022.\u2022\u2022 . \n\n0.24 .----~~~~~~~.....,_.__,_\" \n0.23 \n02 2 \n021 \n0.2 \n\n0.19 \n0.18 \n0.17 \n0.16 \n0.15 \n0.14 L.-~~~~~~~_---.J \no W W ~ ~ ~ ~ ~ ~ ~ ~ \n\ngeoentionJ \n\nFigure 3: LEFT: Woods7.Each cell is either empty\".\", contains a stone \"0\", or contains food \"F'. \nThe cells which contain a stone are not passable, and the cells which contain food are goals. In each \ncell, the agent receives a 2 * 8 = 16 bit long sensor message, which tells the contents of the eight \nadjacent cells. From each cell, the agent can move to one of the eight adjacent cells. When the agent \nreaches a cell which contains food, he receives reward 1. The agent is then put in one of the empty \ncells with equal probability. \nRlGHT:Gain of the best rule set. Parameters: t ma x = 10000, to = 0.10, M = 10, N = 100, P# = \n0.70. \n\nReferences \n\nDejong, K. A. (1988). Learning with genetic algorithms: An overview. Machine Learn(cid:173)\n\ning, 3:121-138. \n\nGoldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine \n\nLearning. Addison-Wesley. \n\nGrefenstette, J. J. (1987). Multilevel credit assignment in a genetic learning system. In \n\nProc. Second Int. Con! on Genetic Algorithms, pp. 202-209. \n\nGrefenstette, J. J. (1988). Credit assignment in rule discovery systems based on genetic \n\nalgorithms. Machine Learning, 3:225-245. \n\nGrefenstette, J. J., C. L. Ramsey, and A. C. Schultz (1990). Learning sequential decision \n\nrules using simulation and competition. Machine Learning, 5:355-381. \n\nHolland, J. H. (1986). Escaping brittleness: the possibilities of general purpose learning \nalgorithms applied to parallel rule-based systems. In Machine Learning II, pp. 593-\n623. Morgan Kaufmann. \n\nHolland, J. H. and J. S. Reitman (1978). Cognitive systems based on adaptive algo(cid:173)\n\nrithms. In D. A. Waterman and F. Hayes-Roth (Eds.), Pattern-directed inference \nsystems. Academic Press. \n\nJaakkola, T., S. P. Singh, and M. I. Jordan (1994). Reinforcement learning algorithm for \npartially observable markov decision problems. In Advances of Neural Information \nProcessing Systems 7, pp. 345-352. \n\nKaelbling, L. P., M. L. Littman, and A. W. Moore (1996). Reinforcement learning: A \n\nsurvey. Journal of Artificial Intelligence Research, 4:237-285. \n\nMcCallum, R. A. (1993). Overcoming incomplete perception with utile distinction \n\nmemory. In Proc. the Tenth Int. Con! on Machine Learning, pp. 190-196. \n\nWesterdale, T. H. (1997). Classifier systems - no wonder they don't work. In Proc. Sec(cid:173)\n\nond Annual Genetic Programming Conference, pp. 529-537. \n\nWilson, S. W. (1994). Zcs: A zeroth order classifier system. Evolutionary Computation , \n\n2(1): 1-18. \n\nWilson, S. W. (1995). Classifier fitness based on accuracy. Evolutionary Computation , \n\n3(2): 149-175. \n\nWilson, S. W. and D. E. Goldberg (1989). A critical review of classifier systems. In Proc. \n\nThird Int. Con! on Genetic Algorithms, pp. 244-255. \n\n\f", "award": [], "sourceid": 1492, "authors": [{"given_name": "Akira", "family_name": "Hayashi", "institution": null}, {"given_name": "Nobuo", "family_name": "Suematsu", "institution": null}]}