{"title": "Viewing Classifier Systems as Model Free Learning in POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 989, "page_last": 995, "abstract": null, "full_text": "Viewing Classifier Systems \n\nas Model Free Learning in POMDPs \n\nAkira Hayashi and Nobuo Suematsu \n\nFaculty of Information Sciences \n\nHiroshima City University \n\n3-4-1  Ozuka-higashi, Asaminami-ku, Hiroshima, 731-3194 Japan \n\n{ akira,suematsu }@im.hiroshima-cu.ac.jp \n\nAbstract \n\nClassifier systems are now  viewed disappointing because of their prob(cid:173)\nlems such as the rule  strength vs rule  set performance problem and the \ncredit assignment problem.  In order to solve the problems, we have de(cid:173)\nveloped a hybrid classifier  system:  GLS  (Generalization Learning Sys(cid:173)\ntem).  In designing GLS, we view CSs as model free learning in POMDPs \nand  take a  hybrid approach to finding  the  best generalization,  given the \ntotal  number of rules.  GLS  uses  the policy improvement procedure by \nJaakkola  et al.  for  an locally  optimal  stochastic policy  when  a  set of \nrule conditions is  given.  GLS  uses GA to search for  the best set of rule \nconditions. \n\n1 \n\nINTRODUCTION \n\nClassifier systems (CSs) (Holland  1986) have been among the most used in reinforcement \nlearning.  Some of the  advantages of CSs  are  (1)  they  have  a  built-in feature  (the  use  of \ndon't care  symbols  \"#\") for input  generalization,  and  (2)  the  complexity  of pOlicies can \nbe controlled by restricting the number of rules.  In  spite of these attractive features,  CSs \nare now viewed somewhat disappointing because of their problems (Wilson and Goldberg \n1989;  Westerdale  1997).  Among them are the rule strength vs rule set performance prob(cid:173)\nlem, the definition of the rule strength parameter, and the credit assignment (BBA vs PSP) \nproblem. \n\nIn order to solve the problems, we have developed a hybrid classifier system:  GLS  (Gener(cid:173)\nalization Learning System).  GLS is based on the recent progress ofRL research in partially \nobservable Markov decision processes (POMDPs).  In POMDPs, the environments are re(cid:173)\nally Markovian, but the agent cannot identify the state from the current observation.  It may \nbe due to noisy sensing or perceptual aliasing.  Perceptual aliasing occurs when the sensor \nreturns the same observation in multiple states.  Note that even for a completely observable \n\n\f990 \n\nA. Hayashi and N.  Suematsu \n\nMDP,  the  use of don't care symbols for input generalization will make  the process as if it \nwere partially observable. \n\nIn designing GLS, we view CSs as RL in POMDPs and take a hybrid approach to finding \nthe best generalization, given the  total number of rules.  GLS  uses the policy improvement \nprocedure in Jaakkola et a!.  (1994) for an locally optimal stochastic policy when a set of \nrule conditions is given.  GLS  uses GA to search for the best set of rule conditions. \n\nThe paper is organized as follows.  Since CS  problems are easier to understand from GLS \nperspective, we introduce Jaakkola et a!.  (1994), propose GLS, and  then discuss CS  prob(cid:173)\nlems. \n\n2  LEARNING IN POMDPS \n\nJaakkola et a1.  (1994) consider POMDPs with perceptual aliasing and memoryless stochas(cid:173)\ntic  policies.  Following  the  authors,  let us  call the  observations  messages.  Therefore,  a \npolicy is a mapping from messages to probability distributions (PDs) over the actions. \nGiven a policy 7r,  the  value of a state s, V7!' (s), is defined for POMDPs just as for MDPs. \nThen, the value of a message m under policy 7r, V7!' (m ), can be defined as follows: \n\nV7!'(m)  =  LP7!'(slm)V7!'(s) \n\nsES \n\n(1) \n\nwhere P7!' (slm) is the probability that the state is s when the message is m  under the policy \n7r. \n\nThen, the following holds. \n\nN \n\nlim'\"' E{R(st, at) -R lSI  =  s} \nN-+(X)~ \nt=l \n\n(2) \n\nE{V(s) Is --t m} \n\n(3) \nwhere St  and at  refer to the state and the action taken at the tth step respectively, R( St, at) \nis  the immediate reward at the  tth  step,  R  is the  (unknown) gain (Le.  the  average reward \nper step).  s  --t m  refers to all the instances where m  is observed in sand E{\u00b7 I s  --t  m} is \na Monte-Carlo expectation. \nIn order to compute E{V(s) I s  --t  m}, Jaakkola et a1.  showed a Monte-Carlo procedure: \n\n1 \n\nvt(m) =  'k{  Rtl  +rl,IRtl+l + rl,2Rtl+2 + ... + rl ,t-tIRt \n+  Rt2  +r2,IRt2 +l + r2,2Rt2+2  + ... + r2,t-t2Rt \n\n+  Rtk  +rk,IRtdl + ... + rk ,t-tkRtl \n\n(4) \n\nwhere  tk  denotes. the  time  step  corresponding to  the  kth  occurrence  of the  message  m, \nR t  =  R(st, at)  - R  for every t,  rk,T indicates the discounting at the Tth  step in the kth \nsequence.  By estimating R  and  by  suitably  setting rk ,T,  Vt(m)  converges  to  V7!'(m). \nQ7!' (m, a),  Q-value of the  message m  for  the  action a under the policy 7r,  is also defined \nand computed in the same way. \n\nJaakkola et a1.  have developed a policy improvement method: \n\nStep 1  Evaluate the current policy 7r  by computing V7!' (m) and Q7!' (m, a) for each m  and \n\na. \n\n\fViewing Classifier Systems as Model Free Learning in POMDPs \n\n991 \n\nStep 2  Test for any m  whether maxa  Q1r (m, a)  > V 1r (m) holds.  If 110t, then return 7r. \nStep 3  For each m  and a, define 7r 1(alm) as follows: \n\n7r 1 (aim)  =  1.0 when a =  argmaxaQ1r(m, a), \nThen, define 7r f  as 7r f  (aim)  =  (1  - \u20ac  )7r( aim) + \u20ac7r 1 (aim) \n\n7r 1 (aim)  =  0.0 otherwise. \n\nStep 4  Set the new policy as 7r  =  7r f\n\n,  and goto Stepl. \n\n3  GLS \n\nEach  rule  in  GLS  consists  of  a  condition  part,  an  action  part,  and  an  evaluation \npart:  Rule  = \n(Condit'ion, Action, Evaluation).  The  condition  part  is  a  string  c \nover  the  alphabet  {O, 1, #},  and  is  compared  with  a  binary  sensor  message.  #  is  a \ndon't  care  symbol,  and  matches  0  and  1.  When  the  condition  c  matches  the  mes(cid:173)\nsage,  the  action  is  randomly  selected  using  the  PD  in  the  action  part:  Action  = \n(p(allc),p(a21c), ... ,p(aIA!lc)), I:j'!\\ p(ajlc) =  1.0 where IAI is the total number ofac(cid:173)\ntions.  The evaluation part records the value of the condition V ( c)  and the  Q-values of the \ncondition action pairs Q(c, a):  Evaluation =  (V(c), Q(c, ad , Q(c, a2), ... ,Q(c, a lAI))' \nEach rule  set consists of N  rules,  {Rulel, Rule2,\"\"  RuleN}.  N,  the  total number of \nrules in a rule set, is a design parameter to control the complexity of policies.  All the rules \nexcept the last one are called standard rules. The last rule Rule N is a special rule which is \ncalled the default rule.  The condition part of the default rule is a string of # 's and matches \nany message. \n\nLearning in GLS  proceeds as follows:  (1 )Initialization:  randomly generate an initial pop(cid:173)\nulation of M  rule  sets,  (2)Policy Evaluation and  Improvement:  for each rule  set, repeat a \npolicy evaluation and improvement cycle for  a suboptimal policy,  then, record the gain of \nthe policy for each rule  set,  (3)Genetic Algorithm:  use the gain of each rule set as its fit(cid:173)\nness measure and produce a new generation of rule sets, (4) Repeat:  repeat from the policy \nevaluation and improvement step with the new generation of rule sets. \n\nIn (2)Policy Evaluation and Improvement,  GLS  repeats the following cycle for each rule \nset. \n\nStep 1  Set \u20ac  sufficiently small.  Set t max  sufficiently large. \nStep 2  Repeat for  1 :::;  t  :::;  t max \u2022 \n\n1.  Make an observation of the environment and receive a message mt from the \n\nsensor. \n\n2.  From all  the  rules  whose condition  matches the  message  mt,  find  the  rule \n\nwhose condition is the most specific l .  Let us call the rule the active rule. \n\n3.  Select the next action at randomly according to the PD in the action part of \nthe active rule, execute the action, and receive the reward R( St, at) from the \nenvironment.  (The state St  is not observable.) \n\n4.  Update  the  current estimate  of the  gain  R  from  its  previous  estimate  and \nR( St, ad .  Let R t  =  R( St , ad - R.  For each rule,  consider its condition Ci \nas (a  generalization of) a message, and update its evaluation part V ( Ci )  and \nQ(c;, aHa E  A) using Eq.(4). \n\nStep 3  Check whether the following holds.  If not, exit. \n\n3i (1  :::;  i  :::;  N),  maxa  Q ( Ci , a)  > V ( cd \n\nStep 4  Improve  the  current policy  according to  the  method  in the  previous  section,  and \n\nupdate the action part of the corresponding rules and goto Step 2. \n\nIThe most specific rule has the least number of #'s. This is intended only for saving the number \n\nof rules. \n\n\f992 \n\nA.  Hayashi and N.  Suematsu \n\nGLS  extracts  the  condition parts  of all  the  rules  in  a rule  set and  concatenates  them to \nform a string.  The string will be an individual to be manipulated by the  genetic algorithm \n(GA). The genetic algorithm used in GLS is a fairly standard one.  GLS combines the SGA \n(the simple genetic algorithm) (Goldberg 1989) with the elitist keeping strategy.  The SGA \nis  composed  of three  genetic  operators:  selection,  crossover,  and  mutation.  The  fitness \nproportional  selection  and  the  single-point  crossover  are  used.  The  three  operators  are \napplied to an entire population at each generation. Since the original SGA does not consider \n#'s in the  rule conditions, we modified SGA as follows.  When GLS  randomly generates \nan initial population of rule  sets,  it generates  #  at each allele position in rule conditions \naccording to the probability P#. \n\n4  CS PROBLEMS AND  GLS \n\nIn the history of classifier systems, there were two quite different approaches:  the Michigan \napproach (Holland and Reitman 1978), and the Pittsburgh (Pitt) approach (Dejong 1988). \nIn the Michigan approach, each rule  is considered as an individual and the rule  set as the \npopulation in GA. Each rule has its strength parameter, which is based on its future payoff \nand is used as the fitness measure in GA.  These aspects of the approach cause many prob(cid:173)\nlems. One is the rule strength vs rule set performance problem.  Can we collect only strong \nrules and get the  best rule set performance?  Not necessarily.  A strong rule may cooperate \nwith weak rules to increase its payoff.  Then, how can we define and compute the strength \nparameter for the best rule set performance? In spite of its problems, this approach is now \nso much more popular than the other, that when people simply say classifier systems, they \nrefer to Michigan type classifier systems.  In the Pitt approach, the problems of the Michi(cid:173)\ngan approach are avoided by requiring GA to evaluate a whole rule set.  In the approach, a \nrule set is considered as an individual and multiple rule sets are kept as the population. The \nproblem of the Pitt approach is its computational difficulties. \n\nGLS can be considered as a combination of the Michigan and Pitt approaches.  GA in GLS \nworks as that in the Pitt approach.  It evaluates a total rule  set,  and  completely avoids the \nrule strength vs rule set performance problem in the Michigan approach.  As the Michigan \ntype CSs, GLS evaluates each rule to improve the policy.  This alleviates the computational \nburden in the Pitt approach.  Moreover, GLS evaluates each rule in a more formal and sound \nway than the  Michigan approach.  The values, V(c), and  Q(c, a), are defined on the  basis \nof POMDPs, and the policy improvement procedure using the values is guaranteed to find \na local maximum. \n\nWesterdale  (1997)  has  recently  made  an  excellent  analysis  of problematic  behaviors  of \nMichigan  type  CSs.  Two  popular methods  for  credit assignment in  CSs  are  the  bucket \nbrigade  algorithm (BBA)  (Holland  1986)  and  the  profit sharing plan (PSP)  (Grefenstette \n1988).  Westerdale shows that BBA does not work in POMDPs.  He  insists that PSP with \ninfinite time span is necessary for the  right credit assignment, although he does not show \nhow  to carry  out the  computation.  GLS  does not use  BBA or PSP.  GLS  uses the  Monte \nCarlo procedure,  Eq.(4),  to  compute the  value  of each condition  action pair.  The series \nin  Eq.(4)  is  slow  to  converge.  But,  this  is  the  cost we  have  to  pay  for  the  right  credit \nassignment  in  POMDPs.  Westerdale  points  out  another  CS  problem.  He  claims  that a \ndistinction must be made between the availability and  the payoff of rules.  We  agree with \nhim.  As  he  says,  if the expected  payoff of Rule  1 is  twice as much  as Rule  2,  then  we \nwant to a/ways choose Rule  1.  GLS  makes the distinction.  The probability of a stochastic \npolicy 71'(alc)  in GLS  corresponds to  the  availability,  and  the  value  of a condition action \npair Q ( c, a) corresponds to the payoff. \n\nSamuel System (Grefenstette et a1.  1990) can also be considered as a combination of the \nMichigan and Pitt approaches.  Samuel is a highly sophisticated system which has  lots of \nfeatures.  We  conjecture,  however,  that Samuel is  not free  from  the  CS  problems  which \n\n\fViewing Classifier Systems as Model Free Learning in POMDPs \n\n993 \n\nWesterdale  has  analyzed.  This  is  because  Samuel  uses  PSP for  credit assignment,  and \nSamuel uses the  payoff of each rule for action selection, and  does not make  a distinction \nbetween the availability and the payoff of rules. \n\nxes (Wilson 1995) seems to be an exceptionally reliable Michigan\u00b7type es. In xes, each \nrule's fitness is  based not on its future  payoff but on the prediction accuracy of its future \npayoff (XeS uses BBA for credit assignment).  Wilson reports that xes's population tends \nto  form  a  complete  and  accurate  mapping from  sensor  messages  and  actions  to  payoff \npredictions.  We conjecture that xes tries  to  build the most general Markovian model of \nthe environment.  Therefore, it will be difficult to apply xes when the environment is not \nMarkovian, or when  we  cannot afford  the  number of rules enough to  build  a  Markovian \nmodel of the  environment, even if the environment itself is Markovian.  As we will see in \nthe next section, GLS is intended exactly for these situations. \n\nKaelbling et a1.  (19%) surveys methods for input generalization when reward is delayed. \nThe methods use a function approximator to represent the value function by mapping a state \ndescription to a value. Since they use value iteration or Q\u00b7leaming anyway, it is difficult to \napply the methods when the generalization violates the Markov assumption and induces a \nPOMDP. \n\n5  EXPERIMENTS \n\nWe have tested GLS with some of the representative problems in es literature. Fig. 1 shows \nGrefl world (Grefenstette 1987). In Grefl world, we used GLS to find  the smallest rule set \nwhich is necessary for the optimal performance. Since this is not a POMDP but an MDP, the \noptimal policy can easily be learned when we have a corresponding rule for each of the  16 \nstates. However, when the total number of rules is less than that of states, the environment \nlooks like a POMDP to the  learning agent, even if the environment itself is  an MDP.  The \ngraph shows how the gain of the best rule set in the population changes with the generation. \nWe  can see  from the  figure  that four rules are  enough for  the optimal performance.  Also \nnote  that the  saving of the  rules is achieved  by selecting the  most specific matching rule \nas an active rule.  The rule set with this rule selection is called the defallit  hierarchy in es \nliterature. \n\npayoff \n\n'i \n\n150 \n\n200 \n\nISO \n\n100 \n\n50 \n\n~~~--------~ \n\n....................................... .. .......... ~ ................. . \n\nN: . l (cid:173)\nN=3 -\nN=2  M  \u2022 \u2022\u2022 \u2022 \u2022  _ _ \n\nO L-~~ __ ~~~~~L~~\u00b7I ~ .. -.~.~.-~-\n\no \n\n10 \n\n15 \n\n10 \n\n15 \n\n30 \n\n35 \n\n40 \n\ng.!ner.a tioruJ \n\nFigure  1:  LEFT: GREF1  World.  States {O, 1,2, 3}  are the  start  states  and  states  {12.13, 14, 15} \nare the end states.  In each  state, the agent gets the state number (4 bits)  as  a message, and chooses \nan action a,b,c, or d. When the agent reaches the end states, he receives reward 1000 in state 13, but \nreward  0 in other states.  Then  the agent is put in one of the start states with equal probability.  We \nadded 10% action errors to make the process ergodic.  When an  action error occurs, the agent moves \nto one of the 16 states with equal probability. \n=:  0.10. M  =:  10. N  =: \nRIGHT:  Gain  of the  best  rule  set.  Parameters: \n2,3 , 4, P#  =:  0.33.  For  N  =:  4,  the  best rule  set  at  the  40 th  generation  was  { if 0101  (State  5) \nthen  a  1.0, if 1010 (State  10) then  c 1.0, if ##11 (States 3,7,11,15) then  d  1.0, if #### (Default \nRule)  then b  1.0}. \n\ntma ;r  =:  10000. \u20ac \n\n\f994 \n\nA.  Hayashi and N.  Suematsu \n\noo~~~~~~~~~--~ \n\n80 \n\n70 \n\n60 \n\n30 \n\n20 \n10 L -...... ~~~oo4---'---'!'Pz.:tim:=o \u2022 .:...1 -_\"\" .... -.-J. \n\nN06-\nNoS-\n\n00  ~ \n\nBlillaD \na  II  a \nII  II  a \n\no  W \n\n20 \n\n30  ~  ~  60  m  80 \n\naenerations \n\nFigure 2:  LEFf: McCallum's Maze.  We show the state numbers in the left, and the messages in the \nright.  States 8 and 9 are the start states, and state G is the goal state.  In each state, the agent receives \na sensor message which is 4 bit long, Each bit in the message tells whether a wall exists in each of \nthe four directions.  From each state, the agent moves to one of the adjacent states.  When the agent \nreaches the goal state, he receives reward 1000.  The agent is then put in one of the start states with \nequal probability. \nRIGHT: Gain of the best rule set.  Parameters:  t mBX  = 50000, ~ = 0.10, M  = 10, N  = 5,6, P#  = \n0.33. \n\nFig. 2 is a POMDP known as  as McCallum's Maze (McCallum 1993).  Thanks to the use \nof stochastic policies, GLS achieves near optimal gain for memoryless poliCies.  Note that \nno memoryless deterministic policy can take the agent to the goal for this problem. \n\nWe have seen GLS's generalization capability for an MDP in Grefl  World, the advantage \nof stochastic policies for a POMDP in McCallum's maze.  In Woods7 (Wilson  1994), we \nattempt to test GLS's generalization capability for a POMDP. See Fig. 3.  Since each sensor \nmessage is 16 bit long, and the conditions of GLS rules can have either O,l,or # for each of \nthe  16 bits, there are 316  possible conditions in total.  When we notice that there  are only \n92 different actual sensor messages in the environment, it seems quite difficult to discover \nthem only  by  using GA.  In fact,  when we ran GLS  for  the  first  time,  the  standard rules \nvery rarely matched the messages and the default rule took over most of the time.  In order \nto  avoid  the  no matching rule  problem,  we  made  the  number of rules in a  rule  set large \n(N =  100), increased P# from 0.33 in the previous problems to 0.70. \n\nThe problem was independently attacked by other methods. Wilson applied his ZCS, zeroth \nlevel classifier system, to Woods7  (Wilson 1994).  The gain was 0.20.  ZCS  has a special \ncovering procedure to tum around the no matching rule problem.  The covering procedure \ngenerates  a  rule  which  matches  a  message  when  none  of the  current rules  matches  the \nmessage. We expect further improvement on the gain, if we equip GLS with some covering \nprocedure. \n\n6  SUMMARY \n\nIn order to solve the CS problems such as the rule strength vs rule set performance problem \nand  the  credit assignment problem,  we have  developed a  hybrid classifier  system:  GLS. \nWe  notice  that generalization often leads  to  state aliasing.  Therefore,  in designing GLS, \nwe  view  CSs as model  free  learning in POMDPs and take  a  hybrid  approach to finding \nthe best generalization, given the total number of rules.  GLS uses the policy improvement \nprocedure by Jaakkola  et a1.  for  an locally optimal  stochastic policy  when  a  set of rule \nconditions is given.  GLS uses GA to search for the best set of rule conditions. \n\n\fViewing Classifier Systems as Model Free Learning in POMDPs \n\n995 \n\n\u2022\u2022\u2022\u2022\u2022 ,  \u2022\u2022\u2022\u2022\u2022 0  . \u2022 \u2022\u2022\u2022\u2022\u2022\u2022...\u2022 .. \u2022\u2022\u2022\u2022\u2022\u2022\u2022.\u2022\u2022\u2022 00 \u2022 \u2022  ,  \u2022\u2022\u2022\u2022\u2022 ,0 \u2022\u2022 .  , . \"  \u2022\u2022 \n. or o . .... F . \u2022\u2022\u2022\u2022 .. . F .\u2022 \u2022 .\u2022\u2022\u2022. ,D . \u2022\u2022\u2022\u2022\u2022 . F \u2022\u2022 , \u2022 . \u2022 . \u2022 . r o., ..... . \n\n.. .\n\n. \u2022\u2022 0  . \u2022 . \u2022\u2022 .\n\n.\n\n.  00 ... \u2022\u2022\u2022. F . \u2022 \u2022.\u2022.\u2022 \u2022. \u2022\u2022\u2022\u2022\u2022 .. \n\n\u2022 . . . .\u2022 . \u2022 \u2022\u2022 . \u2022\u2022 \u2022\u2022 \u2022 . .\u2022\u2022\u2022\u2022 .\u2022 \u2022\u2022\u2022 0  . . . . . .  0  . . . . . . . .. F  \u2022\u2022\u2022\u2022\u2022\u2022 0  \u2022\u2022 \u2022\u2022\u2022 \n.. oro ..... .. . oro .. _ .. . .... F  . \u2022.\u2022\u2022\u2022.\u2022\u2022 00 \u2022\u2022\u2022\u2022 . \u2022 r ... . \n. . . r  ' \" \n. . \u2022..\u2022\u2022 . \u2022\u2022\u2022\u2022 \u2022 \u2022\u2022. . 0  \u2022\u2022 . \n. \u2022\u2022 . .\u2022\u2022\u2022. .. \u2022\u2022\u2022..\u2022\u2022\u2022\u2022. .. .\u2022\u2022. 0  .\n\u2022 . \u2022 00 .. .\n\n:~r~:.:: : :~~::::: :  :~r: ::: : ::~::: :::::: :~t~: :: :::: :~:: ::::: 'i \n\n. . . . . . . . . .  _ \u2022 . 0  .\u2022\u2022 .\n\n. ..\u2022 00 \u2022 . .. \u2022 \u2022. 0 .  \" \n\n. . ~  . . . ~  .. .. .\n\n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \u2022 \u2022\u2022\u2022.\u2022\u2022.\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022.\u2022 \u2022 \u2022.\u2022 . \u2022\u2022 \u2022 \u2022 . .\u2022\u2022.\u2022\u2022\u2022\u2022. 0 . \u2022.\u2022\u2022 . . \n.\n.  \u2022 00 . . .. \u2022.  , 0  .\n. \u2022 \u2022 \u2022 . \n. . . F . . ..... r ...... . o .. . . .. . . Fo . .. . . ... F ... ... .. or . . . . oro. \n. .... . .\n.. . . . . \n\u2022 . 0,,, \u2022\u2022 . \u2022 .. ...\u2022\u2022 . .\u2022\u2022.\u2022\u2022 .  0  \u2022 . ..\u2022 . \u2022\u2022.\u2022\u2022\u2022 0  \u2022\u2022\u2022\u2022 . \u2022 0  \u2022\u2022\u2022\u2022\u2022\u2022 .  0  \u2022\u2022\u2022\u2022 \n. . r  . .. . ... F . \u2022\u2022 . \u2022 .. \u2022\u2022 . . \u2022.. \u2022 F . . \u2022 \u2022\u2022\u2022\u2022. . Fo . \u2022 . ... . r .\u2022 \u2022. \u2022. oF . . . \n. . 0 ,  .. _ . . \u2022 00 \u2022. . \u2022 .. .\u2022 . . . . \u2022 \u2022\u2022 0  .\u2022\u2022 \u2022\u2022.\u2022 ,  .\n.\u2022\u2022\u2022 \u2022 \u2022 \u2022 \u2022 0  \u2022 \u2022.\u2022 .\u2022\u2022\u2022 .. \n... ... .. . ..... .... .. . 0  \u2022 .\u2022 \u2022\u2022 \u2022 \u2022 . \u2022.\u2022 \"  . . . . . . .  0 . . ... . ....... . \n.. . F .. .. . \u2022 \u2022 \u2022 oro ... .. . r . . . .. .. \"  F . . \u2022\n.\u2022.\u2022\u2022 r . ... . 0 F .. \u2022\u2022\u2022\u2022.\u2022 \n... 00 . \u2022 .\u2022.. \u2022\u2022.\u2022\u2022 \u2022. \u2022\u2022 .  0  ... ...\n\n. . ... . . .. ... ... .. . ..\n\n... 00 . . . . \u2022\u2022\u2022 0  \u2022 .\u2022 \u2022\u2022 . 0  ..\u2022 \u2022.\u2022\u2022 . \n\n0.24  .----~~~~~~~.....,_.__,_\" \n0.23 \n02 2 \n021 \n0.2 \n\n0.19 \n0.18 \n0.17 \n0.16 \n0.15 \n0.14  L.-~~~~~~~_---.J \no  W  W  ~  ~  ~  ~  ~  ~  ~  ~ \n\ngeoentionJ \n\nFigure 3:  LEFT:  Woods7.Each  cell is either empty\".\", contains a stone \"0\", or contains food  \"F'. \nThe cells which contain a stone are not passable, and the cells which contain food are goals. In each \ncell,  the agent receives a  2 * 8  =  16 bit long sensor message, which tells the contents of the eight \nadjacent cells.  From each cell, the agent can move to one of the eight adjacent cells.  When the agent \nreaches a cell which contains food,  he receives reward 1.  The agent is then put in one of the empty \ncells with equal probability. \nRlGHT:Gain of the best rule set.  Parameters:  t ma x  =  10000, to  =  0.10, M  =  10, N  =  100, P# = \n0.70. \n\nReferences \n\nDejong, K. A.  (1988). Learning with genetic algorithms:  An overview. Machine Learn(cid:173)\n\ning, 3:121-138. \n\nGoldberg,  D.  E.  (1989).  Genetic  Algorithms  in  Search,  Optimization,  and  Machine \n\nLearning. Addison-Wesley. \n\nGrefenstette, J.  J.  (1987). Multilevel credit assignment in a genetic learning system. In \n\nProc. Second Int.  Con! on Genetic Algorithms, pp. 202-209. \n\nGrefenstette, J. J.  (1988). Credit assignment in rule discovery systems based on genetic \n\nalgorithms. Machine Learning, 3:225-245. \n\nGrefenstette, J. J., C. L. Ramsey, and A. C. Schultz (1990). Learning sequential decision \n\nrules using simulation and competition. Machine Learning, 5:355-381. \n\nHolland, J.  H. (1986). Escaping brittleness:  the possibilities of general purpose learning \nalgorithms applied to parallel rule-based systems. In Machine Learning II, pp. 593-\n623. Morgan Kaufmann. \n\nHolland,  J. H. and  J.  S.  Reitman  (1978).  Cognitive  systems  based  on adaptive  algo(cid:173)\n\nrithms.  In  D.  A.  Waterman  and  F.  Hayes-Roth  (Eds.),  Pattern-directed  inference \nsystems. Academic Press. \n\nJaakkola, T., S. P. Singh, and M. I. Jordan (1994). Reinforcement learning algorithm for \npartially observable markov decision problems. In Advances of Neural Information \nProcessing Systems 7, pp. 345-352. \n\nKaelbling, L. P.,  M.  L. Littman, and A.  W.  Moore  (1996).  Reinforcement learning:  A \n\nsurvey. Journal of Artificial Intelligence Research, 4:237-285. \n\nMcCallum,  R.  A.  (1993).  Overcoming  incomplete  perception  with  utile  distinction \n\nmemory. In Proc. the Tenth Int. Con!  on Machine Learning, pp. 190-196. \n\nWesterdale, T.  H. (1997). Classifier systems - no wonder they don't work. In Proc. Sec(cid:173)\n\nond Annual Genetic Programming Conference, pp. 529-537. \n\nWilson, S. W.  (1994). Zcs:  A zeroth order classifier system. Evolutionary Computation , \n\n2(1): 1-18. \n\nWilson, S. W. (1995).  Classifier fitness  based on accuracy. Evolutionary Computation , \n\n3(2): 149-175. \n\nWilson, S. W. and D. E. Goldberg (1989). A critical review of classifier systems. In Proc. \n\nThird Int. Con! on Genetic Algorithms, pp. 244-255. \n\n\f", "award": [], "sourceid": 1492, "authors": [{"given_name": "Akira", "family_name": "Hayashi", "institution": null}, {"given_name": "Nobuo", "family_name": "Suematsu", "institution": null}]}