{"title": "Generalization and Scaling in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 550, "page_last": 557, "abstract": null, "full_text": "550 \n\nAckley and Littman \n\nGeneralization and scaling in reinforcement \n\nlearning \n\nDavid H. Ackley \n\nMichael L. Littman \n\nCognitive Science Research Group \n\nBellcore \n\nMorristown, NJ 07960 \n\nABSTRACT \n\nIn associative reinforcement learning, an environment generates input \nvectors, a learning system generates possible output vectors, and a re(cid:173)\ninforcement function computes feedback signals from the input-output \npairs. The task is to discover and remember input-output pairs that \ngenerate rewards. Especially difficult cases occur when rewards are \nrare, since the expected time for any algorithm can grow exponentially \nwith the size of the problem. Nonetheless, if a reinforcement function \npossesses regularities, and a learning algorithm exploits them, learning \ntime can be reduced below that of non-generalizing algorithms. This \npaper describes a neural network algorithm called complementary re(cid:173)\ninforcement back-propagation (CRBP), and reports simulation results \non problems designed to offer differing opportunities for generalization. \n\n1 REINFORCEMENT LEARNING REQUIRES SEARCH \nReinforcement learning (Sutton, 1984; Barto & Anandan, 1985; Ackley, 1988; Allen, \n1989) requires more from a learner than does the more familiar supervised learning \nparadigm. Supervised learning supplies the correct answers to the learner, whereas \nreinforcement learning requires the learner to discover the correct outputs before \nthey can be stored. The reinforcement paradigm divides neatly into search and \nlearning aspects: When rewarded the system makes internal adjustments to learn \nthe discovered input-output pair; when punished the system makes internal adjust(cid:173)\nments to search elsewhere. \n\n\fGeneralization and Scaling in Reinforcement Learning \n\n551 \n\n1.1 MAKING REINFORCEMENT INTO ERROR \n\nFollowing work by Anderson (1986) and Williams (1988), we extend the backprop(cid:173)\nagation algorithm to associative reinforcement learning. Start with a \"garden va(cid:173)\nriety\" backpropagation network: A vector i of n binary input units propagates \nthrough zero or more layers of hidden units, ultimately reaching a vector 8 of m \nsigmoid units, each taking continuous values in the range (0,1). Interpret each 8j \nas the probability that an associated random bit OJ takes on value 1. Let us call \nthe continuous, deterministic vector 8 the search vector to distinguish it from the \nstochastic binary output vector o. \n\nGiven an input vector, we forward propagate to produce a search vector 8, and \nthen perform m independent Bernoulli trials to produce an output vector o. The \ni -\n0 pair is evaluated by the reinforcement function and reward or punishment \nensues. Suppose reward occurs. We therefore want to make 0 more likely given i. \nBackpropagation will do just that if we take 0 as the desired target to produce an \nerror vector (0 - 8) and adjust weights normally. \nNow suppose punishment occurs, indicating 0 does not correspond with i. By choice \nof error vector, backpropagation allows us to push the search vector in any direction; \nwhich way should we go? In absence of problem-specific information, we cannot pick \nan appropriate direction with certainty. Any decision will involve assumptions. A \nvery minimal \"don't be like 0\" assumption-employed in Anderson (1986), Williams \n(1988), and Ackley (1989)-pushes s directly away from 0 by taking (8 - 0) as the \nerror vector. A slightly stronger \"be like not-o\" assumption-employed in Barto & \nAnandan (1985) and Ackley (1987)-pushes s directly toward the complement of 0 \nby taking ((1 - 0) - 8) as the error vector. Although the two approaches always \nagree on the signs of the error terms, they differ in magnitudes. In this work, \nwe explore the second possibility, embodied in an algorithm called complementary \nreinforcement back-propagation ( CRBP). \n\nFigure 1 summarizes the CRBP algorithm. The algorithm in the figure reflects three \nmodifications to the basic approach just sketched. First, in step 2, instead of using \nthe 8j'S directly as probabilities, we found it advantageous to \"stretch\" the values \nusing a parameter v. When v < 1, it is not necessary for the 8i'S to reach zero or \none to produce a deterministic output. Second, in step 6, we found it important \nto use a smaller learning rate for punishment compared to reward. Third, consider \nstep 7: Another forward propagation is performed, another stochastic binary out(cid:173)\nput vector 0* is generated (using the procedure from step 2), and 0* is compared \nto o. If they are identical and punishment occurred, or if they are different and \nreward occurred, then another error vector is generated and another weight update \nis performed. This loop continues until a different output is generated (in the case \nof failure) or until the original output is regenerated (in the case of success). This \nmodification improved performance significantly, and added only a small percentage \nto the total number of weight updates performed. \n\n\f552 \n\nAckley and Littman \n\nO. Build a back propagation network with input dimensionality n and output \n\ndimensionality m. Let t = 0 and te = O. \n\n1. Pick random i E 2n and forward propagate to produce a/s. \n2. Generate a binary output vector o. Given a uniform random variable ~ E [0,1] \n\nand parameter 0 < v < 1, \n\nOJ = {1, if(sj - !)/v+! ~ ~j \n\n0, otherwise. \n\nej = (tj - sj)sj(l- Sj). \n\n3. Compute reinforcement r = f(i,o). Increment t. If r < 0, let te = t. \n4. Generate output errors ej. If r > 0, let tj = OJ, otherwise let tj = 1- OJ. Let \n5. Backpropagate errors. \n6. Update weights. 1:::..Wjk = 1]ekSj, using 1] = 1]+ if r ~ 0, and 1] = 1]- otherwise, \n7. Forward propagate again to produce new Sj's. Generate temporary output \n\nwith parameters 1]+,1]- > o. \nvector 0*. If (r > 0 and 0* #- 0) or (r < 0 and 0* = 0), go to 4. \n\n8. If te ~ t, exit returning te, else go to 1. \n\nFigure 1: Complementary Reinforcement Back Propagation-CRBP \n\n2 ON-LINE GENERALIZATION \nWhen there are many possible outputs and correct pairings are rare, the compu(cid:173)\ntational cost associated with the search for the correct answers can be profound. \nThe search for correct pairings will be accelerated if the search strategy can effec(cid:173)\ntively generalize the reinforcement received on one input to others. The speed of \nan algorithm on a given problem relative to non-generalizing algorithms provides a \nmeasure of generalization that we call on-line generalization. \n\nO. Let z be an array of length 2n. Set the z[i] to random numbers from 0 to \n\n2m - 1. Let t = te = O. \n\n1. Pick a random input i E 2n. \n2. Compute reinforcement r = f(i, z[i]). Increment t. \n3. If r < 0 let z[i] = (z[i] + 1) mod 2m , and let te = t. \n4. If te ~ then 0 = In else 0 = on]. The i-o \nmapping is many-to-l. This problem provides an opportunity for what Anderson \n(1986) called \"output generalization\": since there are only two correct output states, \nevery pair of output bits are completely correlated in the cases when reward occurs. \n\n10 7 \n106 \n\n- 10 5 \n\nG) \n\n'iii u \nrn \nC) \n0 ::::. \nE \n; \n\nG) \n\n104 \n10 3 \n10 2 \n10 1 \n10 0 \n\n0 1 2 3 456 78 91011121314 \n\nn \n\nFigure 3: The n-majority problem \n\nx Table \nD CRBP n-n-n \n+ CRBP n-n \n\nFigure 3 displays the simulation results. Note that although Trer is faster than \nCRBP at small values of n, CRBP's slower growth rate (1.6n vs 4.2n ) allows it to \ncross over and begin outperforming Trer at about 6 bits. Note also--in violation of \n1 For n = 1 to 12. we used '1+ = {2.000. 1.550. 1.130.0.979.0.783.0.709.0.623.0.525.0.280. \n\n0.219. 0.170. 0.121}. \n\n\f554 \n\nAckley and Littman \n\nsome conventional wisdom-that although n-majority is a linearly separable prob(cid:173)\nlem, the performance of CRBP with hidden units is better than without. Hidden \nunits can be helpful--even on linearly separable problems-when there are oppor(cid:173)\ntunities for output generalization. \n\n3.2 n-COPY AND THE 2k -ATTRACTORS FAMILY \nAs a second example, consider the n-copy problem: [0 = i]. The i-o mapping is now \n1-1, and the values of output bits in rewarding states are completely uncorrelated, \nbut the value of each output bit is completely correlated with the value of the \ncorresponding input bit. Figure 4 displays the simulation results. Once again, at \n\n-G) \n\n'ii \ntA \nQ \n0 ::::. \n\nG) \n\n.5 -\n\n150*2.0I\\n \n\nx Table \nD CRBP n-n-n \n+ CRBP n-n \n\n12*2.2I\\n \n\n107 \n106 \n105 \n104 \n103 \n102 \n10 1 \n10 0 \n\n0 \n\n1 \n\n2 3 4 5 \n\n6 \nn \n\n7 \n\n8 9 10 1112 \n\nFigure 4: The n-copy problem \n\nlow values of n, Trer is faster, but CRBP rapidly overtakes Trer as n increases. In \nn-copy, unlike n-majority, CRBP performs better without hidden units. \n\nThe n-majority and n-copy problems are extreme cases of a spectrum. n-majority \ncan be viewed as a \"2-attractors\" problem in that there are only two correct \noutputs-all zeros and all ones-and the correct output is the one that i is closer \nto in hamming distance. By dividing the input and output bits into two groups \nand performing the majority function independently on each group, one generates \na \"4-aUractors\" problem. In general, by dividing the input and output bits into \n1 ~ Ie ~ n groups, one generates a \"2i:-attractors\" problem. When Ie = 1, n(cid:173)\nmajority results, and when Ie = n, n-copy results. \nFigure 5 displays simulation results on the n = 8-bit problems generated when Ie is \nvaried from 1 to n. The advantage of hidden units for low values of Ie is evident, \nas is the advantage of \"shortcut connections\" (direct input-to-output weights) for \nlarger values of Ie. Note also that combination of both hidden units and shortcut \nconnections performs better than either alone. \n\n\fGeneralization and Scaling in Reinforcement Learning \n\n555 \n\n105~--------------------------------~ \n\n-0- CASP 8-10-8 \n-+- CASP 8-8 \n.... CASP 8-10-Sls \n... Table \n\n1 \n\n2 \n\n3 \n\n4 \n\n5 \n\n6 \n\n7 \n\n8 \n\nk \n\nFigure 5: The 21:-attractors family at n = 8 \n\n3.3 n-EXCLUDED MIDDLE \n\nAll of the functions considered so far have been linearly separable. Consider this \n\"folded majority\" function: [if i < c < i then 0 = on else 0 = In]. Now, like \nn-majority, there are only two rewarding output states, but the determination of \nwhich output state is correct is not linearly separable in the input space. When \nn = 2, the n-excluded middle problem yields the EQV (i.e., the complement of \nXOR) function, but whereas functions such as n-parity [if nc is even then 0 = on \nelse 0 = In] get more non-linear with increasing n, n-excluded middle does not. \n\n'ii u \n\n-I) \n.2 -I) \n\nf) \nD) \n\nE \n::: \n\n17oo*1.6\"n \n\n107~------------------------------~~ \n10 6 \n10 5 \n10 4 \n10 3 \n10 2 \n10 1 \n10 0 \n\n0 \n\n1 2 3 4 5 \n\n6 \n\n7 8 9 10 1112 \n\nx Table \nc CRSP n-n-n/s \n\nn \n\nFigure 6: The n-excluded middle problem \n\nFigure 6 displays the simulation results. CRBP is slowed somewhat compared to \nthe linearly separable problems, yielding a higher \"cross over point\" of about 8 bits. \n\n\f556 \n\nAckley and Littman \n\n4 STRUCTURING DEGENERATE OUTPUT SPACES \nAll of the scaling problems in the previous section are designed so that there is \na single correct output for each possible input. This allows for difficult problems \neven at small sizes, but it rules out an important aspect of generalizing algorithms \nfor associative reinforcement learning: If there are multiple satisfactory outputs \nfor given inputs, a generalizing algorithm may impose structure on the mapping it \nproduces. \n\nWe have two demonstrations of this effect, \"Bit Count\" and \"Inverse Arithmetic.\" \nThe Bit Count problem simply states that the number of I-bits in the output should \nequal the number of I-bits in the input. When n = 9, Tref rapidly finds solutions \ninvolving hundreds of different output patterns. CRBP is slower--especially with \nrelatively few hidden units-but it regularly finds solutions involving just 10 output \npatterns that form a sequence from 09 to 19 with one bit changing per step. \n\n0+Ox4=0 0+2x4=8 0+4 x 4 = 16 0+6 x 4 = 24 \n1+0x4=1 1+2x4=9 1+4x4=17 1 + 6 x 4 = 25 \n2+0x4=2 2 + 2 x 4 = 10 2 + 4 x 4 = 18 2 + 6 x 4 = 26 \n3+0x4=3 3+2x4=11 3 +4 x 4 = 19 3 + 6 x 4 = 27 \n\n4+0x4=4 4+ 2 x 4 = 12 4+4 x 4 = 20 4 + 6 x 4 = 28 \n5+0x4=5 5 + 2 x 4 = 13 5 + 4 x 4 = 21 5 + 6 x 4 = 29 \n6+0x4=6 6 + 2 x 4 = 14 6 + 4 x 4 = 22 6 + 6 x 4 = 30 \n7+0x4=7 7 + 2 x 4 = 15 7 + 4 x 4 = 23 7 + 6 x 4 = 31 \n\n2+2-4=0 2+2+4=8 6+ 6 + 4 = 16 0+6 x 4 = 24 \n3+2-4=1 3+2+4=9 7+6+4= 17 1 + 6 x 4 = 25 \n2+2+4=2 2 + 2 x 4 = 10 2 + 4 x 4 = 18 2 + 6 x 4 = 26 \n3+2+4=3 3+2x4=1l 3 + 4 x 4 = 19 3 + 6 x 4 = 27 \n\n6+2-4=4 6 + 2+ 4 = 12 4 x 4 + 4 = 20 4 + 6 x 4 = 28 \n7+2-4=5 7 + 2 + 4 = 13 5 + 4 x 4 = 21 5 + 6 x 4 = 29 \n6+2+4=6 6 + 2 x 4 = 14 6 + 4 x 4 = 22 6 + 6 x 4 = 30 \n7+2-.;-4=7 7 + 2 x 4 = 15 7 +4 x 4 = 23 7 + 6 x 4 = 31 \n\nFigure 7: Sample CRBP solutions to Inverse Arithmetic \n\nThe Inverse Arithmetic problem can be summarized as follows: Given i E 25 , find \n:1:, y, z E 23 and 0, <> E {+(OO)' -(01)' X (10)' +(11)} such that :I: oy<>z = i. In all there are \n13 bits of output, interpreted as three 3-bit binary numbers and two 2-bit operators, \nand the task is to pick an output that evaluates to the given 5-bit binary input \nunder the usual rules: operator precedence, left-right evaluation, integer division, \nand division by zero fails. \n\nAs shown in Figure 7, CRBP sometimes solves this problem essentially by discover(cid:173)\ning positional notation, and sometimes produces less-globally structured solutions, \nparticularly as outputs for lower-valued i's, which have a wider range of solutions. \n\n\fGeneralization and Scaling in Reinforcement Learning \n\n557 \n\n5 CONCLUSIONS \nSome basic concepts of supervised learning appear in different guises when the \nparadigm of reinforcement learning is applied to large output spaces. Rather than \na \"learning phase\" followed by a \"generalization test,\" in reinforcement learning \nthe search problem is a generalization test, performed simultaneously with learning. \nInformation is put to work as soon as it is acquired. \n\nThe problem of of \"overfitting\" or \"learning the noise\" seems to be less of an issue, \nsince learning stops automatically when consistent success is reached. In exper(cid:173)\niments not reported here we gradually increased the number of hidden units on \nthe 8-bit copy problem from 8 to 25 without observing the performance decline \nassociated with \"too many free parameters.\" \n\nThe 2k-attractors (and 2k-folds-generalizing Excluded Middle) families provide \na starter set of sample problems with easily understood and distinctly different \nextreme cases. \n\nIn degenerate output spaces, generalization decisions can be seen directly in the \ndiscovered mapping. Network analysis is not required to \"see how the net does it.\" \n\nThe possibility of ultimately generating useful new knowledge via reinforcement \nlearning algorithms cannot be ruled out. \n\nReferences \n\nAckley, D.H. (1987) A connectionist machine for genetic hillclimbing. Boston, MA: Kluwer \nAcademic Press. \n\nAckley, D.H. (1989) Associative learning via inhibitory search. In D.S. Touretzky (ed.), \nAdvances in Neural Information Processing Systems 1, 20-28. San Mateo, CA: Morgan \nKaufmann. \n\nAllen, R.B. (1989) Developing agent models with a neural reinforcement technique. IEEE \nSystems, Man, and Cybernetics Conference. Cambridge, MA. \n\nAnderson, C.W. (1986) Learning and problem solving with multilayer connectionist sys(cid:173)\ntems. University of Mass. Ph.D. dissertation. COINS TR 86-50. Amherst, MA. \n\nBarto, A.G. (1985) Learning by statistical cooperation of self-interested neuron-like com(cid:173)\nputing elements. Human Neurobiology, 4:229-256. \n\nBarto, A.G., & Anandan, P. (1985) Pattern recognizing stochastic learning automata. \nIEEE Transactions on Systems, Man, and Cybernetics, 15, 360-374. \nRumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986) Learning representations by back(cid:173)\npropagating errors. Nature, 323, 533-536. \n\nSutton, R.S. (1984) Temporal credit assignment in reinforcement learning. University of \nMass. Ph.D. dissertation. COINS TR 84-2. Amherst, MA. \n\nWilliams, R.J. (1988) Toward a theory of reinforcement-learning connectionist systems. \nCollege of Computer Science of Northeastern University Technical Report NU-CCS-88-3. \nBoston, MA. \n\n\f", "award": [], "sourceid": 208, "authors": [{"given_name": "David", "family_name": "Ackley", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}]}