Part of Advances in Neural Information Processing Systems 2 (NIPS 1989)

*David Ackley, Michael Littman*

In associative reinforcement learning, an environment generates input vectors, a learning system generates possible output vectors, and a re(cid:173) inforcement function computes feedback signals from the input-output pairs. The task is to discover and remember input-output pairs that generate rewards. Especially difficult cases occur when rewards are rare, since the expected time for any algorithm can grow exponentially with the size of the problem. Nonetheless, if a reinforcement function possesses regularities, and a learning algorithm exploits them, learning time can be reduced below that of non-generalizing algorithms. This paper describes a neural network algorithm called complementary re(cid:173) inforcement back-propagation (CRBP), and reports simulation results on problems designed to offer differing opportunities for generalization.

1 REINFORCEMENT LEARNING REQUIRES SEARCH Reinforcement learning (Sutton, 1984; Barto & Anandan, 1985; Ackley, 1988; Allen, 1989) requires more from a learner than does the more familiar supervised learning paradigm. Supervised learning supplies the correct answers to the learner, whereas reinforcement learning requires the learner to discover the correct outputs before they can be stored. The reinforcement paradigm divides neatly into search and learning aspects: When rewarded the system makes internal adjustments to learn the discovered input-output pair; when punished the system makes internal adjust(cid:173) ments to search elsewhere.

Generalization and Scaling in Reinforcement Learning

551

1.1 MAKING REINFORCEMENT INTO ERROR

Following work by Anderson (1986) and Williams (1988), we extend the backprop(cid:173) agation algorithm to associative reinforcement learning. Start with a "garden va(cid:173) riety" backpropagation network: A vector i of n binary input units propagates through zero or more layers of hidden units, ultimately reaching a vector 8 of m sigmoid units, each taking continuous values in the range (0,1). Interpret each 8j as the probability that an associated random bit OJ takes on value 1. Let us call the continuous, deterministic vector 8 the search vector to distinguish it from the stochastic binary output vector o.

Given an input vector, we forward propagate to produce a search vector 8, and then perform m independent Bernoulli trials to produce an output vector o. The i - 0 pair is evaluated by the reinforcement function and reward or punishment ensues. Suppose reward occurs. We therefore want to make 0 more likely given i. Backpropagation will do just that if we take 0 as the desired target to produce an error vector (0 - 8) and adjust weights normally. Now suppose punishment occurs, indicating 0 does not correspond with i. By choice of error vector, backpropagation allows us to push the search vector in any direction; which way should we go? In absence of problem-specific information, we cannot pick an appropriate direction with certainty. Any decision will involve assumptions. A very minimal "don't be like 0" assumption-employed in Anderson (1986), Williams (1988), and Ackley (1989)-pushes s directly away from 0 by taking (8 - 0) as the error vector. A slightly stronger "be like not-o" assumption-employed in Barto & Anandan (1985) and Ackley (1987)-pushes s directly toward the complement of 0 by taking ((1 - 0) - 8) as the error vector. Although the two approaches always agree on the signs of the error terms, they differ in magnitudes. In this work, we explore the second possibility, embodied in an algorithm called complementary reinforcement back-propagation ( CRBP).

Figure 1 summarizes the CRBP algorithm. The algorithm in the figure reflects three modifications to the basic approach just sketched. First, in step 2, instead of using the 8j'S directly as probabilities, we found it advantageous to "stretch" the values using a parameter v. When v < 1, it is not necessary for the 8i'S to reach zero or one to produce a deterministic output. Second, in step 6, we found it important to use a smaller learning rate for punishment compared to reward. Third, consider step 7: Another forward propagation is performed, another stochastic binary out(cid:173) put vector 0* is generated (using the procedure from step 2), and 0* is compared to o. If they are identical and punishment occurred, or if they are different and reward occurred, then another error vector is generated and another weight update is performed. This loop continues until a different output is generated (in the case of failure) or until the original output is regenerated (in the case of success). This modification improved performance significantly, and added only a small percentage to the total number of weight updates performed.

552

Ackley and Littman

O. Build a back propagation network with input dimensionality n and output

dimensionality m. Let t = 0 and te = O.

- Pick random i E 2n and forward propagate to produce a/s. 2. Generate a binary output vector o. Given a uniform random variable ~ E [0,1]

and parameter 0 < v < 1,

OJ = {1, if(sj - !)/v+! ~ ~j

Do not remove: This comment is monitored to verify that the site is working properly