Stable Fitted Reinforcement Learning

Part of Advances in Neural Information Processing Systems 8 (NIPS 1995)

Bibtex Metadata Paper


Geoffrey J. Gordon


We describe the reinforcement learning problem, motivate algo(cid:173) rithms which seek an approximation to the Q function, and present new convergence results for two such algorithms.



Imagine an agent acting in some environment. At time t, the environment is in some state Xt chosen from a finite set of states. The agent perceives Xt, and is allowed to choose an action at from some finite set of actions. The environment then changes state, so that at time (t + 1) it is in a new state Xt+1 chosen from a probability distribution which depends only on Xt and at. Meanwhile, the agent experiences a real-valued cost Ct, chosen from a distribution which also depends only on Xt and at and which has finite mean and variance.

Such an environment is called a Markov decision process, or MDP. The reinforce(cid:173) ment learning problem is to control an MDP to minimize the expected discounted cost Lt ,tCt for some discount factor, E [0,1]. Define the function Q so that Q(x, a) is the cost for being in state x at time 0, choosing action a, and behaving optimally from then on. If we can discover Q, we have solved the problem: at each step, we may simply choose at to minimize Q(xt, at). For more information about MDPs, see (Watkins, 1989, Bertsekas and Tsitsiklis, 1989). We may distinguish two classes of problems, online and offline. In the offline prob(cid:173) lem, we have a full model of the MDP: given a state and an action, we can describe the distributions of the cost and the next state. We will be concerned with the online problem, in which our knowledge of the MDP is limited to what we can dis(cid:173) cover by interacting with it. To solve an online problem, we may approximate the transition and cost functions, then proceed as for an offline problem (the indirect approach); or we may try to learn the Q function without the intermediate step (the direct approach). Either approach may work better for any given problem: the

Stable Fitted Reinforcement Learning