Instance-Based State Identification for Reinforcement Learning

Part of Advances in Neural Information Processing Systems 7 (NIPS 1994)

Bibtex Metadata Paper


R. Andrew McCallum


This paper presents instance-based state identification, an approach to reinforcement learning and hidden state that builds disambiguat(cid:173) ing amounts of short-term memory on-line, and also learns with an order of magnitude fewer training steps than several previous ap(cid:173) proaches. Inspired by a key similarity between learning with hidden state and learning in continuous geometrical spaces, this approach uses instance-based (or "memory-based") learning, a method that has worked well in continuous spaces.


When a robot's next course of action depends on information that is hidden from the sensors because of problems such as occlusion, restricted range, bounded field of view and limited attention, the robot suffers from hidden state. More formally, we say a reinforcement learning agent suffers from the hidden state problem if the agent's state representation is non-Markovian with respect to actions and utility.

The hidden state problem arises as a case of perceptual aliasing: the mapping be(cid:173) tween states of the world and sensations of the agent is not one-to-one [Whitehead, 1992]. If the agent's perceptual system produces the same outputs for two world states in which different actions are required, and if the agent's state representation consists only of its percepts, then the agent will fail to choose correct actions. Note that even if an agent's state representation includes some internal state beyond its


R. Andrew McCallum

immediate percepts, the agent can still suffer from hidden state if it does not keep enough internal state to uncover the non-Markovian-ness of its environment.

One solution to the hidden state problem is simply to avoid passing through the aliased states. This is the approach taken in Whitehead's Lion algorithm [White(cid:173) head, 1992]. Whenever the agent finds a state that delivers inconsistent reward, it sets that state's utility so low that the policy will never visit it again. The success of this algorithm depends on a deterministic world and on the existence of a path to the goal that consists of only unaliased states.

Other solutions do not avoid aliased states, but do as best they can given a non(cid:173) Markovian state representation [Littman, 1994; Singh et al., 1994; Jaakkola et al., 1995]. They involve either learning deterministic policies that execute incorrect actions in some aliased states, or learning stochastic policies with action choice probabilities matching the proportions of the different underlying aliased world states. These approaches do not depend on a path of unaliased states, but they have other limitations: when faced with many aliased states, a stochastic policy degenerates into random walk; when faced with potentially harmful results from incorrect actions, deterministically incorrect or probabilistically incorrect action choice may prove too dangerous; and when faced with performance-critical tasks, inefficiency that is proportional to the amount of aliasing may be unacceptable.

The most robust solution to the hidden state problem is to augment the agent's state representation on-line so as to disambiguate the aliased states. State identi(cid:173) fication techniques uncover the hidden state information-that is, they make the agent's internal state space Markovian. This transformation from an imperfect state information model to a perfect state information model has been formalized in the decision and control literature, and involves adding previous percepts and actions to the definition of agent internal state [Bertsekas and Shreve, 1978]. By augmenting the agent's perception with history information-.short-term memory of past per(cid:173) cepts, actions and rewards-the agent can distinguish perceptually aliased states, and can then reliably choose correct actions from them.

Predefined, fixed memory representations such as order n Markov models (also known as constant-sized perception windows, linear traces or tapped-delay lines) are often undesirable. When the length of the window is more than needed, they exponentially increase the number of internal states for which a policy must be stored and learned; when the length of the memory is less than needed, the agent reverts to the disadvantages of undistinguished hidden state. Even if the agent de(cid:173) signer understands the task well enough to know its maximal memory requirements, the agent is at a disadvantage with constant-sized windows because, for most tasks, different amounts of memory are needed at different steps of the task.

The on-line memory creation approach has been adopted in several reinforcement learning algorithms. The Perceptual Distinctions Approach [Chrisman, 1992] and Utile Distinction Memory [McCallum, 1993] are both based on splitting states of a finite state machine by doing off-line analysis of statistics gathered over many steps. Recurrent-Q [Lin, 1993] is based on training recurrent neural networks. Indexed Memory [Teller, 1994] uses genetic programming to evolve agents that use load and store instructions on a register bank. A chief disadvantage of all these techniques is that they require a very large number of steps for training.

Instance-Based State Identification for Reinforcement Learning