Part of Advances in Neural Information Processing Systems 11 (NIPS 1998)
Richard S. Sutton, Satinder Singh, Doina Precup, Balaraman Ravindran
In robotics and other control applications it is commonplace to have a pre(cid:173) existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be ob(cid:173) tained by switching flexibly between given controllers, and example appli(cid:173) cations of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs.
In many applications, solutions to parts of a task are known, either because they were hand(cid:173) crafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelli(cid:173) gent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally ex(cid:173) tended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions.
Recently, several researchers have begun to address these challenges within the framework of reinforcement learning and Markov decision processes (e.g., Singh, 1992; Kaelbling, 1993; Dayan & Hinton, 1993; Thrun and Schwartz, 1995; Sutton, 1995; Dietterich, 1998; Parr & Russell, 1998; McGovern, Sutton & Fagg, 1997). Common to much of this recent work is the modeling of a temporally extended action as a policy (controller) and a condition for terminating, which we together refer to as an option (Sutton, Precup & Singh, 1998). In this paper we consider the problem of effectively combining given options into one overall policy, generalizing prior work by Kaelbling (1993). Sections 1-3 introduce the framework; our new results are in Sections 4 and 5.
Improved Switching among Temporally Abstract Actions