Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)

*Daniela Farias, Benjamin Roy*

We introduce a new algorithm based on linear programming that approximates the differential value function of an average-cost Markov decision process via a linear combination of pre-selected basis functions. The algorithm carries out a form of cost shaping and minimizes a version of Bellman error. We establish an error bound that scales gracefully with the number of states without imposing the (strong) Lyapunov condition required by its counter- part in [6]. We propose a path-following method that automates selection of important algorithm parameters which represent coun- terparts to the "state-relevance weights" studied in [6].

1 Introduction

Over the past few years, there has been a growing interest in linear programming (LP) approaches to approximate dynamic programming (DP). These approaches offer algorithms for computing weights to fit a linear combination of pre-selected basis functions to a dynamic programming value function. A control policy that is "greedy" with respect to the resulting approximation is then used to make real-time decisions.

Empirically, LP approaches appear to generate effective control policies for high- dimensional dynamic programs [1, 6, 11, 15, 16]. At the same time, the strength and clarity of theoretical results about such algorithms have overtaken counterparts available for alternatives such as approximate value iteration, approximate policy iteration, and temporal-difference methods. As an example, a result in [6] implies that, for a discrete-time finite-state Markov decision process (MDP), if the span of the basis functions contains the constant function and comes within a distance of of the dynamic programming value function then the approximation generated by a certain LP will come within a distance of O( ). Here, the coefficient of the O( ) term depends on the discount factor and the metric used for measuring distance, but not on the choice of basis functions. On the other hand, the strongest results available for approximate value iteration and approximate policy iteration only promise O( ) error under additional requirements on iterates generated in the course of executing

the algorithms [3, 13]. In fact, it has been shown that, even when = 0, approximate value iteration can generate a diverging sequence of approximations [2, 5, 10, 14].

In this paper, we propose a new LP for approximating optimal policies. We work with a formulation involving average cost optimization of a possibly infinite-state MDP. The fact that we work with this more sophisticated formulation is itself a contribution to the literature on LP approaches to approximate DP, which have been studied for the most part in finite-state discounted-cost settings. But we view as our primary contributions the proposed algorithms and theoretical results, which strengthen in important ways previous results on LP approaches and unify certain ideas in the approximate DP literature. In particular, highlights of our contributions include:

```
1. Relaxed Lyapunov Function dependence. Results in [6] suggest that in order for the LP approach presented there to scale gracefully to large problems a certain linear combination of the basis functions must be a "Lyapunov function," satisfying a certain strong Lyapunov condition. The method and results in our current paper eliminate this requirement. Further, the error bound is strengthened because it alleviates an undesirable dependence on the Lyapunov function that appears in [6] even when the Lyapunov condition is satisfied.
2. Restart Distribution Selection. Applying the LP studied in [6] requires manual selection of a set of parameters called state-relevance weights. That paper illustrated the importance of a good choice and provided intuition on how one might go about making the choice. The LP in the current paper does not explicitly make use of state-relevance weights, but rather, an analog which we call a restart distribution, and we propose an automated method for finding a desirable restart distribution.
3. Relation to Bellman-Error Minimization. An alternative approach for approximate DP aims at minimizing "Bellman error" (this idea was first suggested in [16]). Methods proposed for this (e.g., [4, 12]) involve stochastic steepest descent of a complex nonlinear function. There are no results indicating whether a global minimum will be reached or guaranteeing that a local minimum attained will exhibit desirable behavior. In this paper, we explain how the LP we propose can be thought of as a method for minimizing a version of Bellman error. The important differences here are that our method involves solving a linear rather than a nonlinear (and nonconvex) program and that there are performance guarantees that can be made for the outcome.
```

The next section introduces the problem formulation we will be working with. Sec- tion 3 presents the LP approximation algorithm and an error bound. In Section 4, we propose a method for computing a desirable reset distribution. The LP approx- imation algorithm works with a perturbed version of the MDP. Errors introduced by this perturbation are studied in Section 5. A closing section discusses relations to our prior work on LP approaches to approximate DP [6, 8].

2 Problem Formulation and Perturbation Via Restart

Consider an MDP with a countable state space S and a finite set of actions A available at each state. Under a control policy u : S A, the system dynamics are defined by a transition probability matrix Pu |S||S|, where for policies u and u and states x and y, (Pu)xy = (Pu)xy if u(x) = u(x). We will assume

that, under each policy u, the system has a unique invariant distribution, given by u(x) = limt(P tu)yx, for all x, y S. A cost g(x, a) is associated with each state-action pair (x, a). For shorthand, given any policy u, we let gu(x) = g(x, u(x)). We consider the problem of computing a policy that minimizes the average cost u = Tu gu. Let = minu u and define the differential value function h(x) = minu limT Eux[ T (g t=0 u(xt) - )]. Here, the superscript u of the expectation operator denotes the control policy and the subscript x denotes conditioning on x0 = x. It is easy to show that there exists a policy u that simultaneously minimizes the expectation for every x. Further, a policy u is optimal if and only if u(x) arg minu(g(x, a) + (P y u)xy h(y)) for all x S.

While in principle h can be computed exactly by dynamic programming algorithms, this is often infeasible due to the curse of dimensionality. We consider approximating h using a linear combination K r k=1 kk of fixed basis functions 1, . . . , K : S . In this paper, we propose and analyze an algorithm for computing weights r K to approximate: h K k=1 k(x)rk. It is useful to define a matrix |S|K so that our approximation to h can be written as r.

The algorithm we will propose operates on a perturbed version of the MDP. The nature of the perturbation is influenced by two parameters: a restart probability (1 - ) [0, 1] and a restart distribution c over the state space. We refer to the new system as an (, c)-perturbed MDP. It evolves similarly with the original MDP, except that at each time, the state process restarts with probability 1 - ; in this event, the next state is sampled randomly according to c. Hence, the perturbed MDP has the same state space, action space, and cost function as the original one, but the transition matrix under each policy u are given by P,u = Pu + (1 - )ecT .

We define some notation that will streamline our discussion and analysis of per- turbed MDPs. Let ,u(x) = limt(P t,u)yx, ,u = T,ugu, = minu ,u, and let h be the differential value function for the (, c)-perturbed MDP, and let u be a policy satisfying u(x) arg minu(g(x, a) + (P y ,u)xy h (y)) for all x S . Finally, we will make use of dynamic programming operators T,uh = gu + P,uh and Th = minu T,uh.

Do not remove: This comment is monitored to verify that the site is working properly