#### Authors

Evan Greensmith, Peter Bartlett, Jonathan Baxter

#### Abstract

We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learn- ing problems. The ﬁrst approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sam- ple paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the vari- ance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the op- timal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary exper- iments that illustrate the performance improvements on a simple control problem.

1 Introduction, Background, and Preliminary Results

In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially ob- servable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal-

(cid:3)Most of this work was performed while the authors were with the Research School of Information

Sciences and Engineering at the Australian National University.

culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good esti- mate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investi- gate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X ! R, and we know the

integral of another function ’ : X ! R. Since RX f = RX (f (cid:0) ’) +RX ’, the integral

of f (cid:0) ’ can be estimated instead. Obviously if ’ = f then the variance is zero. More generally, Var(f (cid:0) ’) = Var(f ) (cid:0) 2Cov(f; ’) + Var(’), so that if (cid:30) and f are strongly correlated, the variance of the estimate is reduced.

In this paper, we consider two approaches of this form. The ﬁrst (Section 2) is the technique of adding a baseline. We ﬁnd the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice.

The second approach (Section 3) is the use of an approximate value function. Such actor- critic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the ﬁxed value function and the true value function. In this paper we show that this may not be the best approach: selecting the ﬁxed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a ﬁxed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a ﬁxed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a ﬁxed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements.

POMDP with Reactive, Parameterized Policy

A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices fP(u) : u 2 Ug, each with components pij (u) for i; j 2 S; u 2 U, an observation pro- cess (cid:23) : S ! PY , where PY is the space of probability distributions over Y, and a reward function r : S ! R. We assume that S; U; Y are ﬁnite, although all our re- sults extend easily to inﬁnite U and Y, and with more restrictive assumptions can be extended to inﬁnite S. A reactive, parameterized policy for a POMDP is a set of map- pings f(cid:22)((cid:1); (cid:18)) : Y ! PU j(cid:18) 2 RKg. Together with the POMDP, this deﬁnes the con- trolled POMDP (S; U; Y; P ; (cid:23); r; (cid:22)). The joint state, observation and control process, fXt; Yt; Utg, is Markov. The state process, fXtg, is also Markov, with transition prob-

abilities pij ((cid:18)) = Py2Y;u2U (cid:23)y(i)(cid:22)u(y; (cid:18))pij (u), where (cid:23)y(i) denotes the probability of

observation y given the state i, and (cid:22)u(y; (cid:18)) denotes the probability of action u given pa- rameters (cid:18) and observation y. The Markov chain M((cid:18)) = (S; P((cid:18))) then describes the behaviour of the process fXtg.

Assumption 1 The controlled POMDP (S; U; Y; P ; (cid:23); r; (cid:22)) satisﬁes: For all (cid:18) 2 RK there exists a unique stationary distribution satisfying (cid:25) 0((cid:18)) P((cid:18)) = (cid:25)0((cid:18)). There is an R < 1 such that, for all i 2 S, jr(i)j (cid:20) R. There is a B < 1 such that, for all u 2 U, y 2 Y and (cid:18) 2 RK the deriva- tives @(cid:22)u(y; (cid:18))=@(cid:18)k (1 (cid:20) k (cid:20) K) exist, and the vector of these derivatives satisﬁes kr(cid:22)u(y; (cid:18))=(cid:22)u(y; (cid:18))k (cid:20) B, where k (cid:1) k denotes the Euclidean norm on RK.

implies that this limit exists, and does not depend on the start state X0. The aim is to select a policy to maximize this quantity. Deﬁne the discounted value function, J(cid:12)(i; (cid:18)) def=

We consider the average reward, (cid:17)((cid:18)) def= limT !1 Eh 1 t=0 r(Xt)i. Assumption 1 X0 = i i. Throughout the rest of the paper, dependences limT !1 EhPT (cid:0)1 Var(A) = Eh(A (cid:0) E [A])2i, where a2 denotes a0a, and a0 denotes the transpose of the

upon (cid:18) are assumed, and dropped in the notation. For a random vector A, we denote

t=0 (cid:12)tr(Xt)(cid:12)(cid:12)(cid:12)