Gregory Grudic, Lyle Ungar
We address two open theoretical questions in Policy Gradient Reinforce- ment Learning. The ﬁrst concerns the efﬁcacy of using function approx- imation to represent the state action value function, . Theory is pre- sented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of basis functions in the function approximation representation. The sec- ond concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by is the number of possible actions. Experimen-
tal evidence is presented showing that these theoretical results lead to signiﬁcant improvement in the convergence properties of Policy Gradi- ent Reinforcement Learning algorithms.
is the number of possible actions and