Steady State Analysis of Episodic Reinforcement Learning

Part of Advances in Neural Information Processing Systems 33 (NeurIPS 2020)

AuthorFeedback Bibtex MetaReview Paper Review Supplemental

Authors

Huang Bojun

Abstract

Reinforcement Learning (RL) tasks generally divide into two kinds: continual learning and episodic learning. The concept of steady state has played a foundational role in the continual setting, where unique steady-state distribution is typically presumed to exist in the task being studied, which enables principled conceptual framework as well as efficient data collection method for continual RL algorithms. On the other hand, the concept of steady state has been widely considered irrelevant for episodic RL tasks, in which the decision process terminates in finite time. Alternative concepts, such as episode-wise visitation frequency, are used in episodic RL algorithms, which are not only inconsistent with their counterparts in continual RL, and also make it harder to design and analyze RL algorithms in the episodic setting.

In this paper we proved that unique steady-state distributions pervasively exist in the learning environment of episodic learning tasks, and that the marginal distributions of the system state indeed approach to the steady state in essentially all episodic tasks. This observation supports an interestingly reversed mindset against conventional wisdom: While steady states are traditionally presumed to exist in continual learning and considered less relevant in episodic learning, it turns out they are guaranteed to exist for the latter under any behavior policy. We further developed interesting connections for important concepts that have been separately treated in episodic and continual RL. At the practical side, the existence of unique and approachable steady state implies a general, reliable, and efficient way to collect data in episodic RL algorithms. We applied this method to policy gradient algorithms, based on a new steady-state policy gradient theorem. We also proposed and experimentally evaluated a perturbation method to enforce faster convergence to steady state in real-world episodic RL tasks.