Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann
We revisit the identification of an ε-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, D, and the optimal bias span, H, which satisfy H≤D. Prior work have studied the complexity of ε-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with D≃H for which the sample complexity to output an ε-optimal policy is Ω(SAD/ε2) where S and A are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order SAH/ε2 has been proposed, but it requires the knowledge of H. We first show that the sample complexity required to estimate H is not bounded by any function of S,A and H, ruling out the possibility to easily make the previous algorithm agnostic to H. By relying instead on a diameter estimation procedure, we propose the first algorithm for (ε,δ)-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in SAD/ε2 in the regime of small ε, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in H cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in SAD2/ε2, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.