On the Sample Complexity Bounds of Bilevel Reinforcement Learning

Mudit Gaur, Utsav Singh, Amrit Singh Bedi, Raghu Pasupathy, Vaneet Aggarwal

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) Main Conference Track

Bilevel reinforcement learning (BRL) has emerged as a powerful framework for aligning generative models, yet its theoretical foundations, especially sample complexity bounds, remain relatively underexplored. In this work, we present the first sample complexity bound for BRL, establishing a rate of $\tilde{\mathcal{O}}(\epsilon^{-3})$ in continuous state-action spaces. Traditional MDP analysis techniques do not extend to BRL due to its nested structure and non-convex lower-level problems. We overcome these challenges by leveraging the Polyak-Ɓojasiewicz (PL) condition and the MDP structure to obtain closed-form gradients, enabling tight sample complexity analysis. Our analysis also extends to general bi-level optimization settings with non-convex lower levels, where we achieve state-of-the-art sample complexity results of $\tilde{\mathcal{O}}(\epsilon^{-3})$ improving upon existing bounds of $\tilde{\mathcal{O}}(\epsilon^{-6})$. Additionally, we address the computational bottleneck of hypergradient estimation by proposing a fully first-order, Hessian-free algorithm suitable for large-scale problems.