NeurIPS 2020

On Efficiency in Hierarchical Reinforcement Learning


Meta Review

Quoting from the reviewers: R1: The paper presents a novel framework for analyzing potential efficiencies in reinforcement learning due to hierarchical structure in MDPs. This framework formally defines several useful concepts (subMDPs, equivalent subMDPs, exit states and exit profiles) that allow for an elegant refinement of regret bounds in a well-defined regime. The identification of particular properties (subMDPs, exit state set, and equivalence of subMDPs) provides a clear and useful framework for theoretical analysis of hierarchical reinforcement learning. Overall this paper provides an elegant, concrete framework for formalizing hierarchical structure and quantifying the efficiency such structure may allow. The paper provides a theoretical analysis of hierarchical reinforcement learning, deriving results on learning and planning efficiency when the reinforcement learning problem has repeated structure. The analysis is based on a decomposition of the base MDP into sub-MDPs using state partitions, capturing structure that is repeated exactly in multiple parts of the base MDP. R3: There are two results. First, the authors extend an earlier regret bound by Osband et al (2013) and show the reduction in the bound possible through the repeated hierarchical structure in the MDP. This analysis is based on the algorithm PSRL (posterior sampling for reinforcement learning). Secondly, the authors analyze planning with options that are generated based on the repeated structure in the MDP. This analysis is based on Value Iteration and assumes that the state transition graph that corresponds to an optimal policy is acyclic. The authors provide a bound on the quality of the solution found based on the quality of the options (more specifically, the exit profiles that define the options). They also sufficient conditions for high-quality exit profiles. The paper formalises some of the benefits of hierarchical reinforcement learning, showing the precise impact of repeated structure on learning and planning efficiency. I found it useful and enjoyed reading the paper. The analysis can be a foundation for further work in the area, including new approaches to option discovery.