Review for NeurIPS paper: Preference-based Reinforcement Learning with Finite-Time Guarantees

NeurIPS 2020

Preference-based Reinforcement Learning with Finite-Time Guarantees

Meta Review

This paper generated considerable discussion among the reviewers. One the positive side, this paper makes a solid contribution to the emerging literature on preference-based RL, a topic of some importance and makes some interesting insights (e.g., on the potential lack of a “winning policy”) and novel algorithmic contributions. Conversely, some reviewers raised issues with some of the assumptions made in the paper and the presentation (which seems to assume familiarity with PBRL and its motivations/rationale. The author response was thoughtful and generated some discussion (some of which is not reflected in the reviews, a couple of which failed to get updated unfortunately). On my own reading if the paper, I agree that the paper makes a useful contribution to PBRL, especially from a technical perspective and conceptual perspective (although I don’t believe it makes PBRL more practical at this stage). Apart from some weaknesses raised in the other reviews, and some limitations (which could lead to extensions in future work), I do have some methodological qualms with the paper. Specifically, the direct application of dueling bandit methods over trajectories to drive comparisons of policies seems methodologically flawed. It is “straightforward” humans/users compare trajectories (noise-free or noisily). But to effectively translate this into a "vote" for the value of a policy that (could have) induced it seems to be simply a way of trying to directly apply dueling bandits without much additional work. IMO, the reason we ask users to compare trajectories is to get a sense of the reward function. The induced (possibly stochastic) constraints on the reward function can then be directly translated into constraints on the value function. I don’t see a legitimate rationale for wanting to think of these as stochastic samples of "policy comparisons". From that perspective, Prop 1 seems much less interesting than the paper claims, and is, if not a "formal" consequence, then at least an "intuitive" consequence of social choice/preference aggregation theory. However, this type of approach is a part of some important PBRL frameworks, and as such, I believe it is worthy contribution to the PBRL literature, and the scientific community should judge the value/legitimacy of some of the assumptions made. Hence, I recommend acceptance. The reviewers make many detailed and valuable suggestions, and the author(s) is (are) strongly encouraged to revise the paper to account for these.