NeurIPS 2020

Causal Imitation Learning With Unobserved Confounders


Meta Review

summary: This paper studies the feasibility of imitation learning for decision making from a causal perspective. The paper consideras a very general setting with possible unobserved confounders, expert and policy can have different inputs and the reward being unobserved. The work presents multiple criteria for ensuring successful imitation in particular based on proxy variables for task rewards. meta review: I think this is a very important paper, and I want to recommend it for an oral presentation for the following reasons in spite of the review scores of 5, 7, 7, 6: - Imitation learning is an essential method in reinforcement learning, used eg when expert demonstrations are available but the utility function is difficult to specify explicitly (often the case for complex tasks in robotics, or in a multi-agent / social learning context) or even as sub-routines of other RL approaches (getting learning faster off the ground [Mastering the game of Go with deep neural networks and tree search. Nature] or in kick-starting hard exploration problems). - Imitation learning can fail and characterization of failure cases have not been studied prior; furthermore, AFAIK there is little awareness in the applied RL community that imitation learning is not guaranteed to work. This work can raise awareness of this fundamental problem. - The paper presents multiple, clearly formulated sufficient conditions for imitation learning to succeed. The criteria represent a large advancement in our understanding how to identify situations / assumptions sufficient for imitation learning. Therefore, I expect this paper to have a large impact in the NeurIPS community. All reviewers (apart from Rev 1, reasons for exclusion below) agree that the paper addresses an important topic and that it gives valuable insights into possible solutions. The main reservations voiced by Rev 1 and somewhat by Rev 3&4 that “assumptions are too strong”, “practical application” could be limited and that there could be issues with “limited scalability” have to compared to current practice of imitation learning as applied in RL, where its feasibility (or necessary assumptions) is not studied at all, replaced by a hope that it’ll just work. The authors convincingly show with a small example that without any assumptions, imitation learning can fail, and the only way to make progress is to formulate sensible assumptions. I therefore recommend to discard this criticism of the work and thereby to heavily discount the score from Rev 1.