Review for NeurIPS paper: Offline Imitation Learning with a Misspecified Simulator

NeurIPS 2020

Offline Imitation Learning with a Misspecified Simulator

Review 1

Summary and Contributions: The paper proposes an algorithm to make learning possible in real world. The algorithm has two main modules, 1. a horizon adaptive inverse dynamics that is trained using expert data in real world, 2. a policy in the simulator that is being trained to imitate the real world expert. The policy in the simulator is trained using imitation from observation to generate states similar to the real world expert and then inverse dynamics is used to generate the actions to be taken in real world.

Strengths: Results show improvements compared to baselines. Changes made to GAILfO show improvement in the IfO results which could be interesting if investigated by itself. The overall method and the claims seem to be sound and the problem that is being addressed is related to imitation learning and sim-to-real which is relevant to NeurIPS community.

Weaknesses: The writing can be improved significantly. It's hard to follow the paper. Some examples: Line 45: s_g is used without mentioning what that is (the whole paragraph is hard to follow with the amount of information that is given till that point). Line 93: The complete name of the algorithm should be mentioned when an abbreviations is being used for the first time Algorithm 1 is vague. Line 161: It is the other way around. Some notations are confusing. For instance different notations are used for the goal state, s_g, s_{t+k}, s_{t+h}. Line 167 says horizon of K but everywhere else says horizon is shown with H. etc.

Correctness: The claims and the method seem to be correct and the experiments and the baselines make sense.

Clarity: The writing can be improved significantly. Mentioned in the weaknesses of the paper.

Relation to Prior Work: The authors have mentioned some of the previous works and how their work is related to those. There are more papers in sim to real and imitation learning that could be discussed.

Reproducibility: No

Additional Feedback:

Review 2

Summary and Contributions: The authors are proposing an improvement on existing approaches for imitation learning of policies for embodied agents. The approach is a hybrid between sim-to-real RL approaches (which require a simulator closely matching the real world) and real world imitation learning approaches such as GAIL. The general idea of the paper is that there is a simulator, which, however is allowed to have a different dynamics than the "real world". In particular, the assumption is that two policies can reach the same goal state from the same starting point within H steps in the real-world. The algorithm is tested on the OpenAI Gym environment, where both the real world and the simulator environment are simulations (with different parametrization).

Strengths: The paper is theoretically well grounded and represents an advance over the state of the art. The empirical evaluation is standard for the type of work (OpenAI Gym / MuJoCo) The proposed algorithm is particularly significant and novel because it tackles a setup which is very important for real world learning: the existence of some demonstrations for a task and an imperfect model of the environment.

Weaknesses: -The proposed deployment algorithm requires running the simulator at every step of the policy running - this is much more complex then what typical policy learning algorithms do, and could be a serious limitation in real world deployments. -While we understand that the adaptation of the different dynamics in two simulators is convenient, the paper would be much stronger if the authors would break out from the world of MuJoCo and try out their ideas in the real world.

Correctness: As far as I can tell, the claims, method and evaluation approach are correct.

Clarity: Overall, the paper is very well written. Some comments are below: -In the introduction s_g and s_t are used without definition. -Algorithm 1: does not specify what to do with the different actions returned by the K pi_HID policies. One needs look into section 3.2 to see that what the authors do is to "select the action that is closest to the mean value". This is not at all obvious.

Relation to Prior Work: The paper makes it clear how it relates to previous algorithms (eg GRAILfO), and there is significant novelty.

Reproducibility: Yes

Additional Feedback: I read the feedback which does not change my review.

Review 3

Summary and Contributions: The paper presents an approach to leverage a miss-specified simulator and few expert demonstrations to accelerate learning on a system with a limited interaction budget. The key idea is to extend one step state matching to an H step state matching between the two systems, with the assumption that the state distribution overlaps. Convincing experiments are presented on a simulated system to support the idea

Strengths: Various components of the paper have been explored and studied before. The key contribution of the paper is the idea of horizon adaptive inverse dynamics (HID) which tries to align states that are H step apart and use inverse dynamics to recover actions. Authors show that the multistep equivalence of HID introduces the right inductive biases to constraints dynamics mismatch between the two systems and uses the expert data more efficiently. This overcomes some of the shortcomings of a close previous method [Christiano et al. [5]] they compare against. To validate this claim, the authors show in Figure 5 that the distribution induced from HIDIL is very close to the expert distribution.

Weaknesses: - The cornerstone of the paper is H step state equivalence (HID). This ideas isn't novel and has been successfully explored in multiple works --https://arxiv.org/abs/1903.01973, https://arxiv.org/pdf/1910.11956.pdf, etc

Correctness: The presented results are in simulations and the chosen parameters to induce dynamics mismatch are less representative of real-world scenarios. Mismatch in the real world arises from (a) incorrect modeling (b) unmodelled phenomenons. Some phenomena are more catastrophic than others -- delay, noise, etc. Experiment results, while correct, can be improved to strengthen the claims. A real-world result will be ideal.

Clarity: - Figure 1 can be made more effective, its a little hard to follow - Last term of eq 1 isn't introduced - It will help to clarify early on in the paper the domain in which expert data is gathered. It becomes clear quite late in the paper. - section 4.2 typo - remove 'of' - remove '.' at the end of equations

Relation to Prior Work: - Differenced to Christiano et al. [5] is clearly outlined - Relation of prior work on HID need more work.

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper studies the problem of offline imitation learning where the simulator is misspecified and a small set of demonstrations from real environment is provided. Given an observation from real environment, the simulator is set with the observation and the policy is rolled out for a limited horizon. A goal is picked from the visited states and inverse dynamics problem is solved. Generated action is executed in the real environment. Overall contributions of the paper are: - Using multi-step horizon to solve the ID problem which alleviates simulator misspecification - Using uncertainty of the ensemble of HIDs as weights for selecting the next action

Strengths: - Using multi-step horizon ID gives considerable improvement on top of GAILfO in continuous control settings with smooth transitions - Combining an ensemble model by using uncertainty of the HIDs as weights is interesting.

Weaknesses: - Multi-step horizon is used rather naively. Authors assume that H-step into the future, the action will change mildly which is the basis for HID. It is not clear to me under which conditions "two policies in different dynamics can reach the same state s_g from any s_t within H steps" is satisfied. - It is assumed that the state of the simulator can be set with observations from real environment. Other than assuming simulator can be arbitrarily modified, I think this ignores POMDP assumption.

Correctness: Claims are somewhat correct. Line 8 in the abstract is not discussed in the paper and it is not clear if the states are fully observable. Empirical methodology is correct with well-know benchmark results.

Clarity: The paper is somewhat clearly written. I outlined some of the concerns above.

Relation to Prior Work: It is clearly discussed.

Reproducibility: Yes

Additional Feedback: - Could you clarify if the environment reward in simulator is used in training the agent? - Line 211, double 'of'. - In Line 151, recovers --> recover.