NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 6278 Meta-Inverse Reinforcement Learning with Probabilistic Context Variables

### Reviewer 1

The paper identifies the unsolved problem of meta-Inverse Reinforcement Learning. That is, learning a reward function for an unseen task from a single expert trajectory for that task, using a batch of expert trajectories for different but related tasks as training data (the task being solved by each training expert trajectory is not communicated to the learning algorithm). Because IRL is used rather than imitation learning, a reward function is learned for each task (or rather a single reward function parameterized by the latent variable m which is supposed to capture task). The paper then formulates an framework for training neural networks to solve the identified problem, building off of past work on Adversarial IRL, and adding latent task variables to handle the variation in task. A network q_psi is used to identify the task variable from a demonstration. The paper identifies an objective function for solving the problem, which includes a term that encourages high mutual information between the latent task variable and the trajectories induced by the reward function. They also derive a tractable sample-based method for estimating gradients of this objective. Overall I enjoyed this paper. It feels polished and complete, and the writing is quite good aside from a few awkward sentences. The problem tackled and the proposed solution are moderately novel, being a reasonably straightforward combination of past work on meta-learning and inverse reinforcement learning. The significance is high given the popularity of meta learning and IRL. One wonder I have is the importance of the mutual information term. On line 114 the paper states: """ Without further constraints over m, directly applying AIRL to learn the reward function could simply ignore the context variable m. So we need to explicitly maximize the “correlation” between the reward function and the latent variable m. """ Has this been empirically verified? It would seem to be in the interest of the discriminator to make use of m since it makes life harder for the generator, and if the discriminator is using m then the generator would have to as well. Therefore it would seem possible to omit the mutual information term, which would simplify the framework somewhat. Gradients would then, of course, have to be backpropagated through q_\psi in training both the generator and the discriminator. This starts to look like a variational autoencoder for trajectories, with the GAN taking the place of the VAE decoder. Might be interesting to test against this kind of architecture to verify the contribution of the MI. The experimental methodology seems sound for the most part. In some ways the baselines seem a bit weak; for example, none of them are really properly equipped to solve the manipulation performed in Section 5.2. But assuming this really is the first method to tackle non-tabular meta-IRL, there is little other choice. I have a few misgivings with respect to the coverage of the experiments. One is the focus on the deterministic environments. I would be interested to see performance on a stochastic environment, since this would make expert trajectories less informative about the goal that the expert is attempting to reach, and might make one-shot imitation/irl much more difficult; also, Lemma 1 and Theorem 1 both assume deterministic environments. In that case I would also be interested to see whether, for a single test task, performance improves as more expert trajectories are observed for that task. Another worry I have is that the generalization abilities required by the environments tested are fairly minimal. E.g. in point mass, the tasks seen at test time are presumably quite similar to some of the tasks seen at training time, and this applies to other envs as well. In the Ant env it seems no generalization ability is required at all since there are only two tasks. It would be interesting to evaluate this method in more realistic environments where generalization can be tested more thoroughly. Finally, I'm not completely sure about this but I think the dimensionality of the latent variables was set equal to the true dimensionality of the task space (i.e. 2 dimensions for all envs tested). What happens to performance if this dimension is misspecified? Typos: Eq 6, should be \tau \sim p_\theta (\tau | m), rather than \tau \sim p(\tau | m, \theta). (See also eq 13 in supplemental and several other places) Eq 9, (\tau) should not be in subscript. Line 224: imitation Line 274: imitation Supp line 214: one of these should be "meta-testing" Supp line 426: missing a word, or added an "and"