Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Post rebuttal update: I appreciate the additional explanation for need of overshooting in empirical methods, and the clarity of response regarding stochastic models. The issue I took was with Sec 2.2, that next-step prediction is insufficient to produce belief states, which is only an issue with approximation error when dealing with empirical results. This is not clearly explained in the paper, but clarified much more nicely in the rebuttal. This would cause me to raise my score from a 3 to a 4 for the misunderstanding, but I still do not find this paper worthy of acceptance. I don't think they are particularly surprising insights, and it seems the sole merit of this paper is an empirical one, and impressive because of performance on complex tasks. In that case, I strongly believe that those environments and their method should be open-sourced. Getting image-based, model-based RL working is not a trivial task -- there are many tricks in the Planet paper to get their final performance, and their results are completely non-reproducible without them, and their code is open sourced. This paper is not useful to practitioners and other researchers without seeing those insights and being able to build on top of their results. -------------------------------------- While the authors present a detailed list of related work in model-based reinforcement learning, it is unclear what is novel in this work, and what the message is. They experiment with various tricks introduced in prior works like self-supervised losses, overshooting, and memory architectures and report on performance results in a handful of environments. Originality: There are new environments presented, but using existing techniques to perform a survey over combinations of techniques already shown in other work. Quality: The submission does not provide theoretical justification for their claims, but have significant results showing spread in performance across a variety of loss functions, overshoot lengths, and architectures. Clarity: The introduction is clear in the hypothesis being presented, but the experimental results Significance: This paper has low significance, as it is testing a hypothesis that stochastic generative models benefit more from overshooting than deterministic ones, while failing to theoretically disprove that next-step predictive models are insufficient for learning belief states. There are only empirical results presented showing that overshooting increases performance, but this does not necessarily mean that next step is insufficient for forming belief states, as increasing the overshoot length also increases the supervision and number of labels being used.
This paper provides an interesting comparison between different methodological approaches and nicely addresses some interesting theoretical issues associated with model conditioning for POMDP. Many of the technical details are original and the presented work seems theoretically sound. However, the clarity of the paper suffers from the amount of material covered and the whole thing could be better organised. The description of the contribution seemed to be relatively fluid throughout the paper. However after some re-organisation of the document I think this will makes a solid contribution to the community. Below I highlight specific issues and points of clarification. In the abstract, the last line, the authors say “In practice, using a expressive generative model in RL is computationally expensive and propose a scheme to reduce this computational burden allowing us to build agents that competitive model three baselines”. It is not clear where this demonstrated, please clarify. The authors confront the fact they do no consider planning algorithms and a summary of the work is “how good is the belief state at enabling standard RL algorithms”. This seems fine, as the focus is on the beliefs formed by the model, not action per se. However, the comparison to  is important because the focus on decoding from the belief state is covered in .Could the authors say a little more about the works relationship to  A belief-state is defined as the sufficient statistics of future states. This seems ambiguous and potentially misleading. Is it necessarily over future states, is it the sufficient statistics of the state? Is it not a probability distribution over world states? They cite [9,10] saying that a belief state is a vector representation that is sufficient to predict future observations. Again, the original statement potentially misleading. The difference between belief-state and state, i.e. SimCore starts with a belief state and then predicts a state at some future time. In the table, this is cleared up somewhat, where they say that state initialization is done such that s = b. But why the difference? Some intuition as to why ConvDRAW + GECO solves conditioning would be nice. I guess the fact that a map can reconstructed form the LSTM state is not hugely surprising. The authors note that contrastive loss is poor at mapping but good a localisation. I think theme emerging here is between local and global predictions. Could the authors comment on this. Agent trained using IMPALA, a policy gradient method. This is, presumably, for the agent core, not the simulation core. Authors state running speed decreased 20-40% compared to an agent without a model - why would this be the case? Surely the model is more of a computational burden. It seems pretty obvious why map construction should fail when there is a task because the environment is only partially sampled. I think this good example why building a map per se is not really helpful for RL in general. I really think the paper would benefit by discussing these ideas in the context of mode-based reinforcement learning. The voxel environment tin Section 4.3 is interesting but there is little information about this and it is not clear what to conclude. There is also no consideration of where this work sits in the literature.
This submission includes a number of original and significant contributions and show-cases them in different, challenging first-person environments and various experiments. Rather than training training forward models in isolation and then using them for control, the authors train them jointly with a model-free policy, using the internal state of its recurrent network as a belief state from which future frames are predicted. Hence, the paper addresses several current research topics of high interest: (1) training powerful environment models, (2) effective memory methods for control in POMDPs and (3) improving the sample complexity of model-free RL algorithms. The authors make a case for the necessity of overshooting during RL. I find the motivation insightful since overshooting is usually motivated by the desire to produce coherent multi-step predictions (for planning) rather than by improving conditioning. The experimental evaluation is clear and insightful but, as mentioned in the conclusion, leaves the reader with a few unanswered questions. In particular, it would have been nice to further investigate the interplay between policy and forward model performance. That being said, the experiments are certainly impressive, especially the ones in the voxel environment that demonstrate data efficiency, and support the main theses of the paper quite well.