NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Main Ideas The high-level motivation of this work is to consider alternatives to learning good forward models, which may not be a desirable solution in all cases. The hypothesis is that a predictive model may arise as an emergent property if such prediction were useful for the agent. The authors test this hypothesis by constraining the agent to only observe states at certain timesteps, requiring a model to learn to fill in the gaps. The model was not trained with a forward prediction objective. Connection to Prior Work. The method introduced in this work seem novel in the context of other literature that train forward models. Other work have also attempted to overcome the difficulties with training forward models, such as by using inverse models [1]. The predictron [2] also learns an implicit model as the submission does. Would the authors be able to include a discussion comparing their work with the two above types of approaches ((1) inverse models and (2) implicit models) in the related work section? Quality - strengths: the authors are careful and honest about evluating the strengths and weaknesses of their work. They evaluated their idea on simple easy-to-analyze tasks and also demonstrated the generality of their method on various domains. Clarity - strengths: the paper is very well written and motivated Originality - strengths: the framing of the problem and the method the authors use seems novel Significance - strengths: this work is a proof-of-concept that training an implicit model with observational dropout in some cases is sufficient for learning policies - weakness: one of the appeals of learning a world model is that such models help facilitate generalization to different tasks in the same domain. For example, one task could be to train on car-racing going forwards and test on car-racing going backwards, where a forward model that is trained to predict the next state given the current state and action could presumably handle. However, the implicit model studied in this submission is inherently tied to the training task, and it is unclear whether such implicit models would help with such a generalization. Would the authors be able to provide a thorough experiment analyzing the limits and capabilities of how their implicit model facilitates generalization to unseen tasks? Overall, I like the perspective of this paper and I think it is well written, well-motivated, and thorough. The key drawback I see is the lack of analysis on how well the proposed method fares on generalizing to unseen tasks. This analysis in my opinion is crucial because a large motivation for learning models in the first place is to facilitate such generalization. [1] Pathak, Deepak, et al. "Zero-shot visual imitation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018. [2] Silver, David, et al. "The predictron: End-to-end learning and planning." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017. UPDATE AFTER REBUTTAL I appreciated that the authors have conducted the experiment comparing their proposed method with a model-based baseline. I think a more thorough analysis of generalization would make the paper much stronger and I believe the significance of this work is more of a preliminary flavor. The paper is executed carefully and the results were consistent with the author's claims, but this is only the first step. I keep my original score, but would agree with Reviewer 3's comment about motivation and urge the authors to provide a more compelling argument through empirical evaluation for why one would want to consider the proposed idea in the camera-ready.
Reviewer 2
The authors set out to explore the idea how much data needs to be observed from the world and how much can be imagined or simulated through ones own forward model in model-based reinforcement learning tasks. Therefore, observational-dropout is introduced, a method that at each time step probabilistically substitutes real world observations with predictions of a learned world model which is optimized to support the learning of key skills on search and avoidance or navigation task during dropout. Through that mechanism a forward prediction model implicitly arises. Results suggest that observational-dropout improves the generalization of a trained world-model. Interestingly, the dropout rate seems to be crucial for successful training. Keeping too much ground truth observations seems to prevent the world model from learning anything as then the world model is not needed for performing the task. Keeping too little ground truth observations also diminishes performance as then the agent is too strongly disconnected from the world and solely relies on its own predictions to perform the task which over time can strongly deviate from the ground truth. There seems to be a sweet spot for the right amount of dropout at which both the performance of the implicit forward predictor and the task performance peak. These results were shown in a race car environment and a grid world like environment in which an agent is tasked to search for or avoid certain items, which will be made publicly available. Originality: The idea of dropping observations to implicitly train a world model in a model-based reinforcement learning setup is novel and original. Quality: Well written, instructive, paper with actionable implications for future research. Clarity: Overall the paper is clear and well written. Significance: The results are significant but have been only tested on toy problems. It would be great to see how the presented findings hold up in more complex scenarios. All in all, I would like to argue for accepting this paper. The paper is very instructive, clearly written and the idea of training a world model implicitly through dropout is novel and original. The results are interesting and applicable to model-based reinforcement learning setups. The major insight and contribution lies in the analysis of how much of the observations are exactly needed or actually should be dropped to train a generalizable forward predicting world model. The submission included code and the test environment will be published. The only caveat of this paper is that the experiments are performed on very simple navigation tasks and more complex control experiments closer to real-world scenarios would have made the results more convincing. Conceptually this is interesting work, however it is hard to tell whether this would apply to scenarios outside of simulation. EDIT - The authors addressed concerns regarding comparison to prior work and generalizability in the rebuttal, and discuss how in their experiments generalization is task-dependent. Assuming authors execute the changes proposed in the rebuttal and put them in the final draft, I would still argue to accept the paper, but would not be upset if it gets rejected.
Reviewer 3
Edit after author response: I appreciate the model-based baseline and the frank discussion. I think the main pending question for this work is whether there is a reason to learn a model this way instead of with an explicit prediction objective. However, I think this is an interesting direction for exploration. Overall this paper has an interesting idea and does some preliminary work to evaluate it. However, at this point the authors show only that it might be possible to do this, not that there is any reason one might want to. This is a clever paper and I would like to read a more complete version. Originality This idea is fairly original, proposing to implicitly learn what is recognizably a predictive model of the environment without a prediction objective. I would be interested to see further exploration of this idea, and in particular a demonstration of cases where it has an advantage over explicit model learning. The most similar work to this is probably "The Predictron: End-To-End Learning and Planning" (Silver et al.) and the authors should include a discussion of the similarities. Quality The main weakness of this work is in its experiments. The results shown in Figs. 2 & 5 seem unimpressive, and this work contains zero comparisons to any other methods or variants of the model. It is unacceptable that it does not include a comparison to explicit model learning. The results in Figs. 4 & 6 are qualitatively interesting but somewhat hard to interpret; they seem to indicate that the model which is learned is only vaguely related to ground-truth prediction. The comparison between architectures in Fig. 5 shows no significant difference; furthermore, without sharing the amount of data used to train the two architectures it is impossible to evaluate (as inductive biases will be washed out with sufficient data). The number of environment steps used may be computable from the number of training generations, etc in the Appendix but should be explicitly stated in the main body. It is also clear that with an expressive policy which is able to distinguish between real and generated observations, there is no reason at all that the implicit "model" should need to make forward predictions at all. In that case the policy as a whole reduces to a recurrent neural network policy. It would be important to include a discussion of this limitation. Clarity The writing quality in this work is high and I enjoyed reading it. However, there are a few details which could use elaboration: - All experiments should include the number of environment samples used. - The observation space for the cartpole example is not explicitly stated. Significance Currently the significance of this paper is low-medium. It has a clever idea but it does not establish it well enough to motivate follow-on work by others.